Monitoring¶
Redshift Spectra provides comprehensive monitoring through AWS CloudWatch, X-Ray, and custom metrics.
Architecture¶
flowchart TB
subgraph Lambdas["Lambda Functions"]
API[API Handler]
WORKER[Worker]
AUTH[Authorizer]
end
subgraph CloudWatch["CloudWatch"]
LOGS[Log Groups]
METRICS[Metrics]
ALARMS[Alarms]
DASH[Dashboard]
end
subgraph XRay["X-Ray"]
TRACES[Traces]
MAP[Service Map]
end
subgraph Alerts["Alerting"]
SNS[SNS Topic]
EMAIL[Email]
SLACK[Slack]
end
Lambdas -->|Logs| LOGS
Lambdas -->|Metrics| METRICS
Lambdas -->|Traces| XRay
METRICS --> ALARMS
ALARMS --> SNS
SNS --> EMAIL
SNS --> SLACK
LOGS --> DASH
METRICS --> DASH
CloudWatch Logs¶
Log Groups¶
Each Lambda function has its own log group:
| Log Group | Description |
|---|---|
/aws/lambda/spectra-api-handler |
API request processing |
/aws/lambda/spectra-worker |
Async job execution |
/aws/lambda/spectra-authorizer |
Authorization requests |
Structured Logging¶
Spectra uses AWS Lambda Powertools for structured JSON logging:
{
"level": "INFO",
"message": "Query submitted",
"timestamp": "2026-01-29T10:00:00.000Z",
"service": "spectra",
"cold_start": false,
"function_name": "spectra-api-handler",
"function_memory_size": 512,
"function_request_id": "abc-123",
"correlation_id": "req-xyz",
"tenant_id": "tenant-123",
"job_id": "job-abc",
"query_type": "SELECT"
}
Log Insights Queries¶
Find slow queries:
fields @timestamp, @message, tenant_id, duration_ms
| filter @message like /Query completed/
| filter duration_ms > 5000
| sort @timestamp desc
| limit 100
Find errors by tenant:
fields @timestamp, @message, tenant_id, error_message
| filter level = "ERROR"
| stats count(*) by tenant_id
| sort count desc
Metrics¶
Built-in Lambda Metrics¶
| Metric | Description | Alarm Threshold |
|---|---|---|
Invocations |
Total invocations | - |
Errors |
Failed invocations | > 1% |
Duration |
Execution time | P99 > 5s |
Throttles |
Throttled invocations | > 0 |
ConcurrentExecutions |
Concurrent runs | > 80% limit |
Custom Metrics¶
Spectra emits custom metrics via Lambda Powertools:
from aws_lambda_powertools import Metrics
metrics = Metrics(namespace="Spectra")
@metrics.log_metrics
def handler(event, context):
metrics.add_metric(name="QuerySubmitted", unit="Count", value=1)
metrics.add_dimension(name="TenantId", value=tenant_id)
| Metric | Dimensions | Description |
|---|---|---|
QuerySubmitted |
TenantId | Queries submitted |
QueryCompleted |
TenantId, Status | Queries completed |
QueryDuration |
TenantId | Query execution time |
ResultSize |
TenantId | Result row count |
SessionHit |
TenantId | Session cache hits |
SessionMiss |
TenantId | Session cache misses |
X-Ray Tracing¶
Enable Tracing¶
Set in environment:
Service Map¶
flowchart LR
CLIENT[Client] --> APIGW[API Gateway]
APIGW --> LAMBDA[Lambda]
LAMBDA --> DDB[DynamoDB]
LAMBDA --> RS[Redshift]
LAMBDA --> S3[S3]
Trace Analysis¶
View traces in the AWS Console to identify:
- Slow downstream calls
- Error sources
- Cold start impact
- Service dependencies
Alarms¶
Critical Alarms¶
# terraform/modules/monitoring/main.tf
resource "aws_cloudwatch_metric_alarm" "api_errors" {
alarm_name = "spectra-api-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "Errors"
namespace = "AWS/Lambda"
period = 300
statistic = "Sum"
threshold = 5
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
FunctionName = "spectra-api-handler"
}
}
Recommended Alarms¶
| Alarm | Condition | Severity |
|---|---|---|
| API Errors | > 5 errors in 5 min | Critical |
| High Latency | P99 > 10s | Warning |
| Throttling | Any throttles | Critical |
| DynamoDB Errors | > 0 errors | Critical |
| Worker Backlog | Queue age > 5 min | Warning |
Dashboard¶
Create Dashboard¶
The monitoring module creates a CloudWatch dashboard:
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "spectra-${var.environment}"
dashboard_body = jsonencode({
widgets = [
# Invocations
{
type = "metric"
width = 12
height = 6
properties = {
title = "API Invocations"
region = var.region
metrics = [
["AWS/Lambda", "Invocations", "FunctionName", "spectra-api-handler"]
]
}
}
# ... more widgets
]
})
}
Dashboard Widgets¶
graph TB
subgraph Dashboard["Spectra Dashboard"]
subgraph Row1["Request Metrics"]
W1[Invocations]
W2[Latency P50/P99]
W3[Error Rate]
end
subgraph Row2["Query Metrics"]
W4[Queries by Status]
W5[Query Duration]
W6[Result Size]
end
subgraph Row3["System Health"]
W7[Concurrent Executions]
W8[DynamoDB Capacity]
W9[Session Cache Hit Rate]
end
end
Alerting¶
SNS Integration¶
resource "aws_sns_topic" "alerts" {
name = "spectra-alerts-${var.environment}"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}
Slack Integration¶
Use AWS Chatbot or a Lambda function for Slack notifications:
def notify_slack(event, context):
message = json.loads(event['Records'][0]['Sns']['Message'])
slack_message = {
"text": f"🚨 Alert: {message['AlarmName']}",
"attachments": [{
"color": "danger",
"fields": [
{"title": "Status", "value": message['NewStateValue']},
{"title": "Reason", "value": message['NewStateReason']}
]
}]
}
requests.post(SLACK_WEBHOOK_URL, json=slack_message)
Best Practices¶
Set Up Alarms First
Configure alarms before going to production. Don't wait for incidents.
Use Log Insights
CloudWatch Log Insights is powerful for ad-hoc analysis. Learn the query syntax.
Monitor by Tenant
Use dimensions to track per-tenant metrics. Identify noisy neighbors.