Monitoring and Observability in Modern DevOps Environments
Monitoring and Observability in Modern DevOps Environments
Most posts about observability start with a diagram of the three pillars — metrics, logs, traces — and end with a pitch for whatever vendor's logo is biggest that year. This isn't that post.
The shape I care about is narrower: your alarms, log retention, and notification routing are configuration that lives in a repo, goes through review, and gets deployed like anything else. The teams I've seen struggle with monitoring didn't pick the wrong vendor. Half their alarms were created by hand in the AWS console two years ago by someone who left, and nobody can tell you why the threshold is 47.
What follows is how a production AWS environment I maintain handles this — CloudWatch alarms as Terraform, metric math for rates that matter, retention tiers scaled to environment, SNS routing that lands in the right inbox without bespoke integrations.
Monitoring config is code
If your alarms don't live in version control, you don't have monitoring — you have folklore. Every "why did this page us at 3am?" question needs an answer that starts with git log.
Once alarms are code, you get:
- A master switch — tear down every alarm in a sandbox with one variable flip.
- Environment-appropriate thresholds — staging can tolerate a 5% Lambda error rate; production can't.
- Review on changes — a threshold drop from 5% to 1% goes through the same MR process as an API change.
- Drift detection — if someone clicks around in the console, your next plan tells you.
Pattern 1: The master enable/disable switch
Every alarm in the monitoring module is gated on a single enable_monitoring variable using Terraform's count pattern. This lets sandbox environments run with monitoring off entirely while production and staging inherit the default:
variable "enable_monitoring" {
description = "Master switch to enable/disable monitoring resources"
type = bool
default = true
}
resource "aws_cloudwatch_metric_alarm" "lambda_throttles_warning" {
count = var.enable_monitoring ? 1 : 0
alarm_name = "${var.name_prefix}-lambda-throttles-warning"
namespace = "AWS/Lambda"
metric_name = "Throttles"
dimensions = { FunctionName = var.lambda_function_name }
statistic = "Sum"
period = 300
comparison_operator = "GreaterThanOrEqualToThreshold"
threshold = local.lambda_config.throttle_threshold
evaluation_periods = 1
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.warning.arn]
}
count = var.enable_monitoring ? 1 : 0 is idiomatic Terraform for "create this resource only if the flag is on." For for_each resources (per-table DynamoDB alarms, for example), the equivalent is for_each = var.enable_monitoring ? local.tables : {}.
Per-resource enabled flags rot. One master switch at the module level is the gate to flip when provisioning a short-lived test environment or silencing an alarm storm.
Pattern 2: Metric math for derived rates
Raw counters lie. "47 Lambda errors in the last 5 minutes" means very different things during peak traffic (rounding error) vs. a quiet window (something is broken). The signal you actually want is the rate: errors as a percentage of invocations.
CloudWatch supports this through metric math. An alarm with multiple metric_query blocks can reference underlying metrics by ID and compute a derived value:
resource "aws_cloudwatch_metric_alarm" "lambda_error_rate_critical" {
count = var.enable_monitoring ? 1 : 0
alarm_name = "${var.name_prefix}-lambda-error-rate-critical"
comparison_operator = "GreaterThanThreshold"
threshold = local.lambda_config.error_rate_threshold
evaluation_periods = 2
datapoints_to_alarm = 2
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.critical.arn]
metric_query {
id = "e1"
label = "ErrorRatePercent"
expression = "IF(m2>0,(m1/m2)*100,0)"
return_data = true
}
metric_query {
id = "m1"
metric {
namespace = "AWS/Lambda"
metric_name = "Errors"
dimensions = { FunctionName = var.lambda_function_name }
period = 300
stat = "Sum"
}
}
metric_query {
id = "m2"
metric {
namespace = "AWS/Lambda"
metric_name = "Invocations"
dimensions = { FunctionName = var.lambda_function_name }
period = 300
stat = "Sum"
}
}
}
The expression IF(m2>0,(m1/m2)*100,0) is the key pattern. Without the IF guard, a zero-invocations window leaves m1/m2 as 0/0 — CloudWatch emits no datapoint, and how the alarm reacts then depends entirely on your treat_missing_data setting. That's a coupling you don't want: the alarm's behavior during a quiet window is decided by a config line in a different place. The guard returns a clean 0 whenever there's no traffic, so the alarm only fires on actual error bursts regardless of how missing data is treated.
Same shape works for API Gateway 5XX rate (5XXError / Count), 4XX rate (4XXError / Count), and any other "errors per request" metric you care about.
Pattern 3: Per-environment retention tiers
Log retention is one of those things nobody thinks about until the AWS bill shows CloudWatch Logs costing more than Lambda compute. The fix is trivial once monitoring is code: set retention per environment via a single variable.
resource "aws_cloudwatch_log_group" "this" {
name = "/aws/lambda/${var.function_name}"
retention_in_days = var.log_retention_days
tags = var.tags
}
The defaults I use:
| Environment | Retention (days) | Rationale |
|---|---|---|
| prod | 30 | Compliance + enough history to debug a last-week incident |
| stage | 14 | Useful for the current sprint; not a long-term archive |
| dev | 7 | Debugging the current branch, nothing more |
| sandbox | 3 | Ephemeral; cost floor |
Each environment's env.hcl overrides the default:
# stage/env.hcl
locals {
log_retention_days = 14
}
Two lines per environment. No console click-through, no drift, no surprise bill. The value flows through Terragrunt inputs into every module that creates a log group.
If you need long-term retention for compliance, ship logs to S3 via a subscription filter — don't raise CloudWatch retention to 400 days. S3 is an order of magnitude cheaper and you can lifecycle it to Glacier.
Pattern 4: SNS routing, and the Teams channel-email trick
Two SNS topics, split by severity:
resource "aws_sns_topic" "critical" {
name = "${var.name_prefix}-alerts-critical"
kms_master_key_id = var.kms_key_id
}
resource "aws_sns_topic" "warning" {
name = "${var.name_prefix}-alerts-warning"
kms_master_key_id = var.kms_key_id
}
resource "aws_sns_topic_subscription" "critical_email" {
for_each = local.enable_email ? toset(local.email_addresses) : toset([])
topic_arn = aws_sns_topic.critical.arn
protocol = "email"
endpoint = each.value
}
Critical alarms (error rates, 5XX rates, DynamoDB system errors) go to the critical topic. Warnings (throttles, 4XX rates, p99 latency) go to the warning topic. Both topics are KMS-encrypted. Email addresses come in via a list input, so adding a new subscriber is a config change, not a console click.
The Teams piece is worth calling out because it's often described incorrectly. There is no Teams webhook involved here. Microsoft Teams exposes a per-channel email address — any message sent to that address posts into the channel. So "route critical alerts to the ops channel" becomes: subscribe the Teams channel's email address to the SNS topic, same as any other email subscriber. SNS fans out → Teams ingests the email → the channel gets the message. No bot, no bespoke integration, no token to rotate.
email_addresses = compact([
var.ops_oncall_email,
"<teams-channel-id>@amer.teams.ms",
])
It's the simplest possible thing that works, and it survives vendor changes to Teams webhook APIs because it doesn't use them.
Pattern 5: Per-resource-type alarm taxonomy
Different AWS services expose different metrics, but the alarm taxonomy stays consistent: errors, throughput problems, and latency. For each resource type I cover all three.
Lambda:
- Error rate (metric math, critical) — errors as % of invocations
- Throttles (raw count, warning) — concurrency ceiling hit
- Duration p99 (extended statistic, warning) — performance regression
API Gateway:
- 5XX rate (metric math, critical) — server failures
- 4XX rate (metric math, warning) — client issues or misrouted requests
- Latency p99 (extended statistic, warning) — slow endpoints
DynamoDB (using for_each over tables):
- SystemErrors (critical) — AWS-side failures
- UserErrors (warning) — malformed requests, usually our bug
- ReadThrottleEvents / WriteThrottleEvents (warning) — capacity undersized
- SuccessfulRequestLatency (warning) — slow queries
Each threshold is configurable via a monitoring_config variable, so staging can run looser thresholds than production without duplicating resource definitions. Defaults in the module, per-environment overrides slot in through the same env.hcl path that sets log retention.
What this buys you
- Sandbox spin-up creates no alarms — one flag, no page storms.
- A threshold change is a three-line MR with a reviewer, not a console click.
- A new DynamoDB table gets all five alarm types automatically because it's in the
for_each. - Log retention and thresholds are set per environment by the same file that sets the account ID.
- Teams gets notified without a webhook, without a token, without a dedicated integration.
None of this is clever. It's taking the same "config is code" discipline you already apply to application infrastructure and extending it to the observability layer. Your alarms shouldn't be folklore — they should be in the repo, tagged, reviewed, and deployed, same as everything else.