
Citadel Cloud Management
Prometheus Monitoring Stack Blueprint
DevOps PipelinesCreated by Kenny Ogunlowo
Product Description
Prometheus Monitoring Stack Blueprint
Monitoring-as-code is the practice that separates teams who find out about outages from their customers and teams who find out from their dashboards. At Cigna, the healthcare data pipeline team had 47 CloudWatch alarms, but none of them had been updated when the service architecture changed. Half the alarms monitored resources that no longer existed. The other half had thresholds set during initial launch that were no longer relevant. The team found out about a 3-hour data pipeline failure from a downstream consumer, not from any alarm. This template manages monitoring configuration as code, deployed through the same pipeline as the application.
Pipeline Stages
-
validate —
promtool check config prometheus.ymlandpromtool check rules rules/*.ymlvalidate Prometheus configuration syntax. Grafana dashboard JSON validated against the Grafana API schema. -
test-rules —
promtool test rules tests/*.ymlruns unit tests against alerting rules. Each rule is tested with sample metrics that should trigger and should not trigger the alert. -
lint-dashboards — Custom linter checks Grafana dashboards for: missing datasource variables, hardcoded time ranges, panels without units, queries without
rate()on counters. -
deploy-dev — Prometheus rules applied via
kubectl apply -fto the monitoring namespace. Grafana dashboards provisioned via the HTTP API (POST /api/dashboards/db). AlertManager config updated viaamtool. - smoke-test — Fires a test alert by pushing a metric via Pushgateway. Verifies the alert routes through AlertManager to the correct Slack channel. Validates PagerDuty integration receives the test incident.
- deploy-prod — Manual approval. Prometheus Operator CRDs applied: ServiceMonitor, PodMonitor, PrometheusRule. Grafana dashboards deployed via provisioning ConfigMap. AlertManager secrets updated via Sealed Secrets.
Security Gates
- No secrets in dashboards — Lint step checks that Grafana dashboard JSON contains no hardcoded datasource URLs, credentials, or internal hostnames.
- Alert rule review — Changes to alerting rules require security team review. An overly broad alert can mask a real incident. A removed alert can leave a gap in coverage.
- Sealed Secrets for AlertManager — PagerDuty API keys, Slack webhook URLs, and email credentials encrypted with Sealed Secrets. Only the cluster can decrypt them.
What Breaks First
-
Prometheus OOM from cardinality explosion — A new ServiceMonitor scrapes a target with 100K unique label combinations. Prometheus memory doubles overnight. Fix: add
metricRelabelingsto drop high-cardinality labels and setsample_limiton the ServiceMonitor. -
Grafana dashboard overwrite from provisioning — A developer edits a dashboard in the Grafana UI, but the next pipeline run overwrites it with the version from git. Fix: set
allowUiUpdates: falsein the provisioning config and educate the team that all changes go through git. -
AlertManager route match ordering — A catch-all route defined before specific routes causes all alerts to go to the general channel. Fix: order routes from most specific to least specific, and test routing with
amtool config routes test.