{"product_id":"prometheus-monitoring-stack-blueprint","title":"Prometheus Monitoring Stack Blueprint","description":"\u003ch3\u003ePrometheus Monitoring Stack Blueprint\u003c\/h3\u003e\n\u003cp\u003eMonitoring-as-code is the practice that separates teams who find out about outages from their customers and teams who find out from their dashboards. At Cigna, the healthcare data pipeline team had 47 CloudWatch alarms, but none of them had been updated when the service architecture changed. Half the alarms monitored resources that no longer existed. The other half had thresholds set during initial launch that were no longer relevant. The team found out about a 3-hour data pipeline failure from a downstream consumer, not from any alarm. This template manages monitoring configuration as code, deployed through the same pipeline as the application.\u003c\/p\u003e\n\n\u003ch3\u003ePipeline Stages\u003c\/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cstrong\u003evalidate\u003c\/strong\u003e — \u003ccode\u003epromtool check config prometheus.yml\u003c\/code\u003e and \u003ccode\u003epromtool check rules rules\/*.yml\u003c\/code\u003e validate Prometheus configuration syntax. Grafana dashboard JSON validated against the Grafana API schema.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003etest-rules\u003c\/strong\u003e — \u003ccode\u003epromtool test rules tests\/*.yml\u003c\/code\u003e runs unit tests against alerting rules. Each rule is tested with sample metrics that should trigger and should not trigger the alert.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003elint-dashboards\u003c\/strong\u003e — Custom linter checks Grafana dashboards for: missing datasource variables, hardcoded time ranges, panels without units, queries without \u003ccode\u003erate()\u003c\/code\u003e on counters.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003edeploy-dev\u003c\/strong\u003e — Prometheus rules applied via \u003ccode\u003ekubectl apply -f\u003c\/code\u003e to the monitoring namespace. Grafana dashboards provisioned via the HTTP API (\u003ccode\u003ePOST \/api\/dashboards\/db\u003c\/code\u003e). AlertManager config updated via \u003ccode\u003eamtool\u003c\/code\u003e.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003esmoke-test\u003c\/strong\u003e — Fires a test alert by pushing a metric via Pushgateway. Verifies the alert routes through AlertManager to the correct Slack channel. Validates PagerDuty integration receives the test incident.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003edeploy-prod\u003c\/strong\u003e — Manual approval. Prometheus Operator CRDs applied: ServiceMonitor, PodMonitor, PrometheusRule. Grafana dashboards deployed via provisioning ConfigMap. AlertManager secrets updated via Sealed Secrets.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eSecurity Gates\u003c\/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cstrong\u003eNo secrets in dashboards\u003c\/strong\u003e — Lint step checks that Grafana dashboard JSON contains no hardcoded datasource URLs, credentials, or internal hostnames.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003eAlert rule review\u003c\/strong\u003e — Changes to alerting rules require security team review. An overly broad alert can mask a real incident. A removed alert can leave a gap in coverage.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003eSealed Secrets for AlertManager\u003c\/strong\u003e — PagerDuty API keys, Slack webhook URLs, and email credentials encrypted with Sealed Secrets. Only the cluster can decrypt them.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eWhat Breaks First\u003c\/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cstrong\u003ePrometheus OOM from cardinality explosion\u003c\/strong\u003e — A new ServiceMonitor scrapes a target with 100K unique label combinations. Prometheus memory doubles overnight. Fix: add \u003ccode\u003emetricRelabelings\u003c\/code\u003e to drop high-cardinality labels and set \u003ccode\u003esample_limit\u003c\/code\u003e on the ServiceMonitor.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003eGrafana dashboard overwrite from provisioning\u003c\/strong\u003e — A developer edits a dashboard in the Grafana UI, but the next pipeline run overwrites it with the version from git. Fix: set \u003ccode\u003eallowUiUpdates: false\u003c\/code\u003e in the provisioning config and educate the team that all changes go through git.\u003c\/li\u003e\n\u003cli\u003e\n\u003cstrong\u003eAlertManager route match ordering\u003c\/strong\u003e — A catch-all route defined before specific routes causes all alerts to go to the general channel. Fix: order routes from most specific to least specific, and test routing with \u003ccode\u003eamtool config routes test\u003c\/code\u003e.\u003c\/li\u003e\n\u003c\/ul\u003e","brand":"Citadel Cloud Management","offers":[{"title":"Default Title","offer_id":54890411229475,"sku":"CCM-DEV-008","price":42.0,"currency_code":"USD","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0979\/8539\/7027\/files\/citadel-devops-product_df2001e8-6038-4d45-be9f-10cec32a6770.jpg?v=1775138240","url":"https:\/\/www.citadelcloudmanagement.com\/products\/prometheus-monitoring-stack-blueprint","provider":"Citadel Cloud Management","version":"1.0","type":"link"}