
Citadel Cloud Management
Kubernetes Multi-Cluster Architecture Blueprint
Architecture BlueprintsCreated by Kenny Ogunlowo
Product Description
The Problem This Blueprint Solves
Your team adopted Kubernetes, but the cluster that "works in dev" collapses under production traffic. Pods get OOMKilled, HPA scales too slowly during traffic spikes, and a single node failure cascades into a full service outage because nobody configured pod disruption budgets or topology spread constraints. Your on-call rotation is burning through engineers at an unsustainable rate.
This blueprint reflects the production Kubernetes platform I operated across 340 microservices at a Fortune 500 healthcare company — handling 12,000 requests per second with 99.97% measured availability over 18 months.
What You Get
- Architecture diagrams — Cluster topology, namespace isolation strategy, ingress flow, service mesh configuration, and CI/CD deployment pipeline (Draw.io)
-
Terraform + Helm charts — EKS/AKS/GKE cluster provisioning,
karpenternode autoscaler,cert-manager,external-dns, Istio service mesh base install, OPA Gatekeeper constraint templates - Resource tuning guide — CPU/memory request and limit calculation methodology with actual production profiling examples
- Incident runbook — Top 15 Kubernetes failure modes with diagnosis commands and remediation steps
Key Architecture Decisions
- Karpenter over Cluster Autoscaler — Cluster Autoscaler works at the node group level, so you end up with over-provisioned node groups to handle mixed workload types. Karpenter provisions individual nodes matched to pending pod requirements, cutting compute spend 30-40% while improving scheduling speed from minutes to seconds.
- Istio over Linkerd for service mesh — If you need mTLS with FIPS 140-2 compliance, JWT-based authorization policies, and traffic mirroring for canary analysis, Istio is the only mesh that covers all three. Linkerd is simpler but lacks authorization policy depth for regulated environments.
- OPA Gatekeeper for policy enforcement — Preventing misconfigurations before deployment. The included constraint templates enforce: no containers running as root, mandatory resource requests/limits, required pod disruption budgets, and mandatory topology spread constraints. These four policies prevent 80% of the production incidents I have investigated.
- Namespace-per-team over namespace-per-environment — Teams own their namespace from dev through production. Environment separation happens at the cluster level (dev cluster vs prod cluster). This gives teams autonomy while keeping blast radius contained.
Who This Blueprint Is For
- Platform Engineers building a shared Kubernetes platform for multiple product teams
- SREs responsible for cluster reliability and on-call burden reduction
- DevOps Engineers migrating workloads from EC2/VMs to Kubernetes
- Engineering Directors evaluating Kubernetes adoption readiness
Your First 48 Hours
Deploy the EKS Terraform module into a sandbox account with two managed node groups. Install Karpenter using the provided Helm values and deploy the sample workload that simulates a traffic spike. Watch Karpenter provision and deprovision nodes in real time. On day two, install OPA Gatekeeper with the four base constraint templates and attempt to deploy a pod without resource limits — Gatekeeper should reject it. This validates your policy enforcement pipeline end-to-end.
Limitations and Trade-offs
Istio adds 5-10ms P99 latency per hop and consumes 200-400MB of memory per sidecar proxy. For latency-critical workloads under 5ms budget, consider excluding those services from the mesh. The Terraform modules target EKS on AWS — AKS and GKE adaptations require modifying the node provisioning and IAM sections. Karpenter is AWS-only; GKE and AKS equivalents (NAP and KEDA) have different configuration surfaces not covered here.