50 DevOps Interview Questions for Senior Engineers
Senior DevOps and platform engineering interviews in 2026 test far more than tool proficiency. Interviewers probe your understanding of distributed systems, your approach to incident management, your ability to design for reliability at scale, and your judgment in trade-off decisions. Memorizing commands does not work at this level — you need to demonstrate depth of understanding and production experience.
This collection covers 50 questions organized by domain, each with the kind of answer that impresses senior engineering interviewers. The answers are not scripts to memorize — they are frameworks for structuring your own responses with your specific experience.
CI/CD and Deployment (Questions 1-10)
1. How do you decide between rolling deployments, blue-green, and canary releases?
Rolling deployments are the simplest — old pods are replaced incrementally with new pods. They work when your application is stateless and backward-compatible. Blue-green deployments run two full environments simultaneously and switch traffic atomically. Use blue-green when you need instant rollback capability and your database schema changes are backward-compatible. Canary deployments route a small percentage of traffic to the new version while monitoring metrics. Use canary when you need to validate a change under real production load before full rollout — particularly for changes that might affect latency, error rates, or business metrics in ways that staging cannot reproduce.
The key factor is risk tolerance multiplied by blast radius. A configuration change to a logging service tolerates rolling deployment. A payment processing change demands canary with automated rollback triggers.
2. What does an ideal CI pipeline look like for a microservices architecture?
A monorepo CI pipeline uses path-based filtering to trigger only affected service pipelines. Each service pipeline runs: lint and static analysis (30 seconds), unit tests (under 2 minutes), build container image, scan image for vulnerabilities, integration tests against real dependencies using Testcontainers (under 5 minutes), push image to registry with immutable tag (commit SHA), and update the GitOps manifest repository.
For polyrepo architectures, each repository has its own pipeline but contract tests verify compatibility between services. Schema registries prevent breaking changes in event-driven systems.
Total time target: under 10 minutes from push to deployable artifact.
3. How do you handle database schema migrations in a CI/CD pipeline?
Schema migrations must be backward-compatible. The expand-and-contract pattern: first, add the new column/table without removing anything (expand). Deploy application code that writes to both old and new structures. Once all application instances are updated, deploy a cleanup migration that removes the old structure (contract).
Tools like Flyway or Liquibase manage migration ordering and track which migrations have been applied. The CI pipeline runs migrations against a test database before deploying to production. Never run destructive migrations (DROP TABLE, DROP COLUMN) in the same release as the application changes that depend on them.
4. Describe your approach to managing secrets in CI/CD pipelines.
Secrets never exist in source code, pipeline definitions, or container images. Use the CI platform's native secret management (GitHub Actions secrets, GitLab CI variables) for pipeline credentials. For application secrets, inject them at runtime from a secret manager (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault) through sidecar containers, init containers, or the External Secrets Operator on Kubernetes.
Rotate secrets on a schedule (90 days maximum for credentials, shorter for API keys). Audit secret access through CloudTrail or Vault audit logs. Use short-lived credentials (OIDC federation from CI to cloud providers) instead of long-lived access keys wherever possible.
5. How would you implement a CI/CD pipeline for infrastructure as code?
Terraform CI pipeline stages: terraform fmt -check (style), terraform validate (syntax), tfsec or checkov (security scanning), terraform plan (preview changes), human review of the plan output, terraform apply (deployment). The plan output is posted as a pull request comment so reviewers see exactly what changes will be made.
State is stored in a remote backend (S3 + DynamoDB for locking) with encryption at rest. State files contain sensitive data and must never be committed to Git. Use workspaces or directory-based separation for environment isolation (dev/staging/production).
6. What is GitOps, and when would you not use it?
GitOps uses a Git repository as the single source of truth for declarative infrastructure. A controller (ArgoCD, Flux) running in the cluster watches the repository and reconciles cluster state with the declared state. Benefits: full audit trail, easy rollback via git revert, drift detection.
Situations where GitOps is not ideal: rapid prototyping where the feedback loop of commit-push-sync is slower than kubectl apply, emergency hotfixes where you need sub-minute deployment (though you should still backfill the Git change), and stateful operations like database migrations that require imperative steps.
7. How do you manage CI/CD for a team of 50 engineers with 20 microservices?
Standardization through shared pipeline templates. Create reusable workflow files (GitHub Actions reusable workflows, GitLab CI includes, Jenkins shared libraries) that encode your organization's build, test, scan, and deploy patterns. Individual service repositories reference these templates and provide only service-specific configuration.
A platform team owns the pipeline templates, dependency update automation (Renovate or Dependabot), and the deployment infrastructure (ArgoCD, runner fleet). Service teams own their application code, tests, and Dockerfile.
8. How do you test infrastructure changes before applying to production?
Layered testing: static analysis (tfsec, checkov) catches misconfigurations without applying anything. Plan review shows the exact diff. Ephemeral environments (spin up a full environment from Terraform, run E2E tests, tear down) validate the infrastructure actually works. For critical changes, apply to a staging environment that mirrors production topology and observe for 24-48 hours before production.
Terratest and Kitchen-Terraform enable writing Go or Ruby tests that apply Terraform, validate the resulting resources, and destroy them automatically.
9. Describe your approach to rollback in production.
Immutable deployment artifacts make rollback straightforward: deploy the previous known-good container image. For Kubernetes, GitOps rollback is a git revert of the manifest change. For infrastructure, Terraform state enables rolling back to a previous state snapshot, but this requires careful handling of resources that cannot be recreated (databases with data, DNS records with propagation delays).
The key principle: rollback must be faster than fixing forward. If your rollback procedure takes 30 minutes, you need to simplify it. Target: under 5 minutes from decision to rollback to traffic serving the previous version.
10. How do you handle feature flags in a DevOps context?
Feature flags decouple deployment from release. Deploy code behind a flag in the off position, then enable it for internal users, then a percentage of production traffic, then everyone. This enables trunk-based development (no long-lived feature branches), targeted rollback (disable the flag instead of redeploying), and A/B testing.
Use a dedicated feature flag service (LaunchDarkly, Unleash, AWS AppConfig) rather than environment variables. The service provides audit trails, gradual rollouts, targeting rules, and kill switches. Clean up stale flags aggressively — flag debt is real technical debt.
Kubernetes and Container Orchestration (Questions 11-20)
11. Explain how a Kubernetes Deployment performs a rolling update.
When you update the pod template in a Deployment, the Deployment controller creates a new ReplicaSet with the updated template and scales it up while scaling down the old ReplicaSet. The maxSurge parameter controls how many extra pods can exist during the transition, and maxUnavailable controls how many pods can be unavailable. The new pods must pass readiness probes before receiving traffic and before old pods are terminated. If new pods fail readiness probes, the rollout stalls, preventing a bad deployment from completing.
12. How do you debug a pod stuck in CrashLoopBackOff?
First, kubectl describe pod <name> to see events — look at the last state exit code and reason. Exit code 1 is application error (check logs with kubectl logs <pod> --previous). Exit code 137 is OOM kill (increase memory limits or fix memory leak). Exit code 0 with CrashLoopBackOff means the container keeps completing successfully but Kubernetes expects it to run continuously (check your entrypoint command). Also verify that ConfigMaps, Secrets, and PersistentVolumeClaims referenced by the pod actually exist in the namespace.
13. When would you use a StatefulSet instead of a Deployment?
StatefulSets provide stable network identities (pod-0, pod-1, pod-2 with predictable DNS names), ordered startup and shutdown, and stable persistent storage (each pod gets its own PVC that follows it across rescheduling). Use StatefulSets for databases (PostgreSQL, MySQL, MongoDB), message brokers (Kafka, RabbitMQ), and distributed systems that require stable identity for cluster membership.
For stateless applications — web servers, API services, workers — always use Deployments.
14. Explain Kubernetes resource requests and limits. What happens when you get them wrong?
Requests are the guaranteed minimum resources a pod needs. The scheduler uses requests to decide which node to place a pod on. Limits are the maximum a pod can consume. If a container exceeds its memory limit, it is OOM killed. If it exceeds its CPU limit, it is throttled.
Setting requests too low causes pods to be scheduled on overcommitted nodes, leading to CPU throttling and memory pressure. Setting requests too high wastes cluster capacity — nodes appear full when pods are not actually using their reserved resources. Setting limits too low causes unnecessary OOM kills and throttling. Not setting them at all allows a single pod to consume all node resources.
Best practice: set requests based on observed p95 usage (Prometheus metrics), set CPU limits generously or not at all (CPU throttling is often worse than allowing burst), and set memory limits at 1.5-2x the observed p95.
15. How do you implement network policies in Kubernetes?
NetworkPolicies are namespace-scoped resources that control traffic flow between pods. By default, all pods can communicate with all other pods. A NetworkPolicy with an empty podSelector applies to all pods in the namespace.
Start with a deny-all default policy, then explicitly allow required communication paths. For example, allow the API pods to reach the database pods on port 5432, allow the ingress controller to reach the API pods on port 8080, and deny everything else. You need a CNI plugin that supports NetworkPolicies — Calico, Cilium, or Weave Net. The default kubenet CNI does not enforce them.
16. Describe how you would perform a zero-downtime Kubernetes cluster upgrade.
For managed Kubernetes (EKS, AKS, GKE), use the provider's upgrade mechanism with node pool rolling updates. Upgrade the control plane first, then upgrade node pools one at a time. Set PodDisruptionBudgets on critical workloads to ensure minimum availability during node drains.
Sequence: review the Kubernetes changelog for breaking API changes. Update any deprecated API versions in your manifests. Upgrade the control plane. Create a new node pool with the target version. Cordon and drain old nodes (pods are rescheduled on new nodes). Delete the old node pool. Verify all workloads are healthy.
17. What is a service mesh, and when is it justified?
A service mesh (Istio, Linkerd, Consul Connect) provides mutual TLS between services, traffic management (retries, timeouts, circuit breaking), and observability (distributed tracing, service-level metrics) without changing application code. It works by injecting a sidecar proxy (Envoy) into every pod.
Justified when: you have 20+ microservices needing consistent security and observability policies, you need mTLS between all services for compliance, or you need advanced traffic management (canary by header, traffic mirroring).
Not justified when: you have fewer than 10 services, your team lacks the operational capacity to manage the mesh, or you can achieve your goals with simpler tools (NetworkPolicies for segmentation, application-level retries).
18. How do you manage Kubernetes configurations across multiple environments?
Kustomize overlays: a base directory contains the common manifests, and overlay directories (dev, staging, production) patch specific values (replica count, resource limits, image tags, ConfigMap values). Alternatively, Helm charts with per-environment values files.
The critical rule: environment differences should be minimal and explicit. If your staging configuration diverges significantly from production, staging loses its value as a production proxy.
19. Explain pod affinity, anti-affinity, and topology spread constraints.
Pod affinity places pods on the same node as other specific pods (co-locate cache with application for low latency). Pod anti-affinity ensures pods are spread across nodes or zones (do not put all replicas of a critical service on the same node). Topology spread constraints distribute pods evenly across topology domains (zones, nodes) with configurable skew tolerance.
For production services: use pod anti-affinity with topologyKey: topology.kubernetes.io/zone to spread replicas across availability zones. This ensures a single zone failure does not take down all replicas.
20. How do you handle persistent storage for stateful workloads on Kubernetes?
Use StorageClasses that map to appropriate cloud provider storage. For databases: use volumeBindingMode: WaitForFirstConsumer to ensure the PV is created in the same zone as the pod. For high-performance workloads: use local SSDs (with the caveat that data is lost if the node fails). For shared storage across pods: use EFS (AWS), Azure Files, or Filestore (GCP) with ReadWriteMany access mode.
Always take regular snapshots through VolumeSnapshots. Test restore procedures quarterly — a backup you have never restored is not a backup.
Infrastructure as Code (Questions 21-30)
21. How do you structure Terraform for a large organization?
Separate state files by blast radius and change frequency. A common pattern: networking (VPC, subnets, VPN) in one state, shared services (monitoring, logging, CI runners) in another, and each application in its own state. Use Terraform modules for reusable components and a private module registry for organization-wide standards.
Remote state with locking (S3 + DynamoDB, GCS, Azure Blob) prevents concurrent modifications. State file access is restricted through IAM policies — only the CI pipeline and designated operators can run terraform apply.
22. What is Terraform state drift, and how do you handle it?
State drift occurs when the actual infrastructure diverges from what Terraform state records — someone modifies a resource through the console, another tool modifies it, or an external process changes configuration. Detect drift with terraform plan (run on a schedule in CI). Remediate by either importing the change into state (terraform import) or reverting the manual change.
Prevention: restrict console write access in production, use AWS Config or Azure Policy to detect out-of-band changes, and run drift detection on a daily schedule with alerts.
23. Explain Terraform workspaces versus directory-based environment separation.
Workspaces share the same configuration but maintain separate state files. They work for simple differences (instance sizes, replica counts) but become unwieldy when environments have structural differences. Directory-based separation (envs/dev/, envs/staging/, envs/production/) gives each environment its own configuration and state, with shared modules providing consistency.
Most mature teams use directory-based separation because it provides explicit, reviewable differences between environments and avoids the risk of applying a production change to the wrong workspace.
24. How do you handle Terraform module versioning?
Publish modules to a private registry (Terraform Cloud, GitLab, Artifactory) with semantic versioning. Consumers pin to a version constraint (~> 2.1 allows 2.1.x but not 2.2.0). Breaking changes increment the major version. Module changes go through the same code review and testing as application code.
For cross-team modules: maintain a CHANGELOG, test with Terratest, and provide migration guides for major version bumps.
25. Describe how you would migrate existing infrastructure into Terraform.
Use terraform import to bring existing resources under Terraform management. The process: inventory all resources (AWS Config, Azure Resource Graph, GCP Asset Inventory), write the Terraform configuration for each resource, import each resource into state, run terraform plan to verify no changes are detected, and then commit.
Critical: never import into a state file that manages other resources until you have verified the plan is clean. A misconfigured import can trigger destruction of existing resources.
26-30. Additional IaC questions cover: policy-as-code with OPA/Sentinel, managing Terraform provider upgrades, handling secrets in Terraform (never in state — use external data sources or SOPS), Terraform vs Pulumi vs CDK trade-offs, and managing Terraform at scale with Terragrunt or Spacelift.
Monitoring, Observability, and Incident Response (Questions 31-40)
31. What is the difference between monitoring and observability?
Monitoring answers predefined questions: "Is the CPU above 80%? Is the error rate above 1%?" Observability enables answering questions you did not predict: "Why is latency high for users in Germany but not the US? Why did this specific order fail?" Observability requires three pillars: metrics (aggregated measurements), logs (discrete events with context), and traces (request paths through distributed services).
32. How do you design alerts that are actionable?
Every alert must answer: what is broken, who is affected, and what should the on-call engineer do first. An alert that says "CPU is high" is useless. An alert that says "API latency p99 exceeds 2s — 15% of checkout requests are timing out — check API server resource utilization and database connection pool" is actionable.
Reduce alert noise by: alerting on symptoms (user impact) not causes (CPU usage), using multi-window multi-burn-rate alerting for SLOs, grouping related alerts, and suppressing alerts during known maintenance windows.
33. Describe your incident management process.
Detection (automated alert or user report), triage (severity assignment: SEV1-SEV4), communication (status page update, stakeholder notification), mitigation (restore service, even with a temporary fix), root cause analysis (after service is stable), remediation (permanent fix), and postmortem (blameless writeup with action items and deadlines).
SEV1 criteria: revenue-impacting outage, data loss risk, security breach. SEV1 response: incident commander, dedicated communication channel, 15-minute status updates, all-hands until mitigated.
34. How do you implement SLOs and error budgets?
Define SLIs (Service Level Indicators): latency p99, error rate, availability. Set SLOs based on user expectations: "99.9% of requests complete in under 500ms over a 30-day window." The error budget is the allowed failure: 0.1% of 30 days = 43.2 minutes of downtime.
When the error budget is healthy (plenty remaining), ship features aggressively. When the error budget is depleted, freeze deployments and focus on reliability. This creates a concrete, data-driven balance between velocity and stability.
35-40. Additional observability questions cover: distributed tracing implementation, log aggregation architecture, chaos engineering practices, capacity planning methodology, monitoring Kubernetes cluster health, and cost monitoring and anomaly detection.
System Design and Architecture (Questions 41-50)
41. Design a deployment pipeline for a regulated environment (HIPAA/SOC2).
Add compliance gates: SAST and SCA scans must pass with zero critical findings, infrastructure changes require approval from the security team (enforced through branch protection rules), all deployments are logged to an immutable audit trail, secrets are rotated automatically, container images are signed and verified before deployment, and network policies enforce microsegmentation.
42. How do you design for multi-region high availability?
Active-active with global load balancing (Route 53, Azure Traffic Manager, Cloud DNS). Database replication across regions (Aurora Global Database, Cosmos DB multi-region, Cloud Spanner). Stateless application tier in each region. Circuit breaker patterns for cross-region dependencies. Regular failover testing (monthly).
43-50. Additional design questions cover: migrating from monolith to microservices, designing an internal developer platform, implementing zero-trust network architecture, designing a multi-tenant SaaS platform, implementing disaster recovery with RPO/RTO targets, designing a platform for ML model serving, managing technical debt in infrastructure, and building a FinOps practice.
Preparing for the Interview
Study the concepts, but practice articulating your real experience using the STAR method (Situation, Task, Action, Result) with specific metrics. "I reduced deployment time from 45 minutes to 8 minutes by implementing parallel test execution and Docker layer caching" is stronger than "I improved CI/CD pipelines."
Citadel Cloud Management's DevOps courses cover every topic in this guide with hands-on labs that build production experience, not just theoretical knowledge. The Career Resources collection includes interview preparation guides, system design templates, and salary negotiation frameworks for senior DevOps roles.
For deep dives into specific technical areas, the DevOps Tools and Security Frameworks collections provide production-ready configurations you can study, implement, and reference.
Preparing for DevOps interviews? Start with Citadel's free courses to fill knowledge gaps and build hands-on experience across Kubernetes, Terraform, CI/CD, and observability. Browse all resources for comprehensive preparation materials.
Continue Learning
Start Your Cloud Career Today
Access 17 free courses covering AWS, Azure, GCP, DevOps, AI/ML, and cloud security — built by a practicing Senior Cloud Architect with enterprise experience.
Get Free Cloud Career Resources