| Networking |
Full VPC control |
VPC attachment adds cold start latency |
| Best for |
Long-running services, stateful workloads |
Event-driven processing, APIs with variable load |
| Operational overhead |
Medium (cluster management, node patching) |
Low (provider manages everything) |
7. Explain VPC design for a production workload.
A production VPC design follows this pattern:
-
CIDR: size it for growth. /16 gives 65,536 IPs. Never start with /24 — you will run out
-
Subnets: minimum 3 AZs. Each AZ gets public, private, and isolated subnets. Public for load balancers, private for application instances, isolated for databases
-
NAT gateways: one per AZ for high availability. Cost optimization: share one NAT gateway across AZs in non-production environments
-
Security groups: stateful, instance-level. Allow only required ports. Never use 0.0.0.0/0 for inbound rules
-
NACLs: stateless, subnet-level. Use as a secondary defense for compliance requirements
-
VPC Flow Logs: enabled on all subnets, sent to CloudWatch Logs or S3 for audit
-
Transit Gateway: for multi-VPC or multi-account connectivity, replace VPC peering with Transit Gateway at 3+ VPCs
8. How do you implement zero-downtime deployments?
Three primary strategies:
-
Blue-green: maintain two identical environments. Deploy to the inactive environment, test, then switch traffic via DNS or load balancer. Rollback is instant (switch back). Cost: 2x infrastructure during deployment
-
Canary: route a small percentage of traffic (1-5%) to the new version. Monitor error rates and latency. Gradually increase traffic if metrics are healthy. Rollback: route 100% back to the old version
-
Rolling: replace instances one at a time (or in batches). ECS and Kubernetes support this natively. Minimum healthy percentage ensures capacity during rollout
For database schema changes, use the expand-and-contract pattern: add the new column/table first (expand), deploy application code that uses both old and new schemas, then remove the old column (contract) in a subsequent deployment.
9. What is Infrastructure as Code, and which tool would you recommend?
Infrastructure as Code treats infrastructure provisioning as software: version-controlled, reviewed, tested, and repeatable. The three major tools in 2026 are Terraform (multi-cloud, HCL), CloudFormation (AWS-native, YAML/JSON), and Pulumi (multi-cloud, general-purpose languages).
My recommendation depends on context. For multi-cloud: Terraform. For AWS-only with compliance requirements: CloudFormation. For developer-heavy teams: Pulumi. See our detailed IaC comparison for a full analysis.
10. How do you design for cost optimization in the cloud?
Cost optimization is a continuous practice, not a one-time exercise:
-
Right-sizing: analyze CPU and memory utilization over 14+ days. Most EC2 instances are over-provisioned by 40-60%. Use AWS Compute Optimizer or third-party tools (Spot.io, CloudHealth)
-
Reserved capacity: commit to 1-year or 3-year reservations for steady-state workloads. Savings: 30-60% versus on-demand
-
Spot instances: use for fault-tolerant workloads (batch processing, CI/CD runners, stateless workers). Savings: 60-90%
-
Storage tiering: implement lifecycle policies. S3 Standard → Infrequent Access (30 days) → Glacier (90 days) → Deep Archive (365 days)
-
Serverless where appropriate: Lambda is cheaper than a 24/7 container for workloads with <1M invocations/month and sporadic traffic patterns
-
Tagging: enforce resource tags for cost allocation. Every resource must have `team`, `environment`, and `project` tags
Security and Compliance Questions (11-18)
11. How do you implement the principle of least privilege in AWS?
Start with zero permissions and add only what is required. Specific tactics: use IAM Access Analyzer to identify unused permissions. Implement permission boundaries on IAM users and roles. Use AWS Organizations SCPs to set guardrails across accounts. Review CloudTrail logs monthly to identify permissions granted but never used. Automate permission reviews with custom Lambda functions that flag roles with `*` actions.
12. Explain encryption at rest and in transit. When do you need both?
Always. Encryption at rest protects data stored on disk (EBS, S3, RDS). Use AWS KMS with customer-managed keys for sensitive data. Encryption in transit protects data moving between services (TLS 1.2 or 1.3). For compliance (HIPAA, PCI-DSS, SOC 2), both are mandatory without exception.
13. How would you design a secrets management strategy?
Use a centralized secrets manager (AWS Secrets Manager, HashiCorp Vault). Rotate secrets automatically on a schedule (90 days for API keys, 365 days for certificates). Never store secrets in environment variables, code, or configuration files. Use IAM roles for service-to-service authentication instead of shared secrets where possible. Audit secret access through CloudTrail.
14. What is a landing zone, and why does it matter?
A landing zone is a pre-configured, secure, multi-account AWS environment that follows best practices. AWS Control Tower automates landing zone setup with guardrails, centralized logging (CloudTrail, Config), and account vending. It matters because retrofitting security and governance onto an existing multi-account environment is 5-10x harder than building it correctly from the start.
15. How do you approach compliance (HIPAA, SOC 2, PCI-DSS) in cloud architecture?
Map compliance controls to cloud services. Use AWS Artifact for compliance reports. Enable AWS Config rules that continuously monitor for compliance drift. Implement network segmentation (separate VPCs or accounts for regulated workloads). Encrypt everything. Log everything. Restrict access. Automate evidence collection for auditors. The biggest mistake: treating compliance as a one-time checklist instead of continuous monitoring.
16. Explain the difference between security groups and NACLs.
Security groups are stateful (return traffic is automatically allowed), operate at the instance level, and support allow rules only. NACLs are stateless (return traffic must be explicitly allowed), operate at the subnet level, and support both allow and deny rules. Use security groups as the primary firewall. Use NACLs as an additional layer for compliance requirements or to block specific IP ranges at the subnet level.
17. How do you protect against DDoS attacks in the cloud?
Layer the defenses: AWS Shield Standard (free, automatic L3/L4 protection). CloudFront for edge distribution and absorption. WAF rules for L7 protection (rate limiting, geo-blocking, SQL injection filtering). Shield Advanced ($3,000/month) for dedicated response team and cost protection. Auto-scaling to absorb traffic spikes. Route 53 health checks for DNS-level failover.
18. What is your approach to identity federation?
Use SAML 2.0 or OIDC to federate corporate identities into cloud environments. Never create IAM users for human access — use SSO through AWS IAM Identity Center (formerly SSO) connected to your corporate IdP (Okta, Azure AD, Google Workspace). Service-to-service: use IAM roles with IRSA (IAM Roles for Service Accounts) in Kubernetes, or instance profiles for EC2.
Design and Architecture Questions (19-26)
19. Design a system that processes 10,000 events per second.
Ingestion: Kinesis Data Streams (or Kafka on MSK) with appropriate shard count. Processing: Lambda consumers for simple transformations, or ECS/EKS for complex stateful processing. Storage: DynamoDB for real-time lookups, S3 for raw event archive, Redshift or Athena for analytics. Key design decisions: partition key selection for even distribution, dead-letter queues for failed events, exactly-once processing semantics.
20. How do you design a disaster recovery strategy?
Four DR tiers with increasing cost and lower RTO/RPO: Backup and restore (hours RTO, 24h RPO, lowest cost). Pilot light (minutes RTO, minutes RPO, keep minimum infrastructure running). Warm standby (seconds to minutes RTO, near-zero RPO, scaled-down replica). Multi-site active-active (near-zero RTO/RPO, highest cost). Choose based on business impact analysis and budget.
21. Explain the differences between synchronous and asynchronous communication in microservices.
Synchronous (REST, gRPC): caller waits for response. Simple to implement, creates tight coupling, propagates failures. Use for read operations and time-sensitive requests. Asynchronous (SQS, SNS, EventBridge, Kafka): caller sends and continues. Loose coupling, inherent retry capability, eventual consistency. Use for writes, background processing, and cross-service communication.
22. How do you implement observability in a distributed system?
Three pillars: Logs (structured JSON, centralized in CloudWatch or ELK/OpenSearch). Metrics (Prometheus/CloudWatch, track the four golden signals: latency, traffic, errors, saturation). Traces (OpenTelemetry with X-Ray or Jaeger, trace ID propagation across services). Add SLOs on top: define measurable reliability targets and alert when error budgets are consumed.
23. What is a service mesh, and when should you use one?
A service mesh (Istio, Linkerd, AWS App Mesh) provides network-level features (mTLS, traffic management, observability) without application code changes. Use when you have 20+ microservices and need consistent security policies, traffic splitting for canary deployments, or circuit breakers across services. Do not use for fewer than 10 services — the operational overhead is not justified.
24. How do you handle data consistency across microservices?
Accept eventual consistency as the default. Use the Saga pattern for distributed transactions: choreography (services emit events) or orchestration (a central coordinator manages the flow). Use the Outbox pattern to ensure atomic writes to the database and event publication. For strong consistency requirements, consider using a single service boundary instead of splitting across services.
25. Explain the sidecar pattern and give an example.
A sidecar is a helper container deployed alongside the main application container in the same pod (Kubernetes) or task (ECS). Examples: Envoy proxy for service mesh, Fluentd for log collection, Vault agent for secrets injection, OpenTelemetry collector for trace export. The sidecar handles cross-cutting concerns without modifying application code.
26. How do you design a multi-tenant SaaS architecture?
Three isolation models: Silo (separate infrastructure per tenant — most isolated, most expensive). Pool (shared infrastructure, data isolation through tenant ID in every query). Bridge (shared compute, separate databases per tenant). Choose based on compliance requirements, tenant size variation, and cost targets. Most SaaS platforms use pool for small tenants and silo for enterprise tenants.
Behavioral and Leadership Questions (27-30)
27. Describe a time you had to push back on a technical decision.
Structure your answer: situation, the decision you disagreed with, the data you presented, the outcome. Focus on how you influenced without authority. Example: recommending against a premature microservices migration by presenting the operational cost analysis and proposing a phased approach instead.
28. How do you evaluate build vs. buy decisions?
Framework: total cost of ownership (not just license cost — include integration, maintenance, training). Time to market. Strategic differentiation (build what differentiates you, buy commodity). Team capability. Vendor risk (financial stability, lock-in, exit strategy). I have seen organizations waste $500K+ building custom solutions that were available as $50K/year SaaS products.
29. How do you keep your architecture knowledge current?
Read AWS/Azure/GCP release notes weekly. Follow re:Invent, Build, and Next announcements. Maintain a personal lab environment for hands-on testing. Participate in architecture review boards. Read post-mortems from other organizations. Pursue certifications on a 2-year cycle.
30. How do you communicate architecture decisions to non-technical stakeholders?
Focus on business outcomes, not technical details. Use diagrams (C4 model for architecture, sequence diagrams for flows). Quantify: "This reduces our downtime from 4 hours/year to 5 minutes/year, protecting $X in revenue." Frame trade-offs in terms stakeholders understand: cost, timeline, risk. Document decisions in ADRs (Architecture Decision Records) that capture context, alternatives, and consequences.
Frequently Asked Questions
How many questions should I expect in a cloud architect interview?
Typically 4-6 deep questions in a 60-minute technical round, not 30 rapid-fire questions. Interviewers probe depth on each answer. Expect 2-3 system design questions and 1-2 behavioral questions per round, with 3-5 interview rounds total.
Which certifications help most for cloud architect interviews?
AWS Solutions Architect Professional (SAP-C02) carries the most weight. Follow our AWS certification roadmap for the optimal sequence. The CKA certification is increasingly expected for architect roles in 2026.
Should I specialize in one cloud provider?
For your first architect role, deep expertise in one provider (typically AWS) is more valuable than shallow knowledge across three. After establishing yourself, cross-cloud knowledge becomes valuable for senior and principal architect positions.
Preparation Resources
Explore our free courses program for hands-on labs covering the services and patterns discussed in these questions. For salary benchmarking, see our DevOps vs SRE comparison which includes architect-track compensation data. Browse the cloud certifications collection for structured preparation paths.
*Sources: author's direct interview experience (200+ interviews conducted, 2018-2026), AWS Well-Architected Framework, Google SRE Book, Martin Fowler's microservices patterns, DORA research program, enterprise architecture standards from Lockheed Martin and Cigna Healthcare.*