
Citadel Cloud Management
Database Sharding and Replication Blueprint
Architecture BlueprintsCreated by Kenny Ogunlowo
Product Description
The Problem This Blueprint Solves
Your production database is a single RDS instance that your DBA manually configured through the console 18 months ago. There are no automated backups beyond the 7-day default retention, no read replicas for reporting queries that slow down the primary, connection pooling is handled by each application independently (resulting in 800 idle connections), and a failover test last quarter took 4 minutes — during which your application returned 500 errors because the connection string was hardcoded to the primary endpoint.
This blueprint is the database architecture I designed for a fintech platform running Aurora PostgreSQL with 99.995% measured availability, handling 28,000 transactions per second with sub-5ms P99 read latency and automated failover in under 30 seconds.
What You Get
- Architecture diagrams — Multi-AZ cluster topology, read replica routing, connection pooling layer, backup and recovery pipeline, monitoring dashboard architecture (Draw.io)
- Terraform modules — Aurora PostgreSQL cluster with Multi-AZ, RDS Proxy for connection pooling, automated snapshot management with cross-region copy, Parameter Group tuning, Performance Insights configuration, and CloudWatch alarms for key database metrics
- Operational runbook — Failover procedure, slow query investigation playbook, connection pool troubleshooting, backup restoration steps, and major version upgrade procedure
- Performance tuning guide — PostgreSQL parameter recommendations by workload type, index strategy methodology, query optimization patterns, and vacuum tuning guidelines
Key Architecture Decisions
- Aurora over standard RDS for production workloads — Aurora's storage layer replicates 6 copies across 3 AZs automatically, handles up to 128TB without pre-provisioning, and provides faster failover (typically 15-30 seconds) compared to standard Multi-AZ RDS (60-120 seconds). The 20% price premium pays for itself in operational simplicity and reliability.
- RDS Proxy over application-level connection pooling — Application-level pools (PgBouncer, HikariCP) require deployment and management per application. RDS Proxy is managed, scales automatically, handles failover transparently (connections are preserved during failover), and supports IAM authentication. One proxy serves all applications connecting to the same cluster.
- Reader endpoint with custom endpoint for analytics — The default reader endpoint round-robins across all replicas. Custom endpoints let you route OLTP read queries to one set of replicas and heavy analytics queries to a separate, larger replica. Analytics queries do not compete with production reads for CPU and memory.
- Automated cross-region snapshot copy over manual backup — Automated snapshots stay in the same region as the cluster. A Lambda function triggered by snapshot completion copies each snapshot to a DR region. If the primary region fails, you can restore from the cross-region copy. Manual backup procedures depend on humans remembering to execute them.
Who This Blueprint Is For
- Database Administrators migrating from self-managed PostgreSQL to Aurora
- Backend Engineers building applications that need high-availability database access
- Platform teams standardizing database infrastructure for multiple product teams
- SREs responsible for database reliability and on-call response for database incidents
Your First 48 Hours
Deploy the Aurora cluster with RDS Proxy Terraform module into a sandbox account. Connect your application through RDS Proxy and verify connections are pooled (check pg_stat_activity — you should see fewer backend connections than application connections). On day two, trigger a manual failover using aws rds failover-db-cluster and measure the duration. Verify that your application experiences zero connection errors during failover when connected through RDS Proxy versus connecting directly to the cluster endpoint.
Limitations and Trade-offs
Aurora PostgreSQL does not support all PostgreSQL extensions — check compatibility before migrating workloads that depend on extensions like PostGIS, TimescaleDB, or pgvector (Aurora supports pgvector as of PostgreSQL 15.4). RDS Proxy adds 1-2ms of latency per query due to the connection multiplexing layer — negligible for most workloads but measurable for sub-millisecond latency requirements. Cross-region snapshot restoration creates a new cluster (new endpoint), requiring application connection string updates unless you use Route 53 CNAME records as the connection target. Aurora Serverless v2 scales to zero ACU in dev but has a minimum of 0.5 ACU ($43/month) — standard provisioned instances may be cheaper for predictable workloads.