Overview of How AWS S3 is built
This episode of The Pragmatic Engineer (host Gergely Orosz) is a deep technical conversation with Mylon (VP of Data & Analytics at AWS), who has run Amazon S3 for 13 years. It explains S3’s scale, architecture and design trade-offs (including the shift from eventual to strong consistency), the engineering practices used to guarantee correctness and durability at massive scale, and recent product primitives (S3 Tables and S3 Vectors). Useful both as a systems-engineering case study and as insight into how a very large infra org balances conservatism and invention.
Key stats & scale (what “massive” means)
- Objects: > 500 trillion objects
- Data: hundreds of exabytes stored (1 exabyte = 1,000 petabytes)
- Traffic: hundreds of millions of transactions/second globally; > 10^15 (a quadrillion) requests per year
- Hardware footprint: tens of millions of hard drives across millions of servers, in ~120 availability zones across 38 regions
- Max single-object size: 50 TB (up from the original 5 TB)
- Durability promise: 11 nines (designed and audited at many system levels)
Origin and product evolution
- Launched: design started ~2005; S3 launched in 2006 as the first AWS service. Initially optimized for durability and availability with an eventually-consistent model.
- Pricing strategy: aggressive price cuts from launch (US$0.15/GB/month initially) driven by a mission to make storage economical so customers keep data—AWS absorbed cost reductions rather than passing them to customers.
- Major feature timeline highlights:
- Glacier (archival tier) launched 2012 for very low-cost, high-latency storage.
- Intelligent-tiering 2018 (automatic tiering/discounting).
- Rise of Parquet/“data lakes” (2013–2020) → Iceberg tables adoption (2019–2020).
- S3 Tables launched Dec 2024 (manages Parquet files and exposes SQL-like table semantics).
- S3 Vectors preview July (2024) and GA shortly thereafter — native support for embeddings/vectors.
Core architecture & terminology
- User-facing primitives: buckets, objects, keys, plus newer native primitives — S3 Tables (table abstraction over Parquet files) and S3 Vectors (native vector datatype).
- Fundamental operations remain simple: PUT (write), GET (read) and related primitives: LIST, DELETE, COPY, conditional variants (put-if-absent, put-if-match, copy-if-absent, delete-if-match).
- Under the hood: index subsystem (metadata), caching layers, and a wide storage substrate (replicated across availability zones). Index metadata is consulted on most operations (HEAD, LIST, GET, PUT, etc.).
Consistency: from eventual to strong (how and why)
- Original model: eventual consistency to maximize availability and durability in 2006; acceptable for many web workloads.
- Customer demand and new workloads (analytics, databases, AI) pushed S3 to strong consistency.
- Engineering solution:
- Replicated journal: a distributed data structure chaining storage nodes so writes have sequence numbers and an ordered log.
- Cache coherency protocol + failure allowance: a protocol that allows multiple servers to process requests while tolerating some failures and ensuring readers observe the correct sequence.
- Trade-offs: added complexity and hardware cost; AWS decided not to charge customers for strong consistency—rolled it out transparently without raising prices or materially increasing latencies.
Durability, correctness, and verification
- Durability is addressed at multiple levels: physical layout (replication across racks/AZs/regions), repair systems, and auditor microservices that inspect bytes and kick off repairs.
- Operational model: hundreds of microservices (200+ in the regional S3 control plane) perform health checks, audit, repair, metrics collection.
- Proving correctness:
- AWS uses formal methods / automated reasoning (proper formal proofs) to verify consistency and other properties.
- Critical subsystems (index, cross-region replication, APIs) are formally specified and proofs/checks are run on code check-ins to avoid regressions.
- This is used continuously (not a one-off proof) — “proof on every check-in” for critical paths.
- Practical verification: AWS can measure and report durability over time (auditors validate actual behavior vs math/assumptions).
Failure domains, crash consistency & failure allowances
- Correlated failure: multiple resources failing together (e.g., many nodes on same rack or AZ). This is the primary risk for availability; S3’s design spreads replicas across fault domains to avoid correlation.
- Crash consistency: systems are designed to return to a consistent state after fail-stop events; engineers reason about reachable states in presence of failure.
- Failure allowance: caches and protocol design include tolerance thresholds (sized by metrics) so the system survives typical failures without customer-visible impact. Sizing is metric-driven; scale helps decorrelate workloads and make these allowances more effective.
S3 Vectors: new primitive for embeddings
- Motivation: embeddings/semantic vectors make huge amounts of heterogeneous data searchable; customers need to store billions/trillions of vectors cheaply.
- Design choices:
- New native data type (not just blobs or objects) built for massive scale.
- Neighborhood-based search: offline pre-computed vector neighborhoods (clusters) are created asynchronously; when new vectors are added they’re placed into neighborhoods.
- Query flow: only a small subset (relevant neighborhoods) is loaded into fast memory for nearest-neighbor computation → warm query latencies ~100 ms or less.
- Scale numbers from launch:
- Up to 2 billion vectors per index.
- Up to 20 trillion vectors per vector “bucket”.
- Positioning: not meant to replace ultra-low-latency vector DBs for all workloads, but provides massively scalable, cost-effective storage and reasonably low-latency search that leverages S3’s economics.
Pricing & cost engineering (how low costs are achieved)
- AWS designs down to the metal: hardware selection, data-center ops, repair/tech processes and code are all targets for cost optimization.
- Tactics:
- Set byte-cost targets and optimize each layer (hardware, software, ops).
- Archival tiers (Glacier) accept higher access latency to radically reduce storage cost (e.g., initial Glacier pricing vs standard S3).
- Automated tiering (intelligent-tiering) to reduce total cost of ownership (TCO) for inactive data.
- AWS often absorbs infrastructure costs for features (e.g., strong consistency) to preserve the simple customer experience.
Engineering culture & org practices
- Two cultural tenets emphasized in S3:
- “Respect what came before” — preserve the long-standing guarantees (durability/availability, “it just works”).
- “Be technically fearless” — innovate and add primitives (tables, vectors, conditionals) while preserving core properties.
- Hiring and team traits:
- Engineers across career stages; common traits: ownership, relentless curiosity, emphasis on correctness and operational thinking.
- Typical S3 engineer concerns include crash consistency, correlated failure reasoning, and system-level correctness.
- Simplicity principle: despite internal complexity, the user model must remain simple (clear API, SQL for tables, simple vector APIs).
Practical takeaways for engineers and teams
- At very large scale, math and formal verification become essential: formal methods scale better than purely empirical testing for reasoning about rare corner cases.
- Design for correlated failures (not just independent failures); replication across independent fault domains is mandatory.
- Make scale an advantage: design primitives so bigger scale improves traits (decorrelation, amortization of repair systems, richer metrics).
- Ownership and operational practices (auditors, repair automation, metrics) are as important as code.
- If you want to work on infra like S3: cultivate curiosity, learn reasoning about failures and proofs, embrace operational responsibility and long-lived software design.
Notable quotes & highlights
- “If you imagine stacking all of our drives one on top of another it would go all the way to the International Space Station and just about back.” — visualization of S3’s physical scale.
- Strong consistency rollout decision: AWS implemented strong consistency for all S3 requests and chose not to charge customers for it — a noteworthy engineering + product decision.
- “At S3 scale, math has to save you.” — argument for formal methods / automated reasoning in critical systems.
Recommended follow-ups (topics to explore)
- Formal methods and automated reasoning applied to distributed storage and consistency proofs.
- Iceberg/Parquet table formats and how they map to object storage.
- Embedding/nearest-neighbor indexing techniques (neighborhood/clustering, ANN algorithms) and trade-offs between in-memory vector DBs vs disk-backed/vector primitives at massive scale.
- Operational practices for auditors, repair systems, and how to measure/validate durability in production.
Appendix — short glossary
- Index subsystem: metadata store controlling object names, tags, timestamps, consulted on nearly every API call.
- Replicated journal: ordered log across storage nodes enabling strong consistency (sequence numbers on writes).
- Failure allowance: designed tolerance for component failures so the system still meets availability and correctness SLAs.
- Crash consistency: property that a system returns to a consistent state after a crash/fail-stop event.
(Guest also recommended reading on multimodal embeddings as an important research direction; and a non-technical book about ecology/bee support as a personal recommendation.)
