This is a growing collection of research papers and whitepapers I’ve read — or am currently reading — across systems engineering, distributed computing, reliability, and software infrastructure. I use this page as a reading log and reference point, occasionally updating it with new insights.
Currently Reading
June 2025 : Dynamo: Amazon’s Highly Available Key-value Store
DeCandia et al. (Amazon Web Services) A deep dive into Dynamo, the distributed system that inspired many modern NoSQL databases. Focuses on availability, eventual consistency, and partition tolerance.June 2025 : The Google File System
Ghemawat, Gobioff, Leung Describes the fault-tolerant, distributed file system that formed the backbone of Google’s infrastructure — optimized for large data throughput, not POSIX compliance.
Recently Read
- SRE Book – Chapter: Eliminating Toil
An SRE staple for understanding ops efficiency, automation goals, and engineering productivity.
On My Radar
The Datacenter as a Computer (v3) An overview of datacenter-scale design — energy, efficiency, architecture — as if it were a single computer.
Raft: In Search of an Understandable Consensus Algorithm
A simplified, readable alternative to Paxos with real-world adoption in tools like etcd and Consul.Spanner: Google’s Globally Distributed Database
How Google achieves strong consistency at global scale, leveraging TrueTime and synchronized clocks.MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean & Sanjay Ghemawat A foundational paper describing a programming model and system for processing large data sets with distributed algorithms.The Tail at Scale
Dean & Barroso Discusses how rare but high-latency responses affect system performance, and what techniques help mitigate them.