Advanced
Reliable Data Systems
Building reliable, scalable, and maintainable data systems
⏱️ 50 min read
📅 Updated Jan 2025
👤 Based on Martin Kleppmann's DDIA
📚 Reference: Designing Data-Intensive Applications
This chapter is based on Martin Kleppmann's seminal work on reliable, scalable, and maintainable systems - essential reading for data engineers.
Mode Baca Pemula
Anggap reliability sebagai "sistem tetap benar saat masalah datang". Fokus baca:
- Trade-off reliability, scalability, maintainability
- Konsep replication, partitioning, dan consistency
- Pilihan desain berdasarkan risiko bisnis
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham sistem data bisa gagal karena jaringan, hardware, atau software
- Tahu konsep dasar transaksi dan data sinkron antar node
- Mengerti bahwa "cepat" dan "konsisten" sering ada trade-off
Istilah Penting (3 Lapis)
Istilah: RPO/RTO
Definisi awam: Batas maksimal data hilang dan waktu pulih saat insiden.
Definisi teknis: Recovery Point Objective (toleransi kehilangan data) dan Recovery Time Objective (target waktu recovery).
Contoh praktis: RPO 5 menit dan RTO 30 menit untuk sistem transaksi pembayaran.
Istilah: Eventual Consistency
Definisi awam: Data antar node bisa beda sementara, tapi akhirnya sama.
Definisi teknis: Model consistency di distributed system di mana replikasi disinkronkan asinkron.
Contoh praktis: Update profil terlihat di region A dulu, beberapa detik kemudian baru region B.
The Three Pillars
Any data system must satisfy three fundamental concerns:
🔒 Reliability
System continues to work correctly even when things go wrong
- Hardware faults
- Software bugs
- Human errors
📈 Scalability
System's ability to cope with increased load
- Volume growth
- Traffic spikes
- Complexity management
🔧 Maintainability
Ease of keeping system running smoothly
- Operability
- Simplicity
- Evolvability
Storage Engine Internals
Understanding how databases store data helps make better technology choices.
B-Trees vs LSM-Trees
| Aspect |
B-Tree (PostgreSQL, MySQL) |
LSM-Tree (Cassandra, RocksDB) |
| Write Pattern |
Update in-place |
Append-only |
| Write Speed |
Moderate |
High |
| Read Speed |
Fast |
Variable (compaction) |
| Best For |
Read-heavy workloads |
Write-heavy workloads |
Data Replication
Replication keeps copies of data on multiple machines for fault tolerance and performance.
Replication Strategies
| Strategy |
How It Works |
Trade-offs |
| Single Leader |
One primary accepts writes, replicas follow |
Simple, potential bottleneck |
| Multi-Leader |
Multiple primaries accept writes |
Conflict resolution needed |
| Leaderless |
All replicas accept writes (quorum) |
Complex, high availability |
Partitioning (Sharding)
Splitting large datasets across multiple machines.
Partitioning Strategies
- Key Range: Sort by key, assign ranges (efficient range queries)
- Hash of Key: Even distribution, no hot spots
- List: Explicit mapping (e.g., region-based)
Transactions and ACID
🔐 ACID Properties
- Atomicity: All or nothing
- Consistency: Valid state to valid state
- Isolation: Concurrent transactions don't interfere
- Durability: Committed data survives crashes
Consistency Models
| Model |
Description |
Use Case |
| Strong Consistency |
All reads see latest write |
Financial transactions |
| Eventual Consistency |
Reads may be stale temporarily |
Social media, caching |
| Causal Consistency |
Related operations ordered |
Collaborative editing |
Decision Framework: Reliability Trade-offs
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Strong vs Eventual consistency |
Use case kritikal (billing, ledger, inventory ketat) |
Analytics/monitoring dengan toleransi keterlambatan sinkronisasi |
| Sync replication vs Async replication |
Prioritas durability tinggi meski latency naik |
Prioritas throughput/latency dan cross-region scale |
| Availability vs Strict correctness |
Lebih baik reject request daripada hasil salah |
Lebih baik tetap melayani dengan kemungkinan stale read |
Failure Modes & Anti-Patterns
Anti-Patterns Reliability
- Assuming network is reliable: timeout/retry policy tidak disiapkan.
- No idempotency on retries: duplicate writes saat transient failure.
- Ignoring replication lag: read-after-write inconsistency mengejutkan user.
- No chaos/failure testing: incident besar baru ketahuan saat production.
Production Readiness Checklist
Checklist Reliability
- RTO/RPO target didefinisikan dan diuji.
- Retry + backoff + circuit breaker diterapkan.
- Idempotency untuk write path tervalidasi.
- Replication lag dan failover metrics dimonitor.
- Backup restore drill dijalankan berkala.
- Runbook incident untuk partition/outage tersedia.
✏️ Exercise: System Design
Design a data system for a ride-hailing app with these requirements:
- High write throughput (location updates every 5 seconds)
- Real-time driver-rider matching (low latency reads)
- Trip history must be durable
- Payment data requires strong consistency
Questions:
- What storage engine? B-Tree or LSM?
- Single or multi-leader replication?
- Partitioning strategy?
- Consistency model for each data type?
🎯 Quick Quiz
1. B-Trees are generally better than LSM-Trees for?
A. Write-heavy workloads
B. Read-heavy workloads with range queries
C. Log aggregation
D. Time-series data
2. What does the 'I' in ACID stand for?
A. Integration
B. Isolation
C. Integrity
D. Indexing
3. Single-leader replication is simpler than multi-leader because?
A. It has better performance
B. No conflict resolution needed
C. It uses less storage
D. It supports more concurrent writes
Kesimpulan
Building reliable data systems requires understanding trade-offs between consistency, availability, and partition tolerance (CAP theorem). Choose technologies and architectures that match your specific requirements.
🎯 Key Takeaways
- Reliability, Scalability, and Maintainability are fundamental
- Understand storage engine trade-offs
- Choose replication strategy based on use case
- Not all data needs strong consistency
📚 References & Resources
Primary Sources
- Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
Chapters 1-3: Reliable, Scalable, Maintainable Systems; Storage Engines; Replication and Partitioning
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapter 15: Reliability Engineering for Data
- Site Reliability Engineering - Betsy Beyer et al. (O'Reilly, 2016)
Chapters 1-2: Introduction, SRE Approach
Official Documentation
Articles & Papers