Advanced

Reliable Data Systems

Building reliable, scalable, and maintainable data systems

⏱️ 50 min read 📅 Updated Jan 2025 👤 Based on Martin Kleppmann's DDIA

📚 Reference: Designing Data-Intensive Applications

This chapter is based on Martin Kleppmann's seminal work on reliable, scalable, and maintainable systems - essential reading for data engineers.

Mode Baca Pemula

Anggap reliability sebagai "sistem tetap benar saat masalah datang". Fokus baca:

  1. Trade-off reliability, scalability, maintainability
  2. Konsep replication, partitioning, dan consistency
  3. Pilihan desain berdasarkan risiko bisnis

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: RPO/RTO

Definisi awam: Batas maksimal data hilang dan waktu pulih saat insiden.

Definisi teknis: Recovery Point Objective (toleransi kehilangan data) dan Recovery Time Objective (target waktu recovery).

Contoh praktis: RPO 5 menit dan RTO 30 menit untuk sistem transaksi pembayaran.

Istilah: Eventual Consistency

Definisi awam: Data antar node bisa beda sementara, tapi akhirnya sama.

Definisi teknis: Model consistency di distributed system di mana replikasi disinkronkan asinkron.

Contoh praktis: Update profil terlihat di region A dulu, beberapa detik kemudian baru region B.

The Three Pillars

Any data system must satisfy three fundamental concerns:

🔒 Reliability

System continues to work correctly even when things go wrong

  • Hardware faults
  • Software bugs
  • Human errors

📈 Scalability

System's ability to cope with increased load

  • Volume growth
  • Traffic spikes
  • Complexity management

🔧 Maintainability

Ease of keeping system running smoothly

  • Operability
  • Simplicity
  • Evolvability

Storage Engine Internals

Understanding how databases store data helps make better technology choices.

B-Trees vs LSM-Trees

Aspect B-Tree (PostgreSQL, MySQL) LSM-Tree (Cassandra, RocksDB)
Write Pattern Update in-place Append-only
Write Speed Moderate High
Read Speed Fast Variable (compaction)
Best For Read-heavy workloads Write-heavy workloads

Data Replication

Replication keeps copies of data on multiple machines for fault tolerance and performance.

Replication Strategies

Strategy How It Works Trade-offs
Single Leader One primary accepts writes, replicas follow Simple, potential bottleneck
Multi-Leader Multiple primaries accept writes Conflict resolution needed
Leaderless All replicas accept writes (quorum) Complex, high availability

Partitioning (Sharding)

Splitting large datasets across multiple machines.

Partitioning Strategies

Transactions and ACID

🔐 ACID Properties

Consistency Models

Model Description Use Case
Strong Consistency All reads see latest write Financial transactions
Eventual Consistency Reads may be stale temporarily Social media, caching
Causal Consistency Related operations ordered Collaborative editing

Decision Framework: Reliability Trade-offs

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Strong vs Eventual consistency Use case kritikal (billing, ledger, inventory ketat) Analytics/monitoring dengan toleransi keterlambatan sinkronisasi
Sync replication vs Async replication Prioritas durability tinggi meski latency naik Prioritas throughput/latency dan cross-region scale
Availability vs Strict correctness Lebih baik reject request daripada hasil salah Lebih baik tetap melayani dengan kemungkinan stale read

Failure Modes & Anti-Patterns

Anti-Patterns Reliability

Production Readiness Checklist

Checklist Reliability

  1. RTO/RPO target didefinisikan dan diuji.
  2. Retry + backoff + circuit breaker diterapkan.
  3. Idempotency untuk write path tervalidasi.
  4. Replication lag dan failover metrics dimonitor.
  5. Backup restore drill dijalankan berkala.
  6. Runbook incident untuk partition/outage tersedia.

✏️ Exercise: System Design

Design a data system for a ride-hailing app with these requirements:

  1. High write throughput (location updates every 5 seconds)
  2. Real-time driver-rider matching (low latency reads)
  3. Trip history must be durable
  4. Payment data requires strong consistency

Questions:

  1. What storage engine? B-Tree or LSM?
  2. Single or multi-leader replication?
  3. Partitioning strategy?
  4. Consistency model for each data type?

🎯 Quick Quiz

1. B-Trees are generally better than LSM-Trees for?

A. Write-heavy workloads
B. Read-heavy workloads with range queries
C. Log aggregation
D. Time-series data

2. What does the 'I' in ACID stand for?

A. Integration
B. Isolation
C. Integrity
D. Indexing

3. Single-leader replication is simpler than multi-leader because?

A. It has better performance
B. No conflict resolution needed
C. It uses less storage
D. It supports more concurrent writes

Kesimpulan

Building reliable data systems requires understanding trade-offs between consistency, availability, and partition tolerance (CAP theorem). Choose technologies and architectures that match your specific requirements.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Papers