Intermediate
Storage Systems
Data Warehouse, Data Lake, Lakehouse, and storage best practices
⏱️ 45 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap storage system sebagai "rumah data". Fokus baca:
- Perbedaan warehouse, lake, dan lakehouse
- Pola partition yang bikin query cepat
- Risiko umum seperti small files dan schema berubah
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham data tabel, kolom, dan query dasar
- Tahu bedanya data terstruktur dan semi-terstruktur
- Pernah dengar konsep biaya storage dan biaya query
Istilah Penting (3 Lapis)
Istilah: Partitioning
Definisi awam: Membagi data besar jadi bagian kecil yang rapi.
Definisi teknis: Strategi menyimpan data dalam segmen logis (misalnya per tanggal/tenant) untuk meningkatkan performa baca.
Contoh praktis: Tabel transaksi dipartisi per hari agar query 7 hari terakhir tidak scan 2 tahun data.
Istilah: Schema Evolution
Definisi awam: Struktur data berubah seiring waktu.
Definisi teknis: Kemampuan sistem menyerap perubahan skema (tambah/hapus/ubah kolom) tanpa memutus pipeline.
Contoh praktis: Tim produk menambah kolom promo_code dan pipeline tetap jalan dengan fallback default.
The Evolution of Data Storage
Data storage telah berevolusi dari simple file systems menjadi sophisticated systems
yang di-optimized untuk berbagai workloads. Memahami perbedaan antara Data Warehouse,
Data Lake, dan Lakehouse adalah fundamental untuk data engineer.
🏢 Data Warehouse
Structured only
Schema-on-write
Expensive
→
🌊 Data Lake
Any format
Schema-on-read
Cheap storage
→
🏔️ Lakehouse
Best of both
ACID + flexibility
Modern standard
Data Warehouse
Data Warehouse adalah centralized repository untuk structured data
yang telah diproses dan di-optimize untuk analytics queries.
Characteristics of Data Warehouse
- Structured Schema: Data harus sesuai dengan predefined schema
- Schema-on-Write: Data divalidasi saat dimasukkan
- Optimized for Analytics: Columnar storage, compression, indexing
- ACID Compliance: Transaction integrity
- SQL-First: Query menggunakan SQL standard
Popular Data Warehouses
| Warehouse |
Type |
Best For |
Pricing Model |
| Snowflake |
Cloud-native SaaS |
Enterprise, multi-cloud |
Compute + Storage |
| BigQuery |
Serverless (GCP) |
Large-scale analytics |
Query + Storage |
| Redshift |
Cluster-based (AWS) |
AWS ecosystem |
Node-based |
| ClickHouse |
Open Source |
High-performance OLAP |
Self-managed |
Data Lake
Data Lake menyimpan data dalam format mentah (raw) tanpa requiring
predefined schema. Data bisa structured, semi-structured, atau unstructured.
Data Lake Architecture
🌊 Three-Layer Architecture
Raw Layer (Bronze):
Data as-is from sources. Minimal transformation. Keep original format.
Cleaned Layer (Silver):
Cleaned, deduplicated, schema applied. Ready for analysis.
Curated Layer (Gold):
Business aggregates, feature stores, ready for consumption.
Storage Formats for Data Lake
| Format |
Type |
Best For |
Compression |
| CSV |
Text, row-based |
Simple exports |
Poor |
| JSON |
Text, semi-structured |
API data, logs |
Moderate |
| Parquet |
Binary, columnar |
Analytics (industry standard) |
Excellent |
| Delta Lake |
Parquet + transaction log |
ACID transactions |
Excellent |
| Iceberg |
Table format |
Schema evolution, time travel |
Excellent |
💡 Why Parquet?
- Columnar: Read hanya columns yang dibutuhkan
- Compression: Snappy compression built-in (10x smaller)
- Predicate Pushdown: Filter di storage level
- Schema embedded: Self-describing
Data Lakehouse
Data Lakehouse menggabungkan best of both worlds:
flexibility dari Data Lake dan reliability dari Data Warehouse.
Lakehouse Technologies
🏔️ Key Lakehouse Features
- ACID Transactions: Ensures data consistency
- Schema Enforcement & Evolution: Flexible but controlled
- Time Travel: Query data at any point in time
- Zero-Copy Cloning: Cheap data copies for testing
- Unified Batch & Streaming: Same pipeline for both
| Technology |
Format |
Key Feature |
| Delta Lake |
Open source (Databricks) |
ACID + Time Travel + Z-Ordering |
| Apache Iceberg |
Open table format |
Hidden partitioning, partition evolution |
| Apache Hudi |
Open source |
Incremental processing, upserts |
Partitioning Strategies
Partitioning membagi data menjadi chunks berdasarkan kolom tertentu,
memungkinkan query untuk skip data yang tidak relevan.
Common Partitioning Strategies
📅 Time-based
Pattern: year=2024/month=01/day=15/
Best for: Time-series data, event logs
Query: WHERE date >= '2024-01-01'
🌍 Region-based
Pattern: region=apac/country=id/
Best for: Geo-distributed data
Query: WHERE region = 'apac'
📊 Category-based
Pattern: category=electronics/
Best for: Business categories
Query: WHERE category = 'electronics'
🔢 Hash-based
Pattern: bucket=01/ (user_id % 100)
Best for: Even distribution
Query: Avoid data skew
⚠️ Partitioning Anti-Patterns
- Too many partitions: > 10K partitions slow down metadata operations
- Too few partitions: Files too large, poor parallelism
- High cardinality: Partitioning by user_id creates millions of folders
- Mutable partitioning: Partition column yang sering berubah
Schema Evolution
Schema tidak static - requirements berubah. Data systems harus support
schema evolution tanpa breaking existing pipelines.
Schema Evolution Patterns
| Change |
Compatibility |
Notes |
| Add new column |
✅ Backward & Forward |
Safest change |
| Drop column |
⚠️ Forward only |
Old code expects column |
| Rename column |
❌ Breaking |
Equivalent to drop + add |
| Change data type |
⚠️ Depends |
int → long OK, long → int risky |
| Add nested field |
✅ Backward |
For complex types |
CREATE TABLE users (
id INT,
name STRING,
email STRING
) USING DELTA;
SET spark.databricks.delta.schema.autoMerge.enabled = true;
INSERT INTO users
SELECT 1, 'John', 'john@email.com', '+1234567890';
DESCRIBE HISTORY users;
Storage Best Practices
✅ DO's
- Use columnar formats (Parquet/ORC) for analytics
- Partition by commonly filtered columns
- Keep file sizes between 128MB - 1GB
- Use appropriate compression (Snappy for speed, Gzip for size)
- Implement data retention policies
- Version your data (time travel)
❌ DON'Ts
- Don't store many small files (small file problem)
- Don't use CSV for large datasets
- Don't partition by high-cardinality columns
- Don't store sensitive data without encryption
- Don't forget to compact files periodically
Decision Guide: When to Use What?
| Use Case |
Recommended Storage |
Why |
| BI Dashboards, SQL Analytics |
Data Warehouse (Snowflake/BigQuery) |
Fast queries, SQL-native |
| Raw data landing, ML training data |
Data Lake (S3 + Parquet) |
Cost-effective, flexible |
| Streaming data, real-time analytics |
Lakehouse (Delta/Iceberg) |
ACID + streaming support |
| Regulatory compliance, audit |
Lakehouse with time travel |
Historical point-in-time queries |
| Data science experimentation |
Data Lake |
Flexibility for various formats |
Decision Framework: Storage Architecture
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Warehouse vs Lakehouse |
Workload dominan BI dengan model relasional stabil |
Perlu gabungan BI + ML + data semi-structured dalam satu platform |
| Partition by Date vs Multi-key |
Query utama time-based dan data append harian |
Filter sering by tenant/region + date |
| Parquet vs JSON |
Use case analytics skala besar (scan cepat, biaya efisien) |
Data mentah event/log yang masih eksploratif |
| Hot vs Warm/Cold Tiering |
Data sering diquery untuk dashboard operasional |
Data historis untuk audit/compliance dan query sesekali |
Failure Modes & Anti-Patterns
Anti-Patterns pada Storage Layer
- Over-partitioning: terlalu banyak partisi kecil, query makin lambat.
- Small files explosion: ribuan file kecil menurunkan performa metadata dan scan.
- No schema governance: perubahan schema merusak pipeline downstream.
- CSV-first analytics: biaya scan tinggi dan parsing mahal.
- No lifecycle policy: storage cost terus naik tanpa kontrol tiering.
Production Readiness Checklist
Checklist Storage sebelum Production
- Format file dan strategi partisi sudah diuji pada query nyata.
- Compaction schedule untuk small files sudah aktif.
- Schema evolution policy terdokumentasi.
- Encryption at rest aktif dan key management jelas.
- Retention policy + tiering policy sesuai SLA dan compliance.
- Time travel/versioning tersedia untuk rollback data.
- Monitoring biaya storage per domain/data product aktif.
- Access control (role-based) untuk dataset sensitif diterapkan.
✏️ Exercise: Storage Design
Desain storage architecture untuk sistem e-commerce dengan requirement:
- 100M orders/year, query pattern mostly by date and region
- User behavior events (clicks, views) - 1B events/day
- Product catalog with frequent schema changes
- Need to support: BI dashboards, ML recommendations, audit
Tentukan:
- Storage type untuk masing-masing data
- Partitioning strategy
- File format
- Hot/warm/cold tiering strategy
🎯 Quick Quiz
1. Perbedaan utama Data Warehouse dan Data Lake adalah?
A. Warehouse lebih murah
B. Lake hanya untuk unstructured data
C. Warehouse requires predefined schema, Lake accepts any format
D. Lake lebih cepat untuk analytics
2. Format file terbaik untuk analytics workload adalah?
A. CSV karena human-readable
B. JSON karena flexible
C. Parquet karena columnar storage
D. XML karena structured
3. Lakehouse menambahkan capability apa ke Data Lake?
A. Cheaper storage
B. ACID transactions and schema enforcement
C. Support for unstructured data
D. SQL query capability
Kesimpulan
Memilih storage system yang tepat adalah critical decision dalam data engineering.
Modern data architecture cenderung menggunakan Lakehouse pattern
dengan open table formats (Delta Lake, Iceberg, Hudi) untuk mendapatkan flexibility
dan reliability dalam satu platform.
🎯 Key Takeaways
- Data Warehouse = Structured, schema-on-write, SQL analytics
- Data Lake = Any format, cheap, flexible
- Lakehouse = Best of both + ACID transactions
- Use Parquet for analytics, partition wisely
- Plan for schema evolution from day one
📚 References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 8-9: Storage Systems, Data Warehouses and Data Lakes
- Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
Chapter 10: Batch Processing, Chapter 11: Stream Processing (Partitioning concepts)
- Spark: The Definitive Guide - Bill Chambers & Matei Zaharia (O'Reilly, 2018)
Chapter 6: Spark SQL, Chapter 9: Data Sources (Parquet, Delta Lake)
Official Documentation
Articles & Papers