Intermediate

Storage Systems

Data Warehouse, Data Lake, Lakehouse, and storage best practices

⏱️ 45 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap storage system sebagai "rumah data". Fokus baca:

  1. Perbedaan warehouse, lake, dan lakehouse
  2. Pola partition yang bikin query cepat
  3. Risiko umum seperti small files dan schema berubah

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Partitioning

Definisi awam: Membagi data besar jadi bagian kecil yang rapi.

Definisi teknis: Strategi menyimpan data dalam segmen logis (misalnya per tanggal/tenant) untuk meningkatkan performa baca.

Contoh praktis: Tabel transaksi dipartisi per hari agar query 7 hari terakhir tidak scan 2 tahun data.

Istilah: Schema Evolution

Definisi awam: Struktur data berubah seiring waktu.

Definisi teknis: Kemampuan sistem menyerap perubahan skema (tambah/hapus/ubah kolom) tanpa memutus pipeline.

Contoh praktis: Tim produk menambah kolom promo_code dan pipeline tetap jalan dengan fallback default.

The Evolution of Data Storage

Data storage telah berevolusi dari simple file systems menjadi sophisticated systems yang di-optimized untuk berbagai workloads. Memahami perbedaan antara Data Warehouse, Data Lake, dan Lakehouse adalah fundamental untuk data engineer.

🏢 Data Warehouse

Structured only
Schema-on-write
Expensive

🌊 Data Lake

Any format
Schema-on-read
Cheap storage

🏔️ Lakehouse

Best of both
ACID + flexibility
Modern standard

Data Warehouse

Data Warehouse adalah centralized repository untuk structured data yang telah diproses dan di-optimize untuk analytics queries.

Characteristics of Data Warehouse

Popular Data Warehouses

Warehouse Type Best For Pricing Model
Snowflake Cloud-native SaaS Enterprise, multi-cloud Compute + Storage
BigQuery Serverless (GCP) Large-scale analytics Query + Storage
Redshift Cluster-based (AWS) AWS ecosystem Node-based
ClickHouse Open Source High-performance OLAP Self-managed

Data Lake

Data Lake menyimpan data dalam format mentah (raw) tanpa requiring predefined schema. Data bisa structured, semi-structured, atau unstructured.

Data Lake Architecture

🌊 Three-Layer Architecture

Raw Layer (Bronze):

Data as-is from sources. Minimal transformation. Keep original format.

Cleaned Layer (Silver):

Cleaned, deduplicated, schema applied. Ready for analysis.

Curated Layer (Gold):

Business aggregates, feature stores, ready for consumption.

Storage Formats for Data Lake

Format Type Best For Compression
CSV Text, row-based Simple exports Poor
JSON Text, semi-structured API data, logs Moderate
Parquet Binary, columnar Analytics (industry standard) Excellent
Delta Lake Parquet + transaction log ACID transactions Excellent
Iceberg Table format Schema evolution, time travel Excellent

💡 Why Parquet?

Data Lakehouse

Data Lakehouse menggabungkan best of both worlds: flexibility dari Data Lake dan reliability dari Data Warehouse.

Lakehouse Technologies

🏔️ Key Lakehouse Features

Technology Format Key Feature
Delta Lake Open source (Databricks) ACID + Time Travel + Z-Ordering
Apache Iceberg Open table format Hidden partitioning, partition evolution
Apache Hudi Open source Incremental processing, upserts

Partitioning Strategies

Partitioning membagi data menjadi chunks berdasarkan kolom tertentu, memungkinkan query untuk skip data yang tidak relevan.

Common Partitioning Strategies

📅 Time-based

Pattern: year=2024/month=01/day=15/

Best for: Time-series data, event logs

Query: WHERE date >= '2024-01-01'

🌍 Region-based

Pattern: region=apac/country=id/

Best for: Geo-distributed data

Query: WHERE region = 'apac'

📊 Category-based

Pattern: category=electronics/

Best for: Business categories

Query: WHERE category = 'electronics'

🔢 Hash-based

Pattern: bucket=01/ (user_id % 100)

Best for: Even distribution

Query: Avoid data skew

⚠️ Partitioning Anti-Patterns

Schema Evolution

Schema tidak static - requirements berubah. Data systems harus support schema evolution tanpa breaking existing pipelines.

Schema Evolution Patterns

Change Compatibility Notes
Add new column ✅ Backward & Forward Safest change
Drop column ⚠️ Forward only Old code expects column
Rename column ❌ Breaking Equivalent to drop + add
Change data type ⚠️ Depends int → long OK, long → int risky
Add nested field ✅ Backward For complex types
-- Delta Lake: Schema evolution example -- Original table CREATE TABLE users ( id INT, name STRING, email STRING ) USING DELTA; -- Enable schema evolution SET spark.databricks.delta.schema.autoMerge.enabled = true; -- Insert with new column - automatically adds 'phone' INSERT INTO users SELECT 1, 'John', 'john@email.com', '+1234567890'; -- View schema evolution history DESCRIBE HISTORY users;

Storage Best Practices

✅ DO's

❌ DON'Ts

Decision Guide: When to Use What?

Use Case Recommended Storage Why
BI Dashboards, SQL Analytics Data Warehouse (Snowflake/BigQuery) Fast queries, SQL-native
Raw data landing, ML training data Data Lake (S3 + Parquet) Cost-effective, flexible
Streaming data, real-time analytics Lakehouse (Delta/Iceberg) ACID + streaming support
Regulatory compliance, audit Lakehouse with time travel Historical point-in-time queries
Data science experimentation Data Lake Flexibility for various formats

Decision Framework: Storage Architecture

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Warehouse vs Lakehouse Workload dominan BI dengan model relasional stabil Perlu gabungan BI + ML + data semi-structured dalam satu platform
Partition by Date vs Multi-key Query utama time-based dan data append harian Filter sering by tenant/region + date
Parquet vs JSON Use case analytics skala besar (scan cepat, biaya efisien) Data mentah event/log yang masih eksploratif
Hot vs Warm/Cold Tiering Data sering diquery untuk dashboard operasional Data historis untuk audit/compliance dan query sesekali

Failure Modes & Anti-Patterns

Anti-Patterns pada Storage Layer

Production Readiness Checklist

Checklist Storage sebelum Production

  1. Format file dan strategi partisi sudah diuji pada query nyata.
  2. Compaction schedule untuk small files sudah aktif.
  3. Schema evolution policy terdokumentasi.
  4. Encryption at rest aktif dan key management jelas.
  5. Retention policy + tiering policy sesuai SLA dan compliance.
  6. Time travel/versioning tersedia untuk rollback data.
  7. Monitoring biaya storage per domain/data product aktif.
  8. Access control (role-based) untuk dataset sensitif diterapkan.

✏️ Exercise: Storage Design

Desain storage architecture untuk sistem e-commerce dengan requirement:

Tentukan:

  1. Storage type untuk masing-masing data
  2. Partitioning strategy
  3. File format
  4. Hot/warm/cold tiering strategy

🎯 Quick Quiz

1. Perbedaan utama Data Warehouse dan Data Lake adalah?

A. Warehouse lebih murah
B. Lake hanya untuk unstructured data
C. Warehouse requires predefined schema, Lake accepts any format
D. Lake lebih cepat untuk analytics

2. Format file terbaik untuk analytics workload adalah?

A. CSV karena human-readable
B. JSON karena flexible
C. Parquet karena columnar storage
D. XML karena structured

3. Lakehouse menambahkan capability apa ke Data Lake?

A. Cheaper storage
B. ACID transactions and schema enforcement
C. Support for unstructured data
D. SQL query capability

Kesimpulan

Memilih storage system yang tepat adalah critical decision dalam data engineering. Modern data architecture cenderung menggunakan Lakehouse pattern dengan open table formats (Delta Lake, Iceberg, Hudi) untuk mendapatkan flexibility dan reliability dalam satu platform.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Papers