Intermediate
Data Ingestion
Batch vs streaming, ETL vs ELT, and ingestion patterns
⏱️ 40 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap ingestion sebagai "cara data masuk". Fokus baca:
- Kapan pakai batch vs streaming
- Dampak model delivery ke duplikasi data
- Cara replay data saat gagal
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham source system dan tujuan data (warehouse/lake)
- Tahu konsep jadwal job (tiap jam/harian)
- Pernah lihat data duplicate atau data telat masuk
Istilah Penting (3 Lapis)
Istilah: At-least-once Delivery
Definisi awam: Data pasti sampai, tapi bisa terkirim lebih dari sekali.
Definisi teknis: Semantik delivery yang menjamin event tidak hilang dengan konsekuensi potensi duplikasi.
Contoh praktis: Event order bisa masuk dua kali saat retry, lalu dibersihkan lewat dedup.
Istilah: Deduplication Key
Definisi awam: Kunci unik untuk mengenali data yang sama.
Definisi teknis: Identifier stabil untuk menghapus duplikasi record pada proses ingestion dan merge.
Contoh praktis: Kombinasi order_id + event_time dipakai untuk menahan double insert.
Apa itu Data Ingestion?
Data Ingestion adalah proses memindahkan data dari source systems
ke destination systems (data lake, warehouse, dll). Ini adalah stage kedua dalam
Data Engineering Lifecycle setelah data digenerate.
💡 Key Concept
Good ingestion is invisible - data arrives reliably, on time, and in the expected format.
Batch vs Streaming Ingestion
Ada dua paradigma utama dalam data ingestion: Batch dan Streaming.
| Aspect |
Batch Processing |
Stream Processing |
| Data Size |
Large volumes |
Continuous small chunks |
| Latency |
Minutes to hours |
Milliseconds to seconds |
| Complexity |
Simpler |
More complex |
| Cost |
Lower (off-peak processing) |
Higher (always-on) |
| Use Case |
Historical analysis, reports |
Fraud detection, real-time alerts |
When to Use What?
✅ Choose Batch If:
- Data arrives periodically (daily/hourly)
- Complex transformations needed
- Cost optimization is priority
- Historical data processing
⚠️ Choose Streaming If:
- Real-time decisions required
- Immediate anomaly detection
- Event-driven architecture
- Low-latency customer experience
ETL vs ELT vs EtLT
Pola transformasi data berkembang seiring dengan kemampuan sistem modern.
🔄 Evolution of Data Processing
ETL (Traditional):
Extract → Transform (on-prem) → Load to DW
ELT (Modern Cloud):
Extract → Load to DW → Transform (in DW)
EtLT (Current Best Practice):
Extract → light Transform → Load → Transform
| Pattern |
Best For |
Pros |
Cons |
| ETL |
Small data, strict governance |
Clean data in DW, security control |
Slower, limited by processing power |
| ELT |
Cloud DW, exploration |
Fast loading, flexible |
Raw data in DW, storage cost |
| EtLT |
Data lakes, streaming |
Balance of both, keep raw + clean |
More complex pipeline |
Ingestion Patterns
📅 Scheduled/Batch
Frequency: Hourly, daily, weekly
Tools: Airflow, cron, dbt
Use: Reports, data exports, analytics
🔄 Micro-Batch
Frequency: Every few minutes
Tools: Spark Structured Streaming
Use: Near real-time with batch simplicity
⚡ Real-time Streaming
Frequency: Event-by-event
Tools: Kafka, Flink, Spark Streaming
Use: Fraud detection, live dashboards
🔄 CDC (Change Data Capture)
Trigger: Database changes
Tools: Debezium, Fivetran
Use: Keep DW synchronized with OLTP
Data Quality in Ingestion
Don't just move data - validate it. Data quality checks should happen at ingestion time.
Types of Data Quality Checks
| Check Type |
Example |
Action on Failure |
| Schema Validation |
Column exists, correct type |
Reject or quarantine |
| Completeness |
No null in required fields |
Alert, continue with default |
| Range Check |
Age between 0-150 |
Clamp or reject |
| Referential Integrity |
Order has valid customer_id |
Reject or orphan handling |
| Uniqueness |
No duplicate IDs |
Deduplicate or reject |
import great_expectations as ge
df = ge.read_csv("data.csv")
results = df.expect_column_values_to_not_be_null("customer_id")
results = df.expect_column_values_to_be_between("age", min_value=0, max_value=150)
results = df.expect_column_values_to_be_unique("order_id")
if not results["success"]:
raise ValueError("Data quality checks failed!")
Error Handling Strategies
Things will go wrong. Plan for it.
🚨 Error Handling Patterns
1. Dead Letter Queue (DLQ)
- Failed records go to separate queue for inspection
- Pipeline continues processing valid records
- Manual intervention for DLQ items
2. Circuit Breaker
- Stop processing after N consecutive failures
- Prevent cascading failures
- Auto-recovery after cool-down period
3. Idempotency
- Same input produces same output, no duplicates
- Essential for retry logic
- Use deterministic IDs
Ingestion Metrics to Monitor
📊
Volume
Rows/bytes per run
⏱️
Latency
End-to-end time
✅
Success Rate
% successful runs
⚠️
Error Rate
Failed records %
Case Study: Tokopedia's Ingestion Pipeline
🛒 Multi-Modal Ingestion
Batch Layer:
- Historical order data → Daily Airflow jobs → BigQuery
- Product catalog → Hourly CDC → Data Lake (Parquet)
Speed Layer:
- Real-time events → Kafka → Flink → Real-time dashboard
- Fraud signals → Immediate processing → Block/Allow
Serving Layer:
- Pre-aggregated metrics → Cache layer → Sub-second queries
Decision Framework: Ingestion Strategy
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Batch vs Streaming |
Data dipakai untuk laporan periodik, toleransi delay tinggi |
Use case operasional real-time (fraud, personalization, alerting) |
| CDC vs Full Extract |
Source support log-based CDC, butuh sinkronisasi near real-time |
Volume kecil atau source tidak mendukung CDC |
| ETL vs ELT |
Perlu pre-processing ketat sebelum data masuk platform utama |
Warehouse/lakehouse kuat dan butuh fleksibilitas transformasi |
| Strict Reject vs Quarantine |
Kualitas harus sangat ketat (mis. transaksi keuangan) |
Pipeline harus tetap jalan sambil isolasi record bermasalah |
Failure Modes & Anti-Patterns
Failure Modes Paling Sering Terjadi
- Duplicate delivery: retry tanpa idempotency menghasilkan data ganda.
- Schema drift tak terdeteksi: ingestion sukses tapi downstream rusak.
- Poison message: satu record buruk menghentikan consumer.
- Silent data loss: error di-skip tanpa alert dan tanpa DLQ.
- No replay plan: tidak bisa recovery ketika outage selesai.
Production Readiness Checklist
Checklist Ingestion Sebelum Production
- Semua source punya kontrak schema yang jelas.
- Idempotency key dan dedup logic sudah diuji.
- DLQ + reprocessing flow tersedia.
- Validation checks aktif (schema, completeness, uniqueness).
- Alert untuk lag, error rate, dan throughput sudah dikonfigurasi.
- Runbook incident ingestion tersedia.
- Backfill/replay procedure terdokumentasi.
- Biaya ingestion per source dimonitor.
✏️ Exercise: Design Ingestion Pipeline
Desain ingestion untuk ride-hailing app dengan requirements:
- 100K rides/day, need real-time driver-rider matching
- Payment processing (can be batch)
- Customer analytics dashboard (daily refresh OK)
- Fraud detection (must be real-time)
Decide for each data type:
- Batch or streaming?
- ETL, ELT, or EtLT?
- What quality checks?
- Error handling strategy?
🎯 Quick Quiz
1. Pattern mana yang cocok untuk historical data analysis?
A. Real-time streaming
B. Scheduled batch processing
C. CDC streaming
D. Micro-batch
2. Keuntungan utama ELT dibanding ETL adalah?
A. Lebih aman
B. Transformasi di data warehouse yang lebih powerful
C. Lebih murah secara hardware
D. Lebih cocok untuk small data
3. Apa fungsi Dead Letter Queue?
A. Menyimpan semua data yang berhasil
B. Menyimpan record yang gagal untuk inspeksi manual
C. Mengirim alert ke engineer
D. Backup data secara otomatis
Kesimpulan
Data ingestion adalah fondasi dari data pipeline. Pilih antara batch dan streaming
berdasarkan latency requirements. Modern pipelines menggunakan pola EtLT
untuk mendapatkan flexibility ELT dengan governance ETL.
🎯 Key Takeaways
- Batch for cost, streaming for speed
- ELT is modern standard, EtLT for complex cases
- Validate data quality at ingestion
- Plan for failures with DLQ and circuit breakers
- Monitor volume, latency, and error rates
📚 References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 8-9: Ingestion Patterns, Batch vs Streaming
- Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
Chapter 10: Batch Processing, Chapter 11: Stream Processing
- Data Pipelines Pocket Reference - James Densmore (O'Reilly, 2021)
Chapter 2: Ingestion Patterns
Official Documentation
Articles & Papers