Intermediate

Data Ingestion

Batch vs streaming, ETL vs ELT, and ingestion patterns

⏱️ 40 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap ingestion sebagai "cara data masuk". Fokus baca:

  1. Kapan pakai batch vs streaming
  2. Dampak model delivery ke duplikasi data
  3. Cara replay data saat gagal

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: At-least-once Delivery

Definisi awam: Data pasti sampai, tapi bisa terkirim lebih dari sekali.

Definisi teknis: Semantik delivery yang menjamin event tidak hilang dengan konsekuensi potensi duplikasi.

Contoh praktis: Event order bisa masuk dua kali saat retry, lalu dibersihkan lewat dedup.

Istilah: Deduplication Key

Definisi awam: Kunci unik untuk mengenali data yang sama.

Definisi teknis: Identifier stabil untuk menghapus duplikasi record pada proses ingestion dan merge.

Contoh praktis: Kombinasi order_id + event_time dipakai untuk menahan double insert.

Apa itu Data Ingestion?

Data Ingestion adalah proses memindahkan data dari source systems ke destination systems (data lake, warehouse, dll). Ini adalah stage kedua dalam Data Engineering Lifecycle setelah data digenerate.

💡 Key Concept

Good ingestion is invisible - data arrives reliably, on time, and in the expected format.

Batch vs Streaming Ingestion

Ada dua paradigma utama dalam data ingestion: Batch dan Streaming.

Aspect Batch Processing Stream Processing
Data Size Large volumes Continuous small chunks
Latency Minutes to hours Milliseconds to seconds
Complexity Simpler More complex
Cost Lower (off-peak processing) Higher (always-on)
Use Case Historical analysis, reports Fraud detection, real-time alerts

When to Use What?

✅ Choose Batch If:

⚠️ Choose Streaming If:

ETL vs ELT vs EtLT

Pola transformasi data berkembang seiring dengan kemampuan sistem modern.

🔄 Evolution of Data Processing

ETL (Traditional):

Extract → Transform (on-prem) → Load to DW

ELT (Modern Cloud):

Extract → Load to DW → Transform (in DW)

EtLT (Current Best Practice):

Extract → light Transform → Load → Transform

Pattern Best For Pros Cons
ETL Small data, strict governance Clean data in DW, security control Slower, limited by processing power
ELT Cloud DW, exploration Fast loading, flexible Raw data in DW, storage cost
EtLT Data lakes, streaming Balance of both, keep raw + clean More complex pipeline

Ingestion Patterns

📅 Scheduled/Batch

Frequency: Hourly, daily, weekly

Tools: Airflow, cron, dbt

Use: Reports, data exports, analytics

🔄 Micro-Batch

Frequency: Every few minutes

Tools: Spark Structured Streaming

Use: Near real-time with batch simplicity

⚡ Real-time Streaming

Frequency: Event-by-event

Tools: Kafka, Flink, Spark Streaming

Use: Fraud detection, live dashboards

🔄 CDC (Change Data Capture)

Trigger: Database changes

Tools: Debezium, Fivetran

Use: Keep DW synchronized with OLTP

Data Quality in Ingestion

Don't just move data - validate it. Data quality checks should happen at ingestion time.

Types of Data Quality Checks

Check Type Example Action on Failure
Schema Validation Column exists, correct type Reject or quarantine
Completeness No null in required fields Alert, continue with default
Range Check Age between 0-150 Clamp or reject
Referential Integrity Order has valid customer_id Reject or orphan handling
Uniqueness No duplicate IDs Deduplicate or reject
# Python example: Data quality check dengan Great Expectations import great_expectations as ge df = ge.read_csv("data.csv") # Define expectations results = df.expect_column_values_to_not_be_null("customer_id") results = df.expect_column_values_to_be_between("age", min_value=0, max_value=150) results = df.expect_column_values_to_be_unique("order_id") if not results["success"]: raise ValueError("Data quality checks failed!")

Error Handling Strategies

Things will go wrong. Plan for it.

🚨 Error Handling Patterns

1. Dead Letter Queue (DLQ)

2. Circuit Breaker

3. Idempotency

Ingestion Metrics to Monitor

📊
Volume
Rows/bytes per run
⏱️
Latency
End-to-end time
Success Rate
% successful runs
⚠️
Error Rate
Failed records %

Case Study: Tokopedia's Ingestion Pipeline

🛒 Multi-Modal Ingestion

Batch Layer:

Speed Layer:

Serving Layer:

Decision Framework: Ingestion Strategy

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Batch vs Streaming Data dipakai untuk laporan periodik, toleransi delay tinggi Use case operasional real-time (fraud, personalization, alerting)
CDC vs Full Extract Source support log-based CDC, butuh sinkronisasi near real-time Volume kecil atau source tidak mendukung CDC
ETL vs ELT Perlu pre-processing ketat sebelum data masuk platform utama Warehouse/lakehouse kuat dan butuh fleksibilitas transformasi
Strict Reject vs Quarantine Kualitas harus sangat ketat (mis. transaksi keuangan) Pipeline harus tetap jalan sambil isolasi record bermasalah

Failure Modes & Anti-Patterns

Failure Modes Paling Sering Terjadi

Production Readiness Checklist

Checklist Ingestion Sebelum Production

  1. Semua source punya kontrak schema yang jelas.
  2. Idempotency key dan dedup logic sudah diuji.
  3. DLQ + reprocessing flow tersedia.
  4. Validation checks aktif (schema, completeness, uniqueness).
  5. Alert untuk lag, error rate, dan throughput sudah dikonfigurasi.
  6. Runbook incident ingestion tersedia.
  7. Backfill/replay procedure terdokumentasi.
  8. Biaya ingestion per source dimonitor.

✏️ Exercise: Design Ingestion Pipeline

Desain ingestion untuk ride-hailing app dengan requirements:

Decide for each data type:

  1. Batch or streaming?
  2. ETL, ELT, or EtLT?
  3. What quality checks?
  4. Error handling strategy?

🎯 Quick Quiz

1. Pattern mana yang cocok untuk historical data analysis?

A. Real-time streaming
B. Scheduled batch processing
C. CDC streaming
D. Micro-batch

2. Keuntungan utama ELT dibanding ETL adalah?

A. Lebih aman
B. Transformasi di data warehouse yang lebih powerful
C. Lebih murah secara hardware
D. Lebih cocok untuk small data

3. Apa fungsi Dead Letter Queue?

A. Menyimpan semua data yang berhasil
B. Menyimpan record yang gagal untuk inspeksi manual
C. Mengirim alert ke engineer
D. Backup data secara otomatis

Kesimpulan

Data ingestion adalah fondasi dari data pipeline. Pilih antara batch dan streaming berdasarkan latency requirements. Modern pipelines menggunakan pola EtLT untuk mendapatkan flexibility ELT dengan governance ETL.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Papers