Foundation

Data Engineering Lifecycle

Framework fundamental untuk memahami end-to-end data engineering

⏱️ 25 min read 📅 Updated Jan 2025 👤 Based on "Fundamentals of Data Engineering" by Joe Reis & Matt Housley

            🎯 Learning Objectives
            Memahami 5 stages dari Data Engineering Lifecycle
Mengenal Undercurrents yang melintasi seluruh lifecycle
Bisa mengidentifikasi tools dan technologies untuk setiap stage
Memahami trade-offs dalam architectural decisions

        

Mode Baca Pemula

Jika ini pertama kali kamu belajar topik lifecycle, baca urut:

Definisi lifecycle
5 stages (Generate -> Serve)
Undercurrents (security, data management, DataOps)

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Sudah paham peran Data Engineer secara umum
Tahu konsep dasar database dan API
Pernah melihat alur ETL/ELT secara sederhana

Istilah Penting (3 Lapis)

Istilah: Lifecycle

Definisi awam: Siklus hidup data dari lahir sampai dipakai.

Definisi teknis: Tahapan end-to-end pengelolaan data: generation, storage, ingestion, transformation, serving.

Contoh praktis: Event transaksi dari aplikasi masuk storage, diproses, lalu tampil di dashboard.

Istilah: Undercurrents

Definisi awam: Hal-hal penting yang selalu ikut di semua tahap.

Definisi teknis: Cross-cutting concerns seperti security, governance, orchestration, observability.

Contoh praktis: Walau beda stage, kebutuhan logging dan access control tetap wajib.

Apa itu Data Engineering Lifecycle?

Data Engineering Lifecycle adalah framework yang menjelaskan bagaimana data bergerak dari sumbernya (source) ke tangan pengguna (consumers). Framework ini dikembangkan oleh Joe Reis & Matt Housley untuk membantu data engineers memahami big picture dari pekerjaan mereka.

💡 Key Insight

"The big idea of this book is the data engineering lifecycle: data generation, storage, ingestion, transformation, and serving. Since the dawn of data, we've seen the rise and fall of innumerable specific technologies and vendor products, but the data engineering lifecycle stages have remained essentially unchanged." — Joe Reis

The 5 Stages of Data Engineering Lifecycle

📡

GENERATE

Source Systems

→

💾

STORAGE

Data Repositories

→

🔄

INGEST

Data Movement

→

⚙️

TRANSFORM

Data Processing

→

📊

SERVE

Data Consumption

🌊 Undercurrents (Cross-Cutting Concerns)

🔒 Security

📋 Data Management

🚀 DataOps

🎼 Orchestration

💻 Software Engineering

Stage 1: Generation (Source Systems)

📡

Data Generation

Data berasal dari berbagai source systems. Sebagai Data Engineer, kamu perlu memahami karakteristik dari setiap source untuk merancang ingestion strategy yang tepat.

Common Source Systems:

OLTP Databases - MySQL, PostgreSQL, MongoDB (application data)
APIs - REST APIs, GraphQL, webhooks
Files - CSV, JSON, Parquet, logs
Message Queues - Kafka, RabbitMQ, SQS
IoT Devices - Sensors, telemetry data
Third-party Data - External APIs, data vendors

📱 Case Study: E-commerce Platform

Sebuah e-commerce di Indonesia seperti Tokopedia memiliki multiple sources:

PostgreSQL - User data, orders, products
Clickstream - User behavior (Apache Kafka)
Logs - Application logs (ELK Stack)
APIs - Payment gateway (Midtrans, Xendit)
Mobile App - Event tracking (Firebase)

Stage 2: Storage

💾

Data Storage

Storage adalah tempat data disimpan dalam berbagai format dan untuk berbagai tujuan. Memilih storage system yang tepat adalah kunci untuk performance dan cost-efficiency.

Storage Type	Use Case	Examples
Data Warehouse	Structured analytics, BI	Snowflake, BigQuery, Redshift
Data Lake	Raw data storage, ML	S3, GCS, Azure Data Lake
Data Lakehouse	Best of both worlds	Delta Lake, Apache Iceberg
OLAP	Fast analytics queries	ClickHouse, Druid, Pinot
Time-Series DB	Metrics, IoT data	InfluxDB, TimescaleDB

Stage 3: Ingestion

🔄

Data Ingestion

Ingestion adalah proses memindahkan data dari source systems ke storage. Ada dua pola utama: Batch dan Streaming.

Batch vs Streaming:

Aspect	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Data Volume	Large volumes	Continuous, smaller chunks
Use Case	Daily reports, ETL	Real-time analytics, fraud detection
Tools	Airflow, Spark Batch	Kafka, Flink, Spark Streaming

Common Ingestion Patterns:

CDC (Change Data Capture) - Track database changes in real-time
API Polling - Pull data from external APIs
File Transfer - SFTP, cloud storage sync
Event Streaming - Push-based via message queues

Stage 4: Transformation

⚙️

Data Transformation

Transformation adalah proses mengubah raw data menjadi format yang lebih useful untuk analytics dan machine learning.

Types of Transformations:

Cleaning - Handling nulls, duplicates, outliers
Normalization - Structuring data, schema enforcement
Aggregation - Summarizing data (SUM, AVG, COUNT)
Enrichment - Adding context from other sources
Feature Engineering - Creating ML features

ETL vs ELT:

ETL (Extract, Transform, Load): Transformasi terjadi sebelum data masuk ke warehouse. Cocok untuk data yang sudah well-defined.

ELT (Extract, Load, Transform): Data langsung dimasukkan, transformasi terjadi di dalam warehouse. Lebih fleksibel untuk exploratory analysis.

Stage 5: Serving

📊

Data Serving

Serving adalah stage terakhir dimana data dikonsumsi oleh end users. Berbeda user memiliki kebutuhan yang berbeda.

Data Consumers:

Data Analysts - SQL queries, BI tools (Tableau, Looker)
Data Scientists - Feature stores, notebooks
Applications - APIs, embedded analytics
ML Systems - Model training, inference
Reverse ETL - Sync back to operational systems

Reverse ETL:

Reverse ETL adalah proses mengirim data dari warehouse kembali ke operational systems. Contoh: mengirim customer segmentation data dari warehouse ke CRM (Salesforce, HubSpot).

The Undercurrents

Undercurrents adalah concerns yang melintasi seluruh lifecycle. Mereka tidak spesifik ke satu stage, tapi relevan di semua stages.

            🔒 Security
            Encryption at rest and in transit
Access control and authentication
PII handling and data masking
Audit logging

        

            📋 Data Management
            Data Catalog & Discovery
Data Lineage
Data Quality
Master Data Management

        

            🚀 DataOps
            CI/CD for data pipelines
Automated testing
Infrastructure as Code
Monitoring & Alerting

        

            🎼 Orchestration
            Workflow scheduling
Dependency management
Error handling & retries
Resource management

        

            💻 Software Engineering
            Version control (Git)
Code reviews
Documentation
Testing best practices

        

Case Study: End-to-End Lifecycle

🏪 Gojek Food Delivery Analytics

Business Context: Gojek ingin melihat real-time analytics untuk order food delivery.

Lifecycle Implementation:

Generate: Order data dari aplikasi (Firebase Events), driver location (GPS), restaurant data (PostgreSQL)
Storage: Raw data ke S3 (Data Lake), cleaned data ke BigQuery (Data Warehouse)
Ingest: Kafka Streams untuk real-time events, Airflow DAG untuk daily batch
Transform: dbt untuk modeling, Spark untuk aggregasi large-scale
Serve: Looker dashboards untuk business teams, API untuk driver app

Undercurrents Applied:

Security: PII masking untuk customer data
DataOps: Automated testing untuk data quality
Orchestration: Airflow untuk manage dependencies
Data Management: Data catalog dengan Amundsen

Decision Framework: Memilih Desain Lifecycle

Decision Point	Pilih Opsi A Jika...	Pilih Opsi B Jika...
Batch vs Streaming	Latency target masih menit/jam, fokus cost efficiency	Butuh respon detik/sub-detik, use case fraud/ops real-time
Warehouse vs Lakehouse	Workload dominan BI SQL, governance sederhana	Perlu satu platform untuk BI + ML + raw data
ETL vs ELT	Transformasi wajib sebelum data masuk (compliance ketat)	Ingin fleksibilitas eksplorasi dan transformasi di warehouse
Serving via BI vs API	Consumer utama analis/business user	Consumer utama aplikasi operasional/produk

Failure Modes & Anti-Patterns

            Kesalahan Umum yang Perlu Dihindari
            Tool-first architecture: pilih tool dulu, problem belakangan.
No data contracts: schema berubah diam-diam dan merusak downstream.
Missing ownership: pipeline gagal tapi tidak jelas siapa owner-nya.
Serving tanpa SLA: dashboard dianggap real-time padahal freshness tidak terdefinisi.
No backfill strategy: saat incident selesai, data historis tetap bolong.

        

Production Readiness Checklist

            Checklist Sebelum Go-Live
            Setiap dataset punya owner teknis dan owner bisnis.
SLA/SLO freshness dan success rate terdokumentasi.
Data quality checks aktif (schema, null, duplicate, range).
Retry, timeout, dan idempotency sudah diuji.
Lineage source-to-serving bisa ditelusuri.
Ada runbook untuk incident dan prosedur backfill.
PII masking/encryption sesuai klasifikasi data.
Monitoring biaya (storage + compute) dan alert threshold tersedia.

        

✏️ Exercise: Design a Data Pipeline

Bayangkan kamu adalah Data Engineer di Bukalapak. Design lifecycle untuk sistem analytics berikut:

Source: Website clickstream, transaction database, payment gateway API
Goal: Real-time sales dashboard + daily comprehensive reports

Pertanyaan:

Apa storage systems yang cocok untuk use case ini?
Batch atau streaming untuk clickstream data?
Undercurrents mana yang paling kritikal untuk sistem ini?

🎯 Quick Quiz

1. Stage mana yang bertanggung jawab untuk memindahkan data dari source ke storage?

A. Generation

B. Storage

C. Ingestion

D. Transformation

2. Mana yang merupakan contoh Undercurrent (bukan stage)?

A. Data Lake

B. Security

C. ETL Pipeline

D. API Serving

3. Apa keuntungan utama dari ELT dibanding ETL?

A. Lebih cepat karena transformasi terjadi di source

B. Lebih fleksibel untuk exploratory analysis karena raw data tersimpan

C. Lebih aman untuk data sensitif

D. Tidak membutuhkan storage system

Key Takeaways

            🎯 Summary
            5 Stages: Generation → Storage → Ingestion → Transformation → Serving
5 Undercurrents: Security, Data Management, DataOps, Orchestration, Software Engineering
Lifecycle stages remain constant, tapi technologies berubah terus
Pemahaman lifecycle membantu dalam architectural decisions
Setiap stage punya trade-offs yang perlu dipertimbangkan

        

📚 References & Resources

Primary Sources

Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 3-4: The Data Engineering Lifecycle, Data Engineering Undercurrents

Official Documentation

Apache Kafka Documentation - Event Streaming
Delta Lake Documentation - Storage Layer
Airflow: DAGs & Orchestration