Intermediate
Orchestration with Apache Airflow
Workflow management, DAG design, and best practices
⏱️ 55 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap orchestration sebagai "manajer alur kerja data". Fokus baca:
- Kenapa task dependency harus eksplisit
- Konfigurasi retry, timeout, dan concurrency
- Prinsip aman saat backfill data lama
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham pipeline terdiri dari beberapa langkah berurutan
- Tahu proses bisa gagal dan perlu retry
- Pernah lihat job terjadwal (cron/scheduler sederhana)
Istilah Penting (3 Lapis)
Istilah: Idempotent DAG
Definisi awam: Dijalankan berulang tetap menghasilkan data yang benar.
Definisi teknis: Workflow dengan operasi aman terhadap rerun, tanpa side effect duplikasi atau inkonsistensi.
Contoh praktis: Task load memakai merge key, bukan insert buta, sehingga rerun tidak menggandakan row.
Istilah: Retry Backoff
Definisi awam: Saat gagal, coba lagi dengan jeda yang makin lama.
Definisi teknis: Strategi retry dengan interval bertahap untuk mengurangi beban sistem saat error sementara.
Contoh praktis: API timeout diretry setelah 1m, 2m, lalu 5m sebelum ditandai gagal.
Why Orchestration Matters
Data pipelines don't run in isolation. They have dependencies, schedules, and failure scenarios.
Orchestration manages these complexities automatically.
💡 Without Orchestration
Cron jobs → Silent failures → Data staleness → Manual recovery → 3AM pages
✅ With Orchestration
Defined dependencies → Automatic retries → Alerting → Self-healing → Peace of mind
Apache Airflow Overview
Apache Airflow adalah open-source platform untuk programmatically author, schedule, and monitor workflows.
Dibuat oleh Airbnb pada 2014, sekarang Apache Top-Level Project.
Core Concepts
| Concept |
Description |
Analogy |
| DAG |
Directed Acyclic Graph - workflow definition |
Recipe |
| Task |
Unit of work within a DAG |
Recipe step |
| Operator |
Template for a Task (what to run) |
Tool (oven, knife) |
| Hook |
Connection to external systems |
Adapter |
| Sensor |
Waits for external event |
Timer |
Airflow Architecture
🌐 Web Server
UI for monitoring and management
⚙️ Scheduler
Decides when to run tasks
💼 Worker
Executes tasks (Celery/Kubernetes)
🗄️ Metadata DB
PostgreSQL/MySQL for state
Writing Your First DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'email': ['alert@company.com'],
'email_on_failure': True,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'my_first_dag',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2024, 1, 1),
catchup=False,
tags=['tutorial'],
) as dag:
extract = BashOperator(
task_id='extract_data',
bash_command='echo "Extracting..."',
)
def transform_logic(**context):
data = context['ti'].xcom_pull(task_ids='extract_data')
return f"Transformed: {data}"
transform = PythonOperator(
task_id='transform_data',
python_callable=transform_logic,
)
load = BashOperator(
task_id='load_data',
bash_command='echo "Loading..."',
)
extract >> transform >> load
Common Operators
| Operator |
Use Case |
Example |
| PythonOperator |
Run Python functions |
Data transformation |
| BashOperator |
Run shell commands |
File operations |
| SQLExecuteQueryOperator |
Execute SQL |
DBT, data warehouse |
| S3FileTransformOperator |
S3 file operations |
Data lake tasks |
| DockerOperator |
Run containers |
Isolated environments |
| HttpOperator |
API calls |
External services |
DAG Design Patterns
📊 ETL Pattern
Extract → Transform → Load
Linear dependency chain
🌳 Branching
Parallel processing paths
Join back at the end
🔀 Dynamic Tasks
TaskGroup for repeated work
Map over list of items
⏱️ Sensors
Wait for external events
File arrival, API ready
Best Practices
⚠️ DO's and DON'Ts
✅ DO:
- Keep DAG files idempotent
- Use TaskFlow API for new code
- Store secrets in Airflow Variables/Connections
- Set appropriate retry policies
- Monitor DAG run duration trends
❌ DON'T:
- Process large data in DAG file
- Store state in XCom for large objects
- Write business logic in operators
- Use top-level code that runs on DAG parse
Monitoring and Alerting
Airflow provides multiple monitoring mechanisms:
- Email alerts: On failure, retry, SLA miss
- Web UI: Real-time DAG and task status
- Metrics: StatsD, Prometheus integration
- Logs: Centralized logging for debugging
Decision Framework: Desain Orchestration
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Single DAG vs Multiple DAG |
Flow sederhana dan dependency linear |
Domain berbeda, ownership berbeda, atau SLA berbeda |
| Schedule-based vs Event-driven |
Job periodik stabil (hourly/daily) |
Eksekusi bergantung event/file arrival |
| TaskFlow API vs Classic Operators |
Pipeline Python-centric dan reusable functions |
Butuh integrasi operator bawaan/shell/SQL legacy |
| Catchup enabled vs disabled |
Perlu backfill historis terkontrol |
Hanya peduli run terbaru (near real-time ops) |
Failure Modes & Anti-Patterns
Anti-Patterns di Airflow
- Heavy logic in DAG parse: scheduler lambat dan error sulit ditrace.
- Huge XCom payload: metadata DB bengkak dan performa turun.
- No retry budget: task flaky langsung gagal tanpa recovery.
- Non-idempotent tasks: retry membuat duplicate load.
- Unbounded backfill: cluster jenuh karena eksekusi historis berlebihan.
Production Readiness Checklist
Checklist DAG Production
- DAG idempotent dan aman di-retry.
- Retry, timeout, SLA, dan concurrency sudah disetel.
- Secrets tidak hardcoded (gunakan Connections/Secret Backend).
- Alerting ke Slack/Email/Pager sudah aktif.
- Runbook untuk failed DAG dan manual recovery tersedia.
- Backfill strategy dan guardrail resource terdokumentasi.
- Dependency antar DAG jelas (dataset/trigger/external task).
- Durasi run trend dimonitor, bukan hanya failure count.
✏️ Exercise: Build a Complete DAG
Create a DAG for an e-commerce ETL pipeline with:
- Extract orders from PostgreSQL (daily)
- Extract customer updates from API
- Transform and join data with Python
- Load to BigQuery
- Send success/failure notification
Requirements:
- Proper error handling and retries
- Parallel extraction where possible
- SLA of 2 hours
- Email on failure
🎯 Quick Quiz
1. Apa fungsi dari Scheduler di Airflow?
A. Menjalankan task secara langsung
B. Memutuskan kapan task harus dijalankan
C. Menyediakan UI untuk monitoring
D. Menyimpan log dari task
2. Operator mana yang cocok untuk menjalankan Python function?
A. BashOperator
B. PythonOperator
C. DockerOperator
D. Sensor
3. Apa yang dimaksud dengan DAG?
A. Database Access Gateway
B. Directed Acyclic Graph - workflow definition
C. Data Analytics Group
D. Distributed Application Grid
Alternatives to Airflow
| Tool |
Best For |
Key Difference |
| Prefect |
Modern Python workflows |
Dynamic DAGs, hybrid mode |
| Dagster |
Data-aware orchestration |
Asset-based, type checking |
| Temporal |
Microservices workflows |
Durable execution |
| Azure Data Factory |
Azure ecosystem |
Fully managed, visual editor |
Kesimpulan
Orchestration adalah backbone dari data engineering. Airflow menyediakan
platform yang mature dan flexible untuk mengelola workflow complexity.
Master DAG design patterns, best practices, dan monitoring untuk
membangun pipelines yang reliable.
🎯 Key Takeaways
- DAG = workflow definition, Task = unit of work
- Use appropriate operators for each task type
- Design for idempotency and retries
- Monitor trends, not just failures
- Consider alternatives based on your needs
📚 References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 12-13: Orchestration, Workflow Management
- Data Pipelines Pocket Reference - James Densmore (O'Reilly, 2021)
Chapter 5: Workflow Orchestration with Airflow
- Data Pipelines with Apache Airflow - Bas Harenslak & Julian de Ruiter (Manning, 2021)
Chapters 1-4: Airflow Basics, DAG Design Patterns
Official Documentation
Articles & Guides