Intermediate

Orchestration with Apache Airflow

Workflow management, DAG design, and best practices

⏱️ 55 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap orchestration sebagai "manajer alur kerja data". Fokus baca:

  1. Kenapa task dependency harus eksplisit
  2. Konfigurasi retry, timeout, dan concurrency
  3. Prinsip aman saat backfill data lama

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Idempotent DAG

Definisi awam: Dijalankan berulang tetap menghasilkan data yang benar.

Definisi teknis: Workflow dengan operasi aman terhadap rerun, tanpa side effect duplikasi atau inkonsistensi.

Contoh praktis: Task load memakai merge key, bukan insert buta, sehingga rerun tidak menggandakan row.

Istilah: Retry Backoff

Definisi awam: Saat gagal, coba lagi dengan jeda yang makin lama.

Definisi teknis: Strategi retry dengan interval bertahap untuk mengurangi beban sistem saat error sementara.

Contoh praktis: API timeout diretry setelah 1m, 2m, lalu 5m sebelum ditandai gagal.

Why Orchestration Matters

Data pipelines don't run in isolation. They have dependencies, schedules, and failure scenarios. Orchestration manages these complexities automatically.

💡 Without Orchestration

Cron jobs → Silent failures → Data staleness → Manual recovery → 3AM pages

✅ With Orchestration

Defined dependencies → Automatic retries → Alerting → Self-healing → Peace of mind

Apache Airflow Overview

Apache Airflow adalah open-source platform untuk programmatically author, schedule, and monitor workflows. Dibuat oleh Airbnb pada 2014, sekarang Apache Top-Level Project.

Core Concepts

Concept Description Analogy
DAG Directed Acyclic Graph - workflow definition Recipe
Task Unit of work within a DAG Recipe step
Operator Template for a Task (what to run) Tool (oven, knife)
Hook Connection to external systems Adapter
Sensor Waits for external event Timer

Airflow Architecture

🌐 Web Server

UI for monitoring and management

⚙️ Scheduler

Decides when to run tasks

💼 Worker

Executes tasks (Celery/Kubernetes)

🗄️ Metadata DB

PostgreSQL/MySQL for state

Writing Your First DAG

# dag_example.py from airflow import DAG from airflow.operators.python import PythonOperator from airflow.operators.bash import BashOperator from datetime import datetime, timedelta # Default arguments applied to all tasks default_args = { 'owner': 'data-team', 'depends_on_past': False, 'email': ['alert@company.com'], 'email_on_failure': True, 'retries': 2, 'retry_delay': timedelta(minutes=5), } # DAG definition with DAG( 'my_first_dag', default_args=default_args, description='A simple tutorial DAG', schedule_interval=timedelta(days=1), start_date=datetime(2024, 1, 1), catchup=False, tags=['tutorial'], ) as dag: # Task 1: Extract extract = BashOperator( task_id='extract_data', bash_command='echo "Extracting..."', ) # Task 2: Transform def transform_logic(**context): data = context['ti'].xcom_pull(task_ids='extract_data') return f"Transformed: {data}" transform = PythonOperator( task_id='transform_data', python_callable=transform_logic, ) # Task 3: Load load = BashOperator( task_id='load_data', bash_command='echo "Loading..."', ) # Define dependencies extract >> transform >> load

Common Operators

Operator Use Case Example
PythonOperator Run Python functions Data transformation
BashOperator Run shell commands File operations
SQLExecuteQueryOperator Execute SQL DBT, data warehouse
S3FileTransformOperator S3 file operations Data lake tasks
DockerOperator Run containers Isolated environments
HttpOperator API calls External services

DAG Design Patterns

📊 ETL Pattern

Extract → Transform → Load
Linear dependency chain

🌳 Branching

Parallel processing paths
Join back at the end

🔀 Dynamic Tasks

TaskGroup for repeated work
Map over list of items

⏱️ Sensors

Wait for external events
File arrival, API ready

Best Practices

⚠️ DO's and DON'Ts

✅ DO:

❌ DON'T:

Monitoring and Alerting

Airflow provides multiple monitoring mechanisms:

Decision Framework: Desain Orchestration

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Single DAG vs Multiple DAG Flow sederhana dan dependency linear Domain berbeda, ownership berbeda, atau SLA berbeda
Schedule-based vs Event-driven Job periodik stabil (hourly/daily) Eksekusi bergantung event/file arrival
TaskFlow API vs Classic Operators Pipeline Python-centric dan reusable functions Butuh integrasi operator bawaan/shell/SQL legacy
Catchup enabled vs disabled Perlu backfill historis terkontrol Hanya peduli run terbaru (near real-time ops)

Failure Modes & Anti-Patterns

Anti-Patterns di Airflow

Production Readiness Checklist

Checklist DAG Production

  1. DAG idempotent dan aman di-retry.
  2. Retry, timeout, SLA, dan concurrency sudah disetel.
  3. Secrets tidak hardcoded (gunakan Connections/Secret Backend).
  4. Alerting ke Slack/Email/Pager sudah aktif.
  5. Runbook untuk failed DAG dan manual recovery tersedia.
  6. Backfill strategy dan guardrail resource terdokumentasi.
  7. Dependency antar DAG jelas (dataset/trigger/external task).
  8. Durasi run trend dimonitor, bukan hanya failure count.

✏️ Exercise: Build a Complete DAG

Create a DAG for an e-commerce ETL pipeline with:

  1. Extract orders from PostgreSQL (daily)
  2. Extract customer updates from API
  3. Transform and join data with Python
  4. Load to BigQuery
  5. Send success/failure notification

Requirements:

🎯 Quick Quiz

1. Apa fungsi dari Scheduler di Airflow?

A. Menjalankan task secara langsung
B. Memutuskan kapan task harus dijalankan
C. Menyediakan UI untuk monitoring
D. Menyimpan log dari task

2. Operator mana yang cocok untuk menjalankan Python function?

A. BashOperator
B. PythonOperator
C. DockerOperator
D. Sensor

3. Apa yang dimaksud dengan DAG?

A. Database Access Gateway
B. Directed Acyclic Graph - workflow definition
C. Data Analytics Group
D. Distributed Application Grid

Alternatives to Airflow

Tool Best For Key Difference
Prefect Modern Python workflows Dynamic DAGs, hybrid mode
Dagster Data-aware orchestration Asset-based, type checking
Temporal Microservices workflows Durable execution
Azure Data Factory Azure ecosystem Fully managed, visual editor

Kesimpulan

Orchestration adalah backbone dari data engineering. Airflow menyediakan platform yang mature dan flexible untuk mengelola workflow complexity. Master DAG design patterns, best practices, dan monitoring untuk membangun pipelines yang reliable.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides