Intermediate

Data Pipeline Monitoring

Observability, alerting, SLA management, and incident response

⏱️ 35 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap monitoring sebagai "sistem alarm kualitas data". Fokus baca:

  1. Metrik utama: freshness, completeness, validity, dan cost
  2. Desain alert agar tidak noisy tapi tetap cepat terdeteksi
  3. Runbook dasar untuk respons insiden data

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Freshness SLO

Definisi awam: Target maksimal keterlambatan data.

Definisi teknis: Service Level Objective untuk jeda waktu event-to-availability pada data produk.

Contoh praktis: Tabel transaksi harus terbarui paling lambat 10 menit setelah event masuk.

Istilah: Alert Fatigue

Definisi awam: Tim kebal terhadap alert karena terlalu banyak alarm tidak penting.

Definisi teknis: Penurunan efektivitas incident response akibat noise tinggi dan prioritas alert buruk.

Contoh praktis: Ratusan alert minor membuat alert kegagalan kritikal terlambat ditangani.

Why Monitor Data Pipelines?

Key Metrics to Monitor

Technical Metrics

Metric Description Alert When
Pipeline Duration Time to complete > SLA or baseline + 20%
Success Rate % successful runs < 99% over 24h
Data Volume Rows/bytes processed Anomalous deviation
Freshness Age of newest data > expected update interval
Resource Usage CPU, memory, disk > 80% capacity

Data Quality Metrics

Monitoring Tools

Tool Type Best For
Airflow UI Built-in Pipeline execution status
Prometheus + Grafana Open Source Metrics and dashboards
Datadog SaaS Full-stack observability
Monte Carlo SaaS Data observability
PagerDuty SaaS Incident management

Alerting Strategy

🚨 Alert Fatigue Prevention

Alert Severity Levels

Severity Response Time Example
P0 - Critical 15 minutes Production pipeline down, data loss
P1 - High 1 hour Data freshness SLA breach
P2 - Medium 4 hours Resource usage high, degradation
P3 - Low Next business day Optimization opportunities

SLA Management

Service Level Agreements for data:

📋 SLA Components

Incident Response

Incident Response Process

  1. Detect: Alert fires, automated detection
  2. Triage: Assess severity and impact
  3. Mitigate: Stop the bleeding (rerun, revert)
  4. Resolve: Fix root cause
  5. Post-mortem: Document and learn

Common Incident Types

Type Symptom Typical Fix
Pipeline Failure Job error, timeout Fix code, retry, backfill
Data Delay Source system late Wait, adjust schedule
Schema Drift New column, type change Update schema, tests
Data Quality Anomaly detected Quarantine, investigate

Decision Framework: Monitoring Strategy

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Freshness-first vs Quality-first Stakeholder sensitif keterlambatan data operasional Stakeholder sensitif ketepatan angka (finance/compliance)
Metric alert vs anomaly alert Threshold bisnis jelas dan stabil Pola data dinamis, sulit pakai threshold statis
Centralized on-call vs Domain on-call Tim kecil, ownership belum terpisah Data product per domain sudah matang
P0/P1 strict paging vs business-hour handling Incident langsung berdampak revenue/ops kritikal Dampak moderat dan tidak butuh respon seketika

Failure Modes & Anti-Patterns

Anti-Patterns di Monitoring

Production Readiness Checklist

Checklist Monitoring sebelum Production

  1. Top metrics disepakati: freshness, completeness, success rate, latency.
  2. Threshold dan severity level punya rasional yang terdokumentasi.
  3. Setiap alert punya owner, runbook, dan escalation path.
  4. SLA report otomatis tersedia untuk stakeholder utama.
  5. Noise reduction aktif (dedup, cooldown, grouping).
  6. Incident response process diuji lewat drill sederhana.
  7. Postmortem template digunakan konsisten untuk P0/P1.
  8. Review rutin bulanan untuk tuning alert dan SLO.

✏️ Exercise: Design Monitoring Strategy

Untuk data warehouse perusahaan:

  1. Define 4 critical metrics to monitor
  2. Set alert thresholds dengan severity levels
  3. Choose tools: Grafana untuk metrics, PagerDuty untuk alerts
  4. Create runbook untuk 2 common incident types

🎯 Quick Quiz

1. Metric apa yang paling penting untuk stakeholders?

A. CPU usage
B. Data freshness
C. Disk space
D. Network latency

2. Response time untuk P0 incident?

A. 24 hours
B. 4 hours
C. 15 minutes
D. Next business day

3. Tujuan post-mortem?

A. Menyalahkan seseorang
B. Document and learn from incidents
C. Menutupi masalah
D. Mengurangi anggaran

Kesimpulan

Monitoring adalah critical component dari reliable data systems. Dengan metrics yang tepat, alerting yang actionable, dan process untuk incident response, tim dapat maintain trust dalam data.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides