Intermediate
Data Pipeline Monitoring
Observability, alerting, SLA management, and incident response
⏱️ 35 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap monitoring sebagai "sistem alarm kualitas data". Fokus baca:
- Metrik utama: freshness, completeness, validity, dan cost
- Desain alert agar tidak noisy tapi tetap cepat terdeteksi
- Runbook dasar untuk respons insiden data
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham pipeline data bisa terlambat atau gagal diam-diam
- Tahu konsep dashboard metrik dan notifikasi alert
- Pernah ikut troubleshooting isu data sederhana
Istilah Penting (3 Lapis)
Istilah: Freshness SLO
Definisi awam: Target maksimal keterlambatan data.
Definisi teknis: Service Level Objective untuk jeda waktu event-to-availability pada data produk.
Contoh praktis: Tabel transaksi harus terbarui paling lambat 10 menit setelah event masuk.
Istilah: Alert Fatigue
Definisi awam: Tim kebal terhadap alert karena terlalu banyak alarm tidak penting.
Definisi teknis: Penurunan efektivitas incident response akibat noise tinggi dan prioritas alert buruk.
Contoh praktis: Ratusan alert minor membuat alert kegagalan kritikal terlambat ditangani.
Why Monitor Data Pipelines?
- Detect issues early: Before stakeholders notice
- Meet SLAs: Ensure data is fresh and accurate
- Debug faster: Rich context for root cause analysis
- Build trust: Stakeholders confidence in data
Key Metrics to Monitor
Technical Metrics
| Metric |
Description |
Alert When |
| Pipeline Duration |
Time to complete |
> SLA or baseline + 20% |
| Success Rate |
% successful runs |
< 99% over 24h |
| Data Volume |
Rows/bytes processed |
Anomalous deviation |
| Freshness |
Age of newest data |
> expected update interval |
| Resource Usage |
CPU, memory, disk |
> 80% capacity |
Data Quality Metrics
- Null Rate: % of null values in critical columns
- Duplicate Rate: % duplicate primary keys
- Schema Changes: Unexpected column additions/removals
- Distribution Shifts: Statistical anomalies in data patterns
Monitoring Tools
| Tool |
Type |
Best For |
| Airflow UI |
Built-in |
Pipeline execution status |
| Prometheus + Grafana |
Open Source |
Metrics and dashboards |
| Datadog |
SaaS |
Full-stack observability |
| Monte Carlo |
SaaS |
Data observability |
| PagerDuty |
SaaS |
Incident management |
Alerting Strategy
🚨 Alert Fatigue Prevention
- Only alert on actionable issues
- Use severity levels (P0, P1, P2)
- Set up escalation policies
- Review and tune alert thresholds regularly
Alert Severity Levels
| Severity |
Response Time |
Example |
| P0 - Critical |
15 minutes |
Production pipeline down, data loss |
| P1 - High |
1 hour |
Data freshness SLA breach |
| P2 - Medium |
4 hours |
Resource usage high, degradation |
| P3 - Low |
Next business day |
Optimization opportunities |
SLA Management
Service Level Agreements for data:
📋 SLA Components
- Freshness: Data updated within X hours
- Completeness: % of expected records received
- Accuracy: % of records passing quality checks
- Availability: % of time data is queryable
Incident Response
Incident Response Process
- Detect: Alert fires, automated detection
- Triage: Assess severity and impact
- Mitigate: Stop the bleeding (rerun, revert)
- Resolve: Fix root cause
- Post-mortem: Document and learn
Common Incident Types
| Type |
Symptom |
Typical Fix |
| Pipeline Failure |
Job error, timeout |
Fix code, retry, backfill |
| Data Delay |
Source system late |
Wait, adjust schedule |
| Schema Drift |
New column, type change |
Update schema, tests |
| Data Quality |
Anomaly detected |
Quarantine, investigate |
Decision Framework: Monitoring Strategy
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Freshness-first vs Quality-first |
Stakeholder sensitif keterlambatan data operasional |
Stakeholder sensitif ketepatan angka (finance/compliance) |
| Metric alert vs anomaly alert |
Threshold bisnis jelas dan stabil |
Pola data dinamis, sulit pakai threshold statis |
| Centralized on-call vs Domain on-call |
Tim kecil, ownership belum terpisah |
Data product per domain sudah matang |
| P0/P1 strict paging vs business-hour handling |
Incident langsung berdampak revenue/ops kritikal |
Dampak moderat dan tidak butuh respon seketika |
Failure Modes & Anti-Patterns
Anti-Patterns di Monitoring
- Alert spam: terlalu banyak alert non-aksi hingga incident penting terlewat.
- No ownership mapping: alert masuk tapi tidak jelas siapa yang harus respon.
- Dashboard-only monitoring: metrik ada, tapi tidak ada alert otomatis.
- No postmortem loop: incident berulang karena tidak ada corrective action.
- Technical-only SLI: pipeline sehat tapi data bisnis tetap tidak usable.
Production Readiness Checklist
Checklist Monitoring sebelum Production
- Top metrics disepakati: freshness, completeness, success rate, latency.
- Threshold dan severity level punya rasional yang terdokumentasi.
- Setiap alert punya owner, runbook, dan escalation path.
- SLA report otomatis tersedia untuk stakeholder utama.
- Noise reduction aktif (dedup, cooldown, grouping).
- Incident response process diuji lewat drill sederhana.
- Postmortem template digunakan konsisten untuk P0/P1.
- Review rutin bulanan untuk tuning alert dan SLO.
✏️ Exercise: Design Monitoring Strategy
Untuk data warehouse perusahaan:
- Define 4 critical metrics to monitor
- Set alert thresholds dengan severity levels
- Choose tools: Grafana untuk metrics, PagerDuty untuk alerts
- Create runbook untuk 2 common incident types
🎯 Quick Quiz
1. Metric apa yang paling penting untuk stakeholders?
A. CPU usage
B. Data freshness
C. Disk space
D. Network latency
2. Response time untuk P0 incident?
A. 24 hours
B. 4 hours
C. 15 minutes
D. Next business day
3. Tujuan post-mortem?
A. Menyalahkan seseorang
B. Document and learn from incidents
C. Menutupi masalah
D. Mengurangi anggaran
Kesimpulan
Monitoring adalah critical component dari reliable data systems. Dengan metrics yang tepat, alerting yang actionable, dan process untuk incident response, tim dapat maintain trust dalam data.
🎯 Key Takeaways
- Monitor technical metrics (duration, success rate) and data quality
- Define SLAs for freshness, completeness, accuracy
- Use severity levels to prevent alert fatigue
- Have a clear incident response process
- Learn from incidents through post-mortems
📚 References & Resources
Primary Sources
- Site Reliability Engineering - Betsy Beyer et al. (O'Reilly, 2016)
Chapters 6-7: Monitoring and Alerting
- The Site Reliability Workbook - Niall Murphy et al. (O'Reilly, 2018)
Official Documentation
Articles & Guides