Advanced
DataOps & Observability
CI/CD for data, data quality testing, lineage, and data catalog
⏱️ 40 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap DataOps sebagai "cara kerja tim data biar stabil dan cepat". Fokus baca:
- Metrik observability minimum yang wajib dipantau
- Alur incident: deteksi, triage, recovery
- Quality gate sebelum pipeline rilis
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham pipeline data perlu monitoring seperti aplikasi software
- Tahu konsep alert dan ownership insiden
- Pernah dengar testing otomatis sebelum deploy
Istilah Penting (3 Lapis)
Istilah: Data Freshness
Definisi awam: Seberapa baru data yang tersedia untuk dipakai.
Definisi teknis: Selisih waktu antara event terjadi dan data siap dikonsumsi pada target system.
Contoh praktis: Dashboard harus maksimal telat 15 menit dari transaksi aktual.
Istilah: Error Budget
Definisi awam: Batas toleransi kegagalan yang masih diterima.
Definisi teknis: Bagian dari SLO yang menentukan porsi downtime/failed run sebelum eskalasi prioritas tinggi.
Contoh praktis: SLO 99.5% berarti error budget sekitar 3.6 jam gagal per bulan.
What is DataOps?
DataOps adalah aplikasi dari DevOps practices ke dunia data.
Tujuannya: mengotomatisasi dan mempercepat siklus hidup data sambil menjaga kualitas.
💡 DataOps Definition
"A collaborative data management practice focused on improving the communication, integration, and automation of data flows between data managers and data consumers." — Andy Palmer, 2015
The Three Pillars of Data Observability
📊 Metrics
Track pipeline health, data freshness, volume anomalies
📋 Logs
Detailed execution logs for debugging and auditing
🔍 Traces
End-to-end data lineage across pipeline stages
CI/CD for Data Pipelines
Continuous Integration/Deployment untuk data melibatkan:
- Version Control: Git untuk SQL, Python, dan configs
- Automated Testing: Unit tests, integration tests, data diff
- Environment Promotion: Dev → Staging → Production
- Rollback Capability: Quick recovery from bad deployments
name: dbt CI
on:
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run dbt tests
run: |
dbt deps
dbt seed --target ci
dbt run --target ci
dbt test --target ci
Data Quality Testing Frameworks
| Tool |
Type |
Best For |
| Great Expectations |
Open Source |
Comprehensive data validation |
| Soda Core |
Open Source |
Data quality as code |
| Deequ |
Open Source (AWS) |
Spark-based validation |
| dbt Tests |
Built-in |
Warehouse-native testing |
Data Lineage
Data lineage melacak perjalanan data dari source ke destination:
- Column-level lineage: Track individual fields
- Impact analysis: What breaks if I change this?
- Root cause analysis: Where did this error originate?
🎯 Benefits of Lineage
- Understand data dependencies
- Faster incident resolution
- Compliance and auditing
- Optimize pipeline costs
Data Catalog
Data catalog adalah inventory dari semua data assets dengan metadata:
- Technical metadata: Schemas, data types, locations
- Business metadata: Definitions, ownership, SLAs
- Operational metadata: Usage stats, quality scores
| Catalog Tool |
Features |
Integration |
| DataHub |
Open source, rich lineage |
dbt, Airflow, Snowflake |
| Apache Atlas |
Governance, classifications |
Hadoop ecosystem |
| Monte Carlo |
Data observability SaaS |
Cloud warehouses |
| Collibra |
Enterprise governance |
Broad enterprise support |
SLIs, SLOs, and SLAs
| Term |
Definition |
Example |
SLI Service Level Indicator |
Metric to measure |
Pipeline success rate |
SLO Service Level Objective |
Target for the metric |
99.9% success rate |
SLA Service Level Agreement |
Contract with penalties |
99.5% or money back |
Decision Framework: Prioritas Observability
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Start with Freshness vs Completeness |
Use case sensitif waktu (dashboard harian, ops monitoring) |
Use case sensitif akurasi angka (finance, billing) |
| Warehouse-native tests vs External tool |
Tim kecil, stack sederhana, ingin cepat jalan |
Butuh observability lintas platform/source |
| Centralized alerting vs Team-based alerting |
Tim data masih kecil dan ownership belum terfragmentasi |
Setiap domain sudah punya on-call sendiri |
| Column lineage vs Table lineage |
Perlu RCA detail untuk metric kritikal |
Baru mulai, butuh visibilitas dependency tingkat tinggi |
Failure Modes & Anti-Patterns
Anti-Patterns DataOps yang Sering Terjadi
- Alert fatigue: terlalu banyak alert low-value sampai alert penting diabaikan.
- No owner per dataset: incident lama selesai karena escalation tidak jelas.
- Testing only in prod: bug kualitas data ditemukan setelah dipakai bisnis.
- Lineage tidak update: RCA lambat karena metadata stale.
- No error budget: target reliabilitas ada tapi tidak punya mekanisme kontrol.
Production Readiness Checklist
Checklist DataOps & Observability
- Top 10 dataset kritikal sudah punya owner dan SLA.
- SLI freshness, completeness, dan success rate sudah dimonitor.
- Quality tests dijalankan di CI dan production schedule.
- Alert severity dan routing ke owner sudah jelas.
- Lineage minimal table-level tersedia end-to-end.
- Incident template + postmortem template sudah ada.
- Error budget per domain disepakati lintas tim.
- Dashboard reliabilitas ditinjau rutin (mingguan/bulanan).
✏️ Exercise: Implement Data Observability
Desain observability stack untuk data warehouse:
- Pilih 3 SLIs yang paling penting untuk timmu
- Buat SLO targets untuk masing-masing
- Pilih tools: Great Expectations untuk testing, DataHub untuk lineage
- Buat alert channels (PagerDuty, Slack)
🎯 Quick Quiz
1. Apa tujuan utama DataOps?
A. Menghapus semua data yang salah
B. Meningkatkan velocity dan reliability data pipelines
C. Menggantikan data engineers dengan automation
D. Memindahkan data ke cloud
2. Tool apa yang cocok untuk data quality testing?
A. Kubernetes
B. Great Expectations
C. Terraform
D. Jenkins
3. Apa perbedaan SLO dan SLA?
A. SLO internal target, SLA external contract
B. SLO untuk software, SLA untuk hardware
C. Tidak ada perbedaan
D. SLO lebih strict dari SLA
Kesimpulan
DataOps dan observability adalah critical capabilities untuk tim data yang mature.
Dengan CI/CD, automated testing, dan proper monitoring, tim dapat bergerak cepat
tanpa takut merusak data production.
🎯 Key Takeaways
- DataOps = DevOps practices applied to data
- Observability = Metrics + Logs + Lineage
- Test data pipelines like software
- Define and monitor SLIs/SLOs
- Invest in data catalog for discoverability
📚 References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapter 16: DataOps and Data Observability
- Data Mesh - Zhamak Dehghani (O'Reilly, 2022)
Chapter 4: Federated Data Governance
- Site Reliability Engineering - Betsy Beyer et al. (O'Reilly, 2016)
Chapters 4-6: Monitoring, Alerts, SLIs/SLOs
Official Documentation
Articles & Guides