Intermediate

Data Serving & APIs

The final stage: Making data accessible to consumers

⏱️ 35 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap data serving sebagai "cara kirim data ke pengguna". Fokus baca:

  1. Siapa konsumennya dan kebutuhan latency mereka
  2. Pola serving yang tepat untuk BI, API, atau reverse ETL
  3. Cara menjaga kontrak data agar tidak sering breaking change

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Data Contract

Definisi awam: Janji format data antara pembuat dan pemakai data.

Definisi teknis: Kesepakatan skema, semantik, SLA, dan versioning untuk mencegah breaking changes.

Contoh praktis: Field customer_id wajib string non-null, perubahan harus lewat versi API baru.

Istilah: Reverse ETL

Definisi awam: Kirim data dari warehouse balik ke tools operasional.

Definisi teknis: Sinkronisasi terjadwal dari model analitik ke sistem SaaS (CRM, ads, support tools).

Contoh praktis: Segment "high churn risk" dikirim ke CRM agar tim CS bisa follow-up.

The Data Serving Stage

Data Serving adalah stage terakhir dari Data Engineering Lifecycle. Semua pekerjaan sebelumnya (generate, store, ingest, transform) tidak ada artinya jika data tidak dapat diakses oleh consumers.

💡 Key Principle

"Data engineers don't just build pipelines - they build products that deliver data."

Data Consumers and Their Needs

Consumer Access Pattern Latency Requirement
Business Analysts SQL queries, dashboards Seconds to minutes
Data Scientists Feature stores, notebooks Minutes
Applications APIs, low-latency queries Milliseconds
External Partners Secure APIs, data exports Varies

Data Serving Patterns

📊 Analytics Serving

Tools: Tableau, Looker, Metabase

Store: Data Warehouse

Pre-aggregated tables for fast dashboards

🤖 ML Feature Store

Tools: Feast, Tecton, SageMaker

Store: Key-value, vector DB

Low-latency feature retrieval

🔌 Operational APIs

Tools: FastAPI, GraphQL

Store: PostgreSQL, Redis

Application-facing endpoints

🔄 Reverse ETL

Tools: Hightouch, Census

Dest: Salesforce, HubSpot

Sync warehouse to SaaS tools

API Design for Data

REST API Best Practices

# FastAPI example: Data API from fastapi import FastAPI, Query from pydantic import BaseModel app = FastAPI() class SalesMetrics(BaseModel): date: str revenue: float orders: int # ✅ Good: Resource-based, filtered @app.get("/api/v1/sales") async def get_sales( start_date: str = Query(..., description="YYYY-MM-DD"), end_date: str = Query(..., description="YYYY-MM-DD"), region: str = Query(None) ): # Implementation return {"data": [...]} # ✅ Pagination for large datasets @app.get("/api/v1/transactions") async def get_transactions( cursor: str = None, limit: int = Query(100, le=1000) ): return { "data": [...], "next_cursor": "abc123" }

Reverse ETL

Traditional ETL brings data into the warehouse. Reverse ETL pushes data out to operational tools.

🔄 Reverse ETL Use Cases

Performance Optimization

Technique Use Case Implementation
Caching Repeated queries Redis, materialized views
Pre-aggregation Dashboard metrics Rollup tables, cubes
Partitioning Large tables Date/region partitions
Indexing Point lookups B-tree, inverted indexes

Data Products

Treat your data outputs as products with:

Decision Framework: Data Serving Patterns

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
BI Serving vs API Serving Consumer utama analis dan business user Consumer utama aplikasi produk/operasional
Pre-compute vs On-demand query Latency ketat dan pola query berulang Query ad-hoc beragam dan kebutuhan fleksibel
Pull API vs Reverse ETL push Sistem downstream bisa query kapan saja Perlu sinkronisasi aktif ke CRM/ads/operational tools

Failure Modes & Anti-Patterns

Anti-Patterns di Layer Serving

Production Readiness Checklist

Checklist Data Serving

  1. Metric definition dan semantic layer disepakati lintas tim.
  2. SLA latency dan freshness dipublikasikan ke consumer.
  3. Contract/API schema tests aktif di CI.
  4. Caching + invalidation plan terdokumentasi.
  5. Access control untuk endpoint/dataset sensitif diterapkan.
  6. Monitoring usage, error rate, dan cost per endpoint aktif.

✏️ Exercise: Design a Data Product

Design a Customer Lifetime Value API for marketing team:

  1. Define the API contract (endpoints, request/response)
  2. Choose storage and serving technology
  3. Design for 100ms p95 latency
  4. Plan for 10K requests/minute
  5. Define SLA and error handling

🎯 Quick Quiz

1. Reverse ETL berbeda dari ETL tradisional karena?

A. Menggunakan tools yang berbeda
B. Memindahkan data dari warehouse ke operational tools
C. Lebih cepat daripada ETL
D. Hanya untuk data real-time

2. Teknik apa yang cocok untuk dashboard dengan query berulang?

A. Full table scan
B. Caching dan pre-aggregation
C. Sequential reads
D. Dynamic SQL generation

3. Latency requirement untuk operational APIs biasanya?

A. Minutes
B. Seconds
C. Milliseconds
D. Hours

Kesimpulan

Data Serving adalah stage yang sering diabaikan tapi sangat penting. Data engineers harus memahami kebutuhan berbagai consumers dan memilih teknologi serta patterns yang tepat untuk melayani mereka.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides