Beginner

Source Systems

Memahami berbagai sumber data dan cara mengekstraknya

⏱️ 30 min read πŸ“… Updated Jan 2025 πŸ‘€ By DataLearn Team

Mode Baca Pemula

Anggap source system sebagai β€œasal data”. Fokus baca:

  1. Jenis source yang paling umum
  2. Karakteristik data dari tiap source
  3. Metode ekstraksi yang cocok per source

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Source System

Definisi awam: Tempat asal data sebelum masuk pipeline data.

Definisi teknis: Sistem operasional atau eksternal yang menghasilkan data untuk ingestion.

Contoh praktis: PostgreSQL aplikasi, API payment gateway, log clickstream.

Istilah: CDC

Definisi awam: Ambil hanya data yang berubah.

Definisi teknis: Teknik capture perubahan insert/update/delete dari source database.

Contoh praktis: Order baru/diupdate otomatis tersinkron ke warehouse tanpa full dump.

Apa itu Source Systems?

Source Systems adalah sistem atau aplikasi yang menciptakan data. Ini adalah titik awal dari data engineering lifecycle. Tanpa memahami source systems, kita tidak bisa membangun pipeline yang efektif.

πŸ’‘ Key Concept

Data engineers don't control source systems, but must understand them deeply to build reliable pipelines.

Jenis-jenis Source Systems

πŸ—„οΈ

OLTP Databases

MySQL, PostgreSQL, Oracle - Transaksi bisnis harian

πŸ“‘

APIs

REST, GraphQL - Data dari third-party services

πŸ“

Files

CSV, JSON, Parquet, Log files

πŸ”„

Message Queues

Kafka, RabbitMQ - Event streaming real-time

πŸ“±

IoT Devices

Sensors, wearables, smart devices

☁️

SaaS Apps

Salesforce, HubSpot, Google Analytics

1. OLTP Databases (Online Transaction Processing)

OLTP databases adalah sistem transaksi bisnis utama. Mereka designed untuk ACID compliance dan low-latency writes, bukan untuk analytics.

Common OLTP Databases

Database Type Best For CDC Support
PostgreSQL Open Source RDBMS Web apps, complex queries βœ… Logical Replication
MySQL Open Source RDBMS Web apps, high read traffic βœ… Binary Log
SQL Server Enterprise RDBMS Enterprise applications βœ… Change Tracking
MongoDB Document Store Flexible schemas, JSON data βœ… Change Streams

Extraction Methods

πŸ“‹ Full Extract

When: Small tables (< 1M rows), initial load

SELECT * FROM users;

Pros: Simple, complete snapshot

Cons: Slow untuk large tables, heavy load on source

πŸ“… Incremental (Timestamp)

When: Tables dengan updated_at column

SELECT * FROM orders WHERE updated_at > :last_extract_time;

Pros: Efficient, lower load

Cons: Misses hard deletes, requires timestamp column

πŸ”’ Incremental (Auto-increment)

When: Append-only tables dengan ID

SELECT * FROM events WHERE id > :last_max_id;

Pros: Reliable, no timestamp needed

Cons: Doesn't capture updates/deletes

πŸ”„ CDC (Change Data Capture)

When: Real-time requirements, audit needs

Tools: Debezium, AWS DMS, Fivetran

Pros: Captures all changes, low latency

Cons: More complex setup

CDC (Change Data Capture) Deep Dive

CDC adalah metode terbaik untuk merekam setiap perubahan data di source system.

πŸ”„ How CDC Works

Database (Write Ops) β†’ Transaction Log (WAL/BinLog) β†’ CDC Connector β†’ Message Queue β†’ Data Pipeline

πŸ› οΈ Popular CDC Tools

2. APIs (Application Programming Interfaces)

APIs adalah cara umum untuk mengakses data dari third-party services atau internal microservices.

Types of APIs

Type Format Best For
REST JSON, HTTP verbs General purpose, simple CRUD
GraphQL Query language Flexible data fetching, mobile apps
gRPC Protocol Buffers High performance, internal services
Webhook HTTP POST callbacks Real-time event notifications

API Pagination Patterns

# Python example: Handling pagination dengan cursor-based API import requests def fetch_all_data(api_url, api_key): all_data = [] cursor = None while True: params = {"limit": 100} if cursor: params["cursor"] = cursor response = requests.get( api_url, headers={"Authorization": f"Bearer {api_key}"}, params=params ) data = response.json() all_data.extend(data["results"]) # Check for next page cursor = data.get("next_cursor") if not cursor: break return all_data

⚠️ API Best Practices

3. Message Queues & Event Streaming

Message queues dan event streaming systems adalah source untuk real-time data.

Apache Kafka Concepts

πŸ“Š Kafka Architecture

Producers β†’ Topics (Partitioned) β†’ Consumers

Topics β†’ Partitions β†’ Segments (on disk)
Consumer Groups β†’ Parallel processing

Concept Description
Topic Category/feed name (e.g., "user-events", "orders")
Partition Ordered, immutable sequence of messages within a topic
Offset Unique ID of a message within a partition
Consumer Group Group of consumers that share work (one consumer per partition)
Retention How long to keep messages (time-based or size-based)

4. Files & Object Storage

Files masih menjadi source data yang umum, terutama untuk:

File Formats Comparison

Format Structure Best For Compression
CSV Text, row-based Simple data, human-readable ❌ Poor
JSON Text, nested objects Semi-structured data, APIs ⚠️ Moderate
Parquet Binary, columnar Analytics, large datasets βœ… Excellent
Avro Binary, row-based Streaming, schema evolution βœ… Good
ORC Binary, columnar Hadoop ecosystem βœ… Excellent

5. IoT & Streaming Data

IoT devices menghasilkan data dengan karakteristik khusus:

πŸ“‘ IoT Protocols

Source System Discovery Checklist

Sebelum membangun pipeline, selalu tanyakan:

βœ… Discovery Questions

Volume:

Velocity:

Variety:

Veracity:

Case Study: E-commerce Data Sources

πŸ›’ Tokopedia Source Systems Architecture

Transactional Data:

  • PostgreSQL (Product Catalog, Orders) β†’ CDC via Debezium
  • MySQL (User Profiles) β†’ Binary log replication

Behavioral Data:

  • Mobile App Events β†’ Kafka (via SDK)
  • Website Clickstream β†’ Kafka (via JavaScript tracker)

Third-party:

  • Payment Gateway API β†’ Scheduled API pulls
  • Logistics Partner β†’ Webhook callbacks

Decision Framework: Source System Strategy

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
CDC vs Batch Extract Butuh sinkronisasi near real-time dan source mendukung logs Latency longgar atau source legacy tanpa CDC support
Pull API vs Webhook/Event push Provider tidak menyediakan event endpoint Perlu update cepat dan efisiensi network
Single source of truth vs Multiple source fusion Ada sistem otoritatif yang jelas Nilai bisnis butuh penggabungan dari banyak sistem operasional

Failure Modes & Anti-Patterns

Anti-Patterns di Source Integration

Production Readiness Checklist

Checklist Source System

  1. Owner source system dan contact escalation jelas.
  2. Schema contract/versioning disepakati.
  3. Latency expectation per source terdokumentasi.
  4. Rate limit, retry, timeout policy diterapkan.
  5. Delete/update semantics diuji end-to-end.
  6. Fallback plan saat source unavailable tersedia.

✏️ Exercise: Source System Analysis

Kamu diminta untuk mendesain pipeline untuk sebuah fintech lending app. Identifikasi source systems berikut dan extraction method yang tepat:

  1. User registration data (PostgreSQL, 100 inserts/day)
  2. Transaction history (MySQL, 10K inserts/day)
  3. Credit score dari third-party API (daily update)
  4. Mobile app events (real-time fraud detection)
  5. Bank statement uploads (PDF files)

Hint: Pertimbangkan volume, latency requirement, dan format data.

🎯 Quick Quiz

1. Metode terbaik untuk capture semua perubahan data termasuk delete adalah?

A. Full Extract
B. Timestamp-based Incremental
C. CDC (Change Data Capture)
D. API Polling

2. Format file terbaik untuk analytics workload adalah?

A. CSV
B. JSON
C. Parquet
D. XML

3. Kafka partition key digunakan untuk?

A. Data encryption
B. Determining which partition a message goes to
C. Message compression
D. Authentication

Kesimpulan

Memahami source systems adalah fundamental untuk data engineering. Setiap source memiliki karakteristik berbeda: OLTP databases untuk transaksi, APIs untuk integrasi, Kafka untuk streaming, dan files untuk batch.

Pilih extraction method yang sesuai: Full extract untuk small tables, incremental untuk large tables, dan CDC untuk real-time requirements.

🎯 Key Takeaways

πŸ“š References & Resources

Primary Sources

Official Documentation

Articles & Websites