Advanced

Real-time Streaming with Kafka

Kafka architecture, producers/consumers, stream processing, and patterns

⏱️ 55 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap streaming sebagai "alur data berjalan terus-menerus". Fokus baca:

  1. Konsep topic, partition, producer, dan consumer
  2. Perbedaan event-time vs processing-time
  3. Strategi menangani pesan gagal (DLQ) dan replay

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Consumer Group

Definisi awam: Sekelompok pembaca yang berbagi tugas baca topic.

Definisi teknis: Mekanisme skalabilitas Kafka di mana partisi didistribusikan ke beberapa consumer dalam group yang sama.

Contoh praktis: Empat consumer membaca topic 8 partisi agar throughput pemrosesan naik.

Istilah: Dead Letter Queue (DLQ)

Definisi awam: Tempat menaruh pesan yang gagal diproses agar tidak hilang.

Definisi teknis: Kanal terpisah untuk event poison/invalid setelah retry melewati batas.

Contoh praktis: Event JSON rusak dikirim ke DLQ untuk investigasi tanpa menghentikan stream utama.

What is Apache Kafka?

Apache Kafka adalah distributed event streaming platform yang digunakan untuk membangun real-time data pipelines dan streaming applications. Dibuat oleh LinkedIn, sekarang Apache Top-Level Project.

🚀 Kafka Use Cases

Kafka Architecture

Core Concepts

Concept Description
Topic Category/feed name to which records are published
Partition Ordered, immutable sequence of records within a topic
Producer Client that publishes records to topics
Consumer Client that subscribes to topics and processes records
Broker Kafka server that stores and serves data
Consumer Group Set of consumers that jointly consume a topic

Producers and Consumers

Producer Configuration

from kafka import KafkaProducer producer = KafkaProducer( bootstrap_servers=['localhost:9092'], key_serializer=lambda x: x.encode('utf-8'), value_serializer=lambda x: x.encode('utf-8'), acks='all', # Wait for all replicas retries=3 ) # Send message future = producer.send('orders', key='user-123', value='{"amount": 100}') producer.flush()

Consumer with Group

from kafka import KafkaConsumer consumer = KafkaConsumer( 'orders', bootstrap_servers=['localhost:9092'], group_id='payment-processor', auto_offset_reset='earliest', enable_auto_commit=True ) for message in consumer: process_payment(message.value)

Kafka Streams vs ksqlDB

Feature Kafka Streams ksqlDB
API Java/Scala library SQL-like syntax
Use Case Complex stateful processing Quick stream processing
Learning Curve Steep Shallow

Example: ksqlDB

-- Create stream from topic CREATE STREAM orders ( order_id STRING, amount DOUBLE, customer_id STRING ) WITH ( kafka_topic='orders', value_format='json' ); -- Aggregate in real-time CREATE TABLE hourly_revenue AS SELECT windowStart, windowEnd, SUM(amount) as total FROM orders WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY windowStart, windowEnd;

Kafka Connect

Framework untuk integrating Kafka dengan external systems:

✅ When to Use Connect vs Custom Code

Kafka Operations

Decision Framework: Streaming Architecture

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Kafka Streams vs ksqlDB Logic kompleks, butuh kontrol code-level tinggi Use case SQL-like cepat untuk agregasi streaming
At-least-once vs Exactly-once Toleran duplicate kecil dan prioritaskan throughput Tidak toleran duplicate (billing/financial events)
High partition count vs Moderate Traffic sangat tinggi dan consumer parallel banyak Traffic moderat, ingin operasi cluster lebih sederhana
Connectors vs Custom Consumer Integrasi standar tanpa logic kompleks Perlu transformasi domain dan kontrol error handling detail

Failure Modes & Anti-Patterns

Anti-Patterns pada Real-time Streaming

Production Readiness Checklist

Checklist Streaming sebelum Production

  1. Topic naming, retention, dan replication factor terdokumentasi.
  2. Partition key strategy sesuai kebutuhan ordering/scaling.
  3. Schema registry + compatibility rules diterapkan.
  4. DLQ dan replay strategy tersedia.
  5. Consumer lag, throughput, error rate dimonitor real-time.
  6. SLA latency end-to-end terdefinisi.
  7. Backpressure dan failover behavior sudah diuji.
  8. Runbook incident streaming siap (broker down, lag spike, poison events).

✏️ Exercise: Design Kafka Architecture

Desain streaming platform untuk ride-hailing app:

  1. Topic design: rides, locations, payments
  2. Partition strategy: ride_id for rides, geo-hash for locations
  3. Consumer groups: pricing-engine, driver-matching, analytics
  4. Use Kafka Streams untuk real-time pricing calculation

🎯 Quick Quiz

1. Fungsi partition di Kafka?

A. Hanya untuk backup
B. Enable parallelism dan scalability
C. Untuk security encryption
D. Tidak ada fungsi khusus

2. Apa beda Kafka Streams dan ksqlDB?

A. Streams lebih cepat
B. ksqlDB menggunakan SQL, Streams menggunakan code
C. ksqlDB hanya untuk batch
D. Streams tidak scalable

3. Replication factor 3 artinya?

A. 3 topics yang sama
B. Data direplikasi ke 3 brokers
C. 3 consumers wajib ada
D. 3 partitions minimum

Kesimpulan

Apache Kafka adalah backbone dari modern real-time data architectures. Dengan understanding yang baik tentang producers, consumers, partitions, dan stream processing, kamu dapat membangun systems yang scalable dan reliable.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides