Advanced
Real-time Streaming with Kafka
Kafka architecture, producers/consumers, stream processing, and patterns
⏱️ 55 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap streaming sebagai "alur data berjalan terus-menerus". Fokus baca:
- Konsep topic, partition, producer, dan consumer
- Perbedaan event-time vs processing-time
- Strategi menangani pesan gagal (DLQ) dan replay
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham perbedaan batch (jadwal) dan streaming (kontinu)
- Tahu event bisa datang telat atau urutannya tidak rapi
- Pernah dengar queue atau message broker
Istilah Penting (3 Lapis)
Istilah: Consumer Group
Definisi awam: Sekelompok pembaca yang berbagi tugas baca topic.
Definisi teknis: Mekanisme skalabilitas Kafka di mana partisi didistribusikan ke beberapa consumer dalam group yang sama.
Contoh praktis: Empat consumer membaca topic 8 partisi agar throughput pemrosesan naik.
Istilah: Dead Letter Queue (DLQ)
Definisi awam: Tempat menaruh pesan yang gagal diproses agar tidak hilang.
Definisi teknis: Kanal terpisah untuk event poison/invalid setelah retry melewati batas.
Contoh praktis: Event JSON rusak dikirim ke DLQ untuk investigasi tanpa menghentikan stream utama.
What is Apache Kafka?
Apache Kafka adalah distributed event streaming platform yang digunakan untuk
membangun real-time data pipelines dan streaming applications. Dibuat oleh LinkedIn,
sekarang Apache Top-Level Project.
🚀 Kafka Use Cases
- Event Sourcing: Log semua perubahan state
- Message Queue: Decouple microservices
- Stream Processing: Real-time analytics
- Log Aggregation: Collect logs from distributed systems
Kafka Architecture
Core Concepts
| Concept |
Description |
| Topic |
Category/feed name to which records are published |
| Partition |
Ordered, immutable sequence of records within a topic |
| Producer |
Client that publishes records to topics |
| Consumer |
Client that subscribes to topics and processes records |
| Broker |
Kafka server that stores and serves data |
| Consumer Group |
Set of consumers that jointly consume a topic |
Producers and Consumers
Producer Configuration
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
key_serializer=lambda x: x.encode('utf-8'),
value_serializer=lambda x: x.encode('utf-8'),
acks='all',
retries=3
)
future = producer.send('orders',
key='user-123',
value='{"amount": 100}')
producer.flush()
Consumer with Group
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'orders',
bootstrap_servers=['localhost:9092'],
group_id='payment-processor',
auto_offset_reset='earliest',
enable_auto_commit=True
)
for message in consumer:
process_payment(message.value)
Kafka Streams vs ksqlDB
| Feature |
Kafka Streams |
ksqlDB |
| API |
Java/Scala library |
SQL-like syntax |
| Use Case |
Complex stateful processing |
Quick stream processing |
| Learning Curve |
Steep |
Shallow |
Example: ksqlDB
CREATE STREAM orders (
order_id STRING,
amount DOUBLE,
customer_id STRING
) WITH (
kafka_topic='orders',
value_format='json'
);
CREATE TABLE hourly_revenue AS
SELECT windowStart, windowEnd, SUM(amount) as total
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY windowStart, windowEnd;
Kafka Connect
Framework untuk integrating Kafka dengan external systems:
- Source Connectors: Database → Kafka (Debezium, JDBC)
- Sink Connectors: Kafka → Database (S3, Elasticsearch, JDBC)
✅ When to Use Connect vs Custom Code
- Use Connect untuk simple, standard integrations
- Use custom consumers/producers untuk complex transformations
Kafka Operations
- Replication Factor: Minimum 3 for production
- Partition Count: More partitions = more parallelism
- Retention: How long to keep data (time or size based)
- Monitoring: Consumer lag, broker metrics, throughput
Decision Framework: Streaming Architecture
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Kafka Streams vs ksqlDB |
Logic kompleks, butuh kontrol code-level tinggi |
Use case SQL-like cepat untuk agregasi streaming |
| At-least-once vs Exactly-once |
Toleran duplicate kecil dan prioritaskan throughput |
Tidak toleran duplicate (billing/financial events) |
| High partition count vs Moderate |
Traffic sangat tinggi dan consumer parallel banyak |
Traffic moderat, ingin operasi cluster lebih sederhana |
| Connectors vs Custom Consumer |
Integrasi standar tanpa logic kompleks |
Perlu transformasi domain dan kontrol error handling detail |
Failure Modes & Anti-Patterns
Anti-Patterns pada Real-time Streaming
- No key strategy: partitioning buruk dan ordering use case gagal.
- Consumer lag ignored: data real-time terlambat tanpa diketahui.
- Unlimited retention default: cost storage broker membengkak.
- No DLQ: poison events menghentikan pipeline.
- Schema evolution unmanaged: producer update mematahkan consumers.
Production Readiness Checklist
Checklist Streaming sebelum Production
- Topic naming, retention, dan replication factor terdokumentasi.
- Partition key strategy sesuai kebutuhan ordering/scaling.
- Schema registry + compatibility rules diterapkan.
- DLQ dan replay strategy tersedia.
- Consumer lag, throughput, error rate dimonitor real-time.
- SLA latency end-to-end terdefinisi.
- Backpressure dan failover behavior sudah diuji.
- Runbook incident streaming siap (broker down, lag spike, poison events).
✏️ Exercise: Design Kafka Architecture
Desain streaming platform untuk ride-hailing app:
- Topic design: rides, locations, payments
- Partition strategy: ride_id for rides, geo-hash for locations
- Consumer groups: pricing-engine, driver-matching, analytics
- Use Kafka Streams untuk real-time pricing calculation
🎯 Quick Quiz
1. Fungsi partition di Kafka?
A. Hanya untuk backup
B. Enable parallelism dan scalability
C. Untuk security encryption
D. Tidak ada fungsi khusus
2. Apa beda Kafka Streams dan ksqlDB?
A. Streams lebih cepat
B. ksqlDB menggunakan SQL, Streams menggunakan code
C. ksqlDB hanya untuk batch
D. Streams tidak scalable
3. Replication factor 3 artinya?
A. 3 topics yang sama
B. Data direplikasi ke 3 brokers
C. 3 consumers wajib ada
D. 3 partitions minimum
Kesimpulan
Apache Kafka adalah backbone dari modern real-time data architectures. Dengan understanding yang baik tentang producers, consumers, partitions, dan stream processing, kamu dapat membangun systems yang scalable dan reliable.
🎯 Key Takeaways
- Kafka: distributed, fault-tolerant, high-throughput
- Topics partitioned for parallelism
- Consumer groups enable scalable consumption
- Choose Streams (code) or ksqlDB (SQL) based on complexity
- Use Connect for standard integrations
📚 References & Resources
Primary Sources
- Kafka: The Definitive Guide - Gwen Shapira et al. (O'Reilly, 2022)
2nd Edition: Real-Time Data and Stream Processing
- Designing Event-Driven Systems - Ben Stopford (O'Reilly, 2018)
Using Kafka and Microservices
- Kafka Streams in Action - William Bejeck (Manning, 2018)
Official Documentation
Articles & Guides