Advanced
Apache Spark for Big Data
Spark architecture, RDDs, DataFrames, Spark SQL, and optimization
⏱️ 60 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap Spark sebagai "mesin komputasi data skala besar". Fokus baca:
- Kapan Spark diperlukan dibanding SQL warehouse biasa
- Konsep job, partition, dan shuffle yang paling berpengaruh
- Praktik tuning dasar untuk mencegah job lambat/gagal
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham transformasi data dasar (filter, join, aggregate)
- Tahu dataset besar tidak bisa diproses efisien di satu mesin
- Pernah lihat bottleneck performa saat query kompleks
Istilah Penting (3 Lapis)
Istilah: Shuffle
Definisi awam: Data dipindah-pindah antar worker untuk dikelompokkan ulang.
Definisi teknis: Proses redistribusi partisi antar executor, biasanya pada operasi join/groupBy/orderBy.
Contoh praktis: Job join besar lambat karena shuffle tinggi dan network jadi bottleneck.
Istilah: Skew
Definisi awam: Sebagian worker kebagian data jauh lebih banyak dari yang lain.
Definisi teknis: Ketidakseimbangan distribusi key/partition yang menyebabkan straggler task.
Contoh praktis: Satu key "unknown" mendominasi sehingga satu task selesai paling akhir.
What is Apache Spark?
Apache Spark adalah unified analytics engine untuk big data processing,
dengan built-in modules untuk streaming, SQL, machine learning, dan graph processing.
⚡ Why Spark?
- In-memory processing: 10-100x faster than Hadoop MapReduce
- Ease of use: APIs in Python, Scala, Java, R
- Unified engine: Batch + streaming + ML + SQL
- Fault tolerance: Resilient through lineage
Spark Architecture
Core Components
| Component |
Role |
| Driver Program |
Main program, creates SparkContext, schedules tasks |
| Cluster Manager |
Allocates resources (Standalone, YARN, Mesos, Kubernetes) |
| Executors |
Run tasks and store data on worker nodes |
| SparkContext |
Entry point for Spark functionality |
RDDs vs DataFrames vs Datasets
| API |
Type Safety |
Optimization |
Best For |
| RDD |
Yes |
Manual |
Fine-grained control, legacy |
| DataFrame |
No (runtime) |
Catalyst optimizer |
Most use cases (Python/R) |
| Dataset |
Yes (compile-time) |
Catalyst optimizer |
Scala/Java type safety |
Spark SQL
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
df = spark.read.parquet("s3://bucket/data/")
df.createOrReplaceTempView("sales")
result = spark.sql("""
SELECT product, SUM(amount) as total
FROM sales
WHERE date >= '2024-01-01'
GROUP BY product
ORDER BY total DESC
""")
result.show()
Spark Optimization
Catalyst Optimizer
Spark's query optimizer automatically improves performance:
- Predicate pushdown: Filter early at data source
- Column pruning: Read only needed columns
- Constant folding: Pre-calculate constants
Partitioning and Bucketing
df.write \
.partitionBy("year", "month") \
.parquet("s3://bucket/output/")
df.write \
.bucketBy(100, "user_id") \
.saveAsTable("bucketed_table")
Caching and Persistence
df.cache()
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
Structured Streaming
Unified API for batch and streaming:
stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host:port") \
.option("subscribe", "events") \
.load()
query = stream_df.writeStream \
.outputMode("append") \
.format("console") \
.start()
query.awaitTermination()
Spark on Cloud
| Service |
Provider |
Features |
| EMR |
AWS |
Managed Hadoop/Spark clusters |
| Dataproc |
GCP |
Serverless Spark, autoscaling |
| HDInsight |
Azure |
Enterprise integration |
| Databricks |
Multi-cloud |
Managed Delta Lake, notebooks |
Decision Framework: Spark Workload Design
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| DataFrame vs RDD |
Mayoritas transformasi tabular SQL-like |
Butuh kontrol low-level khusus (kasus langka) |
| Batch vs Structured Streaming |
Latency menit-jam masih acceptable |
Perlu near real-time pipeline dengan micro-batch/continuous |
| Broadcast join vs Shuffle join |
Tabel dimensi kecil dan sering join ke fakta besar |
Kedua tabel besar dan distribusi data seimbang |
| Cache vs Recompute |
DataFrame dipakai berulang di banyak action |
Dipakai sekali atau memory pressure tinggi |
Failure Modes & Anti-Patterns
Anti-Patterns pada Spark
- Blind repartition: partisi dinaikkan tanpa analisis menyebabkan shuffle overhead.
- Skew ignored: satu partition jadi bottleneck dan job timeout.
- Excessive collect(): driver OOM karena menarik data besar ke local.
- No checkpoint for streaming: state hilang saat restart.
- Unbounded small files output: downstream query performance turun.
Production Readiness Checklist
Checklist Spark sebelum Production
- Execution plan diperiksa (explain) untuk query kritikal.
- Join strategy dan skew handling tervalidasi.
- Partitioning output sesuai pola query downstream.
- Checkpointing + exactly-once semantics dipastikan untuk streaming.
- Resource configs (executor/core/memory) dituning berdasarkan beban nyata.
- Monitoring job duration, stage failure, dan spill metrics aktif.
- Compaction/rewrite strategy untuk output files tersedia.
- Runbook untuk retry, backfill, dan recovery disiapkan.
✏️ Exercise: Optimize Spark Job
Diberikan job yang lambat. Implementasi optimasi:
- Cache DataFrame yang digunakan multiple times
- Partition by date untuk filter time-range
- Use broadcast join untuk small lookup table
- Coalesce before writing untuk mengurangi small files
🎯 Quick Quiz
1. Keunggulan utama Spark vs Hadoop MapReduce?
A. Spark lebih tua dan stabil
B. Spark menggunakan in-memory processing
C. Spark hanya untuk batch
D. Spark tidak fault tolerant
2. API apa yang direkomendasikan untuk Python?
A. RDD
B. DataFrame
C. Dataset
D. Java API
3. Fungsi Catalyst Optimizer?
A. Hanya untuk error handling
B. Automatic query optimization
C. Resource allocation
D. Security encryption
Kesimpulan
Apache Spark adalah tool essential untuk big data processing. Dengan DataFrame API, Spark SQL, dan Structured Streaming, engineers dapat membangun pipelines yang scalable dan performant.
🎯 Key Takeaways
- Spark: in-memory, 10-100x faster than MapReduce
- Use DataFrames for Python, Datasets for Scala
- Leverage Catalyst optimizer and caching
- Structured Streaming unifies batch and streaming
- Partition wisely for optimal performance
📚 References & Resources
Primary Sources
- Spark: The Definitive Guide - Bill Chambers & Matei Zaharia (O'Reilly, 2018)
Chapters 1-12: Spark Basics, Structured APIs, Spark SQL, Optimization
- Learning Spark - Jules Damji et al. (O'Reilly, 2020)
2nd Edition: Lightning-Fast Data Analytics
- High Performance Spark - Holden Karau & Rachel Warren (O'Reilly, 2017)
Official Documentation
Articles & Guides