Advanced

Apache Spark for Big Data

Spark architecture, RDDs, DataFrames, Spark SQL, and optimization

⏱️ 60 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap Spark sebagai "mesin komputasi data skala besar". Fokus baca:

  1. Kapan Spark diperlukan dibanding SQL warehouse biasa
  2. Konsep job, partition, dan shuffle yang paling berpengaruh
  3. Praktik tuning dasar untuk mencegah job lambat/gagal

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Shuffle

Definisi awam: Data dipindah-pindah antar worker untuk dikelompokkan ulang.

Definisi teknis: Proses redistribusi partisi antar executor, biasanya pada operasi join/groupBy/orderBy.

Contoh praktis: Job join besar lambat karena shuffle tinggi dan network jadi bottleneck.

Istilah: Skew

Definisi awam: Sebagian worker kebagian data jauh lebih banyak dari yang lain.

Definisi teknis: Ketidakseimbangan distribusi key/partition yang menyebabkan straggler task.

Contoh praktis: Satu key "unknown" mendominasi sehingga satu task selesai paling akhir.

What is Apache Spark?

Apache Spark adalah unified analytics engine untuk big data processing, dengan built-in modules untuk streaming, SQL, machine learning, dan graph processing.

⚡ Why Spark?

Spark Architecture

Core Components

Component Role
Driver Program Main program, creates SparkContext, schedules tasks
Cluster Manager Allocates resources (Standalone, YARN, Mesos, Kubernetes)
Executors Run tasks and store data on worker nodes
SparkContext Entry point for Spark functionality

RDDs vs DataFrames vs Datasets

API Type Safety Optimization Best For
RDD Yes Manual Fine-grained control, legacy
DataFrame No (runtime) Catalyst optimizer Most use cases (Python/R)
Dataset Yes (compile-time) Catalyst optimizer Scala/Java type safety

Spark SQL

# Spark SQL with DataFrames from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("MyApp") \ .getOrCreate() # Read data df = spark.read.parquet("s3://bucket/data/") # SQL queries df.createOrReplaceTempView("sales") result = spark.sql(""" SELECT product, SUM(amount) as total FROM sales WHERE date >= '2024-01-01' GROUP BY product ORDER BY total DESC """) result.show()

Spark Optimization

Catalyst Optimizer

Spark's query optimizer automatically improves performance:

Partitioning and Bucketing

# Partition by date for time-series data df.write \ .partitionBy("year", "month") \ .parquet("s3://bucket/output/") # Bucketing for efficient joins df.write \ .bucketBy(100, "user_id") \ .saveAsTable("bucketed_table")

Caching and Persistence

# Cache frequently used DataFrames df.cache() # Equivalent to persist(MEMORY_ONLY) # Different storage levels from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK)

Structured Streaming

Unified API for batch and streaming:

# Read stream from Kafka stream_df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host:port") \ .option("subscribe", "events") \ .load() # Process and write to console query = stream_df.writeStream \ .outputMode("append") \ .format("console") \ .start() query.awaitTermination()

Spark on Cloud

Service Provider Features
EMR AWS Managed Hadoop/Spark clusters
Dataproc GCP Serverless Spark, autoscaling
HDInsight Azure Enterprise integration
Databricks Multi-cloud Managed Delta Lake, notebooks

Decision Framework: Spark Workload Design

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
DataFrame vs RDD Mayoritas transformasi tabular SQL-like Butuh kontrol low-level khusus (kasus langka)
Batch vs Structured Streaming Latency menit-jam masih acceptable Perlu near real-time pipeline dengan micro-batch/continuous
Broadcast join vs Shuffle join Tabel dimensi kecil dan sering join ke fakta besar Kedua tabel besar dan distribusi data seimbang
Cache vs Recompute DataFrame dipakai berulang di banyak action Dipakai sekali atau memory pressure tinggi

Failure Modes & Anti-Patterns

Anti-Patterns pada Spark

Production Readiness Checklist

Checklist Spark sebelum Production

  1. Execution plan diperiksa (explain) untuk query kritikal.
  2. Join strategy dan skew handling tervalidasi.
  3. Partitioning output sesuai pola query downstream.
  4. Checkpointing + exactly-once semantics dipastikan untuk streaming.
  5. Resource configs (executor/core/memory) dituning berdasarkan beban nyata.
  6. Monitoring job duration, stage failure, dan spill metrics aktif.
  7. Compaction/rewrite strategy untuk output files tersedia.
  8. Runbook untuk retry, backfill, dan recovery disiapkan.

✏️ Exercise: Optimize Spark Job

Diberikan job yang lambat. Implementasi optimasi:

  1. Cache DataFrame yang digunakan multiple times
  2. Partition by date untuk filter time-range
  3. Use broadcast join untuk small lookup table
  4. Coalesce before writing untuk mengurangi small files

🎯 Quick Quiz

1. Keunggulan utama Spark vs Hadoop MapReduce?

A. Spark lebih tua dan stabil
B. Spark menggunakan in-memory processing
C. Spark hanya untuk batch
D. Spark tidak fault tolerant

2. API apa yang direkomendasikan untuk Python?

A. RDD
B. DataFrame
C. Dataset
D. Java API

3. Fungsi Catalyst Optimizer?

A. Hanya untuk error handling
B. Automatic query optimization
C. Resource allocation
D. Security encryption

Kesimpulan

Apache Spark adalah tool essential untuk big data processing. Dengan DataFrame API, Spark SQL, dan Structured Streaming, engineers dapat membangun pipelines yang scalable dan performant.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides