Intermediate

Cloud Data Platforms

AWS, Google Cloud, Azure for data engineering

⏱️ 50 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap cloud platform sebagai "fondasi infrastruktur data". Fokus baca:

  1. Perbedaan layanan inti AWS, GCP, dan Azure
  2. Kompromi antara fleksibilitas, biaya, dan kompleksitas
  3. Cara menghindari lock-in dari awal

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Vendor Lock-in

Definisi awam: Sulit pindah platform karena terlalu tergantung satu vendor.

Definisi teknis: Ketergantungan pada service/API proprietary yang meningkatkan biaya dan risiko migrasi.

Contoh praktis: Pipeline bergantung fitur eksklusif satu warehouse sehingga migrasi butuh rewrite besar.

Istilah: Auto-suspend

Definisi awam: Komputasi berhenti otomatis saat tidak dipakai untuk hemat biaya.

Definisi teknis: Mekanisme idle shutdown pada compute warehouse/cluster untuk optimasi FinOps.

Contoh praktis: Cluster BI mati otomatis setelah 5 menit idle untuk menekan bill bulanan.

Why Cloud for Data Engineering?

Big Three Cloud Providers

🟠 AWS

Market leader, broadest service offering

🔵 Google Cloud

Best-in-class analytics and ML

🔷 Azure

Enterprise integration, hybrid cloud

AWS Data Services

Service Purpose Comparable To
S3 Object storage (Data Lake) GCS, Azure Blob
Redshift Data Warehouse BigQuery, Synapse
EMR Managed Spark/Hadoop Dataproc, HDInsight
Glue Serverless ETL Dataflow, Data Factory
Athena Serverless SQL queries BigQuery, Synapse SQL
Kinesis Streaming data Pub/Sub, Event Hubs
MWAA Managed Airflow Cloud Composer, ADF

Google Cloud Platform (GCP)

🎯 GCP Strengths

GCP Data Services

Service Purpose
Cloud Storage Object storage (Multi-regional, Nearline, Coldline)
BigQuery Serverless data warehouse
Dataproc Managed Spark/Hadoop
Dataflow Stream and batch processing (Apache Beam)
Pub/Sub Messaging and streaming
Cloud Composer Managed Apache Airflow
Data Fusion Visual ETL/CDAP-based

Microsoft Azure

🔷 Azure Strengths

Azure Data Services

Service Purpose
Azure Blob Storage Object storage (Hot, Cool, Archive tiers)
Synapse Analytics Unified analytics (SQL pools, Spark)
Data Factory Visual ETL and data integration
HDInsight Managed Hadoop/Spark/Kafka
Event Hubs Streaming platform (Kafka-compatible)
Stream Analytics Real-time stream processing

Multi-Cloud Strategy

Best practice: avoid vendor lock-in with portable technologies:

Cost Optimization

💰 Cost Best Practices

Decision Framework: Cloud Platform Strategy

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Single-cloud vs Multi-cloud Tim kecil, fokus delivery cepat, governance sederhana Butuh resilience lintas vendor atau constraint regulasi regional
Managed service vs Self-managed Prioritas velocity dan minim beban operasional Butuh kontrol penuh konfigurasi/performance khusus
Serverless vs Provisioned Workload fluktuatif dan usage sulit diprediksi Workload stabil tinggi dan optimasi biaya jangka panjang
Cloud-native vs Portable stack Accept lock-in demi fitur vendor paling lengkap Ingin fleksibilitas migrasi dan negosiasi biaya

Failure Modes & Anti-Patterns

Anti-Patterns pada Cloud Data Platform

Production Readiness Checklist

Checklist Cloud Platform sebelum Production

  1. Landing zone, IAM baseline, dan network segmentation siap.
  2. Infrastructure as Code digunakan untuk provisioning utama.
  3. Budget alert dan quota guardrails aktif per project/workload.
  4. Backup, DR, dan target RTO/RPO terdefinisi.
  5. Data residency/compliance requirement tervalidasi.
  6. Observability stack aktif (cost, performance, reliability).
  7. Service ownership dan escalation path terdokumentasi.
  8. Vendor lock-in risk dicatat dengan mitigation plan.

✏️ Exercise: Cloud Architecture Design

Desain arsitektur data untuk e-commerce di AWS:

  1. Data Lake: S3 untuk raw data (JSON, CSV, Parquet)
  2. Ingestion: Kinesis untuk streaming, Glue untuk batch
  3. Storage: Redshift untuk data warehouse
  4. Processing: EMR untuk Spark transformations
  5. Orchestration: MWAA (Managed Airflow)
  6. Serving: Athena untuk ad-hoc queries

🎯 Quick Quiz

1. Service apa yang setara BigQuery di AWS?

A. S3
B. Redshift
C. EMR
D. Kinesis

2. Keunggulan utama Google Cloud?

A. Cheapest storage
B. Best-in-class analytics and ML
C. Largest market share
D. Best Windows integration

3. Bagaimana menghindari vendor lock-in?

A. Gunakan hanya proprietary services
B. Gunakan open formats dan portable tools
C. Pindah ke on-premise
D. Gunakan satu cloud saja

Kesimpulan

Setiap cloud provider memiliki kekuatan unik. AWS memiliki ekosistem terluas, GCP unggul dalam analytics/ML, dan Azure terintegrasi dengan baik untuk enterprise. Pilih berdasarkan kebutuhan spesifik dan expertise tim.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides