Intermediate
Cloud Data Platforms
AWS, Google Cloud, Azure for data engineering
⏱️ 50 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap cloud platform sebagai "fondasi infrastruktur data". Fokus baca:
- Perbedaan layanan inti AWS, GCP, dan Azure
- Kompromi antara fleksibilitas, biaya, dan kompleksitas
- Cara menghindari lock-in dari awal
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham compute, storage, dan network sebagai komponen dasar
- Tahu model biaya cloud umumnya pay-as-you-go
- Pernah lihat kebutuhan deploy cepat vs kontrol infrastruktur
Istilah Penting (3 Lapis)
Istilah: Vendor Lock-in
Definisi awam: Sulit pindah platform karena terlalu tergantung satu vendor.
Definisi teknis: Ketergantungan pada service/API proprietary yang meningkatkan biaya dan risiko migrasi.
Contoh praktis: Pipeline bergantung fitur eksklusif satu warehouse sehingga migrasi butuh rewrite besar.
Istilah: Auto-suspend
Definisi awam: Komputasi berhenti otomatis saat tidak dipakai untuk hemat biaya.
Definisi teknis: Mekanisme idle shutdown pada compute warehouse/cluster untuk optimasi FinOps.
Contoh praktis: Cluster BI mati otomatis setelah 5 menit idle untuk menekan bill bulanan.
Why Cloud for Data Engineering?
- Scalability: Scale up/down on demand
- Cost Efficiency: Pay for what you use (OPEX vs CAPEX)
- Managed Services: Less operational overhead
- Global Reach: Data centers worldwide
- Integration: Rich ecosystem of services
Big Three Cloud Providers
🟠 AWS
Market leader, broadest service offering
🔵 Google Cloud
Best-in-class analytics and ML
🔷 Azure
Enterprise integration, hybrid cloud
AWS Data Services
| Service |
Purpose |
Comparable To |
| S3 |
Object storage (Data Lake) |
GCS, Azure Blob |
| Redshift |
Data Warehouse |
BigQuery, Synapse |
| EMR |
Managed Spark/Hadoop |
Dataproc, HDInsight |
| Glue |
Serverless ETL |
Dataflow, Data Factory |
| Athena |
Serverless SQL queries |
BigQuery, Synapse SQL |
| Kinesis |
Streaming data |
Pub/Sub, Event Hubs |
| MWAA |
Managed Airflow |
Cloud Composer, ADF |
Google Cloud Platform (GCP)
🎯 GCP Strengths
- BigQuery: Serverless, separates storage/compute
- AI/ML: Vertex AI, TensorFlow integration
- Dataflow: Apache Beam as managed service
- Looker: Modern BI platform
GCP Data Services
| Service |
Purpose |
| Cloud Storage |
Object storage (Multi-regional, Nearline, Coldline) |
| BigQuery |
Serverless data warehouse |
| Dataproc |
Managed Spark/Hadoop |
| Dataflow |
Stream and batch processing (Apache Beam) |
| Pub/Sub |
Messaging and streaming |
| Cloud Composer |
Managed Apache Airflow |
| Data Fusion |
Visual ETL/CDAP-based |
Microsoft Azure
🔷 Azure Strengths
- Enterprise Integration: Active Directory, Office 365
- Hybrid Cloud: Azure Stack for on-prem
- Fabric: Unified analytics platform
- Power BI: Market-leading BI tool
Azure Data Services
| Service |
Purpose |
| Azure Blob Storage |
Object storage (Hot, Cool, Archive tiers) |
| Synapse Analytics |
Unified analytics (SQL pools, Spark) |
| Data Factory |
Visual ETL and data integration |
| HDInsight |
Managed Hadoop/Spark/Kafka |
| Event Hubs |
Streaming platform (Kafka-compatible) |
| Stream Analytics |
Real-time stream processing |
Multi-Cloud Strategy
Best practice: avoid vendor lock-in with portable technologies:
- Open formats: Parquet, Delta Lake, Iceberg
- Open tools: Spark, Airflow, dbt
- Containerization: Kubernetes for portability
- Abstraction layers: Terraform, dbt adapters
Cost Optimization
💰 Cost Best Practices
- Use reserved instances for predictable workloads
- Right-size your compute (don't over-provision)
- Use lifecycle policies for data tiering
- Monitor with cloud cost management tools
- Enable auto-scaling for variable workloads
Decision Framework: Cloud Platform Strategy
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Single-cloud vs Multi-cloud |
Tim kecil, fokus delivery cepat, governance sederhana |
Butuh resilience lintas vendor atau constraint regulasi regional |
| Managed service vs Self-managed |
Prioritas velocity dan minim beban operasional |
Butuh kontrol penuh konfigurasi/performance khusus |
| Serverless vs Provisioned |
Workload fluktuatif dan usage sulit diprediksi |
Workload stabil tinggi dan optimasi biaya jangka panjang |
| Cloud-native vs Portable stack |
Accept lock-in demi fitur vendor paling lengkap |
Ingin fleksibilitas migrasi dan negosiasi biaya |
Failure Modes & Anti-Patterns
Anti-Patterns pada Cloud Data Platform
- Lift-and-shift mindset: desain on-prem dipindah mentah ke cloud tanpa optimasi.
- No cost guardrails: query/ad-hoc runaway menyebabkan bill meledak.
- Over-service sprawl: terlalu banyak service tanpa governance ownership.
- Ignoring data egress: biaya transfer antar region/cloud diabaikan.
- No IaC baseline: environment drift antar dev/staging/prod.
Production Readiness Checklist
Checklist Cloud Platform sebelum Production
- Landing zone, IAM baseline, dan network segmentation siap.
- Infrastructure as Code digunakan untuk provisioning utama.
- Budget alert dan quota guardrails aktif per project/workload.
- Backup, DR, dan target RTO/RPO terdefinisi.
- Data residency/compliance requirement tervalidasi.
- Observability stack aktif (cost, performance, reliability).
- Service ownership dan escalation path terdokumentasi.
- Vendor lock-in risk dicatat dengan mitigation plan.
✏️ Exercise: Cloud Architecture Design
Desain arsitektur data untuk e-commerce di AWS:
- Data Lake: S3 untuk raw data (JSON, CSV, Parquet)
- Ingestion: Kinesis untuk streaming, Glue untuk batch
- Storage: Redshift untuk data warehouse
- Processing: EMR untuk Spark transformations
- Orchestration: MWAA (Managed Airflow)
- Serving: Athena untuk ad-hoc queries
🎯 Quick Quiz
1. Service apa yang setara BigQuery di AWS?
A. S3
B. Redshift
C. EMR
D. Kinesis
2. Keunggulan utama Google Cloud?
A. Cheapest storage
B. Best-in-class analytics and ML
C. Largest market share
D. Best Windows integration
3. Bagaimana menghindari vendor lock-in?
A. Gunakan hanya proprietary services
B. Gunakan open formats dan portable tools
C. Pindah ke on-premise
D. Gunakan satu cloud saja
Kesimpulan
Setiap cloud provider memiliki kekuatan unik. AWS memiliki ekosistem terluas, GCP unggul dalam analytics/ML, dan Azure terintegrasi dengan baik untuk enterprise. Pilih berdasarkan kebutuhan spesifik dan expertise tim.
🎯 Key Takeaways
- Cloud offers scalability, managed services, and cost efficiency
- AWS: Broadest portfolio, market leader
- GCP: Best analytics (BigQuery) and ML
- Azure: Enterprise integration and hybrid cloud
- Use open formats to avoid vendor lock-in
📚 References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapter 18: Cloud Data Platforms
- Cloud Data Platform Architecture - James Serra (Apress, 2022)
- Google Cloud Platform in Action - John Geewax (Manning, 2018)
Official Documentation
Articles & Guides