Beginner
Python for Data Engineering
Advanced Python, pandas, boto3, working with APIs
⏱️ 45 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap Python sebagai "pisau serbaguna" data engineering. Fokus baca:
- Library inti yang paling sering dipakai di pipeline
- Pola coding yang aman untuk produksi (retry, logging, testing)
- Batas kapan Python cocok dan kapan lebih baik SQL/Spark
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Pernah menulis script Python dasar (fungsi, loop, import)
- Tahu format data umum seperti CSV/JSON
- Paham pipeline bisa gagal karena API/database timeout
Istilah Penting (3 Lapis)
Istilah: Virtual Environment
Definisi awam: Lingkungan Python terpisah untuk tiap proyek.
Definisi teknis: Isolasi dependency package agar versi library antar proyek tidak saling bentrok.
Contoh praktis: Proyek A pakai pandas 2.2, proyek B tetap aman di versi lain.
Istilah: Idempotent Script
Definisi awam: Script diulang berkali-kali hasilnya tetap benar.
Definisi teknis: Program ETL yang aman rerun tanpa side effect seperti duplikasi data.
Contoh praktis: Job harian menggunakan upsert key sehingga retry tidak menambah row ganda.
Why Python for Data Engineering?
- Easy to learn: Readable syntax, great for beginners
- Rich ecosystem: pandas, PySpark, Airflow, and more
- Community: Largest data science/engineering community
- Integration: Works with virtually all data tools
Essential Libraries
| Library |
Purpose |
Use Case |
| pandas |
Data manipulation |
ETL, data cleaning |
| requests |
HTTP library |
API ingestion |
| boto3 |
AWS SDK |
S3, Redshift operations |
| psycopg2 |
PostgreSQL adapter |
Database connections |
| pyarrow |
Columnar data |
Parquet handling |
Pandas for Data Processing
import pandas as pd
df_csv = pd.read_csv('data.csv')
df_json = pd.read_json('data.json')
df_parquet = pd.read_parquet('data.parquet')
df['total'] = df['price'] * df['quantity']
monthly = df.groupby('month').agg({
'revenue': 'sum',
'orders': 'count'
})
df['column'].fillna(0, inplace=True)
Working with APIs
import requests
import json
def fetch_api_data(url, headers=None):
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
return None
def fetch_all_pages(base_url):
all_data = []
page = 1
while True:
data = fetch_api_data(f{base_url}?page={page}")
if not data:
break
all_data.extend(data['results'])
page += 1
return all_data
AWS with boto3
import boto3
s3 = boto3.client('s3')
s3.upload_file('local.csv', 'my-bucket', 'data/file.csv')
response = s3.list_objects_v2(Bucket='my-bucket', Prefix='data/')
for obj in response.get('Contents', []):
print(obj['Key'])
import pandas as pd
obj = s3.get_object(Bucket='my-bucket', Key='data.csv')
df = pd.read_csv(obj['Body'])
Database Connections
import psycopg2
from contextlib import contextmanager
@contextmanager
def get_db_connection(conn_string):
conn = psycopg2.connect(conn_string)
try:
yield conn
conn.commit()
except Exception as e:
conn.rollback()
raise e
finally:
conn.close()
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host:5432/db')
df.to_sql('table_name', engine, if_exists='append', index=False)
Python Best Practices
✅ DO's
- Use type hints for function signatures
- Handle exceptions gracefully
- Use context managers (with statement)
- Write unit tests with pytest
- Use virtual environments
Decision Framework: Python Engineering Practices
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Script sederhana vs Package project |
Eksperimen cepat jangka pendek |
Pipeline production dengan lifecycle panjang |
| Pandas vs PySpark |
Data muat di memory mesin tunggal |
Data besar butuh distributed compute |
| Synchronous vs Async I/O |
Integrasi API terbatas dan sederhana |
Banyak network-bound calls paralel |
Failure Modes & Anti-Patterns
Anti-Patterns Python untuk DE
- Notebook-to-prod langsung: logic sulit di-test dan dipelihara.
- No dependency pinning: pipeline rusak karena update package.
- Broad exception handling: error tersembunyi dan data corruption lolos.
- No typing/contracts: bug interface muncul terlambat.
Production Readiness Checklist
Checklist Python DE
- Struktur project modular dan reusable.
- Dependency lockfile/pinning diterapkan.
- Unit test + integration test pipeline tersedia.
- Logging terstruktur dan error context jelas.
- Retry/timeout untuk I/O eksternal aktif.
- CI checks (lint, test, type check) wajib lulus.
✏️ Exercise: Build ETL Pipeline
Buat ETL pipeline dengan Python:
- Extract: Fetch data dari REST API
- Transform: Clean dengan pandas (handle nulls, type conversion)
- Load: Save ke PostgreSQL dan S3 backup
🎯 Quick Quiz
1. Library apa untuk AWS S3 di Python?
A. aws-python
B. boto3
C. s3lib
D. aws-sdk
2. Method pandas untuk membaca Parquet?
A. pd.read_parquet()
B. pd.read_json()
C. pd.read_csv()
D. pd.read_table()
3. Kenapa gunakan context manager?
A. Mempercepat code
B. Auto-cleanup resources
C. Mengurangi code lines
D. Tidak ada alasan
Kesimpulan
Python adalah bahasa wajib untuk data engineer. Dengan menguasai pandas, API interactions, cloud SDKs, dan database connections, kamu dapat membangun ETL pipelines yang robust.
🎯 Key Takeaways
- pandas untuk data manipulation
- requests untuk API ingestion
- boto3 untuk AWS operations
- psycopg2/SQLAlchemy untuk databases
- Always use proper error handling
📚 References & Resources
Primary Sources
- Python for Data Analysis - Wes McKinney (O'Reilly, 2022)
3rd Edition: Data Wrangling with pandas
- Architecture Patterns with Python - Harry Percival (O'Reilly, 2020)
- Effective Python - Brett Slatkin (Addison-Wesley, 2019)
Official Documentation
Articles & Guides