Beginner

Python for Data Engineering

Advanced Python, pandas, boto3, working with APIs

⏱️ 45 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap Python sebagai "pisau serbaguna" data engineering. Fokus baca:

  1. Library inti yang paling sering dipakai di pipeline
  2. Pola coding yang aman untuk produksi (retry, logging, testing)
  3. Batas kapan Python cocok dan kapan lebih baik SQL/Spark

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Virtual Environment

Definisi awam: Lingkungan Python terpisah untuk tiap proyek.

Definisi teknis: Isolasi dependency package agar versi library antar proyek tidak saling bentrok.

Contoh praktis: Proyek A pakai pandas 2.2, proyek B tetap aman di versi lain.

Istilah: Idempotent Script

Definisi awam: Script diulang berkali-kali hasilnya tetap benar.

Definisi teknis: Program ETL yang aman rerun tanpa side effect seperti duplikasi data.

Contoh praktis: Job harian menggunakan upsert key sehingga retry tidak menambah row ganda.

Why Python for Data Engineering?

Essential Libraries

Library Purpose Use Case
pandas Data manipulation ETL, data cleaning
requests HTTP library API ingestion
boto3 AWS SDK S3, Redshift operations
psycopg2 PostgreSQL adapter Database connections
pyarrow Columnar data Parquet handling

Pandas for Data Processing

import pandas as pd # Read data from various sources df_csv = pd.read_csv('data.csv') df_json = pd.read_json('data.json') df_parquet = pd.read_parquet('data.parquet') # Data transformation df['total'] = df['price'] * df['quantity'] # Aggregation monthly = df.groupby('month').agg({ 'revenue': 'sum', 'orders': 'count' }) # Handle missing data df['column'].fillna(0, inplace=True)

Working with APIs

import requests import json def fetch_api_data(url, headers=None): try: response = requests.get(url, headers=headers, timeout=30) response.raise_for_status() return response.json() except requests.exceptions.RequestException as e: print(f"Error: {e}") return None # Paginated API handling def fetch_all_pages(base_url): all_data = [] page = 1 while True: data = fetch_api_data(f{base_url}?page={page}") if not data: break all_data.extend(data['results']) page += 1 return all_data

AWS with boto3

import boto3 # S3 operations s3 = boto3.client('s3') # Upload file s3.upload_file('local.csv', 'my-bucket', 'data/file.csv') # List objects response = s3.list_objects_v2(Bucket='my-bucket', Prefix='data/') for obj in response.get('Contents', []): print(obj['Key']) # Read directly to pandas import pandas as pd obj = s3.get_object(Bucket='my-bucket', Key='data.csv') df = pd.read_csv(obj['Body'])

Database Connections

import psycopg2 from contextlib import contextmanager @contextmanager def get_db_connection(conn_string): conn = psycopg2.connect(conn_string) try: yield conn conn.commit() except Exception as e: conn.rollback() raise e finally: conn.close() # Using SQLAlchemy for pandas integration from sqlalchemy import create_engine engine = create_engine('postgresql://user:pass@host:5432/db') df.to_sql('table_name', engine, if_exists='append', index=False)

Python Best Practices

✅ DO's

Decision Framework: Python Engineering Practices

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Script sederhana vs Package project Eksperimen cepat jangka pendek Pipeline production dengan lifecycle panjang
Pandas vs PySpark Data muat di memory mesin tunggal Data besar butuh distributed compute
Synchronous vs Async I/O Integrasi API terbatas dan sederhana Banyak network-bound calls paralel

Failure Modes & Anti-Patterns

Anti-Patterns Python untuk DE

Production Readiness Checklist

Checklist Python DE

  1. Struktur project modular dan reusable.
  2. Dependency lockfile/pinning diterapkan.
  3. Unit test + integration test pipeline tersedia.
  4. Logging terstruktur dan error context jelas.
  5. Retry/timeout untuk I/O eksternal aktif.
  6. CI checks (lint, test, type check) wajib lulus.

✏️ Exercise: Build ETL Pipeline

Buat ETL pipeline dengan Python:

  1. Extract: Fetch data dari REST API
  2. Transform: Clean dengan pandas (handle nulls, type conversion)
  3. Load: Save ke PostgreSQL dan S3 backup

🎯 Quick Quiz

1. Library apa untuk AWS S3 di Python?

A. aws-python
B. boto3
C. s3lib
D. aws-sdk

2. Method pandas untuk membaca Parquet?

A. pd.read_parquet()
B. pd.read_json()
C. pd.read_csv()
D. pd.read_table()

3. Kenapa gunakan context manager?

A. Mempercepat code
B. Auto-cleanup resources
C. Mengurangi code lines
D. Tidak ada alasan

Kesimpulan

Python adalah bahasa wajib untuk data engineer. Dengan menguasai pandas, API interactions, cloud SDKs, dan database connections, kamu dapat membangun ETL pipelines yang robust.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides