⚙️ Data Engineering

Master the end-to-end data lifecycle: from generation and storage to ingestion, transformation, and serving. Based on "Fundamentals of Data Engineering" by Joe Reis & Matt Housley.

Topics

Available

Phases

📚

Reference: Fundamentals of Data Engineering

This learning path is structured based on the Data Engineering Lifecycle framework by Joe Reis & Matt Housley (O'Reilly, 2022).

The Data Engineering Lifecycle

Generate → Storage → Ingest → Transform → Serve

24 of 29 topics completed (83%)

🏗️

Phase 1: Foundation & Building Blocks

Introduction to Data Engineering

What is data engineering, roles, skills, and data maturity model

⏱️ 15 min ● Beginner ✓ Available

Data Engineering Lifecycle

The 5 stages: Generation, Storage, Ingestion, Transformation, Serving

⏱️ 25 min ● Beginner ✓ Available

Data Architecture

9 principles of good architecture, monolith vs microservices, data mesh

⏱️ 40 min ● Intermediate ✓ Available

Technology Selection

Team capabilities, TCO, cloud vs on-prem, build vs buy decisions

⏱️ 20 min ● Intermediate ✓ Available

Source Systems

OLTP databases, APIs, IoT, message queues, CDC

⏱️ 30 min ● Beginner ✓ Available

🔄

Phase 2: The Data Engineering Lifecycle in Depth

Storage Systems

Data warehouse, data lake, lakehouse, partitioning, schema evolution

⏱️ 45 min ● Intermediate ✓ Available

Data Ingestion

Batch vs streaming, CDC, ETL vs ELT, error handling

⏱️ 40 min ● Intermediate ✓ Available

Data Modeling & Transformation

Normalization, dbt, materialized views, query optimization

⏱️ 50 min ● Intermediate ✓ Available

Orchestration with Apache Airflow

DAG design, workflow patterns, monitoring, best practices

⏱️ 55 min ● Intermediate ✓ Available

Data Serving & APIs

Analytics serving, ML serving, reverse ETL, data products

⏱️ 35 min ● Intermediate ✓ Available

Reliable Data Systems

Kleppmann: Replication, partitioning, ACID, consistency models

⏱️ 50 min ● Advanced ✓ Available

Data Pipeline Patterns

Densmore: Patterns, anti-patterns, testing, idempotency

⏱️ 40 min ● Intermediate ✓ Available

DataOps & Observability

CI/CD for data, data quality testing, lineage, data catalog

⏱️ 40 min ● Advanced ✓ Available

Security & Data Governance

Encryption, access control, PII handling, compliance

⏱️ 35 min ● Intermediate ✓ Available

Cloud Data Platforms

AWS, Google Cloud, Azure for data engineering

⏱️ 50 min ● Intermediate ✓ Available

Apache Spark for Big Data

Spark architecture, RDDs, DataFrames, Spark SQL, optimization

⏱️ 60 min ● Advanced ✓ Available

Real-time Streaming with Kafka

Kafka architecture, producers/consumers, stream processing

⏱️ 55 min ● Advanced ✓ Available

Python for Data Engineering

Advanced Python, pandas, boto3, working with APIs

⏱️ 45 min ● Beginner ✓ Available

SQL Advanced for Data Engineers

Window functions, CTEs, query optimization, execution plans

⏱️ 50 min ● Intermediate ✓ Available

dbt (Data Build Tool)

dbt models, tests, documentation, best practices

⏱️ 45 min ● Intermediate ✓ Available

Data Pipeline Monitoring

Observability, alerting, SLA management, incident response

⏱️ 35 min ● Intermediate ✓ Available

The Future of Data Engineering

Emerging trends, AI/ML integration, data mesh, modern data stack

⏱️ 30 min ● Beginner ✓ Available

📌

Phase 6: Community-Driven Special Topics

Data Contracts & Schema Evolution in Production

Compatibility policy, versioning, CI checks, and safe rollout strategy

⏱️ 45 min ● Advanced ✓ Available

Idempotency vs Atomicity vs Exactly-Once (Practical)

Rerun-safe design, dedup strategies, and realistic consistency guarantees

⏱️ 50 min ● Advanced ✓ Available

Data Freshness, Completeness, and SLA/SLO Data

Metric design, tiered SLOs, and actionable alerting

⏱️ 40 min ● Intermediate Coming Soon

Small Files Problem & Compaction Playbook

Storage anti-patterns, metadata overhead, and remediation workflow

⏱️ 35 min ● Intermediate Coming Soon

Airflow Boundaries: Orchestrate, Don't Transform

DAG boundaries, anti-pattern cleanup, and operation guardrails

⏱️ 35 min ● Intermediate Coming Soon

dbt at Scale: Incremental, State, and Cost Control

Slim CI, incremental pitfalls, and enterprise dbt operations

⏱️ 45 min ● Advanced Coming Soon

Reference Architecture by Constraint (Stack Playbooks)

Stack selection by latency, cost, team size, and compliance constraints

⏱️ 45 min ● Advanced Coming Soon

🎯 Capstone Projects

🚀 Project 1: End-to-End ETL Pipeline

Build a complete pipeline: Extract from API, transform with Python/Pandas, load to PostgreSQL.

View Project →

📊 Project 2: Real-time Analytics Dashboard

Setup Kafka for streaming, process with Spark Streaming, visualize with Grafana.

View Project →

🏢 Project 3: Data Warehouse Migration

Migrate from on-premise database to cloud data warehouse (BigQuery/Snowflake).

View Project →

📚 Resources

🔗 Code Repository

Download all code examples and starter templates for each chapter.

GitHub →

💾 Sample Datasets

E-commerce, logs, IoT, and user behavior datasets for practice.

Browse →

📝 Cheat Sheets

SQL commands, Airflow operators, Spark transformations reference.

Download PDF →

📖 Reference Book

"Fundamentals of Data Engineering" by Joe Reis & Matt Housley (O'Reilly).

Learn More →