📊
DataLearn

⚙️ Data Engineering

Master the end-to-end data lifecycle: from generation and storage to ingestion, transformation, and serving. Based on "Fundamentals of Data Engineering" by Joe Reis & Matt Housley.

29
Topics
24
Available
6
Phases
📚

Reference: Fundamentals of Data Engineering

This learning path is structured based on the Data Engineering Lifecycle framework by Joe Reis & Matt Housley (O'Reilly, 2022).

The Data Engineering Lifecycle

Generate Storage Ingest Transform Serve
24 of 29 topics completed (83%)
🏗️

Phase 1: Foundation & Building Blocks

01

Introduction to Data Engineering

What is data engineering, roles, skills, and data maturity model

⏱️ 15 min ● Beginner ✓ Available
02

Data Engineering Lifecycle

The 5 stages: Generation, Storage, Ingestion, Transformation, Serving

⏱️ 25 min ● Beginner ✓ Available
03

Data Architecture

9 principles of good architecture, monolith vs microservices, data mesh

⏱️ 40 min ● Intermediate ✓ Available
04

Technology Selection

Team capabilities, TCO, cloud vs on-prem, build vs buy decisions

⏱️ 20 min ● Intermediate ✓ Available
05

Source Systems

OLTP databases, APIs, IoT, message queues, CDC

⏱️ 30 min ● Beginner ✓ Available
🔄

Phase 2: The Data Engineering Lifecycle in Depth

06

Storage Systems

Data warehouse, data lake, lakehouse, partitioning, schema evolution

⏱️ 45 min ● Intermediate ✓ Available
07

Data Ingestion

Batch vs streaming, CDC, ETL vs ELT, error handling

⏱️ 40 min ● Intermediate ✓ Available
08

Data Modeling & Transformation

Normalization, dbt, materialized views, query optimization

⏱️ 50 min ● Intermediate ✓ Available
09

Orchestration with Apache Airflow

DAG design, workflow patterns, monitoring, best practices

⏱️ 55 min ● Intermediate ✓ Available
10

Data Serving & APIs

Analytics serving, ML serving, reverse ETL, data products

⏱️ 35 min ● Intermediate ✓ Available
11

Reliable Data Systems

Kleppmann: Replication, partitioning, ACID, consistency models

⏱️ 50 min ● Advanced ✓ Available
12

Data Pipeline Patterns

Densmore: Patterns, anti-patterns, testing, idempotency

⏱️ 40 min ● Intermediate ✓ Available
13

DataOps & Observability

CI/CD for data, data quality testing, lineage, data catalog

⏱️ 40 min ● Advanced ✓ Available
14

Security & Data Governance

Encryption, access control, PII handling, compliance

⏱️ 35 min ● Intermediate ✓ Available
15

Cloud Data Platforms

AWS, Google Cloud, Azure for data engineering

⏱️ 50 min ● Intermediate ✓ Available
16

Apache Spark for Big Data

Spark architecture, RDDs, DataFrames, Spark SQL, optimization

⏱️ 60 min ● Advanced ✓ Available
17

Real-time Streaming with Kafka

Kafka architecture, producers/consumers, stream processing

⏱️ 55 min ● Advanced ✓ Available
18

Python for Data Engineering

Advanced Python, pandas, boto3, working with APIs

⏱️ 45 min ● Beginner ✓ Available
19

SQL Advanced for Data Engineers

Window functions, CTEs, query optimization, execution plans

⏱️ 50 min ● Intermediate ✓ Available
20

dbt (Data Build Tool)

dbt models, tests, documentation, best practices

⏱️ 45 min ● Intermediate ✓ Available
21

Data Pipeline Monitoring

Observability, alerting, SLA management, incident response

⏱️ 35 min ● Intermediate ✓ Available
22

The Future of Data Engineering

Emerging trends, AI/ML integration, data mesh, modern data stack

⏱️ 30 min ● Beginner ✓ Available
📌

Phase 6: Community-Driven Special Topics

23

Data Contracts & Schema Evolution in Production

Compatibility policy, versioning, CI checks, and safe rollout strategy

⏱️ 45 min ● Advanced ✓ Available
24

Idempotency vs Atomicity vs Exactly-Once (Practical)

Rerun-safe design, dedup strategies, and realistic consistency guarantees

⏱️ 50 min ● Advanced ✓ Available
25

Data Freshness, Completeness, and SLA/SLO Data

Metric design, tiered SLOs, and actionable alerting

⏱️ 40 min ● Intermediate Coming Soon
26

Small Files Problem & Compaction Playbook

Storage anti-patterns, metadata overhead, and remediation workflow

⏱️ 35 min ● Intermediate Coming Soon
27

Airflow Boundaries: Orchestrate, Don't Transform

DAG boundaries, anti-pattern cleanup, and operation guardrails

⏱️ 35 min ● Intermediate Coming Soon
28

dbt at Scale: Incremental, State, and Cost Control

Slim CI, incremental pitfalls, and enterprise dbt operations

⏱️ 45 min ● Advanced Coming Soon
29

Reference Architecture by Constraint (Stack Playbooks)

Stack selection by latency, cost, team size, and compliance constraints

⏱️ 45 min ● Advanced Coming Soon

🎯 Capstone Projects

🚀 Project 1: End-to-End ETL Pipeline

Build a complete pipeline: Extract from API, transform with Python/Pandas, load to PostgreSQL.

View Project →

📊 Project 2: Real-time Analytics Dashboard

Setup Kafka for streaming, process with Spark Streaming, visualize with Grafana.

View Project →

🏢 Project 3: Data Warehouse Migration

Migrate from on-premise database to cloud data warehouse (BigQuery/Snowflake).

View Project →

📚 Resources

🔗 Code Repository

Download all code examples and starter templates for each chapter.

GitHub →

💾 Sample Datasets

E-commerce, logs, IoT, and user behavior datasets for practice.

Browse →

📝 Cheat Sheets

SQL commands, Airflow operators, Spark transformations reference.

Download PDF →

📖 Reference Book

"Fundamentals of Data Engineering" by Joe Reis & Matt Housley (O'Reilly).

Learn More →