Intermediate

SQL Advanced for Data Engineers

Window functions, CTEs, query optimization, and execution plans

⏱️ 50 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap SQL advanced sebagai "alat utama transformasi data". Fokus baca:

  1. Window function untuk ranking, running total, dan dedup
  2. CTE untuk query kompleks yang tetap mudah dibaca
  3. Cara membaca execution plan agar query lebih cepat

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: Window Function

Definisi awam: Fungsi yang menghitung nilai sambil tetap melihat baris lain di kelompok yang sama.

Definisi teknis: Operasi analitik SQL dengan OVER(PARTITION BY ... ORDER BY ...) tanpa menghilangkan detail row.

Contoh praktis: Hitung ranking penjualan per kota tanpa kehilangan detail transaksi.

Istilah: Explain Plan

Definisi awam: Peta langkah mesin database saat menjalankan query.

Definisi teknis: Representasi eksekusi query (scan, join, sort) untuk diagnosis performa.

Contoh praktis: Dari plan terlihat tabel besar di-full scan, lalu diperbaiki dengan partisi/index.

Why Advanced SQL Matters

SQL adalah bahasa universal data. Data engineers yang menguasai SQL advanced dapat:

Common Table Expressions (CTEs)

CTEs make complex queries readable and reusable:

-- Basic CTE WITH monthly_sales AS ( SELECT DATE_TRUNC('month', order_date) as month, SUM(amount) as total_sales FROM orders GROUP BY 1 ), growth AS ( SELECT month, total_sales, LAG(total_sales) OVER (ORDER BY month) as prev_month, (total_sales - LAG(total_sales) OVER (ORDER BY month)) / LAG(total_sales) OVER (ORDER BY month) * 100 as growth_pct FROM monthly_sales ) SELECT * FROM growth;

Window Functions

Window functions perform calculations across rows related to the current row:

Common Window Functions

Function Description Example
ROW_NUMBER() Unique sequential number Ranking without ties
RANK() Rank with gaps for ties 1, 2, 2, 4...
DENSE_RANK() Rank without gaps 1, 2, 2, 3...
LEAD()/LAG() Access next/previous row Period-over-period calc
SUM() OVER Running totals Cumulative sales
-- Top 3 products per category SELECT * FROM ( SELECT category, product_name, sales, ROW_NUMBER() OVER ( PARTITION BY category ORDER BY sales DESC ) as rank FROM product_sales ) ranked WHERE rank <= 3;

Query Optimization

Reading Execution Plans

📊 Key Metrics

Optimization Techniques

Technique When to Use
Indexes Frequently filtered/joined columns
Partitioning Large tables, time-series data
Column pruning SELECT only needed columns
Predicate pushdown Filter early, filter often
-- BAD: Function on indexed column SELECT * FROM orders WHERE YEAR(order_date) = 2024; -- Can't use index! -- GOOD: Range query SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'; -- Uses index

Recursive CTEs

For hierarchical data (org charts, bill of materials):

-- Employee hierarchy WITH RECURSIVE employee_hierarchy AS ( -- Base case: top-level managers SELECT id, name, manager_id, 1 as level FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive case: employees with managers SELECT e.id, e.name, e.manager_id, h.level + 1 FROM employees e JOIN employee_hierarchy h ON e.manager_id = h.id ) SELECT * FROM employee_hierarchy;

Advanced Joins

Self-Join

-- Find employees and their managers SELECT e.name as employee, m.name as manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;

Lateral Join

-- Top N per group efficiently SELECT c.category_name, top_products.* FROM categories c LEFT JOIN LATERAL ( SELECT product_name, sales FROM products p WHERE p.category_id = c.id ORDER BY sales DESC LIMIT 3 ) top_products ON true;

Decision Framework: SQL Optimization Strategy

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Window functions vs Self-join Butuh ranking/running totals secara ekspresif Engine terbatas dan query plan self-join lebih efisien
CTE vs Temp table Transformasi logika bertahap dan readable Butuh materialisasi intermediate untuk performa
Denormalized mart vs Raw query on source Dashboard berulang dan SLA cepat Analisis ad-hoc sesekali dengan kebutuhan fleksibel

Failure Modes & Anti-Patterns

Anti-Patterns SQL

Production Readiness Checklist

Checklist SQL Production

  1. Query plan (`EXPLAIN`) ditinjau untuk model kritikal.
  2. Kolom yang dipilih explicit, bukan wildcard.
  3. Join keys dan grain table tervalidasi.
  4. Query punya tests untuk edge-case data.
  5. Cost/performance baseline didokumentasikan.
  6. SQL style guide dipatuhi lintas tim.

✏️ Exercise: Complex SQL Query

Tulis query untuk customer cohort analysis:

  1. Find first purchase month for each customer (cohort)
  2. Calculate retention by cohort over time
  3. Use window functions for period-over-period

🎯 Quick Quiz

1. Fungsi ROW_NUMBER() digunakan untuk?

A. Menjumlahkan nilai
B. Memberi nomor unik pada setiap row
C. Mengurutkan data
D. Menghapus duplicate

2. Kenapa hindari fungsi di WHERE clause?

A. Terlalu panjang
B. Tidak bisa pakai index
C. Error syntax
D. Lambat proses

3. Recursive CTE cocok untuk?

A. Simple aggregation
B. Hierarchical data
C. Date calculations
D. String manipulation

Kesimpulan

Advanced SQL adalah force multiplier untuk data engineers. Dengan window functions, CTEs, dan optimization techniques, kamu dapat solve complex problems efficiently di database level.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Guides