Intermediate
Data Modeling & Transformation
Normalization, dbt, dimensional modeling, and query optimization
⏱️ 50 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap modeling sebagai "menata data agar mudah dipakai". Fokus baca:
- Kenapa struktur tabel memengaruhi analisis
- Kapan pilih model historis vs model overwrite
- Cara transformasi bertahap agar hemat biaya
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham SQL SELECT, JOIN, dan GROUP BY dasar
- Tahu bahwa kebutuhan BI butuh tabel yang stabil
- Pernah lihat perubahan data dimensi (mis. alamat customer)
Istilah Penting (3 Lapis)
Istilah: SCD Type 2
Definisi awam: Menyimpan riwayat perubahan data, bukan menimpa versi lama.
Definisi teknis: Pola dimensi historis dengan row baru per perubahan serta kolom efektif mulai/akhir.
Contoh praktis: Saat tier customer berubah, record lama ditutup dan record baru dibuka.
Istilah: Incremental Model
Definisi awam: Proses hanya data baru/berubah, bukan ulang semua data.
Definisi teknis: Strategi transformasi yang memproses delta berdasarkan watermark atau key untuk efisiensi.
Contoh praktis: Model harian hanya menghitung transaksi sejak run terakhir.
Apa itu Data Modeling?
Data Modeling adalah proses mendefinisikan struktur data untuk
menyimpan dan mengorganisir data secara efisien. Good data model memastikan
data dapat diakses, dipahami, dan digunakan dengan baik.
💡 Analogy
Data model seperti blueprint rumah. Sebelum membangun, kita perlu tahu:
berapa kamar, dimana kitchen, bagaimana alur pergerakan.
Database Normalization
Normalization adalah proses mengorganisir data untuk mengurangi redundancy dan dependency.
Normal Forms (1NF → 3NF)
Normalization Example
Before (Unnormalized):
| order_id |
customer_name |
products |
total |
| 1 |
John Doe |
Laptop, Mouse, Keyboard |
1500 |
After 3NF:
| orders table |
| order_id |
customer_id |
order_date |
| 1 |
101 |
2024-01-15 |
| customers table |
| customer_id |
customer_name |
| 101 |
John Doe |
⚠️ When NOT to Normalize
- Analytics/Data Warehouse: Denormalized is often faster for reads
- Read-heavy workloads: Joins are expensive at scale
- Time-series data: Flat structure often preferred
Dimensional Modeling (Star Schema)
Untuk data warehouses, dimensional modeling adalah best practice.
⭐ Star Schema Structure
fact_orders
(measures: quantity, amount, discount)
↑ ↑ ↑ ↑ ↑
dim_date
dim_customer
dim_product
dim_store
Fact vs Dimension Tables
| Aspect |
Fact Table |
Dimension Table |
| Content |
Measurements, metrics |
Descriptive attributes |
| Granularity |
One row per event/transaction |
One row per entity |
| Size |
Large (millions/billions) |
Smaller (thousands/millions) |
| Updates |
Append-only |
Slowly changing (SCD) |
dbt (Data Build Tool)
dbt adalah tool untuk transformasi data menggunakan SQL. Mengubah data analysts menjadi data engineers.
🎯 Why dbt?
- Version control for SQL transformations
- Automatic documentation generation
- Testing framework built-in
- Modular, reusable SQL with Jinja
- Runs in your data warehouse (no data movement)
dbt Project Structure
my_project/
├── models/
│ ├── staging/ # Raw → Clean (1:1 with sources)
│ │ ├── stg_orders.sql
│ │ └── stg_customers.sql
│ ├── intermediate/ # Business logic
│ │ ├── int_orders_enriched.sql
│ │ └── int_customer_metrics.sql
│ └── marts/ # Business-facing tables
│ ├── fct_orders.sql
│ └── dim_customers.sql
├── tests/
├── macros/
└── dbt_project.yml
Example dbt Model
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
)
SELECT
o.order_id,
o.order_date,
c.customer_id,
c.customer_name,
o.amount,
o.status
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
dbt Tests
version: 2
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: amount
tests:
- positive_value
Materialization Strategies
Cara data disimpan setelah transformation:
| Materialization |
Best For |
Trade-off |
| View |
Light transformations, always fresh |
Computed on every query |
| Table |
Heavy transformations, slow sources |
Full rebuild every run |
| Incremental |
Large tables, append-only data |
More complex setup |
| Ephemeral |
Intermediate CTEs |
Not queryable directly |
Query Optimization
Indexing Strategies
📇 When to Index
- Columns frequently used in WHERE clauses
- JOIN columns
- Columns used in ORDER BY
Don't index: Low cardinality columns (gender, boolean), frequently updated columns
Query Performance Tips
SELECT *
FROM big_table
WHERE created_at >= '2024-01-01'
JOIN small_table ON ...;
SELECT customer_id, name
FROM customers;
WHERE YEAR(created_at) = 2024
WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01'
Transformation Best Practices
🥉 Bronze (Raw)
1:1 with source
Minimal cleaning
Preserve history
🥈 Silver (Clean)
Cleaned & typed
Business logic
Deduplicated
🥇 Gold (Mart)
Business aggregates
Star schema
Consumption-ready
Decision Framework: Modeling & Transformation
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| Normalized vs Dimensional |
Fokus transaksi OLTP dan integritas update tinggi |
Fokus analytics, query agregasi, dan BI performa tinggi |
| Star vs Snowflake Schema |
Prioritas simplicity dan performa query untuk analyst |
Perlu normalisasi dimensi untuk konsistensi atribut kompleks |
| Full Refresh vs Incremental |
Dataset kecil dan logic transform sering berubah |
Dataset besar, growth tinggi, dan SLA ketat |
| SCD Type 1 vs Type 2 |
Tidak butuh histori nilai lama |
Butuh audit trail dan point-in-time analysis |
Failure Modes & Anti-Patterns
Anti-Patterns di Layer Modeling
- No clear grain: fact table campur granularitas dan hasil agregasi jadi salah.
- Metric definition drift: definisi KPI beda antar dashboard.
- Over-denormalization: query cepat awal, tapi maintenance makin mahal.
- Transform tanpa tests: bug data lolos ke production.
- Unused intermediate models: lineage rumit tanpa value nyata.
Production Readiness Checklist
Checklist Modeling sebelum Production
- Setiap fact table punya grain statement yang jelas.
- Setiap metric penting punya definisi baku lintas tim.
- Primary key, not_null, relationships tests aktif.
- Incremental logic diuji untuk late-arriving data.
- SCD strategy sesuai kebutuhan historis bisnis.
- Lineage model terdokumentasi untuk impact analysis.
- Materialization dipilih berdasarkan cost vs latency.
- Backfill procedure aman tersedia untuk model kritikal.
✏️ Exercise: Design a Data Model
Desain data model untuk e-commerce analytics dengan requirements:
- Track orders, customers, products, and payments
- Support: Sales by category, Customer lifetime value, Monthly trends
- Product catalog changes over time (SCD Type 2)
Create:
- ER diagram (fact and dimension tables)
- dbt model structure
- Key metrics definitions
🎯 Quick Quiz
1. Tujuan utama normalization adalah?
A. Mempercepat query
B. Mengurangi redundancy dan dependency
C. Mempermudah backup
D. Menambah security
2. Dalam star schema, fact table berisi?
A. Descriptive attributes
B. Measurements and metrics
C. Metadata only
D. User credentials
3. dbt materialization 'incremental' cocok untuk?
A. Small tables that change completely every run
B. Large tables with append-only data
C. Intermediate CTEs
D. Tables with frequent updates to historical data
Advanced Dimensional Modeling (Kimball Method)
📚 Reference: The Data Warehouse Toolkit
Ralph Kimball's dimensional modeling approach is the industry standard for data warehouse design.
Slowly Changing Dimensions (SCD)
Dimensions change over time. Kimball defined strategies for handling these changes:
| Type |
Description |
Use Case |
| SCD Type 0 |
Fixed dimension - never changes |
Date of birth, original timestamp |
| SCD Type 1 |
Overwrite old value |
Typos, no history needed |
| SCD Type 2 |
Add new row with date range |
Track historical changes (most common) |
| SCD Type 3 |
Add column for previous value |
Track only one previous value |
| SCD Type 4 |
History table + current table |
Fast current lookup + full history |
| SCD Type 6 |
Hybrid: Type 1 + Type 2 + Type 3 |
Complex tracking requirements |
SCD Type 2 Implementation
-- dim_customer with SCD Type 2
customer_key (PK) | customer_id | name | city | valid_from | valid_to | is_current
------------------|-------------|---------|------------|------------|------------|----------
1001 | C001 | John | Jakarta | 2023-01-01 | 2023-06-15 | N
1002 | C001 | John | Surabaya | 2023-06-15 | 9999-12-31 | Y
-- Query for point-in-time analysis
SELECT * FROM dim_customer
WHERE customer_id = 'C001'
AND '2023-03-01' BETWEEN valid_from AND valid_to; -- Returns Jakarta record
SCD Decision Playbook (Praktis)
| Kondisi Atribut |
Rekomendasi |
Alasan |
| Typo/koreksi data tanpa nilai historis bisnis |
Type 1 |
Overwrite cukup, lebih sederhana dan hemat storage |
| Perubahan atribut penting untuk analisis historis |
Type 2 |
Butuh point-in-time analysis dan audit trail |
| Hanya perlu nilai sekarang + 1 nilai sebelumnya |
Type 3 |
Ringan, tapi histori terbatas |
| Lookup current sangat sering, histori tetap wajib |
Type 4 |
Pisah current/history untuk performa query operasional |
| Kebutuhan campuran (overwrite + histori + nilai lama) |
Type 6 |
Fleksibel, tapi maintenance paling kompleks |
Desain Kontrak untuk SCD Type 2
- Business key: key alami dari source (contoh:
customer_id).
- Surrogate key: key teknis per versi row (contoh:
customer_key).
- Effective window:
valid_from dan valid_to.
- Current flag:
is_current untuk akses cepat versi aktif.
- Change hash: hash atribut tracked agar deteksi perubahan murah.
- Audit columns:
ingested_at, updated_by_job, source_system.
Kesalahan SCD Type 2 yang paling sering
- Tidak menutup row lama saat nilai berubah (dua row aktif untuk satu business key).
- Menggunakan timestamp ingestion sebagai
valid_from padahal butuh event/business time.
- Tidak mendefinisikan atribut mana yang di-track historinya (semua kolom di-Type2-kan).
- Tidak punya strategi late-arriving updates dan backfill.
SQL Pattern: Change Detection + SCD Type 2 Upsert
WITH staged AS (
SELECT
customer_id,
customer_name,
city,
md5(concat_ws('|', customer_name, city)) AS change_hash,
business_effective_at
FROM stg_customers
),
current_dim AS (
SELECT *
FROM dim_customer
WHERE is_current = TRUE
)
SELECT s.*
FROM staged s
LEFT JOIN current_dim d
ON s.customer_id = d.customer_id
WHERE d.customer_id IS NULL
OR s.change_hash <> d.change_hash;
UPDATE dim_customer d
SET valid_to = s.business_effective_at - INTERVAL '1 second',
is_current = FALSE
FROM changed_rows s
WHERE d.customer_id = s.customer_id
AND d.is_current = TRUE;
INSERT INTO dim_customer (
customer_key, customer_id, customer_name, city,
change_hash, valid_from, valid_to, is_current
)
SELECT
nextval('dim_customer_key_seq'),
customer_id, customer_name, city,
change_hash, business_effective_at, '9999-12-31', TRUE
FROM changed_rows;
Production Checks untuk SCD
SELECT customer_id
FROM dim_customer
WHERE is_current = TRUE
GROUP BY customer_id
HAVING count(*) > 1;
SELECT a.customer_id
FROM dim_customer a
JOIN dim_customer b
ON a.customer_id = b.customer_id
AND a.customer_key <> b.customer_key
AND a.valid_from <= b.valid_to
AND b.valid_from <= a.valid_to;
SCD Type 2 + dbt Incremental Pattern
Jika kamu memakai dbt, pendekatan praktis adalah: incremental + merge untuk menutup row lama dan menambah versi baru saat ada perubahan atribut.
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='customer_sk'
) }}
WITH source_data AS (
SELECT
customer_id,
customer_name,
city,
business_effective_at,
md5(concat_ws('|', customer_name, city)) AS change_hash
FROM {{ ref('stg_customers') }}
{% if is_incremental() %}
WHERE business_effective_at >= (
SELECT coalesce(max(valid_from), '1900-01-01') FROM {{ this }}
)
{% endif %}
),
current_dim AS (
SELECT *
FROM {{ this }}
WHERE is_current = true
),
changed_rows AS (
SELECT s.*
FROM source_data s
LEFT JOIN current_dim d ON s.customer_id = d.customer_id
WHERE d.customer_id IS NULL OR s.change_hash <> d.change_hash
)
SELECT
{{ dbt_utils.generate_surrogate_key(['customer_id', 'business_effective_at']) }} AS customer_sk,
customer_id,
customer_name,
city,
change_hash,
business_effective_at AS valid_from,
'9999-12-31' AS valid_to,
true AS is_current
FROM changed_rows;
Checklist implementasi dbt untuk SCD
- Gunakan
business_effective_at, bukan sekadar waktu ingestion.
- Pisahkan atribut Type 1 dan Type 2 sejak awal desain model.
- Tambahkan tests:
unique, not_null, dan custom test "single current row".
- Siapkan job backfill khusus untuk data telat (late arriving changes).
Exercise: Debug SCD Incident (Double Current + Overlap)
Skenario: Pada dim_customer, tim BI melaporkan customer yang sama muncul dua kali sebagai current, dan laporan point-in-time berubah-ubah.
Sample data bermasalah:
customer_key | customer_id | city | valid_from | valid_to | is_current
-------------|-------------|-----------|----------------------|----------------------|-----------
2001 | C100 | Jakarta | 2024-01-01 00:00:00 | 9999-12-31 00:00:00 | true
2002 | C100 | Bandung | 2024-03-01 00:00:00 | 9999-12-31 00:00:00 | true -- BUG
2003 | C101 | Surabaya | 2024-01-01 00:00:00 | 2024-04-01 00:00:00 | false
2004 | C101 | Malang | 2024-03-20 00:00:00 | 9999-12-31 00:00:00 | true -- OVERLAP
Tugas:
- Tulis query untuk mendeteksi business key dengan
>1 current row.
- Tulis query untuk mendeteksi overlap window
valid_from/valid_to.
- Perbaiki data: tutup row lama dan sisakan tepat satu row current per
customer_id.
- Tambahkan dbt test agar bug ini tidak lolos lagi.
Expected output minimal:
- Untuk setiap
customer_id, hanya ada satu row dengan is_current = true.
- Tidak ada overlap periode valid untuk key yang sama.
- Point-in-time query untuk tanggal tertentu selalu mengembalikan tepat satu baris.
Fact Table Types
📊 Transaction Fact
Grain: One row per transaction
Example: Order line items
Most common, atomic level data
📅 Periodic Snapshot
Grain: One row per period
Example: Daily inventory balance
State of world at intervals
📈 Accumulating Snapshot
Grain: One row per process
Example: Order fulfillment pipeline
Track milestones in a process
Dimension Design Patterns
1. Junk Dimensions
Combine low-cardinality flags/indicators into one dimension to reduce clutter:
dim_order_flags:
- flag_key (PK)
- is_promotion
- is_free_shipping
- is_gift_wrap
- flag_description
2. Degenerate Dimensions
Transaction identifiers that don't need their own dimension:
3. Role-Playing Dimensions
Same dimension used multiple ways in one fact table:
fact_orders:
- order_date_key (→ dim_date)
- ship_date_key (→ dim_date)
- delivery_date_key (→ dim_date)
4. Conformed Dimensions
Dimensions that are shared across multiple fact tables (enterprise-wide consistency):
- Same dim_customer used in sales, support, marketing facts
- Same dim_product used across all business units
- Critical for drilling across different processes
Bus Architecture
Kimball's approach to enterprise data warehouse - integrate via conformed dimensions:
🚌 Data Warehouse Bus Matrix
| Business Process |
Date |
Customer |
Product |
Store |
Promotion |
| Sales |
✓ |
✓ |
✓ |
✓ |
✓ |
| Inventory |
✓ |
|
✓ |
✓ |
|
| Returns |
✓ |
✓ |
✓ |
✓ |
|
Kesimpulan
Data modeling adalah foundation dari analytics. Pilih normalization untuk OLTP,
dimensional modeling untuk OLAP. Gunakan dbt untuk mengelola transformasi
dengan best practices: version control, testing, dan documentation.
🎯 Key Takeaways
- Normalize for OLTP, denormalize for analytics
- Star schema: facts in center, dimensions around
- dbt brings software engineering to SQL
- Materialize based on query patterns and data size
- Always add tests for data quality
📚 References & Resources
Primary Sources
- The Data Warehouse Toolkit - Ralph Kimball & Margy Ross (Wiley, 2013)
Chapters 1-6: Dimensional Modeling, SCD Types, Fact Tables, Bus Architecture
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 10-11: Data Transformation, dbt and Analytics Engineering
- Building a Scalable Data Warehouse with Data Vault 2.0 - Daniel Linstedt et al. (Morgan Kaufmann, 2016)
Chapter 3: Normalization and Data Warehouse Modeling Approaches
Official Documentation
Articles & Websites