Intermediate

Data Modeling & Transformation

Normalization, dbt, dimensional modeling, and query optimization

⏱️ 50 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap modeling sebagai "menata data agar mudah dipakai". Fokus baca:

  1. Kenapa struktur tabel memengaruhi analisis
  2. Kapan pilih model historis vs model overwrite
  3. Cara transformasi bertahap agar hemat biaya

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Istilah Penting (3 Lapis)

Istilah: SCD Type 2

Definisi awam: Menyimpan riwayat perubahan data, bukan menimpa versi lama.

Definisi teknis: Pola dimensi historis dengan row baru per perubahan serta kolom efektif mulai/akhir.

Contoh praktis: Saat tier customer berubah, record lama ditutup dan record baru dibuka.

Istilah: Incremental Model

Definisi awam: Proses hanya data baru/berubah, bukan ulang semua data.

Definisi teknis: Strategi transformasi yang memproses delta berdasarkan watermark atau key untuk efisiensi.

Contoh praktis: Model harian hanya menghitung transaksi sejak run terakhir.

Apa itu Data Modeling?

Data Modeling adalah proses mendefinisikan struktur data untuk menyimpan dan mengorganisir data secara efisien. Good data model memastikan data dapat diakses, dipahami, dan digunakan dengan baik.

💡 Analogy

Data model seperti blueprint rumah. Sebelum membangun, kita perlu tahu: berapa kamar, dimana kitchen, bagaimana alur pergerakan.

Database Normalization

Normalization adalah proses mengorganisir data untuk mengurangi redundancy dan dependency.

Normal Forms (1NF → 3NF)

1NF

Atomic values
No repeating groups

2NF

1NF +
No partial dependency

3NF

2NF +
No transitive dependency

Normalization Example

Before (Unnormalized):

order_id customer_name products total
1 John Doe Laptop, Mouse, Keyboard 1500

After 3NF:

orders table
order_id customer_id order_date
1 101 2024-01-15
customers table
customer_id customer_name
101 John Doe

⚠️ When NOT to Normalize

Dimensional Modeling (Star Schema)

Untuk data warehouses, dimensional modeling adalah best practice.

⭐ Star Schema Structure

fact_orders

(measures: quantity, amount, discount)

↑ ↑ ↑ ↑ ↑

dim_date dim_customer dim_product dim_store

Fact vs Dimension Tables

Aspect Fact Table Dimension Table
Content Measurements, metrics Descriptive attributes
Granularity One row per event/transaction One row per entity
Size Large (millions/billions) Smaller (thousands/millions)
Updates Append-only Slowly changing (SCD)

dbt (Data Build Tool)

dbt adalah tool untuk transformasi data menggunakan SQL. Mengubah data analysts menjadi data engineers.

🎯 Why dbt?

dbt Project Structure

my_project/ ├── models/ │ ├── staging/ # Raw → Clean (1:1 with sources) │ │ ├── stg_orders.sql │ │ └── stg_customers.sql │ ├── intermediate/ # Business logic │ │ ├── int_orders_enriched.sql │ │ └── int_customer_metrics.sql │ └── marts/ # Business-facing tables │ ├── fct_orders.sql │ └── dim_customers.sql ├── tests/ ├── macros/ └── dbt_project.yml

Example dbt Model

-- models/marts/fct_orders.sql WITH orders AS ( SELECT * FROM {{ ref('stg_orders') }} ), customers AS ( SELECT * FROM {{ ref('stg_customers') }} ) SELECT o.order_id, o.order_date, c.customer_id, c.customer_name, o.amount, o.status FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id

dbt Tests

-- schema.yml version: 2 models: - name: stg_orders columns: - name: order_id tests: - unique - not_null - name: customer_id tests: - not_null - relationships: to: ref('stg_customers') field: customer_id - name: amount tests: - positive_value

Materialization Strategies

Cara data disimpan setelah transformation:

Materialization Best For Trade-off
View Light transformations, always fresh Computed on every query
Table Heavy transformations, slow sources Full rebuild every run
Incremental Large tables, append-only data More complex setup
Ephemeral Intermediate CTEs Not queryable directly

Query Optimization

Indexing Strategies

📇 When to Index

Don't index: Low cardinality columns (gender, boolean), frequently updated columns

Query Performance Tips

-- ✅ DO: Filter early SELECT * FROM big_table WHERE created_at >= '2024-01-01' -- Filter before join JOIN small_table ON ...; -- ✅ DO: Select only needed columns SELECT customer_id, name -- Not SELECT * FROM customers; -- ✅ DO: Use appropriate data types -- TIMESTAMP not VARCHAR for dates -- INTEGER not VARCHAR for IDs -- ❌ DON'T: Use functions on indexed columns -- This prevents index usage: WHERE YEAR(created_at) = 2024 -- ✅ DO: Use range instead WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'

Transformation Best Practices

🥉 Bronze (Raw)

1:1 with source
Minimal cleaning
Preserve history

🥈 Silver (Clean)

Cleaned & typed
Business logic
Deduplicated

🥇 Gold (Mart)

Business aggregates
Star schema
Consumption-ready

Decision Framework: Modeling & Transformation

Decision Point Pilih Opsi A Jika... Pilih Opsi B Jika...
Normalized vs Dimensional Fokus transaksi OLTP dan integritas update tinggi Fokus analytics, query agregasi, dan BI performa tinggi
Star vs Snowflake Schema Prioritas simplicity dan performa query untuk analyst Perlu normalisasi dimensi untuk konsistensi atribut kompleks
Full Refresh vs Incremental Dataset kecil dan logic transform sering berubah Dataset besar, growth tinggi, dan SLA ketat
SCD Type 1 vs Type 2 Tidak butuh histori nilai lama Butuh audit trail dan point-in-time analysis

Failure Modes & Anti-Patterns

Anti-Patterns di Layer Modeling

Production Readiness Checklist

Checklist Modeling sebelum Production

  1. Setiap fact table punya grain statement yang jelas.
  2. Setiap metric penting punya definisi baku lintas tim.
  3. Primary key, not_null, relationships tests aktif.
  4. Incremental logic diuji untuk late-arriving data.
  5. SCD strategy sesuai kebutuhan historis bisnis.
  6. Lineage model terdokumentasi untuk impact analysis.
  7. Materialization dipilih berdasarkan cost vs latency.
  8. Backfill procedure aman tersedia untuk model kritikal.

✏️ Exercise: Design a Data Model

Desain data model untuk e-commerce analytics dengan requirements:

Create:

  1. ER diagram (fact and dimension tables)
  2. dbt model structure
  3. Key metrics definitions

🎯 Quick Quiz

1. Tujuan utama normalization adalah?

A. Mempercepat query
B. Mengurangi redundancy dan dependency
C. Mempermudah backup
D. Menambah security

2. Dalam star schema, fact table berisi?

A. Descriptive attributes
B. Measurements and metrics
C. Metadata only
D. User credentials

3. dbt materialization 'incremental' cocok untuk?

A. Small tables that change completely every run
B. Large tables with append-only data
C. Intermediate CTEs
D. Tables with frequent updates to historical data

Advanced Dimensional Modeling (Kimball Method)

📚 Reference: The Data Warehouse Toolkit

Ralph Kimball's dimensional modeling approach is the industry standard for data warehouse design.

Slowly Changing Dimensions (SCD)

Dimensions change over time. Kimball defined strategies for handling these changes:

Type Description Use Case
SCD Type 0 Fixed dimension - never changes Date of birth, original timestamp
SCD Type 1 Overwrite old value Typos, no history needed
SCD Type 2 Add new row with date range Track historical changes (most common)
SCD Type 3 Add column for previous value Track only one previous value
SCD Type 4 History table + current table Fast current lookup + full history
SCD Type 6 Hybrid: Type 1 + Type 2 + Type 3 Complex tracking requirements

SCD Type 2 Implementation

-- dim_customer with SCD Type 2
customer_key (PK) | customer_id | name    | city       | valid_from | valid_to   | is_current
------------------|-------------|---------|------------|------------|------------|----------
1001              | C001        | John    | Jakarta    | 2023-01-01 | 2023-06-15 | N
1002              | C001        | John    | Surabaya   | 2023-06-15 | 9999-12-31 | Y

-- Query for point-in-time analysis
SELECT * FROM dim_customer
WHERE customer_id = 'C001'
  AND '2023-03-01' BETWEEN valid_from AND valid_to;  -- Returns Jakarta record

SCD Decision Playbook (Praktis)

Kondisi Atribut Rekomendasi Alasan
Typo/koreksi data tanpa nilai historis bisnis Type 1 Overwrite cukup, lebih sederhana dan hemat storage
Perubahan atribut penting untuk analisis historis Type 2 Butuh point-in-time analysis dan audit trail
Hanya perlu nilai sekarang + 1 nilai sebelumnya Type 3 Ringan, tapi histori terbatas
Lookup current sangat sering, histori tetap wajib Type 4 Pisah current/history untuk performa query operasional
Kebutuhan campuran (overwrite + histori + nilai lama) Type 6 Fleksibel, tapi maintenance paling kompleks

Desain Kontrak untuk SCD Type 2

Kesalahan SCD Type 2 yang paling sering

SQL Pattern: Change Detection + SCD Type 2 Upsert

-- Step 1: Cari perubahan atribut yang di-track WITH staged AS ( SELECT customer_id, customer_name, city, md5(concat_ws('|', customer_name, city)) AS change_hash, business_effective_at FROM stg_customers ), current_dim AS ( SELECT * FROM dim_customer WHERE is_current = TRUE ) SELECT s.* FROM staged s LEFT JOIN current_dim d ON s.customer_id = d.customer_id WHERE d.customer_id IS NULL OR s.change_hash <> d.change_hash; -- Step 2: Tutup row lama yang berubah UPDATE dim_customer d SET valid_to = s.business_effective_at - INTERVAL '1 second', is_current = FALSE FROM changed_rows s WHERE d.customer_id = s.customer_id AND d.is_current = TRUE; -- Step 3: Insert versi baru INSERT INTO dim_customer ( customer_key, customer_id, customer_name, city, change_hash, valid_from, valid_to, is_current ) SELECT nextval('dim_customer_key_seq'), customer_id, customer_name, city, change_hash, business_effective_at, '9999-12-31', TRUE FROM changed_rows;

Production Checks untuk SCD

-- Check 1: tidak boleh ada >1 row current per business key SELECT customer_id FROM dim_customer WHERE is_current = TRUE GROUP BY customer_id HAVING count(*) > 1; -- Check 2: periode valid tidak boleh overlap SELECT a.customer_id FROM dim_customer a JOIN dim_customer b ON a.customer_id = b.customer_id AND a.customer_key <> b.customer_key AND a.valid_from <= b.valid_to AND b.valid_from <= a.valid_to;

SCD Type 2 + dbt Incremental Pattern

Jika kamu memakai dbt, pendekatan praktis adalah: incremental + merge untuk menutup row lama dan menambah versi baru saat ada perubahan atribut.

-- models/marts/dim_customer_scd2.sql {{ config( materialized='incremental', incremental_strategy='merge', unique_key='customer_sk' ) }} WITH source_data AS ( SELECT customer_id, customer_name, city, business_effective_at, md5(concat_ws('|', customer_name, city)) AS change_hash FROM {{ ref('stg_customers') }} {% if is_incremental() %} WHERE business_effective_at >= ( SELECT coalesce(max(valid_from), '1900-01-01') FROM {{ this }} ) {% endif %} ), current_dim AS ( SELECT * FROM {{ this }} WHERE is_current = true ), changed_rows AS ( SELECT s.* FROM source_data s LEFT JOIN current_dim d ON s.customer_id = d.customer_id WHERE d.customer_id IS NULL OR s.change_hash <> d.change_hash ) SELECT {{ dbt_utils.generate_surrogate_key(['customer_id', 'business_effective_at']) }} AS customer_sk, customer_id, customer_name, city, change_hash, business_effective_at AS valid_from, '9999-12-31' AS valid_to, true AS is_current FROM changed_rows;

Checklist implementasi dbt untuk SCD

Exercise: Debug SCD Incident (Double Current + Overlap)

Skenario: Pada dim_customer, tim BI melaporkan customer yang sama muncul dua kali sebagai current, dan laporan point-in-time berubah-ubah.

Sample data bermasalah:

customer_key | customer_id | city | valid_from | valid_to | is_current -------------|-------------|-----------|----------------------|----------------------|----------- 2001 | C100 | Jakarta | 2024-01-01 00:00:00 | 9999-12-31 00:00:00 | true 2002 | C100 | Bandung | 2024-03-01 00:00:00 | 9999-12-31 00:00:00 | true -- BUG 2003 | C101 | Surabaya | 2024-01-01 00:00:00 | 2024-04-01 00:00:00 | false 2004 | C101 | Malang | 2024-03-20 00:00:00 | 9999-12-31 00:00:00 | true -- OVERLAP

Tugas:

  1. Tulis query untuk mendeteksi business key dengan >1 current row.
  2. Tulis query untuk mendeteksi overlap window valid_from/valid_to.
  3. Perbaiki data: tutup row lama dan sisakan tepat satu row current per customer_id.
  4. Tambahkan dbt test agar bug ini tidak lolos lagi.

Expected output minimal:

Fact Table Types

📊 Transaction Fact

Grain: One row per transaction

Example: Order line items

Most common, atomic level data

📅 Periodic Snapshot

Grain: One row per period

Example: Daily inventory balance

State of world at intervals

📈 Accumulating Snapshot

Grain: One row per process

Example: Order fulfillment pipeline

Track milestones in a process

Dimension Design Patterns

1. Junk Dimensions

Combine low-cardinality flags/indicators into one dimension to reduce clutter:

-- Instead of having 5 separate flag columns in fact table: -- is_promotion, is_free_shipping, is_gift_wrap, ... -- Create a junk dimension: dim_order_flags: - flag_key (PK) - is_promotion - is_free_shipping - is_gift_wrap - flag_description -- 'Promo + Free Ship', 'Standard', etc.

2. Degenerate Dimensions

Transaction identifiers that don't need their own dimension:

-- order_number, invoice_number - stored directly in fact table -- No separate dimension needed (no attributes to describe)

3. Role-Playing Dimensions

Same dimension used multiple ways in one fact table:

fact_orders: - order_date_key (→ dim_date) - ship_date_key (→ dim_date) - delivery_date_key (→ dim_date) -- Single dim_date table, referenced multiple times with different roles

4. Conformed Dimensions

Dimensions that are shared across multiple fact tables (enterprise-wide consistency):

Bus Architecture

Kimball's approach to enterprise data warehouse - integrate via conformed dimensions:

🚌 Data Warehouse Bus Matrix

Business Process Date Customer Product Store Promotion
Sales
Inventory
Returns

Kesimpulan

Data modeling adalah foundation dari analytics. Pilih normalization untuk OLTP, dimensional modeling untuk OLAP. Gunakan dbt untuk mengelola transformasi dengan best practices: version control, testing, dan documentation.

🎯 Key Takeaways

📚 References & Resources

Primary Sources

Official Documentation

Articles & Websites