Intermediate

Data Modeling & Transformation

Normalization, dbt, dimensional modeling, and query optimization

⏱️ 50 min read 📅 Updated Jan 2025 👤 By DataLearn Team

Mode Baca Pemula

Anggap modeling sebagai "menata data agar mudah dipakai". Fokus baca:

Kenapa struktur tabel memengaruhi analisis
Kapan pilih model historis vs model overwrite
Cara transformasi bertahap agar hemat biaya

Kamus istilah: DE-GLOSSARY.md

Prasyarat Ringan

Paham SQL SELECT, JOIN, dan GROUP BY dasar
Tahu bahwa kebutuhan BI butuh tabel yang stabil
Pernah lihat perubahan data dimensi (mis. alamat customer)

Istilah Penting (3 Lapis)

Istilah: SCD Type 2

Definisi awam: Menyimpan riwayat perubahan data, bukan menimpa versi lama.

Definisi teknis: Pola dimensi historis dengan row baru per perubahan serta kolom efektif mulai/akhir.

Contoh praktis: Saat tier customer berubah, record lama ditutup dan record baru dibuka.

Istilah: Incremental Model

Definisi awam: Proses hanya data baru/berubah, bukan ulang semua data.

Definisi teknis: Strategi transformasi yang memproses delta berdasarkan watermark atau key untuk efisiensi.

Contoh praktis: Model harian hanya menghitung transaksi sejak run terakhir.

Apa itu Data Modeling?

Data Modeling adalah proses mendefinisikan struktur data untuk menyimpan dan mengorganisir data secara efisien. Good data model memastikan data dapat diakses, dipahami, dan digunakan dengan baik.

💡 Analogy

Data model seperti blueprint rumah. Sebelum membangun, kita perlu tahu: berapa kamar, dimana kitchen, bagaimana alur pergerakan.

Database Normalization

Normalization adalah proses mengorganisir data untuk mengurangi redundancy dan dependency.

Normal Forms (1NF → 3NF)

1NF

Atomic values
No repeating groups

→

2NF

1NF +
No partial dependency

→

3NF

2NF +
No transitive dependency

Normalization Example

Before (Unnormalized):

order_id	customer_name	products	total
1	John Doe	Laptop, Mouse, Keyboard	1500

After 3NF:

orders table
order_id	customer_id	order_date
1	101	2024-01-15

customers table
customer_id	customer_name
101	John Doe

            ⚠️ When NOT to Normalize
            Analytics/Data Warehouse: Denormalized is often faster for reads
Read-heavy workloads: Joins are expensive at scale
Time-series data: Flat structure often preferred

        

Dimensional Modeling (Star Schema)

Untuk data warehouses, dimensional modeling adalah best practice.

⭐ Star Schema Structure

fact_orders
(measures: quantity, amount, discount)
↑ ↑ ↑ ↑ ↑

                    dim_date
                    dim_customer
                    dim_product
                    dim_store
                

Fact vs Dimension Tables

Aspect	Fact Table	Dimension Table
Content	Measurements, metrics	Descriptive attributes
Granularity	One row per event/transaction	One row per entity
Size	Large (millions/billions)	Smaller (thousands/millions)
Updates	Append-only	Slowly changing (SCD)

dbt (Data Build Tool)

dbt adalah tool untuk transformasi data menggunakan SQL. Mengubah data analysts menjadi data engineers.

            🎯 Why dbt?
            Version control for SQL transformations
Automatic documentation generation
Testing framework built-in
Modular, reusable SQL with Jinja
Runs in your data warehouse (no data movement)

        

dbt Project Structure

my_project/
├── models/
│   ├── staging/           # Raw → Clean (1:1 with sources)
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   ├── intermediate/      # Business logic
│   │   ├── int_orders_enriched.sql
│   │   └── int_customer_metrics.sql
│   └── marts/             # Business-facing tables
│       ├── fct_orders.sql
│       └── dim_customers.sql
├── tests/
├── macros/
└── dbt_project.yml
        

Example dbt Model

-- models/marts/fct_orders.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
)

SELECT
    o.order_id,
    o.order_date,
    c.customer_id,
    c.customer_name,
    o.amount,
    o.status
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
        

dbt Tests

-- schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: amount
        tests:
          - positive_value
        

Materialization Strategies

Cara data disimpan setelah transformation:

Materialization	Best For	Trade-off
View	Light transformations, always fresh	Computed on every query
Table	Heavy transformations, slow sources	Full rebuild every run
Incremental	Large tables, append-only data	More complex setup
Ephemeral	Intermediate CTEs	Not queryable directly

Query Optimization

Indexing Strategies

📇 When to Index

Columns frequently used in WHERE clauses
JOIN columns
Columns used in ORDER BY

Don't index: Low cardinality columns (gender, boolean), frequently updated columns

Query Performance Tips

-- ✅ DO: Filter early
SELECT *
FROM big_table
WHERE created_at >= '2024-01-01'  -- Filter before join
JOIN small_table ON ...;

-- ✅ DO: Select only needed columns
SELECT customer_id, name  -- Not SELECT *
FROM customers;

-- ✅ DO: Use appropriate data types
-- TIMESTAMP not VARCHAR for dates
-- INTEGER not VARCHAR for IDs

-- ❌ DON'T: Use functions on indexed columns
-- This prevents index usage:
WHERE YEAR(created_at) = 2024

-- ✅ DO: Use range instead
WHERE created_at >= '2024-01-01'
  AND created_at < '2025-01-01'
        

Transformation Best Practices

🥉 Bronze (Raw)

1:1 with source
Minimal cleaning
Preserve history

🥈 Silver (Clean)

Cleaned & typed
Business logic
Deduplicated

🥇 Gold (Mart)

Business aggregates
Star schema
Consumption-ready

Decision Framework: Modeling & Transformation

Decision Point	Pilih Opsi A Jika...	Pilih Opsi B Jika...
Normalized vs Dimensional	Fokus transaksi OLTP dan integritas update tinggi	Fokus analytics, query agregasi, dan BI performa tinggi
Star vs Snowflake Schema	Prioritas simplicity dan performa query untuk analyst	Perlu normalisasi dimensi untuk konsistensi atribut kompleks
Full Refresh vs Incremental	Dataset kecil dan logic transform sering berubah	Dataset besar, growth tinggi, dan SLA ketat
SCD Type 1 vs Type 2	Tidak butuh histori nilai lama	Butuh audit trail dan point-in-time analysis

Failure Modes & Anti-Patterns

            Anti-Patterns di Layer Modeling
            No clear grain: fact table campur granularitas dan hasil agregasi jadi salah.
Metric definition drift: definisi KPI beda antar dashboard.
Over-denormalization: query cepat awal, tapi maintenance makin mahal.
Transform tanpa tests: bug data lolos ke production.
Unused intermediate models: lineage rumit tanpa value nyata.

        

Production Readiness Checklist

            Checklist Modeling sebelum Production
            Setiap fact table punya grain statement yang jelas.
Setiap metric penting punya definisi baku lintas tim.
Primary key, not_null, relationships tests aktif.
Incremental logic diuji untuk late-arriving data.
SCD strategy sesuai kebutuhan historis bisnis.
Lineage model terdokumentasi untuk impact analysis.
Materialization dipilih berdasarkan cost vs latency.
Backfill procedure aman tersedia untuk model kritikal.

        

✏️ Exercise: Design a Data Model

Desain data model untuk e-commerce analytics dengan requirements:

Track orders, customers, products, and payments
Support: Sales by category, Customer lifetime value, Monthly trends
Product catalog changes over time (SCD Type 2)

Create:

ER diagram (fact and dimension tables)
dbt model structure
Key metrics definitions

🎯 Quick Quiz

1. Tujuan utama normalization adalah?

A. Mempercepat query

B. Mengurangi redundancy dan dependency

C. Mempermudah backup

D. Menambah security

2. Dalam star schema, fact table berisi?

A. Descriptive attributes

B. Measurements and metrics

C. Metadata only

D. User credentials

3. dbt materialization 'incremental' cocok untuk?

A. Small tables that change completely every run

B. Large tables with append-only data

C. Intermediate CTEs

D. Tables with frequent updates to historical data

Advanced Dimensional Modeling (Kimball Method)

📚 Reference: The Data Warehouse Toolkit

Ralph Kimball's dimensional modeling approach is the industry standard for data warehouse design.

Slowly Changing Dimensions (SCD)

Dimensions change over time. Kimball defined strategies for handling these changes:

Type	Description	Use Case
SCD Type 0	Fixed dimension - never changes	Date of birth, original timestamp
SCD Type 1	Overwrite old value	Typos, no history needed
SCD Type 2	Add new row with date range	Track historical changes (most common)
SCD Type 3	Add column for previous value	Track only one previous value
SCD Type 4	History table + current table	Fast current lookup + full history
SCD Type 6	Hybrid: Type 1 + Type 2 + Type 3	Complex tracking requirements

SCD Type 2 Implementation

-- dim_customer with SCD Type 2
customer_key (PK) | customer_id | name    | city       | valid_from | valid_to   | is_current
------------------|-------------|---------|------------|------------|------------|----------
1001              | C001        | John    | Jakarta    | 2023-01-01 | 2023-06-15 | N
1002              | C001        | John    | Surabaya   | 2023-06-15 | 9999-12-31 | Y

-- Query for point-in-time analysis
SELECT * FROM dim_customer
WHERE customer_id = 'C001'
  AND '2023-03-01' BETWEEN valid_from AND valid_to;  -- Returns Jakarta record

SCD Decision Playbook (Praktis)

Kondisi Atribut	Rekomendasi	Alasan
Typo/koreksi data tanpa nilai historis bisnis	Type 1	Overwrite cukup, lebih sederhana dan hemat storage
Perubahan atribut penting untuk analisis historis	Type 2	Butuh point-in-time analysis dan audit trail
Hanya perlu nilai sekarang + 1 nilai sebelumnya	Type 3	Ringan, tapi histori terbatas
Lookup current sangat sering, histori tetap wajib	Type 4	Pisah current/history untuk performa query operasional
Kebutuhan campuran (overwrite + histori + nilai lama)	Type 6	Fleksibel, tapi maintenance paling kompleks

Desain Kontrak untuk SCD Type 2

Business key: key alami dari source (contoh: customer_id).
Surrogate key: key teknis per versi row (contoh: customer_key).
Effective window: valid_from dan valid_to.
Current flag: is_current untuk akses cepat versi aktif.
Change hash: hash atribut tracked agar deteksi perubahan murah.
Audit columns: ingested_at, updated_by_job, source_system.

            Kesalahan SCD Type 2 yang paling sering
            Tidak menutup row lama saat nilai berubah (dua row aktif untuk satu business key).
Menggunakan timestamp ingestion sebagai valid_from padahal butuh event/business time.
Tidak mendefinisikan atribut mana yang di-track historinya (semua kolom di-Type2-kan).
Tidak punya strategi late-arriving updates dan backfill.

        

SQL Pattern: Change Detection + SCD Type 2 Upsert

-- Step 1: Cari perubahan atribut yang di-track
WITH staged AS (
  SELECT
    customer_id,
    customer_name,
    city,
    md5(concat_ws('|', customer_name, city)) AS change_hash,
    business_effective_at
  FROM stg_customers
),
current_dim AS (
  SELECT *
  FROM dim_customer
  WHERE is_current = TRUE
)
SELECT s.*
FROM staged s
LEFT JOIN current_dim d
  ON s.customer_id = d.customer_id
WHERE d.customer_id IS NULL
   OR s.change_hash <> d.change_hash;

-- Step 2: Tutup row lama yang berubah
UPDATE dim_customer d
SET valid_to = s.business_effective_at - INTERVAL '1 second',
    is_current = FALSE
FROM changed_rows s
WHERE d.customer_id = s.customer_id
  AND d.is_current = TRUE;

-- Step 3: Insert versi baru
INSERT INTO dim_customer (
  customer_key, customer_id, customer_name, city,
  change_hash, valid_from, valid_to, is_current
)
SELECT
  nextval('dim_customer_key_seq'),
  customer_id, customer_name, city,
  change_hash, business_effective_at, '9999-12-31', TRUE
FROM changed_rows;
        

Production Checks untuk SCD

-- Check 1: tidak boleh ada >1 row current per business key
SELECT customer_id
FROM dim_customer
WHERE is_current = TRUE
GROUP BY customer_id
HAVING count(*) > 1;

-- Check 2: periode valid tidak boleh overlap
SELECT a.customer_id
FROM dim_customer a
JOIN dim_customer b
  ON a.customer_id = b.customer_id
 AND a.customer_key <> b.customer_key
 AND a.valid_from <= b.valid_to
 AND b.valid_from <= a.valid_to;
        

SCD Type 2 + dbt Incremental Pattern

Jika kamu memakai dbt, pendekatan praktis adalah: incremental + merge untuk menutup row lama dan menambah versi baru saat ada perubahan atribut.

-- models/marts/dim_customer_scd2.sql
{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='customer_sk'
) }}

WITH source_data AS (
    SELECT
        customer_id,
        customer_name,
        city,
        business_effective_at,
        md5(concat_ws('|', customer_name, city)) AS change_hash
    FROM {{ ref('stg_customers') }}
    {% if is_incremental() %}
      WHERE business_effective_at >= (
        SELECT coalesce(max(valid_from), '1900-01-01') FROM {{ this }}
      )
    {% endif %}
),
current_dim AS (
    SELECT *
    FROM {{ this }}
    WHERE is_current = true
),
changed_rows AS (
    SELECT s.*
    FROM source_data s
    LEFT JOIN current_dim d ON s.customer_id = d.customer_id
    WHERE d.customer_id IS NULL OR s.change_hash <> d.change_hash
)
SELECT
    {{ dbt_utils.generate_surrogate_key(['customer_id', 'business_effective_at']) }} AS customer_sk,
    customer_id,
    customer_name,
    city,
    change_hash,
    business_effective_at AS valid_from,
    '9999-12-31' AS valid_to,
    true AS is_current
FROM changed_rows;
        

            Checklist implementasi dbt untuk SCD
            Gunakan business_effective_at, bukan sekadar waktu ingestion.
Pisahkan atribut Type 1 dan Type 2 sejak awal desain model.
Tambahkan tests: unique, not_null, dan custom test "single current row".
Siapkan job backfill khusus untuk data telat (late arriving changes).

        

Exercise: Debug SCD Incident (Double Current + Overlap)

Skenario: Pada dim_customer, tim BI melaporkan customer yang sama muncul dua kali sebagai current, dan laporan point-in-time berubah-ubah.

Sample data bermasalah:

customer_key | customer_id | city      | valid_from           | valid_to             | is_current
-------------|-------------|-----------|----------------------|----------------------|-----------
2001         | C100        | Jakarta   | 2024-01-01 00:00:00  | 9999-12-31 00:00:00  | true
2002         | C100        | Bandung   | 2024-03-01 00:00:00  | 9999-12-31 00:00:00  | true   -- BUG
2003         | C101        | Surabaya  | 2024-01-01 00:00:00  | 2024-04-01 00:00:00  | false
2004         | C101        | Malang    | 2024-03-20 00:00:00  | 9999-12-31 00:00:00  | true   -- OVERLAP
            

Tugas:

Tulis query untuk mendeteksi business key dengan >1 current row.
Tulis query untuk mendeteksi overlap window valid_from/valid_to.
Perbaiki data: tutup row lama dan sisakan tepat satu row current per customer_id.
Tambahkan dbt test agar bug ini tidak lolos lagi.

Expected output minimal:

Untuk setiap customer_id, hanya ada satu row dengan is_current = true.
Tidak ada overlap periode valid untuk key yang sama.
Point-in-time query untuk tanggal tertentu selalu mengembalikan tepat satu baris.

Fact Table Types

📊 Transaction Fact

Grain: One row per transaction

Example: Order line items

Most common, atomic level data

📅 Periodic Snapshot

Grain: One row per period

Example: Daily inventory balance

State of world at intervals

📈 Accumulating Snapshot

Grain: One row per process

Example: Order fulfillment pipeline

Track milestones in a process

Dimension Design Patterns

1. Junk Dimensions

Combine low-cardinality flags/indicators into one dimension to reduce clutter:

-- Instead of having 5 separate flag columns in fact table:
-- is_promotion, is_free_shipping, is_gift_wrap, ...

-- Create a junk dimension:
dim_order_flags:
  - flag_key (PK)
  - is_promotion
  - is_free_shipping
  - is_gift_wrap
  - flag_description -- 'Promo + Free Ship', 'Standard', etc.
        

2. Degenerate Dimensions

Transaction identifiers that don't need their own dimension:

-- order_number, invoice_number - stored directly in fact table
-- No separate dimension needed (no attributes to describe)
        

3. Role-Playing Dimensions

Same dimension used multiple ways in one fact table:

fact_orders:
  - order_date_key (→ dim_date)
  - ship_date_key (→ dim_date)
  - delivery_date_key (→ dim_date)
-- Single dim_date table, referenced multiple times with different roles
        

4. Conformed Dimensions

Dimensions that are shared across multiple fact tables (enterprise-wide consistency):

Same dim_customer used in sales, support, marketing facts
Same dim_product used across all business units
Critical for drilling across different processes

Bus Architecture

Kimball's approach to enterprise data warehouse - integrate via conformed dimensions:

🚌 Data Warehouse Bus Matrix

Business Process	Date	Customer	Product	Store	Promotion
Sales	✓	✓	✓	✓	✓
Inventory	✓		✓	✓
Returns	✓	✓	✓	✓

Kesimpulan

Data modeling adalah foundation dari analytics. Pilih normalization untuk OLTP, dimensional modeling untuk OLAP. Gunakan dbt untuk mengelola transformasi dengan best practices: version control, testing, dan documentation.

            🎯 Key Takeaways
            Normalize for OLTP, denormalize for analytics
Star schema: facts in center, dimensions around
dbt brings software engineering to SQL
Materialize based on query patterns and data size
Always add tests for data quality

        

📚 References & Resources

Primary Sources

The Data Warehouse Toolkit - Ralph Kimball & Margy Ross (Wiley, 2013)
Chapters 1-6: Dimensional Modeling, SCD Types, Fact Tables, Bus Architecture
Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 10-11: Data Transformation, dbt and Analytics Engineering
Building a Scalable Data Warehouse with Data Vault 2.0 - Daniel Linstedt et al. (Morgan Kaufmann, 2016)
Chapter 3: Normalization and Data Warehouse Modeling Approaches