Intermediate
dbt (Data Build Tool) Deep Dive
dbt models, tests, documentation, and best practices
β±οΈ 45 min read
π
Updated Jan 2025
π€ By DataLearn Team
Mode Baca Pemula
Anggap dbt sebagai "framework transformasi SQL yang terstruktur". Fokus baca:
- Susunan model staging, intermediate, dan marts
- Test otomatis untuk kualitas data
- Cara deploy aman antara dev dan production
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham SQL transformation di data warehouse
- Tahu konsep git dan environment dev/prod
- Pernah lihat kebutuhan dokumentasi lineage model
Istilah Penting (3 Lapis)
Istilah: Materialization
Definisi awam: Cara dbt menyimpan hasil model.
Definisi teknis: Konfigurasi output model sebagai view, table, incremental, atau ephemeral.
Contoh praktis: Model dashboard dipilih table agar query BI cepat dan stabil.
Istilah: Slim CI
Definisi awam: Menjalankan test hanya pada model yang berubah.
Definisi teknis: Strategi CI dbt berbasis state untuk mempercepat validasi perubahan.
Contoh praktis: PR kecil hanya mengetes model terkait, bukan seluruh ratusan model.
What is dbt?
dbt (data build tool) enables data analysts and engineers to transform data in their warehouses
by simply writing SELECT statements. dbt handles the boilerplateβturning these statements into tables and views.
π― dbt Philosophy
"The T in ELT doesn't stand for Toolβit stands for Transform, and that should be done with code."
Why dbt?
- Version Control: Git integration for SQL transformations
- Modularity: Reusable models with {{ ref() }}
- Testing: Built-in data quality tests
- Documentation: Auto-generated docs and lineage
- CI/CD: Slim CI, defer to production
Project Structure
my_dbt_project/
βββ models/
β βββ staging/ # Source-aligned models
β β βββ _sources.yml
β β βββ stg_orders.sql
β β βββ stg_customers.sql
β βββ intermediate/ # Business logic
β β βββ int_orders_enriched.sql
β β βββ int_customer_metrics.sql
β βββ marts/ # Business-facing
β βββ _models.yml
β βββ fct_orders.sql
β βββ dim_customers.sql
βββ macros/ # Reusable SQL
βββ tests/ # Custom tests
βββ snapshots/ # SCD Type 2
βββ seeds/ # Static data
βββ dbt_project.yml
Model Configuration
Materializations
| Type |
Use Case |
Behavior |
| view |
Light transformations |
Rebuilt on each run |
| table |
Heavy transformations |
Full rebuild |
| incremental |
Large tables, append-only |
Only new rows processed |
| ephemeral |
Intermediate CTEs |
Not persisted |
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
)
}}
SELECT *
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE created_at > (
SELECT MAX(created_at) FROM {{ this }}
)
{% endif %}
Testing in dbt
version: 2
models:
- name: fct_orders
description: "One record per order"
columns:
- name: order_id
description: "Primary key"
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
- name: status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
Custom Tests with Macros
{% macro test_positive_values(model, column_name) %}
SELECT {{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} <= 0
{% endmacro %}
Documentation and Lineage
π Documentation Best Practices
- Document every model with business context
- Document column purposes, not just names
- Add tags for discoverability
- Keep docs close to code (in YML files)
Advanced dbt Features
Snapshots (SCD Type 2)
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
SELECT * FROM {{ source('crm', 'customers') }}
{% endsnapshot %}
Packages
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: dbt-labs/codegen
version: 0.12.1
- git: "https://github.com/dbt-labs/dbt-audit-helper"
revision: main
Decision Framework: dbt Implementation
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| table vs incremental |
Volume data kecil-menengah, rebuild penuh masih cepat |
Volume besar, SLA ketat, dan perubahan bersifat append/update terkontrol |
| staging granularity |
Source stabil dan transformasi sederhana |
Source beragam, butuh cleaning dan standardisasi ketat |
| generic tests vs custom tests |
Aturan kualitas umum (unique, not_null, relationships) |
Aturan bisnis spesifik domain (mis. margin tidak negatif) |
| single project vs mesh/monorepo |
Tim kecil dan domain masih satu |
Multi-domain team dengan ownership jelas per data product |
Failure Modes & Anti-Patterns
Anti-Patterns pada dbt Project
- Model naming inconsistency: sulit cari lineage dan ownership model.
- No test gates: deployment lolos walau quality regression.
- Incremental without unique key: duplicate data dan upsert salah.
- Business logic in staging: layer jadi tidak jelas dan sulit dirawat.
- Docs ignored: model dipakai tanpa pemahaman konteks bisnis.
Production Readiness Checklist
Checklist dbt sebelum Production
- Layering model jelas: staging, intermediate, marts.
- Semua model penting punya tests minimum (unique/not_null/relationships).
- CI menjalankan slim run + tests sebelum merge.
- Incremental model punya unique_key dan backfill strategy.
- Snapshots untuk entity historis penting sudah dikonfigurasi.
- Dokumentasi model + kolom terisi dan dipublish.
- Owner model per domain sudah jelas.
- Failure alert untuk dbt jobs sudah terhubung ke channel on-call.
βοΈ Exercise: Build dbt Project
Buat dbt project untuk e-commerce:
- Setup sources.yml untuk raw data
- Create staging models (stg_orders, stg_customers)
- Build intermediate: int_orders_enriched
- Create marts: fct_orders, dim_customers
- Add tests: unique, not_null, relationships
π― Quick Quiz
1. Materialization apa untuk large append-only tables?
A. view
B. table
C. incremental
D. ephemeral
2. Fungsi {{ ref() }}?
A. Reference schema
B. Reference another model
C. Reference seed
D. Reference macro
3. Snapshots digunakan untuk?
A. Backup data
B. Track slowly changing dimensions
C. Speed up queries
D. Encrypt data
Kesimpulan
dbt mengubah cara tim data bekerjaβdari spreadsheet dan manual processes ke software engineering practices. Dengan testing, documentation, dan version control, dbt brings rigor to analytics.
π― Key Takeaways
- dbt transforms data using SQL SELECT statements
- Choose materialization based on transformation cost
- Test everything: uniqueness, nulls, relationships
- Document as you build
- Use snapshots for SCD Type 2
π References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapters 10-11: Data Transformation, dbt
- dbt Documentation Best Practices - dbt Labs (docs.getdbt.com)
Official Documentation
Articles & Guides