Intermediate
Security & Data Governance
Encryption, access control, PII handling, and compliance
⏱️ 35 min read
📅 Updated Jan 2025
👤 By DataLearn Team
Mode Baca Pemula
Anggap security sebagai "pengaman rumah data". Fokus baca:
- Data mana yang sensitif dan kenapa harus dibatasi
- Perbedaan enkripsi, kontrol akses, dan audit
- Kebutuhan compliance minimum untuk organisasi
Kamus istilah: DE-GLOSSARY.md
Prasyarat Ringan
- Paham data pelanggan tidak boleh diakses sembarang orang
- Tahu konsep user role dan permission dasar
- Pernah dengar istilah PII atau data pribadi
Istilah Penting (3 Lapis)
Istilah: Data Classification
Definisi awam: Mengelompokkan data berdasarkan tingkat sensitivitas.
Definisi teknis: Kerangka label data (public/internal/confidential/restricted) untuk menentukan kontrol keamanan.
Contoh praktis: Email customer diberi label confidential dan wajib masking di lingkungan non-prod.
Istilah: Least Privilege
Definisi awam: Beri akses secukupnya, tidak berlebihan.
Definisi teknis: Prinsip IAM yang membatasi hak akses hanya pada resource dan aksi yang benar-benar dibutuhkan.
Contoh praktis: Role analyst hanya bisa SELECT, tidak bisa DROP tabel produksi.
Data Security Fundamentals
Security in data engineering follows the CIA Triad:
🔒 Confidentiality
Only authorized access to data. Encryption at rest and in transit.
✓ Integrity
Data is accurate and hasn't been tampered with. Checksums, validation.
▲ Availability
Data accessible when needed. Backups, redundancy, disaster recovery.
Encryption
Encryption at Rest
- Transparent Data Encryption (TDE): Database-level encryption
- File-level encryption: Encrypting Parquet/S3 files
- Key management: AWS KMS, Azure Key Vault, HashiCorp Vault
Encryption in Transit
- TLS/SSL: For all network communications
- VPN: Secure connections between networks
- mTLS: Mutual authentication for service-to-service
⚠️ Key Management Best Practices
- Rotate keys regularly (at least annually)
- Use separate keys for different environments
- Never hardcode keys in code
- Enable key versioning for audit trail
Access Control
RBAC vs ABAC
| Model |
Based On |
Example |
RBAC Role-Based |
User's role in organization |
"Analysts can read sales data" |
ABAC Attribute-Based |
Multiple attributes |
"Users in EU can access EU data during business hours" |
Row-Level and Column-Level Security
-- Snowflake: Row-level security with masking policy
CREATE MASKING POLICY email_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() = 'ADMIN' THEN val
ELSE '***MASKED***'
END;
-- Apply to column
ALTER TABLE customers MODIFY COLUMN email
SET MASKING POLICY email_mask;
PII Handling
Personally Identifiable Information (PII) requires special handling:
PII Discovery and Classification
- Automated scanning: AWS Macie, Azure Purview, Collibra
- Pattern matching: Regex for SSN, email, credit cards
- ML-based: NLP for unstructured data
PII Protection Techniques
| Technique |
Use Case |
Reversible? |
| Tokenization |
Replace with random token |
Yes (with vault) |
| Masking |
Show partial data (***-**-1234) |
No |
| Hashing |
One-way transformation |
No (without salt/rainbow table) |
| Encryption |
Protect sensitive values |
Yes (with key) |
Compliance Frameworks
📋 Major Regulations
- GDPR (EU): Right to be forgotten, data portability, consent
- CCPA/CPRA (California): Consumer privacy rights, opt-out
- HIPAA (US Healthcare): Protected Health Information (PHI)
- SOC 2: Security controls for service organizations
- PCI-DSS: Payment card data protection
Data Governance
Framework for managing data availability, usability, integrity, and security:
Governance Components
- Data Stewardship: Assigning ownership and accountability
- Data Quality: Standards and monitoring
- Master Data Management: Single source of truth
- Metadata Management: Documentation and lineage
- Policy Management: Retention, access, privacy policies
Data Retention Policies
🗑️ Retention Best Practices
- Define retention periods by data type
- Automate data archival and deletion
- Legal hold capabilities for litigation
- Audit trail of deletions
Decision Framework: Security & Governance
| Decision Point |
Pilih Opsi A Jika... |
Pilih Opsi B Jika... |
| RBAC vs ABAC |
Role organisasi jelas dan sederhana |
Butuh policy dinamis berbasis atribut user/data/konteks |
| Masking vs Tokenization |
Hanya perlu menyembunyikan tampilan data |
Butuh penggantian nilai sensitif yang bisa direferensikan ulang |
| Central Governance vs Federated |
Organisasi masih awal dan tim data terpusat |
Domain data banyak dengan ownership per unit bisnis |
| Default deny vs Exception-based |
Data sensitif tinggi dan risk tolerance rendah |
Lingkungan eksplorasi internal dengan guardrail kuat |
Failure Modes & Anti-Patterns
Anti-Patterns pada Security/Governance
- PII sprawl: data sensitif tersebar tanpa klasifikasi.
- Shared credentials: audit trail tidak jelas karena akun dipakai bersama.
- Policy on paper only: aturan ada, enforcement teknis tidak ada.
- Manual access reviews: akses berlebih tidak cepat dicabut.
- No deletion workflow: gagal memenuhi right-to-erasure compliance.
Production Readiness Checklist
Checklist Security sebelum Production
- Data classification matrix diterapkan (public/internal/confidential/restricted).
- Encryption at rest + in transit aktif default.
- RBAC/ABAC policy diuji untuk least-privilege.
- PII masking/tokenization berjalan pada layer serving.
- Audit logging aktif untuk akses data sensitif.
- Retention dan deletion policy terotomasi.
- Key rotation schedule terdokumentasi dan diuji.
- Incident response untuk security breach tersedia.
✏️ Exercise: Design Security Architecture
Desain keamanan untuk data warehouse perusahaan e-commerce:
- Encryption: at-rest (AES-256) dan in-transit (TLS 1.3)
- Access control: RBAC dengan role Analyst, Engineer, Admin
- PII handling: Tokenize credit cards, mask emails untuk non-admin
- Compliance: GDPR-compliant dengan data retention 7 tahun
- Audit: Log semua query yang akses PII
🎯 Quick Quiz
1. Apa tujuan dari encryption at rest?
A. Mempercepat query
B. Melindungi data yang tersimpan dari akses tidak sah
C. Mengurangi ukuran storage
D. Mempermudah backup
2. Perbedaan RBAC dan ABAC?
A. RBAC lebih aman dari ABAC
B. RBAC berbasis role, ABAC berbasis atribut multiple
C. ABAC lebih cepat diimplementasikan
D. Tidak ada perbedaan
3. Regulasi apa yang berlaku untuk data EU citizens?
A. CCPA
B. HIPAA
C. GDPR
D. PCI-DSS
Kesimpulan
Security dan governance bukan afterthought—mereka harus di-design dari awal. Data engineers harus memahami encryption, access control, dan compliance requirements untuk membangun sistem yang trusted.
🎯 Key Takeaways
- Follow CIA Triad: Confidentiality, Integrity, Availability
- Encrypt at rest and in transit
- Implement principle of least privilege
- Classify and protect PII appropriately
- Understand relevant compliance frameworks
📚 References & Resources
Primary Sources
- Fundamentals of Data Engineering - Joe Reis & Matt Housley (O'Reilly, 2022)
Chapter 17: Security and Privacy
- Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
Chapter 12: Security and Data Protection
- Data Governance: The Definitive Guide - Evren Eryurek et al. (O'Reilly, 2021)
Official Documentation
Articles & Guides