Plans003 2026 02 20 Transform Layer
Set Up @packages/transform
Overview
Create the dbt project at @packages/transform. Implements all four transform layers (raw sources, staging, intermediate, marts) using dbt-core + dbt-duckdb. Per-year column normalization is handled in staging SQL — marts always receive consistent English-named columns regardless of source year.
Result: uv run dbt run produces all 4 mart tables (mart_revenue, mart_program_progress, mart_channel_marketing, mart_school_progress) in atlas.db, queryable with real data from all loaded years.
Goals
- Scaffold
@packages/transformwith Python/uv project (pyproject.toml, profiles.yml, dbt_project.yml) - Declare raw tables as dbt sources (
sources.yml) - Implement all 4 staging models with per-year COALESCE normalization for column differences
- Implement staging schema tests (not_null, unique, accepted_values)
- Implement intermediate models:
int_ordersandint_enrollments - Implement all 4 mart models
- Implement mart schema tests
- Verify:
uv run dbt run && uv run dbt testpasses with real 2026 data
Non-Goals
- Implementing the Format layer (Plan 004)
- Running the sync layer — raw tables are assumed to already exist (populated by Plan 002)
- Odoo-specific transform logic (future)
- dbt docs serving (nice to have, not required for completion)
- Multi-entity (EPN) support in SQL — models should accept
entityvariable but EPN-specific testing is out of scope
Phases
- Phase 1: Python/dbt scaffold (pyproject.toml, profiles.yml, dbt_project.yml, requirements.txt)
- Phase 2: Raw source declarations (sources.yml)
- Phase 3: Staging models + schema tests (stg_transactions, stg_students, stg_targets, stg_marketing_activity)
- Phase 4: Intermediate models (int_orders, int_enrollments)
- Phase 5: Mart models (mart_revenue, mart_program_progress, mart_channel_marketing, mart_school_progress) + mart tests
- Phase 6: Verification —
uv run dbt run && uv run dbt testpasses
Success
-
@packages/transform/pyproject.tomlandrequirements.txtexist withdbt-coreanddbt-duckdb -
profiles.ymlpoints to../../atlas.dbwith 4 threads -
dbt_project.ymlconfigures staging/intermediate as views, marts as tables - All 4 staging models exist and produce clean English-named columns
-
uv run dbt testpasses on all staging schema tests -
int_ordersandint_enrollmentsexist and join correctly - All 4 mart models exist and are materialized as tables
-
uv run dbt runcompletes without errors on real 2026 data -
uv run dbt testpasses on all mart schema tests
Requirements
- Plan 002 completed:
atlas.dbpopulated with raw tables - Python 3.11+ available
uvinstalled (uv --version)dbt-coreanddbt-duckdbinstallable viauv pipatlas.dbexists at project root
Context
Why This Approach
- dbt-core + dbt-duckdb is the standard, purpose-built toolchain for this pattern
- Staging models as views: no storage cost, always fresh on re-run
- Mart models as tables: pre-aggregated for fast reads by the Format layer and dashboard
- Per-year normalization in staging via COALESCE: one SQL model handles all years without branching —
COALESCE("INTAKE", SPLIT_PART("BULAN/INTAKE", ' ', 2))resolves the 2026 split column transparently int_enrollmentsfilters out non-enrollment rows (book sales, events) so all downstream marts are enrollment-only
Key Constraints
- Raw column names are Indonesian and differ by year — staging models must handle all known variations via COALESCE
- 2026 LAPORAN EVALUASI adds
NO. KWITANSI(receipt number) — this is the join key between transactions and students; prior years must fall back to student name matching int_ordersjoinsstg_transactions×stg_studentsonNO. KWITANSIwhere available, falling back toNAMA SISWA= student name- Unit code normalization in
int_orders:WLC INGGRIS→WLC_ENGLISH,WLC NON INGGRIS→WLC_NON_ENGLISH - Customer type normalization in
int_orders:B→New,L→Renewal,A→Alumni int_enrollmentsexcludes:BUKU REAL,BUKU WLC,BUKU MANDARIN,EVENT - WLC,EVENT - ICC,SIMULASI JLPT,SIMULASI HSK,KIRIM - REAL,KIRIM - TM,TOP UP POINand anyamount <= 0mart_school_progressgroups by year only (no month) — school analysis is annualmart_channel_marketinguses window function forcontribution_pct— DuckDB supports this natively
Edge Cases
- Students in
stg_studentswithout a matching transaction (no receipt number, no name match) — left join inint_ordersso transaction rows are never dropped - Transactions without a matching student row — left join returns NULL student fields; mart queries handle NULLs
TRY_CASTused for all numeric columns in staging to avoid parse failures on dirty dataQUALIFY ROW_NUMBER()deduplication instg_transactionsonNO. KWITANSIhandles duplicate receipt rows in source
Tradeoffs
- Staging as views vs tables: views chosen for simplicity (always fresh); if query performance becomes an issue, can be changed to incremental tables later
- Single staging model per raw table (not per year): one model handles all years via COALESCE, reducing file count; downside is more complex SQL per model
- No Jinja variables for entity filtering in staging: marts include
entity_code/unit_codecolumns so filtering happens at mart query time in the Format layer
Skills
plan— plan file format and conventions
Boundaries
- Always: Update Progress.md after each task completion
- Always: All English column names in staging must match the glossary.md terminology exactly
- Always: Use
TRY_CASTfor all numeric columns to avoid hard failures on dirty source data - Always: Staging models use
WHEREto filter null PKs; never pass nulls downstream - Ask first: Any new mart model not in the original spec
- Ask first: Changing materialization strategy from the default (views for stg/int, tables for marts)
- Never: Apply renaming or business logic in the raw layer (raw = exact source copy)
- Never: Skip schema tests for staging models
Questions
- Join key between transactions and students? →
NO. KWITANSI(2026+), fall back to name match for prior years - Enrollment exclusion list? → Defined in analytics.md, used in int_enrollments
- Should
stg_marketing_activityhandle the weekly breakdown columns (M1/M2/M3/M4) or just monthly totals? → Keep monthly channel totals for now; weekly decomposition deferred until raw structure is stabilized. - Should
mart_revenueinclude ayear_typecolumn (current / last_year / best_year) or leave that computation to the Format layer? → Leave year_type computation to the Format layer.