Atlas Plan

Overview

Create the dbt project at @packages/transform. Implements all four transform layers (raw sources, staging, intermediate, marts) using dbt-core + dbt-duckdb. Per-year column normalization is handled in staging SQL — marts always receive consistent English-named columns regardless of source year.

Result: uv run dbt run produces all 4 mart tables (mart_revenue, mart_program_progress, mart_channel_marketing, mart_school_progress) in atlas.db, queryable with real data from all loaded years.

Goals

Scaffold @packages/transform with Python/uv project (pyproject.toml, profiles.yml, dbt_project.yml)
Declare raw tables as dbt sources (sources.yml)
Implement all 4 staging models with per-year COALESCE normalization for column differences
Implement staging schema tests (not_null, unique, accepted_values)
Implement intermediate models: int_orders and int_enrollments
Implement all 4 mart models
Implement mart schema tests
Verify: uv run dbt run && uv run dbt test passes with real 2026 data

Non-Goals

Implementing the Format layer (Plan 004)
Running the sync layer — raw tables are assumed to already exist (populated by Plan 002)
Odoo-specific transform logic (future)
dbt docs serving (nice to have, not required for completion)
Multi-entity (EPN) support in SQL — models should accept entity variable but EPN-specific testing is out of scope

Phases

Phase 1: Python/dbt scaffold (pyproject.toml, profiles.yml, dbt_project.yml, requirements.txt)
Phase 2: Raw source declarations (sources.yml)
Phase 3: Staging models + schema tests (stg_transactions, stg_students, stg_targets, stg_marketing_activity)
Phase 4: Intermediate models (int_orders, int_enrollments)
Phase 5: Mart models (mart_revenue, mart_program_progress, mart_channel_marketing, mart_school_progress) + mart tests
Phase 6: Verification — uv run dbt run && uv run dbt test passes

Success

@packages/transform/pyproject.toml and requirements.txt exist with dbt-core and dbt-duckdb
profiles.yml points to ../../atlas.db with 4 threads
dbt_project.yml configures staging/intermediate as views, marts as tables
All 4 staging models exist and produce clean English-named columns
uv run dbt test passes on all staging schema tests
int_orders and int_enrollments exist and join correctly
All 4 mart models exist and are materialized as tables
uv run dbt run completes without errors on real 2026 data
uv run dbt test passes on all mart schema tests

Requirements

Plan 002 completed: atlas.db populated with raw tables
Python 3.11+ available
uv installed (uv --version)
dbt-core and dbt-duckdb installable via uv pip
atlas.db exists at project root

Context

Why This Approach

dbt-core + dbt-duckdb is the standard, purpose-built toolchain for this pattern
Staging models as views: no storage cost, always fresh on re-run
Mart models as tables: pre-aggregated for fast reads by the Format layer and dashboard
Per-year normalization in staging via COALESCE: one SQL model handles all years without branching — COALESCE("INTAKE", SPLIT_PART("BULAN/INTAKE", ' ', 2)) resolves the 2026 split column transparently
int_enrollments filters out non-enrollment rows (book sales, events) so all downstream marts are enrollment-only

Key Constraints

Raw column names are Indonesian and differ by year — staging models must handle all known variations via COALESCE
2026 LAPORAN EVALUASI adds NO. KWITANSI (receipt number) — this is the join key between transactions and students; prior years must fall back to student name matching
int_orders joins stg_transactions × stg_students on NO. KWITANSI where available, falling back to NAMA SISWA = student name
Unit code normalization in int_orders: WLC INGGRIS → WLC_ENGLISH, WLC NON INGGRIS → WLC_NON_ENGLISH
Customer type normalization in int_orders: B → New, L → Renewal, A → Alumni
int_enrollments excludes: BUKU REAL, BUKU WLC, BUKU MANDARIN, EVENT - WLC, EVENT - ICC, SIMULASI JLPT, SIMULASI HSK, KIRIM - REAL, KIRIM - TM, TOP UP POIN and any amount <= 0
mart_school_progress groups by year only (no month) — school analysis is annual
mart_channel_marketing uses window function for contribution_pct — DuckDB supports this natively

Edge Cases

Students in stg_students without a matching transaction (no receipt number, no name match) — left join in int_orders so transaction rows are never dropped
Transactions without a matching student row — left join returns NULL student fields; mart queries handle NULLs
TRY_CAST used for all numeric columns in staging to avoid parse failures on dirty data
QUALIFY ROW_NUMBER() deduplication in stg_transactions on NO. KWITANSI handles duplicate receipt rows in source

Tradeoffs

Staging as views vs tables: views chosen for simplicity (always fresh); if query performance becomes an issue, can be changed to incremental tables later
Single staging model per raw table (not per year): one model handles all years via COALESCE, reducing file count; downside is more complex SQL per model
No Jinja variables for entity filtering in staging: marts include entity_code/unit_code columns so filtering happens at mart query time in the Format layer

Skills

plan — plan file format and conventions

Boundaries

Always: Update Progress.md after each task completion
Always: All English column names in staging must match the glossary.md terminology exactly
Always: Use TRY_CAST for all numeric columns to avoid hard failures on dirty source data
Always: Staging models use WHERE to filter null PKs; never pass nulls downstream
Ask first: Any new mart model not in the original spec
Ask first: Changing materialization strategy from the default (views for stg/int, tables for marts)
Never: Apply renaming or business logic in the raw layer (raw = exact source copy)
Never: Skip schema tests for staging models

Questions

Join key between transactions and students? → NO. KWITANSI (2026+), fall back to name match for prior years
Enrollment exclusion list? → Defined in analytics.md, used in int_enrollments
Should stg_marketing_activity handle the weekly breakdown columns (M1/M2/M3/M4) or just monthly totals? → Keep monthly channel totals for now; weekly decomposition deferred until raw structure is stabilized.
Should mart_revenue include a year_type column (current / last_year / best_year) or leave that computation to the Format layer? → Leave year_type computation to the Format layer.

Set Up @packages/transform