Atlas Plan
Plans003 2026 02 20 Transform Layer

Set Up @packages/transform

Overview

Create the dbt project at @packages/transform. Implements all four transform layers (raw sources, staging, intermediate, marts) using dbt-core + dbt-duckdb. Per-year column normalization is handled in staging SQL — marts always receive consistent English-named columns regardless of source year.

Result: uv run dbt run produces all 4 mart tables (mart_revenue, mart_program_progress, mart_channel_marketing, mart_school_progress) in atlas.db, queryable with real data from all loaded years.

Goals

  • Scaffold @packages/transform with Python/uv project (pyproject.toml, profiles.yml, dbt_project.yml)
  • Declare raw tables as dbt sources (sources.yml)
  • Implement all 4 staging models with per-year COALESCE normalization for column differences
  • Implement staging schema tests (not_null, unique, accepted_values)
  • Implement intermediate models: int_orders and int_enrollments
  • Implement all 4 mart models
  • Implement mart schema tests
  • Verify: uv run dbt run && uv run dbt test passes with real 2026 data

Non-Goals

  • Implementing the Format layer (Plan 004)
  • Running the sync layer — raw tables are assumed to already exist (populated by Plan 002)
  • Odoo-specific transform logic (future)
  • dbt docs serving (nice to have, not required for completion)
  • Multi-entity (EPN) support in SQL — models should accept entity variable but EPN-specific testing is out of scope

Phases

  • Phase 1: Python/dbt scaffold (pyproject.toml, profiles.yml, dbt_project.yml, requirements.txt)
  • Phase 2: Raw source declarations (sources.yml)
  • Phase 3: Staging models + schema tests (stg_transactions, stg_students, stg_targets, stg_marketing_activity)
  • Phase 4: Intermediate models (int_orders, int_enrollments)
  • Phase 5: Mart models (mart_revenue, mart_program_progress, mart_channel_marketing, mart_school_progress) + mart tests
  • Phase 6: Verification — uv run dbt run && uv run dbt test passes

Success

  • @packages/transform/pyproject.toml and requirements.txt exist with dbt-core and dbt-duckdb
  • profiles.yml points to ../../atlas.db with 4 threads
  • dbt_project.yml configures staging/intermediate as views, marts as tables
  • All 4 staging models exist and produce clean English-named columns
  • uv run dbt test passes on all staging schema tests
  • int_orders and int_enrollments exist and join correctly
  • All 4 mart models exist and are materialized as tables
  • uv run dbt run completes without errors on real 2026 data
  • uv run dbt test passes on all mart schema tests

Requirements

  • Plan 002 completed: atlas.db populated with raw tables
  • Python 3.11+ available
  • uv installed (uv --version)
  • dbt-core and dbt-duckdb installable via uv pip
  • atlas.db exists at project root

Context

Why This Approach

  • dbt-core + dbt-duckdb is the standard, purpose-built toolchain for this pattern
  • Staging models as views: no storage cost, always fresh on re-run
  • Mart models as tables: pre-aggregated for fast reads by the Format layer and dashboard
  • Per-year normalization in staging via COALESCE: one SQL model handles all years without branching — COALESCE("INTAKE", SPLIT_PART("BULAN/INTAKE", ' ', 2)) resolves the 2026 split column transparently
  • int_enrollments filters out non-enrollment rows (book sales, events) so all downstream marts are enrollment-only

Key Constraints

  • Raw column names are Indonesian and differ by year — staging models must handle all known variations via COALESCE
  • 2026 LAPORAN EVALUASI adds NO. KWITANSI (receipt number) — this is the join key between transactions and students; prior years must fall back to student name matching
  • int_orders joins stg_transactions × stg_students on NO. KWITANSI where available, falling back to NAMA SISWA = student name
  • Unit code normalization in int_orders: WLC INGGRISWLC_ENGLISH, WLC NON INGGRISWLC_NON_ENGLISH
  • Customer type normalization in int_orders: BNew, LRenewal, AAlumni
  • int_enrollments excludes: BUKU REAL, BUKU WLC, BUKU MANDARIN, EVENT - WLC, EVENT - ICC, SIMULASI JLPT, SIMULASI HSK, KIRIM - REAL, KIRIM - TM, TOP UP POIN and any amount <= 0
  • mart_school_progress groups by year only (no month) — school analysis is annual
  • mart_channel_marketing uses window function for contribution_pct — DuckDB supports this natively

Edge Cases

  • Students in stg_students without a matching transaction (no receipt number, no name match) — left join in int_orders so transaction rows are never dropped
  • Transactions without a matching student row — left join returns NULL student fields; mart queries handle NULLs
  • TRY_CAST used for all numeric columns in staging to avoid parse failures on dirty data
  • QUALIFY ROW_NUMBER() deduplication in stg_transactions on NO. KWITANSI handles duplicate receipt rows in source

Tradeoffs

  • Staging as views vs tables: views chosen for simplicity (always fresh); if query performance becomes an issue, can be changed to incremental tables later
  • Single staging model per raw table (not per year): one model handles all years via COALESCE, reducing file count; downside is more complex SQL per model
  • No Jinja variables for entity filtering in staging: marts include entity_code/unit_code columns so filtering happens at mart query time in the Format layer

Skills

  • plan — plan file format and conventions

Boundaries

  • Always: Update Progress.md after each task completion
  • Always: All English column names in staging must match the glossary.md terminology exactly
  • Always: Use TRY_CAST for all numeric columns to avoid hard failures on dirty source data
  • Always: Staging models use WHERE to filter null PKs; never pass nulls downstream
  • Ask first: Any new mart model not in the original spec
  • Ask first: Changing materialization strategy from the default (views for stg/int, tables for marts)
  • Never: Apply renaming or business logic in the raw layer (raw = exact source copy)
  • Never: Skip schema tests for staging models

Questions

  • Join key between transactions and students? → NO. KWITANSI (2026+), fall back to name match for prior years
  • Enrollment exclusion list? → Defined in analytics.md, used in int_enrollments
  • Should stg_marketing_activity handle the weekly breakdown columns (M1/M2/M3/M4) or just monthly totals? → Keep monthly channel totals for now; weekly decomposition deferred until raw structure is stabilized.
  • Should mart_revenue include a year_type column (current / last_year / best_year) or leave that computation to the Format layer? → Leave year_type computation to the Format layer.

On this page