Atlas Plan

Phase 1: Audit Documentation

Purpose: Inspect and document every sheet in all 8 source xlsx files as a permanent reference
Status: COMPLETE — done in planning session 2026-02-21. See @plan/audit/.

T-001 - Create @plan/audit/ and sheet index

Status: completed
Priority: P0
Dependencies: none

Acceptance

@plan/audit/ directory exists
@plan/audit/sheet-index.md exists with full sheet table across all 8 files
All known anomalies documented (SEPTEMBER25 typo, REKAP FEBUARI typo, leading space in REKAP JULI, hidden rows, ICC vs ICA distinction)

T-002 - Create per-file audit snapshots

Status: completed
Priority: P1
Dependencies: T-001

Acceptance

8 snapshot docs exist in @plan/audit/
Each doc includes: file path, all sheets with type/status, column lists for data sheets, header rows, row counts, anomalies

Files

@plan/audit/ions-2023-laporan.md
@plan/audit/ions-2023-database-siswa.md
@plan/audit/ions-2024-laporan.md
@plan/audit/ions-2024-database-siswa.md
@plan/audit/ions-2025-laporan.md
@plan/audit/ions-2025-database-siswa.md
@plan/audit/ions-2026-laporan.md
@plan/audit/ions-2026-database-siswa.md

Phase 2: YAML Config Fixes

Purpose: Correct all structural errors in @source/config YAML files

T-003 - Fix ions-2026.yaml

Update 2026 config: fix targets header rows, add organizations entry, remove marketing_activity entries.

Status: completed
Priority: P0
Dependencies: none

Acceptance

transactions entry: header_row: 6, no merge_header_rows (flat header)
targets entry: merge_header_rows: [10, 11], data_start_row: 12 (rows 1–9 are letter salutation + month label; row 10 = outer header, row 11 = sub-labels)
students entry: all 12 monthly sheets listed with correct 2026 suffix pattern; SEPTEMBER25 typo preserved (exact sheet name in file)
organizations entry added: file = DATABASE SISWA 2026, sheet = REFERENSI, type: referensi
All marketing_activity / REKAP sheet entries removed
No references to REKAP * sheets anywhere in the file

Notes

Targets 2-row header: row 10 outer labels (NO., TENANT, BIAYA STANDAR, TARGET REVENUE, TARGET, REALISASI REVENUE, REALISASI, REALISASI SISWA RIIL, JUMLAH KELAS, RATA-RATA JUMLAH SISWA PER KELAS, JUMLAH SISWA AKTIF SAAT INI, JUMLAH TOTAL KELAS AKTIF, RATA-RATA JUMLAH SISWA TIAP KELAS AKTIF). Row 11 sub-labels: SISWA STANDAR under col F (TARGET) and col H (REALISASI).
The extract reader must merge rows 10+11 the same way as the 2023 transaction merge.

T-004 - Fix ions-2025.yaml and ions-2024.yaml

Both years have flat single-row headers at row 6. No merge_header_rows needed. Add organizations entries. Remove marketing_activity entries.

Status: completed
Priority: P0
Dependencies: none

Acceptance

ions-2025.yaml transactions entry: header_row: 6, no merge_header_rows
ions-2025.yaml has no targets entry (sheet absent in 2025)
ions-2025.yaml organizations entry added: DATABASE SISWA 2025, sheet = REFERENSI, type: referensi
ions-2025.yaml all marketing_activity / REKAP entries removed
ions-2024.yaml same structure as 2025
ions-2024.yaml organizations entry added: DATABASE SISWA 2024, sheet = REFERENSI, type: referensi
All 12 monthly student sheets listed with correct year suffix (25, 24)
Marketing activity sheet names (REKAP *, ' REKAP JULI') removed from both files

Files

@source/config/ions-2025.yaml
@source/config/ions-2024.yaml

T-005 - Fix ions-2023.yaml

2023 is the only year with a genuine 2-row merged header in transactions. No organizations sheet. No targets.

Status: completed
Priority: P0
Dependencies: none

Acceptance

transactions entry: merge_header_rows: [6, 7] (row 6 = NO, HARI, TANGGAL, BARSHEET, PENDAPATAN[group], TENANT...; row 7 = PROGRAM, INTAKE, NAMA SISWA, JUMLAH under PENDAPATAN)
No targets entry
No organizations entry (REFERENSI sheet absent in 2023)
All 12 student sheets listed with 2023 naming pattern: JANUARI 23, FEBRUARI 23, ..., DESEMBER (no year suffix on December)
All marketing_activity / REKAP entries removed (REKAP JAN, REKAP FEB, etc.)

Notes

2023 marketing REKAP sheets use short names: REKAP JAN, REKAP FEB, REKAP MAR, REKAP APRIL, REKAP MEI, REKAP JUNI, REKAP JULI, REKAP AGUSTUS, REKAP SEPTEMBER, REKAP OKTOBER, REKAP NOVEMBER, REKAP DESEMBER
These must all be removed from config, not just marked absent

T-006 - Define canonical CSV schema and extend SourceConfig type

Create a schema.ts file with the authoritative column definitions for all 4 file types. Extend SourceFileConfig TypeScript type with merge_header_rows?: number[], data_start_row?: number, and type field including 'referensi'.

Status: completed
Priority: P0
Dependencies: T-003, T-004, T-005

Acceptance

@packages/pipeline/@source/extract/schema.ts exported constants for each file type:
- TRANSACTIONS_COLUMNS: string[] — 17 canonical English column names in order
- STUDENTS_COLUMNS: string[] — 26 canonical English column names in order
- TARGETS_COLUMNS: string[] — 13 canonical English column names in order
- ORGANIZATIONS_COLUMNS: string[] — 3 canonical English column names
SourceFileConfig type updated with merge_header_rows?: number[], data_start_row?: number
File type union updated to include 'referensi' alongside existing types
pnpm --filter @packages/pipeline test:type passes

Column reference — transactions (17):

no, day, date, barsheet, receipt, program, variant, intake, student, amount, unit, category, payment_via, payment_channel, student_status, student_kind, payment_status

Column reference — students (26):

no, intake, datetime, name, birth_place_date, birth_place, birth_date, identity, phone, email, address, organization, unit, age_cluster, course_type, program, class_level, schedule, meetings, duration, class_type, payment_method, payment_pattern, source, sex, verified

Column reference — targets (13):

no, unit, cost_standard, revenue_target, student_target, revenue_actuals, student_actuals_standard, student_actuals_real, class_count, avg_students_per_class, active_students_now, active_classes_total, avg_students_active

Column reference — organizations (3):

organization, category, age_cluster

Files

@packages/pipeline/@source/extract/schema.ts
@packages/pipeline/@source/config.ts

Phase 3: Extract Step

Purpose: Implement SheetJS-based xlsx → CSV extraction for all years

T-007 - Implement extract/reader.ts and extract/referensi.ts

Two reading functions: readSheet for standard flat and merged-header sheets, readReferensiSheet for the REFERENSI pivot matrix. Both output { headers: string[], rows: Record<string, string>[] }.

Status: completed
Priority: P0
Dependencies: T-006

Acceptance — readSheet

@packages/pipeline/@source/extract/reader.ts exports readSheet(filePath, sheetName, options) where options includes headerRow, columns, mergeHeaderRows?, dataStartRow?
Standard single-row: reads row at headerRow, maps Indonesian → English via inverted columns map, returns data rows starting at headerRow + 1 (or dataStartRow if specified)
merge_header_rows: [R1, R2] case: reads row R1 and R2. For each column position: if R1 value is non-empty AND R2 value is non-empty, use R2 value as column name (sub-label wins). Otherwise use R1 value. Then map to English via columns.
Reads ALL rows regardless of hidden state — never checks row_dimensions
Skips rows where all mapped column values are empty/null
For students: splits T/T/L value at first comma → birth_place (before) and birth_date (after, trimmed). If no comma, populates birth_place_date only; leaves birth_place and birth_date as empty string.
Missing source columns (e.g. receipt absent in 2023) output as empty string for that column
Returns rows with exactly the canonical column set (from schema.ts) — no extra columns, no missing columns
Vitest unit test: @source/__tests__/extract.test.ts — tests single-row header, 2-row merged header, missing columns, birth_place split

Acceptance — readReferensiSheet

@packages/pipeline/@source/extract/referensi.ts exports readReferensiSheet(filePath, sheetName)
Locates ORGANISASI header in the sheet (cell containing 'ORGANISASI')
Reads category header row (row below ORGANISASI header): SMA, SMP, TK-SD, UNIVERSITAS, LAIN-LAIN
Reads age_cluster column (col B): KIDS, TEENS, ADULTS — carried forward until next non-empty value
For each matrix cell that is non-empty: emits { organization, category, age_cluster }
Stops when it hits the next section (TIPE KURSUS or similar non-organization heading)
Returns { headers: ['organization', 'category', 'age_cluster'], rows: [...] }

Notes

SheetJS: use XLSX.utils.sheet_to_json(ws, { header: 1, raw: false }) to get raw row arrays as strings
Invert the YAML columns map (values are Indonesian source names, keys are English target names) for lookup
For merge detection: group names that trigger sub-label preference: any R1 value where R2 position has a non-empty value. Do NOT hardcode group names like PENDAPATAN — detect generically.

Files

@packages/pipeline/@source/extract/reader.ts
@packages/pipeline/@source/extract/referensi.ts
@packages/pipeline/@source/tests/extract.test.ts

T-008 - Implement extract/writer.ts

CSV writing module. Writes canonical-column CSVs to @source/clean/. Implements the frozen-snapshot overwrite guard.

Status: completed
Priority: P0
Dependencies: T-007

Acceptance

@packages/pipeline/@source/extract/writer.ts exports writeCsv(outputPath, headers, rows, overwrite?)
If file exists and overwrite is false (default): logs [extract] skip: {filename} already exists and returns
If file does not exist or overwrite is true: writes CSV
Header row written first, then data rows
Values containing commas, double-quotes, or newlines wrapped in double-quotes; internal quotes escaped as ""
UTF-8 output, no BOM
Empty rows (all values empty string) still written — header-only CSVs for future months are valid
Vitest unit test: @source/__tests__/writer.test.ts — tests overwrite guard, quoting edge cases, header-only output

Files

@packages/pipeline/@source/extract/writer.ts
@packages/pipeline/@source/tests/writer.test.ts

T-009 - Implement extract/index.ts and wire CLI

Main extract orchestrator. Iterates config entries, dispatches to readSheet or readReferensiSheet, calls writeCsv. Generates output filename per the naming convention.

Status: completed
Priority: P0
Dependencies: T-007, T-008

Acceptance

@packages/pipeline/@source/extract/index.ts exports extractAll(config, options) where options = { cleanDir, overwrite }
Filename convention: ions-{YYYY}-{MM}-{type}.csv for monthly sheets; ions-{YYYY}-organizations.csv for REFERENSI
- {MM} = zero-padded month number derived from sheet name (e.g. JANUARI → 01, FEBRUARI → 02)
- Targets special case: {MM} derived from period label in row 9 (DATA BULAN FEBRUARI 2026 → 02)
- Examples: ions-2026-01-transactions.csv, ions-2026-02-targets.csv, ions-2025-organizations.csv
Dispatches to readReferensiSheet when file type is referensi
Dispatches to readSheet for all other types
Logs per-file result: [extract] ions-2026-01-transactions.csv: 847 rows or [extract] skip: already exists
pnpm run extract -- --entity IONS --year 2026 runs 2026 extraction
pnpm run extract -- --entity IONS --year 2026 --overwrite re-extracts existing files
pnpm --filter @packages/pipeline test:type passes

Indonesian month → number mapping (hardcoded constant):

JANUARI=01, FEBRUARI=02, MARET=03, APRIL=04, MEI=05, JUNI=06, JULI=07, AGUSTUS=08, SEPTEMBER=09, OKTOBER=10, NOVEMBER=11, DESEMBER=12

Files

@packages/pipeline/@source/extract/index.ts
@packages/pipeline/@source/index.ts (update: wire extract sub-command, pass --overwrite flag)

T-010 - Extract all years to @source/clean/ and commit

Run extraction for all 4 years. Verify row counts and column names. Commit all 100 CSVs.

Status: completed
Priority: P0
Dependencies: T-009, T-003, T-004, T-005

Acceptance

2023: 12 transaction CSVs + 12 student CSVs = 24 files
2024: 12 transaction CSVs + 12 student CSVs + 1 organizations CSV = 25 files
2025: 12 transaction CSVs + 12 student CSVs + 1 organizations CSV = 25 files
2026: 12 transaction CSVs + 12 student CSVs + 1 targets CSV + 1 organizations CSV = 26 files
Total: 100 CSV files in @source/clean/
ions-2026-02-targets.csv has 13 correct English column headers (not C0, _1...)
ions-2023-01-transactions.csv has program, intake, student, amount columns (from merged row 7)
ions-2026-01-students.csv has birth_place_date, birth_place, birth_date columns
All files committed to git with message: data(pipeline): extract @source/clean CSVs for all years 2023–2026

Expected row counts (approximate):

Year	Transactions	Students
2023	~7,914 total across 12 months	~7,844 total
2024	~8,666 total	~10,201 total
2025	~10,110 total	~10,168 total
2026	~1,191 real rows (Jan+Feb only; other months near-zero stubs)	~1,640 real rows

Files

@source/clean/ (all 100 CSV files)

Phase 4: Load Step

Purpose: Refactor load to read from @source/clean/ CSVs; remove marketing_activity

T-011 - Refactor load/csv.ts to read from @source/clean/

Replace the xlsx-direct load with CSV loading via DuckDB read_csv_auto. Remove the marketing_activity load block entirely.

Status: completed
Priority: P0
Dependencies: T-009

Acceptance

load/csv.ts syncCsv(config, options) reads from @source/clean/ by default
Constructs CSV filename using same convention as extract: ions-{YYYY}-{MM}-{type}.csv and ions-{YYYY}-organizations.csv
Uses DuckDB read_csv_auto with header := true, all_varchar := true
If CSV does not exist: logs warning [sync] skip: {filename} not found — run extract first and continues
If CSV is header-only (0 data rows): logs [sync] {table}/{sheet}: 0 rows (empty sheet) and skips load
marketing_activity file type entries in config are ignored / load block removed
pnpm run sync -- --entity IONS --year 2026 loads from CSVs; raw_targets has correct columns; raw_students ≤ 15k rows
pnpm --filter @packages/pipeline test:type passes

Notes

--source csv is now the default; --source xlsx can remain as a legacy flag for debugging (still points to old xlsx adapter)
raw_marketing_activity is no longer created in DuckDB — this is intentional

Files

@packages/pipeline/@source/load/csv.ts
@packages/pipeline/@source/index.ts (update default source)

Phase 5: Validate Step

Purpose: SQL checks on raw_* tables after load

T-012 - Implement validate/checks.ts

Define all validation checks as typed objects. Pure logic, no DuckDB connection.

Status: completed
Priority: P0
Dependencies: none

Acceptance

@packages/pipeline/@source/validate/checks.ts exports ValidationCheck[]
ValidationCheck type: { name: string, description: string, severity: 'error' | 'warning', sql: string }
Checks implemented:
- raw_transactions_empty — error: SELECT COUNT(*) FROM raw_transactions HAVING COUNT(*) = 0
- raw_students_empty — error: SELECT COUNT(*) FROM raw_students HAVING COUNT(*) = 0
- raw_targets_empty — warning: SELECT COUNT(*) FROM raw_targets HAVING COUNT(*) = 0 (only 2026 has targets)
- raw_targets_bad_columns — error: checks that column unit exists and is not entirely null in raw_targets
- raw_students_bloat — warning: row count > 20,000 signals junk sheets were loaded
- raw_transactions_invalid_year — warning: rows where extracted year from intake < 2020 or > current year + 1
- raw_transactions_unknown_unit — warning: unit values not in known set (REAL, IONs, IONS, ICC, ICA, NP, DTC, TM, WLC, WLC INGGRIS, WLC NON INGGRIS, DETECTION, MANDARIN)
- raw_transactions_negative_amount — warning: TRY_CAST(amount AS DOUBLE) < 0
No check for raw_marketing_activity — it no longer exists
Vitest unit test: @source/__tests__/validate.test.ts — tests check SQL strings and severity values

Files

@packages/pipeline/@source/validate/checks.ts
@packages/pipeline/@source/tests/validate.test.ts

T-013 - Implement validate/index.ts and wire CLI

Main validate orchestrator. Runs all checks, writes JSON report.

Status: completed
Priority: P0
Dependencies: T-012

Acceptance

@packages/pipeline/@source/validate/index.ts exports validateAll(config, options)
Runs all checks from checks.ts against atlas.db
Collects: check name, severity, passed/failed, failing row count, sample rows (up to 5)
Writes output/validation/{entity.toLowerCase()}-{year}-validation.json (2-space indent)
Console summary: ✓ for passed, ✗ for failed; groups errors then warnings
Exit code 0 if all pass or warnings only; exit code 1 if any error-severity check fails
pnpm run validate -- --entity IONS --year 2026 runs and produces report
pnpm --filter @packages/pipeline test:type passes

Files

@packages/pipeline/@source/validate/index.ts
@packages/pipeline/@source/index.ts (update: wire validate sub-command)

Phase 6: dbt Quality

Purpose: Remove dead model; add is_valid flagging; add audit model

T-014 - Remove stg_marketing_activity and raw_marketing_activity

Delete the dead staging model and its source declaration. Verify dbt still runs without error.

Status: completed
Priority: P0
Dependencies: T-011

Acceptance

@python/analytics/models/staging/stg_marketing_activity.sql deleted
raw_marketing_activity source entry removed from @python/analytics/models/staging/sources.yml
mart_channel_marketing schema entry in marts/schema.yml updated if it referenced stg_marketing_activity (it doesn't — but verify)
cd @python/analytics && uv run dbt run succeeds without errors
cd @python/analytics && uv run dbt test passes all remaining tests
mart_channel_marketing still populates correctly (it reads int_enrollments, unaffected)

Notes

mart_channel_marketing reads from int_enrollments via payment_via/payment_channel columns — completely independent of REKAP data. Verify this with a row count check before and after.

Files

@python/analytics/models/staging/stg_marketing_activity.sql (delete)
@python/analytics/models/staging/sources.yml

T-015 - Add is_valid and invalid_reason to int_orders.sql

Add two computed columns. Priority order of invalidity checks matters — first matching condition sets the reason.

Status: completed
Priority: P0
Dependencies: T-014

Acceptance

int_orders.sql has is_valid (boolean) and invalid_reason (varchar) in the final SELECT
is_valid CASE logic (in priority order):
1. period_year IS NULL → false, reason 'null_period'
2. period_year < 2020 → false, reason 'invalid_period_year'
3. period_year > EXTRACT(YEAR FROM CURRENT_DATE) + 1 → false, reason 'invalid_period_year'
4. amount IS NULL OR amount <= 0 → false, reason 'invalid_amount'
5. unit_code_normalized IS NULL → false, reason 'unknown_unit'
6. else → true, reason NULL
cd @python/analytics && uv run dbt run --select int_orders succeeds
No null values in is_valid column (every row is true or false)

Files

@python/analytics/models/intermediate/int_orders.sql

T-016 - Filter int_enrollments to is_valid = true

One-line change. All mart models inherit the filter via int_enrollments.

Status: completed
Priority: P0
Dependencies: T-015

Acceptance

int_enrollments.sql WHERE clause includes AND is_valid = true
uv run dbt run completes all models
mart_revenue row count is plausible (lower than before due to invalid row exclusion)
uv run dbt test passes all schema tests

Files

@python/analytics/models/intermediate/int_enrollments.sql

T-017 - Create audit_flagged_orders dbt view

Exposes all invalid orders for inspection. View — no storage cost, always fresh.

Status: completed
Priority: P1
Dependencies: T-015

Acceptance

@python/analytics/models/intermediate/audit_flagged_orders.sql exists
Configured as {{ config(materialized='view') }}
SELECT * FROM {{ ref('int_orders') }} WHERE is_valid = false
Schema entry in @python/analytics/models/intermediate/schema.yml
uv run dbt run --select audit_flagged_orders succeeds

Files

@python/analytics/models/intermediate/audit_flagged_orders.sql
@python/analytics/models/intermediate/schema.yml (new or updated)

T-018 - Add dbt schema tests for is_valid columns

Ensure is_valid and invalid_reason behave correctly.

Status: completed
Priority: P1
Dependencies: T-017

Acceptance

Schema test: is_valid has no nulls in int_orders
Schema test: invalid_reason is not null for all rows where is_valid = false
uv run dbt test passes all tests including new ones

Files

@python/analytics/models/intermediate/schema.yml

T-019 - End-to-end verification and AGENTS.md update

Full pipeline run from extract through format. Update docs.

Status: cancelled — superseded by T-019a, T-019b, T-019c (all completed)
Priority: P0
Dependencies: T-010, T-011, T-013, T-016, T-017

Acceptance

pnpm run extract -- --entity IONS --year 2026 skips existing CSVs (frozen)
pnpm run sync -- --entity IONS --year 2026 loads from CSVs; raw_targets ~55 rows with correct columns; raw_students ≤ 15k rows; no raw_marketing_activity table
pnpm run validate -- --entity IONS --year 2026 exits 0; JSON report written; no error-severity failures
cd @python/analytics && uv run dbt run completes; audit_flagged_orders view exists
cd @python/analytics && uv run dbt test passes
pnpm run format -- --entity IONS --period 2026-02 produces output/report.json with correct sections
Root AGENTS.md updated: extract command documented; @source/clean/ frozen CSV snapshots explained; REKAP exclusion noted; audit_flagged_orders mentioned
@packages/pipeline/AGENTS.md updated: extract + validate sub-commands documented; readReferensiSheet noted
@python/analytics/AGENTS.md updated: is_valid/invalid_reason pattern documented; stg_marketing_activity removal noted

Files

AGENTS.md
@packages/pipeline/AGENTS.md
@python/analytics/AGENTS.md
output/validation/ions-2026-validation.json (generated, gitignored)

T-019a - Update extract step: amount formatting, date formatting, identity rename

Update reader.ts to produce cleaner values in CSVs. Re-extract all years with --overwrite and commit updated CSVs.

Status: completed
Priority: P0
Dependencies: T-019

Context

The staging SQL currently fails because:

amount values are "100,000" strings — TRY_CAST returns NULL, filtering all rows out
date values may be Excel serial numbers or locale strings — needs ISO "2026-01-02" format
identity column name should be id_raw to reflect its role

These are formatting concerns belonging in the extract step, not staging SQL.

Boundary

Layer	Responsibility
Extract (CSV)	Amount plain number, date ISO, `id_raw` raw string
Staging SQL	Identity split, period parsing, unit normalization, customer type translation

Acceptance

Notes

SheetJS raw: false may already convert Excel serials to locale strings (e.g. "1/2/2026"). The formatter needs to handle both forms.
birth_date in students is a human-readable text like "18 Desember 2003" — do NOT apply formatDate to it. It is not a machine date.
datetime in students often has time component ("2026-01-02 00:00:00") — strip time component, output date only.
formatAmount must handle Indonesian thousands separator (.) as well as Western (,) — some years use periods.

Files

@packages/pipeline/@source/extract/reader.ts
@packages/pipeline/@source/extract/schema.ts
@packages/pipeline/@source/tests/extract.test.ts
@source/clean/ (all 100 CSVs re-extracted with --overwrite)

T-019b - Simplify staging SQL: direct English column refs, identity split, period parsing

Replace all atlas_coalesce_columns_or_null macro lookups with direct English column references. Add identity split columns. Add period parsing. Simplify int_orders to use staging outputs.

Status: completed
Priority: P0
Dependencies: T-019a

What changes

stg_transactions.sql — full rewrite to direct column refs:

Remove all atlas_coalesce_columns_or_null / atlas_column_or_null macro calls
Direct references: receipt, date, unit, program, variant, intake, student, amount, payment_via, payment_channel, student_status, student_kind, payment_status, category, barsheet
TRY_CAST(amount AS DECIMAL(18,2)) now works directly (no REPLACE needed — amount is already plain number)
TRY_CAST(date AS DATE) sufficient (no atlas_parse_date needed — date is already ISO)

Period parsing added here (from intake):

CASE UPPER(intake)
  WHEN 'JANUARI%' THEN 1 ... WHEN 'DESEMBER%' THEN 12
  ELSE NULL
END AS period_month,
TRY_CAST(REGEXP_EXTRACT(intake, '(20[0-9]{2})', 1) AS INTEGER) AS period_year

Customer type translation added here:

CASE UPPER(student_status)
  WHEN 'B' THEN 'New'
  WHEN 'L' THEN 'Renewal'
  WHEN 'A' THEN 'Alumni'
  ELSE student_status
END AS customer_type

Receipt synthesis for legacy rows (no receipt column): CONCAT('LEGACY-', no, '-', program, '-', student, '-', date)
WHERE: NULLIF(receipt,'') IS NOT NULL OR NULLIF(program,'') IS NOT NULL
Cleaned CTE: amount IS NOT NULL AND unit IS NOT NULL AND program IS NOT NULL AND program <> 'program'

stg_students.sql — full rewrite to direct column refs + identity split:

Remove all atlas_coalesce_columns_or_null macro calls
Direct references: id_raw, name, datetime, organization, unit, age_cluster, course_type, program, class_level, intake, source, student_status, payment_method, sex, receipt
Unit normalization CASE stays in staging (not CSV): UPPER(unit) IN ('WLC INGGRIS'...) THEN 'WLC_ENGLISH'
Period parsing added (same as transactions, from intake)
Identity split (5 columns):
- id_ktp: type = KTP or KKTP
- id_kp: type = KARTU PELAJAR or KARTU PELAJAR PS
- id_sim: type = SIM
- id_pass: type = PASSPORT or PASPORT
- id_nis: type = NIS or NOMOR INDUK SEKOLAH or NOMOR INDUK
- Unrecognized types → all 5 NULL (visible via id_raw)
WHERE: NULLIF(name,'') IS NOT NULL AND NULLIF(name,'') NOT IN ('name','NAMA')

int_orders.sql — simplification:

Remove the 24-line CASE block for period_month (now from staging)
Remove the period_year COALESCE block (now from staging)
Remove the customer_type CASE block (B→New etc.) — now from stg_transactions
period_year = COALESCE(t.period_year, s.period_year)
period_month = COALESCE(t.period_month, s.period_month)
customer_type = t.customer_type (already translated)
Identity columns passed through from stg_students: id_raw, id_ktp, id_kp, id_sim, id_pass, id_nis

Acceptance

Notes

atlas_parse_date macro can be removed from stg_transactions/stg_students since dates are now ISO. Keep the macro definition in helpers.sql in case other models need it.
stg_targets.sql is unaffected — it reads from raw_targets which has its own column schema.
The QUALIFY ROW_NUMBER() deduplication in stg_transactions should be preserved.
stg_students WHERE filter must exclude header-artifact rows: name NOT IN ('name', 'NAMA') to handle any CSV header leak.

Files

@python/analytics/models/staging/stg_transactions.sql
@python/analytics/models/staging/stg_students.sql
@python/analytics/models/intermediate/int_orders.sql
@python/analytics/models/intermediate/schema.yml (add id_ktp, id_kp, id_sim, id_pass, id_nis to int_orders docs)

T-019c - Final end-to-end verification

Run the complete pipeline. Verify marts are populated. Verify format output. Close plan.

Status: completed
Priority: P0
Dependencies: T-019b

Acceptance

Files

AGENTS.md (update)
output/report.json (generated, gitignored)

On this page