Phase 1: Audit Documentation
- Purpose: Inspect and document every sheet in all 8 source xlsx files as a permanent reference
- Status: COMPLETE — done in planning session 2026-02-21. See
@plan/audit/.
T-001 - Create @plan/audit/ and sheet index
- Status: completed
- Priority: P0
- Dependencies: none
Acceptance
-
@plan/audit/directory exists -
@plan/audit/sheet-index.mdexists with full sheet table across all 8 files - All known anomalies documented (SEPTEMBER25 typo, REKAP FEBUARI typo, leading space in REKAP JULI, hidden rows, ICC vs ICA distinction)
Files
- @plan/audit/sheet-index.md
T-002 - Create per-file audit snapshots
- Status: completed
- Priority: P1
- Dependencies: T-001
Acceptance
- 8 snapshot docs exist in
@plan/audit/ - Each doc includes: file path, all sheets with type/status, column lists for data sheets, header rows, row counts, anomalies
Files
- @plan/audit/ions-2023-laporan.md
- @plan/audit/ions-2023-database-siswa.md
- @plan/audit/ions-2024-laporan.md
- @plan/audit/ions-2024-database-siswa.md
- @plan/audit/ions-2025-laporan.md
- @plan/audit/ions-2025-database-siswa.md
- @plan/audit/ions-2026-laporan.md
- @plan/audit/ions-2026-database-siswa.md
Phase 2: YAML Config Fixes
- Purpose: Correct all structural errors in @source/config YAML files
T-003 - Fix ions-2026.yaml
Update 2026 config: fix targets header rows, add organizations entry, remove marketing_activity entries.
- Status: completed
- Priority: P0
- Dependencies: none
Acceptance
-
transactionsentry:header_row: 6, nomerge_header_rows(flat header) -
targetsentry:merge_header_rows: [10, 11],data_start_row: 12(rows 1–9 are letter salutation + month label; row 10 = outer header, row 11 = sub-labels) -
studentsentry: all 12 monthly sheets listed with correct 2026 suffix pattern;SEPTEMBER25typo preserved (exact sheet name in file) -
organizationsentry added: file = DATABASE SISWA 2026, sheet =REFERENSI,type: referensi - All
marketing_activity/ REKAP sheet entries removed - No references to
REKAP *sheets anywhere in the file
Notes
- Targets 2-row header: row 10 outer labels (
NO.,TENANT,BIAYA STANDAR,TARGET REVENUE,TARGET,REALISASI REVENUE,REALISASI,REALISASI SISWA RIIL,JUMLAH KELAS,RATA-RATA JUMLAH SISWA PER KELAS,JUMLAH SISWA AKTIF SAAT INI,JUMLAH TOTAL KELAS AKTIF,RATA-RATA JUMLAH SISWA TIAP KELAS AKTIF). Row 11 sub-labels:SISWA STANDARunder col F (TARGET) and col H (REALISASI). - The extract reader must merge rows 10+11 the same way as the 2023 transaction merge.
Files
- @source/config/ions-2026.yaml
T-004 - Fix ions-2025.yaml and ions-2024.yaml
Both years have flat single-row headers at row 6. No merge_header_rows needed. Add organizations entries. Remove marketing_activity entries.
- Status: completed
- Priority: P0
- Dependencies: none
Acceptance
-
ions-2025.yamltransactionsentry:header_row: 6, nomerge_header_rows -
ions-2025.yamlhas notargetsentry (sheet absent in 2025) -
ions-2025.yamlorganizationsentry added: DATABASE SISWA 2025, sheet =REFERENSI,type: referensi -
ions-2025.yamlallmarketing_activity/ REKAP entries removed -
ions-2024.yamlsame structure as 2025 -
ions-2024.yamlorganizationsentry added: DATABASE SISWA 2024, sheet =REFERENSI,type: referensi - All 12 monthly student sheets listed with correct year suffix (
25,24) - Marketing activity sheet names (
REKAP *,' REKAP JULI') removed from both files
Files
- @source/config/ions-2025.yaml
- @source/config/ions-2024.yaml
T-005 - Fix ions-2023.yaml
2023 is the only year with a genuine 2-row merged header in transactions. No organizations sheet. No targets.
- Status: completed
- Priority: P0
- Dependencies: none
Acceptance
-
transactionsentry:merge_header_rows: [6, 7](row 6 =NO,HARI,TANGGAL,BARSHEET,PENDAPATAN[group],TENANT...; row 7 =PROGRAM,INTAKE,NAMA SISWA,JUMLAHunder PENDAPATAN) - No
targetsentry - No
organizationsentry (REFERENSI sheet absent in 2023) - All 12 student sheets listed with 2023 naming pattern:
JANUARI 23,FEBRUARI 23, ...,DESEMBER(no year suffix on December) - All
marketing_activity/ REKAP entries removed (REKAP JAN,REKAP FEB, etc.)
Notes
- 2023 marketing REKAP sheets use short names:
REKAP JAN,REKAP FEB,REKAP MAR,REKAP APRIL,REKAP MEI,REKAP JUNI,REKAP JULI,REKAP AGUSTUS,REKAP SEPTEMBER,REKAP OKTOBER,REKAP NOVEMBER,REKAP DESEMBER - These must all be removed from config, not just marked absent
Files
- @source/config/ions-2023.yaml
T-006 - Define canonical CSV schema and extend SourceConfig type
Create a schema.ts file with the authoritative column definitions for all 4 file types. Extend SourceFileConfig TypeScript type with merge_header_rows?: number[], data_start_row?: number, and type field including 'referensi'.
- Status: completed
- Priority: P0
- Dependencies: T-003, T-004, T-005
Acceptance
-
@packages/pipeline/@source/extract/schema.tsexported constants for each file type:TRANSACTIONS_COLUMNS: string[]— 17 canonical English column names in orderSTUDENTS_COLUMNS: string[]— 26 canonical English column names in orderTARGETS_COLUMNS: string[]— 13 canonical English column names in orderORGANIZATIONS_COLUMNS: string[]— 3 canonical English column names
-
SourceFileConfigtype updated withmerge_header_rows?: number[],data_start_row?: number - File type union updated to include
'referensi'alongside existing types -
pnpm --filter @packages/pipeline test:typepasses
Column reference — transactions (17):
no, day, date, barsheet, receipt, program, variant, intake, student, amount, unit, category, payment_via, payment_channel, student_status, student_kind, payment_status
Column reference — students (26):
no, intake, datetime, name, birth_place_date, birth_place, birth_date, identity, phone, email, address, organization, unit, age_cluster, course_type, program, class_level, schedule, meetings, duration, class_type, payment_method, payment_pattern, source, sex, verified
Column reference — targets (13):
no, unit, cost_standard, revenue_target, student_target, revenue_actuals, student_actuals_standard, student_actuals_real, class_count, avg_students_per_class, active_students_now, active_classes_total, avg_students_active
Column reference — organizations (3):
organization, category, age_cluster
Files
- @packages/pipeline/@source/extract/schema.ts
- @packages/pipeline/@source/config.ts
Phase 3: Extract Step
- Purpose: Implement SheetJS-based xlsx → CSV extraction for all years
T-007 - Implement extract/reader.ts and extract/referensi.ts
Two reading functions: readSheet for standard flat and merged-header sheets, readReferensiSheet for the REFERENSI pivot matrix. Both output { headers: string[], rows: Record<string, string>[] }.
- Status: completed
- Priority: P0
- Dependencies: T-006
Acceptance — readSheet
-
@packages/pipeline/@source/extract/reader.tsexportsreadSheet(filePath, sheetName, options)where options includesheaderRow,columns,mergeHeaderRows?,dataStartRow? - Standard single-row: reads row at
headerRow, maps Indonesian → English via invertedcolumnsmap, returns data rows starting atheaderRow + 1(ordataStartRowif specified) -
merge_header_rows: [R1, R2]case: reads row R1 and R2. For each column position: if R1 value is non-empty AND R2 value is non-empty, use R2 value as column name (sub-label wins). Otherwise use R1 value. Then map to English viacolumns. - Reads ALL rows regardless of hidden state — never checks
row_dimensions - Skips rows where all mapped column values are empty/null
- For
students: splitsT/T/Lvalue at first comma →birth_place(before) andbirth_date(after, trimmed). If no comma, populatesbirth_place_dateonly; leavesbirth_placeandbirth_dateas empty string. - Missing source columns (e.g.
receiptabsent in 2023) output as empty string for that column - Returns rows with exactly the canonical column set (from
schema.ts) — no extra columns, no missing columns - Vitest unit test:
@source/__tests__/extract.test.ts— tests single-row header, 2-row merged header, missing columns, birth_place split
Acceptance — readReferensiSheet
-
@packages/pipeline/@source/extract/referensi.tsexportsreadReferensiSheet(filePath, sheetName) - Locates
ORGANISASIheader in the sheet (cell containing'ORGANISASI') - Reads category header row (row below ORGANISASI header):
SMA,SMP,TK-SD,UNIVERSITAS,LAIN-LAIN - Reads age_cluster column (col B):
KIDS,TEENS,ADULTS— carried forward until next non-empty value - For each matrix cell that is non-empty: emits
{ organization, category, age_cluster } - Stops when it hits the next section (
TIPE KURSUSor similar non-organization heading) - Returns
{ headers: ['organization', 'category', 'age_cluster'], rows: [...] }
Notes
- SheetJS: use
XLSX.utils.sheet_to_json(ws, { header: 1, raw: false })to get raw row arrays as strings - Invert the YAML
columnsmap (values are Indonesian source names, keys are English target names) for lookup - For merge detection: group names that trigger sub-label preference: any R1 value where R2 position has a non-empty value. Do NOT hardcode group names like
PENDAPATAN— detect generically.
Files
- @packages/pipeline/@source/extract/reader.ts
- @packages/pipeline/@source/extract/referensi.ts
- @packages/pipeline/@source/tests/extract.test.ts
T-008 - Implement extract/writer.ts
CSV writing module. Writes canonical-column CSVs to @source/clean/. Implements the frozen-snapshot overwrite guard.
- Status: completed
- Priority: P0
- Dependencies: T-007
Acceptance
-
@packages/pipeline/@source/extract/writer.tsexportswriteCsv(outputPath, headers, rows, overwrite?) - If file exists and
overwriteis false (default): logs[extract] skip: {filename} already existsand returns - If file does not exist or
overwriteis true: writes CSV - Header row written first, then data rows
- Values containing commas, double-quotes, or newlines wrapped in double-quotes; internal quotes escaped as
"" - UTF-8 output, no BOM
- Empty rows (all values empty string) still written — header-only CSVs for future months are valid
- Vitest unit test:
@source/__tests__/writer.test.ts— tests overwrite guard, quoting edge cases, header-only output
Files
- @packages/pipeline/@source/extract/writer.ts
- @packages/pipeline/@source/tests/writer.test.ts
T-009 - Implement extract/index.ts and wire CLI
Main extract orchestrator. Iterates config entries, dispatches to readSheet or readReferensiSheet, calls writeCsv. Generates output filename per the naming convention.
- Status: completed
- Priority: P0
- Dependencies: T-007, T-008
Acceptance
-
@packages/pipeline/@source/extract/index.tsexportsextractAll(config, options)where options ={ cleanDir, overwrite } - Filename convention:
ions-{YYYY}-{MM}-{type}.csvfor monthly sheets;ions-{YYYY}-organizations.csvfor REFERENSI{MM}= zero-padded month number derived from sheet name (e.g.JANUARI→01,FEBRUARI→02)- Targets special case:
{MM}derived from period label in row 9 (DATA BULAN FEBRUARI 2026→02) - Examples:
ions-2026-01-transactions.csv,ions-2026-02-targets.csv,ions-2025-organizations.csv
- Dispatches to
readReferensiSheetwhen file type isreferensi - Dispatches to
readSheetfor all other types - Logs per-file result:
[extract] ions-2026-01-transactions.csv: 847 rowsor[extract] skip: already exists -
pnpm run extract -- --entity IONS --year 2026runs 2026 extraction -
pnpm run extract -- --entity IONS --year 2026 --overwritere-extracts existing files -
pnpm --filter @packages/pipeline test:typepasses
Indonesian month → number mapping (hardcoded constant):
JANUARI=01, FEBRUARI=02, MARET=03, APRIL=04, MEI=05, JUNI=06, JULI=07, AGUSTUS=08, SEPTEMBER=09, OKTOBER=10, NOVEMBER=11, DESEMBER=12
Files
- @packages/pipeline/@source/extract/index.ts
- @packages/pipeline/@source/index.ts (update: wire extract sub-command, pass --overwrite flag)
T-010 - Extract all years to @source/clean/ and commit
Run extraction for all 4 years. Verify row counts and column names. Commit all 100 CSVs.
- Status: completed
- Priority: P0
- Dependencies: T-009, T-003, T-004, T-005
Acceptance
- 2023: 12 transaction CSVs + 12 student CSVs = 24 files
- 2024: 12 transaction CSVs + 12 student CSVs + 1 organizations CSV = 25 files
- 2025: 12 transaction CSVs + 12 student CSVs + 1 organizations CSV = 25 files
- 2026: 12 transaction CSVs + 12 student CSVs + 1 targets CSV + 1 organizations CSV = 26 files
- Total: 100 CSV files in
@source/clean/ -
ions-2026-02-targets.csvhas 13 correct English column headers (notC0,_1...) -
ions-2023-01-transactions.csvhasprogram,intake,student,amountcolumns (from merged row 7) -
ions-2026-01-students.csvhasbirth_place_date,birth_place,birth_datecolumns - All files committed to git with message:
data(pipeline): extract @source/clean CSVs for all years 2023–2026
Expected row counts (approximate):
| Year | Transactions | Students |
|---|---|---|
| 2023 | ~7,914 total across 12 months | ~7,844 total |
| 2024 | ~8,666 total | ~10,201 total |
| 2025 | ~10,110 total | ~10,168 total |
| 2026 | ~1,191 real rows (Jan+Feb only; other months near-zero stubs) | ~1,640 real rows |
Files
- @source/clean/ (all 100 CSV files)
Phase 4: Load Step
- Purpose: Refactor load to read from @source/clean/ CSVs; remove marketing_activity
T-011 - Refactor load/csv.ts to read from @source/clean/
Replace the xlsx-direct load with CSV loading via DuckDB read_csv_auto. Remove the marketing_activity load block entirely.
- Status: completed
- Priority: P0
- Dependencies: T-009
Acceptance
-
load/csv.tssyncCsv(config, options)reads from@source/clean/by default - Constructs CSV filename using same convention as extract:
ions-{YYYY}-{MM}-{type}.csvandions-{YYYY}-organizations.csv - Uses DuckDB
read_csv_autowithheader := true, all_varchar := true - If CSV does not exist: logs warning
[sync] skip: {filename} not found — run extract firstand continues - If CSV is header-only (0 data rows): logs
[sync] {table}/{sheet}: 0 rows (empty sheet)and skips load -
marketing_activityfile type entries in config are ignored / load block removed -
pnpm run sync -- --entity IONS --year 2026loads from CSVs;raw_targetshas correct columns;raw_students≤ 15k rows -
pnpm --filter @packages/pipeline test:typepasses
Notes
--source csvis now the default;--source xlsxcan remain as a legacy flag for debugging (still points to old xlsx adapter)raw_marketing_activityis no longer created in DuckDB — this is intentional
Files
- @packages/pipeline/@source/load/csv.ts
- @packages/pipeline/@source/index.ts (update default source)
Phase 5: Validate Step
- Purpose: SQL checks on raw_* tables after load
T-012 - Implement validate/checks.ts
Define all validation checks as typed objects. Pure logic, no DuckDB connection.
- Status: completed
- Priority: P0
- Dependencies: none
Acceptance
-
@packages/pipeline/@source/validate/checks.tsexportsValidationCheck[] -
ValidationChecktype:{ name: string, description: string, severity: 'error' | 'warning', sql: string } - Checks implemented:
raw_transactions_empty— error:SELECT COUNT(*) FROM raw_transactions HAVING COUNT(*) = 0raw_students_empty— error:SELECT COUNT(*) FROM raw_students HAVING COUNT(*) = 0raw_targets_empty— warning:SELECT COUNT(*) FROM raw_targets HAVING COUNT(*) = 0(only 2026 has targets)raw_targets_bad_columns— error: checks that columnunitexists and is not entirely null inraw_targetsraw_students_bloat— warning: row count > 20,000 signals junk sheets were loadedraw_transactions_invalid_year— warning: rows where extracted year fromintake< 2020 or > current year + 1raw_transactions_unknown_unit— warning:unitvalues not in known set (REAL,IONs,IONS,ICC,ICA,NP,DTC,TM,WLC,WLC INGGRIS,WLC NON INGGRIS,DETECTION,MANDARIN)raw_transactions_negative_amount— warning:TRY_CAST(amount AS DOUBLE) < 0
- No check for
raw_marketing_activity— it no longer exists - Vitest unit test:
@source/__tests__/validate.test.ts— tests check SQL strings and severity values
Files
- @packages/pipeline/@source/validate/checks.ts
- @packages/pipeline/@source/tests/validate.test.ts
T-013 - Implement validate/index.ts and wire CLI
Main validate orchestrator. Runs all checks, writes JSON report.
- Status: completed
- Priority: P0
- Dependencies: T-012
Acceptance
-
@packages/pipeline/@source/validate/index.tsexportsvalidateAll(config, options) - Runs all checks from
checks.tsagainstatlas.db - Collects: check name, severity, passed/failed, failing row count, sample rows (up to 5)
- Writes
output/validation/{entity.toLowerCase()}-{year}-validation.json(2-space indent) - Console summary:
✓for passed,✗for failed; groups errors then warnings - Exit code 0 if all pass or warnings only; exit code 1 if any error-severity check fails
-
pnpm run validate -- --entity IONS --year 2026runs and produces report -
pnpm --filter @packages/pipeline test:typepasses
Files
- @packages/pipeline/@source/validate/index.ts
- @packages/pipeline/@source/index.ts (update: wire validate sub-command)
Phase 6: dbt Quality
- Purpose: Remove dead model; add is_valid flagging; add audit model
T-014 - Remove stg_marketing_activity and raw_marketing_activity
Delete the dead staging model and its source declaration. Verify dbt still runs without error.
- Status: completed
- Priority: P0
- Dependencies: T-011
Acceptance
-
@python/analytics/models/staging/stg_marketing_activity.sqldeleted -
raw_marketing_activitysource entry removed from@python/analytics/models/staging/sources.yml -
mart_channel_marketingschema entry inmarts/schema.ymlupdated if it referencedstg_marketing_activity(it doesn't — but verify) -
cd @python/analytics && uv run dbt runsucceeds without errors -
cd @python/analytics && uv run dbt testpasses all remaining tests -
mart_channel_marketingstill populates correctly (it readsint_enrollments, unaffected)
Notes
mart_channel_marketingreads fromint_enrollmentsviapayment_via/payment_channelcolumns — completely independent of REKAP data. Verify this with a row count check before and after.
Files
- @python/analytics/models/staging/stg_marketing_activity.sql (delete)
- @python/analytics/models/staging/sources.yml
T-015 - Add is_valid and invalid_reason to int_orders.sql
Add two computed columns. Priority order of invalidity checks matters — first matching condition sets the reason.
- Status: completed
- Priority: P0
- Dependencies: T-014
Acceptance
-
int_orders.sqlhasis_valid(boolean) andinvalid_reason(varchar) in the final SELECT -
is_validCASE logic (in priority order):period_year IS NULL→ false, reason'null_period'period_year < 2020→ false, reason'invalid_period_year'period_year > EXTRACT(YEAR FROM CURRENT_DATE) + 1→ false, reason'invalid_period_year'amount IS NULL OR amount <= 0→ false, reason'invalid_amount'unit_code_normalized IS NULL→ false, reason'unknown_unit'- else → true, reason NULL
-
cd @python/analytics && uv run dbt run --select int_orderssucceeds - No null values in
is_validcolumn (every row is true or false)
Files
- @python/analytics/models/intermediate/int_orders.sql
T-016 - Filter int_enrollments to is_valid = true
One-line change. All mart models inherit the filter via int_enrollments.
- Status: completed
- Priority: P0
- Dependencies: T-015
Acceptance
-
int_enrollments.sqlWHERE clause includesAND is_valid = true -
uv run dbt runcompletes all models -
mart_revenuerow count is plausible (lower than before due to invalid row exclusion) -
uv run dbt testpasses all schema tests
Files
- @python/analytics/models/intermediate/int_enrollments.sql
T-017 - Create audit_flagged_orders dbt view
Exposes all invalid orders for inspection. View — no storage cost, always fresh.
- Status: completed
- Priority: P1
- Dependencies: T-015
Acceptance
-
@python/analytics/models/intermediate/audit_flagged_orders.sqlexists - Configured as
{{ config(materialized='view') }} -
SELECT * FROM {{ ref('int_orders') }} WHERE is_valid = false - Schema entry in
@python/analytics/models/intermediate/schema.yml -
uv run dbt run --select audit_flagged_orderssucceeds
Files
- @python/analytics/models/intermediate/audit_flagged_orders.sql
- @python/analytics/models/intermediate/schema.yml (new or updated)
T-018 - Add dbt schema tests for is_valid columns
Ensure is_valid and invalid_reason behave correctly.
- Status: completed
- Priority: P1
- Dependencies: T-017
Acceptance
- Schema test:
is_validhas no nulls inint_orders - Schema test:
invalid_reasonis not null for all rows whereis_valid = false -
uv run dbt testpasses all tests including new ones
Files
- @python/analytics/models/intermediate/schema.yml
T-019 - End-to-end verification and AGENTS.md update
Full pipeline run from extract through format. Update docs.
- Status: cancelled — superseded by T-019a, T-019b, T-019c (all completed)
- Priority: P0
- Dependencies: T-010, T-011, T-013, T-016, T-017
Acceptance
-
pnpm run extract -- --entity IONS --year 2026skips existing CSVs (frozen) -
pnpm run sync -- --entity IONS --year 2026loads from CSVs;raw_targets~55 rows with correct columns;raw_students≤ 15k rows; noraw_marketing_activitytable -
pnpm run validate -- --entity IONS --year 2026exits 0; JSON report written; no error-severity failures -
cd @python/analytics && uv run dbt runcompletes;audit_flagged_ordersview exists -
cd @python/analytics && uv run dbt testpasses -
pnpm run format -- --entity IONS --period 2026-02producesoutput/report.jsonwith correct sections - Root
AGENTS.mdupdated: extract command documented;@source/clean/frozen CSV snapshots explained; REKAP exclusion noted;audit_flagged_ordersmentioned -
@packages/pipeline/AGENTS.mdupdated: extract + validate sub-commands documented;readReferensiSheetnoted -
@python/analytics/AGENTS.mdupdated:is_valid/invalid_reasonpattern documented;stg_marketing_activityremoval noted
Files
- AGENTS.md
- @packages/pipeline/AGENTS.md
- @python/analytics/AGENTS.md
- output/validation/ions-2026-validation.json (generated, gitignored)
T-019a - Update extract step: amount formatting, date formatting, identity rename
Update reader.ts to produce cleaner values in CSVs. Re-extract all years with --overwrite and commit updated CSVs.
- Status: completed
- Priority: P0
- Dependencies: T-019
Context
The staging SQL currently fails because:
amountvalues are"100,000"strings —TRY_CASTreturns NULL, filtering all rows outdatevalues may be Excel serial numbers or locale strings — needs ISO"2026-01-02"formatidentitycolumn name should beid_rawto reflect its role
These are formatting concerns belonging in the extract step, not staging SQL.
Boundary
| Layer | Responsibility |
|---|---|
| Extract (CSV) | Amount plain number, date ISO, id_raw raw string |
| Staging SQL | Identity split, period parsing, unit normalization, customer type translation |
Acceptance
-
reader.tsformatAmount(value)function: strips all commas, returns plain number string ("100,000"→"100000",""→"") -
reader.tsformatDate(value)function: handles Excel serial integers, locale date strings ("2 January 2026","2026-01-02 00:00:00"), outputs ISO"YYYY-MM-DD"or""if unparseable -
reader.tsappliesformatAmounttoamountcolumn in transactions output -
reader.tsappliesformatDatetodatecolumn in transactions anddatetimecolumn in students -
STUDENTS_COLUMNSinschema.ts:identityrenamed toid_raw(same position, same count — 26 columns) -
reader.tsmaps sourceidentity/T/T/L-derived raw value toid_rawcolumn (notidentity) - Vitest unit tests added/updated in
extract.test.ts:formatAmountandformatDatecovering empty, comma number, Excel serial, ISO string, locale string, malformed -
pnpm --filter @packages/pipeline testpasses -
pnpm run extract -- --entity IONS --year 2023 --overwriteruns successfully -
pnpm run extract -- --entity IONS --year 2024 --overwriteruns successfully -
pnpm run extract -- --entity IONS --year 2025 --overwriteruns successfully -
pnpm run extract -- --entity IONS --year 2026 --overwriteruns successfully - Spot check:
ions-2026-01-transactions.csv—amountcolumn has"100000"not"100,000";datecolumn has"2026-01-02"not"2 January 2026" - Spot check:
ions-2026-01-students.csv— header hasid_rawcolumn notidentity - All 100 updated CSVs committed to git
Notes
- SheetJS
raw: falsemay already convert Excel serials to locale strings (e.g."1/2/2026"). The formatter needs to handle both forms. birth_datein students is a human-readable text like"18 Desember 2003"— do NOT applyformatDateto it. It is not a machine date.datetimein students often has time component ("2026-01-02 00:00:00") — strip time component, output date only.formatAmountmust handle Indonesian thousands separator (.) as well as Western (,) — some years use periods.
Files
- @packages/pipeline/@source/extract/reader.ts
- @packages/pipeline/@source/extract/schema.ts
- @packages/pipeline/@source/tests/extract.test.ts
- @source/clean/ (all 100 CSVs re-extracted with --overwrite)
T-019b - Simplify staging SQL: direct English column refs, identity split, period parsing
Replace all atlas_coalesce_columns_or_null macro lookups with direct English column references. Add identity split columns. Add period parsing. Simplify int_orders to use staging outputs.
- Status: completed
- Priority: P0
- Dependencies: T-019a
What changes
stg_transactions.sql — full rewrite to direct column refs:
- Remove all
atlas_coalesce_columns_or_null/atlas_column_or_nullmacro calls - Direct references:
receipt,date,unit,program,variant,intake,student,amount,payment_via,payment_channel,student_status,student_kind,payment_status,category,barsheet TRY_CAST(amount AS DECIMAL(18,2))now works directly (noREPLACEneeded — amount is already plain number)TRY_CAST(date AS DATE)sufficient (noatlas_parse_dateneeded — date is already ISO)- Period parsing added here (from
intake):CASE UPPER(intake) WHEN 'JANUARI%' THEN 1 ... WHEN 'DESEMBER%' THEN 12 ELSE NULL END AS period_month, TRY_CAST(REGEXP_EXTRACT(intake, '(20[0-9]{2})', 1) AS INTEGER) AS period_year - Customer type translation added here:
CASE UPPER(student_status) WHEN 'B' THEN 'New' WHEN 'L' THEN 'Renewal' WHEN 'A' THEN 'Alumni' ELSE student_status END AS customer_type - Receipt synthesis for legacy rows (no receipt column):
CONCAT('LEGACY-', no, '-', program, '-', student, '-', date) - WHERE:
NULLIF(receipt,'') IS NOT NULL OR NULLIF(program,'') IS NOT NULL - Cleaned CTE:
amount IS NOT NULL AND unit IS NOT NULL AND program IS NOT NULL AND program <> 'program'
stg_students.sql — full rewrite to direct column refs + identity split:
- Remove all
atlas_coalesce_columns_or_nullmacro calls - Direct references:
id_raw,name,datetime,organization,unit,age_cluster,course_type,program,class_level,intake,source,student_status,payment_method,sex,receipt - Unit normalization CASE stays in staging (not CSV):
UPPER(unit) IN ('WLC INGGRIS'...) THEN 'WLC_ENGLISH' - Period parsing added (same as transactions, from
intake) - Identity split (5 columns):
id_ktp: type =KTPorKKTPid_kp: type =KARTU PELAJARorKARTU PELAJAR PSid_sim: type =SIMid_pass: type =PASSPORTorPASPORTid_nis: type =NISorNOMOR INDUK SEKOLAHorNOMOR INDUK- Unrecognized types → all 5 NULL (visible via
id_raw)
- WHERE:
NULLIF(name,'') IS NOT NULL AND NULLIF(name,'') NOT IN ('name','NAMA')
int_orders.sql — simplification:
- Remove the 24-line CASE block for
period_month(now from staging) - Remove the
period_yearCOALESCE block (now from staging) - Remove the
customer_typeCASE block (B→Newetc.) — now fromstg_transactions period_year = COALESCE(t.period_year, s.period_year)period_month = COALESCE(t.period_month, s.period_month)customer_type = t.customer_type(already translated)- Identity columns passed through from
stg_students:id_raw,id_ktp,id_kp,id_sim,id_pass,id_nis
Acceptance
-
stg_transactions.sqlhas zeroatlas_coalesce_columns_or_null/atlas_column_or_nullcalls -
stg_students.sqlhas zero macro calls (exceptatlas_parse_dateif still needed) -
stg_students.sqloutputsid_raw,id_ktp,id_kp,id_sim,id_pass,id_nis -
stg_transactions.sqloutputsperiod_year,period_month,customer_type -
stg_students.sqloutputsperiod_year,period_month -
int_orders.sqlusest.period_year,t.period_monthfrom staging (no inline parsing) -
int_orders.sqlusest.customer_typefrom staging (no inline CASE) -
int_orders.sqlpasses throughid_raw,id_ktp,id_kp,id_sim,id_pass,id_nis -
cd @python/analytics && uv run dbt runcompletes with non-zero rows in all mart models -
mart_revenueforperiod_year=2026, period_month=2has non-zerorevenue_actuals -
cd @python/analytics && uv run dbt testpasses all tests -
audit_flagged_ordersrows reflect only genuinely invalid data (not column-name-mismatch false positives)
Notes
atlas_parse_datemacro can be removed from stg_transactions/stg_students since dates are now ISO. Keep the macro definition inhelpers.sqlin case other models need it.stg_targets.sqlis unaffected — it reads fromraw_targetswhich has its own column schema.- The
QUALIFY ROW_NUMBER()deduplication instg_transactionsshould be preserved. stg_studentsWHERE filter must exclude header-artifact rows:name NOT IN ('name', 'NAMA')to handle any CSV header leak.
Files
- @python/analytics/models/staging/stg_transactions.sql
- @python/analytics/models/staging/stg_students.sql
- @python/analytics/models/intermediate/int_orders.sql
- @python/analytics/models/intermediate/schema.yml (add id_ktp, id_kp, id_sim, id_pass, id_nis to int_orders docs)
T-019c - Final end-to-end verification
Run the complete pipeline. Verify marts are populated. Verify format output. Close plan.
- Status: completed
- Priority: P0
- Dependencies: T-019b
Acceptance
-
pnpm run sync -- --entity IONS --year 2026loads from updated CSVs;raw_studentshasid_rawcolumn;raw_transactionsamountcolumn has plain numbers -
pnpm run validate -- --entity IONS --year 2026exits 0; all checks pass or warning-only -
cd @python/analytics && uv run dbt runcompletes;int_ordershas non-zero rows;int_enrollmentshas non-zero rows -
mart_revenuefor 2026-01 and 2026-02 has non-zerorevenue_actuals -
mart_channel_marketinghas non-zero rows -
audit_flagged_ordersrow count is low (only genuine data quality issues) -
pnpm run format -- --entity IONS --period 2026-02producesoutput/report.jsonwith populatedunitssections (non-empty revenue, programs, channels, schools) -
pnpm run sync -- --entity IONS --year 2023anduv run dbt runproduces non-zero historical data in marts -
pnpm --filter @packages/pipeline testpasses -
cd @python/analytics && uv run dbt testpasses all tests - AGENTS.md updated to document
id_ktp,id_kp,id_sim,id_pass,id_niscolumns in stg_students
Files
- AGENTS.md (update)
- output/report.json (generated, gitignored)