Atlas Plan
Plans009 2026 02 21 Data Quality Pipeline

Phase 1: Audit Documentation

  • Purpose: Inspect and document every sheet in all 8 source xlsx files as a permanent reference
  • Status: COMPLETE — done in planning session 2026-02-21. See @plan/audit/.

T-001 - Create @plan/audit/ and sheet index

  • Status: completed
  • Priority: P0
  • Dependencies: none

Acceptance

  • @plan/audit/ directory exists
  • @plan/audit/sheet-index.md exists with full sheet table across all 8 files
  • All known anomalies documented (SEPTEMBER25 typo, REKAP FEBUARI typo, leading space in REKAP JULI, hidden rows, ICC vs ICA distinction)

Files

  • @plan/audit/sheet-index.md

T-002 - Create per-file audit snapshots

  • Status: completed
  • Priority: P1
  • Dependencies: T-001

Acceptance

  • 8 snapshot docs exist in @plan/audit/
  • Each doc includes: file path, all sheets with type/status, column lists for data sheets, header rows, row counts, anomalies

Files

  • @plan/audit/ions-2023-laporan.md
  • @plan/audit/ions-2023-database-siswa.md
  • @plan/audit/ions-2024-laporan.md
  • @plan/audit/ions-2024-database-siswa.md
  • @plan/audit/ions-2025-laporan.md
  • @plan/audit/ions-2025-database-siswa.md
  • @plan/audit/ions-2026-laporan.md
  • @plan/audit/ions-2026-database-siswa.md

Phase 2: YAML Config Fixes

  • Purpose: Correct all structural errors in @source/config YAML files

T-003 - Fix ions-2026.yaml

Update 2026 config: fix targets header rows, add organizations entry, remove marketing_activity entries.

  • Status: completed
  • Priority: P0
  • Dependencies: none

Acceptance

  • transactions entry: header_row: 6, no merge_header_rows (flat header)
  • targets entry: merge_header_rows: [10, 11], data_start_row: 12 (rows 1–9 are letter salutation + month label; row 10 = outer header, row 11 = sub-labels)
  • students entry: all 12 monthly sheets listed with correct 2026 suffix pattern; SEPTEMBER25 typo preserved (exact sheet name in file)
  • organizations entry added: file = DATABASE SISWA 2026, sheet = REFERENSI, type: referensi
  • All marketing_activity / REKAP sheet entries removed
  • No references to REKAP * sheets anywhere in the file

Notes

  • Targets 2-row header: row 10 outer labels (NO., TENANT, BIAYA STANDAR, TARGET REVENUE, TARGET, REALISASI REVENUE, REALISASI, REALISASI SISWA RIIL, JUMLAH KELAS, RATA-RATA JUMLAH SISWA PER KELAS, JUMLAH SISWA AKTIF SAAT INI, JUMLAH TOTAL KELAS AKTIF, RATA-RATA JUMLAH SISWA TIAP KELAS AKTIF). Row 11 sub-labels: SISWA STANDAR under col F (TARGET) and col H (REALISASI).
  • The extract reader must merge rows 10+11 the same way as the 2023 transaction merge.

Files

  • @source/config/ions-2026.yaml

T-004 - Fix ions-2025.yaml and ions-2024.yaml

Both years have flat single-row headers at row 6. No merge_header_rows needed. Add organizations entries. Remove marketing_activity entries.

  • Status: completed
  • Priority: P0
  • Dependencies: none

Acceptance

  • ions-2025.yaml transactions entry: header_row: 6, no merge_header_rows
  • ions-2025.yaml has no targets entry (sheet absent in 2025)
  • ions-2025.yaml organizations entry added: DATABASE SISWA 2025, sheet = REFERENSI, type: referensi
  • ions-2025.yaml all marketing_activity / REKAP entries removed
  • ions-2024.yaml same structure as 2025
  • ions-2024.yaml organizations entry added: DATABASE SISWA 2024, sheet = REFERENSI, type: referensi
  • All 12 monthly student sheets listed with correct year suffix (25, 24)
  • Marketing activity sheet names (REKAP *, ' REKAP JULI') removed from both files

Files

  • @source/config/ions-2025.yaml
  • @source/config/ions-2024.yaml

T-005 - Fix ions-2023.yaml

2023 is the only year with a genuine 2-row merged header in transactions. No organizations sheet. No targets.

  • Status: completed
  • Priority: P0
  • Dependencies: none

Acceptance

  • transactions entry: merge_header_rows: [6, 7] (row 6 = NO, HARI, TANGGAL, BARSHEET, PENDAPATAN[group], TENANT...; row 7 = PROGRAM, INTAKE, NAMA SISWA, JUMLAH under PENDAPATAN)
  • No targets entry
  • No organizations entry (REFERENSI sheet absent in 2023)
  • All 12 student sheets listed with 2023 naming pattern: JANUARI 23, FEBRUARI 23, ..., DESEMBER (no year suffix on December)
  • All marketing_activity / REKAP entries removed (REKAP JAN, REKAP FEB, etc.)

Notes

  • 2023 marketing REKAP sheets use short names: REKAP JAN, REKAP FEB, REKAP MAR, REKAP APRIL, REKAP MEI, REKAP JUNI, REKAP JULI, REKAP AGUSTUS, REKAP SEPTEMBER, REKAP OKTOBER, REKAP NOVEMBER, REKAP DESEMBER
  • These must all be removed from config, not just marked absent

Files

  • @source/config/ions-2023.yaml

T-006 - Define canonical CSV schema and extend SourceConfig type

Create a schema.ts file with the authoritative column definitions for all 4 file types. Extend SourceFileConfig TypeScript type with merge_header_rows?: number[], data_start_row?: number, and type field including 'referensi'.

  • Status: completed
  • Priority: P0
  • Dependencies: T-003, T-004, T-005

Acceptance

  • @packages/pipeline/@source/extract/schema.ts exported constants for each file type:
    • TRANSACTIONS_COLUMNS: string[] — 17 canonical English column names in order
    • STUDENTS_COLUMNS: string[] — 26 canonical English column names in order
    • TARGETS_COLUMNS: string[] — 13 canonical English column names in order
    • ORGANIZATIONS_COLUMNS: string[] — 3 canonical English column names
  • SourceFileConfig type updated with merge_header_rows?: number[], data_start_row?: number
  • File type union updated to include 'referensi' alongside existing types
  • pnpm --filter @packages/pipeline test:type passes

Column reference — transactions (17):

no, day, date, barsheet, receipt, program, variant, intake, student, amount, unit, category, payment_via, payment_channel, student_status, student_kind, payment_status

Column reference — students (26):

no, intake, datetime, name, birth_place_date, birth_place, birth_date, identity, phone, email, address, organization, unit, age_cluster, course_type, program, class_level, schedule, meetings, duration, class_type, payment_method, payment_pattern, source, sex, verified

Column reference — targets (13):

no, unit, cost_standard, revenue_target, student_target, revenue_actuals, student_actuals_standard, student_actuals_real, class_count, avg_students_per_class, active_students_now, active_classes_total, avg_students_active

Column reference — organizations (3):

organization, category, age_cluster

Files

  • @packages/pipeline/@source/extract/schema.ts
  • @packages/pipeline/@source/config.ts

Phase 3: Extract Step

  • Purpose: Implement SheetJS-based xlsx → CSV extraction for all years

T-007 - Implement extract/reader.ts and extract/referensi.ts

Two reading functions: readSheet for standard flat and merged-header sheets, readReferensiSheet for the REFERENSI pivot matrix. Both output { headers: string[], rows: Record<string, string>[] }.

  • Status: completed
  • Priority: P0
  • Dependencies: T-006

Acceptance — readSheet

  • @packages/pipeline/@source/extract/reader.ts exports readSheet(filePath, sheetName, options) where options includes headerRow, columns, mergeHeaderRows?, dataStartRow?
  • Standard single-row: reads row at headerRow, maps Indonesian → English via inverted columns map, returns data rows starting at headerRow + 1 (or dataStartRow if specified)
  • merge_header_rows: [R1, R2] case: reads row R1 and R2. For each column position: if R1 value is non-empty AND R2 value is non-empty, use R2 value as column name (sub-label wins). Otherwise use R1 value. Then map to English via columns.
  • Reads ALL rows regardless of hidden state — never checks row_dimensions
  • Skips rows where all mapped column values are empty/null
  • For students: splits T/T/L value at first comma → birth_place (before) and birth_date (after, trimmed). If no comma, populates birth_place_date only; leaves birth_place and birth_date as empty string.
  • Missing source columns (e.g. receipt absent in 2023) output as empty string for that column
  • Returns rows with exactly the canonical column set (from schema.ts) — no extra columns, no missing columns
  • Vitest unit test: @source/__tests__/extract.test.ts — tests single-row header, 2-row merged header, missing columns, birth_place split

Acceptance — readReferensiSheet

  • @packages/pipeline/@source/extract/referensi.ts exports readReferensiSheet(filePath, sheetName)
  • Locates ORGANISASI header in the sheet (cell containing 'ORGANISASI')
  • Reads category header row (row below ORGANISASI header): SMA, SMP, TK-SD, UNIVERSITAS, LAIN-LAIN
  • Reads age_cluster column (col B): KIDS, TEENS, ADULTS — carried forward until next non-empty value
  • For each matrix cell that is non-empty: emits { organization, category, age_cluster }
  • Stops when it hits the next section (TIPE KURSUS or similar non-organization heading)
  • Returns { headers: ['organization', 'category', 'age_cluster'], rows: [...] }

Notes

  • SheetJS: use XLSX.utils.sheet_to_json(ws, { header: 1, raw: false }) to get raw row arrays as strings
  • Invert the YAML columns map (values are Indonesian source names, keys are English target names) for lookup
  • For merge detection: group names that trigger sub-label preference: any R1 value where R2 position has a non-empty value. Do NOT hardcode group names like PENDAPATAN — detect generically.

Files

  • @packages/pipeline/@source/extract/reader.ts
  • @packages/pipeline/@source/extract/referensi.ts
  • @packages/pipeline/@source/tests/extract.test.ts

T-008 - Implement extract/writer.ts

CSV writing module. Writes canonical-column CSVs to @source/clean/. Implements the frozen-snapshot overwrite guard.

  • Status: completed
  • Priority: P0
  • Dependencies: T-007

Acceptance

  • @packages/pipeline/@source/extract/writer.ts exports writeCsv(outputPath, headers, rows, overwrite?)
  • If file exists and overwrite is false (default): logs [extract] skip: {filename} already exists and returns
  • If file does not exist or overwrite is true: writes CSV
  • Header row written first, then data rows
  • Values containing commas, double-quotes, or newlines wrapped in double-quotes; internal quotes escaped as ""
  • UTF-8 output, no BOM
  • Empty rows (all values empty string) still written — header-only CSVs for future months are valid
  • Vitest unit test: @source/__tests__/writer.test.ts — tests overwrite guard, quoting edge cases, header-only output

Files

  • @packages/pipeline/@source/extract/writer.ts
  • @packages/pipeline/@source/tests/writer.test.ts

T-009 - Implement extract/index.ts and wire CLI

Main extract orchestrator. Iterates config entries, dispatches to readSheet or readReferensiSheet, calls writeCsv. Generates output filename per the naming convention.

  • Status: completed
  • Priority: P0
  • Dependencies: T-007, T-008

Acceptance

  • @packages/pipeline/@source/extract/index.ts exports extractAll(config, options) where options = { cleanDir, overwrite }
  • Filename convention: ions-{YYYY}-{MM}-{type}.csv for monthly sheets; ions-{YYYY}-organizations.csv for REFERENSI
    • {MM} = zero-padded month number derived from sheet name (e.g. JANUARI01, FEBRUARI02)
    • Targets special case: {MM} derived from period label in row 9 (DATA BULAN FEBRUARI 202602)
    • Examples: ions-2026-01-transactions.csv, ions-2026-02-targets.csv, ions-2025-organizations.csv
  • Dispatches to readReferensiSheet when file type is referensi
  • Dispatches to readSheet for all other types
  • Logs per-file result: [extract] ions-2026-01-transactions.csv: 847 rows or [extract] skip: already exists
  • pnpm run extract -- --entity IONS --year 2026 runs 2026 extraction
  • pnpm run extract -- --entity IONS --year 2026 --overwrite re-extracts existing files
  • pnpm --filter @packages/pipeline test:type passes

Indonesian month → number mapping (hardcoded constant):

JANUARI=01, FEBRUARI=02, MARET=03, APRIL=04, MEI=05, JUNI=06, JULI=07, AGUSTUS=08, SEPTEMBER=09, OKTOBER=10, NOVEMBER=11, DESEMBER=12

Files

  • @packages/pipeline/@source/extract/index.ts
  • @packages/pipeline/@source/index.ts (update: wire extract sub-command, pass --overwrite flag)

T-010 - Extract all years to @source/clean/ and commit

Run extraction for all 4 years. Verify row counts and column names. Commit all 100 CSVs.

  • Status: completed
  • Priority: P0
  • Dependencies: T-009, T-003, T-004, T-005

Acceptance

  • 2023: 12 transaction CSVs + 12 student CSVs = 24 files
  • 2024: 12 transaction CSVs + 12 student CSVs + 1 organizations CSV = 25 files
  • 2025: 12 transaction CSVs + 12 student CSVs + 1 organizations CSV = 25 files
  • 2026: 12 transaction CSVs + 12 student CSVs + 1 targets CSV + 1 organizations CSV = 26 files
  • Total: 100 CSV files in @source/clean/
  • ions-2026-02-targets.csv has 13 correct English column headers (not C0, _1...)
  • ions-2023-01-transactions.csv has program, intake, student, amount columns (from merged row 7)
  • ions-2026-01-students.csv has birth_place_date, birth_place, birth_date columns
  • All files committed to git with message: data(pipeline): extract @source/clean CSVs for all years 2023–2026

Expected row counts (approximate):

YearTransactionsStudents
2023~7,914 total across 12 months~7,844 total
2024~8,666 total~10,201 total
2025~10,110 total~10,168 total
2026~1,191 real rows (Jan+Feb only; other months near-zero stubs)~1,640 real rows

Files

  • @source/clean/ (all 100 CSV files)

Phase 4: Load Step

  • Purpose: Refactor load to read from @source/clean/ CSVs; remove marketing_activity

T-011 - Refactor load/csv.ts to read from @source/clean/

Replace the xlsx-direct load with CSV loading via DuckDB read_csv_auto. Remove the marketing_activity load block entirely.

  • Status: completed
  • Priority: P0
  • Dependencies: T-009

Acceptance

  • load/csv.ts syncCsv(config, options) reads from @source/clean/ by default
  • Constructs CSV filename using same convention as extract: ions-{YYYY}-{MM}-{type}.csv and ions-{YYYY}-organizations.csv
  • Uses DuckDB read_csv_auto with header := true, all_varchar := true
  • If CSV does not exist: logs warning [sync] skip: {filename} not found — run extract first and continues
  • If CSV is header-only (0 data rows): logs [sync] {table}/{sheet}: 0 rows (empty sheet) and skips load
  • marketing_activity file type entries in config are ignored / load block removed
  • pnpm run sync -- --entity IONS --year 2026 loads from CSVs; raw_targets has correct columns; raw_students ≤ 15k rows
  • pnpm --filter @packages/pipeline test:type passes

Notes

  • --source csv is now the default; --source xlsx can remain as a legacy flag for debugging (still points to old xlsx adapter)
  • raw_marketing_activity is no longer created in DuckDB — this is intentional

Files

  • @packages/pipeline/@source/load/csv.ts
  • @packages/pipeline/@source/index.ts (update default source)

Phase 5: Validate Step

  • Purpose: SQL checks on raw_* tables after load

T-012 - Implement validate/checks.ts

Define all validation checks as typed objects. Pure logic, no DuckDB connection.

  • Status: completed
  • Priority: P0
  • Dependencies: none

Acceptance

  • @packages/pipeline/@source/validate/checks.ts exports ValidationCheck[]
  • ValidationCheck type: { name: string, description: string, severity: 'error' | 'warning', sql: string }
  • Checks implemented:
    • raw_transactions_empty — error: SELECT COUNT(*) FROM raw_transactions HAVING COUNT(*) = 0
    • raw_students_empty — error: SELECT COUNT(*) FROM raw_students HAVING COUNT(*) = 0
    • raw_targets_empty — warning: SELECT COUNT(*) FROM raw_targets HAVING COUNT(*) = 0 (only 2026 has targets)
    • raw_targets_bad_columns — error: checks that column unit exists and is not entirely null in raw_targets
    • raw_students_bloat — warning: row count > 20,000 signals junk sheets were loaded
    • raw_transactions_invalid_year — warning: rows where extracted year from intake < 2020 or > current year + 1
    • raw_transactions_unknown_unit — warning: unit values not in known set (REAL, IONs, IONS, ICC, ICA, NP, DTC, TM, WLC, WLC INGGRIS, WLC NON INGGRIS, DETECTION, MANDARIN)
    • raw_transactions_negative_amount — warning: TRY_CAST(amount AS DOUBLE) < 0
  • No check for raw_marketing_activity — it no longer exists
  • Vitest unit test: @source/__tests__/validate.test.ts — tests check SQL strings and severity values

Files

  • @packages/pipeline/@source/validate/checks.ts
  • @packages/pipeline/@source/tests/validate.test.ts

T-013 - Implement validate/index.ts and wire CLI

Main validate orchestrator. Runs all checks, writes JSON report.

  • Status: completed
  • Priority: P0
  • Dependencies: T-012

Acceptance

  • @packages/pipeline/@source/validate/index.ts exports validateAll(config, options)
  • Runs all checks from checks.ts against atlas.db
  • Collects: check name, severity, passed/failed, failing row count, sample rows (up to 5)
  • Writes output/validation/{entity.toLowerCase()}-{year}-validation.json (2-space indent)
  • Console summary: for passed, for failed; groups errors then warnings
  • Exit code 0 if all pass or warnings only; exit code 1 if any error-severity check fails
  • pnpm run validate -- --entity IONS --year 2026 runs and produces report
  • pnpm --filter @packages/pipeline test:type passes

Files

  • @packages/pipeline/@source/validate/index.ts
  • @packages/pipeline/@source/index.ts (update: wire validate sub-command)

Phase 6: dbt Quality

  • Purpose: Remove dead model; add is_valid flagging; add audit model

T-014 - Remove stg_marketing_activity and raw_marketing_activity

Delete the dead staging model and its source declaration. Verify dbt still runs without error.

  • Status: completed
  • Priority: P0
  • Dependencies: T-011

Acceptance

  • @python/analytics/models/staging/stg_marketing_activity.sql deleted
  • raw_marketing_activity source entry removed from @python/analytics/models/staging/sources.yml
  • mart_channel_marketing schema entry in marts/schema.yml updated if it referenced stg_marketing_activity (it doesn't — but verify)
  • cd @python/analytics && uv run dbt run succeeds without errors
  • cd @python/analytics && uv run dbt test passes all remaining tests
  • mart_channel_marketing still populates correctly (it reads int_enrollments, unaffected)

Notes

  • mart_channel_marketing reads from int_enrollments via payment_via/payment_channel columns — completely independent of REKAP data. Verify this with a row count check before and after.

Files

  • @python/analytics/models/staging/stg_marketing_activity.sql (delete)
  • @python/analytics/models/staging/sources.yml

T-015 - Add is_valid and invalid_reason to int_orders.sql

Add two computed columns. Priority order of invalidity checks matters — first matching condition sets the reason.

  • Status: completed
  • Priority: P0
  • Dependencies: T-014

Acceptance

  • int_orders.sql has is_valid (boolean) and invalid_reason (varchar) in the final SELECT
  • is_valid CASE logic (in priority order):
    1. period_year IS NULL → false, reason 'null_period'
    2. period_year < 2020 → false, reason 'invalid_period_year'
    3. period_year > EXTRACT(YEAR FROM CURRENT_DATE) + 1 → false, reason 'invalid_period_year'
    4. amount IS NULL OR amount <= 0 → false, reason 'invalid_amount'
    5. unit_code_normalized IS NULL → false, reason 'unknown_unit'
    6. else → true, reason NULL
  • cd @python/analytics && uv run dbt run --select int_orders succeeds
  • No null values in is_valid column (every row is true or false)

Files

  • @python/analytics/models/intermediate/int_orders.sql

T-016 - Filter int_enrollments to is_valid = true

One-line change. All mart models inherit the filter via int_enrollments.

  • Status: completed
  • Priority: P0
  • Dependencies: T-015

Acceptance

  • int_enrollments.sql WHERE clause includes AND is_valid = true
  • uv run dbt run completes all models
  • mart_revenue row count is plausible (lower than before due to invalid row exclusion)
  • uv run dbt test passes all schema tests

Files

  • @python/analytics/models/intermediate/int_enrollments.sql

T-017 - Create audit_flagged_orders dbt view

Exposes all invalid orders for inspection. View — no storage cost, always fresh.

  • Status: completed
  • Priority: P1
  • Dependencies: T-015

Acceptance

  • @python/analytics/models/intermediate/audit_flagged_orders.sql exists
  • Configured as {{ config(materialized='view') }}
  • SELECT * FROM {{ ref('int_orders') }} WHERE is_valid = false
  • Schema entry in @python/analytics/models/intermediate/schema.yml
  • uv run dbt run --select audit_flagged_orders succeeds

Files

  • @python/analytics/models/intermediate/audit_flagged_orders.sql
  • @python/analytics/models/intermediate/schema.yml (new or updated)

T-018 - Add dbt schema tests for is_valid columns

Ensure is_valid and invalid_reason behave correctly.

  • Status: completed
  • Priority: P1
  • Dependencies: T-017

Acceptance

  • Schema test: is_valid has no nulls in int_orders
  • Schema test: invalid_reason is not null for all rows where is_valid = false
  • uv run dbt test passes all tests including new ones

Files

  • @python/analytics/models/intermediate/schema.yml

T-019 - End-to-end verification and AGENTS.md update

Full pipeline run from extract through format. Update docs.

  • Status: cancelled — superseded by T-019a, T-019b, T-019c (all completed)
  • Priority: P0
  • Dependencies: T-010, T-011, T-013, T-016, T-017

Acceptance

  • pnpm run extract -- --entity IONS --year 2026 skips existing CSVs (frozen)
  • pnpm run sync -- --entity IONS --year 2026 loads from CSVs; raw_targets ~55 rows with correct columns; raw_students ≤ 15k rows; no raw_marketing_activity table
  • pnpm run validate -- --entity IONS --year 2026 exits 0; JSON report written; no error-severity failures
  • cd @python/analytics && uv run dbt run completes; audit_flagged_orders view exists
  • cd @python/analytics && uv run dbt test passes
  • pnpm run format -- --entity IONS --period 2026-02 produces output/report.json with correct sections
  • Root AGENTS.md updated: extract command documented; @source/clean/ frozen CSV snapshots explained; REKAP exclusion noted; audit_flagged_orders mentioned
  • @packages/pipeline/AGENTS.md updated: extract + validate sub-commands documented; readReferensiSheet noted
  • @python/analytics/AGENTS.md updated: is_valid/invalid_reason pattern documented; stg_marketing_activity removal noted

Files

  • AGENTS.md
  • @packages/pipeline/AGENTS.md
  • @python/analytics/AGENTS.md
  • output/validation/ions-2026-validation.json (generated, gitignored)

T-019a - Update extract step: amount formatting, date formatting, identity rename

Update reader.ts to produce cleaner values in CSVs. Re-extract all years with --overwrite and commit updated CSVs.

  • Status: completed
  • Priority: P0
  • Dependencies: T-019

Context

The staging SQL currently fails because:

  1. amount values are "100,000" strings — TRY_CAST returns NULL, filtering all rows out
  2. date values may be Excel serial numbers or locale strings — needs ISO "2026-01-02" format
  3. identity column name should be id_raw to reflect its role

These are formatting concerns belonging in the extract step, not staging SQL.

Boundary

LayerResponsibility
Extract (CSV)Amount plain number, date ISO, id_raw raw string
Staging SQLIdentity split, period parsing, unit normalization, customer type translation

Acceptance

  • reader.ts formatAmount(value) function: strips all commas, returns plain number string ("100,000""100000", """")
  • reader.ts formatDate(value) function: handles Excel serial integers, locale date strings ("2 January 2026", "2026-01-02 00:00:00"), outputs ISO "YYYY-MM-DD" or "" if unparseable
  • reader.ts applies formatAmount to amount column in transactions output
  • reader.ts applies formatDate to date column in transactions and datetime column in students
  • STUDENTS_COLUMNS in schema.ts: identity renamed to id_raw (same position, same count — 26 columns)
  • reader.ts maps source identity/T/T/L-derived raw value to id_raw column (not identity)
  • Vitest unit tests added/updated in extract.test.ts: formatAmount and formatDate covering empty, comma number, Excel serial, ISO string, locale string, malformed
  • pnpm --filter @packages/pipeline test passes
  • pnpm run extract -- --entity IONS --year 2023 --overwrite runs successfully
  • pnpm run extract -- --entity IONS --year 2024 --overwrite runs successfully
  • pnpm run extract -- --entity IONS --year 2025 --overwrite runs successfully
  • pnpm run extract -- --entity IONS --year 2026 --overwrite runs successfully
  • Spot check: ions-2026-01-transactions.csvamount column has "100000" not "100,000"; date column has "2026-01-02" not "2 January 2026"
  • Spot check: ions-2026-01-students.csv — header has id_raw column not identity
  • All 100 updated CSVs committed to git

Notes

  • SheetJS raw: false may already convert Excel serials to locale strings (e.g. "1/2/2026"). The formatter needs to handle both forms.
  • birth_date in students is a human-readable text like "18 Desember 2003" — do NOT apply formatDate to it. It is not a machine date.
  • datetime in students often has time component ("2026-01-02 00:00:00") — strip time component, output date only.
  • formatAmount must handle Indonesian thousands separator (.) as well as Western (,) — some years use periods.

Files

  • @packages/pipeline/@source/extract/reader.ts
  • @packages/pipeline/@source/extract/schema.ts
  • @packages/pipeline/@source/tests/extract.test.ts
  • @source/clean/ (all 100 CSVs re-extracted with --overwrite)

T-019b - Simplify staging SQL: direct English column refs, identity split, period parsing

Replace all atlas_coalesce_columns_or_null macro lookups with direct English column references. Add identity split columns. Add period parsing. Simplify int_orders to use staging outputs.

  • Status: completed
  • Priority: P0
  • Dependencies: T-019a

What changes

stg_transactions.sql — full rewrite to direct column refs:

  • Remove all atlas_coalesce_columns_or_null / atlas_column_or_null macro calls
  • Direct references: receipt, date, unit, program, variant, intake, student, amount, payment_via, payment_channel, student_status, student_kind, payment_status, category, barsheet
  • TRY_CAST(amount AS DECIMAL(18,2)) now works directly (no REPLACE needed — amount is already plain number)
  • TRY_CAST(date AS DATE) sufficient (no atlas_parse_date needed — date is already ISO)
  • Period parsing added here (from intake):
    CASE UPPER(intake)
      WHEN 'JANUARI%' THEN 1 ... WHEN 'DESEMBER%' THEN 12
      ELSE NULL
    END AS period_month,
    TRY_CAST(REGEXP_EXTRACT(intake, '(20[0-9]{2})', 1) AS INTEGER) AS period_year
  • Customer type translation added here:
    CASE UPPER(student_status)
      WHEN 'B' THEN 'New'
      WHEN 'L' THEN 'Renewal'
      WHEN 'A' THEN 'Alumni'
      ELSE student_status
    END AS customer_type
  • Receipt synthesis for legacy rows (no receipt column): CONCAT('LEGACY-', no, '-', program, '-', student, '-', date)
  • WHERE: NULLIF(receipt,'') IS NOT NULL OR NULLIF(program,'') IS NOT NULL
  • Cleaned CTE: amount IS NOT NULL AND unit IS NOT NULL AND program IS NOT NULL AND program <> 'program'

stg_students.sql — full rewrite to direct column refs + identity split:

  • Remove all atlas_coalesce_columns_or_null macro calls
  • Direct references: id_raw, name, datetime, organization, unit, age_cluster, course_type, program, class_level, intake, source, student_status, payment_method, sex, receipt
  • Unit normalization CASE stays in staging (not CSV): UPPER(unit) IN ('WLC INGGRIS'...) THEN 'WLC_ENGLISH'
  • Period parsing added (same as transactions, from intake)
  • Identity split (5 columns):
    • id_ktp: type = KTP or KKTP
    • id_kp: type = KARTU PELAJAR or KARTU PELAJAR PS
    • id_sim: type = SIM
    • id_pass: type = PASSPORT or PASPORT
    • id_nis: type = NIS or NOMOR INDUK SEKOLAH or NOMOR INDUK
    • Unrecognized types → all 5 NULL (visible via id_raw)
  • WHERE: NULLIF(name,'') IS NOT NULL AND NULLIF(name,'') NOT IN ('name','NAMA')

int_orders.sql — simplification:

  • Remove the 24-line CASE block for period_month (now from staging)
  • Remove the period_year COALESCE block (now from staging)
  • Remove the customer_type CASE block (BNew etc.) — now from stg_transactions
  • period_year = COALESCE(t.period_year, s.period_year)
  • period_month = COALESCE(t.period_month, s.period_month)
  • customer_type = t.customer_type (already translated)
  • Identity columns passed through from stg_students: id_raw, id_ktp, id_kp, id_sim, id_pass, id_nis

Acceptance

  • stg_transactions.sql has zero atlas_coalesce_columns_or_null / atlas_column_or_null calls
  • stg_students.sql has zero macro calls (except atlas_parse_date if still needed)
  • stg_students.sql outputs id_raw, id_ktp, id_kp, id_sim, id_pass, id_nis
  • stg_transactions.sql outputs period_year, period_month, customer_type
  • stg_students.sql outputs period_year, period_month
  • int_orders.sql uses t.period_year, t.period_month from staging (no inline parsing)
  • int_orders.sql uses t.customer_type from staging (no inline CASE)
  • int_orders.sql passes through id_raw, id_ktp, id_kp, id_sim, id_pass, id_nis
  • cd @python/analytics && uv run dbt run completes with non-zero rows in all mart models
  • mart_revenue for period_year=2026, period_month=2 has non-zero revenue_actuals
  • cd @python/analytics && uv run dbt test passes all tests
  • audit_flagged_orders rows reflect only genuinely invalid data (not column-name-mismatch false positives)

Notes

  • atlas_parse_date macro can be removed from stg_transactions/stg_students since dates are now ISO. Keep the macro definition in helpers.sql in case other models need it.
  • stg_targets.sql is unaffected — it reads from raw_targets which has its own column schema.
  • The QUALIFY ROW_NUMBER() deduplication in stg_transactions should be preserved.
  • stg_students WHERE filter must exclude header-artifact rows: name NOT IN ('name', 'NAMA') to handle any CSV header leak.

Files

  • @python/analytics/models/staging/stg_transactions.sql
  • @python/analytics/models/staging/stg_students.sql
  • @python/analytics/models/intermediate/int_orders.sql
  • @python/analytics/models/intermediate/schema.yml (add id_ktp, id_kp, id_sim, id_pass, id_nis to int_orders docs)

T-019c - Final end-to-end verification

Run the complete pipeline. Verify marts are populated. Verify format output. Close plan.

  • Status: completed
  • Priority: P0
  • Dependencies: T-019b

Acceptance

  • pnpm run sync -- --entity IONS --year 2026 loads from updated CSVs; raw_students has id_raw column; raw_transactions amount column has plain numbers
  • pnpm run validate -- --entity IONS --year 2026 exits 0; all checks pass or warning-only
  • cd @python/analytics && uv run dbt run completes; int_orders has non-zero rows; int_enrollments has non-zero rows
  • mart_revenue for 2026-01 and 2026-02 has non-zero revenue_actuals
  • mart_channel_marketing has non-zero rows
  • audit_flagged_orders row count is low (only genuine data quality issues)
  • pnpm run format -- --entity IONS --period 2026-02 produces output/report.json with populated units sections (non-empty revenue, programs, channels, schools)
  • pnpm run sync -- --entity IONS --year 2023 and uv run dbt run produces non-zero historical data in marts
  • pnpm --filter @packages/pipeline test passes
  • cd @python/analytics && uv run dbt test passes all tests
  • AGENTS.md updated to document id_ktp, id_kp, id_sim, id_pass, id_nis columns in stg_students

Files

  • AGENTS.md (update)
  • output/report.json (generated, gitignored)

On this page

Phase 1: Audit Documentation
T-001 - Create @plan/audit/ and sheet index
Acceptance
Files
T-002 - Create per-file audit snapshots
Acceptance
Files
Phase 2: YAML Config Fixes
T-003 - Fix ions-2026.yaml
Acceptance
Notes
Files
T-004 - Fix ions-2025.yaml and ions-2024.yaml
Acceptance
Files
T-005 - Fix ions-2023.yaml
Acceptance
Notes
Files
T-006 - Define canonical CSV schema and extend SourceConfig type
Acceptance
Column reference — transactions (17):
Column reference — students (26):
Column reference — targets (13):
Column reference — organizations (3):
Files
Phase 3: Extract Step
T-007 - Implement extract/reader.ts and extract/referensi.ts
Acceptance — readSheet
Acceptance — readReferensiSheet
Notes
Files
T-008 - Implement extract/writer.ts
Acceptance
Files
T-009 - Implement extract/index.ts and wire CLI
Acceptance
Indonesian month → number mapping (hardcoded constant):
Files
T-010 - Extract all years to @source/clean/ and commit
Acceptance
Expected row counts (approximate):
Files
Phase 4: Load Step
T-011 - Refactor load/csv.ts to read from @source/clean/
Acceptance
Notes
Files
Phase 5: Validate Step
T-012 - Implement validate/checks.ts
Acceptance
Files
T-013 - Implement validate/index.ts and wire CLI
Acceptance
Files
Phase 6: dbt Quality
T-014 - Remove stg_marketing_activity and raw_marketing_activity
Acceptance
Notes
Files
T-015 - Add is_valid and invalid_reason to int_orders.sql
Acceptance
Files
T-016 - Filter int_enrollments to is_valid = true
Acceptance
Files
T-017 - Create audit_flagged_orders dbt view
Acceptance
Files
T-018 - Add dbt schema tests for is_valid columns
Acceptance
Files
T-019 - End-to-end verification and AGENTS.md update
Acceptance
Files
T-019a - Update extract step: amount formatting, date formatting, identity rename
Context
Boundary
Acceptance
Notes
Files
T-019b - Simplify staging SQL: direct English column refs, identity split, period parsing
What changes
Acceptance
Notes
Files
T-019c - Final end-to-end verification
Acceptance
Files