Atlas Plan

Overview

Implement a clean, auditable data pipeline that replaces direct xlsx→DuckDB loading with a two-step process: extract xlsx sheets to canonical CSV files, then load CSVs into DuckDB. Add a validation step between load and transform. Add is_valid flagging in dbt to preserve bad rows while keeping marts clean. Document every source sheet in @plan/audit/ as a snapshot reference.

The key problems being solved:

raw_targets is completely broken (702 auto-named columns, 400k garbage rows) due to wrong header row
raw_students is 20× bloated (201k rows, should be ~10k) due to compute/summary sheets being included
2023 LAPORAN has a 2-row merged header — sub-columns (PROGRAM, INTAKE, NAMA SISWA, JUMLAH) are under a PENDAPATAN group header in row 6; actual column names are in row 7
Invalid INTAKE years (e.g. SEPTEMBER 2032) pass silently into marts
stg_marketing_activity is a dead model — raw_marketing_activity is loaded but nothing downstream uses it
No documented record of which sheets exist, which are loaded, and which are excluded

Depends on: Plan 008 (Package Restructure) — complete. All work targets @packages/pipeline and @python/analytics.

Audit: T-001 and T-002 (audit documentation) completed in planning session 2026-02-21. Findings embedded in Tasks.md and @plan/audit/.

Goals

Implement extract step in @packages/pipeline: SheetJS-based xlsx → @source/clean/*.csv for all configured sheets across all 4 years (2023–2026)
Implement load step to read from @source/clean/*.csv into DuckDB raw_* tables (replacing direct xlsx reading)
Implement validate step: SQL checks on raw_* tables post-load, output to output/validation/*.json
Fix all @source/config/*.yaml files: correct header rows, add merge_header_rows for 2023 transactions only, remove all marketing_activity entries
Remove stg_marketing_activity.sql and its source declaration — it is a dead model with no downstream dependencies
Add is_valid + invalid_reason columns to int_orders in @python/analytics
Add audit_flagged_orders dbt model (view) exposing all invalid rows

Non-Goals

Implementing a pipeline orchestrator (run sub-command) — stubs are fine for now
Any EPN entity support
Real-time or streaming data
Connecting validation output to a dashboard or alert system
Historical backfill of flagged data — flagging is forward-looking from this plan
Extracting REKAP (marketing summary) sheets to CSV — these are derived summaries used for manual validation only, not source data
Refactoring stg_* staging SQL to use new English column names — future plan

Phases

Phase 1: Audit docs — complete (done in planning session 2026-02-21)
Phase 2: YAML config fixes — correct header rows, remove marketing_activity, add organizations entries
Phase 3: Extract step — SheetJS-based xlsx → CSV for all years, all file types
Phase 4: Load step — refactor load to read from @source/clean/ CSVs; remove marketing_activity load block
Phase 5: Validate step — SQL checks on raw_* tables, JSON report output
Phase 6: dbt quality — remove stg_marketing_activity; add is_valid flagging; add audit_flagged_orders

Success

Requirements

Plan 008 completed: @packages/pipeline and @python/analytics exist
@source/raw/ contains all 8 xlsx files
@source/clean/ directory exists with .gitkeep
xlsx (SheetJS) available via catalog:util in pipeline dependencies
Python/uv available for dbt verification

Context

Why This Approach

CSV intermediary layer: xlsx files have structural problems (merged headers, wrong header rows, hidden rows) that DuckDB's read_xlsx cannot handle correctly. Extracting to CSV in TypeScript with SheetJS gives full control over header detection, row skipping, hidden-row handling, and column mapping before data enters DuckDB.
CSVs committed to git (frozen snapshots): provides a permanent, diffable audit trail. A new extraction only happens intentionally, not automatically on every sync run. This separates "re-extract from source" (rare) from "re-load into DuckDB" (common).
Canonical English column names in CSVs: all years produce the same columns. Absent columns for a year are empty strings. This eliminates per-year column aliasing in dbt staging.
is_valid flag, not hard exclusion: invalid rows are retained in int_orders with a flag. Full audit trail. Marts filter to is_valid = true.
Removing stg_marketing_activity: it is a dead model — nothing downstream reads it. mart_channel_marketing reads from int_enrollments (payment channel breakdown from transaction data), not from REKAP-derived data. Removing it eliminates confusion and a broken dependency.
REKAP sheets excluded from extraction: REKAP sheets are Excel-computed summaries derived from the monthly transaction sheets. Every number in them is computable from the transaction CSVs we already extract. They serve as manual validation reference only.

Key Constraints

2023 LAPORAN only has 2-row merged header: row 6 = PENDAPATAN group (spans cols E–H), row 7 = PROGRAM, INTAKE, NAMA SISWA, JUMLAH. 2024, 2025, 2026 LAPORAN all have flat single-row headers at row 6. Only 2023 transactions needs merge_header_rows: [6, 7].
2026 targets has 2-row header at rows 10–11: row 10 = outer labels, row 11 = sub-labels (SISWA STANDAR under TARGET and REALISASI). Needs merge_header_rows: [10, 11]. Data starts at row 12. Rows 1–8 are a letter salutation — skip entirely.
Targets exists only in 2026: EVALUASI DATA SISWA sheet absent in 2023, 2024, 2025.
Organizations (REFERENSI) exists only in 2024–2026: absent in 2023.
Hidden rows must be ignored: xlsx files use Excel autoFilter extensively. 747 of 847 rows in JANUARI 2026 are marked hidden. Both SheetJS and openpyxl read ALL rows regardless of hidden state — this is correct. Never filter by row_dimensions.hidden.
Targets CSV named by period: EVALUASI DATA SISWA row 9 contains "DATA BULAN FEBRUARI 2026". Extract derives the month number from this string. Filename: ions-2026-02-targets.csv.
@source/clean/ CSVs are frozen: extract must NOT overwrite existing CSVs unless --overwrite is passed.
2023 students payment columns swapped: CARA BAYAR and POLA PEMBAYARAN are in reversed positions compared to 2024–2026. The YAML column map already handles this — payment_method: CARA BAYAR, payment_pattern: POLA PEMBAYARAN in all years, but the Indonesian values differ by year.
2026 DATABASE SISWA SEPTEMBER25 typo: the September data sheet is misnamed SEPTEMBER25 (not SEPTEMBER26). Config must reference this exact name.
ICC = IONs Culinary College, ICA = IONs Culinary Academy: both appear as unit values in transaction data. Treated as separate units.
stg_marketing_activity uses adapter.get_columns_in_relation: if raw_marketing_activity table is removed from DuckDB without deleting the model file, dbt will error. Both must be removed together.

Edge Cases

Empty monthly sheets: future months in 2025/2026 have 0 data rows but stubs exist. Extract produces a header-only CSV. Load skips empty CSVs gracefully (logs warning, does not fail).
Targets period detection: if row 9 of EVALUASI DATA SISWA does not contain a parseable month/year, fall back to the CLI --year argument and log a warning.
Organizations matrix layout: REFERENSI is not a flat table. It's a pivot matrix: age cluster labels in col B, organization category headers (SMA, SMP, TK-SD, UNIVERSITAS, LAIN-LAIN) in row 3 cols D–H, values in the matrix cells. readReferensiSheet must unpivot to 3 columns: organization, category, age_cluster.
birth_place/birth_date split: T/T/L format is "City, DD Month YYYY". Split at first comma: before = birth_place, after (trimmed) = birth_date. If no comma found, put full value in birth_place_date only, leave birth_place and birth_date empty.
N0 column in REKAP (not extracted): the first column is N0 with a zero, not NO with an O. Not relevant since REKAP is excluded.

Tradeoffs

SheetJS vs openpyxl for extraction: SheetJS chosen for language consistency with @packages/pipeline. Both read hidden rows by default (correct behavior).
100 CSV files committed to git: adds ~5–15 MB to the repo. Acceptable — provides full audit trail and diff visibility when source data changes.
Removing stg_marketing_activity entirely vs keeping as stub: removing is cleaner. The REKAP data it was trying to load was always garbage (wrong structure). If REKAP analysis is needed in future, it warrants a proper dedicated model with the stateful section reader.
Organizations as 3 separate year CSVs (not combined): allows year-over-year comparison of the lookup list. dbt can UNION ALL them if needed.

Skills

plan — plan file format and conventions
vitest — additional tests for extract and validate logic

Boundaries

Always: Update Progress.md after each task completion
Always: Verify pnpm test:type passes after any TypeScript change
Always: Run uv run dbt run && uv run dbt test after any dbt model change
Always: The --overwrite flag must be explicitly passed to re-extract an existing CSV — never silently overwrite
Always: Read ALL rows from xlsx sheets — never filter by hidden row state
Ask first: Any change to existing mart SQL beyond adding is_valid filter to int_enrollments
Ask first: Adding new dbt models beyond audit_flagged_orders
Ask first: Changing stg_* SQL logic (staging refactor is future scope)
Never: Delete or overwrite @source/clean/ CSVs without explicit --overwrite flag
Never: Change @source/raw/ xlsx files
Never: Hardcode year numbers in extract/load/validate TypeScript — always read from config or CLI args
Never: Extract REKAP sheets — they are excluded by design

Data Quality Pipeline