Atlas Plan
Plans009 2026 02 21 Data Quality Pipeline

Data Quality Pipeline

Overview

Implement a clean, auditable data pipeline that replaces direct xlsx→DuckDB loading with a two-step process: extract xlsx sheets to canonical CSV files, then load CSVs into DuckDB. Add a validation step between load and transform. Add is_valid flagging in dbt to preserve bad rows while keeping marts clean. Document every source sheet in @plan/audit/ as a snapshot reference.

The key problems being solved:

  • raw_targets is completely broken (702 auto-named columns, 400k garbage rows) due to wrong header row
  • raw_students is 20× bloated (201k rows, should be ~10k) due to compute/summary sheets being included
  • 2023 LAPORAN has a 2-row merged header — sub-columns (PROGRAM, INTAKE, NAMA SISWA, JUMLAH) are under a PENDAPATAN group header in row 6; actual column names are in row 7
  • Invalid INTAKE years (e.g. SEPTEMBER 2032) pass silently into marts
  • stg_marketing_activity is a dead model — raw_marketing_activity is loaded but nothing downstream uses it
  • No documented record of which sheets exist, which are loaded, and which are excluded

Depends on: Plan 008 (Package Restructure) — complete. All work targets @packages/pipeline and @python/analytics.

Audit: T-001 and T-002 (audit documentation) completed in planning session 2026-02-21. Findings embedded in Tasks.md and @plan/audit/.

Goals

  • Implement extract step in @packages/pipeline: SheetJS-based xlsx → @source/clean/*.csv for all configured sheets across all 4 years (2023–2026)
  • Implement load step to read from @source/clean/*.csv into DuckDB raw_* tables (replacing direct xlsx reading)
  • Implement validate step: SQL checks on raw_* tables post-load, output to output/validation/*.json
  • Fix all @source/config/*.yaml files: correct header rows, add merge_header_rows for 2023 transactions only, remove all marketing_activity entries
  • Remove stg_marketing_activity.sql and its source declaration — it is a dead model with no downstream dependencies
  • Add is_valid + invalid_reason columns to int_orders in @python/analytics
  • Add audit_flagged_orders dbt model (view) exposing all invalid rows

Non-Goals

  • Implementing a pipeline orchestrator (run sub-command) — stubs are fine for now
  • Any EPN entity support
  • Real-time or streaming data
  • Connecting validation output to a dashboard or alert system
  • Historical backfill of flagged data — flagging is forward-looking from this plan
  • Extracting REKAP (marketing summary) sheets to CSV — these are derived summaries used for manual validation only, not source data
  • Refactoring stg_* staging SQL to use new English column names — future plan

Phases

  • Phase 1: Audit docs — complete (done in planning session 2026-02-21)
  • Phase 2: YAML config fixes — correct header rows, remove marketing_activity, add organizations entries
  • Phase 3: Extract step — SheetJS-based xlsx → CSV for all years, all file types
  • Phase 4: Load step — refactor load to read from @source/clean/ CSVs; remove marketing_activity load block
  • Phase 5: Validate step — SQL checks on raw_* tables, JSON report output
  • Phase 6: dbt quality — remove stg_marketing_activity; add is_valid flagging; add audit_flagged_orders

Success

  • @plan/audit/ contains sheet-index.md and 8 per-file snapshot docs — done in planning session
  • All @source/config/*.yaml files updated: correct header rows, no marketing_activity entries, organizations entries for 2024–2026
  • pnpm run extract -- --entity IONS --year 2026 produces CSV files in @source/clean/ for all 2026 sheets
  • @source/clean/ contains committed CSVs for all years (2023–2026): 48 transactions + 48 students + 1 targets + 3 organizations = 100 files
  • All CSVs use canonical English column names; all years have the same columns (absent = empty string)
  • pnpm run sync -- --entity IONS --year 2026 loads from @source/clean/ CSVs into raw_* tables
  • raw_targets has correct column names and ~55 rows (not 199k)
  • raw_students row count is ≤ 15,000 per year (not 201k)
  • raw_marketing_activity table no longer created or loaded
  • stg_marketing_activity.sql deleted; uv run dbt run succeeds without it
  • pnpm run validate -- --entity IONS --year 2026 produces output/validation/ions-2026-validation.json
  • int_orders has is_valid and invalid_reason columns
  • int_enrollments and all mart models filter to is_valid = true
  • mart_channel_marketing still works (it reads int_enrollments, not stg_marketing_activity)
  • uv run dbt test passes all schema tests

Requirements

  • Plan 008 completed: @packages/pipeline and @python/analytics exist
  • @source/raw/ contains all 8 xlsx files
  • @source/clean/ directory exists with .gitkeep
  • xlsx (SheetJS) available via catalog:util in pipeline dependencies
  • Python/uv available for dbt verification

Context

Why This Approach

  • CSV intermediary layer: xlsx files have structural problems (merged headers, wrong header rows, hidden rows) that DuckDB's read_xlsx cannot handle correctly. Extracting to CSV in TypeScript with SheetJS gives full control over header detection, row skipping, hidden-row handling, and column mapping before data enters DuckDB.
  • CSVs committed to git (frozen snapshots): provides a permanent, diffable audit trail. A new extraction only happens intentionally, not automatically on every sync run. This separates "re-extract from source" (rare) from "re-load into DuckDB" (common).
  • Canonical English column names in CSVs: all years produce the same columns. Absent columns for a year are empty strings. This eliminates per-year column aliasing in dbt staging.
  • is_valid flag, not hard exclusion: invalid rows are retained in int_orders with a flag. Full audit trail. Marts filter to is_valid = true.
  • Removing stg_marketing_activity: it is a dead model — nothing downstream reads it. mart_channel_marketing reads from int_enrollments (payment channel breakdown from transaction data), not from REKAP-derived data. Removing it eliminates confusion and a broken dependency.
  • REKAP sheets excluded from extraction: REKAP sheets are Excel-computed summaries derived from the monthly transaction sheets. Every number in them is computable from the transaction CSVs we already extract. They serve as manual validation reference only.

Key Constraints

  • 2023 LAPORAN only has 2-row merged header: row 6 = PENDAPATAN group (spans cols E–H), row 7 = PROGRAM, INTAKE, NAMA SISWA, JUMLAH. 2024, 2025, 2026 LAPORAN all have flat single-row headers at row 6. Only 2023 transactions needs merge_header_rows: [6, 7].
  • 2026 targets has 2-row header at rows 10–11: row 10 = outer labels, row 11 = sub-labels (SISWA STANDAR under TARGET and REALISASI). Needs merge_header_rows: [10, 11]. Data starts at row 12. Rows 1–8 are a letter salutation — skip entirely.
  • Targets exists only in 2026: EVALUASI DATA SISWA sheet absent in 2023, 2024, 2025.
  • Organizations (REFERENSI) exists only in 2024–2026: absent in 2023.
  • Hidden rows must be ignored: xlsx files use Excel autoFilter extensively. 747 of 847 rows in JANUARI 2026 are marked hidden. Both SheetJS and openpyxl read ALL rows regardless of hidden state — this is correct. Never filter by row_dimensions.hidden.
  • Targets CSV named by period: EVALUASI DATA SISWA row 9 contains "DATA BULAN FEBRUARI 2026". Extract derives the month number from this string. Filename: ions-2026-02-targets.csv.
  • @source/clean/ CSVs are frozen: extract must NOT overwrite existing CSVs unless --overwrite is passed.
  • 2023 students payment columns swapped: CARA BAYAR and POLA PEMBAYARAN are in reversed positions compared to 2024–2026. The YAML column map already handles this — payment_method: CARA BAYAR, payment_pattern: POLA PEMBAYARAN in all years, but the Indonesian values differ by year.
  • 2026 DATABASE SISWA SEPTEMBER25 typo: the September data sheet is misnamed SEPTEMBER25 (not SEPTEMBER26). Config must reference this exact name.
  • ICC = IONs Culinary College, ICA = IONs Culinary Academy: both appear as unit values in transaction data. Treated as separate units.
  • stg_marketing_activity uses adapter.get_columns_in_relation: if raw_marketing_activity table is removed from DuckDB without deleting the model file, dbt will error. Both must be removed together.

Edge Cases

  • Empty monthly sheets: future months in 2025/2026 have 0 data rows but stubs exist. Extract produces a header-only CSV. Load skips empty CSVs gracefully (logs warning, does not fail).
  • Targets period detection: if row 9 of EVALUASI DATA SISWA does not contain a parseable month/year, fall back to the CLI --year argument and log a warning.
  • Organizations matrix layout: REFERENSI is not a flat table. It's a pivot matrix: age cluster labels in col B, organization category headers (SMA, SMP, TK-SD, UNIVERSITAS, LAIN-LAIN) in row 3 cols D–H, values in the matrix cells. readReferensiSheet must unpivot to 3 columns: organization, category, age_cluster.
  • birth_place/birth_date split: T/T/L format is "City, DD Month YYYY". Split at first comma: before = birth_place, after (trimmed) = birth_date. If no comma found, put full value in birth_place_date only, leave birth_place and birth_date empty.
  • N0 column in REKAP (not extracted): the first column is N0 with a zero, not NO with an O. Not relevant since REKAP is excluded.

Tradeoffs

  • SheetJS vs openpyxl for extraction: SheetJS chosen for language consistency with @packages/pipeline. Both read hidden rows by default (correct behavior).
  • 100 CSV files committed to git: adds ~5–15 MB to the repo. Acceptable — provides full audit trail and diff visibility when source data changes.
  • Removing stg_marketing_activity entirely vs keeping as stub: removing is cleaner. The REKAP data it was trying to load was always garbage (wrong structure). If REKAP analysis is needed in future, it warrants a proper dedicated model with the stateful section reader.
  • Organizations as 3 separate year CSVs (not combined): allows year-over-year comparison of the lookup list. dbt can UNION ALL them if needed.

Skills

  • plan — plan file format and conventions
  • vitest — additional tests for extract and validate logic

Boundaries

  • Always: Update Progress.md after each task completion
  • Always: Verify pnpm test:type passes after any TypeScript change
  • Always: Run uv run dbt run && uv run dbt test after any dbt model change
  • Always: The --overwrite flag must be explicitly passed to re-extract an existing CSV — never silently overwrite
  • Always: Read ALL rows from xlsx sheets — never filter by hidden row state
  • Ask first: Any change to existing mart SQL beyond adding is_valid filter to int_enrollments
  • Ask first: Adding new dbt models beyond audit_flagged_orders
  • Ask first: Changing stg_* SQL logic (staging refactor is future scope)
  • Never: Delete or overwrite @source/clean/ CSVs without explicit --overwrite flag
  • Never: Change @source/raw/ xlsx files
  • Never: Hardcode year numbers in extract/load/validate TypeScript — always read from config or CLI args
  • Never: Extract REKAP sheets — they are excluded by design

Questions

  • Should extract produce one CSV per sheet? → Yes. Naming: ions-{YYYY}-{MM}-{title}.csv. Organizations: ions-{YYYY}-organizations.csv (no month).
  • month_intake vs intake? → Use intake only. 2026 BULAN column is dropped; INTAKE maps to intake for all years.
  • Keep birth_place_date raw? → Yes. Also split into birth_place and birth_date. Leave split fields blank if format inconsistent.
  • Should REKAP sheets be extracted? → No. Derived summaries, used for manual validation only.
  • Should stg_marketing_activity be removed? → Yes. Dead model, nothing downstream reads it.
  • Extract all 12 months regardless of whether they have data? → Yes. Future months produce header-only CSVs.
  • Organizations: separate CSV per year? → Yes. 2024, 2025, 2026 each get their own file.
  • Hidden rows: read all or only visible? → Read all. Excel filters are display-only.
  • Targets filename? → ions-{YYYY}-{MM}-targets.csv where MM is derived from row 9 content.
  • marketing_activity column name collision? → Not relevant — REKAP excluded entirely.
  • 2023/2024/2025 merge_header_rows? → Only 2023 transactions. 2024/2025 LAPORAN already have flat single-row headers.

On this page