Plans009 2026 02 21 Data Quality Pipeline
Completed
- T-001: Create @plan/audit/ and sheet index — done in planning session
- T-002: Create per-file audit snapshots — done in planning session
- T-003: Fix ions-2026.yaml
- T-004: Fix ions-2025.yaml and ions-2024.yaml
- T-005: Fix ions-2023.yaml
- T-006: Define canonical CSV schema and extend SourceConfig type
- T-007: Implement extract/reader.ts and extract/referensi.ts
- T-008: Implement extract/writer.ts
- T-009: Implement extract/index.ts and wire CLI
- T-010: Extract all years to @source/clean/ and commit
- T-011: Refactor load/csv.ts to read from @source/clean/
- T-012: Implement validate/checks.ts
- T-013: Implement validate/index.ts and wire CLI
- T-014: Remove stg_marketing_activity and raw_marketing_activity
- T-015: Add is_valid and invalid_reason to int_orders.sql
- T-016: Filter int_enrollments to is_valid = true
- T-017: Create audit_flagged_orders dbt view
- T-018: Add dbt schema tests for is_valid columns
- T-019a: Update extract step — amount formatting, date ISO, identity → id_raw
- T-019b: Simplify staging SQL — direct English refs, identity split, period parsing
- T-019c: Final end-to-end verification
In Progress
- None
Cancelled
- T-019: Superseded by T-019a/b/c (all completed)
Learnings
- 2023 is the ONLY year with a 2-row merged header in transactions.
- xlsx autoFilter is display-only — read ALL rows regardless of hidden state.
stg_marketing_activitywas a dead model. Removed with zero downstream impact.raw_studentsbloat was caused by OLAH/SUMMARY sheets being loaded. Fixed via CSV layer.raw_targetshad 702 garbage columns due to wrong header row. Fixed via merge_header_rows config.- Staging SQL column name mismatch:
stg_transactions/stg_studentslook up Indonesian names (TENANT,NAMA SISWA,JUMLAH) but raw tables now have English canonical names (unit,student,amount). All mart models are 0-row as a result. T-019b fixes this. amountin CSVs is"100,000"(thousands-separator comma).TRY_CASTfails silently. Fix: T-019a strips commas in extract → plain"100000"string.- 188 students appear with >1 distinct identity value across time. Causes: 93 formatting differences, 85 data entry errors, 9 genuine KTP/Kartu Pelajar transitions (student turned 17).
- Identity columns decided:
id_rawin CSV (raw string);id_ktp,id_kp,id_sim,id_pass,id_nisparsed in staging SQL. No merging/deduplication — separate rows.dim_studentsdeferred. birth_datein students is a text string ("18 Desember 2003") — NOT a machine date. Do not apply date formatting to it.- Period parsing (
JANUARI 2026→ year/month integers) moved to staging SQL (not CSV). - Unit code normalization and customer type translation (
B→New) stay in staging SQL. - Clean CSV boundary: amount (plain number), date (ISO), id_raw (raw string). That's all.
- Direct canonical-column refs in staging remove the alias mismatch and restore populated marts/format outputs.
- For mixed-year verification,
syncmust use append mode on subsequent years (--mode append) or later loads replace earlier raw data. - dbt accepted-values tests surfaced mixed-case
Realvariants, so unit normalization now runs instg_transactionstoo. - Final closure tasks can be evidenced directly from merged PR artifacts (
#19) and merge SHA traces in Progress logs.
Questions
- None — all resolved.
Next
- Plan complete. No remaining work.