Atlas Plan
Plans009 2026 02 21 Data Quality Pipeline

Completed

  • T-001: Create @plan/audit/ and sheet index — done in planning session
  • T-002: Create per-file audit snapshots — done in planning session
  • T-003: Fix ions-2026.yaml
  • T-004: Fix ions-2025.yaml and ions-2024.yaml
  • T-005: Fix ions-2023.yaml
  • T-006: Define canonical CSV schema and extend SourceConfig type
  • T-007: Implement extract/reader.ts and extract/referensi.ts
  • T-008: Implement extract/writer.ts
  • T-009: Implement extract/index.ts and wire CLI
  • T-010: Extract all years to @source/clean/ and commit
  • T-011: Refactor load/csv.ts to read from @source/clean/
  • T-012: Implement validate/checks.ts
  • T-013: Implement validate/index.ts and wire CLI
  • T-014: Remove stg_marketing_activity and raw_marketing_activity
  • T-015: Add is_valid and invalid_reason to int_orders.sql
  • T-016: Filter int_enrollments to is_valid = true
  • T-017: Create audit_flagged_orders dbt view
  • T-018: Add dbt schema tests for is_valid columns
  • T-019a: Update extract step — amount formatting, date ISO, identity → id_raw
  • T-019b: Simplify staging SQL — direct English refs, identity split, period parsing
  • T-019c: Final end-to-end verification

In Progress

  • None

Cancelled

  • T-019: Superseded by T-019a/b/c (all completed)

Learnings

  • 2023 is the ONLY year with a 2-row merged header in transactions.
  • xlsx autoFilter is display-only — read ALL rows regardless of hidden state.
  • stg_marketing_activity was a dead model. Removed with zero downstream impact.
  • raw_students bloat was caused by OLAH/SUMMARY sheets being loaded. Fixed via CSV layer.
  • raw_targets had 702 garbage columns due to wrong header row. Fixed via merge_header_rows config.
  • Staging SQL column name mismatch: stg_transactions/stg_students look up Indonesian names (TENANT, NAMA SISWA, JUMLAH) but raw tables now have English canonical names (unit, student, amount). All mart models are 0-row as a result. T-019b fixes this.
  • amount in CSVs is "100,000" (thousands-separator comma). TRY_CAST fails silently. Fix: T-019a strips commas in extract → plain "100000" string.
  • 188 students appear with >1 distinct identity value across time. Causes: 93 formatting differences, 85 data entry errors, 9 genuine KTP/Kartu Pelajar transitions (student turned 17).
  • Identity columns decided: id_raw in CSV (raw string); id_ktp, id_kp, id_sim, id_pass, id_nis parsed in staging SQL. No merging/deduplication — separate rows. dim_students deferred.
  • birth_date in students is a text string ("18 Desember 2003") — NOT a machine date. Do not apply date formatting to it.
  • Period parsing (JANUARI 2026 → year/month integers) moved to staging SQL (not CSV).
  • Unit code normalization and customer type translation (BNew) stay in staging SQL.
  • Clean CSV boundary: amount (plain number), date (ISO), id_raw (raw string). That's all.
  • Direct canonical-column refs in staging remove the alias mismatch and restore populated marts/format outputs.
  • For mixed-year verification, sync must use append mode on subsequent years (--mode append) or later loads replace earlier raw data.
  • dbt accepted-values tests surfaced mixed-case Real variants, so unit normalization now runs in stg_transactions too.
  • Final closure tasks can be evidenced directly from merged PR artifacts (#19) and merge SHA traces in Progress logs.

Questions

  • None — all resolved.

Next

  • Plan complete. No remaining work.

On this page