Atlas Plan
Plans009 2026 02 21 Data Quality Pipeline

Progress

2026-02-21 21:43 - T-003

Overview: Updated 2026 source config for targets merged headers, referensi typing, and full REKAP removal.

Completed:

  • fix(pipeline): set merge_header_rows and data_start_row for 2026 targets sheet
  • refactor(pipeline): remove 2026 marketing activity config block
  • refactor(pipeline): add explicit type field for 2026 file configs

Files:

  • @source/config/ions-2026.yaml

2026-02-21 21:43 - T-004

Overview: Cleaned 2024 and 2025 YAML configs by removing REKAP entries and normalizing typed organizations entries.

Completed:

  • refactor(pipeline): remove marketing activity config from 2024 and 2025 YAML files
  • refactor(pipeline): add explicit type metadata to transactions/students/organizations for 2024 and 2025

Files:

  • @source/config/ions-2025.yaml
  • @source/config/ions-2024.yaml

2026-02-21 21:43 - T-005

Overview: Fixed 2023 transaction header structure to use merged rows and removed all marketing REKAP config.

Completed:

  • fix(pipeline): configure 2023 transactions with merge_header_rows: [6, 7]
  • fix(pipeline): map 2023 merged transaction sub-columns (PROGRAM, INTAKE, NAMA SISWA, JUMLAH)
  • refactor(pipeline): remove 2023 marketing activity config section entirely

Files:

  • @source/config/ions-2023.yaml

2026-02-21 21:43 - T-006

Overview: Added canonical extract schema constants and extended SourceConfig typing for merged headers and referensi file type.

Completed:

  • feat(pipeline): add canonical CSV column definitions in extract schema module
  • type(pipeline): extend SourceFileConfig with type, merge_header_rows, and data_start_row
  • test(pipeline): verify TypeScript integrity with pnpm --filter @packages/pipeline test:type

Files:

  • @packages/pipeline/@source/extract/schema.ts
  • @packages/pipeline/@source/config.ts

2026-02-21 22:04 - T-007

Overview: Implemented SheetJS readers for regular/merged-header sheets and REFERENSI unpivot extraction with canonical row outputs.

Completed:

  • feat(pipeline): add readSheet with single-row and two-row merged header support
  • feat(pipeline): add students T/T/L split into birth_place and birth_date
  • feat(pipeline): add readReferensiSheet matrix unpivot for organizations
  • test(pipeline): add extract reader tests for single-row header, merged header, missing columns, birth split, and referensi unpivot

Decisions:

  • Keep mapping generic by inverting YAML source labels and using canonical header sets from schema.ts.
  • For merged headers with ambiguous sub-labels (e.g. duplicated labels), fall back to outer-label candidates by column position.

Learnings:

  • In REFERENSI, ORGANISASI can be positioned at column D while age cluster labels remain in column B; parser must not assume same column.

Files:

  • @packages/pipeline/@source/extract/reader.ts
  • @packages/pipeline/@source/extract/referensi.ts
  • @packages/pipeline/@source/tests/extract.test.ts

2026-02-21 22:04 - T-008

Overview: Implemented CSV writer with frozen-snapshot overwrite guard and proper CSV escaping semantics.

Completed:

  • feat(pipeline): add writeCsv with default no-overwrite behavior
  • feat(pipeline): add RFC-compatible CSV escaping for commas, quotes, and newlines
  • test(pipeline): add writer tests for overwrite guard, quoting edge cases, and header-only files

Files:

  • @packages/pipeline/@source/extract/writer.ts
  • @packages/pipeline/@source/tests/writer.test.ts

2026-02-21 22:04 - T-009

Overview: Replaced extract stub with full orchestrator, wired CLI extract command, and verified normal/overwrite execution.

Completed:

  • feat(pipeline): implement extractAll(config, { cleanDir, overwrite }) orchestration
  • feat(pipeline): implement filename resolution for monthly, targets period-label month, and organizations outputs
  • feat(pipeline): wire extract CLI path with --overwrite flag support
  • test(pipeline): extend CLI arg parsing tests for overwrite flag
  • chore(pipeline): run extraction CLI for 2026 in skip mode and overwrite mode to validate behavior

Learnings:

  • Root extraction command routes through pnpm --filter @packages/pipeline start -- extract, and boolean flags need explicit parsing in CLI token handling.
  • Current 2026 organizations extraction yields 25 rows after referensi parser correction.

Files:

  • @packages/pipeline/@source/extract/index.ts
  • @packages/pipeline/@source/index.ts
  • @packages/pipeline/@source/tests/config.test.ts

2026-02-21 22:15 - T-011

Overview: Refactored load step to read frozen CSV snapshots from @source/clean/ and removed marketing activity loading.

Completed:

  • refactor(pipeline): replace xlsx read_xlsx load path with read_csv_auto temp-table loading
  • refactor(pipeline): align CSV filename resolution with extract conventions for monthly, targets, and organizations files
  • refactor(pipeline): switch CLI load default source to csv and wire syncCsv
  • fix(pipeline): drop stale raw_marketing_activity and raw_referensi tables during sync bootstrap
  • chore(pipeline): verify pnpm run sync -- --entity IONS --year 2026 and validate raw table row counts/columns

Learnings:

  • COUNT(*) from DuckDB can surface as bigint/string depending on context; row-count utilities must normalize types before comparisons.
  • Header-only CSV handling requires explicit empty-sheet guards, otherwise replace mode can zero out destination tables.

Files:

  • @packages/pipeline/@source/load/csv.ts
  • @packages/pipeline/@source/index.ts
  • @packages/pipeline/@source/duck.ts

2026-02-21 22:17 - Amendment

Overview: Added explicit boundary evidence for TypeScript verification coverage across completed extract/load tasks.

Changes:

  • boundary: Recorded that pnpm --filter @packages/pipeline test:type was run after TypeScript changes spanning T-007 through T-011.
  • task: Compliance documentation updated to capture boundary proof for tasks that previously referenced tests without explicit test:type mention.

Rationale:

  • Sub-agent checkpoint flagged missing explicit per-task boundary evidence despite passing typechecks.
  • Capturing explicit command evidence in append-only history improves compliance traceability across sessions.

2026-02-21 22:19 - T-012

Overview: Implemented typed validation checks catalog for raw-layer quality rules with unit tests.

Completed:

  • feat(pipeline): add ValidationCheck type and VALIDATION_CHECKS definitions in validate checks module
  • feat(pipeline): include required empty, schema, bloat, year, unknown unit, and negative amount checks
  • test(pipeline): add validate checks tests for severity mapping and core SQL clause coverage
  • test(pipeline): verify pnpm --filter @packages/pipeline test:type and pnpm --filter @packages/pipeline test

Files:

  • @packages/pipeline/@source/validate/checks.ts
  • @packages/pipeline/@source/tests/validate.test.ts

2026-02-21 22:21 - T-013

Overview: Implemented validation runner orchestration with grouped console output, JSON report writing, and CLI integration.

Completed:

  • feat(pipeline): add validateAll orchestrator to execute all checks against DuckDB and collect structured results
  • feat(pipeline): write JSON report to output/validation/{entity}-{year}-validation.json
  • feat(pipeline): wire validate command in CLI with --year + config loading and exit code behavior
  • test(pipeline): verify pnpm --filter @packages/pipeline test:type and pnpm --filter @packages/pipeline test
  • chore(pipeline): run pnpm run validate -- --entity IONS --year 2026 and confirm report generation

Files:

  • @packages/pipeline/@source/validate/index.ts
  • @packages/pipeline/@source/index.ts

2026-02-21 22:24 - T-014

Overview: Removed dead marketing staging source/model and validated dbt graph health.

Completed:

  • refactor(analytics): delete stg_marketing_activity model
  • refactor(analytics): remove raw_marketing_activity source declaration
  • test(analytics): run uv run dbt run successfully after removal
  • test(analytics): run uv run dbt test successfully after removal

Learnings:

  • Current 2026 dataset has zero non-null channel_name rows in int_enrollments, so mart_channel_marketing materializes with 0 rows but remains structurally healthy.

Files:

  • @python/analytics/models/staging/stg_marketing_activity.sql (deleted)
  • @python/analytics/models/staging/sources.yml

2026-02-21 22:26 - T-015

Overview: Added is_valid and invalid_reason flags in int_orders with ordered data-quality rule evaluation.

Completed:

  • feat(analytics): add ordered validity CASE logic for null/invalid period, invalid amount, and unknown unit
  • feat(analytics): add invalid_reason reason codes aligned to validity checks
  • test(analytics): run uv run dbt run --select int_orders and verify no null is_valid

Files:

  • @python/analytics/models/intermediate/int_orders.sql

2026-02-21 22:26 - T-016

Overview: Filtered intermediate enrollments to valid orders only.

Completed:

  • refactor(analytics): add AND is_valid = true to int_enrollments
  • test(analytics): run uv run dbt run and uv run dbt test after validity filter rollout

Files:

  • @python/analytics/models/intermediate/int_enrollments.sql

2026-02-21 22:26 - T-017

Overview: Added audit_flagged_orders view model for invalid order inspection.

Completed:

  • feat(analytics): create audit_flagged_orders view from int_orders invalid rows
  • docs(analytics): add intermediate schema metadata for new audit model
  • test(analytics): run uv run dbt run --select audit_flagged_orders

Files:

  • @python/analytics/models/intermediate/audit_flagged_orders.sql
  • @python/analytics/models/intermediate/schema.yml

2026-02-21 22:26 - T-018

Overview: Added schema tests to enforce validity flag contract.

Completed:

  • test(analytics): add not_null test for int_orders.is_valid
  • test(analytics): add conditional not_null test for invalid_reason when is_valid = false
  • test(analytics): run uv run dbt run && uv run dbt test with new tests passing

Files:

  • @python/analytics/models/intermediate/schema.yml

2026-02-21 22:26 - Amendment

Overview: Added explicit Phase 6 boundary verification evidence for full dbt run/test coverage across T-014 through T-018.

Changes:

  • boundary: Recorded that uv run dbt run && uv run dbt test was executed after Phase 6 model/test changes were completed.
  • task: Supplemented targeted task-level verification with end-of-phase full-suite validation evidence.

Rationale:

  • Compliance review flagged delayed per-task boundary evidence even though full validation passed by phase end.
  • Explicit phase-level run/test evidence preserves strict auditability without rewriting append-only task entries.

2026-02-21 22:28 - T-019

Overview: Completed all end-to-end verification steps except format-content validation; task is blocked pending boundary decision.

Completed:

  • test(pipeline): verified extract frozen-snapshot skip behavior
  • test(pipeline): verified CSV sync row counts and absence of raw_marketing_activity
  • test(pipeline): verified validate command output and JSON report generation
  • docs(*): updated root/pipeline/analytics AGENTS files for extract/validate/is_valid changes

Blockers:

  • pnpm run format -- --entity IONS --period 2026-02 produces an empty units payload because mart models are zero-row.
  • Resolving likely requires edits to stg_transactions.sql and stg_students.sql alias handling, which is restricted by an Ask-first boundary in Plan.md.

Files:

  • AGENTS.md
  • @packages/pipeline/AGENTS.md
  • @python/analytics/AGENTS.md

2026-02-22 11:24 - T-019a

Overview: Implemented extract-level amount/date normalization and identity column rename, then re-extracted all years.

Completed:

  • feat(pipeline): add formatAmount and formatDate in extract reader, apply to transaction amount/date and student datetime fields
  • refactor(pipeline): rename students canonical identity column to id_raw in schema and alias mapping
  • test(pipeline): add unit coverage for amount/date formatting edge cases in extract.test.ts
  • test(pipeline): run pnpm --filter @packages/pipeline test:type and pnpm --filter @packages/pipeline test
  • chore(data): run extract overwrite for 2023, 2024, 2025, 2026 and refresh @source/clean/ snapshots
  • verify(data): spot-check 2026 transactions (amount plain numeric, date ISO) and students header (id_raw)

Pending:

  • Commit updated CSV snapshots (awaiting explicit user commit request)

Files:

  • @packages/pipeline/@source/extract/reader.ts
  • @packages/pipeline/@source/extract/schema.ts
  • @packages/pipeline/@source/tests/extract.test.ts
  • @source/clean/*.csv

2026-02-22 11:24 - T-019b

Overview: Rewrote staging SQL to use direct canonical columns, moved period/customer parsing into staging, and split student identity fields.

Completed:

  • refactor(analytics): replace dynamic raw-column macros with direct canonical refs in stg_transactions and stg_students
  • feat(analytics): add period_year, period_month, and translated customer_type outputs in staging
  • feat(analytics): add identity passthrough/split columns (id_raw, id_ktp, id_kp, id_sim, id_pass, id_nis) in stg_students
  • refactor(analytics): simplify int_orders to consume staging period/customer fields and pass through identity columns
  • docs(analytics): add identity columns to models/intermediate/schema.yml
  • test(analytics): run uv run dbt run && uv run dbt test with marts populated and tests passing

Files:

  • @python/analytics/models/staging/stg_transactions.sql
  • @python/analytics/models/staging/stg_students.sql
  • @python/analytics/models/intermediate/int_orders.sql
  • @python/analytics/models/intermediate/schema.yml

2026-02-22 11:24 - T-019c

Overview: Completed end-to-end verification across extract/load/validate/dbt/format with mixed-year raw loading and updated analytics docs.

Completed:

  • test(pipeline): run pnpm run sync -- --entity IONS --year 2026 and verify raw-layer loads from updated CSVs
  • test(pipeline): run pnpm run validate -- --entity IONS --year 2026 and confirm warning-only validation report
  • test(analytics): run dbt full build/test cycles after sync updates; confirm non-zero int_orders, int_enrollments, and marts
  • test(format): run pnpm run format -- --entity IONS --period 2026-02 and verify populated monthly report output
  • test(data): load 2026 then append 2023 (--mode append) and verify non-zero historical 2023 rows in marts
  • docs(analytics): update @python/analytics/AGENTS.md with stg_students identity split guidance

Files:

  • @python/analytics/AGENTS.md
  • output/monthly/2026-02-report.json
  • output/validation/ions-2026-validation.json

2026-02-22 11:37 - T-010

Overview: Finalized frozen CSV extraction deliverable by committing all refreshed @source/clean/ snapshots in granular data commits and merging to main.

Completed:

  • data(pipeline): committed 2023 transaction and student CSV snapshots
  • data(pipeline): committed 2024 transaction and student CSV snapshots
  • data(pipeline): committed 2025 transaction and student CSV snapshots
  • data(pipeline): committed 2026 transaction and student CSV snapshots
  • chore(git): pushed branch feat/data-quality-remediation, created PR #19, and merged via rebase to main

Evidence:

Files:

  • @source/clean/*.csv

2026-02-22 11:37 - T-019a

Overview: Closed the remaining T-019a completion gate by committing refreshed CSV outputs and shipping extract/schema/test updates to main.

Completed:

  • feat(pipeline): committed extract normalization (formatAmount, formatDate) and id_raw mapping changes
  • test(pipeline): committed updated extract tests for amount/date normalization
  • data(pipeline): committed all 100 regenerated CSV snapshots produced by --overwrite
  • chore(git): delivered via PR #19 merged to main

Evidence:

Files:

  • @packages/pipeline/@source/extract/reader.ts
  • @packages/pipeline/@source/extract/schema.ts
  • @packages/pipeline/@source/tests/extract.test.ts
  • @source/clean/*.csv

On this page