Atlas Plan
Plans002 2026 02 20 Sync and Source Config

Set Up @packages/sync + Source Config

Overview

Scaffold the @packages/sync package and establish the per-year YAML source config system. The sync layer reads config files from source/config/, resolves file paths and sheet names per year, and loads raw data into DuckDB using native read_xlsx. Also seeds LibSQL lookup tables from registry values.

Result: running bun run sync --source xlsx --entity IONS --year 2026 populates atlas.db with all raw tables (raw_transactions, raw_students, raw_organizations, raw_targets, raw_marketing_activity).

Goals

  • Create source/config/ions-YYYY.yaml files for 2023–2026 encoding all per-year file/sheet/column variations
  • Scaffold @packages/sync as a Bun workspace package with correct tsconfig, eslint, and build config
  • Implement DuckDB connection + raw table creation (duck.ts)
  • Implement config loader (config.ts) that reads YAML and resolves paths
  • Implement xlsx sync runner (xlsx.ts) that calls read_xlsx for each configured sheet
  • Implement seed.ts to seed LibSQL lookup tables from registry constants
  • Wire CLI (index.ts) with --source, --entity, --year, --month flags
  • Verify: bun run sync --source xlsx --entity IONS --year 2026 completes without error and raw tables exist in atlas.db

Non-Goals

  • Running dbt transforms (Plan 003)
  • Populating the LibSQL operational tables with order/person data (Plan 005)
  • Odoo adapter (future)
  • Syncing years other than 2023–2026 (config files only cover known years)
  • CI/CD pipeline

Phases

  • Phase 1: Source config files — source/config/ions-YYYY.yaml for all 4 years
  • Phase 2: Package scaffold — @packages/sync config files (package.json, tsconfig, eslint)
  • Phase 3: Core sync code — duck.ts, config.ts, xlsx.ts, seed.ts
  • Phase 4: CLI + integration — index.ts wired, root scripts updated, type-check + run verification

Success

  • source/config/ions-2026.yaml exists with correct file, sheet, column, and header_row entries
  • source/config/ions-2025.yaml, ions-2024.yaml, ions-2023.yaml exist with their year-specific variations
  • @packages/sync added to workspace packages in root package.json
  • bun run test:type --filter @packages/sync passes
  • bun run sync --source xlsx --entity IONS --year 2026 exits 0
  • atlas.db contains raw_transactions, raw_students, raw_organizations, raw_targets, raw_marketing_activity after sync

Requirements

  • DuckDB node bindings available (duckdb-async or @duckdb/node-api)
  • js-yaml or equivalent for YAML parsing in Bun
  • @libsql/client for LibSQL seeding (seed.ts)
  • @repo/typescript/lib tsconfig preset
  • @repo/lint/base ESLint config
  • source/ directory with the 8 xlsx files already present
  • @plan/registry.md as source of truth for seed values

Context

Why This Approach

  • Per-year YAML config files decouple column name variations from code — adding a new year only requires a new config file, no code changes
  • DuckDB read_xlsx handles extraction natively; no parsing code required
  • header_row is configurable per file+year because 2023 LAPORAN EVALUASI starts at row 6 vs row 7 in other years
  • Config-driven sheet name resolution handles the FEBRUARI 23 (2023, space) vs FEBRUARI26 (2026, no space) naming differences
  • LibSQL seeding in this layer (seed.ts) because lookup table values come from registry.md which is independent of dbt

Key Constraints

  • Raw layer preserves source column names (Indonesian) exactly — no renaming in sync layer; all renaming happens in dbt staging
  • read_xlsx in DuckDB requires header := true and a header_row offset for files where data doesn't start at row 1
  • 2023 DATABASE SISWA: header row is row 7; monthly sheet names include a space (JANUARI 23)
  • 2026 LAPORAN EVALUASI: adds NO. KWITANSI (receipt number) and NAMA KELAS columns — 17 cols vs 15 in 2024/2025
  • 2026 DATABASE SISWA: BULAN/INTAKE split into BULAN + INTAKE (separate columns); TANGGAL/HARI/JAM renamed to TANGGAL
  • COMBINED sheets in LAPORAN EVALUASI 2024–2026 contain #REF! errors — do not load these sheets
  • OLAH *, SUMMARY *, Sheet* tabs are internal computation sheets — do not load

Edge Cases

  • 2026 DATABASE SISWA: OLAH/summary tabs for months 3–12 still carry 25 suffix (copy error) — only load the data sheets (MARET26, etc.), not the OLAH tabs
  • Summary2026 sheet label says 2025 (copy error) — ignore, not loaded
  • If a monthly sheet is missing (e.g. future months), skip gracefully with a warning
  • --month flag should limit loading to a single month sheet when provided (faster for incremental updates)

Tradeoffs

  • YAML over TypeScript config: YAML is more readable for non-developers editing year configs; TypeScript config would give type safety but requires a build step to edit
  • Loading all months per year (default): complete historical data in one run; --month flag available for incremental
  • seed.ts is idempotent (upsert): safe to run multiple times without duplicating lookup data

Skills

  • plan — plan file format and conventions

Boundaries

  • Always: Update Progress.md after each task completion
  • Always: Preserve Indonesian source column names in raw tables — never rename in sync layer
  • Always: Use catalog: references in package.json, never pin versions directly
  • Always: Use workspace:* for @repo/lint and @repo/typescript
  • Ask first: Any deviation from YAML config file format
  • Ask first: Adding dependencies not discussed in planning
  • Never: Apply business logic or column renaming in the sync layer
  • Never: Load COMBINED, OLAH *, SUMMARY *, or Sheet* tabs
  • Never: Run dbt or write to LibSQL operational tables (beyond seed.ts lookups)

Questions

  • Config file format? → YAML in source/config/ions-YYYY.yaml
  • Which DuckDB node binding? → duckdb-async (already used in architecture examples)
  • YAML parser? → js-yaml (lightweight, Bun-compatible)
  • Should seed.ts be idempotent (upsert) or error on duplicate? → Upsert/replace
  • Should the CLI also accept --month for single-month incremental sync? → Yes

On this page