Analytics
Analytics
The analytics layer for Atlas. Covers the DuckDB analytical database, the dbt transform pipeline, and all mart definitions.
For the overall pipeline context see architecture.md. For the operational database schema see model.md.
Overview
The analytics layer is a Modern Data Stack pipeline built on DuckDB and dbt-core. It implements the ELT pattern — raw source data is loaded first, then transformed through layered SQL models into analysis-ready marts that power report assembly (report.json) and presentation generation.
DuckDB
Role
DuckDB is the analytical engine. It runs in-process — no server required. All four transform layers (raw, staging, intermediate, mart) live in a single DuckDB file (atlas.db).
Why DuckDB
| Capability | Usage in Atlas |
|---|---|
| CSV loading | read_csv_auto() loads canonical CSV snapshots extracted by the pipeline |
| In-process OLAP | Runs embedded in pipeline/format/presentation scripts without a database server |
| Fast aggregations | Revenue totals, year-over-year comparisons, cohort counts — vectorized execution |
| Full SQL | Window functions, CTEs, PIVOT, UNPIVOT, QUALIFY — used extensively in mart models |
| dbt compatible | dbt-duckdb adapter is the official DuckDB dbt backend |
| Local file | atlas.db — no network dependency for local development |
Connection
import { DuckDBInstance } from '@duckdb/node-api'
const instance = await DuckDBInstance.create('atlas.db')
const conn = await instance.connect()Raw layer setup
Each xlsx source file is first extracted to canonical CSV snapshots (via SheetJS in @packages/pipeline), then loaded into DuckDB via read_csv_auto. Tables are created or replaced on each load run — the raw layer always reflects the latest source data.
-- Transactions ledger (from canonical CSV)
CREATE OR REPLACE TABLE raw_transactions AS
SELECT * FROM read_csv_auto(
'@source/clean/ions-2026-transactions-*.csv'
);
-- Student database (from canonical CSV)
CREATE OR REPLACE TABLE raw_students AS
SELECT * FROM read_csv_auto(
'@source/clean/ions-2026-students-*.csv'
);dbt Project
Setup
The dbt project lives at @python/analytics/. It uses the dbt-duckdb adapter and is managed via uv.
# Install
cd @python/analytics
uv sync
# Run all models
uv run dbt run
# Run specific layer
uv run dbt run --select staging
uv run dbt run --select intermediate
uv run dbt run --select marts
# Test
uv run dbt test
# Generate and serve docs
uv run dbt docs generate
uv run dbt docs serveprofiles.yml
atlas:
target: dev
outputs:
dev:
type: duckdb
path: ../../atlas.db
threads: 4dbt_project.yml
name: atlas
version: '1.0.0'
profile: atlas
model-paths: ['models']
test-paths: ['tests']
models:
atlas:
staging:
+materialized: view
intermediate:
+materialized: view
marts:
+materialized: tableStaging and intermediate models are materialized as views (no storage cost, always fresh). Mart models are materialized as tables (pre-aggregated, fast reads for the Format and presentation layers).
Transform Layers
Layer 1 — Raw (raw_*)
Loaded from canonical CSV snapshots in @source/clean/. Column names are already in English (normalized during the extract step). No type casting, no filtering at this stage. Validation-stage QA exports write monthly flagged rows to @source/invalid/ with reason metadata while preserving raw tables unchanged.
| Table | Source file | Source sheet |
|---|---|---|
raw_transactions | LAPORAN EVALUASI IONs [year] TM-WLC.xlsx | Monthly ledger sheets (JANUARI, FEBRUARI, …) |
raw_students | DATABASE SISWA IONS [year] TM-WLC.xlsx | Monthly enrollment sheets (JANUARI26, FEBRUARI26, …) |
raw_organizations | DATABASE SISWA IONS [year] TM-WLC.xlsx | REFERENSI sheet |
raw_targets | LAPORAN EVALUASI IONs [year] TM-WLC.xlsx | EVALUASI DATA SISWA sheet |
Layer 2 — Staging (stg_*)
Light cleaning only. Column names are already in English (canonical CSV headers). Types cast. Rows deduplicated. No joins, no business logic.
Naming rule: one staging model per raw table. raw_transactions → stg_transactions.
stg_transactions
Raw CSV columns are already English canonical names. Staging applies deduplication and type casting.
SELECT
receipt AS receipt_number,
TRY_CAST(date AS DATE) AS date,
unit AS unit_code,
program AS product_name,
class AS variant_name,
intake AS period_label,
student AS student_name,
TRY_CAST(amount AS DECIMAL(18,2)) AS amount,
channel AS payment_channel,
method AS payment_method_name,
type AS customer_type_name,
status AS payment_status_name,
category AS product_category,
barsheet AS batch_ref
FROM raw_transactions
WHERE receipt IS NOT NULL
QUALIFY ROW_NUMBER() OVER (PARTITION BY receipt ORDER BY date) = 1stg_students
SELECT
student AS full_name,
date AS enrollment_date,
organization AS organization_name,
unit AS unit_code,
age_cluster,
course_type AS course_type_name,
program AS product_name,
class AS variant_name,
intake AS period_label,
source AS channel_name,
type AS customer_type_name,
payment AS payment_channel,
sex AS sex_code,
id_raw
FROM raw_students
WHERE student IS NOT NULLstg_targets
SELECT
"TENANT" AS unit_code,
TRY_CAST("TARGET REVENUE" AS DECIMAL(18,2)) AS revenue_target,
TRY_CAST("TARGET SISWA STANDAR" AS DECIMAL) AS student_target,
TRY_CAST("REALISASI REVENUE" AS DECIMAL(18,2)) AS revenue_actuals,
TRY_CAST("REALISASI SISWA RIIL" AS INTEGER) AS student_actuals
FROM raw_targets
WHERE "TENANT" IS NOT NULLLayer 3 — Intermediate (int_*)
Business logic applied. Joins across staging tables. FK resolution. Customer type classification. No aggregation.
int_orders
Joins transactions with students to produce one row per order with all relevant dimensions.
WITH base AS (
SELECT
t.receipt_number,
t.date,
t.unit_code,
t.product_name,
t.variant_name,
t.amount,
t.payment_channel,
t.payment_method_name,
t.payment_status_name,
t.customer_type_name,
t.product_category,
t.batch_ref,
s.organization_name,
s.age_cluster,
s.course_type_name,
s.channel_name,
s.sex_code,
s.full_name AS student_name,
-- Extract period year and month from period_label
CAST(SPLIT_PART(t.period_label, ' ', 2) AS INTEGER) AS period_year,
CASE SPLIT_PART(t.period_label, ' ', 1)
WHEN 'JANUARI' THEN 1 WHEN 'FEBRUARI' THEN 2
WHEN 'MARET' THEN 3 WHEN 'APRIL' THEN 4
WHEN 'MEI' THEN 5 WHEN 'JUNI' THEN 6
WHEN 'JULI' THEN 7 WHEN 'AGUSTUS' THEN 8
WHEN 'SEPTEMBER' THEN 9 WHEN 'OKTOBER' THEN 10
WHEN 'NOVEMBER' THEN 11 WHEN 'DESEMBER' THEN 12
END AS period_month,
-- Normalize customer type
CASE customer_type_name
WHEN 'B' THEN 'New'
WHEN 'L' THEN 'Renewal'
WHEN 'A' THEN 'Alumni'
END AS customer_type,
-- Normalize unit code
CASE unit_code
WHEN 'WLC INGGRIS' THEN 'WLC_ENGLISH'
WHEN 'WLC NON INGGRIS' THEN 'WLC_NON_ENGLISH'
ELSE unit_code
END AS unit_code_normalized
FROM stg_transactions t
LEFT JOIN stg_students s
ON t.student_name = s.full_name
AND t.period_label = s.period_label
)
-- Quality gate: is_valid + invalid_reason computed downstream
SELECT * FROM base
WHERE receipt_number IS NOT NULLint_enrollments
Filters int_orders to Enrollment-type orders only (excludes book sales, event fees, etc.).
SELECT *
FROM int_orders
WHERE product_category NOT IN ('BUKU REAL', 'BUKU WLC', 'BUKU MANDARIN', 'EVENT - WLC',
'EVENT - ICC', 'SIMULASI JLPT', 'SIMULASI HSK',
'KIRIM - REAL', 'KIRIM - TM', 'TOP UP POIN')
AND amount > 0Layer 4 — Marts (mart_*)
Pre-aggregated, analysis-ready tables. One mart per report section. All mart models are materialized as tables for fast Format and presentation-layer reads.
Runtime boundary note:
- Dashboard runtime no longer reads DuckDB marts directly.
- Dashboard data is served by
@services/apifrom Turso/LibSQL. - DuckDB marts remain the source for report generation and presentation workflows.
mart_revenue
Powers the Revenue Comparison slide and dashboard revenue view.
SELECT
unit_code_normalized AS unit_code,
period_year,
period_month,
SUM(amount) AS revenue_actuals,
COUNT(DISTINCT receipt_number) AS transaction_count,
-- Period labels for presentation layer
period_year || '-' || LPAD(period_month::VARCHAR, 2, '0') AS period
FROM int_orders
GROUP BY unit_code_normalized, period_year, period_monthJoined with targets at query time (in the Format layer):
SELECT
r.*,
t.revenue_target,
t.revenue_target - r.revenue_actuals AS gap_to_target,
ROUND(r.revenue_actuals / t.revenue_target * 100, 2) AS achievement_pct
FROM mart_revenue r
LEFT JOIN stg_targets t
ON r.unit_code = t.unit_code
AND r.period_year = {{ var("year") }}
AND r.period_month = {{ var("month") }}
WHERE r.period_year = {{ var("year") }}
AND r.period_month = {{ var("month") }}
AND r.unit_code = '{{ var("unit") }}'mart_program_progress
Powers the Key Comparison and Program Progress slides.
SELECT
unit_code_normalized AS unit_code,
product_name,
product_category,
course_type_name,
period_year,
period_month,
customer_type,
COUNT(DISTINCT receipt_number) AS order_count,
COUNT(DISTINCT student_name) AS student_count,
SUM(amount) AS revenue
FROM int_enrollments
GROUP BY
unit_code_normalized,
product_name,
product_category,
course_type_name,
period_year,
period_month,
customer_typemart_channel_marketing
Powers the Channel Marketing slide and dashboard channels view.
SELECT
unit_code_normalized AS unit_code,
channel_name,
period_year,
period_month,
customer_type,
COUNT(DISTINCT receipt_number) AS closings,
COUNT(DISTINCT student_name) AS students,
ROUND(
COUNT(DISTINCT receipt_number) * 100.0 /
SUM(COUNT(DISTINCT receipt_number)) OVER (
PARTITION BY unit_code_normalized, period_year, period_month
), 2
) AS contribution_pct
FROM int_enrollments
WHERE channel_name IS NOT NULL
GROUP BY
unit_code_normalized,
channel_name,
period_year,
period_month,
customer_typemart_school_progress
Powers the School Progress slide — shows which organizations produced students each year.
SELECT
unit_code_normalized AS unit_code,
organization_name,
period_year,
COUNT(DISTINCT student_name) AS student_count
FROM int_enrollments
WHERE organization_name IS NOT NULL
GROUP BY
unit_code_normalized,
organization_name,
period_yearPivoted at query time in the Format layer to produce the year × organization matrix shown in the presentation.
dbt Tests
Each model has built-in dbt tests defined in schema.yml files.
# models/staging/schema.yml
models:
- name: stg_transactions
columns:
- name: receipt_number
tests:
- not_null
- unique
- name: amount
tests:
- not_null
- name: unit_code
tests:
- accepted_values:
values: ['REAL', 'WLC INGGRIS', 'WLC NON INGGRIS', 'TM', 'NP', 'ICC', 'ICA', 'DTC', 'IONs']
- name: stg_targets
columns:
- name: unit_code
tests:
- not_null
- uniqueLineage
Full data lineage from source to mart: