AINA data engine room - VDS local closure report - 2026-06-11

AINA Personalization Engine Closure Handoff

A founder-readable and agent-ready map of what was built, what exists on disk, how it matches the original mission, and what remains open.

Ali Mehdi Mukadam - co-authored with Codex - repo /srv/aina/aina-data-engine-room - branch ali/personalization-engine-mission-2026-06-09
The Single Idea

The repo now contains a local, self-contained AINA data engine that turns the original personalization-engine documents into a working source-backed system: public source snapshots, Hugging Face ingestion, O*NET-backed title/occupation data, role/workflow packets, deterministic planner/runtime paths, evaluator and feedback receipts, GDPval practice gates, local API/UI fixtures, and ICP title service-tier routing. It is not production-live and it is not externally deployed. It is a strong VDS-local internal beta engine with review holds, with the remaining open work centered on clearing the last ambiguous ICP title backlog and then hardening the runtime for real users.

47,284Title rows serviceable after adjudication.
7,201Title rows still ambiguous and not safely routed.
0Production, external write, real-user-data, telemetry, or deployment unlocks.
01

Founder Summary

AINA's original idea was not "recommend a course." The shared documents described a personalized AI capability engine that understands a person's role, workflows, readiness, goals, constraints, and proof of work, then chooses the next best learning path and evaluates whether the person can actually perform better with AI.

AreaCurrent statusPlain-English meaning
Source foundationBuilt and validated locallyThe engine has a local data backbone instead of relying on loose CSVs or chat prompts.
Hugging Face dataDownloaded, processed, and mappedAnthropic Economic Index and OpenAI GDPval are real bounded inputs on the VDS.
O*NET/public source backboneO*NET 30.3 loaded; BLS wage/employment gap disclosedOccupations/tasks are grounded in public taxonomy; BLS facts are not faked when cache/access is absent.
Role/workflow packets45,564 packet JSON files generatedThe engine can generate role/workflow intelligence packets at scale.
Runtime loopLocal API, curriculum, sandbox, submit/evaluation fixtures passA synthetic learner loop works locally.
ICP title coverage47,284 title rows serviceable after adjudicationThe engine can serve many title rows now, mostly through fallback-safe routing.
Production releaseBlocked by designNo public runtime, external writes, real-user data, production telemetry, or deployment promotion has been unlocked.
02

Match Against The Original Plan

The original source set was mirrored into this repo under docs/source_foundations/ainpe-files-shared/. The mission synthesis lives at docs/planning/aina-personalization-engine-mission-2026-06-09.md.

MilestoneOriginal targetCurrent statusEvidence
M0Mission and source lockCompleteMission plan Markdown/HTML exists.
M1Canonical source warehouseMostly complete, BLS fact gap disclosedartifacts/warehouse/, source authority audit.
M2Work Intelligence GraphComplete for local beta surfacePackets and packet-quality gate.
M3CurriculumInputPacket and plannerComplete locallySchemas, planner, contract tests.
M4Practice and evaluator loopComplete locallyEvaluator, event replay, feedback receipts.
M5GDPval sandbox and HF proofComplete locally with holdsSandbox, GDPval receipts, HF runtime map.
M6Beta API and product surfaceComplete as local fixtureAPI runtime and beta UI shell tests.
M7ICP title coverage and scale pathPartially complete and pausedTitle coverage/adjudication receipts; 7,201 rows remain ambiguous.
Source O*NET, HF, GDPval Packets Role/workflow graph Runtime Plan and practice Evidence Evaluate, replay, validate
03

What The Engine Can Do Today

The engine can build a local warehouse, load O*NET, process selected Hugging Face data, create role/workflow packets, attach source references, generate local runtime decisions, produce GDPval-style sandbox payloads, evaluate synthetic submissions, replay learner events, classify title rows, and emit validation receipts plus Markdown/HTML reports.

Not yetWhy
Serve real public learnersReal-user data, external writes, public runtime, and production telemetry are intentionally blocked.
Claim production quality for every title7,201 title rows remain ambiguous and many fallback routes are useful but not deep title-specific proof.
Use GDPval raw reference/deliverable folders broadlyThe engine uses selected GDPval task rows and source refs; bulk file downloads still require license/cache review.
Treat BLS wage/employment facts as completeO*NET is loaded; BLS OEWS wage/employment facts are recorded as a cache/access gap with 0 rows.
04

Current Quantitative State

MetricCurrent value
Full validation statuspass
ai_suitability rows74,225
linkedin_jobs rows129,165
Generated packet JSON files45,564
Semantic review rows checked45,564
Hugging Face downloaded bytes186,743,670
GDPval task rows processed220
O*NET occupation rows1,016
BLS wage/employment fact rows0
ICP title bucketCountMeaning
Serviceable after adjudication47,284Exact or fallback-safe local service.
Excluded or not ICP19,740Deliberately held out of service.
Still ambiguous7,201Needs deterministic or GPT/Codex adjudication.
Remaining prompt candidates6,872Rows ready for the next model-review lane.
Structured model decisions applied4,786Accepted model decisions already merged.
05

What Happened In The Latest Title Run

PhaseResult
Base deterministic coverageClassified all 74,225 rows and made 45,564 serviceable before adjudication.
Deterministic residual sweepReduced the ambiguous queue from 8,514 to 7,491.
Normal GPT salvageMerged batches 038, 040, 041, and 042, adding 304 consensus decisions and reducing the ambiguous queue to 7,201.

Batch 039 is partial: reviewer a exists, reviewer b timed out, so it was not merged. Historical Claude/Haiku/Sonnet and Codex Spark artifacts remain on disk as evidence history, but the latest user direction was normal GPT/Codex only.

06

Safety And Production Boundary

BoundaryCurrent state
Public runtimeBlocked
External writesBlocked
Real-user dataBlocked
Production telemetryBlocked
Deployment promotionBlocked
Local internal synthetic betaReady with review holds
The production approval receipt is production_blocked_approval_required. That is good: the repo is not pretending to be safer or more production-ready than it is.
07

Local Disk And Git Inventory

ItemValue
Repo path/srv/aina/aina-data-engine-room
Branchali/personalization-engine-mission-2026-06-09
Latest local commit before reporta2e9b29 Checkpoint ICP title inventory after GPT salvage
RemoteNo remote configured
Tracked files473 before this closure handoff; 477 after committing this handoff plus the inventory CSV/JSON
Untracked non-ignored files0
PathApprox sizeMeaning
/srv/aina/aina-data-engine-room13GEntire repo folder including venv and generated artifacts.
.venv/5.2GPython environment.
artifacts/7.2GGenerated data, packets, receipts, reports, reviewer outputs.
artifacts/packets/892M45,564 generated packet JSON files.
artifacts/semantic_review/210MFull deterministic semantic review outputs.
artifacts/raw/huggingface/179MSelected Hugging Face files.
src/5.1M59 Python modules.
tests/1.7M45 pytest files.

The full file inventory is written to docs/handoff/2026-06-11-local-file-inventory.csv, with a machine-readable summary at docs/handoff/2026-06-11-local-file-inventory-summary.json. It inventories every non-.git file under the repo, including .venv, caches, generated packets, generated artifacts, source files, docs, tests, scripts, reports, and the inventory files themselves.

ClassificationCountMeaning
generated_runtime_data45,607Generated packets, DuckDB/parquet/fixture/runtime artifacts.
environment_cache30,887.venv dependency/environment files.
generated_review_output417Model review outputs, logs, prepared outputs, repair outputs, merge receipts.
source_material184Imported/shared source foundation docs and curriculum references.
generated_evidence157Validation receipts and generated report companions.
tool_cache152Codegraph, pytest, ruff, Python bytecode caches.
hand_authored_code59Python source modules under src/.
hand_authored_test46Test/support files under tests/.
downloaded_source_cache42Hugging Face and public source cache files.
agent_deliverable18Planning, reports, runbooks, and handoffs authored/rendered by agents.
generated_review_input7Reviewer prompts and review input/merge artifacts.
hand_authored_config4Repo readme/config files.
hand_authored_script2Operational scripts.
lockfile1Dependency lockfile.

The inventory CSV intentionally excludes .git internals because staging and committing mutate them. The summary JSON records .git separately as 2,298 files and 15,092,838 bytes at generation time.

08

Repository Map

AreaPurpose
README.md, pyproject.toml, uv.lockRepo overview, package metadata, dependencies, CLI entrypoint.
scripts/rebuild_all.shMain rebuild helper.
scripts/run_icp_spark_backlog.pyBounded Codex/GPT title-adjudication wave runner. The model is controlled by --model.
src/aina_data_engine/Engine modules: contracts, ingestion, HF maps, title coverage, packets, runtime, quality, GDPval gates, telemetry, CLI.
tests/45 pytest files covering source ingest, runtime, title coverage, GDPval gates, deployment boundaries.
docs/source_foundations/Original planning and curriculum documents.
docs/planning/Mission, milestones, and execution board.
docs/reports/ and docs/handoff/Founder reports and handoffs.
artifacts/validation/Machine-readable receipts for every validation gate.
artifacts/reports/Markdown/HTML report companions for validation gates.
09

Hugging Face And Public Source Map

SourceCurrent local use
Anthropic/EconomicIndexSelected files downloaded and processed into Economic Index role/signal maps.
openai/gdpvalTrain parquet downloaded and processed into 220 GDPval tasks and sandbox/replacement lanes.
O*NET 30.31,016 occupations and 18,796 tasks cached into warehouse parquet files.
BLS OEWSOfficial-cache import path exists; current VDS has 0 wage/employment rows because source/cache access is absent.
LinkedIn/job-market rowsUsed as market signals, not canonical truth.
10

Validation State

cd /srv/aina/aina-data-engine-room
uv run pytest -q
uv run ruff check .
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
CheckResult
Pytest172 passed
RuffAll checks passed
Engine validationstatus: pass
11

How To Resume

cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, tables, icp_title_adjudication_summary, production_deployment_approval_summary}' artifacts/validation/full_validation.json

Continue clearing the title backlog with normal GPT/Codex only:

uv run python scripts/run_icp_spark_backlog.py \
  --root /srv/aina/aina-data-engine-room \
  --start-index 043 \
  --wave-size 3 \
  --batch-size 100 \
  --max-waves 1 \
  --max-parallel-reviewers 3 \
  --review-timeout 1200 \
  --repair-timeout 600 \
  --usage-limit-retries 1 \
  --usage-limit-sleep 120 \
  --model gpt-5.4-mini
12

Next-Agent Prompt

Codex - Continue ICP title coverage - preserve local-only safety boundary
You are continuing the AINA Personalization Engine title-coverage milestone on the VDS at /srv/aina/aina-data-engine-room.

Read first:
- docs/handoff/2026-06-11-personalization-engine-closure-handoff.md
- docs/planning/aina-personalization-engine-mission-2026-06-09.md
- artifacts/validation/full_validation.json
- artifacts/validation/icp_title_adjudication_v1.json
- scripts/run_icp_spark_backlog.py

Current state:
- Branch: ali/personalization-engine-mission-2026-06-09
- Latest checkpoint before this handoff: a2e9b29
- Serviceable title rows: 47,284
- Excluded/not ICP rows: 19,740
- Remaining ambiguous rows: 7,201
- Remaining prompt candidates: 6,872
- Production/external/real-user/deployment unlocks: 0

Task:
Continue reducing the ambiguous ICP title backlog using deterministic filters where obvious and normal GPT/Codex adjudication where evidence is weak or ambiguous. Do not use Claude/Haiku/Sonnet/Spark for new review unless Ali explicitly changes direction. Keep all production boundaries closed. Do not push or merge; keep this self-contained on the VDS unless Ali changes the git policy.
Watch-out: do not let the next agent call the goal complete just because validation is green; M7 still has ambiguous title rows.
13

Recommended Next Steps

StepReason
Clear the remaining title backlog in bounded waves.The next milestone is reducing 7,201 ambiguous rows while preserving evidence and safety.
Add deterministic filters only where obvious and tested.Do not turn fuzzy occupational judgment into silent rules.
Rerun or discard partial batch 039.Do not merge one-reviewer output.
Keep GDPval and production approval receipts green.Title coverage should not accidentally unlock public runtime.
Define production-quality thresholds.Decide what separates serve_now from serve_with_fallback.
14

Status Line

Where to start

This is a strong local internal beta data engine checkpoint, not a production launch. The open blocker is the remaining ambiguous ICP title backlog plus the broader production approval boundary.