AINA data engine room - VDS local closure report - 2026-06-11
AINA Personalization Engine Closure Handoff
A founder-readable and agent-ready map of what was built, what exists on disk, how it matches the original mission, and what remains open.
Ali Mehdi Mukadam - co-authored with Codex - repo /srv/aina/aina-data-engine-room - branch ali/personalization-engine-mission-2026-06-09
The Single Idea
The repo now contains a local, self-contained AINA data engine that turns the original personalization-engine documents into a working source-backed system: public source snapshots, Hugging Face ingestion, O*NET-backed title/occupation data, role/workflow packets, deterministic planner/runtime paths, evaluator and feedback receipts, GDPval practice gates, local API/UI fixtures, and ICP title service-tier routing. It is not production-live and it is not externally deployed. It is a strong VDS-local internal beta engine with review holds, with the remaining open work centered on clearing the last ambiguous ICP title backlog and then hardening the runtime for real users.
47,284Title rows serviceable after adjudication.
7,201Title rows still ambiguous and not safely routed.
0Production, external write, real-user-data, telemetry, or deployment unlocks.
01
Founder Summary
AINA's original idea was not "recommend a course." The shared documents described a personalized AI capability engine that understands a person's role, workflows, readiness, goals, constraints, and proof of work, then chooses the next best learning path and evaluates whether the person can actually perform better with AI.
Area
Current status
Plain-English meaning
Source foundation
Built and validated locally
The engine has a local data backbone instead of relying on loose CSVs or chat prompts.
Hugging Face data
Downloaded, processed, and mapped
Anthropic Economic Index and OpenAI GDPval are real bounded inputs on the VDS.
O*NET/public source backbone
O*NET 30.3 loaded; BLS wage/employment gap disclosed
Occupations/tasks are grounded in public taxonomy; BLS facts are not faked when cache/access is absent.
Role/workflow packets
45,564 packet JSON files generated
The engine can generate role/workflow intelligence packets at scale.
Runtime loop
Local API, curriculum, sandbox, submit/evaluation fixtures pass
A synthetic learner loop works locally.
ICP title coverage
47,284 title rows serviceable after adjudication
The engine can serve many title rows now, mostly through fallback-safe routing.
Production release
Blocked by design
No public runtime, external writes, real-user data, production telemetry, or deployment promotion has been unlocked.
02
Match Against The Original Plan
The original source set was mirrored into this repo under docs/source_foundations/ainpe-files-shared/. The mission synthesis lives at docs/planning/aina-personalization-engine-mission-2026-06-09.md.
Milestone
Original target
Current status
Evidence
M0
Mission and source lock
Complete
Mission plan Markdown/HTML exists.
M1
Canonical source warehouse
Mostly complete, BLS fact gap disclosed
artifacts/warehouse/, source authority audit.
M2
Work Intelligence Graph
Complete for local beta surface
Packets and packet-quality gate.
M3
CurriculumInputPacket and planner
Complete locally
Schemas, planner, contract tests.
M4
Practice and evaluator loop
Complete locally
Evaluator, event replay, feedback receipts.
M5
GDPval sandbox and HF proof
Complete locally with holds
Sandbox, GDPval receipts, HF runtime map.
M6
Beta API and product surface
Complete as local fixture
API runtime and beta UI shell tests.
M7
ICP title coverage and scale path
Partially complete and paused
Title coverage/adjudication receipts; 7,201 rows remain ambiguous.
03
What The Engine Can Do Today
The engine can build a local warehouse, load O*NET, process selected Hugging Face data, create role/workflow packets, attach source references, generate local runtime decisions, produce GDPval-style sandbox payloads, evaluate synthetic submissions, replay learner events, classify title rows, and emit validation receipts plus Markdown/HTML reports.
Not yet
Why
Serve real public learners
Real-user data, external writes, public runtime, and production telemetry are intentionally blocked.
Claim production quality for every title
7,201 title rows remain ambiguous and many fallback routes are useful but not deep title-specific proof.
Use GDPval raw reference/deliverable folders broadly
The engine uses selected GDPval task rows and source refs; bulk file downloads still require license/cache review.
Treat BLS wage/employment facts as complete
O*NET is loaded; BLS OEWS wage/employment facts are recorded as a cache/access gap with 0 rows.
04
Current Quantitative State
Metric
Current value
Full validation status
pass
ai_suitability rows
74,225
linkedin_jobs rows
129,165
Generated packet JSON files
45,564
Semantic review rows checked
45,564
Hugging Face downloaded bytes
186,743,670
GDPval task rows processed
220
O*NET occupation rows
1,016
BLS wage/employment fact rows
0
ICP title bucket
Count
Meaning
Serviceable after adjudication
47,284
Exact or fallback-safe local service.
Excluded or not ICP
19,740
Deliberately held out of service.
Still ambiguous
7,201
Needs deterministic or GPT/Codex adjudication.
Remaining prompt candidates
6,872
Rows ready for the next model-review lane.
Structured model decisions applied
4,786
Accepted model decisions already merged.
05
What Happened In The Latest Title Run
Phase
Result
Base deterministic coverage
Classified all 74,225 rows and made 45,564 serviceable before adjudication.
Deterministic residual sweep
Reduced the ambiguous queue from 8,514 to 7,491.
Normal GPT salvage
Merged batches 038, 040, 041, and 042, adding 304 consensus decisions and reducing the ambiguous queue to 7,201.
Batch 039 is partial: reviewer a exists, reviewer b timed out, so it was not merged. Historical Claude/Haiku/Sonnet and Codex Spark artifacts remain on disk as evidence history, but the latest user direction was normal GPT/Codex only.
06
Safety And Production Boundary
Boundary
Current state
Public runtime
Blocked
External writes
Blocked
Real-user data
Blocked
Production telemetry
Blocked
Deployment promotion
Blocked
Local internal synthetic beta
Ready with review holds
The production approval receipt is production_blocked_approval_required. That is good: the repo is not pretending to be safer or more production-ready than it is.
07
Local Disk And Git Inventory
Item
Value
Repo path
/srv/aina/aina-data-engine-room
Branch
ali/personalization-engine-mission-2026-06-09
Latest local commit before report
a2e9b29 Checkpoint ICP title inventory after GPT salvage
Remote
No remote configured
Tracked files
473 before this closure handoff; 477 after committing this handoff plus the inventory CSV/JSON
Untracked non-ignored files
0
Path
Approx size
Meaning
/srv/aina/aina-data-engine-room
13G
Entire repo folder including venv and generated artifacts.
The full file inventory is written to docs/handoff/2026-06-11-local-file-inventory.csv, with a machine-readable summary at docs/handoff/2026-06-11-local-file-inventory-summary.json. It inventories every non-.git file under the repo, including .venv, caches, generated packets, generated artifacts, source files, docs, tests, scripts, reports, and the inventory files themselves.
Model review outputs, logs, prepared outputs, repair outputs, merge receipts.
source_material
184
Imported/shared source foundation docs and curriculum references.
generated_evidence
157
Validation receipts and generated report companions.
tool_cache
152
Codegraph, pytest, ruff, Python bytecode caches.
hand_authored_code
59
Python source modules under src/.
hand_authored_test
46
Test/support files under tests/.
downloaded_source_cache
42
Hugging Face and public source cache files.
agent_deliverable
18
Planning, reports, runbooks, and handoffs authored/rendered by agents.
generated_review_input
7
Reviewer prompts and review input/merge artifacts.
hand_authored_config
4
Repo readme/config files.
hand_authored_script
2
Operational scripts.
lockfile
1
Dependency lockfile.
The inventory CSV intentionally excludes .git internals because staging and committing mutate them. The summary JSON records .git separately as 2,298 files and 15,092,838 bytes at generation time.
You are continuing the AINA Personalization Engine title-coverage milestone on the VDS at /srv/aina/aina-data-engine-room.
Read first:
- docs/handoff/2026-06-11-personalization-engine-closure-handoff.md
- docs/planning/aina-personalization-engine-mission-2026-06-09.md
- artifacts/validation/full_validation.json
- artifacts/validation/icp_title_adjudication_v1.json
- scripts/run_icp_spark_backlog.py
Current state:
- Branch: ali/personalization-engine-mission-2026-06-09
- Latest checkpoint before this handoff: a2e9b29
- Serviceable title rows: 47,284
- Excluded/not ICP rows: 19,740
- Remaining ambiguous rows: 7,201
- Remaining prompt candidates: 6,872
- Production/external/real-user/deployment unlocks: 0
Task:
Continue reducing the ambiguous ICP title backlog using deterministic filters where obvious and normal GPT/Codex adjudication where evidence is weak or ambiguous. Do not use Claude/Haiku/Sonnet/Spark for new review unless Ali explicitly changes direction. Keep all production boundaries closed. Do not push or merge; keep this self-contained on the VDS unless Ali changes the git policy.
Watch-out: do not let the next agent call the goal complete just because validation is green; M7 still has ambiguous title rows.
13
Recommended Next Steps
Step
Reason
Clear the remaining title backlog in bounded waves.
The next milestone is reducing 7,201 ambiguous rows while preserving evidence and safety.
Add deterministic filters only where obvious and tested.
Do not turn fuzzy occupational judgment into silent rules.
Rerun or discard partial batch 039.
Do not merge one-reviewer output.
Keep GDPval and production approval receipts green.
Title coverage should not accidentally unlock public runtime.
Define production-quality thresholds.
Decide what separates serve_now from serve_with_fallback.
14
Status Line
Where to start
This is a strong local internal beta data engine checkpoint, not a production launch. The open blocker is the remaining ambiguous ICP title backlog plus the broader production approval boundary.