AINA Data Engine Room - paused-goal technical handoff - 2026-06-12

Technical Deep Dive Handoff

What exists in the repo, what has been achieved, why title-only repair is now too small, and what should happen next.

Ali Mehdi Mukadam - co-authored with Codex - repo: /srv/aina/aina-data-engine-room

The Single Idea

The repo is no longer just a title-coverage project. It is a self-contained VDS-local data engine for turning job, occupation, workflow, evidence, curriculum, evaluator, and embedding assets into a production-ready Personalization Engine. The next correction is important: ambiguous titles should not be repaired in isolation when the source row already has job description, company, industry, seniority, responsibilities, and other context.

01 - TL;DR

Current Truth For The Engineer

SurfaceCurrent state
Repo path/srv/aina/aina-data-engine-room
Branchali/ain-506-p0-gate-2026-06-12
Audited implementation head77ab1c2 Add AIN-510 semantic reranking proof
RemoteNo remote configured
WorktreeDirty: three tracked files modified and two untracked WIP files
Last validation receiptartifacts/validation/full_validation.json says status: pass, generated before the current WIP edits
Main scaleoccupations=110184, ai_suitability=110184, linkedin_jobs=129165, wedge_occupations=47837
Frozen embedding corpus294315 chunks, 27559686 token count, 309 manifest shards
Gemini vector snapshot6067 vectors using gemini-embedding-2 at 768 dimensions
Runtime proofAIN-510 local exact-cosine retrieval passes with caveats; production runtime authority is still blocked
Product objectAIFluencyCapabilityMap exists as a five-layer headless P0 fixture
The next milestone should promote JD/company/industry/responsibility context into source authority before spending more LLM or Gemini tokens on ambiguous titles.
02 - Git State

Current Branch Is Paused With WIP

## ali/ain-506-p0-gate-2026-06-12
 M .gitignore
 M src/aina_data_engine/cli.py
 M src/aina_data_engine/reports.py
?? src/aina_data_engine/runtime_source_authority_repair.py
?? tests/test_runtime_source_authority_repair.py

Landed code is current through 77ab1c2. The working tree contains a paused AIN-510 source-authority repair slice that has not been validated or committed in this checkpoint. Local branches include the current June 12 branch plus June 9 semantic/mission branches and main. This checkout has no configured Git remote.

BranchHeadRole
ali/ain-506-p0-gate-2026-06-1277ab1c2Current production embedding and AIN-510 branch
ali/personalization-engine-mission-2026-06-09f31a202Earlier mission execution branch
ali/semantic-engine-room-2026-06-0933b56aeEarlier semantic pipeline branch
ali/semantic-risk-provenance-2026-06-0988c8032Semantic risk/provenance branch
main88c8032Local main, behind current branch
03 - Title-Only Loop

The Case Manager Diagnosis

What We Should Stop Centering

Title-only repair asks a generic string like case manager to carry too much meaning. It can waste LLM tokens and still choose the wrong role family.

What We Should Center Next

Use title plus JD text, company, industry, responsibilities, tools, seniority, risk tags, and source refs to resolve the role context before embedding or runtime retrieval.

The linkedin_jobs table already carries company, location, experience_level, job_description, cleaned descriptions, extracted_skills, industry, sub_industry, title_normalized, soc_code, function, seniority, exposure scores, risk tags, and quality flags. The AIN-506 contract also says LinkedIn/job chunks should use title + company/industry if safe + cleaned responsibility/workflow text, not title-only rows.

Raw job row title + JD + company Role-context evidence clean, sourced, typed Embedding + retrieval workflow, curriculum, evaluator AI Fluency Capability Map
04 - Repo Map

What Is On Disk

DirectorySizePurpose
src/aina_data_engine/7.9 MBPython package and CLI implementation
tests/3.2 MBPytest coverage for contracts, gates, runtime, embeddings, AI fluency, reports
docs/35 MBPlanning docs, handoffs, reports, runbooks, source foundations
artifacts/9.9 GBGenerated validation receipts, raw data, embeddings, packets, reports, reviews, warehouse
data/sources/59 MBTitle-expansion source bundles and provenance
evidence/canonical/8.5 MBEvidence Atlas Parquet files
.venv/5.2 GBIgnored Python runtime

Tracked file counts are 394 under artifacts, 309 under docs, 89 under src, 63 under tests, 17 under evidence, and 13 under data. Ignored files total 77,804, dominated by generated artifacts and the local virtualenv. The non-ignored untracked files are only the two WIP AIN-510 repair files.

05 - Python Architecture

What The Code Does

ModuleRole
ingest.pyBuilds DuckDB from raw and processed sources; writes canonical snapshots and master taxonomy
packets.pyBuilds role intelligence packets, teaching packets, AI affordance packs, readiness snapshots, curriculum input packets
schemas.pyPydantic object contracts for learner, runtime, packets, events, evaluator structures
embedding_contracts.pyPydantic/Pandera contracts for embedding configs, chunks, records, output frames
production_embeddings.pyCorpus construction, chunking, manifests, Gemini vectors, exact-cosine quality pairs
runtime_retrieval_proof.pyAIN-510 local exact-cosine runtime proof over existing vectors
ai_fluency_capability.pyHeadless AI Fluency Capability Map contracts and fixture output
evidence_enrichment.pyLoads Evidence Atlas Parquet and enriches packets
reports.pyFull validation receipt and cross-artifact checks

The CLI spans AIN-506 embeddings, AIN-510 retrieval/runtime authority, source authority, title coverage, runtime payloads, AI fluency, GDPval/Hugging Face practice lanes, and release gates.

06 - Data Stores

DuckDB, Parquet, HF, Evidence Atlas

Main DuckDB

TableRowsPurpose
linkedin_jobs129,165Job/posting substrate with JD, company, skills, title, SOC, function, seniority, risk, quality flags
occupations110,184Local occupation/title universe
ai_suitability110,184Exposure, augmentation, risk, priority, wedge, function, seniority
wedge_occupations47,837AINA wedge subset

Parquet Evidence

PathRowsUse
production semantic corpus294,315Frozen embedding chunks
Gemini vectors6,067Current vector snapshot
role_responsibility_evidence.parquet43,196Responsibilities by role/bucket/function/seniority
workflow_intelligence.parquet3,152Workflow names, confidence, risk domains, source refs
workflow_ai_affordance.parquet3,152Trust, risk, data sensitivity, capability IDs, assist/cannot-trust lists
workflow_tool_evidence.parquet413Tool evidence by workflow/bucket

Hugging Face data exists under artifacts/raw/huggingface/, including Anthropic Economic Index files and OpenAI GDPval Parquet. Derived HF role and GDPval maps are under artifacts/derived/huggingface/. Title expansion inputs include gold-IP role spine, WRTMJ, and MVP2 market mapping bundles under data/sources/title_expansion/.

07 - Done So Far

What Has Been Achieved

FoundationLocal DuckDB/Parquet warehouse, role packets, runtime decisions, evaluator fixtures, validation receipts.
Coverage50,053 serviceable rows after title adjudication; 110,184 occupation/title rows in current DuckDB.
Source AuthorityRegistry points to engine-room ledgers, salvage map, jobs-research, and Evidence Atlas assets.
EmbeddingsAIN-506 contract, 294,315 frozen chunks, 6,067 Gemini vectors, top 500/top 1,000 vector coverage.
RetrievalAIN-510 local exact-cosine proof passes with rerank caveats and deterministic fallback.
AI FluencyFive-layer capability map, top-band coverage receipt, conservative six-question onboarding seed boundary.

The work has moved from a broad title map to a headless production-readiness spine. The missing layer is not another title sweep. It is context-first role evidence.

08 - Technology Roles

Why Each Piece Exists

TechnologyUse
DuckDBLocal warehouse, joins, packet generation, vector snapshot querying
ParquetTyped columnar storage for chunks, vectors, O*NET, Evidence Atlas
PydanticAPI/object contracts, learner/runtime/evaluator/AI Fluency objects
PanderaDataframe contracts for chunk and vector tables
JSON/JSONLReceipts, ledgers, row-level proof, spot checks, model outputs
Gemini Embedding 2Semantic relevance layer for clean chunks; exact cosine remains source-of-truth retrieval
LLMsStructured title adjudication, semantic QA, repair reasoning, not a substitute for source authority
09 - WIP

Paused Source-Authority Repair

FileStatus
.gitignoreWIP allow-rule for AIN-510 source-authority repair JSONL
src/aina_data_engine/cli.pyWIP CLI wiring
src/aina_data_engine/reports.pyWIP validation integration
src/aina_data_engine/runtime_source_authority_repair.pyUntracked WIP deterministic repair receipt generator
tests/test_runtime_source_authority_repair.pyUntracked WIP focused test

The WIP intent is sound, but it is too narrow if it only fixes four unresolved cases. Use it as a small proof fixture for a general JD-aware role-context evidence layer.

10 - Next Plan

Milestones From Here

MilestoneGoalAcceptance
M0Freeze and reconcile paused WIPWorktree clean or WIP explicitly parked; validation matches current code
M1Build role_context_evidence_v1Title + JD + company + industry + responsibility context becomes first-class source authority
M2JD-aware disambiguation gateAmbiguous titles resolve by context when context exists and stay caveated when it does not
M3Rebuild clean embedding corpus around role contextNo raw JD dumps, no bad labels in embedding text, source refs on every chunk
M4Progressive Gemini embeddingDry run, 50-row check, live 500, 5k, 25k, then batch candidate only if clean
M5AIN-510 retrieval promotion with contextFewer or zero unresolved caveats for context-rich cases; fallback remains available
M6AI Fluency production readinessTop 500 and top 1,000 capability maps improve using role context without public runtime unlocks

Resume Commands

cd /srv/aina/aina-data-engine-room
git status --short --branch
git log --oneline --decorate -12
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-registry
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-inventory
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
Do not run live Gemini again until the role-context source family has eligibility, repair, dry-run, and semantic spot-check proof.
Where To Start

Resume by building the JD-aware role-context evidence spine, not by repairing ambiguous titles in isolation.