AINA Data Engine Room - paused-goal technical handoff - 2026-06-12

Technical Deep Dive Handoff

What exists in the repo, what has been achieved, why title-only repair is now too small, and what should happen next.

Ali Mehdi Mukadam - co-authored with Codex - repo: /srv/aina/aina-data-engine-room

The Single Idea

The repo is no longer just a title-coverage project. It is a self-contained VDS-local data engine for turning job, occupation, workflow, evidence, curriculum, evaluator, and embedding assets into a production-ready Personalization Engine. The next correction is important: ambiguous titles should not be repaired in isolation when the source row already has job description, company, industry, seniority, responsibilities, and other context.

01 - TL;DR

Current Truth For The Engineer

Surface	Current state
Repo path	`/srv/aina/aina-data-engine-room`
Branch	`ali/ain-506-p0-gate-2026-06-12`
Audited implementation head	`77ab1c2 Add AIN-510 semantic reranking proof`
Remote	No remote configured
Worktree	Dirty: three tracked files modified and two untracked WIP files
Last validation receipt	`artifacts/validation/full_validation.json` says `status: pass`, generated before the current WIP edits
Main scale	`occupations=110184`, `ai_suitability=110184`, `linkedin_jobs=129165`, `wedge_occupations=47837`
Frozen embedding corpus	`294315` chunks, `27559686` token count, `309` manifest shards
Gemini vector snapshot	`6067` vectors using `gemini-embedding-2` at 768 dimensions
Runtime proof	AIN-510 local exact-cosine retrieval passes with caveats; production runtime authority is still blocked
Product object	`AIFluencyCapabilityMap` exists as a five-layer headless P0 fixture

The next milestone should promote JD/company/industry/responsibility context into source authority before spending more LLM or Gemini tokens on ambiguous titles.

02 - Git State

Current Branch Is Paused With WIP

## ali/ain-506-p0-gate-2026-06-12
 M .gitignore
 M src/aina_data_engine/cli.py
 M src/aina_data_engine/reports.py
?? src/aina_data_engine/runtime_source_authority_repair.py
?? tests/test_runtime_source_authority_repair.py

Landed code is current through 77ab1c2. The working tree contains a paused AIN-510 source-authority repair slice that has not been validated or committed in this checkpoint. Local branches include the current June 12 branch plus June 9 semantic/mission branches and main. This checkout has no configured Git remote.

Branch	Head	Role
`ali/ain-506-p0-gate-2026-06-12`	`77ab1c2`	Current production embedding and AIN-510 branch
`ali/personalization-engine-mission-2026-06-09`	`f31a202`	Earlier mission execution branch
`ali/semantic-engine-room-2026-06-09`	`33b56ae`	Earlier semantic pipeline branch
`ali/semantic-risk-provenance-2026-06-09`	`88c8032`	Semantic risk/provenance branch
`main`	`88c8032`	Local main, behind current branch

03 - Title-Only Loop

The Case Manager Diagnosis

What We Should Stop Centering

Title-only repair asks a generic string like case manager to carry too much meaning. It can waste LLM tokens and still choose the wrong role family.

What We Should Center Next

Use title plus JD text, company, industry, responsibilities, tools, seniority, risk tags, and source refs to resolve the role context before embedding or runtime retrieval.

The linkedin_jobs table already carries company, location, experience_level, job_description, cleaned descriptions, extracted_skills, industry, sub_industry, title_normalized, soc_code, function, seniority, exposure scores, risk tags, and quality flags. The AIN-506 contract also says LinkedIn/job chunks should use title + company/industry if safe + cleaned responsibility/workflow text, not title-only rows.

04 - Repo Map

What Is On Disk

Directory	Size	Purpose
`src/aina_data_engine/`	7.9 MB	Python package and CLI implementation
`tests/`	3.2 MB	Pytest coverage for contracts, gates, runtime, embeddings, AI fluency, reports
`docs/`	35 MB	Planning docs, handoffs, reports, runbooks, source foundations
`artifacts/`	9.9 GB	Generated validation receipts, raw data, embeddings, packets, reports, reviews, warehouse
`data/sources/`	59 MB	Title-expansion source bundles and provenance
`evidence/canonical/`	8.5 MB	Evidence Atlas Parquet files
`.venv/`	5.2 GB	Ignored Python runtime

Tracked file counts are 394 under artifacts, 309 under docs, 89 under src, 63 under tests, 17 under evidence, and 13 under data. Ignored files total 77,804, dominated by generated artifacts and the local virtualenv. The non-ignored untracked files are only the two WIP AIN-510 repair files.

05 - Python Architecture

What The Code Does

Module	Role
`ingest.py`	Builds DuckDB from raw and processed sources; writes canonical snapshots and master taxonomy
`packets.py`	Builds role intelligence packets, teaching packets, AI affordance packs, readiness snapshots, curriculum input packets
`schemas.py`	Pydantic object contracts for learner, runtime, packets, events, evaluator structures
`embedding_contracts.py`	Pydantic/Pandera contracts for embedding configs, chunks, records, output frames
`production_embeddings.py`	Corpus construction, chunking, manifests, Gemini vectors, exact-cosine quality pairs
`runtime_retrieval_proof.py`	AIN-510 local exact-cosine runtime proof over existing vectors
`ai_fluency_capability.py`	Headless AI Fluency Capability Map contracts and fixture output
`evidence_enrichment.py`	Loads Evidence Atlas Parquet and enriches packets
`reports.py`	Full validation receipt and cross-artifact checks

The CLI spans AIN-506 embeddings, AIN-510 retrieval/runtime authority, source authority, title coverage, runtime payloads, AI fluency, GDPval/Hugging Face practice lanes, and release gates.

06 - Data Stores

DuckDB, Parquet, HF, Evidence Atlas

Main DuckDB

Table	Rows	Purpose
`linkedin_jobs`	129,165	Job/posting substrate with JD, company, skills, title, SOC, function, seniority, risk, quality flags
`occupations`	110,184	Local occupation/title universe
`ai_suitability`	110,184	Exposure, augmentation, risk, priority, wedge, function, seniority
`wedge_occupations`	47,837	AINA wedge subset

Parquet Evidence

Path	Rows	Use
`production semantic corpus`	294,315	Frozen embedding chunks
`Gemini vectors`	6,067	Current vector snapshot
`role_responsibility_evidence.parquet`	43,196	Responsibilities by role/bucket/function/seniority
`workflow_intelligence.parquet`	3,152	Workflow names, confidence, risk domains, source refs
`workflow_ai_affordance.parquet`	3,152	Trust, risk, data sensitivity, capability IDs, assist/cannot-trust lists
`workflow_tool_evidence.parquet`	413	Tool evidence by workflow/bucket

Hugging Face data exists under artifacts/raw/huggingface/, including Anthropic Economic Index files and OpenAI GDPval Parquet. Derived HF role and GDPval maps are under artifacts/derived/huggingface/. Title expansion inputs include gold-IP role spine, WRTMJ, and MVP2 market mapping bundles under data/sources/title_expansion/.

07 - Done So Far

What Has Been Achieved

FoundationLocal DuckDB/Parquet warehouse, role packets, runtime decisions, evaluator fixtures, validation receipts.

Coverage50,053 serviceable rows after title adjudication; 110,184 occupation/title rows in current DuckDB.

Source AuthorityRegistry points to engine-room ledgers, salvage map, jobs-research, and Evidence Atlas assets.

EmbeddingsAIN-506 contract, 294,315 frozen chunks, 6,067 Gemini vectors, top 500/top 1,000 vector coverage.

RetrievalAIN-510 local exact-cosine proof passes with rerank caveats and deterministic fallback.

AI FluencyFive-layer capability map, top-band coverage receipt, conservative six-question onboarding seed boundary.

The work has moved from a broad title map to a headless production-readiness spine. The missing layer is not another title sweep. It is context-first role evidence.

08 - Technology Roles

Why Each Piece Exists

Technology	Use
DuckDB	Local warehouse, joins, packet generation, vector snapshot querying
Parquet	Typed columnar storage for chunks, vectors, O*NET, Evidence Atlas
Pydantic	API/object contracts, learner/runtime/evaluator/AI Fluency objects
Pandera	Dataframe contracts for chunk and vector tables
JSON/JSONL	Receipts, ledgers, row-level proof, spot checks, model outputs
Gemini Embedding 2	Semantic relevance layer for clean chunks; exact cosine remains source-of-truth retrieval
LLMs	Structured title adjudication, semantic QA, repair reasoning, not a substitute for source authority

09 - WIP

Paused Source-Authority Repair

File	Status
`.gitignore`	WIP allow-rule for AIN-510 source-authority repair JSONL
`src/aina_data_engine/cli.py`	WIP CLI wiring
`src/aina_data_engine/reports.py`	WIP validation integration
`src/aina_data_engine/runtime_source_authority_repair.py`	Untracked WIP deterministic repair receipt generator
`tests/test_runtime_source_authority_repair.py`	Untracked WIP focused test

The WIP intent is sound, but it is too narrow if it only fixes four unresolved cases. Use it as a small proof fixture for a general JD-aware role-context evidence layer.

10 - Next Plan

Milestones From Here

Milestone	Goal	Acceptance
M0	Freeze and reconcile paused WIP	Worktree clean or WIP explicitly parked; validation matches current code
M1	Build `role_context_evidence_v1`	Title + JD + company + industry + responsibility context becomes first-class source authority
M2	JD-aware disambiguation gate	Ambiguous titles resolve by context when context exists and stay caveated when it does not
M3	Rebuild clean embedding corpus around role context	No raw JD dumps, no bad labels in embedding text, source refs on every chunk
M4	Progressive Gemini embedding	Dry run, 50-row check, live 500, 5k, 25k, then batch candidate only if clean
M5	AIN-510 retrieval promotion with context	Fewer or zero unresolved caveats for context-rich cases; fallback remains available
M6	AI Fluency production readiness	Top 500 and top 1,000 capability maps improve using role context without public runtime unlocks

Resume Commands

cd /srv/aina/aina-data-engine-room
git status --short --branch
git log --oneline --decorate -12
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-registry
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-inventory
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here

Do not run live Gemini again until the role-context source family has eligibility, repair, dry-run, and semantic spot-check proof.

Where To Start

Resume by building the JD-aware role-context evidence spine, not by repairing ambiguous titles in isolation.

Byline: Ali Mehdi Mukadam - co-authored with Codex - 2026-06-12

topics:
  - personalization-engine
  - data-engine-room
  - production-readiness
subtopics:
  - source-authority
  - jd-aware-role-context
  - gemini-embeddings
  - ai-fluency-capability-map
  - runtime-retrieval
  - artifact-inventory

personalization-engine data-engine-room source-authority jd-aware-role-context gemini-embeddings