Technical Deep Dive Handoff
What exists in the repo, what has been achieved, why title-only repair is now too small, and what should happen next.
The repo is no longer just a title-coverage project. It is a self-contained VDS-local data engine for turning job, occupation, workflow, evidence, curriculum, evaluator, and embedding assets into a production-ready Personalization Engine. The next correction is important: ambiguous titles should not be repaired in isolation when the source row already has job description, company, industry, seniority, responsibilities, and other context.
Current Truth For The Engineer
| Surface | Current state |
|---|---|
| Repo path | /srv/aina/aina-data-engine-room |
| Branch | ali/ain-506-p0-gate-2026-06-12 |
| Audited implementation head | 77ab1c2 Add AIN-510 semantic reranking proof |
| Remote | No remote configured |
| Worktree | Dirty: three tracked files modified and two untracked WIP files |
| Last validation receipt | artifacts/validation/full_validation.json says status: pass, generated before the current WIP edits |
| Main scale | occupations=110184, ai_suitability=110184, linkedin_jobs=129165, wedge_occupations=47837 |
| Frozen embedding corpus | 294315 chunks, 27559686 token count, 309 manifest shards |
| Gemini vector snapshot | 6067 vectors using gemini-embedding-2 at 768 dimensions |
| Runtime proof | AIN-510 local exact-cosine retrieval passes with caveats; production runtime authority is still blocked |
| Product object | AIFluencyCapabilityMap exists as a five-layer headless P0 fixture |
Current Branch Is Paused With WIP
## ali/ain-506-p0-gate-2026-06-12 M .gitignore M src/aina_data_engine/cli.py M src/aina_data_engine/reports.py ?? src/aina_data_engine/runtime_source_authority_repair.py ?? tests/test_runtime_source_authority_repair.py
Landed code is current through 77ab1c2. The working tree contains a paused AIN-510 source-authority repair slice that has not been validated or committed in this checkpoint. Local branches include the current June 12 branch plus June 9 semantic/mission branches and main. This checkout has no configured Git remote.
| Branch | Head | Role |
|---|---|---|
ali/ain-506-p0-gate-2026-06-12 | 77ab1c2 | Current production embedding and AIN-510 branch |
ali/personalization-engine-mission-2026-06-09 | f31a202 | Earlier mission execution branch |
ali/semantic-engine-room-2026-06-09 | 33b56ae | Earlier semantic pipeline branch |
ali/semantic-risk-provenance-2026-06-09 | 88c8032 | Semantic risk/provenance branch |
main | 88c8032 | Local main, behind current branch |
The Case Manager Diagnosis
What We Should Stop Centering
Title-only repair asks a generic string like case manager to carry too much meaning. It can waste LLM tokens and still choose the wrong role family.
What We Should Center Next
Use title plus JD text, company, industry, responsibilities, tools, seniority, risk tags, and source refs to resolve the role context before embedding or runtime retrieval.
The linkedin_jobs table already carries company, location, experience_level, job_description, cleaned descriptions, extracted_skills, industry, sub_industry, title_normalized, soc_code, function, seniority, exposure scores, risk tags, and quality flags. The AIN-506 contract also says LinkedIn/job chunks should use title + company/industry if safe + cleaned responsibility/workflow text, not title-only rows.
What Is On Disk
| Directory | Size | Purpose |
|---|---|---|
src/aina_data_engine/ | 7.9 MB | Python package and CLI implementation |
tests/ | 3.2 MB | Pytest coverage for contracts, gates, runtime, embeddings, AI fluency, reports |
docs/ | 35 MB | Planning docs, handoffs, reports, runbooks, source foundations |
artifacts/ | 9.9 GB | Generated validation receipts, raw data, embeddings, packets, reports, reviews, warehouse |
data/sources/ | 59 MB | Title-expansion source bundles and provenance |
evidence/canonical/ | 8.5 MB | Evidence Atlas Parquet files |
.venv/ | 5.2 GB | Ignored Python runtime |
Tracked file counts are 394 under artifacts, 309 under docs, 89 under src, 63 under tests, 17 under evidence, and 13 under data. Ignored files total 77,804, dominated by generated artifacts and the local virtualenv. The non-ignored untracked files are only the two WIP AIN-510 repair files.
What The Code Does
| Module | Role |
|---|---|
ingest.py | Builds DuckDB from raw and processed sources; writes canonical snapshots and master taxonomy |
packets.py | Builds role intelligence packets, teaching packets, AI affordance packs, readiness snapshots, curriculum input packets |
schemas.py | Pydantic object contracts for learner, runtime, packets, events, evaluator structures |
embedding_contracts.py | Pydantic/Pandera contracts for embedding configs, chunks, records, output frames |
production_embeddings.py | Corpus construction, chunking, manifests, Gemini vectors, exact-cosine quality pairs |
runtime_retrieval_proof.py | AIN-510 local exact-cosine runtime proof over existing vectors |
ai_fluency_capability.py | Headless AI Fluency Capability Map contracts and fixture output |
evidence_enrichment.py | Loads Evidence Atlas Parquet and enriches packets |
reports.py | Full validation receipt and cross-artifact checks |
The CLI spans AIN-506 embeddings, AIN-510 retrieval/runtime authority, source authority, title coverage, runtime payloads, AI fluency, GDPval/Hugging Face practice lanes, and release gates.
DuckDB, Parquet, HF, Evidence Atlas
Main DuckDB
| Table | Rows | Purpose |
|---|---|---|
linkedin_jobs | 129,165 | Job/posting substrate with JD, company, skills, title, SOC, function, seniority, risk, quality flags |
occupations | 110,184 | Local occupation/title universe |
ai_suitability | 110,184 | Exposure, augmentation, risk, priority, wedge, function, seniority |
wedge_occupations | 47,837 | AINA wedge subset |
Parquet Evidence
| Path | Rows | Use |
|---|---|---|
production semantic corpus | 294,315 | Frozen embedding chunks |
Gemini vectors | 6,067 | Current vector snapshot |
role_responsibility_evidence.parquet | 43,196 | Responsibilities by role/bucket/function/seniority |
workflow_intelligence.parquet | 3,152 | Workflow names, confidence, risk domains, source refs |
workflow_ai_affordance.parquet | 3,152 | Trust, risk, data sensitivity, capability IDs, assist/cannot-trust lists |
workflow_tool_evidence.parquet | 413 | Tool evidence by workflow/bucket |
Hugging Face data exists under artifacts/raw/huggingface/, including Anthropic Economic Index files and OpenAI GDPval Parquet. Derived HF role and GDPval maps are under artifacts/derived/huggingface/. Title expansion inputs include gold-IP role spine, WRTMJ, and MVP2 market mapping bundles under data/sources/title_expansion/.
What Has Been Achieved
The work has moved from a broad title map to a headless production-readiness spine. The missing layer is not another title sweep. It is context-first role evidence.
Why Each Piece Exists
| Technology | Use |
|---|---|
| DuckDB | Local warehouse, joins, packet generation, vector snapshot querying |
| Parquet | Typed columnar storage for chunks, vectors, O*NET, Evidence Atlas |
| Pydantic | API/object contracts, learner/runtime/evaluator/AI Fluency objects |
| Pandera | Dataframe contracts for chunk and vector tables |
| JSON/JSONL | Receipts, ledgers, row-level proof, spot checks, model outputs |
| Gemini Embedding 2 | Semantic relevance layer for clean chunks; exact cosine remains source-of-truth retrieval |
| LLMs | Structured title adjudication, semantic QA, repair reasoning, not a substitute for source authority |
Paused Source-Authority Repair
| File | Status |
|---|---|
.gitignore | WIP allow-rule for AIN-510 source-authority repair JSONL |
src/aina_data_engine/cli.py | WIP CLI wiring |
src/aina_data_engine/reports.py | WIP validation integration |
src/aina_data_engine/runtime_source_authority_repair.py | Untracked WIP deterministic repair receipt generator |
tests/test_runtime_source_authority_repair.py | Untracked WIP focused test |
The WIP intent is sound, but it is too narrow if it only fixes four unresolved cases. Use it as a small proof fixture for a general JD-aware role-context evidence layer.
Milestones From Here
| Milestone | Goal | Acceptance |
|---|---|---|
| M0 | Freeze and reconcile paused WIP | Worktree clean or WIP explicitly parked; validation matches current code |
| M1 | Build role_context_evidence_v1 | Title + JD + company + industry + responsibility context becomes first-class source authority |
| M2 | JD-aware disambiguation gate | Ambiguous titles resolve by context when context exists and stay caveated when it does not |
| M3 | Rebuild clean embedding corpus around role context | No raw JD dumps, no bad labels in embedding text, source refs on every chunk |
| M4 | Progressive Gemini embedding | Dry run, 50-row check, live 500, 5k, 25k, then batch candidate only if clean |
| M5 | AIN-510 retrieval promotion with context | Fewer or zero unresolved caveats for context-rich cases; fallback remains available |
| M6 | AI Fluency production readiness | Top 500 and top 1,000 capability maps improve using role context without public runtime unlocks |
Resume Commands
cd /srv/aina/aina-data-engine-room git status --short --branch git log --oneline --decorate -12 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-registry uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-inventory uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
Resume by building the JD-aware role-context evidence spine, not by repairing ambiguous titles in isolation.