AINA Data Engine Room - handoff - 2026-06-12

Production Embedding Top-Worked Gate Handoff

The top-worked title proof lane is now complete, and the source-authority map is wired into corpus freezing.

Ali Mehdi Mukadam - co-authored with Codex - reading time 5 minutes

The Single Idea

This checkpoint closes the top-worked title embedding gap without embedding raw junk labels. The run repaired the last four generic posting-style titles into clean role-intent titles, embedded only those four with Gemini Embedding 2 through paid Vertex ADC, and added a source-authority preflight so future corpus freezes must verify the hidden-gem sources before generating embedding manifests.

01 Status

Top-Worked Lane

The full active goal remains in progress. This checkpoint completes the top-worked title proving ground and keeps runtime promotion separate from build-time vector proof.

1000 / 1000Top ICP title vectors
500 / 500Top hardening-band vectors
2263Total Gemini vectors in parquet
0Failed Gemini rows

The live receipt reports status: pass, production_eligible: true, and a quality-pair gap of 0.247901 using strict_title_soc_function_v2.

02 What Changed

Source Authority First

1.
Registry verifies hidden-gem sources.
2.
Corpus freeze requires registry proof.
3.
Posting phrases repair into role intent.
4.
Only clean repaired rows are embedded.

run_production_embedding_corpus now regenerates and verifies production_source_authority_registry_v1 before writing the corpus receipt. The registry checks harvest source map, source truth ledger, title ledger, mapping-chain ledger, salvage map, jobs-research title audit, clean candidates, jobs-research manifest, and evidence-atlas title/responsibility parquet files.

Original phraseRepaired embedding title
management training - entry levelmanagement trainee
office expansion- entry level professionals wantedoffice associate
entry level openings: fast-paced marketing teammarketing associate
restaurant / hospitality experience - entry level positionshospitality associate
03 Evidence

Receipts

Primary receipts live under /srv/aina/aina-data-engine-room/artifacts/validation/, especially ain_506_live_gemini_embedding_run_v1.json, ain_506_production_semantic_embedding_corpus_v1.json, production_source_authority_registry_v1.json, and ain_506_p0_embedding_contract_gate_v1.json.

ReceiptMetricValue
Live Gemini runtop_1000_vector_count1000
Live Gemini runtop_500_vector_count500
Live Gemini runfailed_quality_gates[]
Zero-new dry runcandidate_count0
Corpus freezechunk_count298869
Corpus freezemanifest_shard_count314
Source registrytrusted_jobs_research_titles15104
Source registryclean_candidate_row_count44440
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run ruff check src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/production_embeddings.py tests/test_production_embeddings.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate

Results: 35 passed, Ruff All checks passed, P0 gate pass, and full validation pass.

04 Files Touched

Code And Artifacts

Code changes are in src/aina_data_engine/production_embeddings.py, src/aina_data_engine/production_embedding_eligibility.py, and tests/test_production_embeddings.py. Generated evidence updated the live Gemini receipt, quality pairs, corpus receipt, source-authority registry, eligibility and repair receipts, repaired top-worked overlay, vector parquet, and chunk parquet.

05 What This Means

No More Rediscovering This Layer

The engine can now prove complete Gemini coverage over the top-worked title proving ground, including the top 500 hardening band. More importantly, source authority is no longer just a report. The corpus-freeze command now has to prove that the jobs-research title audit, clean candidates, evidence atlas, ledgers, salvage map, and source truth ledger are present before it claims a production semantic corpus.

06 Recommended Next Move

Expand Carefully

Start with the next source family that has high value and manageable risk: regenerate scoped eligibility for serviceable_title, spot-check 50 repaired rows for title noise and label leakage, dry-run the first 500-1000 new candidates, and embed progressively only if the semantic check holds.

cd /srv/aina/aina-data-engine-room
git status --short --branch
git log -3 --oneline
jq '{status,valid,production_eligible,metrics,failed_quality_gates}' artifacts/validation/ain_506_live_gemini_embedding_run_v1.json
jq '{status,valid,metrics,checks,failed_checks}' artifacts/validation/production_source_authority_registry_v1.json
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
Where to start

Resume from the receipts, not from memory: the source-authority registry and top-worked embedding receipt are now the proof surfaces.