Production Embedding Top-Worked Gate Handoff
The top-worked title proof lane is now complete, and the source-authority map is wired into corpus freezing.
This checkpoint closes the top-worked title embedding gap without embedding raw junk labels. The run repaired the last four generic posting-style titles into clean role-intent titles, embedded only those four with Gemini Embedding 2 through paid Vertex ADC, and added a source-authority preflight so future corpus freezes must verify the hidden-gem sources before generating embedding manifests.
Top-Worked Lane
The full active goal remains in progress. This checkpoint completes the top-worked title proving ground and keeps runtime promotion separate from build-time vector proof.
The live receipt reports status: pass, production_eligible: true, and a quality-pair gap of 0.247901 using strict_title_soc_function_v2.
Source Authority First
Registry verifies hidden-gem sources.
Corpus freeze requires registry proof.
Posting phrases repair into role intent.
Only clean repaired rows are embedded.
run_production_embedding_corpus now regenerates and verifies production_source_authority_registry_v1 before writing the corpus receipt. The registry checks harvest source map, source truth ledger, title ledger, mapping-chain ledger, salvage map, jobs-research title audit, clean candidates, jobs-research manifest, and evidence-atlas title/responsibility parquet files.
| Original phrase | Repaired embedding title |
|---|---|
management training - entry level | management trainee |
office expansion- entry level professionals wanted | office associate |
entry level openings: fast-paced marketing team | marketing associate |
restaurant / hospitality experience - entry level positions | hospitality associate |
Receipts
Primary receipts live under /srv/aina/aina-data-engine-room/artifacts/validation/, especially ain_506_live_gemini_embedding_run_v1.json, ain_506_production_semantic_embedding_corpus_v1.json, production_source_authority_registry_v1.json, and ain_506_p0_embedding_contract_gate_v1.json.
| Receipt | Metric | Value |
|---|---|---|
| Live Gemini run | top_1000_vector_count | 1000 |
| Live Gemini run | top_500_vector_count | 500 |
| Live Gemini run | failed_quality_gates | [] |
| Zero-new dry run | candidate_count | 0 |
| Corpus freeze | chunk_count | 298869 |
| Corpus freeze | manifest_shard_count | 314 |
| Source registry | trusted_jobs_research_titles | 15104 |
| Source registry | clean_candidate_row_count | 44440 |
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q uv run ruff check src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/production_embeddings.py tests/test_production_embeddings.py uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
Results: 35 passed, Ruff All checks passed, P0 gate pass, and full validation pass.
Code And Artifacts
Code changes are in src/aina_data_engine/production_embeddings.py, src/aina_data_engine/production_embedding_eligibility.py, and tests/test_production_embeddings.py. Generated evidence updated the live Gemini receipt, quality pairs, corpus receipt, source-authority registry, eligibility and repair receipts, repaired top-worked overlay, vector parquet, and chunk parquet.
No More Rediscovering This Layer
The engine can now prove complete Gemini coverage over the top-worked title proving ground, including the top 500 hardening band. More importantly, source authority is no longer just a report. The corpus-freeze command now has to prove that the jobs-research title audit, clean candidates, evidence atlas, ledgers, salvage map, and source truth ledger are present before it claims a production semantic corpus.
Expand Carefully
Start with the next source family that has high value and manageable risk: regenerate scoped eligibility for serviceable_title, spot-check 50 repaired rows for title noise and label leakage, dry-run the first 500-1000 new candidates, and embed progressively only if the semantic check holds.
cd /srv/aina/aina-data-engine-room
git status --short --branch
git log -3 --oneline
jq '{status,valid,production_eligible,metrics,failed_quality_gates}' artifacts/validation/ain_506_live_gemini_embedding_run_v1.json
jq '{status,valid,metrics,checks,failed_checks}' artifacts/validation/production_source_authority_registry_v1.json
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
Resume from the receipts, not from memory: the source-authority registry and top-worked embedding receipt are now the proof surfaces.