# Gemini Clean-Before-Embed Live 500 Handoff Date: 2026-06-12 Repo: `/srv/aina/aina-data-engine-room` Branch: `ali/ain-506-p0-gate-2026-06-12` Issues: AIN-506, AIN-508, AIN-510 Author: Ali Mehdi Mukadam - co-authored with Codex ## The Single Idea This run turned the clean-before-embed plan from policy into live data-engine proof. AINA now has a guarded Gemini semantic layer with 6,034 vectors, including the first 500 clean `semantic_review` vectors, restored top-1,000/top-500 title coverage, stricter batch safeguards, a source-authority inventory that points future agents to prior cleanup before they burn tokens rediscovering it, and an AIN-510 retrieval promotion gate that blocks runtime promotion until cross-asset retrieval proof is real. ## What Changed 1. The source-authority inventory expanded from 26 to 39 entity assets. It now names the HF-derived maps, top-worked title receipts, serviceable-title checkpoint, and the hidden evidence-atlas layers: workflow tool evidence, IWA evidence, realism corpus, and qualitative corpus. 2. Batch manifest generation is now blocked unless the corpus is explicitly frozen with `eligible_only=true`. Full corpus freezes still work for analysis, but they cannot emit API-ready Gemini JSONL shards containing `repair_first` or `blocked` rows. 3. Scoped eligibility loading now prefers scoped ledgers. A `--source-family semantic_review` repair pass no longer risks reading a stale unscoped eligibility file. 4. Hard marketplace posting artifacts now block repair. Rows such as urgent hiring blurbs and hiring-event announcements are classified as `blocked_hard_posting_artifact`, not converted into embedding authority. 5. Scoped live embedding writes are additive. A scoped run now preserves existing out-of-family vectors, including repaired overlays, instead of pruning them from the shared snapshot. 6. A durable 50-row clean candidate spot check was created before the live call. It sampled the exact clean/progressive `semantic_review` candidate lane: 50 pass, 0 fail, 0 hard posting noise. 7. A live Vertex Gemini Embedding 2 run embedded 500 clean `semantic_review` rows with repaired rows excluded. It had 0 failed rows. 8. A bounded top-worked recovery run re-embedded 84 missing top-worked vectors after the pre-fix scoped writer temporarily dropped coverage. The snapshot is now restored. 9. AIN-510 retrieval promotion is now a real command: `ain-510-retrieval-promotion-gate`. It reads the current Gemini vector snapshot, recomputes exact-cosine quality pairs, checks sensitive buckets and cross-asset family coverage, and keeps runtime authority off. ## Current State The current vector snapshot is: - Total Gemini vectors: `6,034` - `top_worked_title`: `1,000` - `serviceable_title`: `3,440` - `semantic_review`: `500` - `hf_role_signal`: `826` - `gdpval_task`: `220` - `alipe_vision_doc`: `32` - `harvest_source_map`: `16` Top-band coverage is healthy again: - Top 1,000 vector count: `1,000` - Top 500 hardening vector count: `500` - Latest cosine gap: `0.207005` - Latest failed rows: `0` The scoped `semantic_review` eligibility ledger now shows: - Rows: `42,199` - Clean progressive candidates: `29,470` - Repair-first rows: `12,688` - Hard-blocked rows: `41` Runtime authority is still not promoted. The embeddings are live build-time data and internal retrieval material. AIN-510 now proves why: current vectors have title, serviceable, semantic-review, HF, GDPVal, ALIPE, and harvest-map coverage, but `workflow_vector_count` is still `0`; the explicit sensitive mismatch fixture suite and rollback proof are also not present yet. AIN-510 current result: - Status: `blocked_for_runtime_promotion` - Valid gate/report: `true` - Promotion eligible: `false` - Raw vector rows: `6,034` - Valid Gemini vectors: `6,034` - Stale vector rows: `0` - Known similar pairs: `50` - Known dissimilar pairs: `50` - Similar/dissimilar cosine gap: `0.207005` - Workflow vectors: `0` - Curriculum vectors: `252` - Evaluator vectors: `220` - Runtime embedding authority promoted: `false` - Public runtime allowed: `false` ## Important Artifacts - Source start page: `/srv/aina/aina-data-engine-room/artifacts/validation/source_authority_start_here_v1.json` - Expanded inventory: `/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_inventory_v1.json` - Clean candidate spot check: `/srv/aina/aina-data-engine-room/artifacts/validation/semantic_review_clean_candidate_spot_check_v1.json` - Semantic live checkpoint: `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_semantic_review_progressive_live_500_v1.json` - Shared latest live run receipt: `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_live_gemini_embedding_run_v1.json` - AIN-510 retrieval promotion gate: `/srv/aina/aina-data-engine-room/artifacts/validation/ain_510_retrieval_promotion_gate_v1.json` - Vector snapshot: `/srv/aina/aina-data-engine-room/artifacts/embeddings/production/vectors/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/ain_506_live_gemini_embedding_run_v1.parquet` - Updated learning: `/srv/aina/aina-data-engine-room/docs/learnings/2026-06-12-gemini-embedding-clean-before-embed.md` ## Proof Run These commands passed in this run: ```bash uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py tests/test_source_authority_start_here.py -q uv run ruff check src/aina_data_engine/production_embeddings.py src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/cli.py tests/test_production_embeddings.py tests/test_embedding_contracts.py tests/test_source_authority_start_here.py git diff --check ``` The targeted tests are now `49 passed`. ## Linear Proof - AIN-510 gate proof comment: `70e817cc-758a-48a0-bc04-f264c3cafcfd` - AIN-506 embedding checkpoint comment: `8c2ad1b1-f28f-4125-8cde-e92a0ce255c9` - AIN-508 clean-before-embed comment: `7085d79e-80c5-4dca-bc54-9f5886abfc1c` ## Artifact Tracking Policy The old `.gitignore` ignored the whole `artifacts/` tree. That was too blunt: it protected heavy generated data, but it also hid small durable receipts and forced agents to remember `git add -f`. The repo now uses a selective artifact policy: - bulk data remains ignored: embeddings, vectors, Parquet, DuckDB, packets, raw downloads, and large row-level ledgers; - durable receipts show up normally: `artifacts/validation/*.json`, `artifacts/reports/*.md`, and `artifacts/reports/*.html`; - narrow QA JSONL receipts show up normally: `*_quality_pairs.jsonl`, `*_spot_check*.jsonl`, `*semantic_sample_50.jsonl`, and `*sample_50.jsonl`; - force-add is now an escape hatch, not the expected path for routine proof. ## Things Not To Regress Do not run broad live embeddings with `--include-repaired` by default. Repaired rows need their own source-family semantic QA. Do not write batch manifests from a non-eligible corpus. The default corpus freeze can be used for analysis, but only `--eligible-only` can emit API-ready Gemini manifest shards. Do not let source-family repair commands read stale unscoped ledgers. Do not prune other families during scoped vector writes. Scoped runs must be additive unless the operator explicitly runs a full prune/rebuild lane. Do not treat the current live embedding layer as public runtime authority yet. The data-engine vector layer is useful now, and AIN-510 currently blocks runtime promotion for clear missing proof. ## Known Residual Work The serviceable-title 3,000 checkpoint still has `needs_attention` because its next sampler flagged 3 rows. It is superseded by the current vector snapshot, but serviceable-title expansion should not continue until those sampler concerns are resolved. The `semantic_review` repaired corpus still has 12,487 repaired chunks. They are parked. They can be used only after a separate 50-row repaired-corpus semantic QA pass. AIN-510 now counts sensitive bucket presence, and all buckets are represented in the current vector snapshot. What is still missing is the explicit sensitive mismatch fixture suite: regulated, healthcare, legal, HR, frontline, general-business, generic-neighbor, and marketplace-artifact cases must be tested as retrieval mismatches, not only counted as present. AIN-510 also shows the biggest next embedding gap: clean workflow vectors are absent from the current vector snapshot. Title-to-workflow runtime reliance should not be promoted until workflow families pass eligibility, semantic QA, progressive embedding, and exact-cosine retrieval checks. ## Exact Resume Commands ```bash cd /srv/aina/aina-data-engine-room git status --short --branch uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family semantic_review uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family semantic_review --max-new 500 --dry-run --workers 8 ``` Recommended next build command is not another broad title scale-up. First prepare the workflow family: ```bash uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family workflow_seed --source-family workflow_intelligence --source-family workflow_ai_affordance --source-family workflow_tool_evidence uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repair-queue --source-family workflow_seed --source-family workflow_intelligence --source-family workflow_ai_affordance --source-family workflow_tool_evidence ``` Then run a 50-row semantic QA over the clean workflow candidates before the first live `500` workflow embedding run. Rerun AIN-510 after that. ## Footer Ali Mehdi Mukadam - co-authored with Codex - 2026-06-12 ```yaml topics: - aina-data-engine-room - gemini-embeddings - personalization-engine subtopics: - clean-before-embed - semantic-review - ain-506 - ain-510 ```