Gemini Clean-Before-Embed Live 500 Handoff

# Gemini Clean-Before-Embed Live 500 Handoff

Date: 2026-06-12
Repo: `/srv/aina/aina-data-engine-room`
Branch: `ali/ain-506-p0-gate-2026-06-12`
Issues: AIN-506, AIN-508, AIN-510
Author: Ali Mehdi Mukadam - co-authored with Codex

## The Single Idea

This run turned the clean-before-embed plan from policy into live data-engine
proof. AINA now has a guarded Gemini semantic layer with 6,034 vectors, including
the first 500 clean `semantic_review` vectors, restored top-1,000/top-500 title
coverage, stricter batch safeguards, a source-authority inventory that points
future agents to prior cleanup before they burn tokens rediscovering it, and an
AIN-510 retrieval promotion gate that blocks runtime promotion until cross-asset
retrieval proof is real.

## What Changed

1. The source-authority inventory expanded from 26 to 39 entity assets. It now
   names the HF-derived maps, top-worked title receipts, serviceable-title
   checkpoint, and the hidden evidence-atlas layers: workflow tool evidence,
   IWA evidence, realism corpus, and qualitative corpus.
2. Batch manifest generation is now blocked unless the corpus is explicitly
   frozen with `eligible_only=true`. Full corpus freezes still work for analysis,
   but they cannot emit API-ready Gemini JSONL shards containing `repair_first`
   or `blocked` rows.
3. Scoped eligibility loading now prefers scoped ledgers. A
   `--source-family semantic_review` repair pass no longer risks reading a stale
   unscoped eligibility file.
4. Hard marketplace posting artifacts now block repair. Rows such as urgent
   hiring blurbs and hiring-event announcements are classified as
   `blocked_hard_posting_artifact`, not converted into embedding authority.
5. Scoped live embedding writes are additive. A scoped run now preserves existing
   out-of-family vectors, including repaired overlays, instead of pruning them
   from the shared snapshot.
6. A durable 50-row clean candidate spot check was created before the live call.
   It sampled the exact clean/progressive `semantic_review` candidate lane:
   50 pass, 0 fail, 0 hard posting noise.
7. A live Vertex Gemini Embedding 2 run embedded 500 clean `semantic_review`
   rows with repaired rows excluded. It had 0 failed rows.
8. A bounded top-worked recovery run re-embedded 84 missing top-worked vectors
   after the pre-fix scoped writer temporarily dropped coverage. The snapshot is
   now restored.
9. AIN-510 retrieval promotion is now a real command:
   `ain-510-retrieval-promotion-gate`. It reads the current Gemini vector
   snapshot, recomputes exact-cosine quality pairs, checks sensitive buckets and
   cross-asset family coverage, and keeps runtime authority off.

## Current State

The current vector snapshot is:

- Total Gemini vectors: `6,034`
- `top_worked_title`: `1,000`
- `serviceable_title`: `3,440`
- `semantic_review`: `500`
- `hf_role_signal`: `826`
- `gdpval_task`: `220`
- `alipe_vision_doc`: `32`
- `harvest_source_map`: `16`

Top-band coverage is healthy again:

- Top 1,000 vector count: `1,000`
- Top 500 hardening vector count: `500`
- Latest cosine gap: `0.207005`
- Latest failed rows: `0`

The scoped `semantic_review` eligibility ledger now shows:

- Rows: `42,199`
- Clean progressive candidates: `29,470`
- Repair-first rows: `12,688`
- Hard-blocked rows: `41`

Runtime authority is still not promoted. The embeddings are live build-time data
and internal retrieval material. AIN-510 now proves why: current vectors have
title, serviceable, semantic-review, HF, GDPVal, ALIPE, and harvest-map coverage,
but `workflow_vector_count` is still `0`; the explicit sensitive mismatch
fixture suite and rollback proof are also not present yet.

AIN-510 current result:

- Status: `blocked_for_runtime_promotion`
- Valid gate/report: `true`
- Promotion eligible: `false`
- Raw vector rows: `6,034`
- Valid Gemini vectors: `6,034`
- Stale vector rows: `0`
- Known similar pairs: `50`
- Known dissimilar pairs: `50`
- Similar/dissimilar cosine gap: `0.207005`
- Workflow vectors: `0`
- Curriculum vectors: `252`
- Evaluator vectors: `220`
- Runtime embedding authority promoted: `false`
- Public runtime allowed: `false`

## Important Artifacts

- Source start page:
  `/srv/aina/aina-data-engine-room/artifacts/validation/source_authority_start_here_v1.json`
- Expanded inventory:
  `/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_inventory_v1.json`
- Clean candidate spot check:
  `/srv/aina/aina-data-engine-room/artifacts/validation/semantic_review_clean_candidate_spot_check_v1.json`
- Semantic live checkpoint:
  `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_semantic_review_progressive_live_500_v1.json`
- Shared latest live run receipt:
  `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_live_gemini_embedding_run_v1.json`
- AIN-510 retrieval promotion gate:
  `/srv/aina/aina-data-engine-room/artifacts/validation/ain_510_retrieval_promotion_gate_v1.json`
- Vector snapshot:
  `/srv/aina/aina-data-engine-room/artifacts/embeddings/production/vectors/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/ain_506_live_gemini_embedding_run_v1.parquet`
- Updated learning:
  `/srv/aina/aina-data-engine-room/docs/learnings/2026-06-12-gemini-embedding-clean-before-embed.md`

## Proof Run

These commands passed in this run:

```bash
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py tests/test_source_authority_start_here.py -q
uv run ruff check src/aina_data_engine/production_embeddings.py src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/cli.py tests/test_production_embeddings.py tests/test_embedding_contracts.py tests/test_source_authority_start_here.py
git diff --check
```

The targeted tests are now `49 passed`.

## Linear Proof

- AIN-510 gate proof comment: `70e817cc-758a-48a0-bc04-f264c3cafcfd`
- AIN-506 embedding checkpoint comment: `8c2ad1b1-f28f-4125-8cde-e92a0ce255c9`
- AIN-508 clean-before-embed comment: `7085d79e-80c5-4dca-bc54-9f5886abfc1c`

## Artifact Tracking Policy

The old `.gitignore` ignored the whole `artifacts/` tree. That was too blunt:
it protected heavy generated data, but it also hid small durable receipts and
forced agents to remember `git add -f`.

The repo now uses a selective artifact policy:

- bulk data remains ignored: embeddings, vectors, Parquet, DuckDB, packets, raw
  downloads, and large row-level ledgers;
- durable receipts show up normally: `artifacts/validation/*.json`,
  `artifacts/reports/*.md`, and `artifacts/reports/*.html`;
- narrow QA JSONL receipts show up normally:
  `*_quality_pairs.jsonl`, `*_spot_check*.jsonl`, `*semantic_sample_50.jsonl`,
  and `*sample_50.jsonl`;
- force-add is now an escape hatch, not the expected path for routine proof.

## Things Not To Regress

Do not run broad live embeddings with `--include-repaired` by default. Repaired
rows need their own source-family semantic QA.

Do not write batch manifests from a non-eligible corpus. The default corpus
freeze can be used for analysis, but only `--eligible-only` can emit API-ready
Gemini manifest shards.

Do not let source-family repair commands read stale unscoped ledgers.

Do not prune other families during scoped vector writes. Scoped runs must be
additive unless the operator explicitly runs a full prune/rebuild lane.

Do not treat the current live embedding layer as public runtime authority yet.
The data-engine vector layer is useful now, and AIN-510 currently blocks runtime
promotion for clear missing proof.

## Known Residual Work

The serviceable-title 3,000 checkpoint still has `needs_attention` because its
next sampler flagged 3 rows. It is superseded by the current vector snapshot, but
serviceable-title expansion should not continue until those sampler concerns are
resolved.

The `semantic_review` repaired corpus still has 12,487 repaired chunks. They are
parked. They can be used only after a separate 50-row repaired-corpus semantic QA
pass.

AIN-510 now counts sensitive bucket presence, and all buckets are represented in
the current vector snapshot. What is still missing is the explicit sensitive
mismatch fixture suite: regulated, healthcare, legal, HR, frontline,
general-business, generic-neighbor, and marketplace-artifact cases must be
tested as retrieval mismatches, not only counted as present.

AIN-510 also shows the biggest next embedding gap: clean workflow vectors are
absent from the current vector snapshot. Title-to-workflow runtime reliance
should not be promoted until workflow families pass eligibility, semantic QA,
progressive embedding, and exact-cosine retrieval checks.

## Exact Resume Commands

```bash
cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family semantic_review
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family semantic_review --max-new 500 --dry-run --workers 8
```

Recommended next build command is not another broad title scale-up. First prepare
the workflow family:

```bash
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family workflow_seed --source-family workflow_intelligence --source-family workflow_ai_affordance --source-family workflow_tool_evidence
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repair-queue --source-family workflow_seed --source-family workflow_intelligence --source-family workflow_ai_affordance --source-family workflow_tool_evidence
```

Then run a 50-row semantic QA over the clean workflow candidates before the first
live `500` workflow embedding run. Rerun AIN-510 after that.

## Footer

Ali Mehdi Mukadam - co-authored with Codex - 2026-06-12

```yaml
topics:
  - aina-data-engine-room
  - gemini-embeddings
  - personalization-engine
subtopics:
  - clean-before-embed
  - semantic-review
  - ain-506
  - ain-510
```