AINA Data Engine Room - AIN-506 / AIN-510 - 2026-06-12

Gemini Clean-Before-Embed Source Authority Handoff

A compact checkpoint for the next agent: the source-authority inventory is now the front door before more embedding work.

The Single Idea

The embedding lane now has an anti-loop guardrail: agents must inventory and reuse existing cleaned source assets before running more Gemini embedding calls. This protects AINA from embedding noisy titles, stale labels, raw market rows, or company/employer context that has not been converted into clean derived chunks.

01

What Changed

02

Current Proof

26authority assets inventoried
5company/employer lineage assets
5534current Gemini vectors
1000top 1,000 vector coverage
500top 500 vector coverage
43focused tests passed

production-source-authority-inventory, source-authority-start-here, ain-506-p0-gate, and validate all pass. Ruff also passes on the touched source and test files.

03

Important Boundaries

No live Gemini call was made in this slice. The interrupted semantic_review work remains a dry-run-only state and must not be promoted to live embedding until a 50-row semantic spot check passes.

Company/employer lineage exists, including historical market_v2 company and employer registries, but the inventory marks those assets as reference-only. They are blocked from embedding authority until converted into clean derived artifacts.

04

Exact Resume Commands

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-inventory
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family semantic_review
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repair-queue --source-family semantic_review
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family semantic_review
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family semantic_review --include-repaired --max-new 500 --dry-run --workers 8

Only after the 50-row semantic spot check passes:

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family semantic_review --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60
Where to start

Start with the inventory and the semantic-review 50-row spot check. No batch and no live call until that family proves it is clean.