AINA Data Engine Room · Production embeddings · 2026-06-15

Semantic Review Embedding Checkpoint And 5k Rollback

A clean 500-vector advance, a useful 5k failure, and a verified rollback to a promotion-ready local retrieval state.

Ali Mehdi Mukadam · co-authored with Codex · 5 minute read
The Single Idea

The semantic_review family safely advanced by 500 clean Gemini vectors, but the attempted 5,000-vector expansion reduced the known-pair cosine separation below the required 0.15 floor. Scale-up stopped, the failed expansion was backed up, only the failed-gate rows were pruned, and the engine returned to a promotion-ready local retrieval state.

01 · What changed

The family advanced by 500 vectors safely.

Before this slice, the current good vector authority had 151,007 Gemini vectors and 1,000 semantic_review vectors. The repaired-input semantic QA was added first, then the progressive 500 live run passed.

ReceiptResult
production_embedding_semantic_qa_v1__source_family=semantic_review__input=repaired.jsonpass, 50 sampled, 0 failed, 0 raw JD hits, 0 legacy review gate hits
500 live Gemini run500 new vectors, 0 provider failures, total vectors 151,507
AIN-510 after 500promotion_ready, known-pair cosine gap 0.190303
02 · Stop condition

The 5k run was technically clean but semantically too broad.

The 5k dry-run was clean: 5,000 candidates, eligible_only=true, no quality-excluded candidates, and no live API call. The live run completed without provider failures, but retrieval quality dropped below the floor.

MetricValue
New vectors in 5k run5,000
Total vectors before rollback156,507
semantic_review vectors before rollback6,500
Provider failures0
Known-pair cosine gap0.146566
Failed gatesgate_1_known_pairs_separate, gate_5_runtime_retrieval_candidate
03 · Rollback proof

The good vector authority was restored deterministically.

The latest 5,000 rows were separable by embedding_created_at. Rows at or after 2026-06-15T11:49:00Z were pruned from the Parquet authority, and the DuckDB cache was rebuilt from that pruned file.

Back upSaved the failed 5k vector Parquet and pre-refresh DuckDB cache.
PruneKept 151,507 rows and dropped exactly 5,000 later rows.
RefreshRebuilt production_gemini_vectors from Parquet.
ProveAIN-510, reconciliation, registry, runtime readiness, and validate pass.
CheckResult
AIN-510promotion_ready
AIN-510 vector rows151,507
AIN-510 semantic_review vectors1,500
AIN-510 cosine gap0.190303
Stale vectors0
DuckDB vector rows151,507
Source authority registrypass, 151,507 vectors
validatepass
04 · Subagent finding

The next-family risk is not schema. It is thin text.

A read-only data-quality reviewer agreed that semantic_review was acceptable for the progressive 500. They also recommended holding jobs_research_role from Gemini spend because its repaired corpus is structurally valid but semantically thin: mostly title-only rows, 50/50 semantic QA failures, and most rows below the token floor.

05 · Mission mapping

This advanced M2 without weakening the runtime boundary.

Mission sliceStatus
M2.S1 source-family eligibilityContinued; repaired-input QA added for semantic_review
M2.S2 progressive Gemini runs500 passed; 5k attempted and rolled back after quality gate failure
AIN-510 retrieval proofRestored to promotion_ready
Runtime boundaryPublic runtime, real-user data, external writes, production telemetry, and runtime embedding authority remain blocked
Clean-before-embed ruleStrengthened by the jobs_research_role hold decision
06 · Next actions

Partition before scaling this family again.

Do not rerun semantic_review at 5k as one mixed family until either the quality-pair suite is expanded or the family is partitioned into cleaner subfamilies. Hold jobs_research_role until richer JD, responsibility, workflow, function, seniority, or source-intelligence context is included in text_for_embedding. For the next embedding family, choose one with passing semantic QA, enough text signal, no label-only rows, and a clean 500-step proof.

07 · Resume commands

Start from the restored authority.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
Where to start

Start from the 151,507-vector snapshot, treat the failed 5k semantic-review expansion as evidence, and repair or partition before spending again.