Semantic Review Embedding Checkpoint And 5k Rollback
A clean 500-vector advance, a useful 5k failure, and a verified rollback to a promotion-ready local retrieval state.
The semantic_review family safely advanced by 500 clean Gemini vectors, but the attempted 5,000-vector expansion reduced the known-pair cosine separation below the required 0.15 floor. Scale-up stopped, the failed expansion was backed up, only the failed-gate rows were pruned, and the engine returned to a promotion-ready local retrieval state.
The family advanced by 500 vectors safely.
Before this slice, the current good vector authority had 151,007 Gemini vectors and 1,000 semantic_review vectors. The repaired-input semantic QA was added first, then the progressive 500 live run passed.
| Receipt | Result |
|---|---|
production_embedding_semantic_qa_v1__source_family=semantic_review__input=repaired.json | pass, 50 sampled, 0 failed, 0 raw JD hits, 0 legacy review gate hits |
| 500 live Gemini run | 500 new vectors, 0 provider failures, total vectors 151,507 |
| AIN-510 after 500 | promotion_ready, known-pair cosine gap 0.190303 |
The 5k run was technically clean but semantically too broad.
The 5k dry-run was clean: 5,000 candidates, eligible_only=true, no quality-excluded candidates, and no live API call. The live run completed without provider failures, but retrieval quality dropped below the floor.
| Metric | Value |
|---|---|
| New vectors in 5k run | 5,000 |
| Total vectors before rollback | 156,507 |
semantic_review vectors before rollback | 6,500 |
| Provider failures | 0 |
| Known-pair cosine gap | 0.146566 |
| Failed gates | gate_1_known_pairs_separate, gate_5_runtime_retrieval_candidate |
The good vector authority was restored deterministically.
The latest 5,000 rows were separable by embedding_created_at. Rows at or after 2026-06-15T11:49:00Z were pruned from the Parquet authority, and the DuckDB cache was rebuilt from that pruned file.
production_gemini_vectors from Parquet.| Check | Result |
|---|---|
| AIN-510 | promotion_ready |
| AIN-510 vector rows | 151,507 |
AIN-510 semantic_review vectors | 1,500 |
| AIN-510 cosine gap | 0.190303 |
| Stale vectors | 0 |
| DuckDB vector rows | 151,507 |
| Source authority registry | pass, 151,507 vectors |
validate | pass |
The next-family risk is not schema. It is thin text.
A read-only data-quality reviewer agreed that semantic_review was acceptable for the progressive 500. They also recommended holding jobs_research_role from Gemini spend because its repaired corpus is structurally valid but semantically thin: mostly title-only rows, 50/50 semantic QA failures, and most rows below the token floor.
This advanced M2 without weakening the runtime boundary.
| Mission slice | Status |
|---|---|
| M2.S1 source-family eligibility | Continued; repaired-input QA added for semantic_review |
| M2.S2 progressive Gemini runs | 500 passed; 5k attempted and rolled back after quality gate failure |
| AIN-510 retrieval proof | Restored to promotion_ready |
| Runtime boundary | Public runtime, real-user data, external writes, production telemetry, and runtime embedding authority remain blocked |
| Clean-before-embed rule | Strengthened by the jobs_research_role hold decision |
Partition before scaling this family again.
Do not rerun semantic_review at 5k as one mixed family until either the quality-pair suite is expanded or the family is partitioned into cleaner subfamilies. Hold jobs_research_role until richer JD, responsibility, workflow, function, seniority, or source-intelligence context is included in text_for_embedding. For the next embedding family, choose one with passing semantic QA, enough text signal, no label-only rows, and a clean 500-step proof.
Start from the restored authority.
cd /srv/aina/aina-data-engine-room git status --short --branch uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
Start from the 151,507-vector snapshot, treat the failed 5k semantic-review expansion as evidence, and repair or partition before spending again.