AINA Data Engine Room / Production Embeddings / 2026-06-12

Serviceable Title Embedding 2,000 Checkpoint

A third controlled Gemini Embedding 2 slice moved serviceable-title coverage to 2,000 vectors while the next batch sampler stayed clean.

Ali Mehdi Mukadam - co-authored with Codex - local VDS checkpoint

The Single Idea

The serviceable-title Gemini Embedding 2 lane advanced another controlled 500 rows after the 1,500-vector checkpoint. The serviceable-title vector count is now 2,000, total Gemini vectors are 4,094, the live quality gates still pass, and the next 500 candidates pass the compact semantic sampler.

2,000serviceable-title vectors after dry run
4,094total Gemini vectors in the local store
0next-sampler noise flags and label leaks
01 / Since 1,500

Another Conservative Slice, No Batch

The repo started clean at commit 772daee, with the 1,500 checkpoint receipt already proving the next 500 candidates were clean. A third conservative live slice was run against serviceable_title only. No batch job was submitted.

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60
02 / Live Result

The Serviceable-Title Index Advanced To 2,000

MetricValue
Statuspass
New vectors embedded500
Failed rows0
Existing serviceable-title vectors before run1,500
Serviceable-title vectors after dry run2,000
Total Gemini vectors4,094
Known-pair cosine gap0.225806
03 / Dry Run And Sampler

The Following 500 Are Clean To Evaluate

The dry run selected another 500 serviceable-title candidates without invoking Gemini. The compact semantic sampler checked 50 rows and found no posting/location/company flags and no label leakage into weak or missing-authority text.

CheckValue
Next candidate count500
Rows sampled50
Noise flags0
Label leaks0
Orphan vectors pruned0
04 / Boundary

This Still Does Not Promote Public Runtime Authority

This checkpoint is build-time semantic-layer progress. The P0 gate still reports runtime_embedding_authority_promoted: false; exact cosine remains the source of truth, and AIN-510 still owns runtime retrieval promotion.

05 / Validation

The Local Checkpoint Passed

CheckResult
Focused tests38 passed
RuffAll checks passed
P0 gatepass
Full validationpass
06 / Next Slice

Continue From The 2,000 Receipt

cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, valid, live_metrics: .live_run.metrics, dry_run_metrics: .post_run_dry_run.metrics, next_sampler: .next_500_semantic_sampler}' artifacts/validation/ain_506_serviceable_title_progressive_live_2000_v1.json

The next sensible move is another 500 serviceable-title live slice only after rerunning the compact sampler. Do not batch this source family yet; the repair queue is still substantial and the goal is clean progressive scale, not fast junk embedding.

Where To Start

Start with ain_506_serviceable_title_progressive_live_2000_v1.json; it is the compact proof that this slice passed and the next slice is ready to evaluate.