AINA Data Engine Room / Production Embeddings / 2026-06-12

Serviceable Title Embedding 1,000 Checkpoint

A controlled Gemini Embedding 2 slice moved serviceable-title coverage to 1,000 vectors while keeping marketplace noise and label leakage out of the next batch.

Ali Mehdi Mukadam - co-authored with Codex - local VDS checkpoint

The Single Idea

The production embedding lane advanced one controlled step: the serviceable-title Gemini index now has 1,000 serviceable-title vectors, the top 1,000 and top 500 proving bands are complete, and the next 500 serviceable candidates passed a compact semantic sampler. This is build-time semantic-layer progress, not a public runtime unlock.

1. CleanRegional, year, and location noise now route through title-authority repair.
2. Embed500 new serviceable-title vectors were created through live Gemini Embedding 2.
3. VerifyThe post-run dry run sees 1,000 serviceable-title vectors and no orphans.
4. Hold BoundaryBuild-time pass does not promote public runtime authority.
01 / What Changed

Marketplace Cleanup Got Stricter Without Rejecting Real Titles

The serviceable-title marketplace cleaner was tightened so flattened region, posting-year, and trailing-location noise is repaired before embedding. The regression tests also protect the opposite case: seasonal and internship titles are allowed when they are real title concepts rather than postings.

Raw title shapeClean embedding title
microsoft presales specialist (us southeast commercial)microsoft presales specialist
microsoft presales specialist us southeast commercialmicrosoft presales specialist
americas: marketing & communications internship summer 2024marketing & communications internship
tool & die maker beaver damtool & die maker
02 / Live Result

The Serviceable-Title Index Advanced To 1,000

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60
MetricValue
Statuspass
New vectors embedded500
Failed rows0
Total Gemini vectors3,094
Serviceable-title vectors after dry run1,000
Top 1,000 vector count1,000
Top 500 vector count500
Known-pair cosine gap0.232981
03 / Dry Run And Sampler

The Next 500 Are Ready For Another Small Progressive Run

The post-run dry run selected the next 500 serviceable-title candidates without live API use. The compact semantic sampler checked 50 rows and found no posting/location/company noise flags and no label leakage into weak or missing-authority embedding text.

CheckValue
Next candidate count500
Rows sampled50
Noise flags0
Label leaks0
Orphan vectors pruned0
04 / Runtime Boundary

Build-Time Pass Is Not Public Runtime Authority

This checkpoint makes the semantic layer stronger. It does not promote public runtime embedding authority. The P0 gate still reports runtime_embedding_authority_promoted: false; exact cosine remains the source of truth, and runtime promotion still belongs to the AIN-510 retrieval gate.

05 / Validation

The Checkpoint Is Verified Locally

CheckResult
Focused tests38 passed
RuffAll checks passed
P0 gatepass
Full validationpass
Embedding human-review field grepNo matches
06 / Next Slice

Continue From The Receipt, Not A Fresh Scan

cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, valid, live_metrics: .live_run.metrics, dry_run_metrics: .post_run_dry_run.metrics, next_sampler: .next_500_semantic_sampler}' artifacts/validation/ain_506_serviceable_title_progressive_live_1000_v1.json

The next sensible move is another 500 serviceable-title live slice after rerunning the compact sampler. Batch should wait until a source family has repeated progressive proof and a clean repair queue.

Where To Start

Start with ain_506_serviceable_title_progressive_live_1000_v1.json; it is the compact proof that this slice passed and the next slice is ready to evaluate.