Serviceable Title Embedding 1,000 Checkpoint
A controlled Gemini Embedding 2 slice moved serviceable-title coverage to 1,000 vectors while keeping marketplace noise and label leakage out of the next batch.
The production embedding lane advanced one controlled step: the serviceable-title Gemini index now has 1,000 serviceable-title vectors, the top 1,000 and top 500 proving bands are complete, and the next 500 serviceable candidates passed a compact semantic sampler. This is build-time semantic-layer progress, not a public runtime unlock.
Marketplace Cleanup Got Stricter Without Rejecting Real Titles
The serviceable-title marketplace cleaner was tightened so flattened region, posting-year, and trailing-location noise is repaired before embedding. The regression tests also protect the opposite case: seasonal and internship titles are allowed when they are real title concepts rather than postings.
| Raw title shape | Clean embedding title |
|---|---|
microsoft presales specialist (us southeast commercial) | microsoft presales specialist |
microsoft presales specialist us southeast commercial | microsoft presales specialist |
americas: marketing & communications internship summer 2024 | marketing & communications internship |
tool & die maker beaver dam | tool & die maker |
The Serviceable-Title Index Advanced To 1,000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60
| Metric | Value |
|---|---|
| Status | pass |
| New vectors embedded | 500 |
| Failed rows | 0 |
| Total Gemini vectors | 3,094 |
| Serviceable-title vectors after dry run | 1,000 |
| Top 1,000 vector count | 1,000 |
| Top 500 vector count | 500 |
| Known-pair cosine gap | 0.232981 |
The Next 500 Are Ready For Another Small Progressive Run
The post-run dry run selected the next 500 serviceable-title candidates without live API use. The compact semantic sampler checked 50 rows and found no posting/location/company noise flags and no label leakage into weak or missing-authority embedding text.
| Check | Value |
|---|---|
| Next candidate count | 500 |
| Rows sampled | 50 |
| Noise flags | 0 |
| Label leaks | 0 |
| Orphan vectors pruned | 0 |
Build-Time Pass Is Not Public Runtime Authority
This checkpoint makes the semantic layer stronger. It does not promote public runtime embedding authority. The P0 gate still reports runtime_embedding_authority_promoted: false; exact cosine remains the source of truth, and runtime promotion still belongs to the AIN-510 retrieval gate.
The Checkpoint Is Verified Locally
| Check | Result |
|---|---|
| Focused tests | 38 passed |
| Ruff | All checks passed |
| P0 gate | pass |
| Full validation | pass |
| Embedding human-review field grep | No matches |
Continue From The Receipt, Not A Fresh Scan
cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, valid, live_metrics: .live_run.metrics, dry_run_metrics: .post_run_dry_run.metrics, next_sampler: .next_500_semantic_sampler}' artifacts/validation/ain_506_serviceable_title_progressive_live_1000_v1.json
The next sensible move is another 500 serviceable-title live slice after rerunning the compact sampler. Batch should wait until a source family has repeated progressive proof and a clean repair queue.
Start with ain_506_serviceable_title_progressive_live_1000_v1.json; it is the compact proof that this slice passed and the next slice is ready to evaluate.