AINA Data Engine Room · Production Embeddings · 2026-06-12

Production Embedding Serviceable Progress Handoff

A cleaner source-family gate plus the first live serviceable-title Gemini slice.

Ali Mehdi Mukadam · co-authored with Codex · 4 minute read · Branch ali/ain-506-p0-gate-2026-06-12

The Single Idea

The embedding lane now has a cleaner source-family gate and the first live serviceable-title Gemini slice is embedded. Weak or missing title-authority rows no longer embed SOC/function/seniority labels as truth, and market-posting noise is held for repair before any Gemini call.

01 · What Changed

Cleaner Text Before More Vectors

Serviceable-title rows now separate clean role text from doubtful classifications. Rows with derived_clean_title or exact_trusted authority can embed SOC/function/seniority fields. Rows with exact_weak or missing title authority keep those classifications in metadata only and set classification_labels_embedded: false.

GateClassify source family, title authority, text quality, and vector status.

RepairClean W2, city/state, shift, ZIP, brand, and posting fragments.

SampleCheck 50 rows for posting noise and weak-label leakage.

EmbedRun Gemini only after the source-family sample is clean.

The repaired-chunk writer also now dedupes by (chunk_id, text_hash), preserving row lineage when two different source rows repair to the same semantic title.

02 · Key Metrics

Where The Corpus Stands

Metric	Value
Total current Gemini vectors	2,594
Serviceable vectors now current	500
Serviceable next dry-run candidates	500
Serviceable live failures	0
Top 1,000 vector coverage	1,000
Top 500 vector coverage	500
Known-pair cosine gap	0.22896
Serviceable repaired chunks	10,279

03 · Semantic QA

The Rows Looked Right Before Spend

The final 50-row serviceable candidate sample had zero posting-noise flags and zero weak-label leaks. Good role text can be embedded, but weak labels do not become semantic authority.

Sample Check	Result
Rows checked	50
Posting-noise flags	0
Weak-label leaks	0
`derived_clean_title`, labels embedded	19
`exact_trusted`, labels embedded	10
`exact_weak`, labels metadata-only	14
`missing`, labels metadata-only	7

04 · Commands

Proof Commands

uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run ruff check src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/production_embeddings.py tests/test_production_embeddings.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family top_worked_title --include-repaired --max-new 19 --allow-live-gemini --confirm-paid-api --workers 4 --write-every 19 --timeout-seconds 60
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate

Validation	Result
Focused pytest	38 passed
Ruff	All checks passed
AIN-506 P0 gate	pass
Full validation	pass

05 · Still Pending

The Full Goal Is Still Active

The next step is not to batch everything blindly. Continue progressively by source family: run the next serviceable slice only after the dry run remains clean, promote to 5,000 only if failures stay at zero, and keep repair-first rows out of live embedding until their repaired chunks pass spot checks.

06 · Exact Resume

Restart Without Rediscovery

cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, valid, metrics}' artifacts/validation/ain_506_serviceable_title_progressive_live_500_v1.json
jq '{status, valid, metrics}' artifacts/validation/ain_506_live_gemini_embedding_run_v1_dry_run.json
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --dry-run

Where to start

If the serviceable dry run still reports 500 existing vectors, 500 next candidates, and zero pruning, the next autonomous move is the next progressive serviceable live slice.

Byline: Ali Mehdi Mukadam · co-authored with Codex · 2026-06-12

topics:
  - personalization-engine
  - production-embeddings
  - source-family-gating
subtopics:
  - gemini-embedding-2
  - serviceable-title-coverage
  - semantic-quality-gates
  - title-authority-repair

personalization-engine production-embeddings source-family-gating gemini-embedding-2