AINA Data Engine Room · Production Embeddings · 2026-06-12

Production Embedding Serviceable Progress Handoff

A cleaner source-family gate plus the first live serviceable-title Gemini slice.

Ali Mehdi Mukadam · co-authored with Codex · 4 minute read · Branch ali/ain-506-p0-gate-2026-06-12

The Single Idea

The embedding lane now has a cleaner source-family gate and the first live serviceable-title Gemini slice is embedded. Weak or missing title-authority rows no longer embed SOC/function/seniority labels as truth, and market-posting noise is held for repair before any Gemini call.

01 · What Changed

Cleaner Text Before More Vectors

Serviceable-title rows now separate clean role text from doubtful classifications. Rows with derived_clean_title or exact_trusted authority can embed SOC/function/seniority fields. Rows with exact_weak or missing title authority keep those classifications in metadata only and set classification_labels_embedded: false.

GateClassify source family, title authority, text quality, and vector status.
RepairClean W2, city/state, shift, ZIP, brand, and posting fragments.
SampleCheck 50 rows for posting noise and weak-label leakage.
EmbedRun Gemini only after the source-family sample is clean.

The repaired-chunk writer also now dedupes by (chunk_id, text_hash), preserving row lineage when two different source rows repair to the same semantic title.

02 · Key Metrics

Where The Corpus Stands

MetricValue
Total current Gemini vectors2,594
Serviceable vectors now current500
Serviceable next dry-run candidates500
Serviceable live failures0
Top 1,000 vector coverage1,000
Top 500 vector coverage500
Known-pair cosine gap0.22896
Serviceable repaired chunks10,279
03 · Semantic QA

The Rows Looked Right Before Spend

The final 50-row serviceable candidate sample had zero posting-noise flags and zero weak-label leaks. Good role text can be embedded, but weak labels do not become semantic authority.

Sample CheckResult
Rows checked50
Posting-noise flags0
Weak-label leaks0
derived_clean_title, labels embedded19
exact_trusted, labels embedded10
exact_weak, labels metadata-only14
missing, labels metadata-only7
04 · Commands

Proof Commands

uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run ruff check src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/production_embeddings.py tests/test_production_embeddings.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family top_worked_title --include-repaired --max-new 19 --allow-live-gemini --confirm-paid-api --workers 4 --write-every 19 --timeout-seconds 60
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
ValidationResult
Focused pytest38 passed
RuffAll checks passed
AIN-506 P0 gatepass
Full validationpass
05 · Still Pending

The Full Goal Is Still Active

The next step is not to batch everything blindly. Continue progressively by source family: run the next serviceable slice only after the dry run remains clean, promote to 5,000 only if failures stay at zero, and keep repair-first rows out of live embedding until their repaired chunks pass spot checks.

06 · Exact Resume

Restart Without Rediscovery

cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, valid, metrics}' artifacts/validation/ain_506_serviceable_title_progressive_live_500_v1.json
jq '{status, valid, metrics}' artifacts/validation/ain_506_live_gemini_embedding_run_v1_dry_run.json
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --dry-run
Where to start

If the serviceable dry run still reports 500 existing vectors, 500 next candidates, and zero pruning, the next autonomous move is the next progressive serviceable live slice.