Production Embedding Serviceable Progress Handoff
A cleaner source-family gate plus the first live serviceable-title Gemini slice.
The embedding lane now has a cleaner source-family gate and the first live serviceable-title Gemini slice is embedded. Weak or missing title-authority rows no longer embed SOC/function/seniority labels as truth, and market-posting noise is held for repair before any Gemini call.
Cleaner Text Before More Vectors
Serviceable-title rows now separate clean role text from doubtful classifications. Rows with derived_clean_title or exact_trusted authority can embed SOC/function/seniority fields. Rows with exact_weak or missing title authority keep those classifications in metadata only and set classification_labels_embedded: false.
The repaired-chunk writer also now dedupes by (chunk_id, text_hash), preserving row lineage when two different source rows repair to the same semantic title.
Where The Corpus Stands
| Metric | Value |
|---|---|
| Total current Gemini vectors | 2,594 |
| Serviceable vectors now current | 500 |
| Serviceable next dry-run candidates | 500 |
| Serviceable live failures | 0 |
| Top 1,000 vector coverage | 1,000 |
| Top 500 vector coverage | 500 |
| Known-pair cosine gap | 0.22896 |
| Serviceable repaired chunks | 10,279 |
The Rows Looked Right Before Spend
The final 50-row serviceable candidate sample had zero posting-noise flags and zero weak-label leaks. Good role text can be embedded, but weak labels do not become semantic authority.
| Sample Check | Result |
|---|---|
| Rows checked | 50 |
| Posting-noise flags | 0 |
| Weak-label leaks | 0 |
derived_clean_title, labels embedded | 19 |
exact_trusted, labels embedded | 10 |
exact_weak, labels metadata-only | 14 |
missing, labels metadata-only | 7 |
Proof Commands
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q uv run ruff check src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/production_embeddings.py tests/test_production_embeddings.py uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api --workers 8 --write-every 50 --timeout-seconds 60 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family top_worked_title --include-repaired --max-new 19 --allow-live-gemini --confirm-paid-api --workers 4 --write-every 19 --timeout-seconds 60 uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
| Validation | Result |
|---|---|
| Focused pytest | 38 passed |
| Ruff | All checks passed |
| AIN-506 P0 gate | pass |
| Full validation | pass |
The Full Goal Is Still Active
The next step is not to batch everything blindly. Continue progressively by source family: run the next serviceable slice only after the dry run remains clean, promote to 5,000 only if failures stay at zero, and keep repair-first rows out of live embedding until their repaired chunks pass spot checks.
Restart Without Rediscovery
cd /srv/aina/aina-data-engine-room
git status --short --branch
jq '{status, valid, metrics}' artifacts/validation/ain_506_serviceable_title_progressive_live_500_v1.json
jq '{status, valid, metrics}' artifacts/validation/ain_506_live_gemini_embedding_run_v1_dry_run.json
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family serviceable_title --include-repaired --max-new 500 --dry-run
If the serviceable dry run still reports 500 existing vectors, 500 next candidates, and zero pruning, the next autonomous move is the next progressive serviceable live slice.