O*NET Occupation Family Complete Embedding Checkpoint
The full repaired O*NET occupation source family is now embedded locally.
onet_occupation_evidence is now fully embedded for the current repaired corpus: 2,828 vectors, zero failed Gemini rows, zero remaining candidates, and all runtime boundaries still blocked.
What changed
After the 500-vector proof, Codex expanded onet_occupation_evidence to all 2,828 repaired chunks, verified semantic QA, embedded the remaining 2,328 candidates through Gemini Embedding 2, and confirmed the completion dry run has zero remaining candidates.
Live embedding result
| Item | Result |
|---|---|
| Source family | onet_occupation_evidence |
| Clean repaired corpus size | 2,828 |
| Existing O*NET occupation vectors before final run | 500 |
| Final dry-run candidates before live | 2,328 |
| Live embedded new vectors | 2,328 |
| Failed Gemini rows | 0 |
| Remaining O*NET occupation candidates after live | 0 |
| Total Gemini vectors | 19,912 |
| O*NET occupation vectors | 2,828 |
| Workflow intelligence vectors | 3,152 |
| Workflow seed vectors | 6,866 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
The run used gemini-embedding-2 at 768 dimensions through Vertex ADC. Runtime embedding authority remains false.
Validation state
| Gate or receipt | Result |
|---|---|
| Repaired semantic QA | 50 / 50 pass |
| Completion dry run | 0 remaining candidates |
| AIN-506 P0 gate | pass |
| AIN-510 retrieval promotion gate | promotion_ready |
| Chunk/vector reconciliation | pass |
| Source authority registry v2 | pass |
| Runtime readiness | valid |
| Artifact exposure scan | pass, 0 active findings |
| Docs frontmatter | valid |
| Full validation | pass |
| Counter | Value |
|---|---|
| Combined chunk authority | 335,865 |
| Repaired authority chunks | 41,190 |
| Vector rows | 19,912 |
| Matched vectors | 19,912 |
| Stale vectors | 0 |
| Unvectorized chunks | 315,953 |
Completed clean source families
| Source family | Vectors |
|---|---|
workflow_seed | 6,866 |
workflow_intelligence | 3,152 |
onet_occupation_evidence | 2,828 |
jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts. Do not retry that family until richer JD/context repair exists.
Verification commands
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_occupation_evidence --limit 3000 --shard-size 500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --dry-run --max-new 5000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --max-new 5000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
What remains
The next likely public source family is onet_task_evidence, but it needs a fresh repair and QA lane. It likely has generic labels such as Core, E0, and onet_task_evidence, so start with eligibility, repaired corpus, 50-row semantic QA, and a 500-row proof before scaling.
jobs_research_responsibility may be high value later, but use JD/context/source-authority repair first because the adjacent role family already exposed store-number artifacts.
Resume with O*NET task evidence only after generic-title repair passes semantic QA; keep exact cosine and runtime boundaries unchanged.