O*NET Occupation 500 Embedding Checkpoint
The first live O*NET occupation embedding rung is clean, local, and rollbackable.
onet_occupation_evidence has passed its first live embedding rung: 500 clean public-taxonomy chunks embedded through Gemini Embedding 2, zero failed rows, zero stale vectors, and no runtime promotion.
What changed
Codex ran the O*NET occupation source family through the clean-before-embed ladder: eligibility, deterministic repair, 50-row semantic QA, dry-run candidate selection, live Gemini embedding, and the full post-live gate stack.
aina-495702.The family has 2,828 eligible source rows. This checkpoint proves the first 500; the remaining 2,328 should be scaled only if the same family gates stay green.
Live embedding result
| Item | Result |
|---|---|
| Source family | onet_occupation_evidence |
| Eligibility rows | 2,828 |
| Repaired chunks for first rung | 500 |
| Semantic QA | 50 / 50 pass |
| Dry-run candidates | 500 |
| Live embedded new vectors | 500 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 17,584 |
| O*NET occupation vectors | 500 |
| Workflow intelligence vectors | 3,152 |
| Workflow seed vectors | 6,866 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
The run used gemini-embedding-2 at 768 dimensions. Runtime embedding authority remains false.
Validation state
| Gate or receipt | Result |
|---|---|
| Repaired semantic QA | 50 / 50 pass |
| Live Gemini run | 500 vectors, 0 failures |
| AIN-506 P0 gate | pass |
| AIN-510 retrieval promotion gate | promotion_ready |
| Chunk/vector reconciliation | pass |
| Source authority registry v2 | pass |
| Runtime readiness | valid |
| Artifact exposure scan | pass, 0 active findings |
| Docs frontmatter | valid |
| Full validation | pass |
| Counter | Value |
|---|---|
| Combined chunk authority | 333,537 |
| Repaired authority chunks | 38,862 |
| Vector rows | 17,584 |
| Matched vectors | 17,584 |
| Stale vectors | 0 |
| Unvectorized chunks | 315,953 |
What this does not unlock
This is a local source-family embedding proof, not a production release. Public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, and learner-linked payload embedding all remain blocked.
Labels remain routing metadata. The embedded text uses clean public occupation evidence; it does not promote source labels into unquestioned truth.
Verification commands
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --dry-run --max-new 500 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --max-new 500 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
What remains
The next clean move is to complete the remaining O*NET occupation family before moving to noisier sources. Expand the repaired corpus to all 2,828 rows, run the same semantic QA and dry-run checks, embed the remaining 2,328 candidates only if the gates stay green, then repeat AIN-510 and validation.
jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts. That is a guardrail doing its job, not a technical failure.
Resume by scaling onet_occupation_evidence to the full repaired family, then treat O*NET tasks as the next public source family only after stronger generic-title repair.