AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Occupation Family Complete Embedding Checkpoint

The full repaired O*NET occupation source family is now embedded locally.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_occupation_evidence is now fully embedded for the current repaired corpus: 2,828 vectors, zero failed Gemini rows, zero remaining candidates, and all runtime boundaries still blocked.

01 · Change

What changed

After the 500-vector proof, Codex expanded onet_occupation_evidence to all 2,828 repaired chunks, verified semantic QA, embedded the remaining 2,328 candidates through Gemini Embedding 2, and confirmed the completion dry run has zero remaining candidates.

Repair2,828 repaired public taxonomy chunks.
Embed2,328 new live Gemini vectors, zero failed rows.
VerifyAIN-510, reconciliation, registry, exposure, and validate all pass.
02 · Live Run

Live embedding result

ItemResult
Source familyonet_occupation_evidence
Clean repaired corpus size2,828
Existing O*NET occupation vectors before final run500
Final dry-run candidates before live2,328
Live embedded new vectors2,328
Failed Gemini rows0
Remaining O*NET occupation candidates after live0
Total Gemini vectors19,912
O*NET occupation vectors2,828
Workflow intelligence vectors3,152
Workflow seed vectors6,866
Known-pair cosine gap0.190463
Stale vectors0

The run used gemini-embedding-2 at 768 dimensions through Vertex ADC. Runtime embedding authority remains false.

03 · Gates

Validation state

Gate or receiptResult
Repaired semantic QA50 / 50 pass
Completion dry run0 remaining candidates
AIN-506 P0 gatepass
AIN-510 retrieval promotion gatepromotion_ready
Chunk/vector reconciliationpass
Source authority registry v2pass
Runtime readinessvalid
Artifact exposure scanpass, 0 active findings
Docs frontmattervalid
Full validationpass
CounterValue
Combined chunk authority335,865
Repaired authority chunks41,190
Vector rows19,912
Matched vectors19,912
Stale vectors0
Unvectorized chunks315,953
04 · Families

Completed clean source families

Source familyVectors
workflow_seed6,866
workflow_intelligence3,152
onet_occupation_evidence2,828

jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts. Do not retry that family until richer JD/context repair exists.

05 · Proof

Verification commands

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_occupation_evidence --limit 3000 --shard-size 500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --dry-run --max-new 5000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --max-new 5000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
06 · Next

What remains

The next likely public source family is onet_task_evidence, but it needs a fresh repair and QA lane. It likely has generic labels such as Core, E0, and onet_task_evidence, so start with eligibility, repaired corpus, 50-row semantic QA, and a 500-row proof before scaling.

jobs_research_responsibility may be high value later, but use JD/context/source-authority repair first because the adjacent role family already exposed store-number artifacts.

Where to start

Resume with O*NET task evidence only after generic-title repair passes semantic QA; keep exact cosine and runtime boundaries unchanged.