AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Occupation 500 Embedding Checkpoint

The first live O*NET occupation embedding rung is clean, local, and rollbackable.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_occupation_evidence has passed its first live embedding rung: 500 clean public-taxonomy chunks embedded through Gemini Embedding 2, zero failed rows, zero stale vectors, and no runtime promotion.

01 · Change

What changed

Codex ran the O*NET occupation source family through the clean-before-embed ladder: eligibility, deterministic repair, 50-row semantic QA, dry-run candidate selection, live Gemini embedding, and the full post-live gate stack.

Repair500 repaired chunks with title, SOC, metric, and O*NET release provenance.
Embed500 new live Gemini vectors through Vertex ADC on project aina-495702.
VerifyAIN-510, reconciliation, registry, exposure, and validate all pass.

The family has 2,828 eligible source rows. This checkpoint proves the first 500; the remaining 2,328 should be scaled only if the same family gates stay green.

02 · Live Run

Live embedding result

ItemResult
Source familyonet_occupation_evidence
Eligibility rows2,828
Repaired chunks for first rung500
Semantic QA50 / 50 pass
Dry-run candidates500
Live embedded new vectors500
Failed Gemini rows0
Total Gemini vectors17,584
O*NET occupation vectors500
Workflow intelligence vectors3,152
Workflow seed vectors6,866
Known-pair cosine gap0.190463
Stale vectors0

The run used gemini-embedding-2 at 768 dimensions. Runtime embedding authority remains false.

03 · Gates

Validation state

Gate or receiptResult
Repaired semantic QA50 / 50 pass
Live Gemini run500 vectors, 0 failures
AIN-506 P0 gatepass
AIN-510 retrieval promotion gatepromotion_ready
Chunk/vector reconciliationpass
Source authority registry v2pass
Runtime readinessvalid
Artifact exposure scanpass, 0 active findings
Docs frontmattervalid
Full validationpass
CounterValue
Combined chunk authority333,537
Repaired authority chunks38,862
Vector rows17,584
Matched vectors17,584
Stale vectors0
Unvectorized chunks315,953
04 · Boundary

What this does not unlock

This is a local source-family embedding proof, not a production release. Public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, and learner-linked payload embedding all remain blocked.

Labels remain routing metadata. The embedded text uses clean public occupation evidence; it does not promote source labels into unquestioned truth.

05 · Proof

Verification commands

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --dry-run --max-new 500 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --max-new 500 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
06 · Next

What remains

The next clean move is to complete the remaining O*NET occupation family before moving to noisier sources. Expand the repaired corpus to all 2,828 rows, run the same semantic QA and dry-run checks, embed the remaining 2,328 candidates only if the gates stay green, then repeat AIN-510 and validation.

jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts. That is a guardrail doing its job, not a technical failure.

Where to start

Resume by scaling onet_occupation_evidence to the full repaired family, then treat O*NET tasks as the next public source family only after stronger generic-title repair.