AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Occupation 500 Embedding Checkpoint

The first live O*NET occupation embedding rung is clean, local, and rollbackable.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_occupation_evidence has passed its first live embedding rung: 500 clean public-taxonomy chunks embedded through Gemini Embedding 2, zero failed rows, zero stale vectors, and no runtime promotion.

01 · Change

What changed

Codex ran the O*NET occupation source family through the clean-before-embed ladder: eligibility, deterministic repair, 50-row semantic QA, dry-run candidate selection, live Gemini embedding, and the full post-live gate stack.

Repair500 repaired chunks with title, SOC, metric, and O*NET release provenance.

Embed500 new live Gemini vectors through Vertex ADC on project aina-495702.

VerifyAIN-510, reconciliation, registry, exposure, and validate all pass.

The family has 2,828 eligible source rows. This checkpoint proves the first 500; the remaining 2,328 should be scaled only if the same family gates stay green.

02 · Live Run

Live embedding result

Item	Result
Source family	`onet_occupation_evidence`
Eligibility rows	2,828
Repaired chunks for first rung	500
Semantic QA	50 / 50 pass
Dry-run candidates	500
Live embedded new vectors	500
Failed Gemini rows	0
Total Gemini vectors	17,584
O*NET occupation vectors	500
Workflow intelligence vectors	3,152
Workflow seed vectors	6,866
Known-pair cosine gap	0.190463
Stale vectors	0

The run used gemini-embedding-2 at 768 dimensions. Runtime embedding authority remains false.

03 · Gates

Validation state

Gate or receipt	Result
Repaired semantic QA	50 / 50 pass
Live Gemini run	500 vectors, 0 failures
AIN-506 P0 gate	pass
AIN-510 retrieval promotion gate	promotion_ready
Chunk/vector reconciliation	pass
Source authority registry v2	pass
Runtime readiness	valid
Artifact exposure scan	pass, 0 active findings
Docs frontmatter	valid
Full validation	pass

Counter	Value
Combined chunk authority	333,537
Repaired authority chunks	38,862
Vector rows	17,584
Matched vectors	17,584
Stale vectors	0
Unvectorized chunks	315,953

04 · Boundary

What this does not unlock

This is a local source-family embedding proof, not a production release. Public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, and learner-linked payload embedding all remain blocked.

Labels remain routing metadata. The embedded text uses clean public occupation evidence; it does not promote source labels into unquestioned truth.

05 · Proof

Verification commands

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --dry-run --max-new 500 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_occupation_evidence --include-repaired --max-new 500 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate

06 · Next

What remains

The next clean move is to complete the remaining O*NET occupation family before moving to noisier sources. Expand the repaired corpus to all 2,828 rows, run the same semantic QA and dry-run checks, embed the remaining 2,328 candidates only if the gates stay green, then repeat AIN-510 and validation.

jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts. That is a guardrail doing its job, not a technical failure.

Where to start

Resume by scaling onet_occupation_evidence to the full repaired family, then treat O*NET tasks as the next public source family only after stronger generic-title repair.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

topics:
  - aina-data-engine
  - personalization-engine
  - embeddings
subtopics:
  - onet-occupation-evidence
  - gemini-embedding-2
  - source-family-proof
  - retrieval-gate

aina-data-engine personalization-engine embeddings onet-occupation-evidence retrieval-gate