AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Task 500 Embedding Checkpoint

The task family passed its first repaired 500-vector proof after a real semantic repair.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_task_evidence has passed its first live 500-vector proof, but only after Codex caught and fixed generic label leakage before calling Gemini.

01 · Change

What changed

The first manual sample exposed generic source labels like E0, E2, and analytics_only_nc_nd leaking into embedding titles and evidence prefixes. Codex stopped before live API usage, fixed the deterministic repair path, re-sampled real rows, and only then embedded the first 500 candidates.

Eligibility131,095 O*NET task rows, all repair-first.

Repair491 of 500 proof rows received generic-title replacement.

Embed500 new vectors, zero failed Gemini rows.

02 · Repair

Why the repair mattered

The source text contains useful SOC, task, metric, and provenance signals, but source labels are not learner-facing truth. The repaired chunk now centers task meaning, not label artifacts.

title: 11-1011 task - Direct or coordinate an organization's financial or budget activities...
text: Evidence atlas onet task evidence: 11-1011 task - Direct or coordinate...
SOC: 11-1011. Task: Direct or coordinate...

The original labels remain provenance or metric context only. They are not title authority.

03 · Live Run

Live embedding result

Item	Result
Source family	`onet_task_evidence`
Family eligibility rows	131,095
Repaired proof sample size	500
Generic-title replacements	491
Semantic QA	50 / 50 pass
Dry-run candidates	500
Live embedded new vectors	500
Failed Gemini rows	0
Total Gemini vectors	20,412
O*NET task vectors	500
O*NET occupation vectors	2,828
Known-pair cosine gap	0.190463
Stale vectors	0

04 · Gates

Validation state

Gate or receipt	Result
Focused tests	5 passed
Ruff on changed Python/test files	pass
Repaired semantic QA	50 / 50 pass
AIN-506 P0 gate	pass
AIN-510 retrieval promotion gate	promotion_ready
Chunk/vector reconciliation	pass
Source authority registry v2	pass
Runtime readiness	valid
Artifact exposure scan	pass, 0 active findings
Docs frontmatter	valid
Full validation	pass

Counter	Value
Combined chunk authority	336,365
Vector rows	20,412
Matched vectors	20,412
Stale vectors	0
Unvectorized chunks	315,953

05 · Boundary

What this does not unlock

This is a local 500-row proof for a large source family. It does not unlock full O*NET task embedding, runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor repo mutation, raw market dumps, malformed rows, or learner-linked payloads.

06 · Next

What remains

The next sensible step is a 5,000-row repaired O*NET task corpus, followed by another manual sample, semantic QA, dry run, and only then a 5,000 live rung if the content remains clean. Do not batch this family yet.

Where to start

Resume with a 5,000-row O*NET task repaired corpus and another semantic sample before spending more Gemini calls.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

topics:
  - aina-data-engine
  - personalization-engine
  - embeddings
subtopics:
  - onet-task-evidence
  - repair-before-embed
  - gemini-embedding-2
  - retrieval-gate

aina-data-engine personalization-engine onet-task-evidence repair-before-embed retrieval-gate