AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Task 500 Embedding Checkpoint

The task family passed its first repaired 500-vector proof after a real semantic repair.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_task_evidence has passed its first live 500-vector proof, but only after Codex caught and fixed generic label leakage before calling Gemini.

01 · Change

What changed

The first manual sample exposed generic source labels like E0, E2, and analytics_only_nc_nd leaking into embedding titles and evidence prefixes. Codex stopped before live API usage, fixed the deterministic repair path, re-sampled real rows, and only then embedded the first 500 candidates.

Eligibility131,095 O*NET task rows, all repair-first.
Repair491 of 500 proof rows received generic-title replacement.
Embed500 new vectors, zero failed Gemini rows.
02 · Repair

Why the repair mattered

The source text contains useful SOC, task, metric, and provenance signals, but source labels are not learner-facing truth. The repaired chunk now centers task meaning, not label artifacts.

title: 11-1011 task - Direct or coordinate an organization's financial or budget activities...
text: Evidence atlas onet task evidence: 11-1011 task - Direct or coordinate...
SOC: 11-1011. Task: Direct or coordinate...

The original labels remain provenance or metric context only. They are not title authority.

03 · Live Run

Live embedding result

ItemResult
Source familyonet_task_evidence
Family eligibility rows131,095
Repaired proof sample size500
Generic-title replacements491
Semantic QA50 / 50 pass
Dry-run candidates500
Live embedded new vectors500
Failed Gemini rows0
Total Gemini vectors20,412
O*NET task vectors500
O*NET occupation vectors2,828
Known-pair cosine gap0.190463
Stale vectors0
04 · Gates

Validation state

Gate or receiptResult
Focused tests5 passed
Ruff on changed Python/test filespass
Repaired semantic QA50 / 50 pass
AIN-506 P0 gatepass
AIN-510 retrieval promotion gatepromotion_ready
Chunk/vector reconciliationpass
Source authority registry v2pass
Runtime readinessvalid
Artifact exposure scanpass, 0 active findings
Docs frontmattervalid
Full validationpass
CounterValue
Combined chunk authority336,365
Vector rows20,412
Matched vectors20,412
Stale vectors0
Unvectorized chunks315,953
05 · Boundary

What this does not unlock

This is a local 500-row proof for a large source family. It does not unlock full O*NET task embedding, runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor repo mutation, raw market dumps, malformed rows, or learner-linked payloads.

06 · Next

What remains

The next sensible step is a 5,000-row repaired O*NET task corpus, followed by another manual sample, semantic QA, dry run, and only then a 5,000 live rung if the content remains clean. Do not batch this family yet.

Where to start

Resume with a 5,000-row O*NET task repaired corpus and another semantic sample before spending more Gemini calls.