O*NET Task 500 Embedding Checkpoint
The task family passed its first repaired 500-vector proof after a real semantic repair.
onet_task_evidence has passed its first live 500-vector proof, but only after Codex caught and fixed generic label leakage before calling Gemini.
What changed
The first manual sample exposed generic source labels like E0, E2, and analytics_only_nc_nd leaking into embedding titles and evidence prefixes. Codex stopped before live API usage, fixed the deterministic repair path, re-sampled real rows, and only then embedded the first 500 candidates.
Why the repair mattered
The source text contains useful SOC, task, metric, and provenance signals, but source labels are not learner-facing truth. The repaired chunk now centers task meaning, not label artifacts.
title: 11-1011 task - Direct or coordinate an organization's financial or budget activities...
text: Evidence atlas onet task evidence: 11-1011 task - Direct or coordinate...
SOC: 11-1011. Task: Direct or coordinate...
The original labels remain provenance or metric context only. They are not title authority.
Live embedding result
| Item | Result |
|---|---|
| Source family | onet_task_evidence |
| Family eligibility rows | 131,095 |
| Repaired proof sample size | 500 |
| Generic-title replacements | 491 |
| Semantic QA | 50 / 50 pass |
| Dry-run candidates | 500 |
| Live embedded new vectors | 500 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 20,412 |
| O*NET task vectors | 500 |
| O*NET occupation vectors | 2,828 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
Validation state
| Gate or receipt | Result |
|---|---|
| Focused tests | 5 passed |
| Ruff on changed Python/test files | pass |
| Repaired semantic QA | 50 / 50 pass |
| AIN-506 P0 gate | pass |
| AIN-510 retrieval promotion gate | promotion_ready |
| Chunk/vector reconciliation | pass |
| Source authority registry v2 | pass |
| Runtime readiness | valid |
| Artifact exposure scan | pass, 0 active findings |
| Docs frontmatter | valid |
| Full validation | pass |
| Counter | Value |
|---|---|
| Combined chunk authority | 336,365 |
| Vector rows | 20,412 |
| Matched vectors | 20,412 |
| Stale vectors | 0 |
| Unvectorized chunks | 315,953 |
What this does not unlock
This is a local 500-row proof for a large source family. It does not unlock full O*NET task embedding, runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor repo mutation, raw market dumps, malformed rows, or learner-linked payloads.
What remains
The next sensible step is a 5,000-row repaired O*NET task corpus, followed by another manual sample, semantic QA, dry run, and only then a 5,000 live rung if the content remains clean. Do not batch this family yet.
Resume with a 5,000-row O*NET task repaired corpus and another semantic sample before spending more Gemini calls.