O*NET Task 5k Embedding Checkpoint
The task family passed the 5,000-row progressive rung after a stronger repair caught subtler label contamination.
onet_task_evidence now has 5,000 clean repaired Gemini vectors, with zero failed rows and zero stale vectors, after Codex generalized the repair for metric and work-activity labels before the live API run.
What changed
The 5k sample exposed source labels that the first 500-row proof did not catch: n/a, mean_expert_rating, mean_worker_rating, worker_desire_minus_expert_capacity, and work-activity labels. Codex stopped before live Gemini, made the task repair general rather than label-list driven, and rebuilt the repaired corpus.
Live embedding result
| Item | Result |
|---|---|
| Source family | onet_task_evidence |
| Family eligibility rows | 131,095 |
| Repaired corpus size | 5,000 |
| Generic-title replacements | 5,000 |
| Dry-run new candidates | 4,708 |
| Existing valid task vectors retained | 292 |
| Old task vectors pruned | 208 |
| Live embedded new vectors | 4,708 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 24,912 |
| O*NET task vectors | 5,000 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
The run used gemini-embedding-2 at 768 dimensions through Vertex ADC on paid project aina-495702. Runtime embedding authority remains disabled.
Validation state
| Gate or receipt | Result |
|---|---|
| Focused O*NET/semantic-QA tests | 6 passed |
| Ruff on changed files | pass |
| Semantic QA | 50 / 50 pass |
| AIN-506 P0 gate | pass |
| AIN-510 retrieval gate | promotion_ready |
| Chunk-vector reconciliation | pass |
| Source-authority registry v2 | pass |
| Runtime readiness | headless hardening ready |
| Artifact exposure scan | 0 active findings |
| Full validate | pass |
| Current authority count | Value |
|---|---|
| Base chunks | 294,675 |
| Repaired distinct authority chunks | 46,190 |
| Combined chunk authority | 340,865 |
| Matched vectors | 24,912 |
| Unvectorized chunks | 315,953 |
| Source-authority registry rows | 35 |
Why the second repair mattered
The first proof removed obvious labels like E0, E2, and analytics_only_nc_nd. The 5k proof found subtler labels that looked like metrics or source columns while still carrying real SOC/task meaning. The repair now centers the real task text whenever a task statement is parseable.
title: 11-3051 task - Document testing procedures, methodologies, or criteria.
text: Evidence atlas onet task evidence: 11-3051 task - Document testing procedures...
SOC: 11-3051. Task: Document testing procedures, methodologies, or criteria.
The original labels remain source context only. They are not title authority and do not lead the embedding body.
Product boundary
This is a local 5,000-row progressive proof, not a batch unlock. Batch submission, public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, learner-linked payload embedding, and runtime embedding authority promotion all remain blocked.
Exact cosine remains the retrieval source of truth. Deterministic fallbacks and source authority still govern runtime decisions.
Where to continue
The next sensible rung is a 25,000-row repaired corpus with a fresh semantic sample, dry run, live run, and post-live AIN-510/reconciliation stack. If that stays clean, O*NET task evidence can become a batch candidate for the remaining family.
cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --limit 25000 --shard-size 2500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_task_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive
Start with the 25k repaired dry run. Do not batch O*NET task evidence until the next progressive rung stays clean.