AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Task 5k Embedding Checkpoint

The task family passed the 5,000-row progressive rung after a stronger repair caught subtler label contamination.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_task_evidence now has 5,000 clean repaired Gemini vectors, with zero failed rows and zero stale vectors, after Codex generalized the repair for metric and work-activity labels before the live API run.

01 · Change

What changed

The 5k sample exposed source labels that the first 500-row proof did not catch: n/a, mean_expert_rating, mean_worker_rating, worker_desire_minus_expert_capacity, and work-activity labels. Codex stopped before live Gemini, made the task repair general rather than label-list driven, and rebuilt the repaired corpus.

Repair5,000 of 5,000 rows received generic-title replacement and placeholder cleanup.
Embed4,708 new vectors; 292 existing repaired task vectors retained.
Safety0 failed Gemini rows, 0 stale vectors, runtime authority still false.
02 · Live Run

Live embedding result

ItemResult
Source familyonet_task_evidence
Family eligibility rows131,095
Repaired corpus size5,000
Generic-title replacements5,000
Dry-run new candidates4,708
Existing valid task vectors retained292
Old task vectors pruned208
Live embedded new vectors4,708
Failed Gemini rows0
Total Gemini vectors24,912
O*NET task vectors5,000
Known-pair cosine gap0.190463
Stale vectors0

The run used gemini-embedding-2 at 768 dimensions through Vertex ADC on paid project aina-495702. Runtime embedding authority remains disabled.

03 · Proof

Validation state

Gate or receiptResult
Focused O*NET/semantic-QA tests6 passed
Ruff on changed filespass
Semantic QA50 / 50 pass
AIN-506 P0 gatepass
AIN-510 retrieval gatepromotion_ready
Chunk-vector reconciliationpass
Source-authority registry v2pass
Runtime readinessheadless hardening ready
Artifact exposure scan0 active findings
Full validatepass
Current authority countValue
Base chunks294,675
Repaired distinct authority chunks46,190
Combined chunk authority340,865
Matched vectors24,912
Unvectorized chunks315,953
Source-authority registry rows35
04 · Repair

Why the second repair mattered

The first proof removed obvious labels like E0, E2, and analytics_only_nc_nd. The 5k proof found subtler labels that looked like metrics or source columns while still carrying real SOC/task meaning. The repair now centers the real task text whenever a task statement is parseable.

title: 11-3051 task - Document testing procedures, methodologies, or criteria.
text: Evidence atlas onet task evidence: 11-3051 task - Document testing procedures...
SOC: 11-3051. Task: Document testing procedures, methodologies, or criteria.

The original labels remain source context only. They are not title authority and do not lead the embedding body.

05 · Boundary

Product boundary

This is a local 5,000-row progressive proof, not a batch unlock. Batch submission, public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, learner-linked payload embedding, and runtime embedding authority promotion all remain blocked.

Exact cosine remains the retrieval source of truth. Deterministic fallbacks and source authority still govern runtime decisions.

06 · Resume

Where to continue

The next sensible rung is a 25,000-row repaired corpus with a fresh semantic sample, dry run, live run, and post-live AIN-510/reconciliation stack. If that stays clean, O*NET task evidence can become a batch candidate for the remaining family.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --limit 25000 --shard-size 2500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_task_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive
Where to start

Start with the 25k repaired dry run. Do not batch O*NET task evidence until the next progressive rung stays clean.