AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Task 5k Embedding Checkpoint

The task family passed the 5,000-row progressive rung after a stronger repair caught subtler label contamination.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_task_evidence now has 5,000 clean repaired Gemini vectors, with zero failed rows and zero stale vectors, after Codex generalized the repair for metric and work-activity labels before the live API run.

01 · Change

What changed

The 5k sample exposed source labels that the first 500-row proof did not catch: n/a, mean_expert_rating, mean_worker_rating, worker_desire_minus_expert_capacity, and work-activity labels. Codex stopped before live Gemini, made the task repair general rather than label-list driven, and rebuilt the repaired corpus.

Repair5,000 of 5,000 rows received generic-title replacement and placeholder cleanup.

Embed4,708 new vectors; 292 existing repaired task vectors retained.

Safety0 failed Gemini rows, 0 stale vectors, runtime authority still false.

02 · Live Run

Live embedding result

Item	Result
Source family	`onet_task_evidence`
Family eligibility rows	131,095
Repaired corpus size	5,000
Generic-title replacements	5,000
Dry-run new candidates	4,708
Existing valid task vectors retained	292
Old task vectors pruned	208
Live embedded new vectors	4,708
Failed Gemini rows	0
Total Gemini vectors	24,912
O*NET task vectors	5,000
Known-pair cosine gap	0.190463
Stale vectors	0

The run used gemini-embedding-2 at 768 dimensions through Vertex ADC on paid project aina-495702. Runtime embedding authority remains disabled.

03 · Proof

Validation state

Gate or receipt	Result
Focused O*NET/semantic-QA tests	6 passed
Ruff on changed files	pass
Semantic QA	50 / 50 pass
AIN-506 P0 gate	pass
AIN-510 retrieval gate	promotion_ready
Chunk-vector reconciliation	pass
Source-authority registry v2	pass
Runtime readiness	headless hardening ready
Artifact exposure scan	0 active findings
Full validate	pass

Current authority count	Value
Base chunks	294,675
Repaired distinct authority chunks	46,190
Combined chunk authority	340,865
Matched vectors	24,912
Unvectorized chunks	315,953
Source-authority registry rows	35

04 · Repair

Why the second repair mattered

The first proof removed obvious labels like E0, E2, and analytics_only_nc_nd. The 5k proof found subtler labels that looked like metrics or source columns while still carrying real SOC/task meaning. The repair now centers the real task text whenever a task statement is parseable.

title: 11-3051 task - Document testing procedures, methodologies, or criteria.
text: Evidence atlas onet task evidence: 11-3051 task - Document testing procedures...
SOC: 11-3051. Task: Document testing procedures, methodologies, or criteria.

The original labels remain source context only. They are not title authority and do not lead the embedding body.

05 · Boundary

Product boundary

This is a local 5,000-row progressive proof, not a batch unlock. Batch submission, public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, learner-linked payload embedding, and runtime embedding authority promotion all remain blocked.

Exact cosine remains the retrieval source of truth. Deterministic fallbacks and source authority still govern runtime decisions.

06 · Resume

Where to continue

The next sensible rung is a 25,000-row repaired corpus with a fresh semantic sample, dry run, live run, and post-live AIN-510/reconciliation stack. If that stays clean, O*NET task evidence can become a batch candidate for the remaining family.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --limit 25000 --shard-size 2500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_task_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive

Where to start

Start with the 25k repaired dry run. Do not batch O*NET task evidence until the next progressive rung stays clean.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

topics:
  - aina-data-engine
  - personalization-engine
  - embeddings
subtopics:
  - onet-task-evidence
  - repair-before-embed
  - gemini-embedding-2
  - retrieval-gate

aina-data-engine personalization-engine embeddings onet-task-evidence repair-before-embed retrieval-gate