AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Task 25k Embedding Checkpoint

The O*NET task family passed a large foreground proof and is now ready for guarded batch-candidate work.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_task_evidence passed the 25,000-row progressive proof: 20,000 new Gemini vectors, zero failed rows, zero stale vectors, and AIN-510 still promotion_ready.

01 · Change

What changed

Codex built a 25,000-row repaired corpus, verified all rows received generic-title replacement and placeholder cleanup, ran literal scans for bad label prefixes, passed semantic QA, dry-ran candidate selection, confirmed AIN-506, then ran the foreground Vertex ADC embedding job.

Repair25,000 repaired chunks, zero skipped repairs.
Embed20,000 new O*NET task vectors, 5,000 retained.
Safety0 failed rows, 0 stale vectors, runtime authority still false.
02 · Live Run

Live embedding result

ItemResult
Source familyonet_task_evidence
Family eligibility rows131,095
Repaired corpus size25,000
Dry-run new candidates20,000
Existing valid task vectors retained5,000
Live embedded new vectors20,000
Failed Gemini rows0
Total Gemini vectors44,912
O*NET task vectors25,000
Known-pair cosine gap0.190463
Stale vectors0

The run used gemini-embedding-2 at 768 dimensions through Vertex ADC on paid project aina-495702. Runtime embedding authority remains disabled.

03 · Proof

Validation state

Gate or receiptResult
Literal repaired-text scanpass
Focused O*NET/semantic-QA tests6 passed
Ruff on changed filespass
Semantic QA50 / 50 pass
AIN-506 P0 gatepass
AIN-510 retrieval gatepromotion_ready
Chunk-vector reconciliationpass
Source-authority registry v2pass
Runtime readinessheadless hardening ready
Artifact exposure scan0 active findings
Full validatepass
Current authority countValue
Base chunks294,675
Repaired distinct authority chunks66,190
Combined chunk authority360,865
Matched vectors44,912
Combined vector coverage12.4457%
Unvectorized chunks315,953
04 · Meaning

Why this matters

The O*NET task layer now contributes a large public, source-grounded task semantic base for role-to-task, role-to-workflow, task exposure, capability requirements, evaluator scenarios, and curriculum gap matching. The repaired embedding text centers SOC and task statements, not source labels or metric names.

title: 11-3051 task - Document testing procedures, methodologies, or criteria.
text: Evidence atlas onet task evidence: 11-3051 task - Document testing procedures...
SOC: 11-3051. Task: Document testing procedures, methodologies, or criteria.
05 · Boundary

Product boundary

This is local production-readiness proof, not production launch. Runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, and learner-linked payload embedding all remain blocked.

Exact cosine remains the retrieval source of truth. Deterministic fallbacks, source authority, guardrails, and rollback boundaries still govern runtime decisions.

06 · Batch

Batch-readiness interpretation

The 25k rung makes onet_task_evidence a reasonable batch candidate for the remaining repaired family, but only after a batch-manifest guard confirms eligible-only source-family selection, no blocked rows, no raw market rows, no malformed rows, no quality-excluded candidates, exact source scoping, and requeue by chunk_id and text_hash.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
Where to start

Start with a guarded batch-candidate manifest for the remaining repaired O*NET task rows, or move to the next source family if batch implementation needs its own slice.