AINA Data Engine Room · Personalization Engine · 2026-06-15

O*NET Task 25k Embedding Checkpoint

The O*NET task family passed a large foreground proof and is now ready for guarded batch-candidate work.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

onet_task_evidence passed the 25,000-row progressive proof: 20,000 new Gemini vectors, zero failed rows, zero stale vectors, and AIN-510 still promotion_ready.

01 · Change

What changed

Codex built a 25,000-row repaired corpus, verified all rows received generic-title replacement and placeholder cleanup, ran literal scans for bad label prefixes, passed semantic QA, dry-ran candidate selection, confirmed AIN-506, then ran the foreground Vertex ADC embedding job.

Repair25,000 repaired chunks, zero skipped repairs.

Embed20,000 new O*NET task vectors, 5,000 retained.

Safety0 failed rows, 0 stale vectors, runtime authority still false.

02 · Live Run

Live embedding result

Item	Result
Source family	`onet_task_evidence`
Family eligibility rows	131,095
Repaired corpus size	25,000
Dry-run new candidates	20,000
Existing valid task vectors retained	5,000
Live embedded new vectors	20,000
Failed Gemini rows	0
Total Gemini vectors	44,912
O*NET task vectors	25,000
Known-pair cosine gap	0.190463
Stale vectors	0

The run used gemini-embedding-2 at 768 dimensions through Vertex ADC on paid project aina-495702. Runtime embedding authority remains disabled.

03 · Proof

Validation state

Gate or receipt	Result
Literal repaired-text scan	pass
Focused O*NET/semantic-QA tests	6 passed
Ruff on changed files	pass
Semantic QA	50 / 50 pass
AIN-506 P0 gate	pass
AIN-510 retrieval gate	promotion_ready
Chunk-vector reconciliation	pass
Source-authority registry v2	pass
Runtime readiness	headless hardening ready
Artifact exposure scan	0 active findings
Full validate	pass

Current authority count	Value
Base chunks	294,675
Repaired distinct authority chunks	66,190
Combined chunk authority	360,865
Matched vectors	44,912
Combined vector coverage	12.4457%
Unvectorized chunks	315,953

04 · Meaning

Why this matters

The O*NET task layer now contributes a large public, source-grounded task semantic base for role-to-task, role-to-workflow, task exposure, capability requirements, evaluator scenarios, and curriculum gap matching. The repaired embedding text centers SOC and task statements, not source labels or metric names.

title: 11-3051 task - Document testing procedures, methodologies, or criteria.
text: Evidence atlas onet task evidence: 11-3051 task - Document testing procedures...
SOC: 11-3051. Task: Document testing procedures, methodologies, or criteria.

05 · Boundary

Product boundary

This is local production-readiness proof, not production launch. Runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, and learner-linked payload embedding all remain blocked.

Exact cosine remains the retrieval source of truth. Deterministic fallbacks, source authority, guardrails, and rollback boundaries still govern runtime decisions.

06 · Batch

Batch-readiness interpretation

The 25k rung makes onet_task_evidence a reasonable batch candidate for the remaining repaired family, but only after a batch-manifest guard confirms eligible-only source-family selection, no blocked rows, no raw market rows, no malformed rows, no quality-excluded candidates, exact source scoping, and requeue by chunk_id and text_hash.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2

Where to start

Start with a guarded batch-candidate manifest for the remaining repaired O*NET task rows, or move to the next source family if batch implementation needs its own slice.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

topics:
  - aina-data-engine
  - personalization-engine
  - embeddings
subtopics:
  - onet-task-evidence
  - repair-before-embed
  - gemini-embedding-2
  - retrieval-gate
  - batch-candidate-proof

aina-data-engine personalization-engine embeddings onet-task-evidence repair-before-embed batch-candidate-proof