O*NET Task 25k Embedding Checkpoint
The O*NET task family passed a large foreground proof and is now ready for guarded batch-candidate work.
onet_task_evidence passed the 25,000-row progressive proof: 20,000 new Gemini vectors, zero failed rows, zero stale vectors, and AIN-510 still promotion_ready.
What changed
Codex built a 25,000-row repaired corpus, verified all rows received generic-title replacement and placeholder cleanup, ran literal scans for bad label prefixes, passed semantic QA, dry-ran candidate selection, confirmed AIN-506, then ran the foreground Vertex ADC embedding job.
Live embedding result
| Item | Result |
|---|---|
| Source family | onet_task_evidence |
| Family eligibility rows | 131,095 |
| Repaired corpus size | 25,000 |
| Dry-run new candidates | 20,000 |
| Existing valid task vectors retained | 5,000 |
| Live embedded new vectors | 20,000 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 44,912 |
| O*NET task vectors | 25,000 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
The run used gemini-embedding-2 at 768 dimensions through Vertex ADC on paid project aina-495702. Runtime embedding authority remains disabled.
Validation state
| Gate or receipt | Result |
|---|---|
| Literal repaired-text scan | pass |
| Focused O*NET/semantic-QA tests | 6 passed |
| Ruff on changed files | pass |
| Semantic QA | 50 / 50 pass |
| AIN-506 P0 gate | pass |
| AIN-510 retrieval gate | promotion_ready |
| Chunk-vector reconciliation | pass |
| Source-authority registry v2 | pass |
| Runtime readiness | headless hardening ready |
| Artifact exposure scan | 0 active findings |
| Full validate | pass |
| Current authority count | Value |
|---|---|
| Base chunks | 294,675 |
| Repaired distinct authority chunks | 66,190 |
| Combined chunk authority | 360,865 |
| Matched vectors | 44,912 |
| Combined vector coverage | 12.4457% |
| Unvectorized chunks | 315,953 |
Why this matters
The O*NET task layer now contributes a large public, source-grounded task semantic base for role-to-task, role-to-workflow, task exposure, capability requirements, evaluator scenarios, and curriculum gap matching. The repaired embedding text centers SOC and task statements, not source labels or metric names.
title: 11-3051 task - Document testing procedures, methodologies, or criteria.
text: Evidence atlas onet task evidence: 11-3051 task - Document testing procedures...
SOC: 11-3051. Task: Document testing procedures, methodologies, or criteria.
Product boundary
This is local production-readiness proof, not production launch. Runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor deletion, raw market dump embedding, malformed row embedding, and learner-linked payload embedding all remain blocked.
Exact cosine remains the retrieval source of truth. Deterministic fallbacks, source authority, guardrails, and rollback boundaries still govern runtime decisions.
Batch-readiness interpretation
The 25k rung makes onet_task_evidence a reasonable batch candidate for the remaining repaired family, but only after a batch-manifest guard confirms eligible-only source-family selection, no blocked rows, no raw market rows, no malformed rows, no quality-excluded candidates, exact source scoping, and requeue by chunk_id and text_hash.
cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
Start with a guarded batch-candidate manifest for the remaining repaired O*NET task rows, or move to the next source family if batch implementation needs its own slice.