AINA Data Engine Room - Local Handoff - 2026-06-15

O*NET Task 50k Embedding Checkpoint

Ali Mehdi Mukadam - co-authored with Codex

The single idea

onet_task_evidence now has 50,000 Gemini vectors, backed by a full repaired O*NET task overlay of 131,095 chunks. The second large foreground run added 25,000 new vectors with 0 failed rows, kept exact-cosine quality gates green, and left public/runtime production unlocks off.

What Changed

Expanded the O*NET task repaired corpus from the prior 25,000 proof slice to the full available 131,095 repaired chunks.
Confirmed the repair pattern is deterministic and uniform: each repaired chunk receives generic_title_replacement and placeholder_noise_cleanup.
Verified repaired text and repaired manifests do not leak human_review or known generic title prefixes.
Semantic QA passed 50 / 50.
Live Gemini Embedding 2 added 25,000 vectors with 0 failed rows through Vertex ADC on aina-495702.
AIN-510 refreshed to promotion_ready, with 0 stale vectors.
Reconciliation, source authority, runtime readiness, exposure scan, frontmatter, full validation, and focused pytest all passed.

Current Counts

Item	Result
Full repaired O*NET task chunks	131,095
O*NET task vectors	50,000
New vectors in this checkpoint	25,000
Failed Gemini rows	0
Total Gemini vectors	69,912
Combined chunk authority	466,960
Unvectorized chunks	397,048
Combined vector coverage	14.9717%
Known-pair cosine gap	0.190463
Stale vectors	0

Validation State

Gate or receipt	Result
`production-embedding-semantic-qa --include-repaired`	50 / 50 pass
`ain-506-p0-gate`	pass
`gemini-embedding-run` live	pass
`ain-510-retrieval-promotion-gate`	promotion_ready
`production-chunk-vector-reconciliation`	pass
`source-authority-registry-v2`	pass
`production-runtime-readiness`	ready_to_harden_headless_production_runtime
`artifact-exposure-scan`	pass, 0 active findings
`docs-frontmatter-check`	valid
`validate`	pass
Focused pytest	64 passed

Product Boundary

This is local production-readiness proof, not production launch. Public runtime, real-user data, external writes, production telemetry, runtime embedding authority promotion, donor repo mutation/deletion, raw market dump embedding, malformed row embedding, and learner-linked/private payload embedding all remain blocked.

Exact cosine remains the source-of-truth retrieval mode. Deterministic fallbacks, source authority, runtime caveats, and rollback boundaries still govern serving decisions.

Batch Readiness

The foreground path works but is too slow for the remaining O*NET task backlog. The next best implementation slice is a remaining-only batch lane, not another blind manifest.

Manifest rows must exclude already embedded (chunk_id, text_hash) pairs.
Manifest rows must be source_family=onet_task_evidence.
Manifest rows must come from the repaired corpus, not unrepaired base text.
Manifest rows must exclude blocked, repair-first-unrepaired, raw market, malformed, quality-excluded, learner-linked, and quarantine rows.
Failed rows must requeue by chunk_id and text_hash.

The existing repaired manifest directory is API-shaped, but it contains the full repaired family, including rows already embedded. Do not submit that as-is.

Resume Commands

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation