O*NET Task 50k Embedding Checkpoint
onet_task_evidence now has 50,000 Gemini vectors, backed by a full repaired O*NET task overlay of 131,095 chunks. The second large foreground run added 25,000 new vectors with 0 failed rows, kept exact-cosine quality gates green, and left public/runtime production unlocks off.
What Changed
- Expanded the O*NET task repaired corpus from the prior 25,000 proof slice to the full available 131,095 repaired chunks.
- Confirmed the repair pattern is deterministic and uniform: each repaired chunk receives
generic_title_replacementandplaceholder_noise_cleanup. - Verified repaired text and repaired manifests do not leak
human_reviewor known generic title prefixes. - Semantic QA passed 50 / 50.
- Live Gemini Embedding 2 added 25,000 vectors with 0 failed rows through Vertex ADC on
aina-495702. - AIN-510 refreshed to
promotion_ready, with 0 stale vectors. - Reconciliation, source authority, runtime readiness, exposure scan, frontmatter, full validation, and focused pytest all passed.
Current Counts
| Item | Result |
|---|---|
| Full repaired O*NET task chunks | 131,095 |
| O*NET task vectors | 50,000 |
| New vectors in this checkpoint | 25,000 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 69,912 |
| Combined chunk authority | 466,960 |
| Unvectorized chunks | 397,048 |
| Combined vector coverage | 14.9717% |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
Validation State
| Gate or receipt | Result |
|---|---|
production-embedding-semantic-qa --include-repaired | 50 / 50 pass |
ain-506-p0-gate | pass |
gemini-embedding-run live | pass |
ain-510-retrieval-promotion-gate | promotion_ready |
production-chunk-vector-reconciliation | pass |
source-authority-registry-v2 | pass |
production-runtime-readiness | ready_to_harden_headless_production_runtime |
artifact-exposure-scan | pass, 0 active findings |
docs-frontmatter-check | valid |
validate | pass |
| Focused pytest | 64 passed |
Product Boundary
This is local production-readiness proof, not production launch. Public runtime, real-user data, external writes, production telemetry, runtime embedding authority promotion, donor repo mutation/deletion, raw market dump embedding, malformed row embedding, and learner-linked/private payload embedding all remain blocked.
Exact cosine remains the source-of-truth retrieval mode. Deterministic fallbacks, source authority, runtime caveats, and rollback boundaries still govern serving decisions.
Batch Readiness
The foreground path works but is too slow for the remaining O*NET task backlog. The next best implementation slice is a remaining-only batch lane, not another blind manifest.
- Manifest rows must exclude already embedded
(chunk_id, text_hash)pairs. - Manifest rows must be
source_family=onet_task_evidence. - Manifest rows must come from the repaired corpus, not unrepaired base text.
- Manifest rows must exclude blocked, repair-first-unrepaired, raw market, malformed, quality-excluded, learner-linked, and quarantine rows.
- Failed rows must requeue by
chunk_idandtext_hash.
The existing repaired manifest directory is API-shaped, but it contains the full repaired family, including rows already embedded. Do not submit that as-is.
Resume Commands
cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation