O*NET Task 75k Embedding Checkpoint
Live Gemini vectors advanced again, while runtime authority stays local and reversible.
The O*NET task evidence family has reached 75,000 live Gemini vectors with zero failed rows and zero stale vectors. The repo also has an updated remaining-only candidate manifest for the 56,095 repaired O*NET task chunks that are still unembedded.
What changed
onet_task_evidence, with zero quality exclusions.aina-495702, Gemini Embedding 2, 768 dimensions.The important operational correction is that the remaining-only manifest excludes already-vectorized chunk/text-hash pairs. That prevents the next batch or foreground run from paying to embed work already done.
Proof commands run
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --max-new 25000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 16 --timeout-seconds 120 --max-retries 5 --write-every 1000 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_task_evidence --include-repaired --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --remaining-only --shard-size 5000 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
Resume from here
Next safest move: dry-run another 25,000 O*NET task tranche, then run it live only if the selector again reports only onet_task_evidence, zero quality exclusions, and zero stale/orphan candidates.
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --max-new 25000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 16 --timeout-seconds 120 --max-retries 5 --write-every 1000
Use the remaining-only manifest or the guarded foreground selector; do not submit the older full repaired manifest.