AINA Data Engine Room · local handoff · 2026-06-15

O*NET Task 75k Embedding Checkpoint

Live Gemini vectors advanced again, while runtime authority stays local and reversible.

The Single Idea

The O*NET task evidence family has reached 75,000 live Gemini vectors with zero failed rows and zero stale vectors. The repo also has an updated remaining-only candidate manifest for the 56,095 repaired O*NET task chunks that are still unembedded.

01 · Progress

What changed

Dry-run25,000 clean candidates, all from onet_task_evidence, with zero quality exclusions.
Live runVertex ADC on aina-495702, Gemini Embedding 2, 768 dimensions.
New vectors25,000 new rows, zero failed rows, total vectors now 94,912.
Remainder56,095 O*NET task chunks remain in the remaining-only manifest.

The important operational correction is that the remaining-only manifest excludes already-vectorized chunk/text-hash pairs. That prevents the next batch or foreground run from paying to embed work already done.

02 · Retrieval

Current vector authority

MetricValue
Total Gemini vectors94,912
O*NET task vectors75,000
O*NET occupation vectors2,828
Top 1,000 vector coverage1,000
Top 500 vector coverage500
Stale vectors0
Known-pair cosine gap0.190463

AIN-510 remains promotion_ready, but public runtime, real-user data, external writes, production telemetry, and runtime embedding authority are still off.

03 · Commands

Proof commands run

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --max-new 25000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 16 --timeout-seconds 120 --max-retries 5 --write-every 1000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_task_evidence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --remaining-only --shard-size 5000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
04 · Next

Resume from here

Next safest move: dry-run another 25,000 O*NET task tranche, then run it live only if the selector again reports only onet_task_evidence, zero quality exclusions, and zero stale/orphan candidates.

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --dry-run --max-new 25000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family onet_task_evidence --include-repaired --max-new 25000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 16 --timeout-seconds 120 --max-retries 5 --write-every 1000
Where to start

Use the remaining-only manifest or the guarded foreground selector; do not submit the older full repaired manifest.