AINA Data Engine Room - Local Handoff - 2026-06-15
O*NET Task Remaining Batch Candidate Checkpoint
The single idea
The repo now has a remaining-only batch-candidate artifact for the unembedded part of onet_task_evidence. It does not submit a Gemini Batch API job; it creates the clean local manifest surface needed before a future submit/poll/ingest lane can safely run.
What Changed
- Added
--remaining-onlytoproduction-embedding-repaired-corpus. - Kept remaining-only output separate under
production_embedding_remaining_batch_candidate_v1. - Excluded already embedded
(chunk_id, text_hash)pairs from the manifest. - Generated the real O*NET task remaining-only candidate set.
Result
| Item | Result |
|---|---|
| Full repaired O*NET task chunks before filter | 131,095 |
| Existing O*NET task vectors excluded | 50,000 |
| Remaining manifest rows | 81,095 |
| Manifest shards | 17 |
| Skipped repairs | 0 |
| Manifest family check | 81,095 / 81,095 |
| AIN-510 after manifest | promotion_ready |
| Reconciliation after manifest | pass |
| Stale vectors after manifest | 0 |
Boundary
This is a batch candidate, not a batch submission. No live Gemini call was made by this slice, and no batch job was submitted. Public runtime, real-user data, external writes, production telemetry, donor repo mutation/deletion, and runtime embedding authority promotion remain blocked.
Do not submit the older full repaired manifest for this family. Use only the remaining-only manifest path from this checkpoint if a submit/poll/ingest lane is implemented next.
Resume
cd /srv/aina/aina-data-engine-room
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --remaining-only --shard-size 5000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation