AINA Data Engine Room - Local Handoff - 2026-06-15

O*NET Task Remaining Batch Candidate Checkpoint

Ali Mehdi Mukadam - co-authored with Codex

The single idea

The repo now has a remaining-only batch-candidate artifact for the unembedded part of onet_task_evidence. It does not submit a Gemini Batch API job; it creates the clean local manifest surface needed before a future submit/poll/ingest lane can safely run.

What Changed

  1. Added --remaining-only to production-embedding-repaired-corpus.
  2. Kept remaining-only output separate under production_embedding_remaining_batch_candidate_v1.
  3. Excluded already embedded (chunk_id, text_hash) pairs from the manifest.
  4. Generated the real O*NET task remaining-only candidate set.

Result

ItemResult
Full repaired O*NET task chunks before filter131,095
Existing O*NET task vectors excluded50,000
Remaining manifest rows81,095
Manifest shards17
Skipped repairs0
Manifest family check81,095 / 81,095
AIN-510 after manifestpromotion_ready
Reconciliation after manifestpass
Stale vectors after manifest0

Boundary

This is a batch candidate, not a batch submission. No live Gemini call was made by this slice, and no batch job was submitted. Public runtime, real-user data, external writes, production telemetry, donor repo mutation/deletion, and runtime embedding authority promotion remain blocked.

Do not submit the older full repaired manifest for this family. Use only the remaining-only manifest path from this checkpoint if a submit/poll/ingest lane is implemented next.

Resume

cd /srv/aina/aina-data-engine-room
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_task_evidence --remaining-only --shard-size 5000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation