AINA Data Engine Room / Handoff / 2026-06-12

Source Authority And Embedding Checkpoint

A technical resume point for the production embedding lane after wiring prior title intelligence into the corpus and running the first post-cleanup live Gemini slice.

The Single Idea

This run turned the rediscovery loop into repo behavior. The engine room now consults prior jobs-research title authority, clean candidate evidence, source ledgers, and salvage maps before deciding what can be embedded. It also ran a real Gemini Embedding 2 live slice for 500 clean top-worked-title chunks, then correctly stopped expansion because quality gates are still partial.

01 / What Changed

Prior Work Is Now A Build-Time Input

The corpus builder now loads the jobs-research title audit and clean candidate evidence before creating title chunks. Marketplace-shaped titles such as j.p. morgan wealth management - private client advisor - tulsa ok now embed as private client advisor, with the original title preserved as provenance.

FindLoad prior jobs-research, evidence atlas, ledgers, and salvage maps.
CleanResolve noisy marketplace titles before embedding text is built.
GateClassify each chunk as embed-now, progressive-only, repair-first, or blocked.
EmbedRun live vectors only for clean eligible slices, then stop on quality failure.
02 / Source Authority

The Hidden Gems Are Registered

MetricValue
Source records tracked10
Jobs-research title audit rows37,478
Trusted jobs-research titles15,104
Clean candidate rows44,440
Clean candidate titles6,700
JP Morgan title resolves cleanlytrue

The registry points future agents to the harvest source map, title ledger, mapping-chain ledger, cross-repo salvage map, jobs-research title audit, jobs-research clean candidates, jobs-research source intelligence, evidence atlas aliases, evidence atlas responsibilities, and source truth ledger.

03 / Corpus And Eligibility

The Full Corpus Is Frozen, But Not All Rows Are Embed-Ready

MetricValue
Corpus chunks298,869
Estimated tokens27,745,254
Title chunks cleaned by prior authority19,578
Embed now68,650
Repair first230,219
Batch candidates0
Batch remains locked. That is intentional: no source family moves to batch until it has clean eligibility, repair proof, progressive vector proof, and passing retrieval gates.
04 / Live Embedding

The API Works, The Quality Gate Held

MetricValue
Statusprogressive_partial
Live Gemini invokedtrue
New vectors500
Failed rows0
Total vectors2,077
Top 1,000 vector count814
Top 500 vector count478
Cosine gap0.139058

The failed gates are gate_1_known_pairs_separate, gate_2_top_1000_and_500_complete, and gate_5_runtime_retrieval_candidate. Do not expand until these are addressed.

05 / Gate Cleanup

Old Human-Review Metadata Is Gone From This Lane

Generated embedding metadata now uses label_authority_status, not review_status. The recursive JSON scan found zero old gate keys or values across 298,869 eligibility rows and 289,720 repair rows.

uv run python - <<'PY'
# Recursive scan found:
# production_embedding_eligibility_v1.jsonl rows=298869 bad=0
# production_embedding_repair_queue_v1.jsonl rows=289720 bad=0
PY

Resume With Gate Repair, Not More Spend

Start by improving the quality-pair construction and repairing the 70 top-worked-title rows that are still repair-first. Then rerun the top-worked-title dry-run and only execute another small live slice after Gate 1 and Gate 2 pass.

cd /srv/aina/aina-data-engine-room
jq '{status, valid, failed_quality_gates, metrics}' artifacts/validation/ain_506_live_gemini_embedding_run_v1.json
sed -n '2327,2405p' src/aina_data_engine/production_embeddings.py
head -20 artifacts/validation/ain_506_live_gemini_embedding_run_v1_quality_pairs.jsonl
Where To Start

The source-authority loop is addressed; the next work is semantic quality, top-title completion, and only then more embeddings.