AINA Data Engine Room / Handoff / 2026-06-12

Source Authority And Embedding Checkpoint

A technical resume point for the production embedding lane after wiring prior title intelligence into the corpus and running the first post-cleanup live Gemini slice.

The Single Idea

This run turned the rediscovery loop into repo behavior. The engine room now consults prior jobs-research title authority, clean candidate evidence, source ledgers, and salvage maps before deciding what can be embedded. It also ran a real Gemini Embedding 2 live slice for 500 clean top-worked-title chunks, then correctly stopped expansion because quality gates are still partial.

01 / What Changed

Prior Work Is Now A Build-Time Input

The corpus builder now loads the jobs-research title audit and clean candidate evidence before creating title chunks. Marketplace-shaped titles such as j.p. morgan wealth management - private client advisor - tulsa ok now embed as private client advisor, with the original title preserved as provenance.

FindLoad prior jobs-research, evidence atlas, ledgers, and salvage maps.

CleanResolve noisy marketplace titles before embedding text is built.

GateClassify each chunk as embed-now, progressive-only, repair-first, or blocked.

EmbedRun live vectors only for clean eligible slices, then stop on quality failure.

02 / Source Authority

The Hidden Gems Are Registered

Metric	Value
Source records tracked	10
Jobs-research title audit rows	37,478
Trusted jobs-research titles	15,104
Clean candidate rows	44,440
Clean candidate titles	6,700
JP Morgan title resolves cleanly	true

The registry points future agents to the harvest source map, title ledger, mapping-chain ledger, cross-repo salvage map, jobs-research title audit, jobs-research clean candidates, jobs-research source intelligence, evidence atlas aliases, evidence atlas responsibilities, and source truth ledger.

03 / Corpus And Eligibility

The Full Corpus Is Frozen, But Not All Rows Are Embed-Ready

Metric	Value
Corpus chunks	298,869
Estimated tokens	27,745,254
Title chunks cleaned by prior authority	19,578
Embed now	68,650
Repair first	230,219
Batch candidates	0

Batch remains locked. That is intentional: no source family moves to batch until it has clean eligibility, repair proof, progressive vector proof, and passing retrieval gates.

04 / Live Embedding

The API Works, The Quality Gate Held

Metric	Value
Status	progressive_partial
Live Gemini invoked	true
New vectors	500
Failed rows	0
Total vectors	2,077
Top 1,000 vector count	814
Top 500 vector count	478
Cosine gap	0.139058

The failed gates are gate_1_known_pairs_separate, gate_2_top_1000_and_500_complete, and gate_5_runtime_retrieval_candidate. Do not expand until these are addressed.

05 / Gate Cleanup

Old Human-Review Metadata Is Gone From This Lane

Generated embedding metadata now uses label_authority_status, not review_status. The recursive JSON scan found zero old gate keys or values across 298,869 eligibility rows and 289,720 repair rows.

uv run python - <<'PY'
# Recursive scan found:
# production_embedding_eligibility_v1.jsonl rows=298869 bad=0
# production_embedding_repair_queue_v1.jsonl rows=289720 bad=0
PY

06 / Next Slice

Resume With Gate Repair, Not More Spend

Start by improving the quality-pair construction and repairing the 70 top-worked-title rows that are still repair-first. Then rerun the top-worked-title dry-run and only execute another small live slice after Gate 1 and Gate 2 pass.

cd /srv/aina/aina-data-engine-room
jq '{status, valid, failed_quality_gates, metrics}' artifacts/validation/ain_506_live_gemini_embedding_run_v1.json
sed -n '2327,2405p' src/aina_data_engine/production_embeddings.py
head -20 artifacts/validation/ain_506_live_gemini_embedding_run_v1_quality_pairs.jsonl

Where To Start

The source-authority loop is addressed; the next work is semantic quality, top-title completion, and only then more embeddings.