Source Authority And Embedding Checkpoint
A technical resume point for the production embedding lane after wiring prior title intelligence into the corpus and running the first post-cleanup live Gemini slice.
This run turned the rediscovery loop into repo behavior. The engine room now consults prior jobs-research title authority, clean candidate evidence, source ledgers, and salvage maps before deciding what can be embedded. It also ran a real Gemini Embedding 2 live slice for 500 clean top-worked-title chunks, then correctly stopped expansion because quality gates are still partial.
Prior Work Is Now A Build-Time Input
The corpus builder now loads the jobs-research title audit and clean candidate evidence before creating title chunks. Marketplace-shaped titles such as j.p. morgan wealth management - private client advisor - tulsa ok now embed as private client advisor, with the original title preserved as provenance.
The Hidden Gems Are Registered
| Metric | Value |
|---|---|
| Source records tracked | 10 |
| Jobs-research title audit rows | 37,478 |
| Trusted jobs-research titles | 15,104 |
| Clean candidate rows | 44,440 |
| Clean candidate titles | 6,700 |
| JP Morgan title resolves cleanly | true |
The registry points future agents to the harvest source map, title ledger, mapping-chain ledger, cross-repo salvage map, jobs-research title audit, jobs-research clean candidates, jobs-research source intelligence, evidence atlas aliases, evidence atlas responsibilities, and source truth ledger.
The Full Corpus Is Frozen, But Not All Rows Are Embed-Ready
| Metric | Value |
|---|---|
| Corpus chunks | 298,869 |
| Estimated tokens | 27,745,254 |
| Title chunks cleaned by prior authority | 19,578 |
| Embed now | 68,650 |
| Repair first | 230,219 |
| Batch candidates | 0 |
The API Works, The Quality Gate Held
| Metric | Value |
|---|---|
| Status | progressive_partial |
| Live Gemini invoked | true |
| New vectors | 500 |
| Failed rows | 0 |
| Total vectors | 2,077 |
| Top 1,000 vector count | 814 |
| Top 500 vector count | 478 |
| Cosine gap | 0.139058 |
The failed gates are gate_1_known_pairs_separate, gate_2_top_1000_and_500_complete, and gate_5_runtime_retrieval_candidate. Do not expand until these are addressed.
Old Human-Review Metadata Is Gone From This Lane
Generated embedding metadata now uses label_authority_status, not review_status. The recursive JSON scan found zero old gate keys or values across 298,869 eligibility rows and 289,720 repair rows.
uv run python - <<'PY' # Recursive scan found: # production_embedding_eligibility_v1.jsonl rows=298869 bad=0 # production_embedding_repair_queue_v1.jsonl rows=289720 bad=0 PY
Resume With Gate Repair, Not More Spend
Start by improving the quality-pair construction and repairing the 70 top-worked-title rows that are still repair-first. Then rerun the top-worked-title dry-run and only execute another small live slice after Gate 1 and Gate 2 pass.
cd /srv/aina/aina-data-engine-room
jq '{status, valid, failed_quality_gates, metrics}' artifacts/validation/ain_506_live_gemini_embedding_run_v1.json
sed -n '2327,2405p' src/aina_data_engine/production_embeddings.py
head -20 artifacts/validation/ain_506_live_gemini_embedding_run_v1_quality_pairs.jsonl
The source-authority loop is addressed; the next work is semantic quality, top-title completion, and only then more embeddings.