AINA Data Engine Room handoff · 2026-06-13 · branch ali/ain-506-p0-gate-2026-06-12

JD-Aware Repair And Live Embedding Checkpoint

A clean-before-embed repair slice that fixed glued JD responsibility text and embedded one newly eligible Gemini vector.

The Single Idea

This checkpoint repairs part of the JD-aware role-context evidence layer before embedding, then embeds exactly one newly eligible JD-aware semantic chunk with Gemini Embedding 2 through the paid Vertex ADC path. Rows that had real job-description text but failed snippet extraction because of glued headings like ResponsibilitiesApply are now parsed deterministically; rows with no JD/job context remain blocked.

01 · Repair

What Changed

The JD-aware responsibility extractor now handles common job-posting formatting defects: punctuation without following spaces, glued responsibility headings, and section headings such as Typical Tasks, Daily Tasks, What You Will Do, and Key Accountabilities.

This is deterministic cleanup, not LLM repair. It does not mutate raw LinkedIn/job rows, does not embed raw JD dumps, and does not loosen safety gates.
02 · Counts

Data Movement

MeasureBeforeAfter
Rows with JD summaries954962
Rows with responsibility snippets954962
Evidence-layer repair_first rows168
Evidence-layer blocked rows3434
JD-aware clean corpus chunks288289
JD-aware vectors288289
Total Gemini vectors65066507
Stale vectors00
03 · Gemini

Live Embedding Proof

Candidate1 JD-aware chunk
Temporary Sales Associate Tommy Hilfiger
Modelgemini-embedding-2
768 dimensions
paid Vertex ADC path
Result1 embedded
0 failed
cosine gap 0.196032
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jd_aware_role_context --max-new 1 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 120 --max-retries 2
04 · Authority

Authority Sync

AIN-510 caught a temporary authority mismatch after the live vector: the vector existed before the global authoritative chunk table had been refreshed. The fix was to refresh the full production semantic corpus with raw market rows still excluded, then rerun reconciliation and downstream receipts.

Final MeasureCount
Base chunks294672
Repaired overlay chunks27844
Combined chunks322516
Vector rows6507
Matched vectors6507
Stale vectors0
05 · Proof

Verification

Focused tests, ruff, semantic QA, chunk/vector reconciliation, AIN-506, AIN-510, production runtime readiness, source-authority registry v2, runtime contracts, source-authority start-here, and full validation all pass. AIN-510 is promotion_ready, while public runtime, real-user data, external writes, and production telemetry remain disabled.

06 · Remaining

What Remains

The full production goal is not complete. Next, inspect the remaining 8 JD-aware repair-first rows, preserve the 34 blocked rows unless source refs can be found, use the 50 JD-aware fixtures to harden AI Fluency maps, and continue embeddings only by source family after semantic QA and eligibility gates.

Where To Start Next

Start with the remaining 8 JD-aware repair-first rows; graduate only rows with clean compressed JD context, and keep everything else out until source authority improves.