JD-Aware Derived Evidence Embedding Handoff
A clean-before-embed checkpoint that repairs real JD context, refreshes Gemini vectors, and keeps exact-cosine retrieval green.
This checkpoint fixes the title-only trap for the JD-aware embedding lane. The engine now extracts useful responsibility evidence from messy but source-linked job descriptions, embeds derived role-context text instead of raw JD fields, and keeps exact-cosine retrieval green with current Gemini vectors.
Cleaner JD Evidence
The JD-aware responsibility extractor now handles common marketplace formatting problems: semicolon task lists, glued action boundaries such as programsBuilds, spaced action boundaries such as podium Engaging, and colon-introduced duty lists such as by: Greeting guests. It also filters the law-enforcement description preamble so snippets prefer actual records and clerical tasks.
The production embedding corpus now includes derived JD summaries and derived responsibility evidence in the embedding text. It still excludes raw JD field names like jd_summary and responsibility_snippets, and semantic QA checks that raw JD text is not embedded.
Numbers That Matter
| Surface | Result |
|---|---|
| JD-aware role-context rows | 1,004 |
| Rows with job context | 970 |
| Rows with responsibility snippets | 970 |
| JD-aware repair-first rows | 0 |
| JD-aware clean corpus chunks | 292 |
| JD-aware semantic QA | 50 pass / 0 fail |
| Live Gemini embedded rows | 292 |
| Live Gemini failed rows | 0 |
| Total current vectors | 6,510 |
| Stale vectors | 0 |
| Full base corpus chunks | 294,675 |
| Combined chunk authority | 322,519 |
| AIN-510 known-pair cosine gap | 0.190463 |
Verification Trail
uv run pytest tests/test_jd_aware_role_context.py tests/test_production_embeddings.py::test_production_embedding_corpus_adds_clean_contextual_runtime_families tests/test_production_embeddings.py::test_production_embedding_semantic_qa_passes_clean_contextual_sample -q uv run ruff check src/aina_data_engine/jd_aware_role_context.py src/aina_data_engine/production_embeddings.py tests/test_jd_aware_role_context.py tests/test_production_embeddings.py uv run aina-data-engine --root /srv/aina/aina-data-engine-room jd-aware-role-context-evidence --fixture-limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --source-family jd_aware_role_context uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family jd_aware_role_context uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family jd_aware_role_context --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jd_aware_role_context --max-new 292 --selection-mode progressive --dry-run uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jd_aware_role_context --max-new 292 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 300 --max-retries 2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --full-corpus --market-limit 0 uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
Where The Work Lives
Hand-edited code lives in src/aina_data_engine/jd_aware_role_context.py, src/aina_data_engine/production_embeddings.py, src/aina_data_engine/chunk_vector_reconciliation.py, tests/test_jd_aware_role_context.py, and tests/test_production_embeddings.py.
Core regenerated receipts live under artifacts/validation/, especially the JD-aware evidence, source-family corpus, eligibility, semantic QA, live Gemini run, AIN-510 gate, chunk/vector reconciliation, source-authority registry, and full validation receipts.
Bulk artifacts remain intentionally ignored by git. The vector/corpus Parquet and DuckDB files are current on this VDS and should be preserved with checkpoint checksums/bundles rather than force-added.
Still Local And Controlled
This checkpoint did not mutate donor repos, did not promote public runtime, did not enable production telemetry, did not use real learner data, and did not make runtime embedding authority public. AIN-510 says local exact-cosine retrieval is promotion-ready with rollback proof, but production runtime embedding authority remains false.
Where To Continue
The next autonomous slice should use this now-clean JD-aware family to harden the AI Fluency loop and retrieval contracts: inspect the four remaining AIN-510 source-authority caveat cases, decide which can be resolved through current JD-aware context, and then expand clean-before-embed to the next source family only after the same eligibility and semantic QA ladder passes.
Resume in /srv/aina/aina-data-engine-room on branch ali/ain-506-p0-gate-2026-06-12. Start with git status --short --branch, git log -5 --oneline, and this handoff: docs/handoff/2026-06-13-jd-aware-derived-evidence-embedding-handoff.md. Trust current repo receipts over older handoffs. Verify AIN-506, AIN-510, production-chunk-vector-reconciliation, source-authority-registry-v2, and validate before new embedding scale-up. Continue the production semantic spine goal by resolving remaining AIN-510 caveat cases and expanding clean-before-embed source-family by source-family. Do not mutate donor repos or promote public runtime without explicit Ali approval.