AINA Data Engine Room · Handoff · 2026-06-13

JD-Aware Derived Evidence Embedding Handoff

A clean-before-embed checkpoint that repairs real JD context, refreshes Gemini vectors, and keeps exact-cosine retrieval green.

Ali Mehdi Mukadam · co-authored with Codex · VDS local-only

The Single Idea

This checkpoint fixes the title-only trap for the JD-aware embedding lane. The engine now extracts useful responsibility evidence from messy but source-linked job descriptions, embeds derived role-context text instead of raw JD fields, and keeps exact-cosine retrieval green with current Gemini vectors.

01 · What Changed

Cleaner JD Evidence

The JD-aware responsibility extractor now handles common marketplace formatting problems: semicolon task lists, glued action boundaries such as programsBuilds, spaced action boundaries such as podium Engaging, and colon-introduced duty lists such as by: Greeting guests. It also filters the law-enforcement description preamble so snippets prefer actual records and clerical tasks.

The production embedding corpus now includes derived JD summaries and derived responsibility evidence in the embedding text. It still excludes raw JD field names like jd_summary and responsibility_snippets, and semantic QA checks that raw JD text is not embedded.

Source JDMessy postings with real responsibilities and marketplace formatting artifacts.
Derived ContextShort summaries and snippets, with company values kept reference-only.
Semantic QA50-row sample, raw JD key scan, label-gate scan, source refs present.
Gemini Vectors292 current JD-aware chunks embedded at 768 dimensions.
02 · Key Proof

Numbers That Matter

SurfaceResult
JD-aware role-context rows1,004
Rows with job context970
Rows with responsibility snippets970
JD-aware repair-first rows0
JD-aware clean corpus chunks292
JD-aware semantic QA50 pass / 0 fail
Live Gemini embedded rows292
Live Gemini failed rows0
Total current vectors6,510
Stale vectors0
Full base corpus chunks294,675
Combined chunk authority322,519
AIN-510 known-pair cosine gap0.190463
03 · Commands

Verification Trail

uv run pytest tests/test_jd_aware_role_context.py tests/test_production_embeddings.py::test_production_embedding_corpus_adds_clean_contextual_runtime_families tests/test_production_embeddings.py::test_production_embedding_semantic_qa_passes_clean_contextual_sample -q
uv run ruff check src/aina_data_engine/jd_aware_role_context.py src/aina_data_engine/production_embeddings.py tests/test_jd_aware_role_context.py tests/test_production_embeddings.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room jd-aware-role-context-evidence --fixture-limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --source-family jd_aware_role_context
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family jd_aware_role_context
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family jd_aware_role_context --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jd_aware_role_context --max-new 292 --selection-mode progressive --dry-run
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jd_aware_role_context --max-new 292 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 300 --max-retries 2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --full-corpus --market-limit 0
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
04 · Files

Where The Work Lives

Hand-edited code lives in src/aina_data_engine/jd_aware_role_context.py, src/aina_data_engine/production_embeddings.py, src/aina_data_engine/chunk_vector_reconciliation.py, tests/test_jd_aware_role_context.py, and tests/test_production_embeddings.py.

Core regenerated receipts live under artifacts/validation/, especially the JD-aware evidence, source-family corpus, eligibility, semantic QA, live Gemini run, AIN-510 gate, chunk/vector reconciliation, source-authority registry, and full validation receipts.

Bulk artifacts remain intentionally ignored by git. The vector/corpus Parquet and DuckDB files are current on this VDS and should be preserved with checkpoint checksums/bundles rather than force-added.

05 · Boundaries

Still Local And Controlled

This checkpoint did not mutate donor repos, did not promote public runtime, did not enable production telemetry, did not use real learner data, and did not make runtime embedding authority public. AIN-510 says local exact-cosine retrieval is promotion-ready with rollback proof, but production runtime embedding authority remains false.

06 · Next Work

Where To Continue

The next autonomous slice should use this now-clean JD-aware family to harden the AI Fluency loop and retrieval contracts: inspect the four remaining AIN-510 source-authority caveat cases, decide which can be resolved through current JD-aware context, and then expand clean-before-embed to the next source family only after the same eligibility and semantic QA ladder passes.

Resume in /srv/aina/aina-data-engine-room on branch ali/ain-506-p0-gate-2026-06-12.
Start with git status --short --branch, git log -5 --oneline, and this handoff:
docs/handoff/2026-06-13-jd-aware-derived-evidence-embedding-handoff.md.
Trust current repo receipts over older handoffs. Verify AIN-506, AIN-510,
production-chunk-vector-reconciliation, source-authority-registry-v2, and
validate before new embedding scale-up. Continue the production semantic spine
goal by resolving remaining AIN-510 caveat cases and expanding clean-before-embed
source-family by source-family. Do not mutate donor repos or promote public
runtime without explicit Ali approval.
Where to start: use the refreshed JD-aware vectors as the clean contextual base, then resolve the remaining source-authority caveats before expanding the next embedding family.