M2 Workflow Embedding Checkpoint
A small clean-before-embed run completed the jobs-research workflow vector family.
M2 now has one more clean source-authoritative family embedded: the remaining jobs_research_workflow repaired chunks. This was a small progressive live run, not a batch run and not a broad title expansion.
What changed
The workflow family moved through the clean-before-embed ladder: repaired corpus confirmation, deterministic semantic QA, dry-run candidate selection, small live Vertex ADC embedding, AIN-510, vector reconciliation, source-authority registry refresh, exposure scan, runtime readiness, AIN-506, and full validation.
- Confirmed the repaired corpus exists for
jobs_research_workflow. - Ran deterministic semantic QA over 50 workflow rows.
- Dry-ran Gemini candidate selection with repaired chunks included.
- Embedded the 56 missing workflow vectors.
- Refreshed the receipts that make the vector snapshot authoritative locally.
Live embedding scope
| Item | Result |
|---|---|
| Source family | jobs_research_workflow |
| Existing vectors before run | 33 |
jobs_research_workflow vector count after run | 89 |
| Workflow-related vector count after run | 99 |
| Stale vectors after AIN-510 | 0 |
| Known-pair cosine gap | 0.190463 |
The live call used Gemini Embedding 2 at 768 dimensions through Vertex ADC on project aina-495702. No Developer API key route was used.
Semantic QA proof
jobs_research_workflow semantic QA passed with 50 sampled rows, 50 passes, no failures, no raw JD key hits, and no legacy review gate hits. The adjacent jobs_research_tool plus workflow_tool_evidence family also passed QA; the dry run found no new candidates because all 83 chunks already had accepted vectors.
Locked boundaries
The run did not change runtime or release posture. Public runtime, real-user data, external writes, production telemetry, runtime embedding authority, batch promotion, and donor repo mutation all remain off.
Verification
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family jobs_research_tool --source-family workflow_tool_evidence --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family jobs_research_workflow --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jobs_research_workflow --include-repaired --max-new 100 --selection-mode progressive --workers 4 --timeout-seconds 60 --max-retries 4 --write-every 25 --allow-live-gemini --confirm-paid-api uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
All passed after rerunning vector reconciliation serially. A first reconciliation attempt was run in parallel with AIN-510 and read the old vector count; the serialized rerun is the durable receipt.
Next work
Continue M2 with another bounded source family. Recommended order: workflow_seed / workflow_intelligence, then jobs_research_responsibility, and only then serviceable_title after stronger source-authority repair proof.
cd /srv/aina/aina-data-engine-room git status --short --branch git log -3 --oneline uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
Do not return to broad title embedding until the source-family gate proves it will not re-embed stale labels or posting artifacts.