Workflow Seed Family Complete Embedding Checkpoint
The current repaired workflow seed family is fully embedded; no remaining candidates.
The workflow_seed source family is now fully embedded for the current repaired corpus. The 25k request proved the family has 6,866 clean repaired chunks today, so Codex embedded the remaining 1,866 candidates and confirmed the final dry run has zero remaining candidates.
What changed
The 5k checkpoint left 1,866 clean workflow seed chunks unembedded. This checkpoint completed that remainder through the same gated path: semantic QA, dry run, AIN-506, live Vertex ADC call, completion dry run, retrieval gate, reconciliation, source registry, exposure scan, and full validation.
Live embedding result
| Item | Result |
|---|---|
| Source family | workflow_seed |
| Clean repaired corpus cap | 6,866 |
| Existing workflow_seed vectors before final run | 5,000 |
| Final dry-run candidates before live | 1,866 |
| Live embedded new vectors | 1,866 |
| Failed Gemini rows | 0 |
| Remaining workflow_seed candidates after live | 0 |
| Total Gemini vectors | 13,932 |
| Workflow seed vectors | 6,866 |
| Workflow vectors overall | 6,965 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
Validation state
| Gate or receipt | Result |
|---|---|
| Semantic QA | 50 / 50 pass |
| Completion dry run | 0 remaining candidates |
ain-506-p0-gate | pass |
ain-510-retrieval-promotion-gate | promotion_ready |
production-chunk-vector-reconciliation | pass |
source-authority-registry-v2 | pass |
artifact-exposure-scan | 0 active findings |
validate | pass |
| Counter | Value |
|---|---|
| Combined chunk authority | 329,385 |
| Vector rows | 13,932 |
| Matched vectors | 13,932 |
| Unvectorized chunks | 315,453 |
| BLS OEWS occupation rows | 1,103 |
| BLS OEWS wage/employment rows | 1,103 |
Product boundary
This is a strong source-family proof, not a public release. Runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor repo mutation, raw market dump embedding, and private learner-linked payload embedding all remain blocked.
Commands run
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_seed --limit 25000 --shard-size 2500 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 25000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --max-new 25000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 25000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
What remains
The next M2 family should be chosen by source-authority value and cleanliness, not by size alone. Good candidates are jobs_research_role, jobs_research_responsibility, workflow_intelligence, or O*NET evidence families, but each needs its own eligibility, repaired corpus, 50-row semantic QA, dry run, live 500, scale proof, AIN-510, exposure scan, and handoff.
cd /srv/aina/aina-data-engine-room git status --short --branch uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family jobs_research_role uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family jobs_research_role --limit 500 --shard-size 250 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family jobs_research_role --include-repaired --limit 50
Pick the next source family by cleanliness and product value. The workflow_seed family is complete; do not use that as permission to batch noisy families.