Workflow Intelligence Family Complete Embedding Checkpoint
The full repaired workflow intelligence source family is now embedded locally.
The workflow_intelligence source family is now fully embedded for the current repaired corpus: 3,152 vectors, zero failed Gemini rows, zero remaining candidates, and all retrieval/runtime gates still local-only and green.
What changed
After the 500-vector proof, Codex expanded workflow_intelligence to all 3,152 repaired chunks, verified semantic QA, embedded the remaining 2,652 candidates through Gemini Embedding 2, and confirmed the completion dry run has zero remaining candidates.
Live embedding result
| Item | Result |
|---|---|
| Source family | workflow_intelligence |
| Clean repaired corpus size | 3,152 |
| Existing workflow_intelligence vectors before final run | 500 |
| Final dry-run candidates before live | 2,652 |
| Live embedded new vectors | 2,652 |
| Failed Gemini rows | 0 |
| Remaining workflow_intelligence candidates after live | 0 |
| Total Gemini vectors | 17,084 |
| Workflow intelligence vectors | 3,152 |
| Workflow seed vectors | 6,866 |
| Workflow vectors overall | 10,117 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
Validation state
| Gate or receipt | Result |
|---|---|
| Semantic QA | 50 / 50 pass |
| Completion dry run | 0 remaining candidates |
ain-506-p0-gate | pass |
ain-510-retrieval-promotion-gate | promotion_ready |
production-chunk-vector-reconciliation | pass |
source-authority-registry-v2 | pass |
artifact-exposure-scan | 0 active findings |
validate | pass |
| Counter | Value |
|---|---|
| Combined chunk authority | 333,037 |
| Vector rows | 17,084 |
| Matched vectors | 17,084 |
| Unvectorized chunks | 315,953 |
Source-family status
workflow_seed and workflow_intelligence are now complete for their current repaired corpora. jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts.
Commands run
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_intelligence --limit 5000 --shard-size 1000 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_intelligence --include-repaired --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_intelligence --include-repaired --dry-run --max-new 5000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_intelligence --include-repaired --max-new 5000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_intelligence --include-repaired --dry-run --max-new 5000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
What remains
Pick the next family by semantic value and repair readiness. onet_occupation_evidence is a good next candidate because it is smaller and public-taxonomy grounded. onet_task_evidence is larger and valuable, but needs careful generic-title repair. jobs_research_responsibility is high value but currently risky because early samples share the store-manager artifact pattern.
cd /srv/aina/aina-data-engine-room git status --short --branch uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family onet_occupation_evidence uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_occupation_evidence --limit 500 --shard-size 250 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
Start with O*NET occupation evidence. It is public, bounded, and likely to improve role grounding without repeating marketplace-noise mistakes.