Workflow Seed Embedding And BLS Cache Checkpoint
The workflow seed family now has live vectors from repaired semantic text, and the local BLS cache gap is closed.
This checkpoint advanced M2 from a small workflow embedding proof to a larger repair-first workflow seed proof: workflow_seed now has 500 live Gemini Embedding 2 vectors from the repaired corpus, not from label-like raw seed text. The run also closed the local BLS cache gap because the VDS now has parseable BLS OEWS files.
Repair-first QA is now real
The clean-before-embed ladder now supports semantic QA over repaired corpora. The new CLI flag is production-embedding-semantic-qa --include-repaired, and repaired QA receipts use __input=repaired so they do not overwrite the base-corpus failure evidence.
Chunk/vector reconciliation now dedupes repaired parquet scopes by (chunk_id, text_hash) for authority counts while still reporting duplicate repaired overlay rows. The source-snapshots CLI also reports the correct public_source_snapshot_v1 schema.
Workflow seed vector coverage moved forward
| Item | Result |
|---|---|
| Source family | workflow_seed |
| Repaired chunks materialized | 500 |
| Repaired semantic QA sample | 50 / 50 pass |
| Live embedded new vectors | 500 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 7,566 |
| Workflow vectors | 599 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
Live Gemini was invoked only through the configured Vertex ADC path for gemini-embedding-2 at 768 dimensions. Runtime embedding authority is still not promoted.
BLS is now parsed from local cache
| Public source layer | Rows |
|---|---|
| O*NET occupations | 1,016 |
| O*NET tasks | 18,796 |
| BLS OEWS occupations | 1,103 |
| BLS OEWS wage/employment rows | 1,103 |
| Canonical occupation rows | 1,016 |
This updates the source-authority and beta-admission receipts away from the old BLS access-gap numbers. External/public runtime remains blocked.
The gate stack is green
Focused pytest, ruff, docs frontmatter, artifact exposure, runtime readiness, AIN-506, AIN-510, source authority, beta admission, deployment readiness, reconciliation, and full validation all passed. Final validation status is pass.
Continue the progressive ladder
The next clean M2 step is to continue workflow_seed: run another semantic spot check, then consider 5,000 only if quality stays clean. Other source families should not be batch-embedded until their repaired corpus, 50-row semantic QA, dry run, and live 500 proof are green.
Start here next
cd /srv/aina/aina-data-engine-room git status --short --branch uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 5000 --selection-mode progressive