AINA Data Engine Room · M2 Clean Repair Embed · 2026-06-15

Workflow Seed Embedding And BLS Cache Checkpoint

The workflow seed family now has live vectors from repaired semantic text, and the local BLS cache gap is closed.

The Single Idea

This checkpoint advanced M2 from a small workflow embedding proof to a larger repair-first workflow seed proof: workflow_seed now has 500 live Gemini Embedding 2 vectors from the repaired corpus, not from label-like raw seed text. The run also closed the local BLS cache gap because the VDS now has parseable BLS OEWS files.

01 · What changed

Repair-first QA is now real

The clean-before-embed ladder now supports semantic QA over repaired corpora. The new CLI flag is production-embedding-semantic-qa --include-repaired, and repaired QA receipts use __input=repaired so they do not overwrite the base-corpus failure evidence.

1Base seed text failed on label-like rows.
2Derived repaired chunks passed QA.
3500 candidates dry-ran cleanly.
4500 live vectors landed with 0 failures.

Chunk/vector reconciliation now dedupes repaired parquet scopes by (chunk_id, text_hash) for authority counts while still reporting duplicate repaired overlay rows. The source-snapshots CLI also reports the correct public_source_snapshot_v1 schema.

02 · Live embedding

Workflow seed vector coverage moved forward

ItemResult
Source familyworkflow_seed
Repaired chunks materialized500
Repaired semantic QA sample50 / 50 pass
Live embedded new vectors500
Failed Gemini rows0
Total Gemini vectors7,566
Workflow vectors599
Known-pair cosine gap0.190463
Stale vectors0

Live Gemini was invoked only through the configured Vertex ADC path for gemini-embedding-2 at 768 dimensions. Runtime embedding authority is still not promoted.

03 · Source cache

BLS is now parsed from local cache

Public source layerRows
O*NET occupations1,016
O*NET tasks18,796
BLS OEWS occupations1,103
BLS OEWS wage/employment rows1,103
Canonical occupation rows1,016

This updates the source-authority and beta-admission receipts away from the old BLS access-gap numbers. External/public runtime remains blocked.

04 · Verification

The gate stack is green

Focused pytest, ruff, docs frontmatter, artifact exposure, runtime readiness, AIN-506, AIN-510, source authority, beta admission, deployment readiness, reconciliation, and full validation all passed. Final validation status is pass.

05 · What remains

Continue the progressive ladder

The next clean M2 step is to continue workflow_seed: run another semantic spot check, then consider 5,000 only if quality stays clean. Other source families should not be batch-embedded until their repaired corpus, 50-row semantic QA, dry run, and live 500 proof are green.

06 · Resume commands

Start here next

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 5000 --selection-mode progressive
Where to start: keep the clean-before-embed ladder intact; the repaired corpus is the authority for workflow seed embedding, not the raw seed label rows.