AINA Data Engine Room · Personalization Engine · 2026-06-15

Workflow Seed Family Complete Embedding Checkpoint

The current repaired workflow seed family is fully embedded; no remaining candidates.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

The workflow_seed source family is now fully embedded for the current repaired corpus. The 25k request proved the family has 6,866 clean repaired chunks today, so Codex embedded the remaining 1,866 candidates and confirmed the final dry run has zero remaining candidates.

01 · Change

What changed

The 5k checkpoint left 1,866 clean workflow seed chunks unembedded. This checkpoint completed that remainder through the same gated path: semantic QA, dry run, AIN-506, live Vertex ADC call, completion dry run, retrieval gate, reconciliation, source registry, exposure scan, and full validation.

RequestedMaterialize 25k repaired workflow seed chunks.
Actual clean capThe source family currently has 6,866 repaired chunks, so the correct action was family completion rather than synthetic scale.
02 · Live Run

Live embedding result

ItemResult
Source familyworkflow_seed
Clean repaired corpus cap6,866
Existing workflow_seed vectors before final run5,000
Final dry-run candidates before live1,866
Live embedded new vectors1,866
Failed Gemini rows0
Remaining workflow_seed candidates after live0
Total Gemini vectors13,932
Workflow seed vectors6,866
Workflow vectors overall6,965
Known-pair cosine gap0.190463
Stale vectors0
03 · Gates

Validation state

Gate or receiptResult
Semantic QA50 / 50 pass
Completion dry run0 remaining candidates
ain-506-p0-gatepass
ain-510-retrieval-promotion-gatepromotion_ready
production-chunk-vector-reconciliationpass
source-authority-registry-v2pass
artifact-exposure-scan0 active findings
validatepass
CounterValue
Combined chunk authority329,385
Vector rows13,932
Matched vectors13,932
Unvectorized chunks315,453
BLS OEWS occupation rows1,103
BLS OEWS wage/employment rows1,103
04 · Boundary

Product boundary

This is a strong source-family proof, not a public release. Runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor repo mutation, raw market dump embedding, and private learner-linked payload embedding all remain blocked.

Exact cosine remains the retrieval source of truth. Deterministic SOC/O*NET and service-tier fallbacks remain authoritative when vector confidence is weak.
05 · Verification

Commands run

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_seed --limit 25000 --shard-size 2500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 25000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --max-new 25000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 25000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness
uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan
uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
06 · Next

What remains

The next M2 family should be chosen by source-authority value and cleanliness, not by size alone. Good candidates are jobs_research_role, jobs_research_responsibility, workflow_intelligence, or O*NET evidence families, but each needs its own eligibility, repaired corpus, 50-row semantic QA, dry run, live 500, scale proof, AIN-510, exposure scan, and handoff.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family jobs_research_role
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family jobs_research_role --limit 500 --shard-size 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family jobs_research_role --include-repaired --limit 50
Where to start

Pick the next source family by cleanliness and product value. The workflow_seed family is complete; do not use that as permission to batch noisy families.