AINA Data Engine Room · Personalization Engine · 2026-06-15

Workflow Seed 5k Embedding Checkpoint

A repair-first Gemini embedding ladder step for source-authoritative workflow seeds.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

This checkpoint completes the workflow_seed 5k progressive embedding step. The engine room expanded the repaired corpus to 5,000 clean semantic chunks, checked the sample, dry-ran candidate selection, and added 4,500 live Gemini vectors with zero failures and no runtime unlock.

01 · Change

What changed in this slice

The previous checkpoint proved the first 500 repaired workflow_seed vectors. This slice scaled the same source family to the 5k rung without changing the release boundary.

Repair5,000 source-ref-injected workflow seed chunks.
QA50-row repaired semantic sample passed before live calls.
Embed4,500 new Gemini vectors through Vertex ADC.
GateAIN-510, reconciliation, exposure, runtime, and validate stayed green.
02 · Live Run

Live embedding result

ItemResult
Source familyworkflow_seed
Repaired chunks materialized5,000
Existing workflow_seed vectors before live run500
Dry-run new candidates4,500
Live embedded new vectors4,500
Failed Gemini rows0
Total Gemini vectors12,066
Workflow seed vectors5,000
Workflow vectors overall5,099
Known-pair cosine gap0.190463
Stale vectors0
Runtime embedding authority is still not promoted. Exact cosine remains the retrieval source of truth, and public runtime, real-user data, external writes, and production telemetry remain blocked.
03 · Gates

Authority and validation state

Gate or receiptResult
Semantic QA before live50 / 50 pass
Semantic QA after live50 / 50 pass
ain-506-p0-gatepass
ain-510-retrieval-promotion-gatepromotion_ready
production-chunk-vector-reconciliationpass
source-authority-registry-v2pass
artifact-exposure-scan0 active findings
validatepass
CounterValue
Combined chunk authority327,519
Vector rows12,066
Matched vectors12,066
Unvectorized chunks315,453
BLS OEWS occupation rows1,103
BLS OEWS wage/employment rows1,103
04 · Proof

Important artifacts

The durable proof lives in the validation receipts, especially production_embedding_repaired_corpus_v1__source_family=workflow_seed.json, production_embedding_semantic_qa_v1__source_family=workflow_seed__input=repaired.json, ain_506_live_gemini_embedding_run_v1.json, ain_510_retrieval_promotion_gate_v1.json, production_chunk_vector_reconciliation_v1.json, source_authority_registry_v2.json, artifact_exposure_scan_v1.json, and full_validation.json.

05 · Verification

Commands run

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_seed --limit 5000 --shard-size 1000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 5000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --max-new 5000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness
uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan
uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
06 · Resume

What remains

The next M2 ladder step is workflow_seed 25k, but not as a blind live run. First materialize the larger repaired corpus, run another 50-row semantic QA sample, and dry-run candidate selection. Only proceed if repair quality, AIN-510, artifact exposure, and runtime boundaries remain green.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_seed --limit 25000 --shard-size 2500
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 25000 --selection-mode progressive
Where to start

Start with the 25k dry-run ladder, not batch. The 5k proof is strong, but it is still a source-family proof, not a license to embed noisy or unrelated data.