Workflow Seed 5k Embedding Checkpoint
A repair-first Gemini embedding ladder step for source-authoritative workflow seeds.
This checkpoint completes the workflow_seed 5k progressive embedding step. The engine room expanded the repaired corpus to 5,000 clean semantic chunks, checked the sample, dry-ran candidate selection, and added 4,500 live Gemini vectors with zero failures and no runtime unlock.
What changed in this slice
The previous checkpoint proved the first 500 repaired workflow_seed vectors. This slice scaled the same source family to the 5k rung without changing the release boundary.
Live embedding result
| Item | Result |
|---|---|
| Source family | workflow_seed |
| Repaired chunks materialized | 5,000 |
| Existing workflow_seed vectors before live run | 500 |
| Dry-run new candidates | 4,500 |
| Live embedded new vectors | 4,500 |
| Failed Gemini rows | 0 |
| Total Gemini vectors | 12,066 |
| Workflow seed vectors | 5,000 |
| Workflow vectors overall | 5,099 |
| Known-pair cosine gap | 0.190463 |
| Stale vectors | 0 |
Authority and validation state
| Gate or receipt | Result |
|---|---|
| Semantic QA before live | 50 / 50 pass |
| Semantic QA after live | 50 / 50 pass |
ain-506-p0-gate | pass |
ain-510-retrieval-promotion-gate | promotion_ready |
production-chunk-vector-reconciliation | pass |
source-authority-registry-v2 | pass |
artifact-exposure-scan | 0 active findings |
validate | pass |
| Counter | Value |
|---|---|
| Combined chunk authority | 327,519 |
| Vector rows | 12,066 |
| Matched vectors | 12,066 |
| Unvectorized chunks | 315,453 |
| BLS OEWS occupation rows | 1,103 |
| BLS OEWS wage/employment rows | 1,103 |
Important artifacts
The durable proof lives in the validation receipts, especially production_embedding_repaired_corpus_v1__source_family=workflow_seed.json, production_embedding_semantic_qa_v1__source_family=workflow_seed__input=repaired.json, ain_506_live_gemini_embedding_run_v1.json, ain_510_retrieval_promotion_gate_v1.json, production_chunk_vector_reconciliation_v1.json, source_authority_registry_v2.json, artifact_exposure_scan_v1.json, and full_validation.json.
Commands run
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_seed --limit 5000 --shard-size 1000 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 5000 --selection-mode progressive uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --max-new 5000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250 uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
What remains
The next M2 ladder step is workflow_seed 25k, but not as a blind live run. First materialize the larger repaired corpus, run another 50-row semantic QA sample, and dry-run candidate selection. Only proceed if repair quality, AIN-510, artifact exposure, and runtime boundaries remain green.
cd /srv/aina/aina-data-engine-room git status --short --branch uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_seed --limit 25000 --shard-size 2500 uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_seed --include-repaired --limit 50 uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_seed --include-repaired --dry-run --max-new 25000 --selection-mode progressive
Start with the 25k dry-run ladder, not batch. The 5k proof is strong, but it is still a source-family proof, not a license to embed noisy or unrelated data.