AINA Data Engine Room · Personalization Engine · 2026-06-15

Workflow Intelligence Family Complete Embedding Checkpoint

The full repaired workflow intelligence source family is now embedded locally.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-15

The Single Idea

The workflow_intelligence source family is now fully embedded for the current repaired corpus: 3,152 vectors, zero failed Gemini rows, zero remaining candidates, and all retrieval/runtime gates still local-only and green.

01 · Change

What changed

After the 500-vector proof, Codex expanded workflow_intelligence to all 3,152 repaired chunks, verified semantic QA, embedded the remaining 2,652 candidates through Gemini Embedding 2, and confirmed the completion dry run has zero remaining candidates.

Repair3,152 repaired chunks with placeholder-noise cleanup.
Embed2,652 new live Gemini vectors, zero failed rows.
VerifyAIN-510, reconciliation, registry, exposure, and validate all pass.
02 · Live Run

Live embedding result

ItemResult
Source familyworkflow_intelligence
Clean repaired corpus size3,152
Existing workflow_intelligence vectors before final run500
Final dry-run candidates before live2,652
Live embedded new vectors2,652
Failed Gemini rows0
Remaining workflow_intelligence candidates after live0
Total Gemini vectors17,084
Workflow intelligence vectors3,152
Workflow seed vectors6,866
Workflow vectors overall10,117
Known-pair cosine gap0.190463
Stale vectors0
03 · Gates

Validation state

Gate or receiptResult
Semantic QA50 / 50 pass
Completion dry run0 remaining candidates
ain-506-p0-gatepass
ain-510-retrieval-promotion-gatepromotion_ready
production-chunk-vector-reconciliationpass
source-authority-registry-v2pass
artifact-exposure-scan0 active findings
validatepass
CounterValue
Combined chunk authority333,037
Vector rows17,084
Matched vectors17,084
Unvectorized chunks315,953
04 · Status

Source-family status

workflow_seed and workflow_intelligence are now complete for their current repaired corpora. jobs_research_role remains blocked because its repaired rows failed semantic QA as title-only/store-number artifacts.

This is still a local source-family proof. Runtime embedding authority, public runtime, real-user data, external writes, production telemetry, donor mutation, raw market dumps, and private learner-linked payloads remain blocked.
05 · Verification

Commands run

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family workflow_intelligence --limit 5000 --shard-size 1000
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family workflow_intelligence --include-repaired --limit 50
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_intelligence --include-repaired --dry-run --max-new 5000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_intelligence --include-repaired --max-new 5000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --workers 8 --timeout-seconds 60 --write-every 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family workflow_intelligence --include-repaired --dry-run --max-new 5000 --selection-mode progressive
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-chunk-vector-reconciliation
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-registry-v2
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-runtime-readiness
uv run aina-data-engine --root /srv/aina/aina-data-engine-room artifact-exposure-scan
uv run aina-data-engine --root /srv/aina/aina-data-engine-room docs-frontmatter-check
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
06 · Resume

What remains

Pick the next family by semantic value and repair readiness. onet_occupation_evidence is a good next candidate because it is smaller and public-taxonomy grounded. onet_task_evidence is larger and valuable, but needs careful generic-title repair. jobs_research_responsibility is high value but currently risky because early samples share the store-manager artifact pattern.

cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family onet_occupation_evidence
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family onet_occupation_evidence --limit 500 --shard-size 250
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-semantic-qa --source-family onet_occupation_evidence --include-repaired --limit 50
Where to start

Start with O*NET occupation evidence. It is public, bounded, and likely to improve role grounding without repeating marketplace-noise mistakes.