AINA Data Engine Room · AIN-506 · 2026-06-12 · VDS local branch ali/ain-506-p0-gate-2026-06-12

Production Semantic Embedding Milestone Handoff

A clean semantic unlock is now in place: freeze broad, embed carefully, and never let doubtful raw rows become retrieval authority.

The Single Idea

The production embedding lane now has the correct architecture and proof path: Gemini Embedding 2 through Vertex ADC, clean semantic chunks as the corpus, exact cosine as the retrieval authority, and deterministic fallback preserved. We stopped further paid expansion, kept the clean top-band proof, and explicitly removed orphan vectors from the older raw-market-inclusive run.

302,234clean semantic corpus chunks

9,353Gemini vectors retained against the clean corpus

647old raw-market orphan vectors pruned

0new model calls in the cleanup pass

1,000 / 1,000top ICP title vectors present

500 / 500hardening-band title vectors present

01 · Current Status

Current Status

The production embedding lane is implemented and passing on the VDS. It uses the funded Vertex ADC route for gemini-embedding-2 at 768 dimensions, not the legacy bare Developer API key route. Secrets remain in ignored .env.local; receipts only record presence, auth mode, project, location, and booleans.

Ali stopped further expansion before we spent more embedding tokens. The full clean corpus freeze had already completed locally and used zero Gemini model calls. I then ran only a zero-new-vector cleanup pass with live_gemini_api_invoked: false to prune old vector rows that no longer belong to the clean corpus.

Surface	Current proof
Clean semantic corpus chunks	302,234
Gemini vectors retained against clean corpus	9,353
Old orphan vectors pruned	647
Top 1,000 ICP title vectors	1,000 / 1,000
Top 500 hardening vectors	500 / 500
Failed embedding rows in current receipt	0
Current model call during cleanup	false

The prior “10k” run was useful and passed quality gates, but after tightening the raw-data policy the active clean-current vector layer is 9,353 vectors. The 647 removed rows were orphaned from the old raw-market-inclusive progressive corpus and should not be treated as retrieval authority.

02 · Clean Corpus Policy

Clean Corpus Policy

The corpus now embeds processed semantic documents, not noisy raw dumps. Full corpus means every usable harvested semantic source family, while raw public market rows are excluded by default.

Guardrail

Market job rows only enter when an explicit positive --market-limit is passed, and only from fully parseable structured CSV files. Malformed market CSV files contribute zero rows until repaired upstream. This prevents parser-salvaged rows, doubtful labels, or incorrect raw snippets from being embedded and reinforced.

The clean full-corpus freeze used:

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --full-corpus --market-limit 0 --shard-size 10000

Freeze metric	Value
Status	pass
Chunks	302,234
Token estimate	30,204,466
Manifest shards	48
Tier 1 chunks	96,617
Tier 2 chunks	195,009
Tier 3 chunks	10,608
Raw market rows included	0

Major families included: serviceable titles, top worked titles, semantic reviews, O*NET occupations/tasks, Hugging Face role signals, GDPval task rows, jobs-research roles/responsibilities/workflows/tools/affordances, evidence atlas layers, ALIPE/source docs, workflow seeds, and harvest/source-map references.

03 · Live Gemini Proof

Live Gemini Proof

The successful live paid embedding run used Vertex ADC with project aina-495702, location us, model gemini-embedding-2, and 768 dimensions.

After the clean corpus freeze, the cleanup command intentionally requested zero new vectors:

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --limit 0 --workers 1 --write-every 1000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 30 --max-retries 0

Current receipt	Value
Status	pass
live_gemini_api_invoked	false
Existing clean vectors	9,353
New vectors in cleanup	0
Orphan vectors pruned	647
Mean vector norm	1.0
Similar/dissimilar cosine gap	0.172978
Gate 1 known pairs	50 similar and 50 dissimilar
Quality gates	all pass
Production eligibility for current top-band proof	true

This does not mean the full 302,234-chunk corpus is fully embedded. It means the clean-current vector layer is good enough for top-band/runtime proving without spending more tokens right now.

04 · Files And Receipts

Files And Receipts

Corpus receipt: artifacts/validation/ain_506_production_semantic_embedding_corpus_v1.json
Corpus report: artifacts/reports/ain_506_production_semantic_embedding_corpus_v1.md
Corpus HTML: artifacts/reports/ain_506_production_semantic_embedding_corpus_v1.html
Chunk table: artifacts/embeddings/production/chunks/schema_version=embedding_contract_v1/ain_506_production_semantic_embedding_corpus_v1.parquet
Gemini manifests: artifacts/embeddings/production/gemini_batch_manifests/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/
Live run receipt: artifacts/validation/ain_506_live_gemini_embedding_run_v1.json
Vector table: artifacts/embeddings/production/vectors/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/ain_506_live_gemini_embedding_run_v1.parquet
DuckDB: artifacts/embeddings/production/ain_506_live_gemini_embedding_run_v1.duckdb

05 · Validation

Validation

uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run ruff check src/aina_data_engine/production_embeddings.py src/aina_data_engine/cli.py tests/test_production_embeddings.py tests/test_embedding_contracts.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate

Check	Result
Focused tests	15 passed
Ruff	All checks passed
AIN-506 P0 gate	pass
Full validation	pass

The P0 gate reports gemini_api_key_present: true, auth_mode: vertex_adc, vertex_project: aina-495702, vertex_location: us, live_api_enabled: true, paid_api_project_confirmed: true, runtime_embedding_authority_promoted: false, and live_gemini_api_invoked: false.

06 · Parked Work

Parked Work

Do not immediately submit the remaining clean corpus to Gemini just because manifests exist. The next run should first decide the expansion lane.

HoldKeep the current 9,353 clean vectors as the active proof layer.

PrioritizeUse the clean 302,234-chunk corpus as the backlog and expand by source family or priority band.

ProtectContinue excluding raw market rows unless repaired into processed, labeled, fully parseable sources.

No-live verification:

cd /srv/aina/aina-data-engine-room
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate

If Ali explicitly resumes embedding expansion, start bounded and clean:

cd /srv/aina/aina-data-engine-room
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --limit 25000 --workers 24 --write-every 1000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 90 --max-retries 5

The runner resumes by (chunk_id, text_hash), prunes vectors outside the current corpus, and skips already-written vectors unless --force is passed.

Where to start

Use the 9,353-vector proof now, treat the 302,234 clean chunks as the expansion backlog, and only spend more embedding calls after the source-family priority queue is explicit.

Ali Mehdi Mukadam · co-authored with Codex · 2026-06-12

topics:
  - aina-data-engine-room
  - personalization-engine
  - production-embeddings
  - ain-506
subtopics:
  - gemini-embedding-2
  - vertex-adc
  - semantic-corpus
  - raw-market-exclusion
  - exact-cosine-retrieval

aina-data-engine-room personalization-engine production-embeddings ain-506 gemini-embedding-2