AINA Data Engine Room · AIN-506 · 2026-06-12 · VDS local branch ali/ain-506-p0-gate-2026-06-12

Production Semantic Embedding Milestone Handoff

A clean semantic unlock is now in place: freeze broad, embed carefully, and never let doubtful raw rows become retrieval authority.

The Single Idea

The production embedding lane now has the correct architecture and proof path: Gemini Embedding 2 through Vertex ADC, clean semantic chunks as the corpus, exact cosine as the retrieval authority, and deterministic fallback preserved. We stopped further paid expansion, kept the clean top-band proof, and explicitly removed orphan vectors from the older raw-market-inclusive run.

302,234clean semantic corpus chunks
9,353Gemini vectors retained against the clean corpus
647old raw-market orphan vectors pruned
0new model calls in the cleanup pass
1,000 / 1,000top ICP title vectors present
500 / 500hardening-band title vectors present
01 · Current Status

Current Status

The production embedding lane is implemented and passing on the VDS. It uses the funded Vertex ADC route for gemini-embedding-2 at 768 dimensions, not the legacy bare Developer API key route. Secrets remain in ignored .env.local; receipts only record presence, auth mode, project, location, and booleans.

Ali stopped further expansion before we spent more embedding tokens. The full clean corpus freeze had already completed locally and used zero Gemini model calls. I then ran only a zero-new-vector cleanup pass with live_gemini_api_invoked: false to prune old vector rows that no longer belong to the clean corpus.

SurfaceCurrent proof
Clean semantic corpus chunks302,234
Gemini vectors retained against clean corpus9,353
Old orphan vectors pruned647
Top 1,000 ICP title vectors1,000 / 1,000
Top 500 hardening vectors500 / 500
Failed embedding rows in current receipt0
Current model call during cleanupfalse

The prior “10k” run was useful and passed quality gates, but after tightening the raw-data policy the active clean-current vector layer is 9,353 vectors. The 647 removed rows were orphaned from the old raw-market-inclusive progressive corpus and should not be treated as retrieval authority.

02 · Clean Corpus Policy

Clean Corpus Policy

The corpus now embeds processed semantic documents, not noisy raw dumps. Full corpus means every usable harvested semantic source family, while raw public market rows are excluded by default.

Guardrail

Market job rows only enter when an explicit positive --market-limit is passed, and only from fully parseable structured CSV files. Malformed market CSV files contribute zero rows until repaired upstream. This prevents parser-salvaged rows, doubtful labels, or incorrect raw snippets from being embedded and reinforced.

The clean full-corpus freeze used:

uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --full-corpus --market-limit 0 --shard-size 10000
Freeze metricValue
Statuspass
Chunks302,234
Token estimate30,204,466
Manifest shards48
Tier 1 chunks96,617
Tier 2 chunks195,009
Tier 3 chunks10,608
Raw market rows included0

Major families included: serviceable titles, top worked titles, semantic reviews, O*NET occupations/tasks, Hugging Face role signals, GDPval task rows, jobs-research roles/responsibilities/workflows/tools/affordances, evidence atlas layers, ALIPE/source docs, workflow seeds, and harvest/source-map references.

03 · Live Gemini Proof

Live Gemini Proof

The successful live paid embedding run used Vertex ADC with project aina-495702, location us, model gemini-embedding-2, and 768 dimensions.

After the clean corpus freeze, the cleanup command intentionally requested zero new vectors:

uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --limit 0 --workers 1 --write-every 1000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 30 --max-retries 0
Current receiptValue
Statuspass
live_gemini_api_invokedfalse
Existing clean vectors9,353
New vectors in cleanup0
Orphan vectors pruned647
Mean vector norm1.0
Similar/dissimilar cosine gap0.172978
Gate 1 known pairs50 similar and 50 dissimilar
Quality gatesall pass
Production eligibility for current top-band prooftrue

This does not mean the full 302,234-chunk corpus is fully embedded. It means the clean-current vector layer is good enough for top-band/runtime proving without spending more tokens right now.

04 · Files And Receipts

Files And Receipts

05 · Validation

Validation

uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q
uv run ruff check src/aina_data_engine/production_embeddings.py src/aina_data_engine/cli.py tests/test_production_embeddings.py tests/test_embedding_contracts.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
CheckResult
Focused tests15 passed
RuffAll checks passed
AIN-506 P0 gatepass
Full validationpass

The P0 gate reports gemini_api_key_present: true, auth_mode: vertex_adc, vertex_project: aina-495702, vertex_location: us, live_api_enabled: true, paid_api_project_confirmed: true, runtime_embedding_authority_promoted: false, and live_gemini_api_invoked: false.

Parked Work

Do not immediately submit the remaining clean corpus to Gemini just because manifests exist. The next run should first decide the expansion lane.

HoldKeep the current 9,353 clean vectors as the active proof layer.
PrioritizeUse the clean 302,234-chunk corpus as the backlog and expand by source family or priority band.
ProtectContinue excluding raw market rows unless repaired into processed, labeled, fully parseable sources.

No-live verification:

cd /srv/aina/aina-data-engine-room
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate

If Ali explicitly resumes embedding expansion, start bounded and clean:

cd /srv/aina/aina-data-engine-room
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --limit 25000 --workers 24 --write-every 1000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 90 --max-retries 5

The runner resumes by (chunk_id, text_hash), prunes vectors outside the current corpus, and skips already-written vectors unless --force is passed.

Where to start

Use the 9,353-vector proof now, treat the 302,234 clean chunks as the expansion backlog, and only spend more embedding calls after the source-family priority queue is explicit.