Production Semantic Embedding Milestone Handoff
A clean semantic unlock is now in place: freeze broad, embed carefully, and never let doubtful raw rows become retrieval authority.
The production embedding lane now has the correct architecture and proof path: Gemini Embedding 2 through Vertex ADC, clean semantic chunks as the corpus, exact cosine as the retrieval authority, and deterministic fallback preserved. We stopped further paid expansion, kept the clean top-band proof, and explicitly removed orphan vectors from the older raw-market-inclusive run.
Current Status
The production embedding lane is implemented and passing on the VDS. It uses the funded Vertex ADC route for gemini-embedding-2 at 768 dimensions, not the legacy bare Developer API key route. Secrets remain in ignored .env.local; receipts only record presence, auth mode, project, location, and booleans.
Ali stopped further expansion before we spent more embedding tokens. The full clean corpus freeze had already completed locally and used zero Gemini model calls. I then ran only a zero-new-vector cleanup pass with live_gemini_api_invoked: false to prune old vector rows that no longer belong to the clean corpus.
| Surface | Current proof |
|---|---|
| Clean semantic corpus chunks | 302,234 |
| Gemini vectors retained against clean corpus | 9,353 |
| Old orphan vectors pruned | 647 |
| Top 1,000 ICP title vectors | 1,000 / 1,000 |
| Top 500 hardening vectors | 500 / 500 |
| Failed embedding rows in current receipt | 0 |
| Current model call during cleanup | false |
The prior “10k” run was useful and passed quality gates, but after tightening the raw-data policy the active clean-current vector layer is 9,353 vectors. The 647 removed rows were orphaned from the old raw-market-inclusive progressive corpus and should not be treated as retrieval authority.
Clean Corpus Policy
The corpus now embeds processed semantic documents, not noisy raw dumps. Full corpus means every usable harvested semantic source family, while raw public market rows are excluded by default.
Market job rows only enter when an explicit positive --market-limit is passed, and only from fully parseable structured CSV files. Malformed market CSV files contribute zero rows until repaired upstream. This prevents parser-salvaged rows, doubtful labels, or incorrect raw snippets from being embedded and reinforced.
The clean full-corpus freeze used:
uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-corpus --full-corpus --market-limit 0 --shard-size 10000
| Freeze metric | Value |
|---|---|
| Status | pass |
| Chunks | 302,234 |
| Token estimate | 30,204,466 |
| Manifest shards | 48 |
| Tier 1 chunks | 96,617 |
| Tier 2 chunks | 195,009 |
| Tier 3 chunks | 10,608 |
| Raw market rows included | 0 |
Major families included: serviceable titles, top worked titles, semantic reviews, O*NET occupations/tasks, Hugging Face role signals, GDPval task rows, jobs-research roles/responsibilities/workflows/tools/affordances, evidence atlas layers, ALIPE/source docs, workflow seeds, and harvest/source-map references.
Live Gemini Proof
The successful live paid embedding run used Vertex ADC with project aina-495702, location us, model gemini-embedding-2, and 768 dimensions.
After the clean corpus freeze, the cleanup command intentionally requested zero new vectors:
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --limit 0 --workers 1 --write-every 1000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 30 --max-retries 0
| Current receipt | Value |
|---|---|
| Status | pass |
| live_gemini_api_invoked | false |
| Existing clean vectors | 9,353 |
| New vectors in cleanup | 0 |
| Orphan vectors pruned | 647 |
| Mean vector norm | 1.0 |
| Similar/dissimilar cosine gap | 0.172978 |
| Gate 1 known pairs | 50 similar and 50 dissimilar |
| Quality gates | all pass |
| Production eligibility for current top-band proof | true |
This does not mean the full 302,234-chunk corpus is fully embedded. It means the clean-current vector layer is good enough for top-band/runtime proving without spending more tokens right now.
Files And Receipts
- Corpus receipt:
artifacts/validation/ain_506_production_semantic_embedding_corpus_v1.json - Corpus report:
artifacts/reports/ain_506_production_semantic_embedding_corpus_v1.md - Corpus HTML:
artifacts/reports/ain_506_production_semantic_embedding_corpus_v1.html - Chunk table:
artifacts/embeddings/production/chunks/schema_version=embedding_contract_v1/ain_506_production_semantic_embedding_corpus_v1.parquet - Gemini manifests:
artifacts/embeddings/production/gemini_batch_manifests/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/ - Live run receipt:
artifacts/validation/ain_506_live_gemini_embedding_run_v1.json - Vector table:
artifacts/embeddings/production/vectors/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/ain_506_live_gemini_embedding_run_v1.parquet - DuckDB:
artifacts/embeddings/production/ain_506_live_gemini_embedding_run_v1.duckdb
Validation
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py -q uv run ruff check src/aina_data_engine/production_embeddings.py src/aina_data_engine/cli.py tests/test_production_embeddings.py tests/test_embedding_contracts.py uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
| Check | Result |
|---|---|
| Focused tests | 15 passed |
| Ruff | All checks passed |
| AIN-506 P0 gate | pass |
| Full validation | pass |
The P0 gate reports gemini_api_key_present: true, auth_mode: vertex_adc, vertex_project: aina-495702, vertex_location: us, live_api_enabled: true, paid_api_project_confirmed: true, runtime_embedding_authority_promoted: false, and live_gemini_api_invoked: false.
Parked Work
Do not immediately submit the remaining clean corpus to Gemini just because manifests exist. The next run should first decide the expansion lane.
No-live verification:
cd /srv/aina/aina-data-engine-room uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
If Ali explicitly resumes embedding expansion, start bounded and clean:
cd /srv/aina/aina-data-engine-room uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --limit 25000 --workers 24 --write-every 1000 --selection-mode progressive --allow-live-gemini --confirm-paid-api --timeout-seconds 90 --max-retries 5
The runner resumes by (chunk_id, text_hash), prunes vectors outside the current corpus, and skips already-written vectors unless --force is passed.
Use the 9,353-vector proof now, treat the 302,234 clean chunks as the expansion backlog, and only spend more embedding calls after the source-family priority queue is explicit.