# Production Semantic Embedding Corpus Receipt

Status: `pass`
Created: `2026-06-13T15:04:58Z`
Model target: `gemini-embedding-2` at `768` dimensions

## Summary

This freezes the first production semantic corpus for the Personalization Engine.
It builds semantic documents from harvested source families instead of blindly
embedding raw rows and writes an AIN-506-compatible Parquet chunk table. Gemini
API-ready JSONL shards are emitted only for `eligible_only` corpora.

Raw public market rows are excluded by default. Market-job chunks only enter
when the operator passes an explicit positive cap, and only from fully parseable
structured CSV files; malformed files contribute zero rows until repaired
upstream.

The corpus uses Gemini Embedding 2 document-style text prefixes
(`title: ... | text: ...`) and preserves `task_type` as manifest metadata. The
actual API request sends `output_dimensionality: 768` and does not send the older
task-type field because Gemini Embedding 2 uses prompt formatting for retrieval
behavior.

## Metrics

- Chunks frozen: `294675`
- Token estimate: `27662302`
- Source families: `25`
- Top 1,000 ICP titles present: `1000`
- Top 500 hardening titles present: `500`
- Manifest shards: `0`
- Manifest position: No API-ready batch shards were emitted because this corpus was not frozen with `eligible_only=true`.
- Title-authority status counts: `{"derived_clean_title": 15264, "exact_trusted": 17071, "exact_weak": 28065, "missing": 28298}`
- Title-family chunks cleaned by prior authority: `15264`

## Source Families

| Family | Chunks |
| --- | ---: |
| affordance_pack | 6626 |
| ai_fluency_headless_loop | 48 |
| alipe_vision_doc | 32 |
| gdpval_task | 220 |
| harvest_source_map | 16 |
| hf_role_signal | 907 |
| iwa_evidence | 476 |
| jd_aware_role_context | 292 |
| jobs_research_ai_affordance | 89 |
| jobs_research_responsibility | 43196 |
| jobs_research_role | 6656 |
| jobs_research_tool | 73 |
| jobs_research_workflow | 89 |
| named_tool_authority | 20 |
| onet_occupation_evidence | 2828 |
| onet_task_evidence | 131095 |
| qualitative_corpus | 5 |
| realism_corpus | 230 |
| semantic_review | 42199 |
| serviceable_title | 45499 |
| top_worked_title | 1000 |
| workflow_ai_affordance | 3051 |
| workflow_intelligence | 3152 |
| workflow_seed | 6866 |
| workflow_tool_evidence | 10 |

## Runtime Position

This corpus is production-eligible input material. The Gemini vectors become
production-eligible retrieval authority only after the live embedding run passes
the quality gates over the top 1,000 ICP band and top 500 hardening band, with
deterministic fallback preserved.

---

Ali Mehdi Mukadam - co-authored with Codex