Production Chunk Vector Reconciliation
The old 294,315 number is the base corpus. The current authority is base plus repaired overlay.
The Single Idea
M1 should use base_chunks_parquet + repaired_chunks_glob as
the chunk authority: 294675 base chunks plus
27844 repaired chunks equals
322519 current chunks.
322519current chunks
6510Gemini vectors
0.020185coverage
Authority Decision
The embedding DuckDB chunk table is a cache/snapshot with
294675 rows. It is not
the current corpus authority. The current vector authority remains the
Gemini vector Parquet joined exactly to the combined chunk corpus.
Top Families
| Family | Chunks | Vectors | Coverage | Status |
|---|---|---|---|---|
| onet_task_evidence | 131095 | 0 | 0.0 | source_evidence |
| serviceable_title | 60100 | 3440 | 0.057238 | current_engine_room |
| semantic_review | 54686 | 500 | 0.009143 | current_engine_room |
| jobs_research_responsibility | 43196 | 0 | 0.0 | donor_clean |
| workflow_seed | 7277 | 0 | 0.0 | donor_clean |
| jobs_research_role | 6656 | 0 | 0.0 | donor_clean |
| affordance_pack | 6626 | 0 | 0.0 | source_evidence |
| workflow_intelligence | 3152 | 0 | 0.0 | source_evidence |
| workflow_ai_affordance | 3051 | 0 | 0.0 | source_evidence |
| onet_occupation_evidence | 2828 | 0 | 0.0 | source_evidence |
| top_worked_title | 1084 | 1000 | 0.922509 | current_engine_room |
| hf_role_signal | 907 | 826 | 0.910695 | source_evidence |
Checks
| Check | Status |
|---|---|
| base_chunk_count_matches_planning_checkpoint | PASS |
| combined_chunk_count_matches_ain_510 | PASS |
| embedding_duckdb_chunk_table_marked_snapshot_not_authority | PASS |
| embedding_duckdb_vectors_match_vector_parquet | PASS |
| family_rows_cover_combined_chunks | PASS |
| linkedin_jobs_count_verified | PASS |
| matched_vectors_match_ain_510 | PASS |
| no_live_gemini_api_invoked | PASS |
| public_runtime_unpromoted | PASS |
| repaired_overlay_explains_chunk_delta | PASS |
| repaired_overlay_has_no_base_overlap | PASS |
| runtime_source_authority_repair_jsonl_private_identifier_scan_passed | PASS |
| stale_vectors_absent | PASS |
| vector_rows_match_ain_510 | PASS |
Where to start: build
# Production Chunk Vector Reconciliation
Status: `pass`
Created: `2026-06-13T15:24:17Z`
## The Single Idea
The authoritative M1 corpus is not the old base Parquet alone. It is the
base production semantic corpus plus the repaired overlay:
`294675` base chunks + `27844`
repaired chunks = `322519` current chunks.
## Corpus And Vector Counts
- Base frozen chunk count: `294675`
- Repaired overlay chunk count: `27844`
- Repaired overlap with base: `0`
- Current combined chunk count: `322519`
- Gemini vector rows: `6510`
- Vectors matched to current corpus: `6510`
- Stale vectors: `0`
- Unvectorized current chunks: `316009`
- Current vector coverage ratio: `0.020185`
## Authority Decision
Use `base_chunks_parquet + repaired_chunks_glob` as the chunk authority for
M1 and later source-family gates. The embedding DuckDB
`production_embedding_chunks` table has
`294675` rows and is retained as a
snapshot/cache, not as current corpus authority.
## Top Source Families
- `onet_task_evidence`: `131095` chunks, `0` vectors, `0.0` coverage, status `source_evidence`
- `serviceable_title`: `60100` chunks, `3440` vectors, `0.057238` coverage, status `current_engine_room`
- `semantic_review`: `54686` chunks, `500` vectors, `0.009143` coverage, status `current_engine_room`
- `jobs_research_responsibility`: `43196` chunks, `0` vectors, `0.0` coverage, status `donor_clean`
- `workflow_seed`: `7277` chunks, `0` vectors, `0.0` coverage, status `donor_clean`
- `jobs_research_role`: `6656` chunks, `0` vectors, `0.0` coverage, status `donor_clean`
- `affordance_pack`: `6626` chunks, `0` vectors, `0.0` coverage, status `source_evidence`
- `workflow_intelligence`: `3152` chunks, `0` vectors, `0.0` coverage, status `source_evidence`
- `workflow_ai_affordance`: `3051` chunks, `0` vectors, `0.0` coverage, status `source_evidence`
- `onet_occupation_evidence`: `2828` chunks, `0` vectors, `0.0` coverage, status `source_evidence`
## Checks
- PASS `base_chunk_count_matches_planning_checkpoint`
- PASS `combined_chunk_count_matches_ain_510`
- PASS `embedding_duckdb_chunk_table_marked_snapshot_not_authority`
- PASS `embedding_duckdb_vectors_match_vector_parquet`
- PASS `family_rows_cover_combined_chunks`
- PASS `linkedin_jobs_count_verified`
- PASS `matched_vectors_match_ain_510`
- PASS `no_live_gemini_api_invoked`
- PASS `public_runtime_unpromoted`
- PASS `repaired_overlay_explains_chunk_delta`
- PASS `repaired_overlay_has_no_base_overlap`
- PASS `runtime_source_authority_repair_jsonl_private_identifier_scan_passed`
- PASS `stale_vectors_absent`
- PASS `vector_rows_match_ain_510`
## Next Move
Build `source_authority_registry_v2` from this combined corpus baseline, then
run source-family eligibility before any further Gemini embedding scale-up.
---
Ali Mehdi Mukadam - co-authored with Codex
```yaml
topics:
- personalization-engine
- semantic-embeddings
- data-authority
subtopics:
- chunk-reconciliation
- vector-coverage
- source-family-gates
```
source_authority_registry_v2 from this
combined corpus baseline before any further Gemini scale-up.