AINA Data Engine Room · Chunk Authority · 2026-06-13

Production Chunk Vector Reconciliation

The old 294,315 number is the base corpus. The current authority is base plus repaired overlay.

The Single Idea

M1 should use base_chunks_parquet + repaired_chunks_glob as the chunk authority: 294675 base chunks plus 27844 repaired chunks equals 322519 current chunks.

322519current chunks
6510Gemini vectors
0.020185coverage

Authority Decision

The embedding DuckDB chunk table is a cache/snapshot with 294675 rows. It is not the current corpus authority. The current vector authority remains the Gemini vector Parquet joined exactly to the combined chunk corpus.

Top Families

FamilyChunksVectorsCoverageStatus
onet_task_evidence13109500.0source_evidence
serviceable_title6010034400.057238current_engine_room
semantic_review546865000.009143current_engine_room
jobs_research_responsibility4319600.0donor_clean
workflow_seed727700.0donor_clean
jobs_research_role665600.0donor_clean
affordance_pack662600.0source_evidence
workflow_intelligence315200.0source_evidence
workflow_ai_affordance305100.0source_evidence
onet_occupation_evidence282800.0source_evidence
top_worked_title108410000.922509current_engine_room
hf_role_signal9078260.910695source_evidence

Checks

CheckStatus
base_chunk_count_matches_planning_checkpointPASS
combined_chunk_count_matches_ain_510PASS
embedding_duckdb_chunk_table_marked_snapshot_not_authorityPASS
embedding_duckdb_vectors_match_vector_parquetPASS
family_rows_cover_combined_chunksPASS
linkedin_jobs_count_verifiedPASS
matched_vectors_match_ain_510PASS
no_live_gemini_api_invokedPASS
public_runtime_unpromotedPASS
repaired_overlay_explains_chunk_deltaPASS
repaired_overlay_has_no_base_overlapPASS
runtime_source_authority_repair_jsonl_private_identifier_scan_passedPASS
stale_vectors_absentPASS
vector_rows_match_ain_510PASS
Where to start: build source_authority_registry_v2 from this combined corpus baseline before any further Gemini scale-up.