# Production Source Authority Inventory Status: `pass` Created: `2026-06-13T15:03:25Z` ## The Single Idea Before AINA pays for more Gemini embeddings, the engine must ask: has this cleanup already been done somewhere else? This inventory turns donor repos into a build-time source map so agents reuse cleaned title, role, responsibility, workflow, tool, affordance, and company/employer evidence instead of recreating weaker cleanup rules. ## Metrics - Entity authority assets: `46` - Present entity assets: `45` - Company/employer assets inventoried: `5` - Present company/employer assets: `5` - Trusted jobs-research titles: `15104` - Clean candidate rows: `44440` ## Entity Authority Inventory | Asset | Status | Entity types | Embedding gate | Path | | --- | --- | --- | --- | --- | | `current_repo_canonical_title_aliases` | `present` | title_alias, canonical_title | `qa_reference_or_repair_source` | `/srv/aina/aina-data-engine-room/evidence/canonical/local_title_aliases.parquet` | | `current_repo_canonical_role_responsibilities` | `present` | role, responsibility | `repair_first_then_progressive` | `/srv/aina/aina-data-engine-room/evidence/canonical/role_responsibility_evidence.parquet` | | `current_repo_canonical_workflow_intelligence` | `present` | workflow, dwa_alignment, task | `repair_first_then_progressive` | `/srv/aina/aina-data-engine-room/evidence/canonical/workflow_intelligence.parquet` | | `current_repo_canonical_workflow_ai_affordance` | `present` | workflow, ai_affordance, risk_boundary | `tier_3_after_relevance_proof` | `/srv/aina/aina-data-engine-room/evidence/canonical/workflow_ai_affordance.parquet` | | `current_repo_canonical_affordance_packs` | `present` | role, workflow, ai_affordance, coverage | `tier_3_after_relevance_proof` | `/srv/aina/aina-data-engine-room/evidence/canonical/affordance_packs_v1.parquet` | | `current_repo_canonical_task_evidence` | `present` | onet_task, task_statement | `repair_first_then_progressive` | `/srv/aina/aina-data-engine-room/evidence/canonical/task_evidence.parquet` | | `current_repo_canonical_occupation_evidence` | `present` | onet_occupation, occupation_signal | `repair_first_then_progressive` | `/srv/aina/aina-data-engine-room/evidence/canonical/occupation_evidence.parquet` | | `current_repo_canonical_workflow_tool_evidence` | `present` | workflow, tool, tool_evidence | `repair_first_then_progressive` | `/srv/aina/aina-data-engine-room/evidence/canonical/workflow_tool_evidence.parquet` | | `engine_room_named_tool_source_authority` | `present` | tool, workflow, role_context, source_authority | `progressive_only_after_named_tool_receipt_passes` | `/srv/aina/aina-data-engine-room/artifacts/validation/named_tool_source_authority_v1.json` | | `current_repo_canonical_iwa_evidence` | `present` | iwa, work_activity, onet_task | `repair_first_then_progressive` | `/srv/aina/aina-data-engine-room/evidence/canonical/iwa_evidence.parquet` | | `current_repo_canonical_realism_corpus` | `present` | evaluator_output, realism_signal | `tier_3_after_relevance_proof` | `/srv/aina/aina-data-engine-room/evidence/canonical/realism_corpus.parquet` | | `current_repo_canonical_qualitative_corpus` | `present` | qualitative_evidence, evaluator_output | `tier_3_after_relevance_proof` | `/srv/aina/aina-data-engine-room/evidence/canonical/qualitative_corpus.parquet` | | `track1_role_curriculum_manifest` | `present` | curriculum, role_lesson, platform_alignment | `progressive_only_after_document_spot_check` | `/srv/aina/aina-data-engine-room/docs/source_foundations/aina-curriculum/track1_role_curriculum_v1/MANIFEST.md` | | `track1_role_curriculum_platform_index` | `present` | curriculum, role_lesson, platform_index | `progressive_only_after_document_spot_check` | `/srv/aina/aina-data-engine-room/docs/source_foundations/aina-curriculum/track1_role_curriculum_v1/PLATFORM_INDEX.jsonl` | | `engine_room_hf_role_signal_map` | `present` | role, economic_signal, source_map | `embed_now_after_eligibility` | `/srv/aina/aina-data-engine-room/artifacts/derived/huggingface/hf_role_signal_map.jsonl` | | `engine_room_gdpval_task_map` | `present` | task, benchmark_scenario, soc_mapping | `embed_now_after_eligibility` | `/srv/aina/aina-data-engine-room/artifacts/derived/huggingface/gdpval_task_map.jsonl` | | `engine_room_economic_index_role_map` | `present` | role, economic_signal, soc_mapping | `qa_reference_or_progressive_after_spot_check` | `/srv/aina/aina-data-engine-room/artifacts/derived/huggingface/economic_index_role_map.jsonl` | | `engine_room_economic_index_legacy_signal_map` | `present` | role, legacy_signal, economic_signal | `qa_reference_only` | `/srv/aina/aina-data-engine-room/artifacts/derived/huggingface/economic_index_legacy_signal_map.jsonl` | | `engine_room_hf_ingest_summary` | `present` | source_manifest, hf_ingest_receipt | `not_embedding_text` | `/srv/aina/aina-data-engine-room/artifacts/derived/huggingface/hf_ingest_summary.json` | | `jobs_research_audience_titles_enriched_audit` | `present` | title, soc_mapping, canonical_role | `embed_now_after_eligibility_or_repair` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/02-audience-title-and-onet-workflow-pilot/outputs/outputs/audience_titles_enriched_audit.csv` | | `engine_room_top_worked_title_icp_serviceable_1000` | `present` | title, icp_band, runtime_action | `already_progressive_safe_when_eligible` | `/srv/aina/aina-data-engine-room/artifacts/validation/top_worked_title_readiness_v1_icp_serviceable_top_1000.jsonl` | | `engine_room_top_worked_title_raw_1000` | `present` | title, market_rank, visibility_band | `qa_reference_only` | `/srv/aina/aina-data-engine-room/artifacts/validation/top_worked_title_readiness_v1_raw_top_1000.jsonl` | | `engine_room_top_worked_title_semantic_sample_50` | `present` | title, semantic_sample, qa_evidence | `qa_reference_only` | `/srv/aina/aina-data-engine-room/artifacts/validation/top_worked_title_readiness_v1_semantic_sample_50.jsonl` | | `engine_room_ai_fluency_capability_map` | `present` | ai_fluency_capability, capability_map, proof_contract | `progressive_only_after_capability_source_spot_check` | `/srv/aina/aina-data-engine-room/artifacts/validation/ai_fluency_capability_map_v0.json` | | `engine_room_ai_fluency_capability_map_jsonl` | `present` | ai_fluency_capability, workflow_capability_requirement, capability_observation, proof_artifact_index | `progressive_only_after_capability_source_spot_check` | `/srv/aina/aina-data-engine-room/artifacts/validation/ai_fluency_capability_map_v0.jsonl` | | `engine_room_progressive_workflow_assessment_seed` | `present` | assessment_seed, capability_observation, workflow_gap | `not_embedding_text_until_privacy_safe_derived_rows_exist` | `/srv/aina/aina-data-engine-room/artifacts/validation/progressive_workflow_assessment_seed_v1.json` | | `engine_room_ai_fluency_top_band_capability_coverage` | `present` | title, ai_fluency_capability, coverage_receipt | `qa_reference_only` | `/srv/aina/aina-data-engine-room/artifacts/validation/ai_fluency_top_band_capability_coverage_v1.json` | | `engine_room_jd_aware_role_context_evidence` | `present` | role_context_evidence, linkedin_job_trace, tool_context, risk_context | `progressive_only_after_raw_jd_key_and_semantic_spot_check` | `/srv/aina/aina-data-engine-room/artifacts/validation/jd_aware_role_context_evidence_v1.jsonl` | | `engine_room_ai_fluency_headless_loop` | `present` | ai_fluency_capability_map, capability_observation, proof_artifact_index, role_context_loop | `progressive_only_after_50_row_loop_gate` | `/srv/aina/aina-data-engine-room/artifacts/validation/ai_fluency_headless_loop_v1.jsonl` | | `engine_room_serviceable_title_progressive_3000_receipt` | `present` | embedding_receipt, serviceable_title, quality_gate | `qa_reference_only_until_sampler_attention_resolved` | `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_serviceable_title_progressive_live_3000_v1.json` | | `jobs_research_combined_clean_evidence_candidates_pass_draft` | `present` | title, responsibility, workflow, evidence_candidate | `progressive_only_after_50_row_semantic_spot_check` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/03-combined-source-intelligence-candidate-layer/outputs/outputs/combined_clean_evidence_candidates_v1.pass_draft.jsonl` | | `jobs_research_combined_source_intelligence_inventory` | `present` | source_inventory, lineage | `not_embedding_text` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/03-combined-source-intelligence-candidate-layer/outputs/outputs/combined_source_intelligence_inventory_2026-05-07.json` | | `jobs_research_clean_candidate_vector_index_benchmark` | `present` | retrieval_benchmark, clean_candidate_index | `qa_reference_only` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/04-clean-index-and-ruvector-benchmark/outputs/outputs/clean_candidate_index_benchmark_v1.json` | | `jobs_research_direct_repair_status` | `present` | repair_note, responsibility | `repair_first_reference` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/06-llm-review-consensus-and-direct-repair/reports/DIRECT_REPAIR_HAIKU_BATCH_STATUS_2026-05-07.md` | | `jobs_research_workflow_taxonomy_repair` | `missing` | workflow, taxonomy_repair | `repair_first_reference` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/07-workflow-seed-dwa-alignment-and-taxonomy-repair/exports/workflow_taxonomy_repair_suggestions_v1.jsonl` | | `jobs_research_source_intelligence_manifest` | `present` | role, responsibility, workflow, tool, ai_affordance | `repair_first_until_family_spot_check_passes` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/manifest.json` | | `jobs_research_review_csv_funnel_index` | `present` | role, responsibility, workflow, review_funnel | `not_embedding_text` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1_review_csv_funnel/00_FUNNEL_REVIEW_INDEX.csv` | | `jobs_research_workflow_seed_catalog` | `present` | workflow_seed, dwa_alignment | `repair_first_then_progressive` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1_review_csv_funnel/05_atomic_workflow_seed_catalog_6866.csv` | | `jobs_research_ai_affordances_by_workflow` | `present` | workflow, ai_affordance, safety_boundary | `tier_3_after_relevance_proof` | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/08-ai-capability-trust-and-founder-reporting/outputs/outputs/ai_affordances_by_workflow_v1.jsonl` | | `aina_core_evidence_atlas_title_aliases` | `present` | title_alias, canonical_title | `qa_reference_or_repair_source` | `/home/ali/conductor/repos/aina-core/evidence/canonical/local_title_aliases.parquet` | | `aina_core_evidence_atlas_responsibilities` | `present` | responsibility, role | `repair_first_then_progressive` | `/home/ali/conductor/repos/aina-core/evidence/canonical/role_responsibility_evidence.parquet` | | `aina_core_evidence_atlas_tasks` | `present` | onet_task, task_statement | `repair_first_then_progressive` | `/home/ali/conductor/repos/aina-core/evidence/canonical/task_evidence.parquet` | | `historical_source_intelligence_title_alias_decisions` | `present` | title_alias, decision_layer | `qa_reference_only` | `/home/ali/Personalization Engine/personalization-engine-aina/data/processed/source_intelligence_alpha/title_alias_decisions_v2.jsonl` | | `historical_source_intelligence_role_mapping_decisions` | `present` | role_mapping, decision_layer | `qa_reference_only` | `/home/ali/Personalization Engine/personalization-engine-aina/data/processed/source_intelligence_alpha/role_mapping_decisions_v1.jsonl` | | `historical_bucket_canonicals` | `present` | bucket, canonical_role, function | `qa_reference_only` | `/home/ali/Personalization Engine/personalization-engine-aina/data/processed/bucket_canonicals/bucket_canonicals_v1.jsonl` | | `alipe_and_engine_room_vision_docs` | `present` | vision_doc, learning_graph, curriculum_architecture | `already_progressive_safe_when_chunked` | `/srv/aina/aina-data-engine-room/docs/source_foundations/ainpe-files-shared` | ## Company And Employer Inventory Company/employer evidence exists, but it is not yet promoted as embedding authority. It must first be converted into clean derived artifacts with role/title truth separated from employer context. | Asset | Status | Authority | Embedding gate | Path | | --- | --- | --- | --- | --- | | `market_v2_company_context_registry` | `quarantined_reference_present` | `quarantined_reference_only` | `blocked_until_clean_derived_artifact` | `/home/ali/Personalization Engine/personalization-engine-aina/corrupt_data_archive/2026-05-04/data/processed/market_v2/company_context_registry_v2.jsonl` | | `market_v2_reference_employer_postings` | `quarantined_reference_present` | `quarantined_reference_only` | `blocked_until_clean_derived_artifact` | `/home/ali/Personalization Engine/personalization-engine-aina/corrupt_data_archive/2026-05-04/data/processed/market_v2/reference_employer_postings_v2.jsonl` | | `market_v2_company_context_registry_csv_export` | `quarantined_reference_present` | `quarantined_reference_only` | `blocked_until_clean_derived_artifact` | `/home/ali/Personalization Engine/personalization-engine-aina/corrupt_data_archive/2026-05-04/data/processed/market_v2/csv_exports/csv_exports/company_context_registry_v2.csv` | | `market_v2_reference_employer_postings_csv_export` | `quarantined_reference_present` | `quarantined_reference_only` | `blocked_until_clean_derived_artifact` | `/home/ali/Personalization Engine/personalization-engine-aina/corrupt_data_archive/2026-05-04/data/processed/market_v2/csv_exports/csv_exports/reference_employer_postings_v2.csv` | | `personalization_engine_market_v2_audit_script` | `present` | `audit_tooling_only` | `not_embedding_text` | `/home/ali/Personalization Engine/personalization-engine-aina/tools/audit_market_v2.py` | ## Source-Family Order | Phase | Source families | Rule | | ---: | --- | --- | | 1 | `current_repo_canonical_evidence`, `top_worked_title`, `jd_aware_role_context`, `ai_fluency_headless_loop`, `serviceable_title`, `semantic_review`, `hf_role_signal`, `gdpval_task`, `alipe_vision_doc`, `harvest_source_map` | embed only rows already eligible or repaired from trusted source authority; semantic_review must pass a 50-row spot check before live calls | | 2 | `jobs_research_role`, `jobs_research_responsibility`, `workflow_seed`, `onet_task_evidence`, `onet_occupation_evidence` | reuse jobs-research/evidence-atlas repairs first, then derive clean chunks; do not embed generic titles or label-only classifications | | 3 | `jobs_research_tool`, `named_tool_authority`, `jobs_research_ai_affordance`, `affordance_pack`, `workflow_ai_affordance`, `workflow_intelligence`, `workflow_tool_evidence`, `iwa_evidence`, `realism_corpus`, `qualitative_corpus` | embed after relevance proof and placeholder cleanup; keep labels metadata-only when confidence is doubtful | ## Stepwise Lane 1. `source-authority-start-here` 2. `production-source-authority-inventory` 3. `production-embedding-eligibility --source-family <family>` 4. `production-embedding-repair-queue --source-family <family>` 5. `production-embedding-repaired-corpus --source-family <family>` 6. `gemini-embedding-run --source-family <family> --dry-run --max-new 500` 7. 50-row semantic spot check from learner/tutor/evaluator/platform/provenance perspectives 8. `gemini-embedding-run --source-family <family> --include-repaired --max-new 500 --allow-live-gemini --confirm-paid-api` 9. post-live exact-cosine and mismatch QA 10. 5k, then 25k, then batch only when the family stays clean ## Contract - Labels are routing metadata, not truth. - Good text with doubtful labels can be embedded only after the labels are excluded from embedding text authority. - Raw market rows, malformed parser rows, and quarantined company/employer registries stay blocked until converted into clean derived chunks. - Batch is a reward for progressive proof, not a shortcut around semantic QA. --- Ali Mehdi Mukadam - co-authored with Codex