Source Authority Start Here
A deterministic boot sequence so agents consume the known source estate before spending tokens rediscovering it.
The Single Idea
This artifact turns source memory into an executable receipt: read these anchors, respect these traps, and continue from the latest embedding checkpoint.
Serviceable-title vectors3440
Top-worked vectors1000
Total Gemini vectors6507
Corpus chunks294672
Repair-first rows231105
01 Source
Read the authority registry and title ledger first.
02 Repair
Use jobs-research and evidence atlas layers to clean derived chunks.
03 Embed
Run Gemini progressively; batch waits for clean source-family proof.
Must-Read Anchors
| Artifact | Status | Purpose | Path |
|---|---|---|---|
production_source_authority_registry | present | build-time source authority and anti-regression title cleanup map | /srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_registry_v1.json |
production_source_authority_inventory | present | cleaned-asset inventory for titles, companies, roles, responsibilities, workflows, tools, affordances, and evidence before embedding | /srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_inventory_v1.json |
title_ledger | present | title precedence, count reconciliation, and export contract | /srv/aina/aina-data-engine-room/docs/TITLE-LEDGER.md |
mapping_chain_ledger | present | title to role to workflow to capability join topology | /srv/aina/aina-data-engine-room/docs/MAPPING-CHAIN-LEDGER.md |
cross_repo_salvage_map | present | prior-work navigation map; adapt before reinventing | /home/ali/conductor/aina-consolidated/20-references/linear/doc__cross-repo-salvage-map.md |
jobs_research_source_map | present | jobs-research source-to-output lineage | /home/ali/conductor/repos/aina-jobs-research/project-summary-package/file-index/SOURCE_MAP.md |
jobs_research_validation_report | present | export row counts and manifest-hash validation | /home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/validation_report.md |
ain_core_stitch_ledger | present | stitched current copy of evidence atlas and source packages | /home/ali/conductor/repos/aina-core/STITCH-LEDGER.md |
production_embedding_corpus_receipt | present | frozen semantic chunk corpus and batch manifest receipt | /srv/aina/aina-data-engine-room/artifacts/validation/ain_506_production_semantic_embedding_corpus_v1.json |
production_embedding_eligibility_ledger | present | source-family eligibility and repair-first routing | /srv/aina/aina-data-engine-room/artifacts/validation/production_embedding_eligibility_v1.json |
latest_serviceable_embedding_checkpoint | present | latest progressive serviceable-title Gemini checkpoint | /srv/aina/aina-data-engine-room/artifacts/validation/ain_506_serviceable_title_progressive_live_3000_v1.json |
Hidden Gems
| Source | Why It Matters | Path |
|---|---|---|
| jobs-research source intelligence export | Role buckets, responsibilities, workflow seeds, workflows, tools, and affordances already exist as a validated source package. | /home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/manifest.json |
| jobs-research clean candidate pass-draft layer | Use as clean evidence substrate before embedding or repairing noisy title rows. | /home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/03-combined-source-intelligence-candidate-layer/outputs/outputs/combined_clean_evidence_candidates_v1.pass_draft.jsonl |
| aina-core evidence atlas canonical parquets | Current stitched title alias, responsibility, task, workflow, tool, realism, and affordance evidence layers. | /home/ali/conductor/repos/aina-core/evidence/canonical |
| ALIPE LinkedIn and Indeed chunks | Powerful market substrate already referenced by packets, but it must enter only through clean derived chunks. | /home/ali/ALIPE/data/jobs/linkedin_indeed_clusters_v1/chunks_250k |
| personalization inventory | Layer and sample inventory for O*NET, jobs CSVs, review artifacts, runtime samples, and historical data quality ranking. | /home/ali/Personalization Engine/aina-inventory |
Routing Traps
| Trap | Routing Rule |
|---|---|
| Docs shell versus live engine | Never treat aina-personalization-engine as live engine code; use personalization-engine-aina or the engine-room/current core ledgers. |
| Raw market rows | Do not embed raw marketplace labels as semantic truth; derive clean chunks first. |
| Doubtful labels on good text | Preserve labels in metadata and exclude them from embedding text authority. |
| Context-blind responsibility classifier | Treat k2_classified_v1 as broken for serving until rerun with role context. |
| Old status labels | Do not skip prior work because it says reference, not_promoted, or old review language; inspect lineage and consume clean derived layers. |
Where to start: run this command, then continue from the latest valid embedding checkpoint instead of re-scanning the whole estate.
Canonical markdown source
# Source Authority Start Here Status: `pass` Created: `2026-06-13T14:45:54Z` Schema: `source_authority_start_here_v1` ## The Single Idea This is the boot sequence for AINA Personalization Engine data work. It points agents at the already-discovered source ledgers before title cleaning, source harvest, Gemini embedding, or runtime mapping work starts. The goal is to stop rediscovering the same repos and to consume prior work in the right order. ## Current Counts | Metric | Value | | --- | ---: | | Title audit rows | `37478` | | Trusted jobs-research titles | `15104` | | Pass-draft clean evidence rows | `44440` | | Source-authority inventory assets | `46` | | Company/employer assets inventoried | `5` | | Corpus chunks frozen | `294672` | | Corpus source families | `25` | | Eligibility rows already embedded | `5950` | | Eligibility rows progressive only | `57175` | | Eligibility rows repair first | `231105` | | Serviceable-title vectors | `3440` | | Top-worked-title vectors | `1000` | | Top-1,000 vector coverage | `1000` | | Top-500 vector coverage | `500` | | Total Gemini vectors | `6507` | ## Must-Read Anchors | Artifact | Status | Purpose | Path | | --- | --- | --- | --- | | `production_source_authority_registry` | `present` | build-time source authority and anti-regression title cleanup map | `/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_registry_v1.json` | | `production_source_authority_inventory` | `present` | cleaned-asset inventory for titles, companies, roles, responsibilities, workflows, tools, affordances, and evidence before embedding | `/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_inventory_v1.json` | | `title_ledger` | `present` | title precedence, count reconciliation, and export contract | `/srv/aina/aina-data-engine-room/docs/TITLE-LEDGER.md` | | `mapping_chain_ledger` | `present` | title to role to workflow to capability join topology | `/srv/aina/aina-data-engine-room/docs/MAPPING-CHAIN-LEDGER.md` | | `cross_repo_salvage_map` | `present` | prior-work navigation map; adapt before reinventing | `/home/ali/conductor/aina-consolidated/20-references/linear/doc__cross-repo-salvage-map.md` | | `jobs_research_source_map` | `present` | jobs-research source-to-output lineage | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/file-index/SOURCE_MAP.md` | | `jobs_research_validation_report` | `present` | export row counts and manifest-hash validation | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/validation_report.md` | | `ain_core_stitch_ledger` | `present` | stitched current copy of evidence atlas and source packages | `/home/ali/conductor/repos/aina-core/STITCH-LEDGER.md` | | `production_embedding_corpus_receipt` | `present` | frozen semantic chunk corpus and batch manifest receipt | `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_production_semantic_embedding_corpus_v1.json` | | `production_embedding_eligibility_ledger` | `present` | source-family eligibility and repair-first routing | `/srv/aina/aina-data-engine-room/artifacts/validation/production_embedding_eligibility_v1.json` | | `latest_serviceable_embedding_checkpoint` | `present` | latest progressive serviceable-title Gemini checkpoint | `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_serviceable_title_progressive_live_3000_v1.json` | ## Hidden Gems To Reuse | Source | Why it matters | Path | | --- | --- | --- | | jobs-research source intelligence export | Role buckets, responsibilities, workflow seeds, workflows, tools, and affordances already exist as a validated source package. | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/manifest.json` | | jobs-research clean candidate pass-draft layer | Use as clean evidence substrate before embedding or repairing noisy title rows. | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/03-combined-source-intelligence-candidate-layer/outputs/outputs/combined_clean_evidence_candidates_v1.pass_draft.jsonl` | | aina-core evidence atlas canonical parquets | Current stitched title alias, responsibility, task, workflow, tool, realism, and affordance evidence layers. | `/home/ali/conductor/repos/aina-core/evidence/canonical` | | ALIPE LinkedIn and Indeed chunks | Powerful market substrate already referenced by packets, but it must enter only through clean derived chunks. | `/home/ali/ALIPE/data/jobs/linkedin_indeed_clusters_v1/chunks_250k` | | personalization inventory | Layer and sample inventory for O*NET, jobs CSVs, review artifacts, runtime samples, and historical data quality ranking. | `/home/ali/Personalization Engine/aina-inventory` | ## Routing Traps | Trap | Routing rule | | --- | --- | | Docs shell versus live engine | Never treat aina-personalization-engine as live engine code; use personalization-engine-aina or the engine-room/current core ledgers. | | Raw market rows | Do not embed raw marketplace labels as semantic truth; derive clean chunks first. | | Doubtful labels on good text | Preserve labels in metadata and exclude them from embedding text authority. | | Context-blind responsibility classifier | Treat k2_classified_v1 as broken for serving until rerun with role context. | | Old status labels | Do not skip prior work because it says reference, not_promoted, or old review language; inspect lineage and consume clean derived layers. | ## Consumption Order 1. Read the production source authority registry and title ledger before any title work. 2. Read the mapping-chain ledger before any role, workflow, task, or curriculum join. 3. Use jobs-research authority to clean noisy market titles before embedding. 4. Treat labels as routing metadata unless the source authority says they are strong enough. 5. Embed only clean derived chunks; preserve bad or doubtful labels in metadata, not text authority. 6. Run progressive Gemini slices before batch. Batch only follows clean source-family proof. ## Checks | Check | Result | | --- | --- | | Required anchors present | `True` | | Source registry valid | `True` | | Source inventory valid | `True` | | Corpus receipt valid | `True` | | Eligibility ledger valid | `True` | | Latest serviceable checkpoint present | `True` | | Embedding lane has no deprecated review-gate fields | `True` | | Raw market rows remain blocked from embedding authority | `True` | ## Next Commands 1. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here` 2. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-inventory` 3. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate` 4. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate` 5. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family semantic_review` 6. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repair-queue --source-family semantic_review` 7. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family semantic_review` 8. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family semantic_review --include-repaired --max-new 500 --dry-run --workers 8` ## Where To Start Start here, then continue from the current vector snapshot or latest valid embedding checkpoint. Do not re-scan the whole estate unless one of these anchors is missing or contradicted by live repo state. --- Ali Mehdi Mukadam - co-authored with Codex - 2026-06-12 ```yaml topics: - aina-personalization-engine - source-authority - production-embeddings subtopics: - anti-rediscovery - jobs-research - evidence-atlas - gemini-embedding-2 ```