AINA Data Engine Room - 2026-06-13T14:45:54Z - status pass

Source Authority Start Here

A deterministic boot sequence so agents consume the known source estate before spending tokens rediscovering it.

The Single Idea

This artifact turns source memory into an executable receipt: read these anchors, respect these traps, and continue from the latest embedding checkpoint.

Serviceable-title vectors3440
Top-worked vectors1000
Total Gemini vectors6507
Corpus chunks294672
Repair-first rows231105
01 Source

Read the authority registry and title ledger first.

02 Repair

Use jobs-research and evidence atlas layers to clean derived chunks.

03 Embed

Run Gemini progressively; batch waits for clean source-family proof.

Must-Read Anchors

ArtifactStatusPurposePath
production_source_authority_registrypresentbuild-time source authority and anti-regression title cleanup map/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_registry_v1.json
production_source_authority_inventorypresentcleaned-asset inventory for titles, companies, roles, responsibilities, workflows, tools, affordances, and evidence before embedding/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_inventory_v1.json
title_ledgerpresenttitle precedence, count reconciliation, and export contract/srv/aina/aina-data-engine-room/docs/TITLE-LEDGER.md
mapping_chain_ledgerpresenttitle to role to workflow to capability join topology/srv/aina/aina-data-engine-room/docs/MAPPING-CHAIN-LEDGER.md
cross_repo_salvage_mappresentprior-work navigation map; adapt before reinventing/home/ali/conductor/aina-consolidated/20-references/linear/doc__cross-repo-salvage-map.md
jobs_research_source_mappresentjobs-research source-to-output lineage/home/ali/conductor/repos/aina-jobs-research/project-summary-package/file-index/SOURCE_MAP.md
jobs_research_validation_reportpresentexport row counts and manifest-hash validation/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/validation_report.md
ain_core_stitch_ledgerpresentstitched current copy of evidence atlas and source packages/home/ali/conductor/repos/aina-core/STITCH-LEDGER.md
production_embedding_corpus_receiptpresentfrozen semantic chunk corpus and batch manifest receipt/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_production_semantic_embedding_corpus_v1.json
production_embedding_eligibility_ledgerpresentsource-family eligibility and repair-first routing/srv/aina/aina-data-engine-room/artifacts/validation/production_embedding_eligibility_v1.json
latest_serviceable_embedding_checkpointpresentlatest progressive serviceable-title Gemini checkpoint/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_serviceable_title_progressive_live_3000_v1.json

Hidden Gems

SourceWhy It MattersPath
jobs-research source intelligence exportRole buckets, responsibilities, workflow seeds, workflows, tools, and affordances already exist as a validated source package./home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/manifest.json
jobs-research clean candidate pass-draft layerUse as clean evidence substrate before embedding or repairing noisy title rows./home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/03-combined-source-intelligence-candidate-layer/outputs/outputs/combined_clean_evidence_candidates_v1.pass_draft.jsonl
aina-core evidence atlas canonical parquetsCurrent stitched title alias, responsibility, task, workflow, tool, realism, and affordance evidence layers./home/ali/conductor/repos/aina-core/evidence/canonical
ALIPE LinkedIn and Indeed chunksPowerful market substrate already referenced by packets, but it must enter only through clean derived chunks./home/ali/ALIPE/data/jobs/linkedin_indeed_clusters_v1/chunks_250k
personalization inventoryLayer and sample inventory for O*NET, jobs CSVs, review artifacts, runtime samples, and historical data quality ranking./home/ali/Personalization Engine/aina-inventory

Routing Traps

TrapRouting Rule
Docs shell versus live engineNever treat aina-personalization-engine as live engine code; use personalization-engine-aina or the engine-room/current core ledgers.
Raw market rowsDo not embed raw marketplace labels as semantic truth; derive clean chunks first.
Doubtful labels on good textPreserve labels in metadata and exclude them from embedding text authority.
Context-blind responsibility classifierTreat k2_classified_v1 as broken for serving until rerun with role context.
Old status labelsDo not skip prior work because it says reference, not_promoted, or old review language; inspect lineage and consume clean derived layers.
Where to start: run this command, then continue from the latest valid embedding checkpoint instead of re-scanning the whole estate.
Canonical markdown source
# Source Authority Start Here

Status: `pass`
Created: `2026-06-13T14:45:54Z`
Schema: `source_authority_start_here_v1`

## The Single Idea

This is the boot sequence for AINA Personalization Engine data work. It points
agents at the already-discovered source ledgers before title cleaning, source
harvest, Gemini embedding, or runtime mapping work starts. The goal is to stop
rediscovering the same repos and to consume prior work in the right order.

## Current Counts

| Metric | Value |
| --- | ---: |
| Title audit rows | `37478` |
| Trusted jobs-research titles | `15104` |
| Pass-draft clean evidence rows | `44440` |
| Source-authority inventory assets | `46` |
| Company/employer assets inventoried | `5` |
| Corpus chunks frozen | `294672` |
| Corpus source families | `25` |
| Eligibility rows already embedded | `5950` |
| Eligibility rows progressive only | `57175` |
| Eligibility rows repair first | `231105` |
| Serviceable-title vectors | `3440` |
| Top-worked-title vectors | `1000` |
| Top-1,000 vector coverage | `1000` |
| Top-500 vector coverage | `500` |
| Total Gemini vectors | `6507` |

## Must-Read Anchors

| Artifact | Status | Purpose | Path |
| --- | --- | --- | --- |
| `production_source_authority_registry` | `present` | build-time source authority and anti-regression title cleanup map | `/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_registry_v1.json` |
| `production_source_authority_inventory` | `present` | cleaned-asset inventory for titles, companies, roles, responsibilities, workflows, tools, affordances, and evidence before embedding | `/srv/aina/aina-data-engine-room/artifacts/validation/production_source_authority_inventory_v1.json` |
| `title_ledger` | `present` | title precedence, count reconciliation, and export contract | `/srv/aina/aina-data-engine-room/docs/TITLE-LEDGER.md` |
| `mapping_chain_ledger` | `present` | title to role to workflow to capability join topology | `/srv/aina/aina-data-engine-room/docs/MAPPING-CHAIN-LEDGER.md` |
| `cross_repo_salvage_map` | `present` | prior-work navigation map; adapt before reinventing | `/home/ali/conductor/aina-consolidated/20-references/linear/doc__cross-repo-salvage-map.md` |
| `jobs_research_source_map` | `present` | jobs-research source-to-output lineage | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/file-index/SOURCE_MAP.md` |
| `jobs_research_validation_report` | `present` | export row counts and manifest-hash validation | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/validation_report.md` |
| `ain_core_stitch_ledger` | `present` | stitched current copy of evidence atlas and source packages | `/home/ali/conductor/repos/aina-core/STITCH-LEDGER.md` |
| `production_embedding_corpus_receipt` | `present` | frozen semantic chunk corpus and batch manifest receipt | `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_production_semantic_embedding_corpus_v1.json` |
| `production_embedding_eligibility_ledger` | `present` | source-family eligibility and repair-first routing | `/srv/aina/aina-data-engine-room/artifacts/validation/production_embedding_eligibility_v1.json` |
| `latest_serviceable_embedding_checkpoint` | `present` | latest progressive serviceable-title Gemini checkpoint | `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_serviceable_title_progressive_live_3000_v1.json` |

## Hidden Gems To Reuse

| Source | Why it matters | Path |
| --- | --- | --- |
| jobs-research source intelligence export | Role buckets, responsibilities, workflow seeds, workflows, tools, and affordances already exist as a validated source package. | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/exports/source_intelligence_v1/manifest.json` |
| jobs-research clean candidate pass-draft layer | Use as clean evidence substrate before embedding or repairing noisy title rows. | `/home/ali/conductor/repos/aina-jobs-research/project-summary-package/organized-outputs/03-combined-source-intelligence-candidate-layer/outputs/outputs/combined_clean_evidence_candidates_v1.pass_draft.jsonl` |
| aina-core evidence atlas canonical parquets | Current stitched title alias, responsibility, task, workflow, tool, realism, and affordance evidence layers. | `/home/ali/conductor/repos/aina-core/evidence/canonical` |
| ALIPE LinkedIn and Indeed chunks | Powerful market substrate already referenced by packets, but it must enter only through clean derived chunks. | `/home/ali/ALIPE/data/jobs/linkedin_indeed_clusters_v1/chunks_250k` |
| personalization inventory | Layer and sample inventory for O*NET, jobs CSVs, review artifacts, runtime samples, and historical data quality ranking. | `/home/ali/Personalization Engine/aina-inventory` |

## Routing Traps

| Trap | Routing rule |
| --- | --- |
| Docs shell versus live engine | Never treat aina-personalization-engine as live engine code; use personalization-engine-aina or the engine-room/current core ledgers. |
| Raw market rows | Do not embed raw marketplace labels as semantic truth; derive clean chunks first. |
| Doubtful labels on good text | Preserve labels in metadata and exclude them from embedding text authority. |
| Context-blind responsibility classifier | Treat k2_classified_v1 as broken for serving until rerun with role context. |
| Old status labels | Do not skip prior work because it says reference, not_promoted, or old review language; inspect lineage and consume clean derived layers. |

## Consumption Order

1. Read the production source authority registry and title ledger before any title work.
2. Read the mapping-chain ledger before any role, workflow, task, or curriculum join.
3. Use jobs-research authority to clean noisy market titles before embedding.
4. Treat labels as routing metadata unless the source authority says they are strong enough.
5. Embed only clean derived chunks; preserve bad or doubtful labels in metadata, not text authority.
6. Run progressive Gemini slices before batch. Batch only follows clean source-family proof.

## Checks

| Check | Result |
| --- | --- |
| Required anchors present | `True` |
| Source registry valid | `True` |
| Source inventory valid | `True` |
| Corpus receipt valid | `True` |
| Eligibility ledger valid | `True` |
| Latest serviceable checkpoint present | `True` |
| Embedding lane has no deprecated review-gate fields | `True` |
| Raw market rows remain blocked from embedding authority | `True` |

## Next Commands

1. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here`
2. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-source-authority-inventory`
3. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate`
4. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate`
5. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-eligibility --source-family semantic_review`
6. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repair-queue --source-family semantic_review`
7. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room production-embedding-repaired-corpus --source-family semantic_review`
8. `uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family semantic_review --include-repaired --max-new 500 --dry-run --workers 8`

## Where To Start

Start here, then continue from the current vector snapshot or latest valid
embedding checkpoint. Do not re-scan the whole estate unless one of these
anchors is missing or contradicted by live repo state.

---

Ali Mehdi Mukadam - co-authored with Codex - 2026-06-12

```yaml
topics:
  - aina-personalization-engine
  - source-authority
  - production-embeddings
subtopics:
  - anti-rediscovery
  - jobs-research
  - evidence-atlas
  - gemini-embedding-2
```