AINA data engine room - founder report - 2026-06-11

Founder Report

The local data engine is now a real foundation for personalization, with broader title coverage, packet evidence, and proof artifacts.

The Single Idea

The engine can now recognize 110,184 local occupations and attach evidence packs to 14,114 eligible roles. It is not production yet, but it is no longer just a title map.

01 - What Changed

From Title Coverage Toward Runtime Intelligence

BeforeNow
About 74k title rows.110,184 local occupations.
Evidence packs reached a narrow slice.14,114 eligible occupations now have evidence packs.
Some technically passing answers sounded wrong.Warehouse/shipping now resolves to warehouse manager, not warehouse i.
Evidence matching was mostly exact title or small SOC fallback.It now has exact title, explicit SOC, title-derived SOC, and guarded SOC-family paths.
Excluded roles could inherit evidence.Excluded roles now skip evidence packs.
02 - Why It Matters

This Is Becoming The Personalization Engine

A learner can give a job title, the engine can resolve it, and for a growing set of roles it can attach grounded guidance about where AI helps, where AI should not be trusted, and which workflows matter.

The important improvement is honesty: weak matches are labeled, excluded roles stay excluded, and missing evidence stays visible.
03 - What To Review

Use The Human Surfaces

ArtifactWhat to look for
docs/handoff/2026-06-11-session-closeout-data-engine-room-handoff.mdFull technical handoff and repo map.
artifacts/reports/evidence_fanout_probe_v1.mdEvidence coverage measurement.
artifacts/reports/serving_probe_v1.mdServing behavior on exact, messy, and OOD examples.
docs/handoff/2026-06-11-title-expansion-runtime-semantic-replay-handoff.mdCumulative handoff for this lane.

You do not need to review raw JSON, parquet files, or Python internals unless you want to inspect the machinery.

04 - Gemini Embeddings

Worth Testing, Not A Runtime Shortcut

Gemini embeddings are worth looking at next. They could help compare job titles, responsibilities, workflows, and evidence packs semantically instead of relying only on deterministic token overlap.

The next move should be a sidecar evaluation, not a direct runtime switch. The question is whether embeddings find better matches without creating confident nonsense.

05 - Pending Work

What Is Still Open

Pending areaPlain-English meaning
More evidence coverageMost eligible titles still do not have promoted evidence packs.
Embedding evaluationGemini should be tested against the deterministic baseline.
More semantic replayOnly the first runtime semantic batch has been processed in this lane.
Quarantine decisionresponsibility_registry_v2 remains off-limits until explicitly lifted.
Runtime packagingThis is still a local repo engine, not a product API/UI.

Run A Gemini Embeddings Evaluation Lane

Embed title/evidence/workflow text, compare embedding matches against evidence_fanout_probe_v1, inspect 50 to 100 real examples, and only promote embedding behavior if it beats the deterministic baseline.

Founder takeaway

The engine is useful enough to evaluate seriously, but not ready to connect to real learners without the next evidence-quality pass.