AINA Data Engine Room · Handoff · 2026-06-11

Runtime Eval Runs Handoff

Observed deterministic local answers and assertion outcomes for the 1,000-title runtime fixture sample.

Ali Mehdi Mukadam · co-authored with Codex · branch ali/personalization-engine-mission-2026-06-09

The Single Idea

The engine room now has observed local eval runs for the current 1,000-title runtime fixture sample. This moves the system from expected evaluator cases to deterministic local answer text, assertion outcomes, and pass/fail status for every fixture.

What Changed

Added src/aina_data_engine/runtime_eval_runs.py, wired aina-data-engine runtime-eval-runs, and extended tests for observed answers, assertion outcomes, blocked refusals, and CLI wiring.

No model calls were made. These are deterministic local fixtures for proving safety boundaries, answer shape, caveat visibility, blocked holds, and basic role/workflow fit.

Live Artifacts

Artifact	Rows	SHA-256
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1.jsonl`	1000	`3455608747e6eac88e07f1473667fd1ee7f4fa38a41860fda617de7c8d8fb90c`
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_packet_quality_eval_runs.jsonl`	295	`0c4fda90b6f344c2fb36f8c47b5a34b41025a60a3c10d3c059a6bfef038ceb4e`
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_caveat_eval_runs.jsonl`	670	`0e97c6c2730a8bed0af00021d2b6b3ac8aaed43058cf5465f599b38ccf0900e9`
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_source_ref_rerun_eval_runs.jsonl`	1	`11a78bab821b3428f19b71aaf07466c03d4b2e83502256f3470a6b0ba6c2269a`
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_hold_mining_eval_runs.jsonl`	34	`306594d77d3601c569563a08a7e889f9b057652652b5fee1b9d2beccf17622de`
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_semantic_followup_eval_runs.jsonl`	462	`8f22b6e953c370695f00dea14060b504aa64206c8f5b2b47c62497a386d985c6`
`/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_failing_eval_runs.jsonl`	0	`e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`

Live Result

Metric	Count
Eval run rows	1000
Local serviceable eval runs	965
Packet-quality eval runs	295
Caveat eval runs	670
Blocked eval runs	35
Semantic follow-up eval runs	462
Failing eval runs	0

Eval status	Count
Passed	538
Passed with semantic follow-up	427
Passed blocking refusal	35

Assertion Coverage

Assertion	Count
Mentions role title	1000
Names workflow or artifact	1000
Preserves human judgment boundary	1000
Blocks production claim	1000
Avoids real-user data	1000
References local synthetic scope	1000
Displays and keeps caveat visible	670
Uses source-backed context	295
Is packet-hardening ready	295
Refuses runtime plan until repair	35
Requires source evidence repair	35

Spot Check

I inspected 50 actual rows with eval status, fixture lane, display title, function, pass flag, semantic flags, assertion count, and observed answer text.

Title	Status	Runtime behavior
Seasonal Sales Associate	Passed	Source-backed packet-hardening answer.
Support Associate - Soma	Passed	Customer support answer with local scope.
Director of Business Intelligence	Passed	Data/insight answer with judgment boundary.
Customer Service Representative	Passed	Fallback precision caveat visible.
Salesperson	Passed with semantic follow-up	Sales workflow served locally; correction preserved.
Business Analyst	Passed with semantic follow-up	Data/analysis workflow served locally; correction preserved.
family law attorney	Passed blocking refusal	Runtime plan refused until source evidence repair.
teacher-special education	Passed blocking refusal	Runtime plan refused until source refs are attached.

Scope Boundaries

Local VDS only
No external writes
No real-user data
No production claim
No model calls
Source rows mutated: false
Human reviewer gate removed
Multi-LLM/evaluator review still required

Validation

All summary checks are true, including fixture validity, observed answers, assertion results, serviceable row pass, blocked row refusal, caveat visibility, external-domain blocking, and production-claim blocking.

cd /srv/aina/aina-data-engine-room
.venv/bin/python -m ruff check src tests
.venv/bin/python -m pytest -q

All checks passed.
199 passed in 219.78s.

Resume Commands

cd /srv/aina/aina-data-engine-room
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-eval-runs

cd /srv/aina/aina-data-engine-room
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room harvest-source-map
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room source-import-recipes
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room semantic-harvest-gate --sample-limit 1000
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room semantic-repair-queue
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room deterministic-semantic-repairs
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room semantic-patch-replay
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-intake
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-payloads
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-evaluator-fixtures
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-eval-runs