Runtime Eval Runs Handoff
Observed deterministic local answers and assertion outcomes for the 1,000-title runtime fixture sample.
The engine room now has observed local eval runs for the current 1,000-title runtime fixture sample. This moves the system from expected evaluator cases to deterministic local answer text, assertion outcomes, and pass/fail status for every fixture.
What Changed
Added src/aina_data_engine/runtime_eval_runs.py, wired aina-data-engine runtime-eval-runs, and extended tests for observed answers, assertion outcomes, blocked refusals, and CLI wiring.
Live Artifacts
| Artifact | Rows | SHA-256 |
|---|---|---|
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1.jsonl | 1000 | 3455608747e6eac88e07f1473667fd1ee7f4fa38a41860fda617de7c8d8fb90c |
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_packet_quality_eval_runs.jsonl | 295 | 0c4fda90b6f344c2fb36f8c47b5a34b41025a60a3c10d3c059a6bfef038ceb4e |
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_caveat_eval_runs.jsonl | 670 | 0e97c6c2730a8bed0af00021d2b6b3ac8aaed43058cf5465f599b38ccf0900e9 |
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_source_ref_rerun_eval_runs.jsonl | 1 | 11a78bab821b3428f19b71aaf07466c03d4b2e83502256f3470a6b0ba6c2269a |
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_hold_mining_eval_runs.jsonl | 34 | 306594d77d3601c569563a08a7e889f9b057652652b5fee1b9d2beccf17622de |
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_semantic_followup_eval_runs.jsonl | 462 | 8f22b6e953c370695f00dea14060b504aa64206c8f5b2b47c62497a386d985c6 |
/srv/aina/aina-data-engine-room/artifacts/validation/runtime_eval_runs_v1_failing_eval_runs.jsonl | 0 | e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 |
Live Result
| Metric | Count |
|---|---|
| Eval run rows | 1000 |
| Local serviceable eval runs | 965 |
| Packet-quality eval runs | 295 |
| Caveat eval runs | 670 |
| Blocked eval runs | 35 |
| Semantic follow-up eval runs | 462 |
| Failing eval runs | 0 |
| Eval status | Count |
|---|---|
| Passed | 538 |
| Passed with semantic follow-up | 427 |
| Passed blocking refusal | 35 |
Assertion Coverage
| Assertion | Count |
|---|---|
| Mentions role title | 1000 |
| Names workflow or artifact | 1000 |
| Preserves human judgment boundary | 1000 |
| Blocks production claim | 1000 |
| Avoids real-user data | 1000 |
| References local synthetic scope | 1000 |
| Displays and keeps caveat visible | 670 |
| Uses source-backed context | 295 |
| Is packet-hardening ready | 295 |
| Refuses runtime plan until repair | 35 |
| Requires source evidence repair | 35 |
Spot Check
I inspected 50 actual rows with eval status, fixture lane, display title, function, pass flag, semantic flags, assertion count, and observed answer text.
| Title | Status | Runtime behavior |
|---|---|---|
| Seasonal Sales Associate | Passed | Source-backed packet-hardening answer. |
| Support Associate - Soma | Passed | Customer support answer with local scope. |
| Director of Business Intelligence | Passed | Data/insight answer with judgment boundary. |
| Customer Service Representative | Passed | Fallback precision caveat visible. |
| Salesperson | Passed with semantic follow-up | Sales workflow served locally; correction preserved. |
| Business Analyst | Passed with semantic follow-up | Data/analysis workflow served locally; correction preserved. |
| family law attorney | Passed blocking refusal | Runtime plan refused until source evidence repair. |
| teacher-special education | Passed blocking refusal | Runtime plan refused until source refs are attached. |
Scope Boundaries
- Local VDS only
- No external writes
- No real-user data
- No production claim
- No model calls
- Source rows mutated: false
- Human reviewer gate removed
- Multi-LLM/evaluator review still required
Validation
All summary checks are true, including fixture validity, observed answers, assertion results, serviceable row pass, blocked row refusal, caveat visibility, external-domain blocking, and production-claim blocking.
cd /srv/aina/aina-data-engine-room
.venv/bin/python -m ruff check src tests
.venv/bin/python -m pytest -q
All checks passed.
199 passed in 219.78s.
Resume Commands
cd /srv/aina/aina-data-engine-room
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-eval-runs
cd /srv/aina/aina-data-engine-room
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room harvest-source-map
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room source-import-recipes
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room semantic-harvest-gate --sample-limit 1000
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room semantic-repair-queue
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room deterministic-semantic-repairs
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room semantic-patch-replay
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-intake
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-payloads
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-evaluator-fixtures
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-eval-runs