AINA Data Engine Room2026-06-11

Runtime Semantic Adjudication Handoff

A review-ready surface for the 526 semantic follow-up rows created by the runtime eval chain.

Ali Mehdi Mukadam · co-authored with Codex · 6 minute read · /srv/aina/aina-data-engine-room

The Single Idea

The runtime engine now has a concrete review surface for the 526 semantic follow-up rows created by the local eval chain. Instead of leaving reviewers with raw eval-run JSONL, this slice turns those rows into prioritized lanes, 100-row batches, per-row review prompts, decision schemas, and split artifacts for function changes, broad title cleanup, and source-evidence repair.

BeforeSemantic follow-up existed as raw eval rows with flags. Useful, but awkward for multi-LLM review.

AfterEvery follow-up row has a lane, priority, batch, prompt, decision schema, and scoped evidence snapshot.

01What changed
02Before and after
03Review lanes
04Semantic sanity check
05Artifact inventory
06Validation
07What this means
08Next best slice

What Changed

I added src/aina_data_engine/runtime_semantic_adjudication.py and wired it into the CLI as:

Codex · Generate The Adjudication Queue · local-only review surface

.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-semantic-adjudication --batch-size 100

Watch-out: do not treat generated review prompts as approvals; they are prompts for adjudication.

The command reads runtime_eval_runs_v1_semantic_followup_eval_runs.jsonl, preserves the 526-row count, and writes a new local-only artifact family under /srv/aina/aina-data-engine-room/artifacts/validation/.

Every adjudication row includes title, source function, runtime function, fixture lane, eval status, semantic flags, review lane, priority score, batch assignment, prompt, decision schema, evidence snapshot, and local-only scope.

Before And After

Surface	Before	After
Runtime eval rows	1,000	1,000
Semantic follow-up rows	526 raw rows	526 adjudication rows
Review batches	none	6 batches of 100 or fewer
First batch	none	100 prioritized rows
Broad residual queue	implicit in flags	12-row split artifact
Source-evidence repair queue	implicit in flags	35-row split artifact
Function-change queue	implicit in flags	479-row split artifact
Model calls made	0	0
Production claims allowed	no	no

Review Lanes

Lane	Rows	What reviewers decide
`adjudicate_caveat_function_change`	444	Whether a deterministic function override is semantically right for a caveated local runtime row.
`adjudicate_packet_function_change`	35	Whether a packet-quality candidate can keep the specialized runtime function, needs revision, or should be downgraded.
`recover_source_evidence_before_runtime`	35	Whether to attach source refs and rerun, hold for source mining, or exclude as not ICP.
`resolve_broad_general_business_context`	12	Whether broad residual titles can be assigned a specific function, kept broad with caveat, held, or require source context.

The first batch starts with source-evidence and sensitive-domain risks, then broad residuals, then packet-quality function changes. That makes the next reviewer run high-leverage.

Semantic Sanity Check

I inspected the 50-row sample with rank, batch, review lane, priority score, title, source function, runtime function, flags, and artifact under test. The sample is not auto-approval; it is a useful map of where deterministic rules might be right, overreaching, or under-sourced.

Rank	Title	Lane	Why it matters
1	`court judicial assistant teller`	source-evidence repair	The title/function mix is suspicious and should not graduate without source proof.
14	`Associate`	broad context resolution	The title is too generic; reviewer must choose specific function, caveat, hold, or source-context path.
30	`Customer Service Assistant`	packet function change	Tests whether administration to customer success is valid for a packet-quality candidate.
45	`Vice President Operations`	packet function change	Tests whether leadership strategy is better than operations for executive-context learning.
83	`Assistant Branch Manager`	caveat function change	Tests whether finance is the right runtime context despite upstream administration.

Artifact Inventory

Artifact	Rows	Bytes	SHA-256
`runtime_semantic_adjudication_v1.json`	150 lines	6,064	`dee1c80e...`
`runtime_semantic_adjudication_v1.jsonl`	526	2,402,826	`7bb9d9c4...`
`runtime_semantic_adjudication_v1_batch_001.jsonl`	100	457,476	`c646f01f...`
`runtime_semantic_adjudication_v1_broad_context.jsonl`	12	55,280	`59593e69...`
`runtime_semantic_adjudication_v1_function_changes.jsonl`	479	2,185,813	`e58f65ce...`
`runtime_semantic_adjudication_v1_semantic_sample_50.jsonl`	50	228,471	`00e9a579...`
`runtime_semantic_adjudication_v1_source_evidence_repairs.jsonl`	35	161,733	`30c123a2...`

Validation

01 FocusedNew module, CLI, and test surface linted.

02 PipelineFocused runtime pipeline tests passed: 11 tests.

03 ArtifactGenerated 526 adjudication rows with all checks true.

04 FullRuff passed and full pytest passed: 201 tests.

Codex · Validate The Slice · prove code and artifact health

.venv/bin/python -m ruff check src tests
.venv/bin/python -m pytest -q

201 passed in 101.12s

Watch-out: passing tests prove mechanics, not semantic approval. The adjudication rows still need review decisions.

What This Means In Reality

This does not mean the 526 follow-up rows are approved. It means they are now ready for serious multi-LLM review. The engine can say: here are the rows where deterministic specialization changed runtime behavior; here are the rows still too broad; here are the rows with source-evidence gaps; here is exactly what a reviewer must decide for each one.

This moves the data engine closer to serving a broad ICP universe because it converts semantic uncertainty into a controlled adjudication queue. The platform can now harden high-confidence rows, revise questionable mappings, attach source refs where needed, and keep risky rows out of runtime claims.

Next Best Slice

Run multi-LLM adjudication against runtime_semantic_adjudication_v1_batch_001.jsonl, then merge the decisions into a deterministic replay artifact. The first batch is intentionally high-leverage: it includes source-evidence gaps, sensitive roles, broad residuals, and packet-quality function changes.

Where To Start

Start from artifacts/validation/runtime_semantic_adjudication_v1_batch_001.jsonl. It is the next review-ready unit: 100 rows, ranked, prompted, scoped, and decision-schema-bound.