AINA Data Engine Room2026-06-11

Runtime Semantic Adjudication Handoff

A review-ready surface for the 526 semantic follow-up rows created by the runtime eval chain.

Ali Mehdi Mukadam · co-authored with Codex · 6 minute read · /srv/aina/aina-data-engine-room

The Single Idea

The runtime engine now has a concrete review surface for the 526 semantic follow-up rows created by the local eval chain. Instead of leaving reviewers with raw eval-run JSONL, this slice turns those rows into prioritized lanes, 100-row batches, per-row review prompts, decision schemas, and split artifacts for function changes, broad title cleanup, and source-evidence repair.

BeforeSemantic follow-up existed as raw eval rows with flags. Useful, but awkward for multi-LLM review.
AfterEvery follow-up row has a lane, priority, batch, prompt, decision schema, and scoped evidence snapshot.
  1. 01What changed
  2. 02Before and after
  3. 03Review lanes
  4. 04Semantic sanity check
  5. 05Artifact inventory
  6. 06Validation
  7. 07What this means
  8. 08Next best slice
01

What Changed

I added src/aina_data_engine/runtime_semantic_adjudication.py and wired it into the CLI as:

Codex · Generate The Adjudication Queue · local-only review surface
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-semantic-adjudication --batch-size 100
Watch-out: do not treat generated review prompts as approvals; they are prompts for adjudication.

The command reads runtime_eval_runs_v1_semantic_followup_eval_runs.jsonl, preserves the 526-row count, and writes a new local-only artifact family under /srv/aina/aina-data-engine-room/artifacts/validation/.

Every adjudication row includes title, source function, runtime function, fixture lane, eval status, semantic flags, review lane, priority score, batch assignment, prompt, decision schema, evidence snapshot, and local-only scope.

02

Before And After

SurfaceBeforeAfter
Runtime eval rows1,0001,000
Semantic follow-up rows526 raw rows526 adjudication rows
Review batchesnone6 batches of 100 or fewer
First batchnone100 prioritized rows
Broad residual queueimplicit in flags12-row split artifact
Source-evidence repair queueimplicit in flags35-row split artifact
Function-change queueimplicit in flags479-row split artifact
Model calls made00
Production claims allowednono
03

Review Lanes

LaneRowsWhat reviewers decide
adjudicate_caveat_function_change444Whether a deterministic function override is semantically right for a caveated local runtime row.
adjudicate_packet_function_change35Whether a packet-quality candidate can keep the specialized runtime function, needs revision, or should be downgraded.
recover_source_evidence_before_runtime35Whether to attach source refs and rerun, hold for source mining, or exclude as not ICP.
resolve_broad_general_business_context12Whether broad residual titles can be assigned a specific function, kept broad with caveat, held, or require source context.
The first batch starts with source-evidence and sensitive-domain risks, then broad residuals, then packet-quality function changes. That makes the next reviewer run high-leverage.
04

Semantic Sanity Check

I inspected the 50-row sample with rank, batch, review lane, priority score, title, source function, runtime function, flags, and artifact under test. The sample is not auto-approval; it is a useful map of where deterministic rules might be right, overreaching, or under-sourced.

RankTitleLaneWhy it matters
1court judicial assistant tellersource-evidence repairThe title/function mix is suspicious and should not graduate without source proof.
14Associatebroad context resolutionThe title is too generic; reviewer must choose specific function, caveat, hold, or source-context path.
30Customer Service Assistantpacket function changeTests whether administration to customer success is valid for a packet-quality candidate.
45Vice President Operationspacket function changeTests whether leadership strategy is better than operations for executive-context learning.
83Assistant Branch Managercaveat function changeTests whether finance is the right runtime context despite upstream administration.
05

Artifact Inventory

ArtifactRowsBytesSHA-256
runtime_semantic_adjudication_v1.json150 lines6,064dee1c80e...
runtime_semantic_adjudication_v1.jsonl5262,402,8267bb9d9c4...
runtime_semantic_adjudication_v1_batch_001.jsonl100457,476c646f01f...
runtime_semantic_adjudication_v1_broad_context.jsonl1255,28059593e69...
runtime_semantic_adjudication_v1_function_changes.jsonl4792,185,813e58f65ce...
runtime_semantic_adjudication_v1_semantic_sample_50.jsonl50228,47100e9a579...
runtime_semantic_adjudication_v1_source_evidence_repairs.jsonl35161,73330c123a2...
06

Validation

01 FocusedNew module, CLI, and test surface linted.
02 PipelineFocused runtime pipeline tests passed: 11 tests.
03 ArtifactGenerated 526 adjudication rows with all checks true.
04 FullRuff passed and full pytest passed: 201 tests.
Codex · Validate The Slice · prove code and artifact health
.venv/bin/python -m ruff check src tests
.venv/bin/python -m pytest -q

201 passed in 101.12s
Watch-out: passing tests prove mechanics, not semantic approval. The adjudication rows still need review decisions.
07

What This Means In Reality

This does not mean the 526 follow-up rows are approved. It means they are now ready for serious multi-LLM review. The engine can say: here are the rows where deterministic specialization changed runtime behavior; here are the rows still too broad; here are the rows with source-evidence gaps; here is exactly what a reviewer must decide for each one.

This moves the data engine closer to serving a broad ICP universe because it converts semantic uncertainty into a controlled adjudication queue. The platform can now harden high-confidence rows, revise questionable mappings, attach source refs where needed, and keep risky rows out of runtime claims.

08

Next Best Slice

Run multi-LLM adjudication against runtime_semantic_adjudication_v1_batch_001.jsonl, then merge the decisions into a deterministic replay artifact. The first batch is intentionally high-leverage: it includes source-evidence gaps, sensitive roles, broad residuals, and packet-quality function changes.

Where To Start

Start from artifacts/validation/runtime_semantic_adjudication_v1_batch_001.jsonl. It is the next review-ready unit: 100 rows, ranked, prompted, scoped, and decision-schema-bound.