Runtime Semantic Adjudication Handoff
A review-ready surface for the 526 semantic follow-up rows created by the runtime eval chain.
The runtime engine now has a concrete review surface for the 526 semantic follow-up rows created by the local eval chain. Instead of leaving reviewers with raw eval-run JSONL, this slice turns those rows into prioritized lanes, 100-row batches, per-row review prompts, decision schemas, and split artifacts for function changes, broad title cleanup, and source-evidence repair.
- 01What changed
- 02Before and after
- 03Review lanes
- 04Semantic sanity check
- 05Artifact inventory
- 06Validation
- 07What this means
- 08Next best slice
What Changed
I added src/aina_data_engine/runtime_semantic_adjudication.py and wired it into the CLI as:
.venv/bin/aina-data-engine --root /srv/aina/aina-data-engine-room runtime-semantic-adjudication --batch-size 100
The command reads runtime_eval_runs_v1_semantic_followup_eval_runs.jsonl, preserves the 526-row count, and writes a new local-only artifact family under /srv/aina/aina-data-engine-room/artifacts/validation/.
Every adjudication row includes title, source function, runtime function, fixture lane, eval status, semantic flags, review lane, priority score, batch assignment, prompt, decision schema, evidence snapshot, and local-only scope.
Before And After
| Surface | Before | After |
|---|---|---|
| Runtime eval rows | 1,000 | 1,000 |
| Semantic follow-up rows | 526 raw rows | 526 adjudication rows |
| Review batches | none | 6 batches of 100 or fewer |
| First batch | none | 100 prioritized rows |
| Broad residual queue | implicit in flags | 12-row split artifact |
| Source-evidence repair queue | implicit in flags | 35-row split artifact |
| Function-change queue | implicit in flags | 479-row split artifact |
| Model calls made | 0 | 0 |
| Production claims allowed | no | no |
Review Lanes
| Lane | Rows | What reviewers decide |
|---|---|---|
adjudicate_caveat_function_change | 444 | Whether a deterministic function override is semantically right for a caveated local runtime row. |
adjudicate_packet_function_change | 35 | Whether a packet-quality candidate can keep the specialized runtime function, needs revision, or should be downgraded. |
recover_source_evidence_before_runtime | 35 | Whether to attach source refs and rerun, hold for source mining, or exclude as not ICP. |
resolve_broad_general_business_context | 12 | Whether broad residual titles can be assigned a specific function, kept broad with caveat, held, or require source context. |
Semantic Sanity Check
I inspected the 50-row sample with rank, batch, review lane, priority score, title, source function, runtime function, flags, and artifact under test. The sample is not auto-approval; it is a useful map of where deterministic rules might be right, overreaching, or under-sourced.
| Rank | Title | Lane | Why it matters |
|---|---|---|---|
| 1 | court judicial assistant teller | source-evidence repair | The title/function mix is suspicious and should not graduate without source proof. |
| 14 | Associate | broad context resolution | The title is too generic; reviewer must choose specific function, caveat, hold, or source-context path. |
| 30 | Customer Service Assistant | packet function change | Tests whether administration to customer success is valid for a packet-quality candidate. |
| 45 | Vice President Operations | packet function change | Tests whether leadership strategy is better than operations for executive-context learning. |
| 83 | Assistant Branch Manager | caveat function change | Tests whether finance is the right runtime context despite upstream administration. |
Artifact Inventory
| Artifact | Rows | Bytes | SHA-256 |
|---|---|---|---|
runtime_semantic_adjudication_v1.json | 150 lines | 6,064 | dee1c80e... |
runtime_semantic_adjudication_v1.jsonl | 526 | 2,402,826 | 7bb9d9c4... |
runtime_semantic_adjudication_v1_batch_001.jsonl | 100 | 457,476 | c646f01f... |
runtime_semantic_adjudication_v1_broad_context.jsonl | 12 | 55,280 | 59593e69... |
runtime_semantic_adjudication_v1_function_changes.jsonl | 479 | 2,185,813 | e58f65ce... |
runtime_semantic_adjudication_v1_semantic_sample_50.jsonl | 50 | 228,471 | 00e9a579... |
runtime_semantic_adjudication_v1_source_evidence_repairs.jsonl | 35 | 161,733 | 30c123a2... |
Validation
.venv/bin/python -m ruff check src tests .venv/bin/python -m pytest -q 201 passed in 101.12s
What This Means In Reality
This does not mean the 526 follow-up rows are approved. It means they are now ready for serious multi-LLM review. The engine can say: here are the rows where deterministic specialization changed runtime behavior; here are the rows still too broad; here are the rows with source-evidence gaps; here is exactly what a reviewer must decide for each one.
This moves the data engine closer to serving a broad ICP universe because it converts semantic uncertainty into a controlled adjudication queue. The platform can now harden high-confidence rows, revise questionable mappings, attach source refs where needed, and keep risky rows out of runtime claims.
Next Best Slice
Run multi-LLM adjudication against runtime_semantic_adjudication_v1_batch_001.jsonl, then merge the decisions into a deterministic replay artifact. The first batch is intentionally high-leverage: it includes source-evidence gaps, sensitive roles, broad residuals, and packet-quality function changes.
Start from artifacts/validation/runtime_semantic_adjudication_v1_batch_001.jsonl. It is the next review-ready unit: 100 rows, ranked, prompted, scoped, and decision-schema-bound.