# Gemini Workflow Embedding Slice Handoff

Date: 2026-06-12
Repo: `/srv/aina/aina-data-engine-room`
Branch: `ali/ain-506-p0-gate-2026-06-12`
Issues: AIN-506, AIN-508, AIN-510
Author: Ali Mehdi Mukadam - co-authored with Codex

## The Single Idea

This follow-up converted the first clean workflow family into live Gemini vectors without widening the blast radius. AINA now has 43 `jobs_research_workflow` vectors in the shared Gemini Embedding 2 snapshot, 7 exact quality exclusions enforced by the runner, and a cleaner artifact policy where embedding bulk stays on disk but out of git.

## What Changed

1. `jobs_research_workflow` was repaired into 89 candidate chunks. Placeholder fields, doubtful label fragments, and repeated instruction prefixes were removed from `text_for_embedding` before any live call.
2. A 50-row adversarial semantic spot check reviewed the repaired workflow slice. Result: 37 pass, 6 caution, 7 block. The 7 block rows were written to `production_embedding_quality_exclusions_v1.json`.
3. The live embedding runner now reads that quality-exclusion receipt and prunes exact `(chunk_id, text_hash)` rows before dry-run or live selection.
4. The runner now loads source-scoped repaired corpora for source-scoped live runs. That prevents stale repaired overlays from older combined repair files from sneaking into a live source-family run.
5. Scoped live runs now preserve existing out-of-family repaired vectors even when `--include-repaired` is enabled. The dry run explicitly proved `orphan_existing_vector_count_pruned: 0` before the live call.
6. A controlled Vertex Gemini Embedding 2 live run embedded 43 reviewed workflow chunks with 0 failed rows.
7. `artifacts/embeddings/` was removed from the git index while kept on disk. Bulk Parquet, vectors, DuckDB, sidecar files, and batch manifests are now ignored; receipts remain tracked.

## Current Vector State

| Metric | Value |
|---|---:|
| Total Gemini vectors | 6,077 |
| New workflow vectors this slice | 43 |
| `jobs_research_workflow` vector count | 43 |
| Top 1,000 title vectors | 1,000 |
| Top 500 hardening vectors | 500 |
| Semantic review vectors | 500 |
| Serviceable title vectors | 3,440 |
| Failed Gemini rows in live slice | 0 |
| Stale vectors in AIN-510 | 0 |
| Cosine gap | 0.207005 |

AIN-510 now sees workflow vectors for the first time, but runtime promotion remains correctly blocked.

## AIN-510 Status

`ain-510-retrieval-promotion-gate` is valid and current.

- Status: `blocked_for_runtime_promotion`
- Promotion eligible: `false`
- Raw vector rows: `6,077`
- Valid vector rows: `6,077`
- Workflow vector count: `43`
- Curriculum vector count: `252`
- Evaluator vector count: `220`
- Known similar pairs: `50`
- Known dissimilar pairs: `50`
- Failed promotion checks:
  - `gate_4_sensitive_mismatch_fixture_suite_present`
  - `gate_5_runtime_rollback_proof_present`

Plain English: embeddings are now useful internal retrieval material, including workflow anchors, but they are not public/runtime authority until mismatch fixtures and rollback proof exist.

## Important Artifacts

- Quality exclusions: `/srv/aina/aina-data-engine-room/artifacts/validation/production_embedding_quality_exclusions_v1.json`
- Workflow spot check: `/srv/aina/aina-data-engine-room/artifacts/validation/workflow_repaired_candidate_spot_check_v1.json`
- Workflow spot-check rows: `/srv/aina/aina-data-engine-room/artifacts/validation/workflow_repaired_candidate_spot_check_v1_spot_check_50.jsonl`
- Live Gemini run receipt: `/srv/aina/aina-data-engine-room/artifacts/validation/ain_506_live_gemini_embedding_run_v1.json`
- AIN-510 gate: `/srv/aina/aina-data-engine-room/artifacts/validation/ain_510_retrieval_promotion_gate_v1.json`
- Source start page: `/srv/aina/aina-data-engine-room/artifacts/validation/source_authority_start_here_v1.json`
- Local vector snapshot, ignored by git: `/srv/aina/aina-data-engine-room/artifacts/embeddings/production/vectors/model=gemini-embedding-2/dim=768/schema_version=embedding_contract_v1/ain_506_live_gemini_embedding_run_v1.parquet`

## Proof Commands That Passed

```bash
uv run pytest tests/test_production_embeddings.py tests/test_embedding_contracts.py tests/test_source_authority_start_here.py -q
uv run ruff check src/aina_data_engine/production_embeddings.py src/aina_data_engine/production_embedding_eligibility.py src/aina_data_engine/cli.py tests/test_production_embeddings.py tests/test_embedding_contracts.py tests/test_source_authority_start_here.py
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-506-p0-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room validate
git diff --check
```

The targeted test suite is now `53 passed`.

## Artifact Policy Status

The repo no longer expects embedding bulk to be committed. `artifacts/embeddings/` is intentionally ignored and has been removed from the git index while remaining present on the VDS. This includes production chunks, vector Parquet, sidecar DuckDB, repaired Parquet, and Gemini batch manifests.

Track these instead:

- `artifacts/validation/*.json`
- selected small QA JSONL receipts such as `*_quality_pairs.jsonl` and `*_spot_check*.jsonl`
- `artifacts/reports/*.md`
- `artifacts/reports/*.html`

## Linear Status

Posting to Linear was attempted, but the Linear session is expired: `401 auth_revoked`. Ready-to-post payloads for AIN-506, AIN-508, and AIN-510 were preserved here:

- `/srv/aina/aina-data-engine-room/artifacts/validation/linear_update_payloads_2026_06_12_gemini_workflow_embedding.json`
- `/srv/aina/aina-data-engine-room/artifacts/reports/linear_update_payloads_2026_06_12_gemini_workflow_embedding.md`
- `/srv/aina/aina-data-engine-room/artifacts/reports/linear_update_payloads_2026_06_12_gemini_workflow_embedding.html`

## Known Residual Work

The remaining 39 `jobs_research_workflow` repaired rows were not batch-cleared. They passed deterministic placeholder/artifact scans, but only the first 50 rows received adversarial semantic review. Do not embed the remaining tail until it gets its own spot check and exclusions.

Workflow seed and combined workflow-family rows are still parked. Earlier repair showed many `workflow_seed` rows were taxonomy shells, not learner-facing semantic workflow assets.

AIN-510 still needs the sensitive mismatch fixture suite and rollback proof before runtime embedding authority can be promoted.

Some legacy non-embedding validation surfaces still use old review language. The embedding lane itself now has an explicit check that human-review fields are absent.

## Exact Resume Commands

```bash
cd /srv/aina/aina-data-engine-room
git status --short --branch
uv run aina-data-engine --root /srv/aina/aina-data-engine-room source-authority-start-here
uv run aina-data-engine --root /srv/aina/aina-data-engine-room ain-510-retrieval-promotion-gate
uv run aina-data-engine --root /srv/aina/aina-data-engine-room gemini-embedding-run --source-family jobs_research_workflow --include-repaired --max-new 39 --dry-run --workers 8
```

Recommended next move: review the remaining 39 workflow rows semantically, add any new exclusions to `production_embedding_quality_exclusions_v1.json`, then run the next small `jobs_research_workflow` live slice. After workflow coverage is stable, build AIN-510 sensitive mismatch fixtures and rollback proof.

## Footer

Ali Mehdi Mukadam - co-authored with Codex - 2026-06-12

```yaml
topics:
  - aina-data-engine-room
  - gemini-embeddings
  - personalization-engine
subtopics:
  - clean-before-embed
  - workflow-vectors
  - artifact-policy
  - ain-506
  - ain-510
```