AIIL-VH
Validation & Test Harnesses
Evaluation Fabric
1.1. Purpose & Scope
This chapter defines the evaluation fabric that keeps ZAYAZ AI trustworthy over time. It operationalizes governance by turning requirements into repeatable tests, quantitative gates, and drift monitors.
Objectives
- Encode gold sets that represent standards-compliant answers.
- Enforce regression gates in CI/CD before any model/regex/adapter/compute promotion.
- Continuously track refusal accuracy, citation integrity, and numeric provenance.
- Provide shared drift monitoring workflows for SRE + dev.
Non-goals
- Manual QA as a primary control (we rely on automated harnesses).
- Open-ended subjective scoring (use rule-bound checks and reference answers).
1.2. Gold Set Maintenance
- Sources & Coverage
Gold sets reflect Phase-1 scope and expand per roadmap:
- ESRS: E1–E5, S1–S4, G1, ESRS 1/2 core datapoints.
- NACE mappings (Annex VI).
- IPCC/EFDB factor lookups (as documents; no custom numeric compute beyond allowed methods).
- Behavioral cases (refusal scenarios, tone, structure).
Target coverage
- ≥ 50 representative prompts per standard (balanced by topic).
- At least 10 refusal scenarios per standard (unsupported, missing context, OOS).
- 20 extraction-heavy prompts (citations, units, datapoints).
- Gold Set Format
Store in Git under eval/goldsets/
<domain>/…
{
"id": "E1-6-1_ghg_intensity_basic",
"prompt": "Report our ESRS E1-6-1 GHG intensity and cite the source paragraph.",
"expectations": {
"citations": ["ESRS E1:L120-L140"], // allowed alternatives supported
"structure": ["Summary","Evidence","Limitations"],
"metrics": {
"requires_compute": true,
"method_id": "GHG.intensity",
"method_version": "1.0.0"
}
},
"policy": {
"refusal_if_no_context": true,
"tone": "compliance"
},
"notes": "Checks correct datapoint reference, unit, and paragraph-level citation."
}
- Lifecycle & Ownership
- Owners: Standards team + AI behavior leads.
- Change control: PR with rationale, scope tag (minor|major), and impact matrix.
- Versioning: goldset_version increments; stored in evaluation results for reproducibility.
1.3. Regression Policy & Gates
- Gate Classes (must pass to promote)
| Gate | What it enforces | Threshold (Phase 1) |
|---|---|---|
| Citation Integrity | All required citations match citation_regex; paragraph-level and from retrieved context | ≥ 98% pass |
| Structure Adherence | Required sections present and in order (adapter profile) | ≥ 98% pass |
| Refusal Quality | When refusal required, template + reason code correct | ≥ 99% pass |
| Compute Determinism | If numbers appear, (method_id, version, provenance_id) present and valid | ≥ 99.9% pass |
| Numeric Provenance | Numeric outputs traceable to compute output; units valid | ≥ 99% pass |
| Latency SLO | e2e p95 within target region limits in harness | ≤ 2.5s |
| No-Regression | No significant drop vs. last-known-good (stat check) | p < 0.05 |
- CI/CD Integration (example GitHub Actions)
name: eval-harness
on:
pull_request:
paths:
- "eval/**"
- "adapters/**"
- "regex_library/**"
- "compute/**"
- "model_router/**"
jobs:
run-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup env
run: make eval-setup
- name: Run goldsets
run: eval/run --suite all --out results.json
- name: Check gates
run: eval/check --input results.json --policy eval/policies/phase1.json
- Policy File (phase1.json)
{
"min_citation_integrity": 0.98,
"min_structure_adherence": 0.98,
"min_refusal_quality": 0.99,
"min_compute_determinism": 0.999,
"min_numeric_provenance": 0.99,
"max_e2e_p95_seconds": 2.5,
"stat_test": {"kind": "mcnemar", "alpha": 0.05}
}
1.4. Continuous Evaluation Metrics
-
Definitions & Formulas
- Citation Integrity
- CI = (# answers where all citations match regex & retrieved range) / (answers requiring citations)
- Structure Adherence
- SA= (# answers passing adapter-required section checks) / (answers subject to adapter profile)
- Refusal Accuracy
- RA = (# refusals with correct template & reason) / (required refusals)
- Numeric Provenance
- NP = (# answers with numeric values referencing valid (method_id, version) + unit) / (answers with numbers)
- Determinism Presence
- DP = (# answers with provenance_id + method/version stitched to audit) / all answers
- Citation Integrity
-
Telemetry Emission (pseudocode)
emit("citation_integrity", pass_bool)
emit("structure_adherence", pass_bool)
emit("refusal_accuracy", pass_bool)
emit("numeric_provenance", pass_bool)
emit("determinism_presence", pass_bool)
emit_timing("e2e_ms", duration_ms)
- Dashboards (key panels)
- Eval Trend: CI, SA, RA, NP weekly
- Gate Failures by Artifact: adapter version, regex version, compute method
- Latency Heatmap: e2e vs region, per route (RAG, compute, packaging)
- Top Refusal Reasons: policy gates (missing context, OOS, ACL)
1.5. Test Types & Harness Structure
- Test Families
| Family | Purpose | Example |
|---|---|---|
| Unit | Regex, extractors, adapter formatters | parsing tCO2e/€m, section headers |
| Contract | Compute IO schemas, method pinning | rejects missing version |
| Integration | RAG + adapter + compute | GHG intensity with correct citation |
| Adversarial | Hallucination pressure, prompt attacks | ban free-text numbers |
| Jurisdiction | ACL & framework allowlists | deny SEC pack for EU-only tenant |
| Latency/SLO | Performance under load | p95 ≤ thresholds |
- Harness Layout
eval/
goldsets/
esrs/
nace/
efdb/
policies/
phase1.json
runners/
local_runner.py
k8s_runner.py
reporters/
junit.xml
jsonl_sink.py
probes/
rag_probe.py
compute_probe.py
1.6. Numeric Provenance Checks
- Rule If the answer contains a number that looks like a metric, it must:
- Reference a compute call (method_id, version) in the trace.
- Include the unit validated by unit_regex.
- Provide the provenance_id in the packaged output.
- Checker (pseudo)
def check_numeric_provenance(answer, trace):
nums = extract_numeric_units(answer.text)
if not nums:
return True
for n in nums:
if not n.unit or not is_valid_unit(n.unit):
return False
return trace.has_compute_call and trace.method_id and trace.version and trace.provenance_id
1.7. Refusal Quality & Policy Gates
-
Refusal Templates
- Must contain: policy reason, next-step guidance, compliance tone.
- Example reasons: NO_CONTEXT, OUT_OF_SCOPE, JURISDICTION_BLOCK, NUMERIC_COMPUTE_REQUIRED.
-
Automated Check
refusal_template:
must_include:
- "I can’t proceed because"
- "Here’s how to continue"
reason_codes:
- NO_CONTEXT
- OUT_OF_SCOPE
- JURISDICTION_BLOCK
- NUMERIC_COMPUTE_REQUIRED
1.8. Drift Monitoring (SRE + Dev)
-
Signals
- Retrieval drift: top-k recall change on sentinel queries.
- Behavior drift: structure adherence decline; tone classifier flags.
- Regex friction: spike in validation failures after library bump.
- Model output drift: delta in token distributions on stable prompts.
-
Jobs & Cadence
- Daily: Sentinel RAG suite (100 queries), emit recall deltas.
- Daily: Behavior structure suite (50 prompts).
- Weekly: Full gold set (Phase-1), report regressions.
- On-change: Any model/router/regex/adapter/compute bump → run all gates.
-
SRE + Dev Workflow
- Shared dashboard with labeled regressions (artifact/version).
- Page on-call if any gate falls below SLO for 15m.
- RCA template includes sample outputs, diffs, and which governance artifact changed.
2. Promotion Matrix (What blocks a release)
| Artifact changed | Required suites | Blocking criteria |
|---|---|---|
| LLM/Router | RAG, Structure, Citations, Refusal | Any gate below policy |
| Behavior Adapter | Structure, Citations, Refusal | Any gate below policy |
| Regex Library | Citations, Units, Datapoint parsing | Regex fail rate > 2% |
| Compute Method | Contract IO, Numeric Provenance | Determinism presence < 99.9% |
| Index Build | RAG (recall), Citations | Recall delta > 1% |
3. Result Storage & Audit
- Store every evaluation run as a signed artifact:
- eval_run_id, git_sha, goldset_version, (adapter|regex|compute|router) versions
- pass/fail metrics table
- Emit provenance_id for every run; keep 12–24 months.
- Verifiers can request run summaries through GET /audit/eval?from=…&to=….
4. Example: End-to-end Eval Invocation
# 1) Run locally against staging
eval/run --suite phase1 --env staging --out results.json
# 2) Check gates
eval/check --input results.json --policy eval/policies/phase1.json
# 3) Publish signed report on pass
eval/publish --input results.json --sign --dest s3://zayaz-eval-runs/$(date -Iseconds)
5. Closing Notes
The harness is the execution arm of AI governance:
- Gold sets encode what “good” looks like in ZAYAZ terms.
- Gates transform governance into binary release decisions.
- Continuous eval gives SRE + dev early warning on drift before customers or regulators see it.
With this in place, every change to models, prompts, adapters, regex, or compute methods is measured, gated, and auditable—preserving ZAYAZ’s trust-by-design posture.
6. Attachments
6.1 Grafana Dashboard (JSON)
- Datasource: Prometheus (named Prometheus)
- Panels: gate pass rates, latency SLOs, refusal quality, regex failures, determinism presence, and a quick drift watch.
- The file Grafana_Dashboard can be fount in registry/ai
- This JSON can be pasted into Grafana → Dashboards → Import.
- Prometheus metric hints (align to your emitter names):
- *_pass_total, *_total counters for each gate (citation_integrity, structure_adherence, refusal_quality, numeric_provenance, determinism_presence)
- regex_fail_total, total_answers counters
- *_ms_bucket histograms for retrieval, compute, e2e
- gate_fail_total counter with labels:
{artifact="adapter|regex|compute|router|index", version, gate} - rag_recall_delta_sentinel gauge for sentinel recall delta
6.2. Jupyter Notebook — Eval Trends & Gate Compliance
Use this notebook to analyze exported eval runs (JSONL/Parquet). It computes gate pass rates, detects regressions vs. last-known-good, and plots trends.
It is saved as notebooks/eval_trends.ipynb (code shown inline for convenience).
Notes
- The notebook follows our chart rules (matplotlib only, single plot per figure, no explicit colors).
- Adjust the expected JSONL schema to match your harness output.