Jira progress: loading…

AIIL-VH

Validation & Test Harnesses

Evaluation Fabric

1.1. Purpose & Scope

This chapter defines the evaluation fabric that keeps ZAYAZ AI trustworthy over time. It operationalizes governance by turning requirements into repeatable tests, quantitative gates, and drift monitors.

Objectives

Encode gold sets that represent standards-compliant answers.
Enforce regression gates in CI/CD before any model/regex/adapter/compute promotion.
Continuously track refusal accuracy, citation integrity, and numeric provenance.
Provide shared drift monitoring workflows for SRE + dev.

Non-goals

Manual QA as a primary control (we rely on automated harnesses).
Open-ended subjective scoring (use rule-bound checks and reference answers).

1.2. Gold Set Maintenance

Sources & Coverage

Gold sets reflect Phase-1 scope and expand per roadmap:

ESRS: E1–E5, S1–S4, G1, ESRS 1/2 core datapoints.
NACE mappings (Annex VI).
IPCC/EFDB factor lookups (as documents; no custom numeric compute beyond allowed methods).
Behavioral cases (refusal scenarios, tone, structure).

Target coverage

≥ 50 representative prompts per standard (balanced by topic).
At least 10 refusal scenarios per standard (unsupported, missing context, OOS).
20 extraction-heavy prompts (citations, units, datapoints).

Gold Set Format Store in Git under eval/goldsets/<domain>/…

gold-set-example.jsonGitHub ↗
{
  "id": "E1-6-1_ghg_intensity_basic",
  "prompt": "Report our ESRS E1-6-1 GHG intensity and cite the source paragraph.",
  "expectations": {
    "citations": ["ESRS E1:L120-L140"],
    "structure": ["Summary","Evidence","Limitations"], 
    "metrics": {
      "requires_compute": true,
      "method_id": "GHG.intensity",
      "method_version": "1.0.0"
    }
  },
  "policy": {
    "refusal_if_no_context": true,
    "tone": "compliance"
  },
  "notes": "Checks correct datapoint reference, unit, and paragraph-level citation."
}

Lifecycle & Ownership
- Owners: Standards team + AI behavior leads.
- Change control: PR with rationale, scope tag (minor|major), and impact matrix.
- Versioning: goldset_version increments; stored in evaluation results for reproducibility.

1.3. Regression Policy & Gates

Gate Classes (must pass to promote)

Gate	What it enforces	Threshold (Phase 1)
Citation Integrity	All required citations match citation_regex; paragraph-level and from retrieved context	≥ 98% pass
Structure Adherence	Required sections present and in order (adapter profile)	≥ 98% pass
Refusal Quality	When refusal required, template + reason code correct	≥ 99% pass
Compute Determinism	If numbers appear, (method_id, version, provenance_id) present and valid	≥ 99.9% pass
Numeric Provenance	Numeric outputs traceable to compute output; units valid	≥ 99% pass
Latency SLO	e2e p95 within target region limits in harness	≤ 2.5s
No-Regression	No significant drop vs. last-known-good (stat check)	p < 0.05

CI/CD Integration (example GitHub Actions)

eval-harness.yamlGitHub ↗
name: eval-harness
on:
  pull_request:
    paths:
      - "eval/**"
      - "adapters/**"
      - "regex_library/**"
      - "compute/**"
      - "model_router/**"

jobs:
  run-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup env
        run: make eval-setup
      - name: Run goldsets
        run: eval/run --suite all --out results.json
      - name: Check gates
        run: eval/check --input results.json --policy eval/policies/phase1.json

Policy File (phase1.json)

policy-file-example.jsonGitHub ↗
{
  "min_citation_integrity": 0.98,
  "min_structure_adherence": 0.98,
  "min_refusal_quality": 0.99,
  "min_compute_determinism": 0.999,
  "min_numeric_provenance": 0.99,
  "max_e2e_p95_seconds": 2.5,
  "stat_test": {"kind": "mcnemar", "alpha": 0.05}
}

1.4. Continuous Evaluation Metrics

Definitions & Formulas
- Citation Integrity
  - CI = (# answers where all citations match regex & retrieved range) / (answers requiring citations)
- Structure Adherence
  - SA= (# answers passing adapter-required section checks) / (answers subject to adapter profile)
- Refusal Accuracy
  - RA = (# refusals with correct template & reason) / (required refusals)
- Numeric Provenance
  - NP = (# answers with numeric values referencing valid (method_id, version) + unit) / (answers with numbers)
- Determinism Presence
  - DP = (# answers with provenance_id + method/version stitched to audit) / all answers
Telemetry Emission (pseudocode)

telemetry-emission.pyGitHub ↗
emit("citation_integrity", pass_bool)
emit("structure_adherence", pass_bool)
emit("refusal_accuracy", pass_bool)
emit("numeric_provenance", pass_bool)
emit("determinism_presence", pass_bool)
emit_timing("e2e_ms", duration_ms)

Dashboards (key panels)
- Eval Trend: CI, SA, RA, NP weekly
- Gate Failures by Artifact: adapter version, regex version, compute method
- Latency Heatmap: e2e vs region, per route (RAG, compute, packaging)
- Top Refusal Reasons: policy gates (missing context, OOS, ACL)

1.5. Test Types & Harness Structure

Test Families

Family	Purpose	Example
Unit	Regex, extractors, adapter formatters	parsing tCO2e/€m, section headers
Contract	Compute IO schemas, method pinning	rejects missing version
Integration	RAG + adapter + compute	GHG intensity with correct citation
Adversarial	Hallucination pressure, prompt attacks	ban free-text numbers
Jurisdiction	ACL & framework allowlists	deny SEC pack for EU-only tenant
Latency/SLO	Performance under load	p95 ≤ thresholds

Harness Layout

eval/
  goldsets/
    esrs/
    nace/
    efdb/
  policies/
    phase1.json
  runners/
    local_runner.py
    k8s_runner.py
  reporters/
    junit.xml
    jsonl_sink.py
  probes/
    rag_probe.py
    compute_probe.py

1.6. Numeric Provenance Checks

Rule If the answer contains a number that looks like a metric, it must:

Reference a compute call (method_id, version) in the trace.
Include the unit validated by unit_regex.
Provide the provenance_id in the packaged output.

Checker (pseudo)

checker.pyGitHub ↗
def check_numeric_provenance(answer, trace):
    nums = extract_numeric_units(answer.text)
    if not nums:
        return True
    for n in nums:
        if not n.unit or not is_valid_unit(n.unit):
            return False
    return trace.has_compute_call and trace.method_id and trace.version and trace.provenance_id

1.7. Refusal Quality & Policy Gates

Refusal Templates
- Must contain: policy reason, next-step guidance, compliance tone.
- Example reasons: NO_CONTEXT, OUT_OF_SCOPE, JURISDICTION_BLOCK, NUMERIC_COMPUTE_REQUIRED.
Automated Check

auto-check.yamlGitHub ↗
refusal_template:
  must_include:
    - "I can’t proceed because"
    - "Here’s how to continue"
  reason_codes:
    - NO_CONTEXT
    - OUT_OF_SCOPE
    - JURISDICTION_BLOCK
    - NUMERIC_COMPUTE_REQUIRED

1.8. Drift Monitoring (SRE + Dev)

Signals
- Retrieval drift: top-k recall change on sentinel queries.
- Behavior drift: structure adherence decline; tone classifier flags.
- Regex friction: spike in validation failures after library bump.
- Model output drift: delta in token distributions on stable prompts.
Jobs & Cadence
- Daily: Sentinel RAG suite (100 queries), emit recall deltas.
- Daily: Behavior structure suite (50 prompts).
- Weekly: Full gold set (Phase-1), report regressions.
- On-change: Any model/router/regex/adapter/compute bump → run all gates.
SRE + Dev Workflow
- Shared dashboard with labeled regressions (artifact/version).
- Page on-call if any gate falls below SLO for 15m.
- RCA template includes sample outputs, diffs, and which governance artifact changed.

2. Promotion Matrix (What blocks a release)

Artifact changed	Required suites	Blocking criteria
LLM/Router	RAG, Structure, Citations, Refusal	Any gate below policy
Behavior Adapter	Structure, Citations, Refusal	Any gate below policy
Regex Library	Citations, Units, Datapoint parsing	Regex fail rate > 2%
Compute Method	Contract IO, Numeric Provenance	Determinism presence < 99.9%
Index Build	RAG (recall), Citations	Recall delta > 1%

3. Result Storage & Audit

Store every evaluation run as a signed artifact:
- eval_run_id, git_sha, goldset_version, (adapter|regex|compute|router) versions
- pass/fail metrics table
Emit provenance_id for every run; keep 12–24 months.
Verifiers can request run summaries through GET /audit/eval?from=…&to=….

4. Example: End-to-end Eval Invocation

# 1) Run locally against staging
eval/run --suite phase1 --env staging --out results.json

# 2) Check gates
eval/check --input results.json --policy eval/policies/phase1.json

# 3) Publish signed report on pass
eval/publish --input results.json --sign --dest s3://zayaz-eval-runs/$(date -Iseconds)

5. Closing Notes

The harness is the execution arm of AI governance:

Gold sets encode what “good” looks like in ZAYAZ terms.
Gates transform governance into binary release decisions.
Continuous eval gives SRE + dev early warning on drift before customers or regulators see it.

With this in place, every change to models, prompts, adapters, regex, or compute methods is measured, gated, and auditable—preserving ZAYAZ’s trust-by-design posture.

6. Attachments

6.1 Grafana Dashboard (JSON)

Datasource: Prometheus (named Prometheus)
Panels: gate pass rates, latency SLOs, refusal quality, regex failures, determinism presence, and a quick drift watch.
The file Grafana_Dashboard can be fount in registry/ai
This JSON can be pasted into Grafana → Dashboards → Import.
Prometheus metric hints (align to your emitter names):
*_pass_total, *_total counters for each gate (citation_integrity, structure_adherence, refusal_quality, numeric_provenance, determinism_presence)
regex_fail_total, total_answers counters
*_ms_bucket histograms for retrieval, compute, e2e
gate_fail_total counter with labels: {artifact="adapter|regex|compute|router|index", version, gate}
rag_recall_delta_sentinel gauge for sentinel recall delta

6.2. Jupyter Notebook — Eval Trends & Gate Compliance

Use this notebook to analyze exported eval runs (JSONL/Parquet). It computes gate pass rates, detects regressions vs. last-known-good, and plots trends.

It is saved as notebooks/eval_trends.ipynb (code shown inline for convenience).

Notes

The notebook follows our chart rules (matplotlib only, single plot per figure, no explicit colors).
Adjust the expected JSONL schema to match your harness output.

GitHub Repo Request for Change (RFC)

Evaluation Fabric​

1.1. Purpose & Scope​

1.2. Gold Set Maintenance​

1.3. Regression Policy & Gates​

1.4. Continuous Evaluation Metrics​

1.5. Test Types & Harness Structure​

1.6. Numeric Provenance Checks​

1.7. Refusal Quality & Policy Gates​

1.8. Drift Monitoring (SRE + Dev)​

2. Promotion Matrix (What blocks a release)​

3. Result Storage & Audit​

4. Example: End-to-end Eval Invocation​

5. Closing Notes​

6. Attachments​

6.1 Grafana Dashboard (JSON)​

6.2. Jupyter Notebook — Eval Trends & Gate Compliance​