Jira progress: loading…

AIIL

Regex/Extractor Specs

1. Regex as AI Guardrails

1.1. Overview

While schemas and APIs define the “hard contracts” for disclosures, ZAYAZ also employs Regex and Extractor Specifications as AI guardrails at the generation layer.

Why this matters:

AI outputs can drift: even with strong prompting, models may produce citations in the wrong format, extra text around numbers, or invalid identifiers.
Regex validation acts as a pre-filter: AI answers must match strict patterns before they can be serialized into JSON or exported.
Extractors enforce determinism: text is parsed into canonical tokens (datapoint IDs, NACE codes, numeric values with units) that downstream systems can trust.

Together, regex + extractors1. ensure that AI free text is transformed into standards-compliant tokens before it ever enters the disclosure pipeline.

1.2. Core Regex Patterns

Domain	Regex Pattern Example	Purpose
Citations	\[.+?:L\d+-L\d+\]	Enforce ESRS-style citations with doc name + line range.
Datapoint IDs	^E[1-5S][0-9]+-[0-9]+$	Match ESRS datapoint references (e.g., E1-6-1).
NACE Codes	^[A-U]\d2(\.\d+)?$	Validate EU NACE sector/activity codes (e.g., C20.3, A01).
Numeric Units	`^\d+(.\d+)?\s?(tCO2e	MWh
Dates	^\d4(-Q[1-4])?$	Restrict to fiscal years or quarters (e.g., 2025, 2025-Q2).

AI governance link:

Prevents “hallucinated” formats.
Forces AI into compliance-bound language before reaching schema.

1.3. Extractor Layer

Regex is paired with extractor functions that normalize matches into canonical tokens.

Example:

Raw AI output:

“The company emitted 123.5 tonnes of CO₂ per € million revenue.”

Extractor result:

extractor-output.json
{
  "value": 123.5,
  "unit": "tCO2e/€m",
  "datapoint_ref": "E1-6-1"
}

Implementation sketch:

implementation-sketch.py
import re

NUMERIC_UNIT = re.compile(r"(\d+(\.\d+)?)\s?(tCO2e|MWh|€m|%)")

def extract_numeric_unit(text: str):
    match = NUMERIC_UNIT.search(text)
    if not match:
        return None
    return {"value": float(match.group(1)), "unit": match.group(3)}

1.4. Integration with Eval Harness

Every AI answer is checked against regex/extractor specs in the Evaluation Harness before promotion.

Pass criteria: All required fields (citations, datapoints, values) are extractable.
Fail criteria: Regex mismatch or missing token → AI must retry or answer is rejected.

Example test (citation validation):

citation-validation.yaml
test_case: "GHG intensity disclosure"
ai_output: "See ESRS E1:L45-L60 for details."
expected: fail  # citation missing brackets and line numbers

Governance link:

Keeps AI accountable to structural compliance, not just semantic accuracy.

1.5. Enforcement in Runtime

Regex/extractor enforcement is not just for tests — it runs in production AI responses:

Enforcement Stage	What Regex Validates	Action on Fail
Pre-serialization	Citations, datapoint IDs, NACE codes, units	Hard refusal → AI retries
Post-serialization	JSON payload re-validated against schema	Reject API call
Verifier Packaging	Evidence hashes match regex formats	Package rejected

This ensures no malformed AI output can bypass compliance gates.

1.6. Worked Example — Scope 3 Extraction

AI free text (initial):

“Scope 3 Category 1 emissions are 500 tCO₂e, with revenue of €100m, giving an 
intensity of 5 tCO₂e per €m.”

Regex passes:

500 tCO2e (numeric unit)
€100m (numeric unit)
5 tCO2e/€m (derived unit)

Extractor normalizes:

extractor-norm.json
{
  "scope3_cat1": 500,
  "revenue": 100,
  "value": 5,
  "unit": "tCO2e/€m"
}

Validation result: ✅ Passed → forwarded to JSON schema serialization.

If the AI had written:

“five hundred units of carbon dioxide equivalent”

❌ Regex mismatch → AI refusal → request retry with stricter prompt.

1.7. Governance Controls

Regex/extractor enforcement aligns with AI governance in three ways:

AStructural determinism: A AI outputs become predictable tokens, not free text.
ACompliance enforcement: A Only ESRS-compliant datapoints and units are allowed.
Auditability: A Regex logs (pass/fail) become part of the AI Validation Log for verifiers/regulators.

1.8. Regex Library Reference

The following table defines the approved regex patterns for ZAYAZ AI enforcement. All patterns are version-controlled; changes require a governance approval flow.

Name	Regex Pattern	Example Match	Purpose / Governance Note
citation_regex	\\[.+?:L\\d+-L\\d+\\]	[E1:L45-L60]	Enforces paragraph-level ESRS citations with doc name + line range. Required in AI output.
datapoint_regex	`^E([1-5]	S[1-4]	G1)(-[0-9]+)+$`
nace_regex	^[A-U]\\d2(\\.\\d+)?$	C20.3, A01	Validates EU NACE activity codes. Restricts AI to official sector taxonomies.
unit_regex	`^\d+(\.\d+)?\s?(tCO2e	MWh	kWh
currency_regex	`^€\d+(\.\d+)?(m	bn)?$`	€100m, €2.3bn
date_regex	^\\d4(-Q[1-4])?$	2025, 2025-Q3	Fiscal years or quarters only; no vague/unsupported date formats.
hash_regex	^[a-f0-9]64$	f7c3bc1d808e04732adf679965ccc34c...	Validates dataset hashes (SHA-256). Ensures reproducibility in compute provenance.
provenance_id	^ZAYAZ-[A-Z0-9]12$	ZAYAZ-9F3A2D8B7C1E	Canonical provenance IDs. Guarantees each AI response is traceable.

Usage in Enforcement

Pre-Serialization: AI output checked against citation_regex, datapoint_regex, unit_regex.
Schema Serialization: Extractors convert matches into canonical JSON tokens.
Provenance & Audit: hash_regex and provenance_id ensure cryptographic reproducibility.

Example Config (Policy Engine Snippet)

policy-engine-snippet.json
{
  "rules": {
    "citation_pattern": "\\[.+?:L\\d+-L\\d+\\]",
    "datapoint_pattern": "^E([1-5]|S[1-4]|G1)(-[0-9]+)+$",
    "unit_pattern": "^\\d+(\\.\\d+)?\\s?(tCO2e|MWh|kWh|€m|%)$",
    "require_provenance_id": true
  }
}

1.9. Governance Process for Updating Regex Specs

Regex specifications are compliance-critical artifacts. They cannot be changed ad hoc; every update must pass through the ZAYAZ AI Lifecycle governance flow.

Lifecycle Stages

Stage	Action	Responsible Role	Audit Output
Proposal	Developer submits PR adding/modifying regex entry in regex_library.json. Must include description, test cases, and rationale.	Developer / Feature Team	Git commit + PR review
Validation Harness	Automated suite runs: test inputs, golden outputs, fuzz tests (malformed inputs). Fails if new regex causes regressions.	CI/CD Pipeline	Validation report (pass/fail)
Governance Review	AI Governance Committee reviews change log, compliance risks (e.g., whether regex broadens scope improperly).	Governance Committee	Review log + approval/rejection
Staging Rollout	Regex spec applied in staging tenants. Shadow AI runs confirm no drift or refusal mismatches.	SRE / AI Ops Team	Shadow test logs + delta analysis
Promotion	Regex merged into production allowlist. Version bump applied to regex_library_version.	Release Manager	Release note entry
Deprecation	Old regex patterns remain supported for a grace period (configurable, default 30 days). Then marked deprecated.	AI Ops + Governance	Deprecation log

Example: Adding a New Unit Regex

PR Submission

pr-submission.json
{
  "name": "unit_regex_energy",
  "pattern": "^\\d+(\\.\\d+)?\\s?(MWh|kWh|GWh)$",
  "description": "Accept only energy values in MWh/kWh/GWh."
}

Validation Harness Output

PASS: "12.5 MWh"  
PASS: "1000 kWh"  
FAIL: "12 MJ" (rejected as intended)

Governance Approval Committee approves, rationale recorded: “Restricts energy reporting to electricity-related units per ESRS E1 standard.”
Release Note

regex_library v1.3.0  
- Added unit_regex_energy for MWh/kWh/GWh (E1 compliance).

Enforcement Policy

No bypass: AI models and extractors must use only patterns in the approved regex_library.json.
Reproducibility: Every AI output records the regex_library_version used.
Auditability: Regex changes are logged in the same trail as model/router updates.

This way, regex governance is just as strict as compute methods or model promotion — reinforcing ZAYAZ’s principle that every enforcement artifact is part of the AI trust boundary.

1.10. Parallel Governance: Regex Library vs. Compute Method Registry

Both regex specs and compute methods are compliance-critical. They control what AI is allowed to output (regex) and how it derives numbers (compute methods). To prevent drift or hidden changes, ZAYAZ enforces the same governance lifecycle for both artifacts.

Comparison Table

Dimension	Compute Method Registry	Regex Library
Identifier	(method_id, version) composite key	(regex_name, version) composite key
Schema	JSONB: input, options, output schema	JSON: pattern, description, compliance tags
Lifecycle Stages	Proposal → Validation Harness →Governance Review → Promotion	Proposal → Validation Harness → Governance Review → Promotion
Versioning	Semantic versions (v1.0.0, v1.1.0…) allow side-by-side rollout	Semantic versions (v1.0.0, v1.1.0…) allow safe migrations
Deprecation Policy	Old versions discoverable but flagged as deprecated	Old regex patterns discoverable but flagged as deprecated
Auditability	Every compute call logs (method_id, version) + provenance_id	Every AI output logs (regex_name, version) + provenance_id
Enforcement Surface	API layer (POST /compute/factor)	AI output validators + post-processing layer
Example	GHG.intensity v1.0.0	citation_regex v1.3.0

Example: Citation Regex Parallel to Compute

Compute Call

compute-call.json
{
  "method_id": "GHG.intensity",
  "version": "1.0.0",
  "inputs": { "scope1": 100, "scope2": 200, "revenue": 50 }
}

→ Deterministic result, logged with (method_id, version).

Regex Enforcement

regex-enforcement.json
{
  "regex_name": "citation_regex",
  "version": "1.3.0",
  "pattern": "\\[.+?:L\\d+-L\\d+\\]",
  "description": "Ensure paragraph-level citations."
}

→ AI output must validate against this regex; failure = refusal.

Unified Lifecycle Contract

Proposals: Both live in Git PRs.
Validation Harness: Both run regression tests.
Governance Review: AI Governance Committee approves or rejects.
Staging: Both tested in shadow AI runs before promotion.
Promotion/Deprecation: Both follow the same semantic versioning + allowlist pattern.

1.11. Closing Notes

Regex and Extractor Specs form the first compliance lock in the ZAYAZ AI pipeline. They ensure that:

AI cannot drift into unsupported formats.
Outputs are machine-extractable, schema-compatible, and auditable.
Every disclosure, from Scope 1 to Scope 3, starts with validated tokens before being packaged into JSON (§27), transformed into XBRL (§27.4), and exported to verifiers/regulators.

GitHub Repo Request for Change (RFC)

1. Regex as AI Guardrails​

1.1. Overview​

1.2. Core Regex Patterns​

1.3. Extractor Layer​

1.4. Integration with Eval Harness​

1.5. Enforcement in Runtime​

1.6. Worked Example — Scope 3 Extraction​

1.7. Governance Controls​

1.8. Regex Library Reference​

1.9. Governance Process for Updating Regex Specs​

1.10. Parallel Governance: Regex Library vs. Compute Method Registry​

1.11. Closing Notes​