AIIL
Regex/Extractor Specs
1. Regex as AI Guardrails
1.1. Overview
While schemas and APIs define the “hard contracts” for disclosures, ZAYAZ also employs Regex and Extractor Specifications as AI guardrails at the generation layer.
Why this matters:
- AI outputs can drift: even with strong prompting, models may produce citations in the wrong format, extra text around numbers, or invalid identifiers.
- Regex validation acts as a pre-filter: AI answers must match strict patterns before they can be serialized into JSON or exported.
- Extractors enforce determinism: text is parsed into canonical tokens (datapoint IDs, NACE codes, numeric values with units) that downstream systems can trust.
Together, regex + extractors1. ensure that AI free text is transformed into standards-compliant tokens before it ever enters the disclosure pipeline.
1.2. Core Regex Patterns
| Domain | Regex Pattern Example | Purpose |
|---|---|---|
| Citations | \[.+?:L\d+-L\d+\] | Enforce ESRS-style citations with doc name + line range. |
| Datapoint IDs | ^E[1-5S][0-9]+-[0-9]+$ | Match ESRS datapoint references (e.g., E1-6-1). |
| NACE Codes | ^[A-U]\d2(\.\d+)?$ | Validate EU NACE sector/activity codes (e.g., C20.3, A01). |
| Numeric Units | `^\d+(.\d+)?\s?(tCO2e | MWh |
| Dates | ^\d4(-Q[1-4])?$ | Restrict to fiscal years or quarters (e.g., 2025, 2025-Q2). |
AI governance link:
- Prevents “hallucinated” formats.
- Forces AI into compliance-bound language before reaching schema.
1.3. Extractor Layer
Regex is paired with extractor functions that normalize matches into canonical tokens.
Example:
Raw AI output:
“The company emitted 123.5 tonnes of CO₂ per € million revenue.”
Extractor result:
{
"value": 123.5,
"unit": "tCO2e/€m",
"datapoint_ref": "E1-6-1"
}
Implementation sketch:
import re
NUMERIC_UNIT = re.compile(r"(\d+(\.\d+)?)\s?(tCO2e|MWh|€m|%)")
def extract_numeric_unit(text: str):
match = NUMERIC_UNIT.search(text)
if not match:
return None
return {"value": float(match.group(1)), "unit": match.group(3)}
1.4. Integration with Eval Harness
Every AI answer is checked against regex/extractor specs in the Evaluation Harness before promotion.
- Pass criteria: All required fields (citations, datapoints, values) are extractable.
- Fail criteria: Regex mismatch or missing token → AI must retry or answer is rejected.
Example test (citation validation):
test_case: "GHG intensity disclosure"
ai_output: "See ESRS E1:L45-L60 for details."
expected: fail # citation missing brackets and line numbers
Governance link:
- Keeps AI accountable to structural compliance, not just semantic accuracy.
1.5. Enforcement in Runtime
Regex/extractor enforcement is not just for tests — it runs in production AI responses:
| Enforcement Stage | What Regex Validates | Action on Fail |
|---|---|---|
| Pre-serialization | Citations, datapoint IDs, NACE codes, units | Hard refusal → AI retries |
| Post-serialization | JSON payload re-validated against schema | Reject API call |
| Verifier Packaging | Evidence hashes match regex formats | Package rejected |
This ensures no malformed AI output can bypass compliance gates.
1.6. Worked Example — Scope 3 Extraction
AI free text (initial):
“Scope 3 Category 1 emissions are 500 tCO₂e, with revenue of €100m, giving an
intensity of 5 tCO₂e per €m.”
Regex passes:
- 500 tCO2e (numeric unit)
- €100m (numeric unit)
- 5 tCO2e/€m (derived unit)
Extractor normalizes:
{
"scope3_cat1": 500,
"revenue": 100,
"value": 5,
"unit": "tCO2e/€m"
}
Validation result: ✅ Passed → forwarded to JSON schema serialization.
If the AI had written:
“five hundred units of carbon dioxide equivalent”
❌ Regex mismatch → AI refusal → request retry with stricter prompt.
1.7. Governance Controls
Regex/extractor enforcement aligns with AI governance in three ways:
- AStructural determinism: A AI outputs become predictable tokens, not free text.
- ACompliance enforcement: A Only ESRS-compliant datapoints and units are allowed.
- Auditability: A Regex logs (pass/fail) become part of the AI Validation Log for verifiers/regulators.
1.8. Regex Library Reference
The following table defines the approved regex patterns for ZAYAZ AI enforcement. All patterns are version-controlled; changes require a governance approval flow.
| Name | Regex Pattern | Example Match | Purpose / Governance Note |
|---|---|---|---|
| citation_regex | \\[.+?:L\\d+-L\\d+\\] | [E1:L45-L60] | Enforces paragraph-level ESRS citations with doc name + line range. Required in AI output. |
| datapoint_regex | `^E([1-5] | S[1-4] | G1)(-[0-9]+)+$` |
| nace_regex | ^[A-U]\\d2(\\.\\d+)?$ | C20.3, A01 | Validates EU NACE activity codes. Restricts AI to official sector taxonomies. |
| unit_regex | `^\d+(\.\d+)?\s?(tCO2e | MWh | kWh |
| currency_regex | `^€\d+(\.\d+)?(m | bn)?$` | €100m, €2.3bn |
| date_regex | ^\\d4(-Q[1-4])?$ | 2025, 2025-Q3 | Fiscal years or quarters only; no vague/unsupported date formats. |
| hash_regex | ^[a-f0-9]64$ | f7c3bc1d808e04732adf679965ccc34c... | Validates dataset hashes (SHA-256). Ensures reproducibility in compute provenance. |
| provenance_id | ^ZAYAZ-[A-Z0-9]12$ | ZAYAZ-9F3A2D8B7C1E | Canonical provenance IDs. Guarantees each AI response is traceable. |
Usage in Enforcement
- Pre-Serialization: AI output checked against citation_regex, datapoint_regex, unit_regex.
- Schema Serialization: Extractors convert matches into canonical JSON tokens.
- Provenance & Audit: hash_regex and provenance_id ensure cryptographic reproducibility.
Example Config (Policy Engine Snippet)
{
"rules": {
"citation_pattern": "\\[.+?:L\\d+-L\\d+\\]",
"datapoint_pattern": "^E([1-5]|S[1-4]|G1)(-[0-9]+)+$",
"unit_pattern": "^\\d+(\\.\\d+)?\\s?(tCO2e|MWh|kWh|€m|%)$",
"require_provenance_id": true
}
}
1.9. Governance Process for Updating Regex Specs
Regex specifications are compliance-critical artifacts. They cannot be changed ad hoc; every update must pass through the ZAYAZ AI Lifecycle governance flow.
Lifecycle Stages
| Stage | Action | Responsible Role | Audit Output |
|---|---|---|---|
| Proposal | Developer submits PR adding/modifying regex entry in regex_library.json. Must include description, test cases, and rationale. | Developer / Feature Team | Git commit + PR review |
| Validation Harness | Automated suite runs: test inputs, golden outputs, fuzz tests (malformed inputs). Fails if new regex causes regressions. | CI/CD Pipeline | Validation report (pass/fail) |
| Governance Review | AI Governance Committee reviews change log, compliance risks (e.g., whether regex broadens scope improperly). | Governance Committee | Review log + approval/rejection |
| Staging Rollout | Regex spec applied in staging tenants. Shadow AI runs confirm no drift or refusal mismatches. | SRE / AI Ops Team | Shadow test logs + delta analysis |
| Promotion | Regex merged into production allowlist. Version bump applied to regex_library_version. | Release Manager | Release note entry |
| Deprecation | Old regex patterns remain supported for a grace period (configurable, default 30 days). Then marked deprecated. | AI Ops + Governance | Deprecation log |
Example: Adding a New Unit Regex
- PR Submission
{
"name": "unit_regex_energy",
"pattern": "^\\d+(\\.\\d+)?\\s?(MWh|kWh|GWh)$",
"description": "Accept only energy values in MWh/kWh/GWh."
}
- Validation Harness Output
PASS: "12.5 MWh"
PASS: "1000 kWh"
FAIL: "12 MJ" (rejected as intended)
-
Governance Approval Committee approves, rationale recorded: “Restricts energy reporting to electricity-related units per ESRS E1 standard.”
-
Release Note
regex_library v1.3.0
- Added unit_regex_energy for MWh/kWh/GWh (E1 compliance).
Enforcement Policy
- No bypass: AI models and extractors must use only patterns in the approved regex_library.json.
- Reproducibility: Every AI output records the regex_library_version used.
- Auditability: Regex changes are logged in the same trail as model/router updates.
This way, regex governance is just as strict as compute methods or model promotion — reinforcing ZAYAZ’s principle that every enforcement artifact is part of the AI trust boundary.
1.10. Parallel Governance: Regex Library vs. Compute Method Registry
Both regex specs and compute methods are compliance-critical. They control what AI is allowed to output (regex) and how it derives numbers (compute methods). To prevent drift or hidden changes, ZAYAZ enforces the same governance lifecycle for both artifacts.
Comparison Table
| Dimension | Compute Method Registry | Regex Library |
|---|---|---|
| Identifier | (method_id, version) composite key | (regex_name, version) composite key |
| Schema | JSONB: input, options, output schema | JSON: pattern, description, compliance tags |
| Lifecycle Stages | Proposal → Validation Harness →Governance Review → Promotion | Proposal → Validation Harness → Governance Review → Promotion |
| Versioning | Semantic versions (v1.0.0, v1.1.0…) allow side-by-side rollout | Semantic versions (v1.0.0, v1.1.0…) allow safe migrations |
| Deprecation Policy | Old versions discoverable but flagged as deprecated | Old regex patterns discoverable but flagged as deprecated |
| Auditability | Every compute call logs (method_id, version) + provenance_id | Every AI output logs (regex_name, version) + provenance_id |
| Enforcement Surface | API layer (POST /compute/factor) | AI output validators + post-processing layer |
| Example | GHG.intensity v1.0.0 | citation_regex v1.3.0 |
Example: Citation Regex Parallel to Compute
- Compute Call
{
"method_id": "GHG.intensity",
"version": "1.0.0",
"inputs": { "scope1": 100, "scope2": 200, "revenue": 50 }
}
→ Deterministic result, logged with (method_id, version).
- Regex Enforcement
{
"regex_name": "citation_regex",
"version": "1.3.0",
"pattern": "\\[.+?:L\\d+-L\\d+\\]",
"description": "Ensure paragraph-level citations."
}
→ AI output must validate against this regex; failure = refusal.
Unified Lifecycle Contract
- Proposals: Both live in Git PRs.
- Validation Harness: Both run regression tests.
- Governance Review: AI Governance Committee approves or rejects.
- Staging: Both tested in shadow AI runs before promotion.
- Promotion/Deprecation: Both follow the same semantic versioning + allowlist pattern.
1.11. Closing Notes
Regex and Extractor Specs form the first compliance lock in the ZAYAZ AI pipeline. They ensure that:
- AI cannot drift into unsupported formats.
- Outputs are machine-extractable, schema-compatible, and auditable.
- Every disclosure, from Scope 1 to Scope 3, starts with validated tokens before being packaged into JSON (§27), transformed into XBRL (§27.4), and exported to verifiers/regulators.