Skip to main content
Jira progress: loading…

AIIL

Regex/Extractor Specs

1. Regex as AI Guardrails

1.1. Overview

While schemas and APIs define the “hard contracts” for disclosures, ZAYAZ also employs Regex and Extractor Specifications as AI guardrails at the generation layer.

Why this matters:

  • AI outputs can drift: even with strong prompting, models may produce citations in the wrong format, extra text around numbers, or invalid identifiers.
  • Regex validation acts as a pre-filter: AI answers must match strict patterns before they can be serialized into JSON or exported.
  • Extractors enforce determinism: text is parsed into canonical tokens (datapoint IDs, NACE codes, numeric values with units) that downstream systems can trust.

Together, regex + extractors1. ensure that AI free text is transformed into standards-compliant tokens before it ever enters the disclosure pipeline.

1.2. Core Regex Patterns

DomainRegex Pattern ExamplePurpose
Citations\[.+?:L\d+-L\d+\]Enforce ESRS-style citations with doc name + line range.
Datapoint IDs^E[1-5S][0-9]+-[0-9]+$Match ESRS datapoint references (e.g., E1-6-1).
NACE Codes^[A-U]\d2(\.\d+)?$Validate EU NACE sector/activity codes (e.g., C20.3, A01).
Numeric Units`^\d+(.\d+)?\s?(tCO2eMWh
Dates^\d4(-Q[1-4])?$Restrict to fiscal years or quarters (e.g., 2025, 2025-Q2).

AI governance link:

  • Prevents “hallucinated” formats.
  • Forces AI into compliance-bound language before reaching schema.

1.3. Extractor Layer

Regex is paired with extractor functions that normalize matches into canonical tokens.

Example:

Raw AI output:

“The company emitted 123.5 tonnes of CO₂ per € million revenue.”

Extractor result:

extractor-output.json
{
"value": 123.5,
"unit": "tCO2e/€m",
"datapoint_ref": "E1-6-1"
}

Implementation sketch:

implementation-sketch.py
import re

NUMERIC_UNIT = re.compile(r"(\d+(\.\d+)?)\s?(tCO2e|MWh|€m|%)")

def extract_numeric_unit(text: str):
match = NUMERIC_UNIT.search(text)
if not match:
return None
return {"value": float(match.group(1)), "unit": match.group(3)}

1.4. Integration with Eval Harness

Every AI answer is checked against regex/extractor specs in the Evaluation Harness before promotion.

  • Pass criteria: All required fields (citations, datapoints, values) are extractable.
  • Fail criteria: Regex mismatch or missing token → AI must retry or answer is rejected.

Example test (citation validation):

citation-validation.yaml
test_case: "GHG intensity disclosure"
ai_output: "See ESRS E1:L45-L60 for details."
expected: fail # citation missing brackets and line numbers

Governance link:

  • Keeps AI accountable to structural compliance, not just semantic accuracy.

1.5. Enforcement in Runtime

Regex/extractor enforcement is not just for tests — it runs in production AI responses:

Enforcement StageWhat Regex ValidatesAction on Fail
Pre-serializationCitations, datapoint IDs, NACE codes, unitsHard refusal → AI retries
Post-serializationJSON payload re-validated against schemaReject API call
Verifier PackagingEvidence hashes match regex formatsPackage rejected

This ensures no malformed AI output can bypass compliance gates.

1.6. Worked Example — Scope 3 Extraction

AI free text (initial):

“Scope 3 Category 1 emissions are 500 tCO₂e, with revenue of €100m, giving an 
intensity of 5 tCO₂e per €m.”

Regex passes:

  • 500 tCO2e (numeric unit)
  • €100m (numeric unit)
  • 5 tCO2e/€m (derived unit)

Extractor normalizes:

extractor-norm.json
{
"scope3_cat1": 500,
"revenue": 100,
"value": 5,
"unit": "tCO2e/€m"
}

Validation result: ✅ Passed → forwarded to JSON schema serialization.

If the AI had written:

“five hundred units of carbon dioxide equivalent”

❌ Regex mismatch → AI refusal → request retry with stricter prompt.

1.7. Governance Controls

Regex/extractor enforcement aligns with AI governance in three ways:

  • AStructural determinism: A AI outputs become predictable tokens, not free text.
  • ACompliance enforcement: A Only ESRS-compliant datapoints and units are allowed.
  • Auditability: A Regex logs (pass/fail) become part of the AI Validation Log for verifiers/regulators.

1.8. Regex Library Reference

The following table defines the approved regex patterns for ZAYAZ AI enforcement. All patterns are version-controlled; changes require a governance approval flow.

NameRegex PatternExample MatchPurpose / Governance Note
citation_regex\\[.+?:L\\d+-L\\d+\\][E1:L45-L60]Enforces paragraph-level ESRS citations with doc name + line range. Required in AI output.
datapoint_regex`^E([1-5]S[1-4]G1)(-[0-9]+)+$`
nace_regex^[A-U]\\d2(\\.\\d+)?$C20.3, A01Validates EU NACE activity codes. Restricts AI to official sector taxonomies.
unit_regex`^\d+(\.\d+)?\s?(tCO2eMWhkWh
currency_regex`^€\d+(\.\d+)?(mbn)?$`€100m, €2.3bn
date_regex^\\d4(-Q[1-4])?$2025, 2025-Q3Fiscal years or quarters only; no vague/unsupported date formats.
hash_regex^[a-f0-9]64$f7c3bc1d808e04732adf679965ccc34c...Validates dataset hashes (SHA-256). Ensures reproducibility in compute provenance.
provenance_id^ZAYAZ-[A-Z0-9]12$ZAYAZ-9F3A2D8B7C1ECanonical provenance IDs. Guarantees each AI response is traceable.

Usage in Enforcement

  • Pre-Serialization: AI output checked against citation_regex, datapoint_regex, unit_regex.
  • Schema Serialization: Extractors convert matches into canonical JSON tokens.
  • Provenance & Audit: hash_regex and provenance_id ensure cryptographic reproducibility.

Example Config (Policy Engine Snippet)

policy-engine-snippet.json
{
"rules": {
"citation_pattern": "\\[.+?:L\\d+-L\\d+\\]",
"datapoint_pattern": "^E([1-5]|S[1-4]|G1)(-[0-9]+)+$",
"unit_pattern": "^\\d+(\\.\\d+)?\\s?(tCO2e|MWh|kWh|€m|%)$",
"require_provenance_id": true
}
}

1.9. Governance Process for Updating Regex Specs

Regex specifications are compliance-critical artifacts. They cannot be changed ad hoc; every update must pass through the ZAYAZ AI Lifecycle governance flow.

Lifecycle Stages

StageActionResponsible RoleAudit Output
ProposalDeveloper submits PR adding/modifying regex entry in regex_library.json. Must include description, test cases, and rationale.Developer / Feature TeamGit commit + PR review
Validation HarnessAutomated suite runs: test inputs, golden outputs, fuzz tests (malformed inputs). Fails if new regex causes regressions.CI/CD PipelineValidation report (pass/fail)
Governance ReviewAI Governance Committee reviews change log, compliance risks (e.g., whether regex broadens scope improperly).Governance CommitteeReview log + approval/rejection
Staging RolloutRegex spec applied in staging tenants. Shadow AI runs confirm no drift or refusal mismatches.SRE / AI Ops TeamShadow test logs + delta analysis
PromotionRegex merged into production allowlist. Version bump applied to regex_library_version.Release ManagerRelease note entry
DeprecationOld regex patterns remain supported for a grace period (configurable, default 30 days). Then marked deprecated.AI Ops + GovernanceDeprecation log

Example: Adding a New Unit Regex

  • PR Submission
pr-submission.json
{
"name": "unit_regex_energy",
"pattern": "^\\d+(\\.\\d+)?\\s?(MWh|kWh|GWh)$",
"description": "Accept only energy values in MWh/kWh/GWh."
}
  • Validation Harness Output
PASS: "12.5 MWh"  
PASS: "1000 kWh"
FAIL: "12 MJ" (rejected as intended)
  • Governance Approval Committee approves, rationale recorded: “Restricts energy reporting to electricity-related units per ESRS E1 standard.”

  • Release Note

regex_library v1.3.0  
- Added unit_regex_energy for MWh/kWh/GWh (E1 compliance).

Enforcement Policy

  • No bypass: AI models and extractors must use only patterns in the approved regex_library.json.
  • Reproducibility: Every AI output records the regex_library_version used.
  • Auditability: Regex changes are logged in the same trail as model/router updates.

This way, regex governance is just as strict as compute methods or model promotion — reinforcing ZAYAZ’s principle that every enforcement artifact is part of the AI trust boundary.

1.10. Parallel Governance: Regex Library vs. Compute Method Registry

Both regex specs and compute methods are compliance-critical. They control what AI is allowed to output (regex) and how it derives numbers (compute methods). To prevent drift or hidden changes, ZAYAZ enforces the same governance lifecycle for both artifacts.

Comparison Table

DimensionCompute Method RegistryRegex Library
Identifier(method_id, version) composite key(regex_name, version) composite key
SchemaJSONB: input, options, output schemaJSON: pattern, description, compliance tags
Lifecycle StagesProposal → Validation Harness →Governance Review → PromotionProposal → Validation Harness → Governance Review → Promotion
VersioningSemantic versions (v1.0.0, v1.1.0…) allow side-by-side rolloutSemantic versions (v1.0.0, v1.1.0…) allow safe migrations
Deprecation PolicyOld versions discoverable but flagged as deprecatedOld regex patterns discoverable but flagged as deprecated
AuditabilityEvery compute call logs (method_id, version) + provenance_idEvery AI output logs (regex_name, version) + provenance_id
Enforcement SurfaceAPI layer (POST /compute/factor)AI output validators + post-processing layer
ExampleGHG.intensity v1.0.0citation_regex v1.3.0

Example: Citation Regex Parallel to Compute

  • Compute Call
compute-call.json
{
"method_id": "GHG.intensity",
"version": "1.0.0",
"inputs": { "scope1": 100, "scope2": 200, "revenue": 50 }
}

→ Deterministic result, logged with (method_id, version).

  • Regex Enforcement
regex-enforcement.json
{
"regex_name": "citation_regex",
"version": "1.3.0",
"pattern": "\\[.+?:L\\d+-L\\d+\\]",
"description": "Ensure paragraph-level citations."
}

→ AI output must validate against this regex; failure = refusal.

Unified Lifecycle Contract

  • Proposals: Both live in Git PRs.
  • Validation Harness: Both run regression tests.
  • Governance Review: AI Governance Committee approves or rejects.
  • Staging: Both tested in shadow AI runs before promotion.
  • Promotion/Deprecation: Both follow the same semantic versioning + allowlist pattern.

1.11. Closing Notes

Regex and Extractor Specs form the first compliance lock in the ZAYAZ AI pipeline. They ensure that:

  • AI cannot drift into unsupported formats.
  • Outputs are machine-extractable, schema-compatible, and auditable.
  • Every disclosure, from Scope 1 to Scope 3, starts with validated tokens before being packaged into JSON (§27), transformed into XBRL (§27.4), and exported to verifiers/regulators.


GitHub RepoRequest for Change (RFC)