DR
Dataset Registry
1. Canonical Dataset Contracts, Provenance & Access Control
1.1. Purpose and Scope
The Dataset Registry (DR) is the canonical, platform-wide registry for all datasets used within the ZAYAZ ecosystem for computation, validation, modeling, and reporting.
While standards frameworks define what must be reported and signals define what is measured, the Dataset Registry defines:
what data is allowed to be used, under what structural contract, provenance guarantees, and access rules.
The Dataset Registry is a horizontal system primitive and applies across all domains, including:
- CSRD / ESRS reporting
- PEF / OEF rule execution
- GHG factor calculations
- Scenario modeling (NGFS, IAMs)
- Financial impact and risk modeling
- AI-assisted inference, replay, and audit
1.2. Conceptual Position in the Architecture
The Dataset Registry operates at the data artifact layer, distinct from frameworks, rules, and signals.
| Layer | Registry |
|---|---|
| Governance & Obligations | standards_frameworks, standards_criteria |
| Semantic Measurement | SSSR, USO |
| Data Artifacts (this section) | Dataset Registry (DR) |
| Computation & Engines | MICE, NETZERO, SEM |
| Validation & Trust | DaVE, DICE, VTE |
| Audit & Replay | ALTD, DAL |
The Dataset Registry ensures that every computation performed by ZAYAZ is reproducible, auditable, and contract-bound.
1.3. What Is a Dataset in ZAYAZ?
In ZAYAZ, a dataset is not just data.
A dataset is a governed artifact with:
- explicit provenance
- a versioned structural schema
- declared integrity constraints
- controlled read semantics
- bounded validity in time
Datasets are treated as first-class artifacts, on equal footing with:
- policy artifacts
- rule definitions
- AI models
- evidence bundles
1.4. Core Responsibilities of the Dataset Registry
The Dataset Registry provides authoritative answers to:
- What dataset is this, exactly?
- Who provided it and under what license?
- What is its structural schema and schema version?
- How may it be read (snapshot, latest, time-bounded)?
- What integrity rules must hold?
- When is this dataset valid?
- Can it be used for this computation or framework?
No dataset may be consumed by a computation engine, validator, or AI workflow unless it is registered and approved in the Dataset Registry.
1.5. Canonical Dataset Registry Schema
The Dataset Registry is represented by the dataset_registry table.
Input Table Structure
testStable| Signal | Type | Example | Description |
|---|---|---|---|
test_id | string | tst-002 | Test |
CREATE TABLE dataset_registry (
dataset_id TEXT PRIMARY KEY,
dataset_key TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
provider TEXT NOT NULL,
version TEXT NOT NULL,
license_ref TEXT NOT NULL,
checksum TEXT NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
column_schema JSONB NOT NULL,
schema_version TEXT NOT NULL,
primary_keys JSONB NOT NULL,
partitions JSONB,
read_policy JSONB,
integrity_rules JSONB,
update_frequency TEXT,
document_location TEXT,
short_description TEXT NOT NULL,
notes TEXT
);
2. Structural Schema Contract (column_schema)
2.1. The column_schema field defines the machine-enforceable structural contract of the dataset.
It specifies:
- column names
- data types
- nullability
- units of measure
- optional semantic annotations
This schema is consumed by:
- validation engines (DaVE, DICE)
- micro-engines (MICE)
- form generators (FOGE)
- schema compatibility checks
- computation replay and audit
Any schema change must increment schema_version and follow the “Change Management & Compatibility (CMCB)” policy.
2.2. Read Policies and Deterministic Consumption
The read_policy field enforces deterministic dataset consumption.
Common patterns include:
- snapshot-only – immutable historical reproduction
- latest-partition – operational “current state” access
- time-bounded – constrained to explicit validity windows
Runtime engines must comply with read policies to preserve:
- auditability
- reproducibility
- regulatory defensibility
2.3. Integrity Rules and Trust Enforcement
integrity_rules define dataset-level guarantees such as:
- not-null constraints
- referential expectations
- statistical bounds
- row-count thresholds
- checksum verification
These rules are evaluated during:
- ingestion
- validation
- audit and replay
Violations propagate into trust scoring (DaVE / VTE) and audit logs (ALTD).
3. Relationship to Domain-Specific Dataset Registries
The Dataset Registry is platform-wide and domain-agnostic.
Domain-specific registries (for example pef_dataset_registry) are specializations, not replacements.
Conceptual hierarchy:
dataset_registry
├─ pef_dataset_registry
├─ ghg_factor_dataset_registry
├─ scenario_dataset_registry
└─ financial_impact_dataset_registry
All domain registries inherit:
- provenance guarantees
- schema contracts
- integrity enforcement
- read policy semantics
⸻
4. Relationship to Rule Authoring Kit (RAK)
The Rule Authoring Kit (RAK) references datasets exclusively via the Dataset Registry.
Rules never embed raw dataset assumptions. Instead, they bind to:
- dataset_id
- allowed dataset versions
- schema versions
- declared usage purpose
This ensures PEF and OEF rules remain:
- portable
- auditable
- forward-compatible
5. Audit, Replay, and Regulatory Assurance
The Dataset Registry is a foundational dependency for:
- computation replay
- regulatory audits
- verifier inspections
- AI decision traceability
Combined with ALTD and DAL, it allows ZAYAZ to answer:
Exactly which data, schema, and version produced this reported number — and why.
Design Principle
If a dataset is not registered, it does not exist. If it is registered, it is computable, auditable, and defensible.
This principle underpins ZAYAZ’s compliance-grade data intelligence architecture.