Skip to main content
Jira progress: loading…

DR

Dataset Registry

1. Canonical Dataset Contracts, Provenance & Access Control

1.1. Purpose and Scope

The Dataset Registry (DR) is the canonical, platform-wide registry for all datasets used within the ZAYAZ ecosystem for computation, validation, modeling, and reporting.

While standards frameworks define what must be reported and signals define what is measured, the Dataset Registry defines:

what data is allowed to be used, under what structural contract, provenance guarantees, and access rules.

The Dataset Registry is a horizontal system primitive and applies across all domains, including:

  • CSRD / ESRS reporting
  • PEF / OEF rule execution
  • GHG factor calculations
  • Scenario modeling (NGFS, IAMs)
  • Financial impact and risk modeling
  • AI-assisted inference, replay, and audit

1.2. Conceptual Position in the Architecture

The Dataset Registry operates at the data artifact layer, distinct from frameworks, rules, and signals.

LayerRegistry
Governance & Obligationsstandards_frameworks, standards_criteria
Semantic MeasurementSSSR, USO
Data Artifacts (this section)Dataset Registry (DR)
Computation & EnginesMICE, NETZERO, SEM
Validation & TrustDaVE, DICE, VTE
Audit & ReplayALTD, DAL

The Dataset Registry ensures that every computation performed by ZAYAZ is reproducible, auditable, and contract-bound.

1.3. What Is a Dataset in ZAYAZ?

In ZAYAZ, a dataset is not just data.

A dataset is a governed artifact with:

  • explicit provenance
  • a versioned structural schema
  • declared integrity constraints
  • controlled read semantics
  • bounded validity in time

Datasets are treated as first-class artifacts, on equal footing with:

  • policy artifacts
  • rule definitions
  • AI models
  • evidence bundles

1.4. Core Responsibilities of the Dataset Registry

The Dataset Registry provides authoritative answers to:

  • What dataset is this, exactly?
  • Who provided it and under what license?
  • What is its structural schema and schema version?
  • How may it be read (snapshot, latest, time-bounded)?
  • What integrity rules must hold?
  • When is this dataset valid?
  • Can it be used for this computation or framework?

No dataset may be consumed by a computation engine, validator, or AI workflow unless it is registered and approved in the Dataset Registry.

1.5. Canonical Dataset Registry Schema

The Dataset Registry is represented by the dataset_registry table.

Input Table Structure

testStable
Source file: test.xlsx
SignalTypeExampleDescription
test_idstringtst-002Test
Family
test
Used by engines
agri-v
Used by modules
sem
dataset_registry.sql
CREATE TABLE dataset_registry (
dataset_id TEXT PRIMARY KEY,
dataset_key TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
provider TEXT NOT NULL,
version TEXT NOT NULL,
license_ref TEXT NOT NULL,
checksum TEXT NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
column_schema JSONB NOT NULL,
schema_version TEXT NOT NULL,
primary_keys JSONB NOT NULL,
partitions JSONB,
read_policy JSONB,
integrity_rules JSONB,
update_frequency TEXT,
document_location TEXT,
short_description TEXT NOT NULL,
notes TEXT
);

2. Structural Schema Contract (column_schema)

2.1. The column_schema field defines the machine-enforceable structural contract of the dataset.

It specifies:

  • column names
  • data types
  • nullability
  • units of measure
  • optional semantic annotations

This schema is consumed by:

  • validation engines (DaVE, DICE)
  • micro-engines (MICE)
  • form generators (FOGE)
  • schema compatibility checks
  • computation replay and audit

Any schema change must increment schema_version and follow the “Change Management & Compatibility (CMCB)” policy.

2.2. Read Policies and Deterministic Consumption

The read_policy field enforces deterministic dataset consumption.

Common patterns include:

  • snapshot-only – immutable historical reproduction
  • latest-partition – operational “current state” access
  • time-bounded – constrained to explicit validity windows

Runtime engines must comply with read policies to preserve:

  • auditability
  • reproducibility
  • regulatory defensibility

2.3. Integrity Rules and Trust Enforcement

integrity_rules define dataset-level guarantees such as:

  • not-null constraints
  • referential expectations
  • statistical bounds
  • row-count thresholds
  • checksum verification

These rules are evaluated during:

  • ingestion
  • validation
  • audit and replay

Violations propagate into trust scoring (DaVE / VTE) and audit logs (ALTD).

3. Relationship to Domain-Specific Dataset Registries

The Dataset Registry is platform-wide and domain-agnostic.

Domain-specific registries (for example pef_dataset_registry) are specializations, not replacements.

Conceptual hierarchy:

dataset_registry
├─ pef_dataset_registry
├─ ghg_factor_dataset_registry
├─ scenario_dataset_registry
└─ financial_impact_dataset_registry

All domain registries inherit:

  • provenance guarantees
  • schema contracts
  • integrity enforcement
  • read policy semantics

4. Relationship to Rule Authoring Kit (RAK)

The Rule Authoring Kit (RAK) references datasets exclusively via the Dataset Registry.

Rules never embed raw dataset assumptions. Instead, they bind to:

  • dataset_id
  • allowed dataset versions
  • schema versions
  • declared usage purpose

This ensures PEF and OEF rules remain:

  • portable
  • auditable
  • forward-compatible

5. Audit, Replay, and Regulatory Assurance

The Dataset Registry is a foundational dependency for:

  • computation replay
  • regulatory audits
  • verifier inspections
  • AI decision traceability

Combined with ALTD and DAL, it allows ZAYAZ to answer:

Exactly which data, schema, and version produced this reported number — and why.

Design Principle

If a dataset is not registered, it does not exist. If it is registered, it is computable, auditable, and defensible.

This principle underpins ZAYAZ’s compliance-grade data intelligence architecture.



GitHub RepoRequest for Change (RFC)