Skip to main content
Jira progress: loading…

DR

Dataset Registry

1. Canonical Dataset Contracts, Provenance & Access Control

1.1. Purpose and Scope

The Dataset Registry (DR) is the canonical, platform-wide registry for all datasets used within the ZAYAZ ecosystem for computation, validation, modeling, and reporting.

While standards frameworks define what must be reported and signals define what is measured, the Dataset Registry defines:

what data is allowed to be used, under what structural contract, provenance guarantees, and access rules.

The Dataset Registry is a horizontal system primitive and applies across all domains, including:

  • CSRD / ESRS reporting
  • PEF / OEF rule execution
  • GHG factor calculations
  • Scenario modeling (NGFS, IAMs)
  • Financial impact and risk modeling
  • AI-assisted inference, replay, and audit

1.2. Conceptual Position in the Architecture

The Dataset Registry operates at the data artifact layer, distinct from frameworks, rules, and signals.

LayerRegistry
Governance & Obligationsstandards_frameworks, standards_criteria
Semantic MeasurementSSSR, USO
Data Artifacts (this section)Dataset Registry (DR)
Computation & EnginesMICE, NETZERO, SEM
Validation & TrustDaVE, DICE, VTE
Audit & ReplayALTD, DAL

The Dataset Registry ensures that every computation performed by ZAYAZ is reproducible, auditable, and contract-bound.

1.3. What Is a Dataset in ZAYAZ?

In ZAYAZ, a dataset is not just data.

A dataset is a governed artifact with:

  • explicit provenance
  • a versioned structural schema
  • declared integrity constraints
  • controlled read semantics
  • bounded validity in time

Datasets are treated as first-class artifacts, on equal footing with:

  • policy artifacts
  • rule definitions
  • AI models
  • evidence bundles

1.4. Core Responsibilities of the Dataset Registry

The Dataset Registry provides authoritative answers to:

  • What dataset is this, exactly?
  • Who provided it and under what license?
  • What is its structural schema and schema version?
  • How may it be read (snapshot, latest, time-bounded)?
  • What integrity rules must hold?
  • When is this dataset valid?
  • Can it be used for this computation or framework?

No dataset may be consumed by a computation engine, validator, or AI workflow unless it is registered and approved in the Dataset Registry.

1.5. Canonical Dataset Registry Schema

The Dataset Registry is represented by the dataset_registry table.

Input Table Structure

Source file: compute_executions.xlsx
SignalTypeExampleDescription
caller_ipTEXT192.168.14.52, 15.188.110.23Source IP for security / anomaly analysis.
created_atTEXT2025-02-01T10:14:33Z, 2025-02-01T10:14:34ZExecution timestamp (UTC).
dataset_hashesTEXTIPCC-EFDB:sha256-a123bc9f1d92f12398fabc77bb38d99f2c91afbd5311a37d,IEA_SSP:sha256-98231acd19fa0287fb29c8ff093a772cd44aa2bc1faff391, UN_FAO_WATER:sha256-71ab3cd19fa0aa77fb29c8ff093a772cd44aa2135d1ac191Array/map of dataset artifact hashes used (e.g., ['IPCC-EFDB:sha256-...','IEA_SSP:sha256-...']).
error_codeTEXTnull, SCHEMA_VALIDATION_FAILEDMachine-readable error code if status='error' (e.g., SCHEMA_VALIDATION_FAILED).
exec_idstring (pk)01J9Z9K4M2F7W7N4Q7ZJ9V9XGF, 01J9Z9KCN4AMJ5T2W6S5XYF4E9Globally unique execution ID (ULID). Primary key.
inputs_hashTEXTsha256:3fa42eab87cbdd29d8fd0c9a8bd1cb6a543920d8e16c08c28a1c78c9231e4b92, sha256:a128ab769e12ba91bb221b0a8afbd5311a37d4a0e366620eefb4e32976cf81aaCryptographic hash (e.g., sha256:...) of normalized input payload.
latency_msTEXT143, 212End-to-end compute latency in milliseconds.
method_idTEXTMID-00001, MID-00042Method ID (e.g., MID-00001). FK part 1 → ref_compute_method_registry.method_id.
options_hashTEXTsha256:92f8aa769e12b2a3bfb2f2c91afbd5311a37d4a0e366620eefb4e32976c029d2, sha256:81aeaa769e12f89dfbb2f2c91afbd5311a37d4a0e366620eefb4e32976acc2b0Cryptographic hash of normalized options payload.
output_hashTEXTsha256:4cd920fa8912a4627bb9eab32c5fdd314ff2a119e80cc291eabe927caae123ef, sha256:f12920fa8912ff111bb9eab32c5fdd314ff2a119e80cc291eabe927caac44ab1Cryptographic hash of normalized output payload. Optional if execution failed pre-output.
provenance_idTEXTprov-7f892b1a3cdd4e8f8fba2e, prov-c102aa39efdd4694948df2Correlation ID returned to caller; links UI/API to this audit row.
regionTEXTeu-north-1, eu-central-1Region tag where compute ran (e.g., eu-north-1).
statusTEXTok, errorok | error. Final outcome.
storage_refTEXTs3://zayaz-computations/exec/01J9Z9K4M2F7W7N4Q7ZJ9V9XGF/input_output.json, s3://zayaz-computations/exec/01J9Z9KCN4AMJ5T2W6S5XYF4E9/input_output.jsonPointer to sealed blob with full input/output snapshots if retained.
tenant_idTEXTeco-197-123-456-789, eco-044-001-882-331Tenant/customer identifier (used for multi-tenant isolation and audit).
versionTEXT1.1.2, 1.0.0Exact version used (e.g., 1.0.0). FK part 2 → ref_compute_method_registry.version.
Family
auditclassificationcodegeoidmetric
Used by engines
ALTDDAVEDICEMICEPOSTHSEMZARAZSSR
Used by modules
CHUBRIHUBSISZARA
dataset_registry.sql
CREATE TABLE dataset_registry (
dataset_id TEXT PRIMARY KEY,
dataset_key TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
provider TEXT NOT NULL,
version TEXT NOT NULL,
license_ref TEXT NOT NULL,
checksum TEXT NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
column_schema JSONB NOT NULL,
schema_version TEXT NOT NULL,
primary_keys JSONB NOT NULL,
partitions JSONB,
read_policy JSONB,
integrity_rules JSONB,
update_frequency TEXT,
document_location TEXT,
short_description TEXT NOT NULL,
notes TEXT
);

2. Structural Schema Contract (column_schema)

2.1. The column_schema field defines the machine-enforceable structural contract of the dataset.

It specifies:

  • column names
  • data types
  • nullability
  • units of measure
  • optional semantic annotations

This schema is consumed by:

  • validation engines (DaVE, DICE)
  • micro-engines (MICE)
  • form generators (FOGE)
  • schema compatibility checks
  • computation replay and audit

Any schema change must increment schema_version and follow the “Change Management & Compatibility (CMCB)” policy.

2.2. Read Policies and Deterministic Consumption

The read_policy field enforces deterministic dataset consumption.

Common patterns include:

  • snapshot-only – immutable historical reproduction
  • latest-partition – operational “current state” access
  • time-bounded – constrained to explicit validity windows

Runtime engines must comply with read policies to preserve:

  • auditability
  • reproducibility
  • regulatory defensibility

2.3. Integrity Rules and Trust Enforcement

integrity_rules define dataset-level guarantees such as:

  • not-null constraints
  • referential expectations
  • statistical bounds
  • row-count thresholds
  • checksum verification

These rules are evaluated during:

  • ingestion
  • validation
  • audit and replay

Violations propagate into trust scoring (DaVE / VTE) and audit logs (ALTD).

3. Relationship to Domain-Specific Dataset Registries

The Dataset Registry is platform-wide and domain-agnostic.

Domain-specific registries (for example pef_dataset_registry) are specializations, not replacements.

Conceptual hierarchy:

dataset_registry
├─ pef_dataset_registry
├─ ghg_factor_dataset_registry
├─ scenario_dataset_registry
└─ financial_impact_dataset_registry

All domain registries inherit:

  • provenance guarantees
  • schema contracts
  • integrity enforcement
  • read policy semantics

4. Relationship to Rule Authoring Kit (RAK)

The Rule Authoring Kit (RAK) references datasets exclusively via the Dataset Registry.

Rules never embed raw dataset assumptions. Instead, they bind to:

  • dataset_id
  • allowed dataset versions
  • schema versions
  • declared usage purpose

This ensures PEF and OEF rules remain:

  • portable
  • auditable
  • forward-compatible

5. Audit, Replay, and Regulatory Assurance

The Dataset Registry is a foundational dependency for:

  • computation replay
  • regulatory audits
  • verifier inspections
  • AI decision traceability

Combined with ALTD and DAL, it allows ZAYAZ to answer:

Exactly which data, schema, and version produced this reported number — and why.

Design Principle

If a dataset is not registered, it does not exist. If it is registered, it is computable, auditable, and defensible.

This principle underpins ZAYAZ’s compliance-grade data intelligence architecture.



GitHub RepoRequest for Change (RFC)