Jira progress: loading…

Dataset Registry

1. Canonical Dataset Contracts, Provenance & Access Control

1.1. Purpose and Scope

The Dataset Registry (DR) is the canonical, platform-wide registry for all datasets used within the ZAYAZ ecosystem for computation, validation, modeling, and reporting.

While standards frameworks define what must be reported and signals define what is measured, the Dataset Registry defines:

what data is allowed to be used, under what structural contract, provenance guarantees, and access rules.

The Dataset Registry is a horizontal system primitive and applies across all domains, including:

CSRD / ESRS reporting
PEF / OEF rule execution
GHG factor calculations
Scenario modeling (NGFS, IAMs)
Financial impact and risk modeling
AI-assisted inference, replay, and audit

1.2. Conceptual Position in the Architecture

The Dataset Registry operates at the data artifact layer, distinct from frameworks, rules, and signals.

Layer	Registry
Governance & Obligations	`standards_frameworks`, `standards_criteria`
Semantic Measurement	SSSR, USO
Data Artifacts (this section)	Dataset Registry (DR)
Computation & Engines	MICE, NETZERO, SEM
Validation & Trust	DaVE, DICE, VTE
Audit & Replay	ALTD, DAL

The Dataset Registry ensures that every computation performed by ZAYAZ is reproducible, auditable, and contract-bound.

1.3. What Is a Dataset in ZAYAZ?

In ZAYAZ, a dataset is not just data.

A dataset is a governed artifact with:

explicit provenance
a versioned structural schema
declared integrity constraints
controlled read semantics
bounded validity in time

Datasets are treated as first-class artifacts, on equal footing with:

policy artifacts
rule definitions
AI models
evidence bundles

1.4. Core Responsibilities of the Dataset Registry

The Dataset Registry provides authoritative answers to:

What dataset is this, exactly?
Who provided it and under what license?
What is its structural schema and schema version?
How may it be read (snapshot, latest, time-bounded)?
What integrity rules must hold?
When is this dataset valid?
Can it be used for this computation or framework?

No dataset may be consumed by a computation engine, validator, or AI workflow unless it is registered and approved in the Dataset Registry.

1.5. Canonical Dataset Registry Schema

The Dataset Registry is represented by the dataset_registry table.

Input Table Structure

compute_executionsDraft

Source file: compute_executions.xlsx

Signal	Type	Example	Description
`caller_ip`	TEXT	`192.168.14.52, 15.188.110.23`	Source IP for security / anomaly analysis.
`created_at`	TEXT	`2025-02-01T10:14:33Z, 2025-02-01T10:14:34Z`	Execution timestamp (UTC).
`dataset_hashes`	TEXT	`IPCC-EFDB:sha256-a123bc9f1d92f12398fabc77bb38d99f2c91afbd5311a37d,IEA_SSP:sha256-98231acd19fa0287fb29c8ff093a772cd44aa2bc1faff391, UN_FAO_WATER:sha256-71ab3cd19fa0aa77fb29c8ff093a772cd44aa2135d1ac191`	Array/map of dataset artifact hashes used (e.g., ['IPCC-EFDB:sha256-...','IEA_SSP:sha256-...']).
`error_code`	TEXT	`null, SCHEMA_VALIDATION_FAILED`	Machine-readable error code if status='error' (e.g., SCHEMA_VALIDATION_FAILED).
`exec_id`	string (pk)	`01J9Z9K4M2F7W7N4Q7ZJ9V9XGF, 01J9Z9KCN4AMJ5T2W6S5XYF4E9`	Globally unique execution ID (ULID). Primary key.
`inputs_hash`	TEXT	`sha256:3fa42eab87cbdd29d8fd0c9a8bd1cb6a543920d8e16c08c28a1c78c9231e4b92, sha256:a128ab769e12ba91bb221b0a8afbd5311a37d4a0e366620eefb4e32976cf81aa`	Cryptographic hash (e.g., sha256:...) of normalized input payload.
`latency_ms`	TEXT	`143, 212`	End-to-end compute latency in milliseconds.
`method_id`	TEXT	`MID-00001, MID-00042`	Method ID (e.g., MID-00001). FK part 1 → ref_compute_method_registry.method_id.
`options_hash`	TEXT	`sha256:92f8aa769e12b2a3bfb2f2c91afbd5311a37d4a0e366620eefb4e32976c029d2, sha256:81aeaa769e12f89dfbb2f2c91afbd5311a37d4a0e366620eefb4e32976acc2b0`	Cryptographic hash of normalized options payload.
`output_hash`	TEXT	`sha256:4cd920fa8912a4627bb9eab32c5fdd314ff2a119e80cc291eabe927caae123ef, sha256:f12920fa8912ff111bb9eab32c5fdd314ff2a119e80cc291eabe927caac44ab1`	Cryptographic hash of normalized output payload. Optional if execution failed pre-output.
`provenance_id`	TEXT	`prov-7f892b1a3cdd4e8f8fba2e, prov-c102aa39efdd4694948df2`	Correlation ID returned to caller; links UI/API to this audit row.
`region`	TEXT	`eu-north-1, eu-central-1`	Region tag where compute ran (e.g., eu-north-1).
`status`	TEXT	`ok, error`	ok \| error. Final outcome.
`storage_ref`	TEXT	`s3://zayaz-computations/exec/01J9Z9K4M2F7W7N4Q7ZJ9V9XGF/input_output.json, s3://zayaz-computations/exec/01J9Z9KCN4AMJ5T2W6S5XYF4E9/input_output.json`	Pointer to sealed blob with full input/output snapshots if retained.
`tenant_id`	TEXT	`eco-197-123-456-789, eco-044-001-882-331`	Tenant/customer identifier (used for multi-tenant isolation and audit).
`version`	TEXT	`1.1.2, 1.0.0`	Exact version used (e.g., 1.0.0). FK part 2 → ref_compute_method_registry.version.

Family

auditclassificationcodegeoidmetric

Used by engines

ALTDDAVEDICEMICEPOSTHSEMZARAZSSR

Used by modules

CHUBRIHUBSISZARA

dataset_registry.sql
CREATE TABLE dataset_registry (
  dataset_id           TEXT PRIMARY KEY,
  dataset_key          TEXT UNIQUE NOT NULL,
  name                 TEXT NOT NULL,
  provider             TEXT NOT NULL,
  version              TEXT NOT NULL,
  license_ref          TEXT NOT NULL,
  checksum             TEXT NOT NULL,
  valid_from           DATE NOT NULL,
  valid_to             DATE,
  column_schema        JSONB NOT NULL,
  schema_version       TEXT NOT NULL,
  primary_keys         JSONB NOT NULL,
  partitions           JSONB,
  read_policy          JSONB,
  integrity_rules      JSONB,
  update_frequency     TEXT,
  document_location    TEXT,
  short_description    TEXT NOT NULL,
  notes                TEXT
);

2. Structural Schema Contract (column_schema)

2.1. The column_schema field defines the machine-enforceable structural contract of the dataset.

It specifies:

column names
data types
nullability
units of measure
optional semantic annotations

This schema is consumed by:

validation engines (DaVE, DICE)
micro-engines (MICE)
form generators (FOGE)
schema compatibility checks
computation replay and audit

Any schema change must increment schema_version and follow the “Change Management & Compatibility (CMCB)” policy.

2.2. Read Policies and Deterministic Consumption

The read_policy field enforces deterministic dataset consumption.

Common patterns include:

snapshot-only – immutable historical reproduction
latest-partition – operational “current state” access
time-bounded – constrained to explicit validity windows

Runtime engines must comply with read policies to preserve:

auditability
reproducibility
regulatory defensibility

2.3. Integrity Rules and Trust Enforcement

integrity_rules define dataset-level guarantees such as:

not-null constraints
referential expectations
statistical bounds
row-count thresholds
checksum verification

These rules are evaluated during:

ingestion
validation
audit and replay

Violations propagate into trust scoring (DaVE / VTE) and audit logs (ALTD).

3. Relationship to Domain-Specific Dataset Registries

The Dataset Registry is platform-wide and domain-agnostic.

Domain-specific registries (for example pef_dataset_registry) are specializations, not replacements.

Conceptual hierarchy:

dataset_registry
 ├─ pef_dataset_registry
 ├─ ghg_factor_dataset_registry
 ├─ scenario_dataset_registry
 └─ financial_impact_dataset_registry

All domain registries inherit:

provenance guarantees
schema contracts
integrity enforcement
read policy semantics

⸻

4. Relationship to Rule Authoring Kit (RAK)

The Rule Authoring Kit (RAK) references datasets exclusively via the Dataset Registry.

Rules never embed raw dataset assumptions. Instead, they bind to:

dataset_id
allowed dataset versions
schema versions
declared usage purpose

This ensures PEF and OEF rules remain:

portable
auditable
forward-compatible

5. Audit, Replay, and Regulatory Assurance

The Dataset Registry is a foundational dependency for:

computation replay
regulatory audits
verifier inspections
AI decision traceability

Combined with ALTD and DAL, it allows ZAYAZ to answer:

Exactly which data, schema, and version produced this reported number — and why.

Design Principle

If a dataset is not registered, it does not exist. If it is registered, it is computable, auditable, and defensible.

This principle underpins ZAYAZ’s compliance-grade data intelligence architecture.

GitHub Repo Request for Change (RFC)

1. Canonical Dataset Contracts, Provenance & Access Control​

1.1. Purpose and Scope​

1.2. Conceptual Position in the Architecture​

1.3. What Is a Dataset in ZAYAZ?​

1.4. Core Responsibilities of the Dataset Registry​

1.5. Canonical Dataset Registry Schema​

2. Structural Schema Contract (column_schema)​

2.1. The column_schema field defines the machine-enforceable structural contract of the dataset.​

2.2. Read Policies and Deterministic Consumption​

2.3. Integrity Rules and Trust Enforcement​

3. Relationship to Domain-Specific Dataset Registries​

4. Relationship to Rule Authoring Kit (RAK)​

5. Audit, Replay, and Regulatory Assurance​