ZAR-FW
ZAYAZ Artifact Registry Framework
1. Introduction
The ZAYAZ Artifact Registry (ZAR) is the foundational persistence, lineage, and governance layer of the ZAYAZ platform.
It serves as the canonical system of record for all computational artifacts, enabling full traceability across the ESG data lifecycle—from raw inputs to validated disclosures.
ZAR is not a traditional registry. It is a deterministic, intelligence-aware infrastructure that connects:
- data (signals)
- computation (engines, models, rulesets)
- governance (validation, assurance, audit)
into a single, unified system of truth.
2. Role Within the ZAYAZ Architecture
ZAR operates as a core pillar of the Shared Intelligence Stack (SIS) and ensures coherence across all platform modules.
It integrates directly with:
- SSSR (Smart Searchable Signal Registry) → semantic definition of signals
- USO (Universal Signal Ontology) → runtime lineage and instance tracking
- ZSSR (Smart System Router) → routing and orchestration
- ZARA / ZAAM → AI-driven reasoning and interaction
- Verification & Assurance (VERA) → trust, validation, and audit workflows
Together, these systems form a closed-loop ESG intelligence architecture where every data point is:
Defined → Produced → Validated → Traced → Explained
3. The ZAYAZ Identifier System
At the core of ZAR lies the Canonical Identifier Architecture (CIA), which ensures that every element in the system is uniquely and immutably identifiable.
ZAYAZ distinguishes between three identity layers:
| Layer | Identifier | Purpose |
|---|---|---|
| Instance | USO ID | Identifies a specific occurrence of a signal |
| Type | CSI | Defines the semantic meaning of the signal |
| Artifact | CMI | Identifies the component that produced or processed the signal |
These identifiers operate at three distinct abstraction levels: CSI (type), CMI (artifact), and USO (instance) and form the ZAYAZ Identifier Trinity, enabling full traceability across the entire lifecycle of ESG data.
4. Canonical System Identifiers (CSI)
The Canonical Signal Identifier (CSI) defines the semantic identity of a signal and is governed within the SSSR.
Each signal field within the platform is assigned a CSI, making it:
- discoverable
- comparable
- auditable
- reusable across modules
CSI is governed within the SSSR and referenced by ZAR, but not owned by it.
CSI Structure
<MODULE_CODE>.<COMPONENT_ID>.<KIND>.<NAME>.v<MAJOR>_<MINOR>
Key Concepts
- MODULE_CODE represents the top-level ZAYAZ module (e.g. comp, vera, inpt)
- COMPONENT_ID corresponds to the frontmatter ID defined in the ZAYAZ manual
- KIND defines the role of the signal (e.g. INPUT, OUTPUT, METRIC)
- NAME is the canonical semantic identifier and is the same as the signal_name in the signal_reistry. (The signal_name is a curated version of the column name.)
- VERSION tracks the evolution of the signal’s meaning
Example
comp.PEF-ME.OUTPUT.CO2E.v1_0
vera.TG-CORE.OUTPUT.TRUST_SCORE.v1_0
Module Codes
| Module | Code |
|---|---|
| Input Hub | inpt |
| Computation Hub | comp |
| Reports & Insights | repo |
| SIS | siss |
| ZARA | zara |
| ZAAM | zaam |
| Risk (RIF) | risk |
| NetZero | netz |
| Verification & Assurance | vera |
| SEEL | seel |
| EcoWorld Academy | acad |
Design Principle — Documentation-Linked Identity
A core design principle of ZAYAZ is that:
Every CSI is directly resolvable to its originating component specification.
By aligning COMPONENT_ID with frontmatter IDs:
- auditors can trace signals directly to documented logic
- ZARA can explain how values are produced
- developers maintain a single source of truth
5. What ZAR Registers
ZAR maintains a governed catalog of all artifacts within the platform, including:
- computation engines (micro-engines, pipelines)
- schemas and data contracts
- AI models and feature generators
- routing rulesets and orchestration logic
- validation and assurance modules
Each artifact is assigned a Canonical Managed Identifier (CMI) and a ZAR Code, enabling:
- deterministic lineage tracking
- reproducible computations
- audit-ready traceability
6. End-to-End Traceability
ZAR enables full traceability across five stages:
- Input → data is ingested via Input Hub (inpt)
- Processing → computed through engines in Computation Hub (comp)
- Validation → evaluated via Verification & Assurance (vera)
- Routing → orchestrated via ZSSR
- Disclosure → exposed through Reports & Insights (repo)
At every step:
- the CMI identifies which artifact processed the data
- the CSI defines what the data represents
- the USO ID tracks which instance is being observed
This creates a fully connected lineage chain.
7. Core Capabilities
ZAR enables the following system-critical capabilities:
- Deterministic Lineage
Every signal instance can be traced through its complete processing chain.
- Replay & Reproducibility
Any ESG disclosure can be reconstructed using:
- CSI (semantic definition)
- CMI (execution logic)
- USO lineage
- Audit-Ready Architecture
Supports:
- CSRD assurance requirements
- ESRS data traceability
- ISO 14064 reproducibility
- AI Explainability
ZARA and ZAAM can:
- resolve any CSI to its component
- explain computation logic
- surface assumptions and validation layers
- Modular Scalability
ZAR supports:
- multi-tenant deployments
- white-label configurations
- global supply chain integration
8. Design Principles
ZAR is built on the following principles:
- Immutability
- Identifiers and lineage records are append-only
- Separation of Concerns
- SSSR → semantics (CSI)
- ZAR → artifacts (CMI)
- USO → runtime instances
- Deterministic Identity
- Every element is uniquely and consistently identifiable
- Documentation as Infrastructure
- Component identities are directly linked to system specifications
- Precision Before Automation
- All computations must be explainable, auditable, and verifiable
9. Strategic Positioning
ZAR transforms conventional ESG reporting into a traceable ESG infrastructure layer.
It enables organizations to move from:
- fragmented data handling to unified traceability
- opaque computations to explainable decision chains
- reactive compliance to auditable governance-by-design
In architectural terms, ZAR makes every sustainability-relevant output traceable back to:
- its semantic definition
- its producing artifact
- its runtime processing history
- its documented component specification
10. Transition to Canonical Identifier Architecture
The following section defines the Canonical Identifier Architecture (CIA) in detail, including:
- CSI (signal identity)
- CMI (artifact identity)
- USO (runtime identity)
- and their interaction across the ZAYAZ platform
APPENDIX A - CSI Naming Taxonomy
The <MODULE_CODE>, <COMPONENT_ID> and the <VERSION> is given.
Below is examples of <KIND> and <NAME> for CSIs
A.1. KIND
Represents the role or artifact type that the signal belongs to. Typically one per schema or message family.
| KIND | Description |
|---|---|
| INPUT | Input schema or raw signal |
| OUTPUT | Output schema or derived signal |
| SIGNAL | Atomic reusable signal |
| SCHEMA | JSON Schema or tabular schema type |
| CONFIG | Configuration schema or parameter set |
| FEATURE | Derived ML feature |
| METRIC | Aggregated KPI or model output |
| EVENT | System event schema |
| VIEW | Analytical or reporting view |
A.2. NAME conventions (semantic or technical label)
Describes what the signal is semantically. Uppercase with underscores for clarity. Name must be unambiguous across all components and is equivalent to the signal's signal_name.
| Example | Meaning |
|---|---|
| TRUST_SCORE | Weighted trust index (0–1) |
| CO2E | Carbon equivalent emissions |
| EF_QUALITY | Emission factor quality |
| SUPPLIER_TRUST | Supplier reliability score |
| EF_TIER | Emission factor source tier |
| WATER_USE | Water consumption metric |
APPENDIX B - CMI Naming Taxonomy
The <MODULE_CODE>, <COMPONENT_ID> and the <VERSION> is pretty much given.
Below is examples of <KIND> and <NAME> for CMIs
B.1. KIND
| KIND | Meaning |
|---|---|
| ENGINE | Executable micro-engine (Python, Node, etc.) |
| SCHEMA | Schema or data contract |
| SCRIPT | Script or ETL job |
| RULESET | Ruleset / policy definition |
| CONNECTOR | Integration adapter (e.g., SAP, QuickBooks) |
| MODEL | Trained ML model |
| UI | Front-end component |
| DASHBOARD | Visualization artifact |
| JOB | Orchestrated workflow (Airflow/StepFunction) |
| LIB | Shared library |
| TEST | Validation or regression test bundle |
B.2. NAME
The artifact or sub-function name within the component.
| Example | Meaning |
|---|---|
| Core | Main runtime module |
| Parser | Text parsing module |
| Validator | Rule validator |
| Connector | API connector |
| Decision | Output schema |
| OutputDecision | Decision schema type |
| InvoiceLines | Router ruleset |
| EU_Validator | Region-specific variant |
APPENDIX C - The Birth of a Signal
When a signal is born, a USO ID is created, and the appropriate CSI and CMI are assigned from their registries.
The canonical creation sequence
| Step | Action | Created / Assigned | Registry | Meaning |
|---|---|---|---|---|
| 1. Signal instance is generated | A micro-engine finishes a computation or data extraction. | — | — | “A new data record is born.” |
| 2. System mints a USO ID | New globally unique lineage identifier. | Created | USO (runtime) | “This is one unique signal instance.” |
| 3. System attaches CMI | Engine’s canonical artifact ID. | Assigned | ZAR | “It was produced by this artifact.” |
| 4. System attaches CSI | Canonical signal type ID. | Assigned | SSSR | “It is a signal of this conceptual type.” |
| 5. (Optional) Add origin_chain and origin_chain_codes | For future provenance hops. | Appended | USO | “Here’s its movement trail.” |
In plain language
- USO ID → created at runtime (new each time a data point exists)
- CMI → assigned from the ZAR registry (the artifact that produced it)
- CSI → assigned from the SSSR registry (the conceptual type of signal)
Example — “Invoice CO2E” in context
| Layer | Field | Example | How it got there | Meaning |
|---|---|---|---|---|
| USO | uso_id | 01JBF0W8S9Q0R1S2T3U4V5W6X | Auto-created ULID at runtime | This is one unique signal instance. |
| USO | primary_origin_cmi | comp.TAC.ENGINE.CORE.1_1_0 | Assigned from ZAR | It was first produced by this artifact. |
| USO | csi | comp.TAC.OUTPUT.CO2E.v1_0 | Assigned from SSSR | It is a signal of this conceptual type. |
| USO | origin_chain | [comp.TAC.ENGINE.CORE.1_1_0] | Initialized from producing artifact | Here is the ordered chain of artifacts that touched it. |
| USO | origin_chain_codes | [TAC12] | Derived from CMI short code | Compact representation of the processing trail. |
| USO | born_at | 2025-10-25T12:40Z | Auto-timestamped | When this signal instance was created. |
Later, if the same record passes through TrustGate:
| Field | New Value | Why |
|---|---|---|
| current_cmi | vera.TG-CORE.ENGINE.CORE.1_0_0 | Assigned from ZAR |
| origin_chain | […, vera.TG-CORE.ENGINE.CORE.1_0_0] | Appended |
| origin_chain_codes | […, TG3K7] | Appended |
Mental shortcut
Think of each registry as a “naming service” that the runtime joins together:
| Registry | Gives you | Example |
|---|---|---|
| ZAR | Who produced or consumed it | MICE.InvoiceEmissions.Engine.1_1_0 |
| SSSR | What type of signal it is | MICE.InvoiceEmissions.OUTPUT.CO2E.v1_0 |
| USO (runtime) | Which specific instance this is | 01JBF0W8S9Q0R1S2T3U4V5W6X |
Summary statement
- When a signal is generated, the ZAYAZ platform:
- Creates a new USO ID (unique lineage instance),
- Assigns the correct CMI (the producing artifact, from ZAR),
- Assigns the correct CSI (the signal type, from SSSR),
- Optionally begins its origin_chain with the producing CMI and short code.
APPENDIX D - CSI Validation & SSSR Enforcement
D.1. Purpose
The CSI validator ensures that every Canonical Signal Identifier is:
- syntactically valid
- semantically well-formed
- aligned with the ZAYAZ module system
- linked to a valid documented component
- versioned consistently
It should be enforced in:
- CI/CD
- schema publishing
- SSSR inserts/updates
- code generation pipelines
- linting for MDX/manual examples
D.2. Canonical CSI Format
<MODULE_CODE>.<COMPONENT_ID>.<KIND>.<NAME>.v<MAJOR>_<MINOR>
Example
comp.PEF-ME.OUTPUT.CO2E.v1_0
vera.TG-CORE.OUTPUT.TRUST_SCORE.v1_0
inpt.FOGE-FORM.INPUT.WATER_USE.v1_0
Allowed Module Codes
inpt
comp
repo
siss
zara
zaam
risk
netz
vera
seel
acad
Allowed KIND values
INPUT
OUTPUT
SIGNAL
SCHEMA
CONFIG
FEATURE
METRIC
EVENT
VIEW
D.3. Field Rules
- MODULE_CODE
- must be lowercase
- must be one of the approved module codes
- must be exactly one registered ZAYAZ module namespace
- COMPONENT_ID
- must match a valid frontmatter ID
- must be globally unique across the platform
- recommended pattern:
^[A-Z0-9]+(?:-[A-Z0-9]+)*$
- examples:
- PEF-ME
- TG-CORE
- FOGE-FORM
- KIND
- must be uppercase
- must belong to the approved enum
- NAME
- must be uppercase snake case or uppercase alphanumeric token
- recommended pattern:
^[A-Z][A-Z0-9_]*$
- examples:
- CO2E
- TRUST_SCORE
- VALIDATION_STATUS
- VERSION
- must use:
v<MAJOR>_<MINOR>
- examples:
v1_0v2_1
- Full CSI
- no spaces
- no extra segments
- no lowercase in KIND or NAME
- no missing v prefix
- no dots inside segments
D.4. Regex
Use this as the base validator:
^(inpt|comp|repo|siss|zara|zaam|risk|netz|vera|seel|acad)\.([A-Z0-9]+(?:-[A-Z0-9]+)*)\.(INPUT|OUTPUT|SIGNAL|SCHEMA|CONFIG|FEATURE|METRIC|EVENT|VIEW)\.([A-Z][A-Z0-9_]*)\.v([0-9]+)_([0-9]+)$
D.5. Validation Levels
Level 1 — Syntax Validation
Checks:
- regex match
- segment count
- allowed character set
- required v version prefix
Level 2 — Registry Validation
Checks:
MODULE_CODEexistsCOMPONENT_IDexists in component/frontmatter registry- component belongs to correct module
KINDis valid enum
Level 3 — Semantic Validation
Checks:
- CSI not already assigned conflicting meaning
- version bump rules followed
- deprecated CSI not reused
- NAME uniqueness rules respected within intended scope
Level 4 — Governance Validation
Checks:
- change approved if semantic meaning changed
- major version bump for breaking semantic changes
- minor version bump only for non-breaking semantic refinements
D.6. Versioning Rules
Minor bump (v1_0 → v1_1)
Use when:
- description refined
- metadata expanded
- documentation clarified
- no semantic meaning change
Major bump (v1_1 → v2_0)
Use when:
- signal meaning changes
- methodology changes materially
- unit changes
- value interpretation changes
- framework mapping changes in a way that alters semantics
Forbidden
- changing meaning without version bump
- reusing deprecated CSI for new meaning
- patch-style CSI versions like
v1_0_1
D.7. Example CSIs
Example Valid CSIs
comp.PEF-ME.OUTPUT.CO2E.v1_0
vera.TG-CORE.OUTPUT.TRUST_SCORE.v1_0
inpt.FOGE-FORM.INPUT.WATER_USE.v1_0
repo.REPORT-BUILDER.VIEW.ESRS_DASHBOARD.v2_0
risk.RIF-CORE.EVENT.RISK_ALERT.v1_1
Example Invalid CSIs
calc.TAC.OUTPUT.CO2E.1_0
Invalid:
- calc not approved
- missing v
comp.pef-me.OUTPUT.CO2E.v1_0
Invalid:
- component not uppercase frontmatter format
comp.PEF-ME.output.CO2E.v1_0
Invalid:
- KIND not uppercase
comp.PEF-ME.OUTPUT.co2e.v1_0
Invalid:
- NAME not uppercase
comp.PEF-ME.OUTPUT.CO2E.v1_0_1
Invalid:
- CSI does not use patch versioning
D.8. CI Enforcement Spec
Required checks in CI
Every new or changed CSI should be validated against:
- regex format
- approved module code list
- frontmatter/component registry lookup
- duplicate/conflict detection in SSSR
- version bump policy
Suggested CI failure messages
Invalid CSI: calc.TAC.OUTPUT.CO2E.1_0
Reason: module_code 'calc' is not registered. Use one of: inpt, comp, repo, siss, zara, zaam, risk, netz, vera, seel, acad.
Invalid CSI: comp.PEF-ME.OUTPUT.CO2E.1_0
Reason: version must use 'v<MAJOR>_<MINOR>' format, e.g. v1_0.
Invalid CSI: comp.PEF-ME.output.CO2E.v1_0
Reason: KIND must be one of INPUT, OUTPUT, SIGNAL, SCHEMA, CONFIG, FEATURE, METRIC, EVENT, VIEW.
D.9. SSSR Schema Enforcement Model
Since CSI does not have its own registry and lives inside SSSR, the cleanest model is:
- keep CSI as a first-class field in
signal_registry - validate it against:
- module registry
- component/frontmatter registry
- CSI rules
- optionally decompose it into indexed columns
Recommended signal_registry Structure
CREATE TABLE signal_registry (
signal_id TEXT PRIMARY KEY,
csi TEXT NOT NULL UNIQUE,
module_code TEXT NOT NULL,
component_id TEXT NOT NULL,
kind TEXT NOT NULL,
signal_name TEXT NOT NULL,
version_major INTEGER NOT NULL,
version_minor INTEGER NOT NULL,
display_name TEXT NOT NULL,
description TEXT,
value_type TEXT,
unit TEXT,
status TEXT NOT NULL DEFAULT 'active',
deprecated_by_csi TEXT NULL,
source_module_id TEXT NULL,
framework_tags JSONB,
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_by TEXT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
(NOTE: This is just a part of the full signal_registry table.)
D.10. Why store decomposed CSI columns too
Do not rely only on the full csi string.
Store:
module_codecomponent_idkindsignal_nameversion_majorversion_minor
This gives us:
- faster filtering
- easier joins
- easier governance
- safer validation
- better analytics
The full csi remains the canonical string, but the decomposed columns make the system operable.
D.11. Recommended Constraints
- Module constraint
CHECK (module_code IN ('inpt','comp','repo','siss','zara','zaam','risk','netz','vera','seel','acad'))
- Kind constraint
CHECK (kind IN ('INPUT','OUTPUT','SIGNAL','SCHEMA','CONFIG','FEATURE','METRIC','EVENT','VIEW'))
- Version constraint
CHECK (version_major >= 0),
CHECK (version_minor >= 0)
- CSI format constraint
If the DB supports regex checks:
CHECK (
csi ~ '^(inpt|comp|repo|siss|zara|zaam|risk|netz|vera|seel|acad)\.([A-Z0-9]+(?:-[A-Z0-9]+)*)\.(INPUT|OUTPUT|SIGNAL|SCHEMA|CONFIG|FEATURE|METRIC|EVENT|VIEW)\.([A-Z][A-Z0-9_]*)\.v([0-9]+)_([0-9]+)$'
)
- Canonical string consistency
Ensure decomposed fields match the csi string through trigger or generated column logic.
D.12. Recommended Foreign Keys
In the module/component documentation registry:
component_id REFERENCES documented_components(component_id)
And:
(module_code, component_id) REFERENCES documented_components(module_code, component_id)
This is the strongest way to enforce:
- frontmatter linkage
- documentation integrity
- ZARA explainability compatibility
D.13. Suggested documented_components Table
| Field | Purpose |
|---|---|
| component_id | Frontmatter ID, e.g. ZAR-FW, PEF-ME, TG-CORE |
| module_code | comp, vera, inpt, etc. |
| title | Human-readable title |
| slug | Docs route |
| source_file | MDX file path |
| doc_status | draft / review / active / deprecated |
| version | Spec version |
| owner_team | Responsible team |
| summary | One-paragraph description |
| parent_component_id | Optional link to parent spec/component |
| tags | Search/filter metadata |
| legacy_manual_ref | Optional backward reference |
| last_updated | Auditability / sync support |
CREATE TABLE documented_components (
component_id TEXT PRIMARY KEY,
module_code TEXT NOT NULL,
title TEXT NOT NULL,
slug TEXT,
doc_status TEXT NOT NULL DEFAULT 'active',
owner_team TEXT,
source_file TEXT
);
Add uniqueness:
CREATE UNIQUE INDEX documented_components_module_component_uidx
ON documented_components(module_code, component_id);
D.14. Trigger Strategy
Use a trigger on sssr_signals insert/update to:
- parse csi
- validate segment values
- populate decomposed fields
- verify (
module_code,component_id) exists indocumented_components - reject semantic collisions
D.15. Pseudocode
on insert/update sssr_signals:
parse csi into module_code, component_id, kind, signal_name, version_major, version_minor
assert module_code in allowed_modules
assert kind in allowed_kinds
assert documented_components contains (module_code, component_id)
assert no conflicting active signal with same csi
assert versioning rules are respected
write parsed fields back into columns
D.16. Recommended Indexes
CREATE UNIQUE INDEX sssr_signals_csi_uidx ON sssr_signals(csi);
CREATE INDEX sssr_signals_module_idx ON sssr_signals(module_code);
CREATE INDEX sssr_signals_component_idx ON sssr_signals(component_id);
CREATE INDEX sssr_signals_kind_idx ON sssr_signals(kind);
CREATE INDEX sssr_signals_name_idx ON sssr_signals(signal_name);
CREATE INDEX sssr_signals_status_idx ON sssr_signals(status);
D.17. Strong Governance Rules for SSSR
A signal record must not be created unless:
- CSI is valid
- component exists in documentation registry
- semantic definition is present
- display name is present
- value type is defined for machine handling
A signal record must be deprecated instead of overwritten when:
- meaning changes
- framework logic changes materially
- unit/value interpretation changes
A signal record may be revised in place only when:
- documentation is clarified
- metadata is enriched without semantic change
D.18. Best Practice: Canonical + Display Split
In SSSR, keep:
csias canonical machine identitydisplay_nameas human labeldescriptionas semantic definition
Example:
| Field | Value |
|---|---|
csi | vera.TG-CORE.OUTPUT.TRUST_SCORE.v1_0 |
display_name | Trust Score |
description | Weighted trust index between 0 and 1 for signal-level or aggregate validation confidence. |
This avoids semantic drift.
D.19. Final Recommendation
The strongest setup for ZAYAZ is:
- CSI stays inside SSSR
- CSI is validated by regex + registry + trigger
- component linkage is enforced against frontmatter-derived documentation metadata
- decomposed CSI fields are stored alongside the full canonical string
- CI validates examples and schema changes before merge
That gives us:
- documentation-linked identity
- strong DB enforcement
- reliable routing inputs
- ZARA-readable architecture
- audit-grade consistency
APPENDIX E — Signal Naming Governance Policy
E.1. Purpose
This appendix defines the governance policy for generating signal_name, classifying MODULE_CODE and KIND, and validating pre-version CSI structures for the ZAYAZ platform.
The policy is used by the Signal Classification Pipeline-assisted classification workflow and applies to all signal records prepared for insertion into the SSSR signal registry.
The workflow relies on structured context extracted from signal_registry and table_registry, including:
- component title
- component description
- table description
- column reference
- column description
- cleaned datatype
- enum values or example content
- other relevant metadata required for classification
This information is exported into a working csi_registry for classification and review. Once approved, the enriched results are written back into signal_registry, where the final CSI is assembled.
E.2. Governing Principles
The following principles apply throughout the classification process:
-
Classification before concatenation
Semantic classification must be completed before the full CSI is assembled. -
Validation before review
Automated checks must run before human review is triggered. -
JSON evidence before approval
Every processed column must produce a JSON evidence record. -
Human review only where needed
Manual review is reserved for low-confidence or flagged cases. -
Versioning remains outside the Signal Classification Pipeline
CSI versioning is assigned manually and appended later during Excel concatenation. -
Semantics over storage
signal_nameandKINDmust reflect semantic intent, not merely physical column names or storage formats.
E.3. Signal Classification Pipeline
The Signal Classification Pipeline prepares SSSR signal metadata in a deterministic, reviewable, and auditable way before final CSI concatenation.
The pipeline stages are:
- Datatype cleanup app
- MODULE_CODE app
- KIND app
- SIGNAL_NAME app
- Validator checks (pre-version only)
- JSON export
- Human review only for low-confidence cases
- Excel concatenation
E.4 Pipeline Stages
E.4.1 Datatype Cleanup App
Normalizes:
- base datatype
- nullability
- enum structure
- scalar vs array
- object vs text
- reference vs reference-list semantics
- timestamp/date conventions
Inputs:
source_data_type(signal_type)column_description(signal_description)sample_valuestable_prefixsource_table
Outputs:
cleaned_data_typedatatype_normalization_notesdatatype_confidence
E.4.2 MODULE_CODE App
Classifies the correct module from the fixed approved module list (use Module Code):
MODULE_CODE Dictionary
inptcompreposisszarazaamrisknetzveraseelacad
Rules:
MODULE_CODEmust match one of the values above.- Module classification is component-governed, not column-level.
- A component should map to exactly one module unless explicitly documented.
Inputs:
- component title (
component_name) - component description (
component_description) - table description (
table_notes,short_description) - table type hint (
table_prefix) - owning component context =
component_name+component_description+ table context - existing component-to-module mapping where available (recommended external lookup table)
Outputs:
module_codemodule_confidence- rationale
Rule: MODULE_CODE should normally be determined at the component or table level, not independently per column.
E.4.3 KIND App
Classifies the semantic role of each field using the KIND policy defined below.
Inputs:
- table baseline context (
table_notes,short_description) - table type hint (
table_prefix) - column description (
column_description/signal_description) - cleaned datatype (
cleaned_data_type) - enum values or example content (
sample_values) - module context (
module_code) - component context (
component_name,component_description)
Outputs:
kindkind_confidence- rationale
E.4.4 SIGNAL_NAME App
Generates the curated semantic signal_name used as the NAME segment in the CSI.
Inputs:
column_referencecolumn_description(signal_description)- cleaned datatype (
cleaned_data_type) - enum values or example content (
sample_values) - table context (
source_table,table_prefix,table_notes,short_description) - component context (
component_name,component_description) - module context (
module_code) - approved naming governance rules
Outputs:
signal_namesignal_name_confidence- naming basis
- review flags if ambiguous
E.4.5 Validator Checks
Runs pre-version validation on:
<MODULE_CODE>.<COMPONENT_ID>.<KIND>.<SIGNAL_NAME>
Example:
comp.AIIL-CON.CONFIG.METHOD_VERSION
Checks include:
- module validity
- component linkage
- allowed KIND
- naming policy compliance
- duplicate collision detection
- near-collision detection
- reserved-word and anti-pattern checks
Outputs:
pre_version_keyis_validcollision_check_resultnear_collision_resultneeds_reviewreview_reason
E.4.6. JSON Export
Exports one JSON record per processed column.
Recommended output file:
zarathustra-csi-proposals.json
This file serves as:
- audit evidence
- training data
- QA input
- migration/reference source
E.4.7. Human Review
Only low-confidence or flagged records are reviewed manually.
Typical review triggers include:
- ambiguous KIND
- weak or missing column descriptions
- near-collision results
- naming policy exceptions
- low total confidence scores
E.4.8. Excel Concatenation
Approved values are pasted into Excel and concatenated into the final CSI:
<MODULE_CODE>.<COMPONENT_ID>.<KIND>.<SIGNAL_NAME>.v<MAJOR>_<MINOR>
Versioning remains manual.
E.5. Classification Governance Rules
E.5.1. MODULE_CODE Governance
MODULE_CODE is component-governed.
It must not be invented independently for each field.
For most tables, all columns should inherit the same module as the owning component.
Example:
- component:
AIIL-CON - table:
compute_method_registry - module:
comp
All signals in that table therefore inherit the comp.* namespace unless a documented exception exists.
E.5.2. KIND Governance
KIND is field-governed, but table-aware.
It must not be guessed from the column name alone.
A baseline kind may be established at table level, but field-level overrides are allowed and expected where the semantic role differs.
E.5.3. SIGNAL_NAME Governance
SIGNAL_NAME is field-governed and semantics-first.
It must not be copied blindly from the physical column name unless the physical name already expresses the correct semantic meaning according to policy.
column_reference remains the physical storage reference.
signal_name is the curated semantic identifier.
The NAME segment in the CSI is derived from signal_name, not from column_reference.
E.6. Confidence Model
E.6.1. MODULE_CODE confidence
Usually high confidence when:
- the component is already mapped
- the table description is clearly anchored
- MDX/frontmatter context is available
Low confidence when:
- the component spans multiple modules
- descriptions are vague
- ownership is unclear
E.6.2. KIND confidence
Usually high confidence when:
- datatype and description align
- the table has a clear semantic role
- the field meaning is obvious
Low confidence when:
- the field is generic (
value,status,type,data) - classification is ambiguous between
CONFIGandSCHEMA - classification is ambiguous between
OUTPUTandMETRIC - classification is ambiguous between
SIGNALandFEATURE
E.6.3. SIGNAL_NAME confidence
Usually high confidence when:
- the field description is specific
- the semantic meaning is clear
- naming matches approved suffix and token conventions
- no collision or near-collision exists
Low confidence when:
- the field is generic
- the description is weak
- multiple expansions are plausible
- the field could be interpreted in more than one semantic way
E.7. KIND Classification Policy
E.7.1. Purpose
The KIND segment classifies the semantic role of a field or signal within the ZAYAZ platform.
It is not merely a datatype label and not merely a UI label. It expresses the field’s functional role in context.
Approved KIND values are:
INPUTOUTPUTSIGNALSCHEMACONFIGFEATUREMETRICEVENTVIEW
Rules:
KINDmust match one of the values above.KINDis field-governed but table-aware.- A table may define a baseline KIND, but field-level overrides are allowed.
Important constraints:
*_SCHEMA_REF→ must beSCHEMACREATED_AT,UPDATED_AT(in registry tables) → must beCONFIG- Numeric fields are not automatically
METRIC
E.7.2. Core Principle
KIND must describe the semantic role of the field in the platform, not just how the field happens to be stored.
This means:
- a schema reference is not automatically an
INPUTorOUTPUT - a registry timestamp is not automatically an
EVENT - a config row is not automatically a
METRIC - a physical column name must not determine
KINDby itself
E.7.3. Decision Hierarchy
The KIND app should classify in the following order:
A. Table or component context What kind of object is the table primarily describing?
Examples:
- registry/config tables → baseline often
CONFIG - runtime payload tables → may contain
INPUT,OUTPUT,SIGNAL - analytical views → often
VIEWorMETRIC - event logs → often
EVENT
B. Column semantic meaning What does the field actually represent?
Examples:
- schema references → often
SCHEMA - lifecycle metadata → often
CONFIG - computed KPI values → often
METRIC - derived model variables → often
FEATURE
C. Datatype and shape Use cleaned datatype as a secondary signal, not the primary one.
Examples:
- timestamp alone does not imply
EVENT - JSON alone does not imply
SCHEMA - enum alone does not imply
CONFIG
E.7.4. KIND Definitions
INPUT Use when the field represents an input value or input-facing signal consumed by a process, engine, form, or model.
Typical examples:
- activity input value
- emissions input quantity
- user-entered data field
- machine-provided input signal
Do not use for:
- references to input schemas
- method configuration describing inputs in general
OUTPUT Use when the field represents a computed or emitted output value from a method, engine, or transformation.
Typical examples:
CO2ETRUST_SCOREVALIDATION_RESULT
Do not use for:
- references to output schemas
- report display metadata
SIGNAL Use when the field represents an atomic reusable signal that is neither best modeled as explicit input, explicit output, nor higher-level metric.
Use sparingly. Prefer INPUT, OUTPUT, or METRIC when those are clearly more accurate.
SCHEMA Use when the field defines, references, or primarily concerns schema structure or data contracts.
Typical examples:
INPUTS_SCHEMA_REFOPTIONS_SCHEMA_REFOUTPUT_SCHEMA_REF
CONFIG Use when the field represents configuration, registry metadata, lifecycle settings, implementation bindings, dependency metadata, or governance-related setup information.
Typical examples:
METHOD_IDMETHOD_NAMEMETHOD_VERSIONLIFECYCLE_STATUSIMPLEMENTATION_REFMICRO_ENGINE_REFDATASET_REQUIREMENTSCREATED_ATUPDATED_AT
FEATURE Use when the field represents a derived model feature used for ML/statistical processing rather than a business-facing metric.
METRIC Use when the field represents an aggregated KPI, score, index, benchmark, or business-facing measurement.
Typical examples:
TRUST_SCOREECO_SCOREMATERIALITY_INDEXABATEMENT_COST
EVENT Use when the field belongs to an event record or explicitly represents an event signal or state-change record.
Do not use for:
CREATED_ATUPDATED_AT
when those occur in registry/config tables.
VIEW Use when the field belongs to a read-model, dashboard projection, analytical presentation layer, or reporting-specific view model.
E.7.5. Baseline + Override Model
Do not force a rigid table-wide KIND.
Instead use:
- a table-level baseline
KIND - per-column overrides where justified
Example: compute_method_registry
Baseline:
CONFIG
Overrides:
INPUTS_SCHEMA_REF → SCHEMAOPTIONS_SCHEMA_REF → SCHEMAOUTPUT_SCHEMA_REF → SCHEMA
E.7.6. Example Classification for compute_method_registry
| Field / signal_name | KIND |
|---|---|
| METHOD_ID | CONFIG |
| METHOD_NAME | CONFIG |
| METHOD_VERSION | CONFIG |
| LIFECYCLE_STATUS | CONFIG |
| METHOD_TYPE | CONFIG |
| DESCRIPTION | CONFIG |
| INPUTS_SCHEMA_REF | SCHEMA |
| OPTIONS_SCHEMA_REF | SCHEMA |
| OUTPUT_SCHEMA_REF | SCHEMA |
| IMPLEMENTATION_REF | CONFIG |
| MICRO_ENGINE_REF | CONFIG |
| ASSUMPTIONS_TEXT / ASSUMPTIONS_JSON | CONFIG |
| FRAMEWORK_REFS | CONFIG |
| DATASET_REQUIREMENTS | CONFIG |
| ACL_TAGS | CONFIG |
| CREATED_AT | CONFIG |
| UPDATED_AT | CONFIG |
E.7.7. Confidence and Review Rules
Auto-accept when:
- table purpose is clear
- field description clearly matches one KIND
- datatype aligns with the interpretation
- no near-equal alternative KIND is plausible
Human review required when:
- field is generic (
value,type,status,data) - ambiguous between
CONFIGandSCHEMA - ambiguous between
OUTPUTandMETRIC - ambiguous between
SIGNALandFEATURE - description is weak or missing
E.7.8. Hard Exclusions
The KIND app must never infer:
OUTPUTfor*_SCHEMA_REFINPUTfor*_SCHEMA_REFEVENTforCREATED_AT/UPDATED_ATin registry/config tablesMETRIConly because a field is numeric
E.8. Controlled Classification Dictionaries
The following controlled dictionaries define the allowed values for MODULE_CODE, KIND, and table_prefix.
These tables serve as:
- authoritative classification references
- validation sources for the Signal Classification Pipeline
- future candidates for formal registry tables in ZAR
E.8.1 MODULE_CODE Dictionary (zar.module_registry)
| Module | Module Code | Domain | Description |
|---|---|---|---|
| Input Hub | inpt | Data Acquisition | Structured ESG input, onboarding, system capability mapping |
| Computation Hub | comp | Analytics | Cross-domain computation & modeling |
| Reports & Insights Hub | repo | Disclosure | Report generation, visualization, stakeholder outputs |
| SIS | siss | Services | Shared governance services |
| ZARA | zara | Governance AI | Prompt-driven ESG orchestration |
| ZAAM | zaam | AI Assistance | Role-aware agent system |
| RIF | risk | Risk | ESG risk intelligence & escalation |
| NETZERO | netz | Climate | Decarbonization modeling & pathways |
| Verification & Assurance | vera | Trust | Verifier workflows & assurance logic |
| SEEL | seel | Materiality | Stakeholder engagement & materiality |
| EcoWorld Academy | acad | Education | Capacity building & ESG fluency |
Rules:
MODULE_CODEmust match one of the values above.- Module classification is component-governed, not column-level.
- A component should map to exactly one module unless explicitly documented.
zar.module_registry
CREATE TABLE zar.module_registry (
module_code TEXT PRIMARY KEY,
module_name TEXT NOT NULL,
domain TEXT NOT NULL,
description TEXT NOT NULL,
sort_order INTEGER NOT NULL DEFAULT 100 CHECK (sort_order >= 0),
status TEXT NOT NULL DEFAULT 'active',
version TEXT NOT NULL DEFAULT '1_0_0',
source_doc_id TEXT,
approved_by TEXT,
notes TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT module_registry_module_code_chk
CHECK (module_code ~ '^[a-z]{4}$'),
CONSTRAINT module_registry_status_chk
CHECK (status IN ('active', 'deprecated', 'draft', 'retired')),
CONSTRAINT module_registry_version_chk
CHECK (version ~ '^[0-9]+_[0-9]+_[0-9]+$')
);
CREATE UNIQUE INDEX module_registry_module_name_uidx
ON zar.module_registry (module_name);
Seed insert:
INSERT INTO zar.module_registry
(module_code, module_name, domain, description, sort_order, status, version)
VALUES
('inpt', 'Input Hub', 'Data Acquisition', 'Structured ESG input, onboarding, system capability mapping', 10, 'active', '1_0_0'),
('comp', 'Computation Hub', 'Analytics', 'Cross-domain computation & modeling', 20, 'active', '1_0_0'),
('repo', 'Reports & Insights Hub', 'Disclosure', 'Report generation, visualization, stakeholder outputs', 30, 'active', '1_0_0'),
('siss', 'SIS', 'Services', 'Shared governance services', 40, 'active', '1_0_0'),
('zara', 'ZARA', 'Governance AI', 'Prompt-driven ESG orchestration', 50, 'active', '1_0_0'),
('zaam', 'ZAAM', 'AI Assistance', 'Role-aware agent system', 60, 'active', '1_0_0'),
('risk', 'RIF', 'Risk', 'ESG risk intelligence & escalation', 70, 'active', '1_0_0'),
('netz', 'NETZERO', 'Climate', 'Decarbonization modeling & pathways', 80, 'active', '1_0_0'),
('vera', 'Verification & Assurance', 'Trust', 'Verifier workflows & assurance logic', 90, 'active', '1_0_0'),
('seel', 'SEEL', 'Materiality', 'Stakeholder engagement & materiality', 100, 'active', '1_0_0'),
('acad', 'EcoWorld Academy', 'Education', 'Capacity building & ESG fluency', 110, 'active', '1_0_0');
This table also exist as a JSON file that will be used for the Signal Classification Pipeline inputs: zarathustra-module-registry.json
E.8.2 KIND Dictionary (zar.kind_registry)
| KIND | Description |
|---|---|
INPUT | Input schema or raw signal |
OUTPUT | Output schema or derived signal |
SIGNAL | Atomic reusable signal |
SCHEMA | JSON Schema or tabular schema reference |
CONFIG | Configuration, registry metadata, or parameters |
FEATURE | Derived ML feature |
METRIC | Aggregated KPI or model output |
EVENT | System event or state-change record |
VIEW | Analytical or reporting view |
Rules:
KINDmust match one of the values above.KINDis field-governed but table-aware.- A table may define a baseline KIND, but field-level overrides are allowed.
Important constraints:
*_SCHEMA_REF→ must beSCHEMACREATED_AT,UPDATED_AT(in registry tables) → must beCONFIG- Numeric fields are not automatically
METRIC
zar.kind_registry
CREATE TABLE zar.kind_registry (
csi_kind TEXT PRIMARY KEY,
csi_kind_description TEXT NOT NULL,
semantic_role TEXT,
usage_notes TEXT,
sort_order INTEGER NOT NULL DEFAULT 100 CHECK (sort_order >= 0),
status TEXT NOT NULL DEFAULT 'active',
version TEXT NOT NULL DEFAULT '1_0_0',
source_doc_id TEXT,
approved_by TEXT,
notes TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT kind_registry_csi_kind_chk
CHECK (csi_kind IN (
'INPUT',
'OUTPUT',
'SIGNAL',
'SCHEMA',
'CONFIG',
'FEATURE',
'METRIC',
'EVENT',
'VIEW'
)),
CONSTRAINT kind_registry_status_chk
CHECK (status IN ('active', 'deprecated', 'draft', 'retired')),
CONSTRAINT kind_registry_version_chk
CHECK (version ~ '^[0-9]+_[0-9]+_[0-9]+$')
);
Seed insert:
INSERT INTO zar.kind_registry
(csi_kind, csi_kind_description, semantic_role, usage_notes, sort_order, status, version)
VALUES
('INPUT', 'Input schema or raw signal', 'Input-facing', 'Use for runtime or user/system-provided input values.', 10, 'active', '1_0_0'),
('OUTPUT', 'Output schema or derived signal', 'Output-facing', 'Use for computed or emitted result values.', 20, 'active', '1_0_0'),
('SIGNAL', 'Atomic reusable signal', 'Neutral semantic', 'Use sparingly when neither INPUT, OUTPUT, nor METRIC is the best fit.', 30, 'active', '1_0_0'),
('SCHEMA', 'JSON Schema or tabular schema reference', 'Structural', 'Use for schema-defining or schema-reference fields such as *_SCHEMA_REF.', 40, 'active', '1_0_0'),
('CONFIG', 'Configuration, registry metadata, or parameters', 'Configuration', 'Baseline kind for most registry and method-definition tables.', 50, 'active', '1_0_0'),
('FEATURE', 'Derived ML feature', 'ML feature', 'Use for engineered features intended for models or scoring.', 60, 'active', '1_0_0'),
('METRIC', 'Aggregated KPI or model output', 'Business metric', 'Use for KPIs, indexes, scores, and business-facing measurements.', 70, 'active', '1_0_0'),
('EVENT', 'System event or state-change record', 'Event-driven', 'Use for event logs, alerts, state transitions, and emitted event records.', 80, 'active', '1_0_0'),
('VIEW', 'Analytical or reporting view', 'Presentation', 'Use for read-models, marts, dashboards, and reporting-facing projections.', 90, 'active', '1_0_0');
This table also exist as a JSON file that will be used for the Signal Classification Pipeline inputs: zarathustra-kind-registry.json
E.8.3 Table Prefix Dictionary (zar.table_prefix_registry)
| Prefix | Description |
|---|---|
data_ | Legacy or raw general-purpose data |
dim_ | Dimension tables (countries, units, sectors) |
fact_ | Fact/event tables (emissions, indicators, executions) |
ref_ | Reference data (EFDB, NACE, method registries) |
stg_ | Staging tables (raw Excel/API imports) |
int_ | Intermediate tables (engine merge outputs) |
agg_ | Aggregated data (KPI rollups) |
mrt_ | Data marts (domain-tailored outputs) |
tmp_ | Temporary pipeline tables |
rl_ | Relation tables (many-to-many joins) |
eng_ | Engine outputs (computed results, scored outputs) |
mod_ | Module-owned business objects (user-facing state) |
sig_ | Signal registry tables (signal definitions in SSSR) |
Usage in the Signal Classification Pipeline:
table_prefixis used as a strong heuristic signal for:- KIND baseline classification
- table semantic role inference
- validation consistency checks
Examples:
ref_→ typicallyCONFIGorSCHEMA-heavy tablesfact_→ oftenEVENT,SIGNAL, orMETRICeng_→ oftenOUTPUT,FEATURE, orMETRICmrt_→ oftenVIEWorMETRIC
zar.table_prefix_registry
CREATE TABLE zar.table_prefix_registry (
table_prefix TEXT PRIMARY KEY,
table_prefix_desc TEXT NOT NULL,
baseline_kind_hint TEXT[],
usage_notes TEXT,
sort_order INTEGER NOT NULL DEFAULT 100 CHECK (sort_order >= 0),
status TEXT NOT NULL DEFAULT 'active',
version TEXT NOT NULL DEFAULT '1_0_0',
source_doc_id TEXT,
approved_by TEXT,
notes TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT table_prefix_registry_prefix_chk
CHECK (table_prefix ~ '^[a-z]+_$'),
CONSTRAINT table_prefix_registry_status_chk
CHECK (status IN ('active', 'deprecated', 'draft', 'retired')),
CONSTRAINT table_prefix_registry_version_chk
CHECK (version ~ '^[0-9]+_[0-9]+_[0-9]+$')
);
Seed insert:
INSERT INTO zar.table_prefix_registry
(table_prefix, table_prefix_desc, baseline_kind_hint, usage_notes, sort_order, status, version)
VALUES
('data_', 'Legacy / Raw general', ARRAY['SIGNAL','CONFIG'], 'General-purpose raw or inherited data structures.', 10, 'active', '1_0_0'),
('dim_', 'Dimensions (Countries, Units, Sectors)', ARRAY['CONFIG'], 'Reference-like dimensional structures used for joins and classification.', 20, 'active', '1_0_0'),
('fact_', 'Facts (events) (Emissions, indicators, executions)', ARRAY['EVENT','SIGNAL','METRIC'], 'Fact-style records often contain runtime observations, events, or measured outputs.', 30, 'active', '1_0_0'),
('ref_', 'Reference data (EFDB, NACE, method registries)', ARRAY['CONFIG','SCHEMA'], 'Reference and registry tables, often configuration-heavy with schema references.', 40, 'active', '1_0_0'),
('stg_', 'Staging (Raw Excel / API imports)', ARRAY['INPUT','SIGNAL'], 'Landing-zone data pending normalization or transformation.', 50, 'active', '1_0_0'),
('int_', 'Intermediate (Engine merge outputs)', ARRAY['SIGNAL','OUTPUT'], 'Intermediate computation structures between raw and final outputs.', 60, 'active', '1_0_0'),
('agg_', 'Aggregates (KPI rollups)', ARRAY['METRIC'], 'Aggregated KPI or rollup outputs.', 70, 'active', '1_0_0'),
('mrt_', 'Data marts (Domain-tailored outputs)', ARRAY['VIEW','METRIC'], 'Domain-facing analytical outputs and reporting structures.', 80, 'active', '1_0_0'),
('tmp_', 'Temporary (Pipeline intermediates)', ARRAY['SIGNAL','CONFIG'], 'Ephemeral pipeline support structures.', 90, 'active', '1_0_0'),
('rl_', 'Pure join tables / Relations (Many-to-many links)', ARRAY['CONFIG'], 'Relationship and join support tables.', 100, 'active', '1_0_0'),
('eng_', 'Outputs produced by computation engines (algorithmic results, scored outputs)', ARRAY['OUTPUT','FEATURE','METRIC'], 'Engine-produced computed outputs.', 110, 'active', '1_0_0'),
('mod_', 'Module-owned output tables (business objects, user-facing module state)', ARRAY['CONFIG','VIEW'], 'Business-object or user-facing module state tables.', 120, 'active', '1_0_0'),
('sig_', 'Signals registry (Signal definitions)', ARRAY['SIGNAL','CONFIG'], 'Signal-definition and metadata registry structures.', 130, 'active', '1_0_0');
This table also exist as a JSON file that will be used for the Signal Classification Pipeline inputs: zarathustra-table-prefix-registry.json
E.8.5 zar.component_module_map
This stabilize the entire Signal Classification Pipeline.
- MODULE_CODE app = deterministic lookup
CREATE TABLE zar.component_module_map (
component_id TEXT PRIMARY KEY,
module_code TEXT NOT NULL,
confidence NUMERIC(3,2) DEFAULT 1.00,
source TEXT DEFAULT 'manual',
notes TEXT,
sort_order INTEGER NOT NULL DEFAULT 100 CHECK (sort_order >= 0),
status TEXT NOT NULL DEFAULT 'active',
version TEXT NOT NULL DEFAULT '1_0_0',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_component_module
FOREIGN KEY (module_code)
REFERENCES zar.module_registry (module_code),
CONSTRAINT component_module_status_chk
CHECK (status IN ('active', 'deprecated', 'draft'))
);
This table becomes:
The authoritative bridge between documentation (components) and runtime classification (modules)
Used by:
- Signal Classification Pipeline MODULE_CODE
- Validator
- ZARA explainability
- Auditors
A few seed entries (example)
INSERT INTO zar.component_module_map
(component_id, module_code, confidence, source, sort_order)
VALUES
('AIIL-CON', 'comp', 0.99, 'manual', 10),
('ZAR-FW', 'siss', 0.95, 'manual', 20),
('TG-CORE', 'vera', 0.99, 'manual', 30);
Test of first real join
SELECT
c.component_id,
c.module_code,
m.module_name
FROM zar.component_module_map c
JOIN zar.module_registry m
ON c.module_code = m.module_code;
Output:
component_id | module_code | module_name
--------------+-------------+--------------------------
AIIL-CON | comp | Computation Hub
ZAR-FW | siss | SIS
TG-CORE | vera | Verification & Assurance
(3 rows)
zar.documented_component_registry
The zar.documented_component_registry gives us:
- canonical
component_id - title
- source MDX path
- owner
- status
- stronger linkage for
component_module_map - future ZARA explainability lookup
CREATE TABLE zar.documented_component_registry (
component_id TEXT PRIMARY KEY,
component_title TEXT NOT NULL,
module_code TEXT NOT NULL,
source_doc_id TEXT,
source_file TEXT,
slug TEXT,
owner_team TEXT,
status TEXT NOT NULL DEFAULT 'active',
version TEXT NOT NULL DEFAULT '1_0_0',
sort_order INTEGER NOT NULL DEFAULT 100 CHECK (sort_order >= 0),
notes TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT documented_component_status_chk
CHECK (status IN ('active', 'deprecated', 'draft', 'retired')),
CONSTRAINT documented_component_version_chk
CHECK (version ~ '^[0-9]+_[0-9]+_[0-9]+$'),
CONSTRAINT fk_documented_component_module
FOREIGN KEY (module_code)
REFERENCES zar.module_registry (module_code)
);
Architecture
ZAR Governance Layer (v1)
| Layer | Table | Purpose |
|---|---|---|
| Module taxonomy | module_registry | System domains |
| Signal semantics | kind_registry | Field roles |
| Data structure | table_prefix_registry | Table meaning |
| Component mapping | component_module_map | System wiring |
This is what powers the Signal Classification Pipeline + CSI + ZARA
Example validation query
SELECT *
FROM zar.module_registry
WHERE module_code = 'comp';
E.8.4 Design Note
These dictionaries should be treated as:
- controlled vocabularies
- validation constraints in the Signal Classification Pipeline
- future candidates for formal ZAR registry tables
Over time, they should be promoted into:
module_registrykind_registrytable_type_registry
within ZAR for full governance and traceability.
| Prefix | Description |
|---|---|
data_ | Legacy or raw general-purpose data |
dim_ | Dimension tables (countries, units, sectors) |
fact_ | Fact/event tables (emissions, indicators, executions) |
ref_ | Reference data (EFDB, NACE, method registries) |
stg_ | Staging tables (raw Excel/API imports) |
int_ | Intermediate tables (engine merge outputs) |
agg_ | Aggregated data (KPI rollups) |
mrt_ | Data marts (domain-tailored outputs) |
tmp_ | Temporary pipeline tables |
rl_ | Relation tables (many-to-many joins) |
eng_ | Engine outputs (computed results, scored outputs) |
mod_ | Module-owned business objects (user-facing state) |
sig_ | Signal registry tables (signal definitions in SSSR) |
Usage in the Signal Classification Pipeline:
table_prefixis used as a strong heuristic signal for:- KIND baseline classification
- table semantic role inference
- validation consistency checks
Examples:
ref_→ typicallyCONFIGorSCHEMA-heavy tablesfact_→ oftenEVENT,SIGNAL, orMETRICeng_→ oftenOUTPUT,FEATURE, orMETRICmrt_→ oftenVIEWorMETRIC
E.9. JSON Evidence Record
Each processed column must produce a JSON evidence record.
Recommended structure:
{
"component_id": "AIIL-CON",
"table_name": "compute_method_registry",
"column_reference": "version",
"column_description": "Semantic version of the method implementation and schema (e.g., 1.0.0). Enables side-by-side versions.",
"data_type": "text",
"module_code": "comp",
"kind": "CONFIG",
"signal_name": "METHOD_VERSION",
"confidence_scores": {
"module_confidence": 0.91,
"kind_confidence": 0.97,
"signal_name_confidence": 0.92,
"total_score": 0.94
},
"review_reason": null,
"existing_similar_signals": [],
"datatype_normalization_notes": "No datatype normalization required. Source type 'text' retained.",
"naming_basis": [
"column_description indicates semantic version of method",
"generic VERSION expanded to domain-specific METHOD_VERSION",
"matches existing suffix conventions"
],
"needs_review": false,
"pre_version_key": "comp.AIIL-CON.CONFIG.METHOD_VERSION",
"suggested_csi_pattern": "comp.AIIL-CON.CONFIG.METHOD_VERSION.v<MAJOR>_<MINOR>",
"collision_check_result": "no_conflict",
"near_collision_result": [],
"generated_at": "2026-03-24T12:00:00Z",
"generator_version": "zarathustra-naming-0.1.0"
}
zarathustra-csi-proposals.json
E.10. Summary
The Signal Naming Governance Policy ensures that ZAYAZ generates MODULE_CODE, KIND, and signal_name in a disciplined, explainable, and reviewable manner before final CSI concatenation.
It exists to ensure:
- semantic consistency across the SSSR
- documentation-linked traceability
- reduced naming drift
- collision prevention
- auditable AI-assisted classification
APPENDIX F - Query Results - Tests
F.1. Inspect the latest view to confirm the new run outputs landed correctly
SELECT
row_id,
source_signal_id,
column_reference,
cleaned_data_type,
module_code,
kind,
signal_name,
pre_version_key,
is_valid
FROM zar.v_codex_signal_registry_latest
ORDER BY row_id;
row_id | source_signal_id | column_reference | cleaned_data_type | module_code | kind | signal_name | pre_version_key | is_valid --------+------------------+----------------------+-------------------+-------------+--------+----------------------+-------------------------------------------+---------- 1 | sssr-000343 | method_id | TEXT | comp | CONFIG | METHOD_ID | comp.AIIL-CON.CONFIG.METHOD_ID | t | 2 | sssr-000344 | method_name | TEXT | comp | CONFIG | METHOD_NAME | comp.AIIL-CON.CONFIG.METHOD_NAME | t | 3 | sssr-000345 | version | TEXT | comp | CONFIG | METHOD_VERSION | comp.AIIL-CON.CONFIG.METHOD_VERSION | t | 4 | sssr-000346 | status | TEXT | comp | CONFIG | LIFECYCLE_STATUS | comp.AIIL-CON.CONFIG.LIFECYCLE_STATUS | t | 5 | sssr-000347 | method_type | ENUM | comp | CONFIG | METHOD_TYPE | comp.AIIL-CON.CONFIG.METHOD_TYPE | t | 6 | sssr-000348 | description | TEXT | comp | CONFIG | DESCRIPTION | comp.AIIL-CON.CONFIG.DESCRIPTION | t | 7 | sssr-000349 | inputs_schema_json | TEXT | comp | SCHEMA | INPUTS_SCHEMA_REF | comp.AIIL-CON.SCHEMA.INPUTS_SCHEMA_REF | t | 8 | sssr-000350 | options_schema_json | TEXT | comp | SCHEMA | OPTIONS_SCHEMA_REF | comp.AIIL-CON.SCHEMA.OPTIONS_SCHEMA_REF | t | 9 | sssr-000351 | output_schema_json | TEXT | comp | SCHEMA | OUTPUT_SCHEMA_REF | comp.AIIL-CON.SCHEMA.OUTPUT_SCHEMA_REF | t | 10 | sssr-000352 | implementation_ref | TEXT | comp | CONFIG | IMPLEMENTATION_REF | comp.AIIL-CON.CONFIG.IMPLEMENTATION_REF | t | 11 | sssr-000353 | micro_engine_ref | TEXT | comp | CONFIG | MICRO_ENGINE_REF | comp.AIIL-CON.CONFIG.MICRO_ENGINE_REF | t | 12 | sssr-000354 | assumptions_json | JSONB | comp | CONFIG | ASSUMPTIONS_JSON | comp.AIIL-CON.CONFIG.ASSUMPTIONS_JSON | t | 13 | sssr-000355 | framework_refs | JSONB | comp | CONFIG | FRAMEWORK_REFS | comp.AIIL-CON.CONFIG.FRAMEWORK_REFS | t | 14 | sssr-000356 | dataset_requirements | JSONB | comp | CONFIG | DATASET_REQUIREMENTS | comp.AIIL-CON.CONFIG.DATASET_REQUIREMENTS | t | 15 | sssr-000357 | acl_tags | JSONB | comp | CONFIG | ACL_TAGS | comp.AIIL-CON.CONFIG.ACL_TAGS | t | 16 | sssr-000358 | created_at | TIMESTAMPTZ | comp | CONFIG | CREATED_AT | comp.AIIL-CON.CONFIG.CREATED_AT | t | 17 | sssr-000359 | updated_at | TIMESTAMPTZ | comp | CONFIG | UPDATED_AT | comp.AIIL-CON.CONFIG.UPDATED_AT | t |
F.2. Follow up query
SELECT
row_id,
column_reference,
kind,
kind_confidence,
kind_rationale,
kind_needs_review,
kind_review_reason
FROM zar.v_codex_signal_registry_latest
ORDER BY row_id;
row_id | column_reference | kind | kind_confidence | kind_rationale | kind_needs_review | kind_review_reason --------+----------------------+--------+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------- 1 | method_id | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 2 | method_name | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 3 | version | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 4 | status | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 5 | method_type | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 6 | description | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 7 | inputs_schema_json | SCHEMA | 0.980 | Schema-reference field override. | f | | 8 | options_schema_json | SCHEMA | 0.980 | Schema-reference field override. | f | | 9 | output_schema_json | SCHEMA | 0.980 | Schema-reference field override. | f | | 10 | implementation_ref | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 11 | micro_engine_ref | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 12 | assumptions_json | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 13 | framework_refs | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 14 | dataset_requirements | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 15 | acl_tags | CONFIG | 0.960 | Baseline classification from zar.table_prefix_registry primary_kind_hint=CONFIG for table_prefix=ref_. Secondary hints: ['SCHEMA']. trust_score=0.960. | f | | 16 | created_at | CONFIG | 0.960 | Registry timestamp field classified as CONFIG. | f | | 17 | updated_at | CONFIG | 0.960 | Registry timestamp field classified as CONFIG. | f |