Skip to main content
Jira progress: loading…

SIG-REG-PL

ZAYAZ Signal Registry & Signal Classification Pipeline Architecture

Overview

This document describes the current architecture of the ZAYAZ Signal Registry System (SSSR) and its supporting Signal Classification Pipeline.

The system is designed to:

  • Transform raw schemas (e.g., SQL DDL) into structured signal intelligence
  • Enable deterministic + AI-assisted classification
  • Support any complex data platform, not limited to ESG
  • Provide full traceability from source → signal → CSI

1. Architecture Layers

1.1 Ingestion Layer

Purpose: Convert source schemas into structured signal rows.

Tables

  • codex_signal_registry_working

Description

  • Central working table for all extracted signals
  • One row per column from source tables
  • Acts as the backbone of the entire pipeline

1.2 Signal Classification Pipeline

Purpose: Enrich signals through staged classification.

Each stage writes to its own results table with full traceability via run_id.

Stages

Stage 1: Datatype Normalization

  • Table: codex_datatype_results
  • Output: cleaned_data_type

Stage 2: Module Classification

  • Table: codex_module_results
  • Output: module_code

Stage 3: CSI KIND Classification

  • Table: codex_kind_results
  • Output: kind (CSI KIND)

Note: This is CSI KIND, distinct from other KIND concepts (e.g., CMI KIND).

Stage 4: Signal Naming

  • Table: codex_signal_name_results
  • Output: signal_name

Stage 5: Validation & CSI Preparation

  • Table: codex_validator_results

  • Output:

    • pre_version_key
    • is_valid
    • suggested_csi_pattern

Run Tracking

  • Table: codex_automation_runs

1.3 Derived Views

v_codex_signal_registry_latest

  • Combines working table + latest results from all stages
  • Represents the current "best state" of each signal

v_codex_signal_registry_review_queue

  • Filters signals requiring manual review
  • Driven by confidence thresholds and validation flags

v_codex_signal_registry_export_ready

  • Final curated dataset
  • Ready for promotion to SSSR / CSI

2. Component & Module Architecture

2.1 Component Registry

documented_component_registry

  • Source of truth for all ZAYAZ components
  • Derived from documentation (MDX frontmatter)

Key fields:

  • component_id
  • component_title
  • md_document_link
  • module_code (primary association)

2.2 Module Registry

module_registry

Canonical modules (examples):

  • inpt (Input Hub)
  • comp (Computation Hub)
  • repo (Reports & Insights)
  • siss (Signal Intelligence System)
  • zara
  • zaam
  • risk
  • netz
  • vera
  • seel
  • acad

2.3 Module Alias Layer

module_alias_registry

Purpose: Map documentation/frontmatter identifiers to canonical module codes.

Example:

  • computation-hubcomp
  • sississ

This decouples:

  • Documentation vocabulary
  • System-level canonical codes

2.4 Component ↔ Module Mapping

component_module_map

Purpose: Resolve component-to-module relationships.

Supports:

  • Multi-module components
  • Canonical mapping
  • Frontmatter preservation

Key fields:

  • primary_module_code
  • module_codes[]
  • omr_module_ids[]

component_omr_module_map

Purpose: Fully normalized many-to-many mapping.

  • One row per (component, module alias)
  • Used for traceability and validation

component_module_map_staging

  • Raw ingestion from documentation exports
  • Never used directly in production logic

3. Lookup & Classification Systems

3.1 Table Prefix Registry

table_prefix_registry

Defines dataset semantics:

Examples:

  • ref_ → Reference data
  • fact_ → Events
  • dim_ → Dimensions
  • agg_ → Metrics

Also provides:

  • primary_kind_hint
  • secondary_kind_hints

Used directly by classification pipeline.


3.2 CSI KIND Registry

kind_registry

Defines CSI semantic categories:

  • INPUT
  • OUTPUT
  • SIGNAL
  • SCHEMA
  • CONFIG
  • FEATURE
  • METRIC
  • EVENT
  • VIEW

Note: This registry governs CSI KIND, not CMI KIND.


3.3 Table Registry

table_registry

Purpose: Central registry of all datasets/tables used across the ZAYAZ platform.

Stores:

  • Dataset definitions
  • Storage configuration
  • Governance metadata
  • Documentation links

This table acts as a foundational metadata layer and is actively used by other systems, including:

  • Signal registry (SSSR)
  • Classification pipeline
  • Documentation system

It provides table-level context that is picked up and reused by the signal registry, ensuring consistency between:

  • Physical data structures
  • Signal definitions
  • Documentation links

This table is the bridge between:

  • Physical data
  • Signal registry
  • Documentation

3.4 Signal Registry (SSSR) Model

Definition

The Signal Registry System (SSSR) is the canonical layer for representing structured signals derived from source datasets.

A signal is defined as:

A semantically enriched, uniquely identifiable representation of a data field, including its context, meaning, and classification within the ZAYAZ system.

ConceptDescription
ColumnPhysical field in a dataset (raw schema level)
SignalEnriched representation of a column with semantic meaning
CSIStandardized identity for a signal across systems
MetricAggregated or computed signal

Signal Lifecycle

  1. Source Column

    • Extracted from schema (DDL / API / dataset)
  2. Working Signal

    • Stored in codex_signal_registry_working
    • Contains raw metadata
  3. Classified Signal

    • Enriched through pipeline stages:

      • datatype
      • module
      • CSI KIND
      • naming
      • validation
  4. Validated Signal (Pre-CSI)

    • Assigned pre_version_key
    • Checked for uniqueness and consistency
  5. CSI Signal

    • Promoted to canonical identity layer
    • Versioned and governed

Relationship to Other Systems

table_registry

  • Provides dataset-level metadata
  • Defines context for signals
  • Signals inherit semantic meaning from their parent dataset

Component & Module System

  • Signals are linked to components via component_name
  • Modules are resolved via component_module_map
  • Enables system-level grouping and governance

CSI (Canonical Signal Identity)

  • Provides globally unique signal identity
  • Ensures cross-system consistency
  • Enables reuse and interoperability

Future: USO Integration

  • Signals will be extended with:

    • Unit
    • Scope
    • Object
  • Introduces additional classification layer beyond CSI

Key Properties of a Signal

  • Traceable: Linked to source column and dataset
  • Classified: Assigned module, CSI KIND, and semantic name
  • Versioned: Evolves through CSI versioning
  • Reusable: Can be referenced across systems and modules
  • Governed: Subject to validation and review workflows

Design Principles

  • Signals are not tied to ESG — fully domain-agnostic
  • Signals are first-class entities, not just columns
  • Signals enable semantic interoperability across datasets
  • Signals are the foundation for analytics, AI, and reporting layers

4. Pipeline Flow

SQL → ddl_to_desc.py → codex_signal_registry_working

→ Datatype → Module → CSI KIND → Signal Name → Validator

→ v_codex_signal_registry_latest
→ review_queue
→ export_ready

→ SSSR → CSI

5. Design Principles

5.1 Separation of Concerns

  • Raw ingestion vs classification vs validation
  • Staging vs canonical vs derived

5.2 Traceability

  • Every transformation tracked via run_id
  • Full lineage from source column to final CSI

5.3 Deterministic + AI Hybrid

  • Rule-based baseline (prefix, schema, naming)
  • AI augmentation (future phases)

5.4 Multi-Module Awareness

  • Components can belong to multiple modules
  • Primary module used for deterministic classification

5.5 Documentation-Driven Architecture

  • MDX frontmatter → database
  • Database → documentation (bi-directional potential)

6. Next Phase (Planned)

6.1 Full SSSR Table

  • Promote export-ready signals to canonical registry
  • Add versioning + governance fields

6.2 USO Integration

  • Add Unit, Scope, Object classification

6.3 CSI Versioning Engine

  • Automatic version increments
  • Change detection
  • Backward compatibility

6.4 Validation Framework

  • Cross-signal validation
  • Schema enforcement
  • Data integrity rules

7. Summary

The current ZAYAZ architecture represents a:

  • Modular
  • Scalable
  • Traceable
  • Domain-agnostic

signal intelligence system.

It is now ready for:

  • Full SSSR expansion
  • USO enrichment
  • Production-grade automation

End of Document

11. CMI Naming Policy

11.1 Purpose

This chapter defines how Canonical Managed Identifiers (CMIs) are named, validated, proposed, and governed within ZAR.

It exists to ensure that artifact identity is:

  • deterministic
  • documentation-linked
  • reusable across versions
  • explainable to humans and AI systems
  • enforceable in registry and CI workflows

The CMI naming policy governs:

  • module_code
  • component_id
  • cmi_kind
  • artifact_name
  • semantic versioning for artifacts

11.2 Canonical CMI Format

<module_code>.<COMPONENT_ID>.<CMI_KIND>.<ARTIFACT_NAME>.<MAJOR>_<MINOR>_<PATCH>

Example:

comp.PEF-ME.ENGINE.CORE.1_0_0
vera.TG-CORE.ENGINE.VALIDATOR.1_0_0
comp.AIIL-CON.SCHEMA.INPUTS.1_0_0

11.3 Segment Rules

A. module_code

  • must be a valid canonical module code from zar.module_registry
  • lowercase
  • examples: comp, vera, siss, repo

B. COMPONENT_ID

  • must match a valid documented component in zar.documented_component_registry
  • uppercase frontmatter-style identifier
  • examples: PEF-ME, TG-CORE, AIIL-CON

C. CMI_KIND

  • must exist in zar.cmi_kind_registry
  • identifies what category of artifact this is
  • examples: ENGINE, SCHEMA, RULESET, MODEL

D. ARTIFACT_NAME

  • must exist in zar.cmi_artifact_name_registry for the chosen CMI_KIND
  • describes the canonical functional role of the artifact within the component
  • examples: CORE, VALIDATOR, INPUTS, ROUTING

E. VERSION

  • must use semantic artifact versioning:
  • MAJOR_MINOR_PATCH
  • examples: 1_0_0, 2_1_3

11.4 Core Principle for CMI_KIND

CMI_KIND classifies what type of artifact the registry entry represents.

It must describe the artifact category, not:

  • the programming language
  • the file name
  • the deployment environment
  • the team name

Examples:

  • executable processing logic → ENGINE
  • JSON schema or contract → SCHEMA
  • policy bundle or routing rules → RULESET
  • trained model package → MODEL

11.5 Core Principle for ARTIFACT_NAME

ARTIFACT_NAME defines the canonical functional role of the artifact inside the component.

It must be:

  • semantic
  • stable over time
  • implementation-neutral
  • chosen from the controlled vocabulary

ARTIFACT_NAME must not be derived from:

  • file names such as main.py or index.ts
  • frameworks such as fastapi or react
  • version labels such as v2
  • temporary developer naming

Good examples:

  • CORE
  • VALIDATOR
  • CALCULATOR
  • INPUTS
  • OUTPUT
  • ROUTING

Bad examples:

  • MAIN_PY
  • FASTAPI_HANDLER
  • VALIDATOR_V2
  • OUTPUT_SCHEMA_FINAL

11.6 Decision Hierarchy for Choosing CMI_KIND

When registering a new artifact, classify in this order:

  1. What is the artifact primarily?

    • execution logic → ENGINE
    • structure/contract → SCHEMA
    • standalone operational script → SCRIPT
    • policy/rules → RULESET
    • external system adapter → CONNECTOR
    • trained ML artifact → MODEL
    • front-end interaction component → UI
    • visual reporting surface → DASHBOARD
    • orchestrated workflow → JOB
    • shared package → LIB
    • test harness or QA artifact → TEST
  2. What role does it play in the platform?

    • prefer the most semantically precise CMI_KIND
  3. If two kinds are plausible:

    • choose the kind that best represents the artifact's primary governance and runtime identity
    • flag for review if ambiguity remains

11.7 Decision Hierarchy for Choosing ARTIFACT_NAME

  1. Choose CMI_KIND first

  2. Identify the artifact's functional role within the component

  3. Select the approved ARTIFACT_NAME from zar.cmi_artifact_name_registry

  4. If no approved name exists:

    • propose a new controlled-vocabulary entry
    • do not improvise permanent naming in production registry rows

Examples:

ENGINE

  • main runtime logic → CORE
  • deterministic computation → CALCULATOR
  • validation logic → VALIDATOR
  • scoring logic → SCORER
  • transformation logic → TRANSFORMER
  • normalization logic → NORMALIZER

SCHEMA

  • input contract → INPUTS
  • output contract → OUTPUT
  • configurable options → OPTIONS
  • API request contract → REQUEST
  • API response contract → RESPONSE

RULESET

  • decision rules → DECISION
  • routing rules → ROUTING
  • validation rules → VALIDATION
  • classification rules → CLASSIFIER

11.8 Completeness Model

A component may contain multiple artifacts of the same CMI_KIND.

Examples:

comp.PEF-ME.ENGINE.CORE.1_0_0
comp.PEF-ME.ENGINE.CALCULATOR.1_0_0
comp.PEF-ME.ENGINE.VALIDATOR.1_0_0
comp.PEF-ME.SCHEMA.INPUTS.1_0_0
comp.PEF-ME.SCHEMA.OUTPUT.1_0_0

A CMI registry entry is therefore unique at the following grain:

(module_code, component_id, cmi_kind, artifact_name, version)

Not at:

(module_code, component_id, cmi_kind)


11.9 CMI Completeness Validator

The CMI Completeness Validator should work in two layers.

Layer 1 — Deterministic baseline

Use:

  • component_id
  • component_title
  • md_document_link
  • controlled CMI_KIND vocabulary
  • controlled ARTIFACT_NAME vocabulary
  • registry rules for minimum expected artifact sets

This layer should check whether the component has the minimum required artifacts for its apparent role.

Examples of baseline expectations:

  • most executable components should have at least one ENGINE.CORE
  • schema-driven components often require SCHEMA.INPUTS and/or SCHEMA.OUTPUT
  • rule-heavy components may require RULESET.ROUTING, RULESET.DECISION, or RULESET.VALIDATION

Layer 2 — AI-assisted proposal

AI can be used after the deterministic baseline to:

  • read the MDX/component specification
  • infer likely missing artifacts
  • propose candidate CMI_KIND + ARTIFACT_NAME combinations
  • explain why they are proposed
  • flag ambiguous cases for human review

Recommended model:

  • deterministic registry validation first
  • AI-assisted proposal second
  • human approval where needed

This prevents uncontrolled invention while still allowing the system to scale.


11.10 Governance Rules

A CMI record must not be created unless:

  • module_code exists in zar.module_registry
  • component_id exists in zar.documented_component_registry
  • cmi_kind exists in zar.cmi_kind_registry
  • (artifact_name, cmi_kind) exists in zar.cmi_artifact_name_registry
  • the full CMI is unique for its version

A new artifact name must be added to the controlled vocabulary before production use.

A semantic artifact change should trigger a version increment, not silent overwrite.


11.11 Strategic Outcome

With controlled CMI_KIND, controlled ARTIFACT_NAME, and a completeness validator, ZAR gains:

  • deterministic artifact identity
  • stronger lineage precision
  • explainable runtime provenance
  • reusable artifact patterns across components
  • governance-ready validation for future automation

This makes the CMI layer as disciplined and scalable as the CSI layer.


APPENDIX A - The zar.cmi_artifact_name_registry

The Controlled Vocabulary (v1 Seed)

ENGINE

('CORE',         'ENGINE', 'Primary execution engine of the component', 'Execution', 'Main runtime logic'),
('CALCULATOR', 'ENGINE', 'Performs deterministic calculations', 'Execution', 'Used for formula-based processing'),
('VALIDATOR', 'ENGINE', 'Validates inputs or outputs', 'Validation', 'Rule-based or logic-based validation'),
('SCORER', 'ENGINE', 'Computes scores or indices', 'Scoring', 'Used for trust scores, risk scores, etc.'),
('TRANSFORMER', 'ENGINE', 'Transforms data structures', 'Transformation', 'Mapping or reshaping data'),
('AGGREGATOR', 'ENGINE', 'Aggregates multiple inputs', 'Aggregation', 'Rollups, groupings, summaries'),
('ENRICHER', 'ENGINE', 'Adds additional data or context', 'Enrichment', 'Joins, external lookups'),
('ROUTER', 'ENGINE', 'Routes signals or tasks', 'Orchestration', 'ZSSR-like routing logic'),
('PROCESSOR', 'ENGINE', 'Generic processing unit', 'Execution', 'Fallback when role is broad'),
('NORMALIZER', 'ENGINE', 'Normalizes data formats or values', 'Transformation', 'Standardization logic')

SCHEMA

('INPUTS',     'SCHEMA', 'Defines input structure', 'Structural', 'Used for input schemas'),
('OUTPUT', 'SCHEMA', 'Defines output structure', 'Structural', 'Used for output schemas'),
('OPTIONS', 'SCHEMA', 'Defines configuration options', 'Structural', 'Optional parameters'),
('PAYLOAD', 'SCHEMA', 'Defines message payload', 'Structural', 'Event/message structure'),
('CONTRACT', 'SCHEMA', 'Formal data contract', 'Structural', 'Interface definition'),
('RESPONSE', 'SCHEMA', 'Response schema', 'Structural', 'API responses'),
('REQUEST', 'SCHEMA', 'Request schema', 'Structural', 'API inputs')

SCRIPT

('ETL',        'SCRIPT', 'Extract-transform-load script', 'Data Processing', 'Batch pipelines'),
('MIGRATION', 'SCRIPT', 'Schema or data migration script', 'Maintenance', 'DB migrations'),
('SEED', 'SCRIPT', 'Seed data script', 'Initialization', 'Initial data population'),
('IMPORT', 'SCRIPT', 'Data import script', 'Ingestion', 'External ingestion'),
('EXPORT', 'SCRIPT', 'Data export script', 'Extraction', 'Data export jobs')

RULESET

('DECISION',     'RULESET', 'Decision logic ruleset', 'Decisioning', 'Decision trees, policies'),
('VALIDATION', 'RULESET', 'Validation rules', 'Validation', 'Constraint enforcement'),
('ROUTING', 'RULESET', 'Routing rules', 'Orchestration', 'Signal routing'),
('POLICY', 'RULESET', 'Policy definitions', 'Governance', 'Compliance rules'),
('CLASSIFIER', 'RULESET', 'Classification logic', 'Classification', 'Categorization rules')

CONNECTOR

('API',          'CONNECTOR', 'External API connector', 'Integration', 'Generic API'),
('SAP', 'CONNECTOR', 'SAP integration', 'Integration', 'ERP integration'),
('FILE_INGEST', 'CONNECTOR', 'File ingestion connector', 'Integration', 'CSV, Excel, etc.'),
('DATABASE', 'CONNECTOR', 'Database connector', 'Integration', 'SQL/NoSQL'),
('STREAM', 'CONNECTOR', 'Streaming connector', 'Integration', 'Kafka, etc.')

MODEL

('PREDICTOR',    'MODEL', 'Prediction model', 'ML', 'Forecasting'),
('CLASSIFIER', 'MODEL', 'Classification model', 'ML', 'Categorization'),
('SCORING', 'MODEL', 'Scoring model', 'ML', 'Risk/trust scoring'),
('EMBEDDER', 'MODEL', 'Embedding model', 'ML', 'Vector generation'),
('ANOMALY', 'MODEL', 'Anomaly detection model', 'ML', 'Outlier detection')

UI

('FORM',         'UI', 'Input form', 'Presentation', 'User data entry'),
('DASHBOARD', 'UI', 'UI dashboard component', 'Presentation', 'Interactive UI'),
('REVIEW', 'UI', 'Review interface', 'Presentation', 'Human validation'),
('WIDGET', 'UI', 'Reusable UI widget', 'Presentation', 'Componentized UI')

DASHBOARD

('ESRS_OVERVIEW', 'DASHBOARD', 'ESRS overview dashboard', 'Reporting', 'Regulatory reporting'),
('KPI_MONITOR', 'DASHBOARD', 'KPI monitoring dashboard', 'Reporting', 'Performance tracking'),
('EXEC_SUMMARY', 'DASHBOARD', 'Executive summary dashboard', 'Reporting', 'Management view')

JOB

('PIPELINE',     'JOB', 'Orchestrated pipeline', 'Orchestration', 'End-to-end flow'),
('SCHEDULER', 'JOB', 'Scheduled job', 'Orchestration', 'Cron-like execution'),
('BATCH', 'JOB', 'Batch job', 'Processing', 'Bulk processing'),
('REPROCESS', 'JOB', 'Reprocessing job', 'Recovery', 'Replay jobs')

LIB

('CORE',         'LIB', 'Core shared logic', 'Shared', 'Core logic reuse'),
('VALIDATION', 'LIB', 'Validation library', 'Shared', 'Reusable validation'),
('FORMAT', 'LIB', 'Formatting utilities', 'Shared', 'Data formatting')

TEST

('UNIT',         'TEST', 'Unit tests', 'QA', 'Low-level tests'),
('INTEGRATION', 'TEST', 'Integration tests', 'QA', 'System tests'),
('REGRESSION', 'TEST', 'Regression test suite', 'QA', 'Prevent regressions'),
('PERFORMANCE', 'TEST', 'Performance tests', 'QA', 'Load/stress testing')

APPENDIX B - ZAR Tables:

SchemaNameTypeOwnerPersistenceAccess methodSizeDescription
zarcmi_artifact_name_registrytablezar_adminpermanentheap48 kB
zarcmi_completeness_rule_registrytablezar_adminpermanentheap16 kB
zarcmi_kind_registrytablezar_adminpermanentheap16 kB
zarcmi_proposal_resultstablezar_adminpermanentheap320 kB
zarcmi_registrytablezar_adminpermanentheap8192 bytes
zarcodex_automation_runstablezar_adminpermanentheap16 kB
zarcodex_datatype_resultstablezar_adminpermanentheap16 kB
zarcodex_kind_resultstablezar_adminpermanentheap16 kB
zarcodex_module_resultstablezar_adminpermanentheap16 kB
zarcodex_signal_name_resultstablezar_adminpermanentheap16 kB
zarcodex_signal_registry_workingtablezar_adminpermanentheap16 kB
zarcodex_validator_resultstablezar_adminpermanentheap16 kB
zarcomponent_module_maptablezar_adminpermanentheap160 kB
zarcomponent_module_map_stagingtablezar_adminpermanentheap192 kB
zarcomponent_omr_module_maptablezar_adminpermanentheap120 kB
zardocumented_component_registrytablezar_adminpermanentheap120 kB
zardocumented_component_stagingtablezar_adminpermanentheap8192 bytes
zarkind_registrytablezar_adminpermanentheap16 kB
zarmodule_alias_registrytablezar_adminpermanentheap16 kB
zarmodule_registrytablezar_adminpermanentheap16 kB
zartable_prefix_registrytablezar_adminpermanentheap48 kB
zartable_registrytablezar_adminpermanentheap8192 bytes



GitHub RepoRequest for Change (RFC)