SIG-REG-PL
ZAYAZ Signal Registry & Signal Classification Pipeline Architecture
Overview
This document describes the current architecture of the ZAYAZ Signal Registry System (SSSR) and its supporting Signal Classification Pipeline.
The system is designed to:
- Transform raw schemas (e.g., SQL DDL) into structured signal intelligence
- Enable deterministic + AI-assisted classification
- Support any complex data platform, not limited to ESG
- Provide full traceability from source → signal → CSI
1. Architecture Layers
1.1 Ingestion Layer
Purpose: Convert source schemas into structured signal rows.
Tables
codex_signal_registry_working
Description
- Central working table for all extracted signals
- One row per column from source tables
- Acts as the backbone of the entire pipeline
1.2 Signal Classification Pipeline
Purpose: Enrich signals through staged classification.
Each stage writes to its own results table with full traceability via run_id.
Stages
Stage 1: Datatype Normalization
- Table:
codex_datatype_results - Output:
cleaned_data_type
Stage 2: Module Classification
- Table:
codex_module_results - Output:
module_code
Stage 3: CSI KIND Classification
- Table:
codex_kind_results - Output:
kind(CSI KIND)
Note: This is CSI KIND, distinct from other KIND concepts (e.g., CMI KIND).
Stage 4: Signal Naming
- Table:
codex_signal_name_results - Output:
signal_name
Stage 5: Validation & CSI Preparation
-
Table:
codex_validator_results -
Output:
pre_version_keyis_validsuggested_csi_pattern
Run Tracking
- Table:
codex_automation_runs
1.3 Derived Views
v_codex_signal_registry_latest
- Combines working table + latest results from all stages
- Represents the current "best state" of each signal
v_codex_signal_registry_review_queue
- Filters signals requiring manual review
- Driven by confidence thresholds and validation flags
v_codex_signal_registry_export_ready
- Final curated dataset
- Ready for promotion to SSSR / CSI
2. Component & Module Architecture
2.1 Component Registry
documented_component_registry
- Source of truth for all ZAYAZ components
- Derived from documentation (MDX frontmatter)
Key fields:
component_idcomponent_titlemd_document_linkmodule_code(primary association)
2.2 Module Registry
module_registry
Canonical modules (examples):
- inpt (Input Hub)
- comp (Computation Hub)
- repo (Reports & Insights)
- siss (Signal Intelligence System)
- zara
- zaam
- risk
- netz
- vera
- seel
- acad
2.3 Module Alias Layer
module_alias_registry
Purpose: Map documentation/frontmatter identifiers to canonical module codes.
Example:
computation-hub→compsis→siss
This decouples:
- Documentation vocabulary
- System-level canonical codes
2.4 Component ↔ Module Mapping
component_module_map
Purpose: Resolve component-to-module relationships.
Supports:
- Multi-module components
- Canonical mapping
- Frontmatter preservation
Key fields:
primary_module_codemodule_codes[]omr_module_ids[]
component_omr_module_map
Purpose: Fully normalized many-to-many mapping.
- One row per (component, module alias)
- Used for traceability and validation
component_module_map_staging
- Raw ingestion from documentation exports
- Never used directly in production logic
3. Lookup & Classification Systems
3.1 Table Prefix Registry
table_prefix_registry
Defines dataset semantics:
Examples:
ref_→ Reference datafact_→ Eventsdim_→ Dimensionsagg_→ Metrics
Also provides:
primary_kind_hintsecondary_kind_hints
Used directly by classification pipeline.
3.2 CSI KIND Registry
kind_registry
Defines CSI semantic categories:
- INPUT
- OUTPUT
- SIGNAL
- SCHEMA
- CONFIG
- FEATURE
- METRIC
- EVENT
- VIEW
Note: This registry governs CSI KIND, not CMI KIND.
3.3 Table Registry
table_registry
Purpose: Central registry of all datasets/tables used across the ZAYAZ platform.
Stores:
- Dataset definitions
- Storage configuration
- Governance metadata
- Documentation links
This table acts as a foundational metadata layer and is actively used by other systems, including:
- Signal registry (SSSR)
- Classification pipeline
- Documentation system
It provides table-level context that is picked up and reused by the signal registry, ensuring consistency between:
- Physical data structures
- Signal definitions
- Documentation links
This table is the bridge between:
- Physical data
- Signal registry
- Documentation
3.4 Signal Registry (SSSR) Model
Definition
The Signal Registry System (SSSR) is the canonical layer for representing structured signals derived from source datasets.
A signal is defined as:
A semantically enriched, uniquely identifiable representation of a data field, including its context, meaning, and classification within the ZAYAZ system.
Signal vs Related Concepts
| Concept | Description |
|---|---|
| Column | Physical field in a dataset (raw schema level) |
| Signal | Enriched representation of a column with semantic meaning |
| CSI | Standardized identity for a signal across systems |
| Metric | Aggregated or computed signal |
Signal Lifecycle
-
Source Column
- Extracted from schema (DDL / API / dataset)
-
Working Signal
- Stored in
codex_signal_registry_working - Contains raw metadata
- Stored in
-
Classified Signal
-
Enriched through pipeline stages:
- datatype
- module
- CSI KIND
- naming
- validation
-
-
Validated Signal (Pre-CSI)
- Assigned
pre_version_key - Checked for uniqueness and consistency
- Assigned
-
CSI Signal
- Promoted to canonical identity layer
- Versioned and governed
Relationship to Other Systems
table_registry
- Provides dataset-level metadata
- Defines context for signals
- Signals inherit semantic meaning from their parent dataset
Component & Module System
- Signals are linked to components via
component_name - Modules are resolved via
component_module_map - Enables system-level grouping and governance
CSI (Canonical Signal Identity)
- Provides globally unique signal identity
- Ensures cross-system consistency
- Enables reuse and interoperability
Future: USO Integration
-
Signals will be extended with:
- Unit
- Scope
- Object
-
Introduces additional classification layer beyond CSI
Key Properties of a Signal
- Traceable: Linked to source column and dataset
- Classified: Assigned module, CSI KIND, and semantic name
- Versioned: Evolves through CSI versioning
- Reusable: Can be referenced across systems and modules
- Governed: Subject to validation and review workflows
Design Principles
- Signals are not tied to ESG — fully domain-agnostic
- Signals are first-class entities, not just columns
- Signals enable semantic interoperability across datasets
- Signals are the foundation for analytics, AI, and reporting layers
4. Pipeline Flow
SQL → ddl_to_desc.py → codex_signal_registry_working
→ Datatype → Module → CSI KIND → Signal Name → Validator
→ v_codex_signal_registry_latest
→ review_queue
→ export_ready
→ SSSR → CSI
5. Design Principles
5.1 Separation of Concerns
- Raw ingestion vs classification vs validation
- Staging vs canonical vs derived
5.2 Traceability
- Every transformation tracked via
run_id - Full lineage from source column to final CSI
5.3 Deterministic + AI Hybrid
- Rule-based baseline (prefix, schema, naming)
- AI augmentation (future phases)
5.4 Multi-Module Awareness
- Components can belong to multiple modules
- Primary module used for deterministic classification
5.5 Documentation-Driven Architecture
- MDX frontmatter → database
- Database → documentation (bi-directional potential)
6. Next Phase (Planned)
6.1 Full SSSR Table
- Promote export-ready signals to canonical registry
- Add versioning + governance fields
6.2 USO Integration
- Add Unit, Scope, Object classification
6.3 CSI Versioning Engine
- Automatic version increments
- Change detection
- Backward compatibility
6.4 Validation Framework
- Cross-signal validation
- Schema enforcement
- Data integrity rules
7. Summary
The current ZAYAZ architecture represents a:
- Modular
- Scalable
- Traceable
- Domain-agnostic
signal intelligence system.
It is now ready for:
- Full SSSR expansion
- USO enrichment
- Production-grade automation
End of Document
11. CMI Naming Policy
11.1 Purpose
This chapter defines how Canonical Managed Identifiers (CMIs) are named, validated, proposed, and governed within ZAR.
It exists to ensure that artifact identity is:
- deterministic
- documentation-linked
- reusable across versions
- explainable to humans and AI systems
- enforceable in registry and CI workflows
The CMI naming policy governs:
module_codecomponent_idcmi_kindartifact_name- semantic versioning for artifacts
11.2 Canonical CMI Format
<module_code>.<COMPONENT_ID>.<CMI_KIND>.<ARTIFACT_NAME>.<MAJOR>_<MINOR>_<PATCH>
Example:
comp.PEF-ME.ENGINE.CORE.1_0_0
vera.TG-CORE.ENGINE.VALIDATOR.1_0_0
comp.AIIL-CON.SCHEMA.INPUTS.1_0_0
11.3 Segment Rules
A. module_code
- must be a valid canonical module code from
zar.module_registry - lowercase
- examples:
comp,vera,siss,repo
B. COMPONENT_ID
- must match a valid documented component in
zar.documented_component_registry - uppercase frontmatter-style identifier
- examples:
PEF-ME,TG-CORE,AIIL-CON
C. CMI_KIND
- must exist in
zar.cmi_kind_registry - identifies what category of artifact this is
- examples:
ENGINE,SCHEMA,RULESET,MODEL
D. ARTIFACT_NAME
- must exist in
zar.cmi_artifact_name_registryfor the chosenCMI_KIND - describes the canonical functional role of the artifact within the component
- examples:
CORE,VALIDATOR,INPUTS,ROUTING
E. VERSION
- must use semantic artifact versioning:
MAJOR_MINOR_PATCH- examples:
1_0_0,2_1_3
11.4 Core Principle for CMI_KIND
CMI_KIND classifies what type of artifact the registry entry represents.
It must describe the artifact category, not:
- the programming language
- the file name
- the deployment environment
- the team name
Examples:
- executable processing logic →
ENGINE - JSON schema or contract →
SCHEMA - policy bundle or routing rules →
RULESET - trained model package →
MODEL
11.5 Core Principle for ARTIFACT_NAME
ARTIFACT_NAME defines the canonical functional role of the artifact inside the component.
It must be:
- semantic
- stable over time
- implementation-neutral
- chosen from the controlled vocabulary
ARTIFACT_NAME must not be derived from:
- file names such as
main.pyorindex.ts - frameworks such as
fastapiorreact - version labels such as
v2 - temporary developer naming
Good examples:
COREVALIDATORCALCULATORINPUTSOUTPUTROUTING
Bad examples:
MAIN_PYFASTAPI_HANDLERVALIDATOR_V2OUTPUT_SCHEMA_FINAL
11.6 Decision Hierarchy for Choosing CMI_KIND
When registering a new artifact, classify in this order:
-
What is the artifact primarily?
- execution logic →
ENGINE - structure/contract →
SCHEMA - standalone operational script →
SCRIPT - policy/rules →
RULESET - external system adapter →
CONNECTOR - trained ML artifact →
MODEL - front-end interaction component →
UI - visual reporting surface →
DASHBOARD - orchestrated workflow →
JOB - shared package →
LIB - test harness or QA artifact →
TEST
- execution logic →
-
What role does it play in the platform?
- prefer the most semantically precise
CMI_KIND
- prefer the most semantically precise
-
If two kinds are plausible:
- choose the kind that best represents the artifact's primary governance and runtime identity
- flag for review if ambiguity remains
11.7 Decision Hierarchy for Choosing ARTIFACT_NAME
-
Choose
CMI_KINDfirst -
Identify the artifact's functional role within the component
-
Select the approved
ARTIFACT_NAMEfromzar.cmi_artifact_name_registry -
If no approved name exists:
- propose a new controlled-vocabulary entry
- do not improvise permanent naming in production registry rows
Examples:
ENGINE
- main runtime logic →
CORE - deterministic computation →
CALCULATOR - validation logic →
VALIDATOR - scoring logic →
SCORER - transformation logic →
TRANSFORMER - normalization logic →
NORMALIZER
SCHEMA
- input contract →
INPUTS - output contract →
OUTPUT - configurable options →
OPTIONS - API request contract →
REQUEST - API response contract →
RESPONSE
RULESET
- decision rules →
DECISION - routing rules →
ROUTING - validation rules →
VALIDATION - classification rules →
CLASSIFIER
11.8 Completeness Model
A component may contain multiple artifacts of the same CMI_KIND.
Examples:
comp.PEF-ME.ENGINE.CORE.1_0_0
comp.PEF-ME.ENGINE.CALCULATOR.1_0_0
comp.PEF-ME.ENGINE.VALIDATOR.1_0_0
comp.PEF-ME.SCHEMA.INPUTS.1_0_0
comp.PEF-ME.SCHEMA.OUTPUT.1_0_0
A CMI registry entry is therefore unique at the following grain:
(module_code, component_id, cmi_kind, artifact_name, version)
Not at:
(module_code, component_id, cmi_kind)
11.9 CMI Completeness Validator
The CMI Completeness Validator should work in two layers.
Layer 1 — Deterministic baseline
Use:
component_idcomponent_titlemd_document_link- controlled
CMI_KINDvocabulary - controlled
ARTIFACT_NAMEvocabulary - registry rules for minimum expected artifact sets
This layer should check whether the component has the minimum required artifacts for its apparent role.
Examples of baseline expectations:
- most executable components should have at least one
ENGINE.CORE - schema-driven components often require
SCHEMA.INPUTSand/orSCHEMA.OUTPUT - rule-heavy components may require
RULESET.ROUTING,RULESET.DECISION, orRULESET.VALIDATION
Layer 2 — AI-assisted proposal
AI can be used after the deterministic baseline to:
- read the MDX/component specification
- infer likely missing artifacts
- propose candidate
CMI_KIND+ARTIFACT_NAMEcombinations - explain why they are proposed
- flag ambiguous cases for human review
Recommended model:
- deterministic registry validation first
- AI-assisted proposal second
- human approval where needed
This prevents uncontrolled invention while still allowing the system to scale.
11.10 Governance Rules
A CMI record must not be created unless:
module_codeexists inzar.module_registrycomponent_idexists inzar.documented_component_registrycmi_kindexists inzar.cmi_kind_registry(artifact_name, cmi_kind)exists inzar.cmi_artifact_name_registry- the full CMI is unique for its version
A new artifact name must be added to the controlled vocabulary before production use.
A semantic artifact change should trigger a version increment, not silent overwrite.
11.11 Strategic Outcome
With controlled CMI_KIND, controlled ARTIFACT_NAME, and a completeness validator, ZAR gains:
- deterministic artifact identity
- stronger lineage precision
- explainable runtime provenance
- reusable artifact patterns across components
- governance-ready validation for future automation
This makes the CMI layer as disciplined and scalable as the CSI layer.
APPENDIX A - The zar.cmi_artifact_name_registry
The Controlled Vocabulary (v1 Seed)
ENGINE
('CORE', 'ENGINE', 'Primary execution engine of the component', 'Execution', 'Main runtime logic'),
('CALCULATOR', 'ENGINE', 'Performs deterministic calculations', 'Execution', 'Used for formula-based processing'),
('VALIDATOR', 'ENGINE', 'Validates inputs or outputs', 'Validation', 'Rule-based or logic-based validation'),
('SCORER', 'ENGINE', 'Computes scores or indices', 'Scoring', 'Used for trust scores, risk scores, etc.'),
('TRANSFORMER', 'ENGINE', 'Transforms data structures', 'Transformation', 'Mapping or reshaping data'),
('AGGREGATOR', 'ENGINE', 'Aggregates multiple inputs', 'Aggregation', 'Rollups, groupings, summaries'),
('ENRICHER', 'ENGINE', 'Adds additional data or context', 'Enrichment', 'Joins, external lookups'),
('ROUTER', 'ENGINE', 'Routes signals or tasks', 'Orchestration', 'ZSSR-like routing logic'),
('PROCESSOR', 'ENGINE', 'Generic processing unit', 'Execution', 'Fallback when role is broad'),
('NORMALIZER', 'ENGINE', 'Normalizes data formats or values', 'Transformation', 'Standardization logic')
SCHEMA
('INPUTS', 'SCHEMA', 'Defines input structure', 'Structural', 'Used for input schemas'),
('OUTPUT', 'SCHEMA', 'Defines output structure', 'Structural', 'Used for output schemas'),
('OPTIONS', 'SCHEMA', 'Defines configuration options', 'Structural', 'Optional parameters'),
('PAYLOAD', 'SCHEMA', 'Defines message payload', 'Structural', 'Event/message structure'),
('CONTRACT', 'SCHEMA', 'Formal data contract', 'Structural', 'Interface definition'),
('RESPONSE', 'SCHEMA', 'Response schema', 'Structural', 'API responses'),
('REQUEST', 'SCHEMA', 'Request schema', 'Structural', 'API inputs')
SCRIPT
('ETL', 'SCRIPT', 'Extract-transform-load script', 'Data Processing', 'Batch pipelines'),
('MIGRATION', 'SCRIPT', 'Schema or data migration script', 'Maintenance', 'DB migrations'),
('SEED', 'SCRIPT', 'Seed data script', 'Initialization', 'Initial data population'),
('IMPORT', 'SCRIPT', 'Data import script', 'Ingestion', 'External ingestion'),
('EXPORT', 'SCRIPT', 'Data export script', 'Extraction', 'Data export jobs')
RULESET
('DECISION', 'RULESET', 'Decision logic ruleset', 'Decisioning', 'Decision trees, policies'),
('VALIDATION', 'RULESET', 'Validation rules', 'Validation', 'Constraint enforcement'),
('ROUTING', 'RULESET', 'Routing rules', 'Orchestration', 'Signal routing'),
('POLICY', 'RULESET', 'Policy definitions', 'Governance', 'Compliance rules'),
('CLASSIFIER', 'RULESET', 'Classification logic', 'Classification', 'Categorization rules')
CONNECTOR
('API', 'CONNECTOR', 'External API connector', 'Integration', 'Generic API'),
('SAP', 'CONNECTOR', 'SAP integration', 'Integration', 'ERP integration'),
('FILE_INGEST', 'CONNECTOR', 'File ingestion connector', 'Integration', 'CSV, Excel, etc.'),
('DATABASE', 'CONNECTOR', 'Database connector', 'Integration', 'SQL/NoSQL'),
('STREAM', 'CONNECTOR', 'Streaming connector', 'Integration', 'Kafka, etc.')
MODEL
('PREDICTOR', 'MODEL', 'Prediction model', 'ML', 'Forecasting'),
('CLASSIFIER', 'MODEL', 'Classification model', 'ML', 'Categorization'),
('SCORING', 'MODEL', 'Scoring model', 'ML', 'Risk/trust scoring'),
('EMBEDDER', 'MODEL', 'Embedding model', 'ML', 'Vector generation'),
('ANOMALY', 'MODEL', 'Anomaly detection model', 'ML', 'Outlier detection')
UI
('FORM', 'UI', 'Input form', 'Presentation', 'User data entry'),
('DASHBOARD', 'UI', 'UI dashboard component', 'Presentation', 'Interactive UI'),
('REVIEW', 'UI', 'Review interface', 'Presentation', 'Human validation'),
('WIDGET', 'UI', 'Reusable UI widget', 'Presentation', 'Componentized UI')
DASHBOARD
('ESRS_OVERVIEW', 'DASHBOARD', 'ESRS overview dashboard', 'Reporting', 'Regulatory reporting'),
('KPI_MONITOR', 'DASHBOARD', 'KPI monitoring dashboard', 'Reporting', 'Performance tracking'),
('EXEC_SUMMARY', 'DASHBOARD', 'Executive summary dashboard', 'Reporting', 'Management view')
JOB
('PIPELINE', 'JOB', 'Orchestrated pipeline', 'Orchestration', 'End-to-end flow'),
('SCHEDULER', 'JOB', 'Scheduled job', 'Orchestration', 'Cron-like execution'),
('BATCH', 'JOB', 'Batch job', 'Processing', 'Bulk processing'),
('REPROCESS', 'JOB', 'Reprocessing job', 'Recovery', 'Replay jobs')
LIB
('CORE', 'LIB', 'Core shared logic', 'Shared', 'Core logic reuse'),
('VALIDATION', 'LIB', 'Validation library', 'Shared', 'Reusable validation'),
('FORMAT', 'LIB', 'Formatting utilities', 'Shared', 'Data formatting')
TEST
('UNIT', 'TEST', 'Unit tests', 'QA', 'Low-level tests'),
('INTEGRATION', 'TEST', 'Integration tests', 'QA', 'System tests'),
('REGRESSION', 'TEST', 'Regression test suite', 'QA', 'Prevent regressions'),
('PERFORMANCE', 'TEST', 'Performance tests', 'QA', 'Load/stress testing')
APPENDIX B - ZAR Tables:
| Schema | Name | Type | Owner | Persistence | Access method | Size | Description |
|---|---|---|---|---|---|---|---|
| zar | cmi_artifact_name_registry | table | zar_admin | permanent | heap | 48 kB | |
| zar | cmi_completeness_rule_registry | table | zar_admin | permanent | heap | 16 kB | |
| zar | cmi_kind_registry | table | zar_admin | permanent | heap | 16 kB | |
| zar | cmi_proposal_results | table | zar_admin | permanent | heap | 320 kB | |
| zar | cmi_registry | table | zar_admin | permanent | heap | 8192 bytes | |
| zar | codex_automation_runs | table | zar_admin | permanent | heap | 16 kB | |
| zar | codex_datatype_results | table | zar_admin | permanent | heap | 16 kB | |
| zar | codex_kind_results | table | zar_admin | permanent | heap | 16 kB | |
| zar | codex_module_results | table | zar_admin | permanent | heap | 16 kB | |
| zar | codex_signal_name_results | table | zar_admin | permanent | heap | 16 kB | |
| zar | codex_signal_registry_working | table | zar_admin | permanent | heap | 16 kB | |
| zar | codex_validator_results | table | zar_admin | permanent | heap | 16 kB | |
| zar | component_module_map | table | zar_admin | permanent | heap | 160 kB | |
| zar | component_module_map_staging | table | zar_admin | permanent | heap | 192 kB | |
| zar | component_omr_module_map | table | zar_admin | permanent | heap | 120 kB | |
| zar | documented_component_registry | table | zar_admin | permanent | heap | 120 kB | |
| zar | documented_component_staging | table | zar_admin | permanent | heap | 8192 bytes | |
| zar | kind_registry | table | zar_admin | permanent | heap | 16 kB | |
| zar | module_alias_registry | table | zar_admin | permanent | heap | 16 kB | |
| zar | module_registry | table | zar_admin | permanent | heap | 16 kB | |
| zar | table_prefix_registry | table | zar_admin | permanent | heap | 48 kB | |
| zar | table_registry | table | zar_admin | permanent | heap | 8192 bytes |