Jira progress: loading…

EXTR

Extraction Engines

1. Overview

Extraction Engines (EXTR) are micro-engines responsible for converting unstructured or semi-structured source material into structured, schema-conformant data signals suitable for downstream computation.

EXTR engines operate at the ingestion boundary of the ZAYAZ platform.

They do not compute ESG metrics.
They materialize raw signals from documents, files, text, or external feeds.

EXTR is an engine type within the MICE taxonomy.

2. Design Principles

Loss-Aware Extraction
Any uncertainty, ambiguity, or confidence degradation introduced during extraction must be explicitly surfaced.
Source Traceability
Every extracted value retains references to its original source (document, page, cell, paragraph, timestamp).
Non-Interpretive
EXTR engines extract structure — they do not infer meaning, apply policy, or assess correctness.

3. Scope of Responsibility

3.1. What EXTR Engines Do

Optical Character Recognition (OCR)
Table and layout extraction
Field extraction from text (NLP-assisted)
Parsing of invoices, disclosures, and structured files
Conversion of external feeds into internal signal envelopes

EXTR outputs are the first structured signals in the system.

4. What EXTR Engines Do Not Do

❌ Compute metrics or indicators (CALC)
❌ Validate business rules or completeness (VALI)
❌ Normalize units or formats (TRFM)
❌ Aggregate across entities or time (AGGR)
❌ Apply semantic enrichment or classification (META / RMAP)

5. Inputs

EXTR engines consume:

Unstructured or semi-structured sources:
- PDFs, scans, spreadsheets
- Text documents or disclosures
- External system feeds or APIs
Extraction schemas or templates
Optional confidence thresholds or extraction profiles

6. Outputs

EXTR engines produce:

Structured payloads conforming to declared schemas
Confidence or certainty scores per extracted field
Source references (document ID, page, range, position)
Extraction metadata (engine version, profile, timestamp)

EXTR outputs are typically consumed by:

VALI (schema and structural validation)
META (semantic enrichment)
CALC (once validated)

7. Canonical Identification

Engine Type: EXTR
USO Code: EXTR
Category: Micro Engine (MICE)

8. Position in the Execution Chain

EXTR engines sit at the very start of the computation pipeline:

Source Material
   → EXTR
      → VALI
         → META
            → CALC / TRFM / AGGR

This ensures:

clear separation between extraction and computation,
explicit provenance from raw source to reported metric,
auditability of ingestion steps.

9. Registry View

Loading micro engines…

10. Design Rationale

By isolating extraction into its own engine type, ZAYAZ ensures:

ingestion uncertainty is visible and auditable,
downstream engines operate only on declared structure,
re-extraction is possible without recomputing metrics,
assurance systems (AAE) can reason about where data originated.

Extraction is not “data prep” — it is a first-class, governed operation.

Stable

GitHub Repo Request for Change (RFC)

1. Overview​

2. Design Principles​

3. Scope of Responsibility​

3.1. What EXTR Engines Do​

4. What EXTR Engines Do Not Do​

5. Inputs​

6. Outputs​

7. Canonical Identification​

8. Position in the Execution Chain​

9. Registry View​

10. Design Rationale​