Skip to main content
Jira progress: loading…

EXTR

Extraction Engines

1. Overview

Extraction Engines (EXTR) are micro-engines responsible for converting unstructured or semi-structured source material into structured, schema-conformant data signals suitable for downstream computation.

EXTR engines operate at the ingestion boundary of the ZAYAZ platform.

They do not compute ESG metrics.
They materialize raw signals from documents, files, text, or external feeds.

EXTR is an engine type within the MICE taxonomy.

2. Design Principles

  1. Loss-Aware Extraction
    Any uncertainty, ambiguity, or confidence degradation introduced during extraction must be explicitly surfaced.

  2. Source Traceability
    Every extracted value retains references to its original source (document, page, cell, paragraph, timestamp).

  3. Non-Interpretive
    EXTR engines extract structure — they do not infer meaning, apply policy, or assess correctness.

3. Scope of Responsibility

3.1. What EXTR Engines Do

  • Optical Character Recognition (OCR)
  • Table and layout extraction
  • Field extraction from text (NLP-assisted)
  • Parsing of invoices, disclosures, and structured files
  • Conversion of external feeds into internal signal envelopes

EXTR outputs are the first structured signals in the system.

4. What EXTR Engines Do Not Do

  • ❌ Compute metrics or indicators (CALC)
  • ❌ Validate business rules or completeness (VALI)
  • ❌ Normalize units or formats (TRFM)
  • ❌ Aggregate across entities or time (AGGR)
  • ❌ Apply semantic enrichment or classification (META / RMAP)

5. Inputs

EXTR engines consume:

  • Unstructured or semi-structured sources:
    • PDFs, scans, spreadsheets
    • Text documents or disclosures
    • External system feeds or APIs
  • Extraction schemas or templates
  • Optional confidence thresholds or extraction profiles

6. Outputs

EXTR engines produce:

  • Structured payloads conforming to declared schemas
  • Confidence or certainty scores per extracted field
  • Source references (document ID, page, range, position)
  • Extraction metadata (engine version, profile, timestamp)

EXTR outputs are typically consumed by:

  • VALI (schema and structural validation)
  • META (semantic enrichment)
  • CALC (once validated)

7. Canonical Identification

  • Engine Type: EXTR
  • USO Code: EXTR
  • Category: Micro Engine (MICE)

8. Position in the Execution Chain

EXTR engines sit at the very start of the computation pipeline:

Source Material
→ EXTR
→ VALI
→ META
→ CALC / TRFM / AGGR

This ensures:

  • clear separation between extraction and computation,
  • explicit provenance from raw source to reported metric,
  • auditability of ingestion steps.

9. Registry View

Loading micro engines…

10. Design Rationale

By isolating extraction into its own engine type, ZAYAZ ensures:

  • ingestion uncertainty is visible and auditable,
  • downstream engines operate only on declared structure,
  • re-extraction is possible without recomputing metrics,
  • assurance systems (AAE) can reason about where data originated.

Extraction is not “data prep” — it is a first-class, governed operation.


Stable

GitHub RepoRequest for Change (RFC)