Skip to main content

ZAYAZ Automation System – Operator Runbook

1. Purpose

This document explains how scheduled and manual automations in the ZAYAZ system work, how they are triggered, and how operators should reason about failures, overrides, and safety.

The automation system is designed to be:

  • Centralized (single scheduler, single dispatcher)
  • Observable (events, dashboard, warnings)
  • Safe by default (manual-only jobs are never scheduled)
  • UI-driven (enable/disable + run-now without touching YAML)

2. Architecture Overview

GitHub (cron every 10 min)
|
v
POST /api/automation/tick
|
v
System API (Fly.io)
- reads automation catalog
- reads overrides store
- evaluates cron + enable flags
- dispatches jobs
|
v
GitHub Actions
automation.yml (dispatcher)
|
v
npm run <script>

3. Key Components

ComponentResponsibility
package.jsonSource of truth for what scripts exist
scripts/system/automation.overrides.jsonHuman-facing catalog metadata + default schedules
docusaurus/static/system/automation.jsonGenerated public automation catalog
System API (/api/automation/*)Scheduling, overrides, dispatch, logging
/data/automation.overrides.json (Fly volume)Runtime state (enabled, cron overrides, lastRunAt)
automation.ymlSingle dispatcher workflow
automation-tick.ymlCalls scheduler every 10 minutes

4. Automation Types

4.1. Scheduled Automations

These have:

  • enabledDefault: true
  • non-empty cron array

They run automatically via /tick.

Examples:

  • Jira syncs & indices
  • Docs hygiene
  • Hecate audits
  • Search release
  • ZARA fixtures

4.2. Manual-Only Automations

These have:

  • enabledDefault: false
  • "cron": []

They never run automatically but are visible and runnable from the UI.

Examples:

  • jira:backfill-links
  • zara:jira
  • zara:jira:create

These are usually:

  • corrective
  • side-effect-heavy
  • migration-oriented
  • potentially destructive

4.3. One-Off / Developer Scripts (Not in Automation System)

These are intentionally excluded and explained in the System Settings UI.

Examples:

  • dev servers
  • setup scripts
  • composite helpers
  • scripts already wrapped by higher-level automations

5. Scheduling Model (Important)

  • All schedules live in data, not YAML.
  • GitHub has exactly one cron: every 10 minutes.
  • /api/automation/tick:
    • evaluates cron expressions
    • de-dupes using lastRunAt
    • respects enable/disable flags
    • dispatches only due jobs

This avoids:

  • duplicated schedules
  • YAML drift
  • hard-to-debug cron behavior

6. Overrides & Control

Enable / Disable

  • Controlled via UI
  • Stored in /data/automation.overrides.json
  • Takes precedence over defaults

Cron Overrides

  • Can be changed later via System API (future UI)
  • Default cron comes from catalog

Manual Run

  • “Run now” triggers /api/automation/run/:id
  • Bypasses scheduler
  • Uses the same dispatcher workflow

7. Observability & Safety

Dashboard

The Automation Dashboard shows:

  • Tick activity
  • Dispatch success / errors
  • Lock contention
  • Catalog fetch issues
  • Top offending automations

This replaces email-based monitoring.

Locks

  • Global tick lock prevents overlap
  • Prevents duplicate dispatches
  • Visible as “locked ticks” in UI

Audit Trail

  • All ticks and dispatches are logged
  • Available via /api/automation/events
  • Used by the dashboard

8. Operational Playbooks

“Something didn’t run”

  1. Open System Settings → Automation Dashboard
  2. Check:
  • catalog errors
  • dispatch errors
  • lock contention
  1. Inspect “Top offenders”
  2. Re-run manually if needed

Disable temporarily”

  • Toggle Enabled off in UI
  • Scheduler will skip it immediately
  • No YAML changes required

Run something dangerous

  • Only possible for manual-only jobs
  • Explicitly labeled
  • Requires admin token
  • Never scheduled automatically

Design Principles (Why this system exists)

  • No cron in YAML except the scheduler
  • No hidden jobs
  • No email spam
  • No accidental side effects
  • Everything observable
  • Everything reversible

This is intentional.



GitHub RepoRequest for Change (RFC)