Skip to main content
Jira progress: loading…

AIIL-RB

SRE Runbooks

1. Operational runbooks for AI services

1.1. Purpose

Operational runbooks for AI services: detect, mitigate, and prevent failures without breaking compliance guarantees. Each playbook preserves:

  • Determinism: (method_id, version) pinned
  • Format control: (regex_name, version) enforced
  • Traceability: provenance_id in every hop
  • Jurisdiction: acl_tags honored on compute & expo

1.2. Index Drift — Detection & Rollback

Signals (what to watch)

  • Accuracy shift: answers_with_all_citations / total_answers drops > 3% over 1h
  • Result inconsistency: repeat questions produce different citations (Jaccard < 0.5)
  • Shard skew: retriever_shard_miss_rate > 2%
  • Embedding mismatch: sudden rise in OOV (out-of-vocab) vectors for the same corpus

Triage (10–minute checklist)

  • Freeze writes
kubectl scale deploy retriever-writer --replicas=0 -n rag
  • Compare index manifests (prod vs last-known-good)
diff <(aws s3 ls s3://zayaz-indices/prod/) <(aws s3 ls s3://zayaz-indices/lkg/)
  • Drift sample: same 20 gold queries against current vs LKG
./bin/rag_probe --manifest prod --suite gold_20.json > prod.out
./bin/rag_probe --manifest lkg --suite gold_20.json > lkg.out
./bin/compare_hits prod.out lkg.out

Rollback (deterministic)

  • Point gateway to LKG index (feature flag)
helm upgrade rag-gateway ./charts/rag-gateway \
--set retrieval.indexManifest=s3://zayaz-indices/lkg/manifest.json
  • Invalidate retriever cache
kubectl delete pods -l app=retriever -n rag

Verify & Unfreeze

  • Re-run gold_200 evaluation; require ∆top1@k ≤ 1% vs baseline.
  • Re-enable writes:
kubectl scale deploy retriever-writer --replicas=2 -n rag

Governance note: Index flips are logged with change ticket + hash of LKG manifest; audit appends provenance_id chain.

1.3. Behavior Adapter Misconfig — Resolution

Symptoms

  • Answers lose “compliance tone” or structured sections
  • Missing compute calls when numbers are requested
  • Regex failure spikes (citation/unit format)

Quick checks

  • Adapter release diff
kubectl rollout history deploy/behavior-adapter -n rag
kubectl logs deploy/behavior-adapter -n rag | tail -n 200
  • Policy set in effect
curl -s http://adapter:8080/debug/policy | jq .
  • Feature flags
zayaz-ffctl get behavior.adapter.profile

Fix pattern

  • Hot-swap to last-known-good profile
zayaz-ffctl set behavior.adapter.profile=lkg-2025-09-12
kubectl rollout restart deploy/behavior-adapter -n rag
  • Re-enable compute binding
zayaz-ffctl set behavior.adapter.compute_binding=ENFORCED

Post-fix validation

  • Run Eval Harness “structure-adherence” and “citation-coverage” suites.
  • Gate: citation_coverage ≥ 0.98, regex_pass ≥ 0.98, compute_invocation_rate within ±1%.

Governance note: Adapter profiles are versioned artifacts; promotion requires committee approval and are referenced in audit logs.

1.4. ACL Enforcement Failures — Troubleshoot

Common causes

  • Tenant outside allowed region for EU_ONLY
  • Method carries acl_tags not permitted for tenant (e.g., NO_PHI)
  • Missing/invalid tenant claims (JWT) or stale RBAC cache

Runbook

  • Confirm denial reason
grep dec- <(curl -s "https://audit/api/audit/logs?tenant_id=$TENANT&decision=DENY") | jq .
  • Inspect method ACL tags
SELECT method_id, version, acl_tags
FROM compute_method_registry
WHERE method_id='GHG.intensity' AND version='1.0.0';
  • Check tenant profile
curl -s https://tenant-svc/api/tenants/$TENANT | jq .jurisdiction
curl -s https://tenant-svc/api/tenants/$TENANT | jq .framework_allowlist
  • RBAC cache refresh
kubectl exec deploy/rbac-svc -n governance -- kill -HUP 1

Safe overrides (time-boxed)

  • Incident override token (auditor-approved):
zayaz-rbac override --tenant $TENANT --tag EU_ONLY --duration 2h --ticket INC-1234
  • Re-test request; ensure override is logged with decision_id.

Governance note: Overrides require ticket + expiry; audit entry must include authorizer and rationale.

2. KPIs & SLO Monitoring

SLIs (AI-specific)

  • Retrieval quality: answers_with_all_citations / total_answers
  • Regex quality: answers_passing_regex / total_answers
  • Refusal quality: % refusals with correct template and policy reason
  • Compute determinism: % answers with method_id+version+p_id present
  • Latency: retrieval_p95, compute_p95, e2e_p95

SLOs (Phase 1 scope)

  • Retrieval p95 ≤ 1.1 s
  • End-to-end p95 ≤ 2.5 s (Oslo)
  • Citation coverage ≥ 98%
  • Regex pass rate ≥ 98%
  • Deterministic compute presence ≥ 99.9%

Prometheus alerts (examples)

prometheus-alerts.yaml
groups:
- name: ai-slo
rules:
- alert: RetrievalLatencyP95High
expr: histogram_quantile(0.95, sum(rate(retrieval_ms_bucket[5m])) by (le)) > 1.1
for: 10m
labels: {severity: page}
annotations:
summary: "retrieval p95 > 1.1s (10m)"
- alert: CitationCoverageDrop
expr: (sum(rate(answers_with_all_citations[15m])) / sum(rate(total_answers[15m]))) < 0.98
for: 15m
labels: {severity: page}
- alert: RegexFailSpike
expr: (1 - sum(rate(regex_pass[15m])) / sum(rate(total_answers[15m]))) > 0.02
for: 10m
labels: {severity: page}
- alert: DeterminismMissing
expr: (1 - sum(rate(compute_with_method_version[15m])) / sum(rate(total_answers[15m]))) > 0.001
for: 10m
labels: {severity: page}

Dashboards (Grafana panels)

  • RAG: top-k recall, citation coverage, ANN/BM25 split, shard miss rate
  • Adapter: structure adherence, refusal quality
  • Compute: p50/p95, error codes, method mix, dataset hash distribution
  • Packaging: JSON/XBRL success rates, signature failuresRAG: top-k recall, citation coverage, ANN/BM25 split, shard miss rate

2.1. On-Call Quick Commands (crib sheet)

  • Roll back behavior adapter
helm rollback behavior-adapter 42 -n rag
  • Pin compute latest to prior
UPDATE compute_method_latest
SET version='1.1.0', updated_at=now()
WHERE method_id='GHG.intensity';
  • Disable a problematic regex library
zayaz-ffctl set regex.library.version=1.3.0
kubectl rollout restart deploy/regex-validator -n governance
  • Flip retriever to BM25-only fallback
zayaz-ffctl set retrieval.mode=BM25_ONLY
kubectl rollout restart deploy/rag-gateway -n rag

  • Get last denial decisions
curl -s "https://audit/api/audit/logs?decision=DENY&from=$(date -Iminutes -u -d '-15 min')" | jq .

Note:

https://audit/api/... refers to the internal ALTD Audit Service DNS name, resolvable only within the ZAYAZ cluster or private network.

Public access (if enabled) is exposed separately via the ZAYAZ API Gateway (e.g. https://api.zayaz.io/v1/audit/...) and is not used for SRE diagnostics.

3. Incident Playbooks

  • Index Drift (P1)

    • Page SRE on-call; freeze writes; switch to LKG manifest
    • Run gold_200; confirm recovery
    • RCA: identify corpus change, embedder mismatch, or shard compaction bug
    • Prevent: add pre-publish canary eval to index pipeline
  • IBehavior Adapter Regression (P1)

    • Revert to LKG profile; enable compute binding ENFORCED
    • Run “structure/citation” suites; monitor regex pass
    • RCA: diff prompt contracts; missing tone or refusal clauses
    • Prevent: add unit tests to adapter pack; gated promotion
  • IACL 403 Flood (P2)

    • Sample audit denials; confirm cause (claims vs policy)
    • Refresh RBAC cache; check tenant profiles
    • If policy bug: hotfix with time-boxed override + ticket
    • Prevent: add backtest suite on policy changes
  • IPackaging Signature Failures (P1)

    • Verify HSM availability & signer service logs
    • Fail-closed: block exports; still return internal JSON with banner
    • Rotate signer keys if compromise suspected; re-sign recent packages
    • RCA + governance notification

4. Post-Incident Review (template)

  • Summary: what broke, when, blast radius
  • SLO impact: which were breached, for how long
  • Evidence: links to dashboards, audit decision_ids, provenance_ids
  • Root cause: technical + governance gaps
  • Fixes: code, config, policy, tests
  • Prevention: new gates (Eval Harness), new alerts, runbook updates
  • Owner / due dates: tracked in change management

5. Preventative Controls (continuous)

  • Pre-merge: Eval Harness on gold sets (retrieval, structure, citations)
  • Pre-deploy: shadow traffic & canaries (adapter/regex/compute)
  • Daily drift scans: compare ANN vs BM25 recall on sentinel queries
  • Weekly audit sampling: random 100 answers → verify method/version, regex version, and provenance stitched end-to-end

Outcome

These runbooks ensure that when something slips—index drift, adapter misconfigs, ACL mismatches, or SLO violations—we recover safely and fast without compromising the integrity of AI outputs or our compliance guarantees.



GitHub RepoRequest for Change (RFC)