Jira progress: loading…

AIIL-RB

SRE Runbooks

1. Operational runbooks for AI services

1.1. Purpose

Operational runbooks for AI services: detect, mitigate, and prevent failures without breaking compliance guarantees. Each playbook preserves:

Determinism: (method_id, version) pinned
Format control: (regex_name, version) enforced
Traceability: provenance_id in every hop
Jurisdiction: acl_tags honored on compute & expo

1.2. Index Drift — Detection & Rollback

Signals (what to watch)

Accuracy shift: answers_with_all_citations / total_answers drops > 3% over 1h
Result inconsistency: repeat questions produce different citations (Jaccard < 0.5)
Shard skew: retriever_shard_miss_rate > 2%
Embedding mismatch: sudden rise in OOV (out-of-vocab) vectors for the same corpus

Triage (10–minute checklist)

Freeze writes

kubectl scale deploy retriever-writer --replicas=0 -n rag

Compare index manifests (prod vs last-known-good)

diff <(aws s3 ls s3://zayaz-indices/prod/) <(aws s3 ls s3://zayaz-indices/lkg/)

Drift sample: same 20 gold queries against current vs LKG

./bin/rag_probe --manifest prod --suite gold_20.json > prod.out
./bin/rag_probe --manifest lkg  --suite gold_20.json > lkg.out
./bin/compare_hits prod.out lkg.out

Rollback (deterministic)

Point gateway to LKG index (feature flag)

helm upgrade rag-gateway ./charts/rag-gateway \
  --set retrieval.indexManifest=s3://zayaz-indices/lkg/manifest.json

Invalidate retriever cache

kubectl delete pods -l app=retriever -n rag

Verify & Unfreeze

Re-run gold_200 evaluation; require ∆top1@k ≤ 1% vs baseline.
Re-enable writes:

kubectl scale deploy retriever-writer --replicas=2 -n rag

Governance note: Index flips are logged with change ticket + hash of LKG manifest; audit appends provenance_id chain.

1.3. Behavior Adapter Misconfig — Resolution

Symptoms

Answers lose “compliance tone” or structured sections
Missing compute calls when numbers are requested
Regex failure spikes (citation/unit format)

Quick checks

Adapter release diff

kubectl rollout history deploy/behavior-adapter -n rag
kubectl logs deploy/behavior-adapter -n rag | tail -n 200

Policy set in effect

curl -s http://adapter:8080/debug/policy | jq .

Feature flags

zayaz-ffctl get behavior.adapter.profile

Fix pattern

Hot-swap to last-known-good profile

zayaz-ffctl set behavior.adapter.profile=lkg-2025-09-12
kubectl rollout restart deploy/behavior-adapter -n rag

Re-enable compute binding

zayaz-ffctl set behavior.adapter.compute_binding=ENFORCED

Post-fix validation

Run Eval Harness “structure-adherence” and “citation-coverage” suites.
Gate: citation_coverage ≥ 0.98, regex_pass ≥ 0.98, compute_invocation_rate within ±1%.

Governance note: Adapter profiles are versioned artifacts; promotion requires committee approval and are referenced in audit logs.

1.4. ACL Enforcement Failures — Troubleshoot

Common causes

Tenant outside allowed region for EU_ONLY
Method carries acl_tags not permitted for tenant (e.g., NO_PHI)
Missing/invalid tenant claims (JWT) or stale RBAC cache

Runbook

Confirm denial reason

grep dec- <(curl -s "https://audit/api/audit/logs?tenant_id=$TENANT&decision=DENY") | jq .

Inspect method ACL tags

inspect-method-acl-tags.sqlGitHub ↗
SELECT method_id, version, acl_tags
FROM compute_method_registry
WHERE method_id='GHG.intensity' AND version='1.0.0';

Check tenant profile

curl -s https://tenant-svc/api/tenants/$TENANT | jq .jurisdiction
curl -s https://tenant-svc/api/tenants/$TENANT | jq .framework_allowlist

RBAC cache refresh

kubectl exec deploy/rbac-svc -n governance -- kill -HUP 1

Safe overrides (time-boxed)

Incident override token (auditor-approved):

zayaz-rbac override --tenant $TENANT --tag EU_ONLY --duration 2h --ticket INC-1234

Re-test request; ensure override is logged with decision_id.

Governance note: Overrides require ticket + expiry; audit entry must include authorizer and rationale.

2. KPIs & SLO Monitoring

SLIs (AI-specific)

Retrieval quality: answers_with_all_citations / total_answers
Regex quality: answers_passing_regex / total_answers
Refusal quality: % refusals with correct template and policy reason
Compute determinism: % answers with method_id+version+p_id present
Latency: retrieval_p95, compute_p95, e2e_p95

SLOs (Phase 1 scope)

Retrieval p95 ≤ 1.1 s
End-to-end p95 ≤ 2.5 s (Oslo)
Citation coverage ≥ 98%
Regex pass rate ≥ 98%
Deterministic compute presence ≥ 99.9%

Prometheus alerts (examples)

prometheus-alerts.yamlGitHub ↗
groups:
- name: ai-slo
  rules:
  - alert: RetrievalLatencyP95High
    expr: histogram_quantile(0.95, sum(rate(retrieval_ms_bucket[5m])) by (le)) > 1.1
    for: 10m
    labels: {severity: page}
    annotations:
      summary: "retrieval p95 > 1.1s (10m)"
  - alert: CitationCoverageDrop
    expr: (sum(rate(answers_with_all_citations[15m])) / sum(rate(total_answers[15m]))) < 0.98
    for: 15m
    labels: {severity: page}
  - alert: RegexFailSpike
    expr: (1 - sum(rate(regex_pass[15m])) / sum(rate(total_answers[15m]))) > 0.02
    for: 10m
    labels: {severity: page}
  - alert: DeterminismMissing
    expr: (1 - sum(rate(compute_with_method_version[15m])) / sum(rate(total_answers[15m]))) > 0.001
    for: 10m
    labels: {severity: page}

Dashboards (Grafana panels)

RAG: top-k recall, citation coverage, ANN/BM25 split, shard miss rate
Adapter: structure adherence, refusal quality
Compute: p50/p95, error codes, method mix, dataset hash distribution
Packaging: JSON/XBRL success rates, signature failuresRAG: top-k recall, citation coverage, ANN/BM25 split, shard miss rate

2.1. On-Call Quick Commands (crib sheet)

Roll back behavior adapter

helm rollback behavior-adapter 42 -n rag

Pin compute latest to prior

UPDATE compute_method_latest
SET version='1.1.0', updated_at=now()
WHERE method_id='GHG.intensity';

Disable a problematic regex library

zayaz-ffctl set regex.library.version=1.3.0
kubectl rollout restart deploy/regex-validator -n governance

Flip retriever to BM25-only fallback

zayaz-ffctl set retrieval.mode=BM25_ONLY
kubectl rollout restart deploy/rag-gateway -n rag

Get last denial decisions

curl -s "https://audit/api/audit/logs?decision=DENY&from=$(date -Iminutes -u -d '-15 min')" | jq .

Note:

https://audit/api/... refers to the internal ALTD Audit Service DNS name, resolvable only within the ZAYAZ cluster or private network.

Public access (if enabled) is exposed separately via the ZAYAZ API Gateway (e.g. https://api.zayaz.io/v1/audit/...) and is not used for SRE diagnostics.

3. Incident Playbooks

Index Drift (P1)
- Page SRE on-call; freeze writes; switch to LKG manifest
- Run gold_200; confirm recovery
- RCA: identify corpus change, embedder mismatch, or shard compaction bug
- Prevent: add pre-publish canary eval to index pipeline
IBehavior Adapter Regression (P1)
- Revert to LKG profile; enable compute binding ENFORCED
- Run “structure/citation” suites; monitor regex pass
- RCA: diff prompt contracts; missing tone or refusal clauses
- Prevent: add unit tests to adapter pack; gated promotion
IACL 403 Flood (P2)
- Sample audit denials; confirm cause (claims vs policy)
- Refresh RBAC cache; check tenant profiles
- If policy bug: hotfix with time-boxed override + ticket
- Prevent: add backtest suite on policy changes
IPackaging Signature Failures (P1)
- Verify HSM availability & signer service logs
- Fail-closed: block exports; still return internal JSON with banner
- Rotate signer keys if compromise suspected; re-sign recent packages
- RCA + governance notification

4. Post-Incident Review (template)

Summary: what broke, when, blast radius
SLO impact: which were breached, for how long
Evidence: links to dashboards, audit decision_ids, provenance_ids
Root cause: technical + governance gaps
Fixes: code, config, policy, tests
Prevention: new gates (Eval Harness), new alerts, runbook updates
Owner / due dates: tracked in change management

5. Preventative Controls (continuous)

Pre-merge: Eval Harness on gold sets (retrieval, structure, citations)
Pre-deploy: shadow traffic & canaries (adapter/regex/compute)
Daily drift scans: compare ANN vs BM25 recall on sentinel queries
Weekly audit sampling: random 100 answers → verify method/version, regex version, and provenance stitched end-to-end

Outcome

These runbooks ensure that when something slips—index drift, adapter misconfigs, ACL mismatches, or SLO violations—we recover safely and fast without compromising the integrity of AI outputs or our compliance guarantees.

GitHub Repo Request for Change (RFC)

1. Operational runbooks for AI services​

1.1. Purpose​

1.2. Index Drift — Detection & Rollback​

1.3. Behavior Adapter Misconfig — Resolution​

1.4. ACL Enforcement Failures — Troubleshoot​

2. KPIs & SLO Monitoring​

2.1. On-Call Quick Commands (crib sheet)​

3. Incident Playbooks​

4. Post-Incident Review (template)​

5. Preventative Controls (continuous)​