> For the complete documentation index, see [llms.txt](https://wiki.fridays.bot/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://wiki.fridays.bot/documentation/white-paper/10.-reliability-engineering.md).

# 10. Reliability Engineering

Fridays' dependency graph is unusual: \~50 vendor APIs, 2+ LLM providers, and its own infrastructure, where the *product promise* (playbook objectives, §5.1) can degrade even while every internal component is healthy (§7.5). The reliability design follows from one framing decision: **availability is defined per (tenant, playbook), not globally** — a Zoho outage must cost exactly the Zoho-dependent playbook branches and nothing else.

#### 10.1 Availability model and degraded modes

**Vendor outage.** Per-endpoint circuit breakers \[1] open on error-rate/latency thresholds fed by the Rate-Limit Governor's signals (§3.5) plus synthetic canaries against the sandbox matrix (§3.6, distinguishing "vendor down" from "our credentials broken" — different pages, different runbooks). Breaker-open behavior per plane:

* **Reads:** serve from the Operational Cache with explicit staleness labels. Safe *because* of §6.5's precondition revalidation: any approval granted against stale state revalidates at dispatch, so a decision made during an outage cannot execute against a world that moved — graceful read degradation and the approval model compose instead of conflicting.
* **Writes:** queue, don't drop, don't silently skip. Affected playbook cells enter an explicit `paused: vendor_unavailable` status surfaced to the tenant — the same anti-fail-open doctrine as §4.4: an invisible pause is the §1.1 failure reproduced internally.
* **Recovery:** backlog drain is jittered and paced by the governor. The naive failure here is self-inflicted: every tenant's queued writes hitting QuickBooks the minute it recovers is a thundering herd that re-trips per-realm limits (§3.5) and re-opens breakers — recovery storms are a documented outage-extension mode \[1]\[2]. Drain priority follows §10.2's classes.

**LLM provider outage.** Planner/executor separation (§5.2) pays directly: already-compiled plans continue executing — executor transitions are code, and cache-hit slot-filling runs on Tier B (§5.6), which fails over independently. A full provider outage stalls only *new plan compilation* for cache-miss instances; two providers per tier with hot failover (§5.6) bounds even that. Degradation order is therefore: novel/complex playbook instances first, routine recurrent work last — the inverse of user pain.

**Approval path.** The Approval Feed must outlive everything else — a suspended `money_movement` action with an unreachable approval surface is a stuck business process. Push (APNs/FCM) is best-effort by platform design, so the feed is **pull-backed**: the mobile app renders the suspension queue from the API regardless of push delivery, and the approval API is deployed as an isolated minimal service (bulkhead \[1]) with a stricter SLO than the planner/executor plane. Expiry semantics (§6.5) already define the never-worse-than outcome: unanswered = no-op, so approval-path degradation fails closed by inheritance.

#### 10.2 Queueing, priority, and backpressure

Four work classes, strictly ordered:

| Class                  | Contents                                           | Guarantee                                                  |
| ---------------------- | -------------------------------------------------- | ---------------------------------------------------------- |
| **P0 — Lifecycle**     | Token refresh, webhook-subscription renewal (§4.4) | **Reserved capacity floor**, not just priority — see below |
| **P1 — Interactive**   | Post-approval dispatch; owner-initiated actions    | Latency SLO (§7.5); preempts P2/P3 at the governor (§3.5)  |
| **P2 — Scheduled**     | Playbook sweeps, cadence touches                   | Deadline-based; deferrable within cadence tolerance        |
| **P3 — Drain/rebuild** | Outage backlog, cache rebuild                      | Best-effort, jittered                                      |

P0's reserved floor is the non-obvious rule: under sustained overload, *priority* alone eventually starves the lowest class, and if renewals ever queue behind a large P2 sweep, tokens and subscriptions lapse — converting a load event into a mass fail-open event (§1.1, §3.4) whose recovery (tenant re-auth) is far more expensive than the load it deferred. Lifecycle work therefore has capacity that load shedding cannot touch; shedding order under backpressure is P3 → P2 → (never P1/P0).

Mechanics: bounded queues per class per vendor; **weighted fair queuing across tenants** within a class, so one tenant's 10k-invoice month-end sweep cannot starve peers — which composes cleanly with vendors whose limits are per-tenant anyway (QuickBooks per-realm, §3.5). All queues are at-least-once; §5.4's idempotency keys are what make that sufficient — redelivery is a vendor-side no-op — avoiding exactly-once queue infrastructure that would be both slower and still not exactly-once at the vendor boundary \[3].

#### 10.3 Disaster recovery

RPO/RTO is set per store by *reconstructibility*, and the ordering is the interesting part:

| Store                     | RPO                                        | RTO                      | Rationale / recovery mode                                                                                                                                                                                                                                                                                                                      |
| ------------------------- | ------------------------------------------ | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Token Vault**           | \~0 (sync replication, multi-region)       | Minutes                  | Worst loss in the system: lost tokens = simultaneous re-auth across the entire tenant base — a human-labor recovery no engineering shortens. Backups are ciphertext (§4.3); KEKs are KMS multi-region; recovery = restore ciphertext + KMS grant re-attach.                                                                                    |
| **Action log + anchors**  | \~0 (sync replication; WORM replicas §7.1) | Minutes–hours            | Evidence layer; external anchors survive Fridays-side loss by construction and re-verify restored chains — the anchor is DR *for* the log's integrity, not just tamper-evidence.                                                                                                                                                               |
| **Payload store**         | Near-0 (versioned, replicated WORM)        | Hours                    | Dispute/provenance material (§7.2).                                                                                                                                                                                                                                                                                                            |
| **Approval/config state** | Minutes                                    | Minutes                  | Threshold bands (§6.2), delegations; small, hot-replicated. Loss degrades safely: missing state ⇒ cold-start suspend-heavy defaults — the failure mode is *over-asking*, never over-acting.                                                                                                                                                    |
| **Operational Cache**     | **∞ by design**                            | Bounded by vendor quotas | Rebuildable from vendors (§9.2 re-fetch-over-retain). The real constraint is that rebuild for N tenants is **rate-limit-bounded** (§3.5): cold-start RTO is a function of vendor quotas, not disk speed. Rebuild is therefore lazy and prioritized — entities referenced by active playbook instances first, long tail on demand — as P3 work. |

Two disciplines around the table: **restores are exercised, not assumed** — scheduled restore drills per store class, because a backup's existence is unverifiable except by restoring it \[2]; and **vendor-outage game days** run against the sandbox matrix (§3.6), exercising breaker behavior, drain pacing, and the `paused:` status surfaces end-to-end rather than trusting them at the first real outage.

Crash recovery below the DR horizon is already specified: executor state persists per transition and replays idempotently (§5.4) — components are built crash-only \[4]; there is no clean-shutdown path whose absence corrupts state.

#### 10.4 Vendor API deprecation management

Deprecation is a certainty across a 50-vendor catalog — each vendor sunsets versions on its own clock (Google's announced-deprecation windows, Microsoft Graph's beta→v1 churn, Intuit minor-version advances, Zoho's multi-major-version API history). It is handled as routine change, not incident, because the machinery already exists:

* **Detection:** schema pinning fails closed on unrecognized change (§3.3); sandbox canaries (§3.6) catch *behavioral* drift that schema-compatible changes hide (same fields, different semantics — the nastier class). Vendor deprecation announcements feed the Application Tracker's calendar (§2.2 component 8), so sunsets are scheduled work with lead time, not surprises.
* **Migration:** a version bump is a connector release: updated wrapper/pin → full eval-harness run against the sandbox on the *new* version (§11.3) → dual-validation window where golden datasets execute against both versions and diffs are reviewed → staged per-tenant rollout with automatic rollback on eval-metric regression. The eval harness is the migration safety net; without per-playbook golden sets, API migrations across 50 vendors would be release roulette.
* **Aggregator payoff, again:** for aggregated categories (§3.2.3), the aggregator absorbs upstream deprecations behind its unified schema — Fridays migrates once per aggregator change instead of once per payroll vendor. Deprecation load was priced into the §3.2 transport decision; this is where it's collected.

***

#### References (Section 10)

\[1] M. Nygard, *Release It! Design and Deploy Production-Ready Software*, 2nd ed. (Pragmatic, 2018) — circuit breakers, bulkheads, and recovery-storm/retry-amplification failure modes.

\[2] B. Beyer et al. (eds.), *Site Reliability Engineering* (O'Reilly, 2016) — data-integrity chapter: restore testing over backup possession; and cascading-failure chapter on recovery-time load management.

\[3] M. Kleppmann, *Designing Data-Intensive Applications* (O'Reilly, 2017) — at-least-once delivery + idempotent effects as the practical exactly-once (per §5.4); limits of end-to-end exactly-once across systems you don't control.

\[4] G. Candea, A. Fox, *Crash-Only Software* (HotOS 2003) — recovery path as the only path; basis for the executor's persist-per-transition design (§5.4).

\[5] Vendor deprecation policies: Google API deprecation policy (announced windows), Microsoft Graph versioning and deprecation guidance, Intuit API minor-version policy, Zoho API version lifecycle. Tracked per vendor in Appendix A; calendars maintained in the Application Tracker (§2.2).

***

*Next: Section 11 (Evaluation and Quality) — golden datasets per playbook, action-correctness metrics, sandbox E2E, approval-denial as production signal, red-teaming.*

***

### As-Built Reconciliation — V1

*Legend/sources as in §1's addendum.*

**Whole section is design-ahead-of-build.** Reliability/DR/SRE is acknowledged in the platform docs as an **open item** (SA §9) and appears only as Phase-3 line items (MR §3) — a **Tier-1 gap (A) with no companion doc yet**.

| §                                                                                            | Status      | As-built (source)                                                                                                                                                                                                                                             |
| -------------------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 10.1 Availability / degraded modes (circuit breakers, cache-serve-on-outage, queue-not-drop) | **PLANNED** | No circuit breakers or degraded-mode design in backend. Connector-liveness *detection* → `observability-and-slos-v1.md` §4.1 is the sensing half.                                                                                                             |
| 10.2 Queueing / priority / backpressure (P0–P3, **reserved lifecycle floor**, WFQ)           | **PARTIAL** | EXISTS: DB-backed **wakeup queue + coalescing** + routine **concurrency/catch-up** policy (AO §2, §5). Priority classes, the reserved lifecycle capacity floor, and weighted-fair-queuing across tenants — PLANNED (→ `rate-limiting-and-quotas-v1.md` §3.3). |
| 10.3 DR (RPO/RTO per store, vault recovery, **restore drills**, game days)                   | **PLANNED** | Backend ships only a **logical-backup command** (SC §9); no RPO/RTO targets, no restore drills. Backups **exclude the master key** (EXISTS — a leaked backup can't decrypt).                                                                                  |
| 10.4 Vendor API deprecation mgmt                                                             | **PLANNED** | Schema-pin-fails-closed depends on the gateway version-pinning (**ASPIRATIONAL**, §3.3) + sandbox CI (PLANNED).                                                                                                                                               |

**As-built reliability that&#x20;*****does*****&#x20;exist:** crash-recovery primitives — one auto-retry then explicit recovery, the **silent-run watchdog**, and startup/periodic **reconciliation** (AO §3) — are real and align with the crash-only posture §5.4 assumes. Everything above the process level (multi-region, breakers, DR drills, game days) is unbuilt. **Recommend a companion `reliability-and-dr-v1.md`** to give this section a home (it and the eval harness are the two Tier-1 items with no companion).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://wiki.fridays.bot/documentation/white-paper/10.-reliability-engineering.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
