> For the complete documentation index, see [llms.txt](https://wiki.fridays.bot/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://wiki.fridays.bot/documentation/white-paper/13.-performance-and-cost-model.md).

# 13. Performance and Cost Model

The commercial thesis (per-business subscription, not per-seat) is only viable if inference cost per tenant is **bounded and predictable**, not proportional to how much work the model does. This section shows the architecture already specified makes it so, and that the numbers are *measured* (§7.4 metering), not modeled. All illustrative figures below are marked; they are structure, not committed unit economics.

#### 13.1 Inference cost per action: bounding via plan caching and tiering

The naive agent cost model — one frontier inference call per reasoning step (§5.2, ReAct \[1]) — makes cost proportional to task complexity × task volume, which is unbounded per tenant and fatal to a flat subscription. Two specified mechanisms sever that proportionality:

* **Plan caching (§5.2).** Recurrent playbook instances of the same (playbook version, connector set, parameter-schema hash) reuse a cached plan template; only slot values differ. The expensive step — frontier-model plan *compilation* — happens on cache miss (new playbook version, structural change), not per run. For a tenant whose `invoice_followup` runs daily, the plan compiles roughly **once per release, not once per day** — amortizing a fixed cost across every subsequent execution. This is the FrugalGPT cascade insight \[2] applied at plan granularity rather than per-call.
* **Tiered routing (§5.6).** Only Tier-A work (novel compilation, tone-sensitive drafting, anomaly replanning) hits frontier pricing; cache-hit slot-filling, classification, and extraction run on Tier-B/C models at a fraction of the cost. The cost-dominant operations by *volume* (routine recurrent actions) are served by the cheapest tier that holds §11 quality.

Illustrative decomposition (structure only — replace with metered values):

| Operation                        | Frequency             | Tier | Relative cost                        |
| -------------------------------- | --------------------- | ---- | ------------------------------------ |
| Plan compilation                 | \~once/release/tenant | A    | High, amortized to \~0 per action    |
| Cache-hit slot fill              | per action            | B    | Low                                  |
| Draft generation (external comm) | per suspended comm    | A    | Moderate, bounded by approval volume |
| Classification/extraction        | per event             | B/C  | Very low                             |

The load-bearing consequence: **marginal inference cost per routine action trends toward the cost of a Tier-B slot-fill**, not a frontier reasoning loop — the difference between a per-tenant COGS that is flat and one that scales with tenant activity. §7.4 metering validates cache-hit rates against realized spend, so this is an audited claim, not an architectural hope.

#### 13.2 Vendor API cost pass-through

Most vendor APIs in the catalog (QuickBooks, Gmail, Graph, Zoho) carry **no per-call dollar cost** — the tenant already pays the vendor for the underlying SaaS. The real cost is **quota scarcity** (§3.5), which is why metering records vendor consumption in native units (§7.4), not dollars. Two places dollars do enter:

* **Aggregators (§3.2.3):** Finch/Merge and social aggregators price per connection/API call. This is a real per-tenant COGS line, sized against the value of the payroll/social playbooks; it is a deliberate build-vs-buy trade (§3.2.3), and the aggregator fee is the "buy" cost that displaced N direct-integration engineering efforts.
* **Metered vendor tiers:** some vendors gate higher volume behind paid API tiers; where a playbook drives a tenant into one, that cost is attributed per tenant via §7.4 and factored into tier pricing (§13.4).

#### 13.3 Latency budget per playbook class

Two latency regimes, because conflating them mis-sizes both infrastructure and UX expectations:

* **Background (the common case):** playbook sweeps, cadence touches. Human-imperceptible latency requirement — an invoice-follow-up sweep completing in 30 s or 5 min is equivalent to the tenant. Optimize for throughput and cost (batchable, Tier-B, deferrable within cadence tolerance, §10.2), not latency.
* **Interactive (post-approval):** when the owner taps approve, the resulting dispatch should feel immediate. This is P1 priority (§10.2), preempting background work at the governor. Budget decomposition: gateway mediation (single-digit ms, §2.3) + vault fetch (ms) + **vendor API round-trip (100–800 ms, the dominant term)**. The architecture's own overhead is negligible against vendor latency — mediation is not the bottleneck, a claim made in §2.3 and quantified here.

Note the asymmetry: plan *compilation* latency (frontier model, potentially seconds) lands almost entirely in the background regime (§13.1: compilation is rare and off the interactive path), so the expensive-latency operation never sits in front of a waiting human. This is the latency dividend of planner/executor separation (§5.2), paid on top of the cost dividend.

#### 13.4 Unit economics: gross-margin sensitivity to token prices

Per-tenant gross margin is a **query against §7.4 metering** (LLM spend + aggregator fees + infra, attributed per tenant), not a spreadsheet estimate — which means margin is monitored continuously and per-tenant outliers (a tenant whose stack or volume inflates COGS) are visible, not discovered at quarter-end.

Sensitivity structure (illustrative — populate from metering):

* **Token-price exposure is bounded by §13.1.** Because routine actions are Tier-B/C and plan compilation is amortized, a frontier-price increase moves COGS far less than for an architecture that ran frontier inference per step. The plan-cache hit rate is the key lever: higher hit rate → less exposure to any tier's price. This is why §5.2's cache design is an *economic* control, not just a latency one.
* **Two-provider redundancy (§5.6) is also price insurance:** hot-swappable providers per tier let routing follow price/performance, so a single provider's increase is a routing decision, not a margin event.
* **Aggregator fees are the more linear cost** (per-connection), sized deliberately against the revenue of the categories they unlock; they are the predictable line, LLM spend the one the architecture works to flatten.

Break-even framing (structure): a flat subscription is defensible precisely when per-tenant COGS is bounded and sub-linear in tenant activity. §13.1 delivers the boundedness for the dominant cost (inference); §13.2 makes vendor cost mostly a scarcity problem the governor manages rather than a dollar problem; §7.4 makes the whole thing measured. The economic argument and the technical architecture are the same argument — the plan cache, the tier routing, and the metering choke point are simultaneously performance, cost, and margin controls.

***

#### References (Section 13)

\[1] S. Yao et al., *ReAct: Synergizing Reasoning and Acting in Language Models* (ICLR 2023) — per-step inference loop; the cost model §13.1 avoids. <https://arxiv.org/abs/2210.03629>

\[2] L. Chen, M. Zaharia, J. Zou, *FrugalGPT* (2023) — LLM cascades; empirical cost reductions at matched quality, applied here at plan granularity and in tier routing (§5.6). <https://arxiv.org/abs/2305.05176>

\[3] Metering methodology per §7.4 (price-sheet-snapshot-ID'd LLM spend; native-unit vendor quota); unit-economics figures in this section are illustrative placeholders pending production metering data.

***

*Next: Section 14 (Roadmap).*

***

### As-Built Reconciliation — V1

*Legend/sources as in §1's addendum.*

| §                                                     | Status      | As-built (source)                                                                                                                                                                                                                   |
| ----------------------------------------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 13.1 Inference cost bounding (plan caching + tiering) | **PLANNED** | Plan caching depends on the planner/executor design (§5.2, PLANNED). **Budget hard-stops EXIST** (AO §6) and bound *worst-case* spend pre-invocation; the *expected*-cost reduction from caching is a Fridays design not yet built. |
| 13.2 Vendor API cost pass-through                     | **PLANNED** | `cost_events` EXISTS as the metering substrate but records **USD**; native vendor-quota-unit metering is PLANNED.                                                                                                                   |
| 13.3 Latency budget (background vs interactive)       | **PLANNED** | Approval-latency SLO → `observability-and-slos-v1.md` §4.2.                                                                                                                                                                         |
| 13.4 Unit economics (margin = query on metering)      | **PARTIAL** | `cost_events` per run (tokens/USD) EXISTS; the per-tenant margin *query* is PLANNED.                                                                                                                                                |

**Flag — the missing analysis (Tier-1 E):** the cost model assumes the plan-cache architecture (aspirational/planned); the as-built cost control is **budget hard-stops**, which bound the *worst case* but do not establish *expected* per-tenant COGS. Critically, the paper prices a **per-business subscription** while the hosted model is **instance-per-customer** — a dedicated Postgres + container per customer. At a flat low tier, a dedicated managed Postgres alone can rival the subscription price, and ops (patching/backup/monitoring ×N) scales linearly. That tension is **unanalyzed** in both the paper and the platform docs. Cross-ref `data-model-v1.md` §4 (N-instance migration/retention cost) and `observability-and-slos-v1.md` §6 (N-target observability tax).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://wiki.fridays.bot/documentation/white-paper/13.-performance-and-cost-model.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.