> For the complete documentation index, see [llms.txt](https://wiki.fridays.bot/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://wiki.fridays.bot/documentation/white-paper/11.-evaluation-and-quality.md).

# 11. Evaluation and Quality

Evaluation is not QA overhead here; it is the artifact that makes three earlier claims *load-bearing rather than aspirational*: playbook correctness (§5.1's objective predicates), threshold relaxation safety (§6.2's ratchet), and the substantiation regulators require (§12.2). A system that ships probabilistic components into money-moving workflows without a regression harness is making claims it cannot defend. This section specifies the harness.

#### 11.1 Playbook eval harness

Each playbook version ships with a **golden dataset**: input fixtures (tenant-state snapshots, inbound events) paired with expected outcomes, executed against the sandbox matrix (§3.6). Because §5.1's objective is a *verifiable predicate*, correctness is computable, not judged.

Three fixture classes, each catching a distinct failure:

1. **Synthetic** — hand-authored edge cases: invoice paid mid-cadence, duplicate webhook delivery, contact with no email, multi-currency amounts, an invoice at exactly the 14-day boundary. Catches logic errors deterministically.
2. **Adversarial** — injection/deception corpora (§11.5); the fixture's *expected* outcome is "no capability escalation, action suspended/flagged," making §8.2's defenses a regression gate rather than a one-time pentest.
3. **Replayed-consented** — anonymized real cases from tenants who opted in, pseudonymized per §5.7. Catches the distribution real data occupies that synthetic fixtures miss (the memo-line phrasing, the malformed-but-valid PO number).

CI gate: a playbook release **cannot merge** if any golden case regresses. This is the CD-for-ML discipline — treat data and model behavior as versioned test surface, not just code \[1]. Concretely, this is why §5.1 forbids tenant-authored playbooks: golden-set guarantees are only tractable over a closed, first-party task set; user-authored logic has no owner to write its evals.

#### 11.2 Action-correctness metrics

Aggregate task-success rate is the wrong headline metric — it hides the asymmetry that a wrong `money_movement` action costs orders of magnitude more than a missed `read_only`. Metrics are therefore **per risk class** (§6.1):

| Metric                                | Definition                                                      | Why it's the one that matters                                                                                                                                                                        |
| ------------------------------------- | --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Money-movement precision**          | Of money-affecting actions taken, fraction correct              | The catastrophic-error metric. Target approaches 1.0; the approval gate (§6) is the safety net *behind* it, not a substitute — a wrong action the owner rubber-stamps still counts as a failure here |
| **External-comm precision**           | Of sends, fraction correct in recipient/amount/content          | Uncompensatable (§5.5); measured against approved-draft ground truth                                                                                                                                 |
| **False-approval-request rate (FAR)** | Suspensions the owner approves *unmodified* / total suspensions | The §6.2 objective. High FAR → alarm fatigue → disuse → the whole §6 control collapses \[Parasuraman & Riley, §6.2 ref]. This is a *quality* defect that looks like safety                           |
| **Missed-action rate**                | Objective-relevant actions the playbook failed to take          | The silent failure: §5.1 predicate unmet while every RPC returned 200. Directly the §7.5 objective-attainment SLI                                                                                    |
| **Recall @ trigger**                  | Of situations warranting action, fraction the playbook acted on | Complements precision; a playbook that never acts has perfect precision and zero value                                                                                                               |

The precision/recall split is deliberate: optimizing either alone is trivially gamed (never send → perfect send precision; send always → perfect recall). Both are gated, per class.

#### 11.3 Sandbox-based end-to-end tests

Golden datasets execute against real vendor sandboxes (§3.6), not mocks. The rationale is specific and expensive-to-learn: **mocks encode your assumptions about the vendor, so they pass exactly when your assumptions are wrong** — the failure you most need to catch. A mock of the QuickBooks invoice API cannot surface that Intuit changed a state-transition rule; the sandbox does, as a red build (§3.6, §10.4). E2E coverage runs the full path — webhook ingest → plan → gateway → sandbox dispatch → audit-chain assertion — so provenance completeness (§7.2) and idempotency (§5.4) are themselves under test, not assumed. Vendors without sandboxes get dedicated paid test accounts (Appendix A elevated-risk flag); the gap is tracked, never silently accepted.

#### 11.4 Production monitoring: denial as signal

Approval denials (§6.5) are the highest-value production quality signal available, because they are **free expert labels**: the owner, the domain authority (§6.4), telling the system it was wrong *before* the action executed. Routing by structured denial reason (§6.5):

* **`wrong_amount` / `wrong_recipient`** → flags the *plan* for review, not merely the preference. These signal an extraction or data defect (the model misread the invoice; the cache was stale), i.e. a latent bug that would recur — escalated to the eval pipeline as a candidate new golden case (§11.1).
* **`wrong_tone` / `not_now`** → preference signal; feeds threshold/tone calibration (§6.2), not the bug tracker.

The distinction is the point: conflating "you got the fact wrong" with "I'd have phrased it differently" would either bury real defects in preference noise or pollute the model's tone-learning with bug reports. Aggregate denial-rate per (playbook, class) is a monitored SLI (§7.5); a spike is a shipped regression the golden set missed — closing the loop back to §11.1.

#### 11.5 Red-teaming

Continuous, not launch-gated, because the threat (§8.1 A1) evolves and the attack surface (every inbound email) is permanently open. Scope:

* **Injection corpora** exercising §8.2 layers 2–6: direct instruction override, obfuscated/encoded instructions, multi-step ("save this for later then act"), and tool-poisoning payloads in vendor MCP tool descriptions (§8.5) \[2]\[3]. Drawn from and extending public suites where applicable \[4].
* **Deception-mode suites** (§8.1): well-formed fraudulent-fact emails carrying *no* injection — the harder class, since separation/detection do nothing and only §6.1's novelty/first-contact scoring fires. Expected outcome: suspended with anomaly annotation, never auto-executed.
* **Approval-fatigue simulation:** synthetic high-FAR load to verify the §6.2 ratchet does not relax thresholds under noise and that escalating-denial pauses (§6.5) trigger — testing that a *patient* attacker (drive FAR up, then slip one fraudulent action past a fatigued approver) is contained.
* **Isolation probes:** attempts at cross-tenant read via crafted inputs, verifying §8.3 partitioning holds under adversarial context.

Findings become permanent adversarial golden cases (§11.1 class 2) — every red-team win is a regression test forever after, so defenses ratchet monotonically. Cadence: internal continuous + periodic third-party assessment, the latter reused as SOC 2 (§8.6) and CASA (§3.7) evidence.

***

#### References (Section 11)

\[1] D. Sculley et al., *Hidden Technical Debt in Machine Learning Systems* (NeurIPS 2015) — data/behavior as versioned test surface; and Google, *MLOps: Continuous delivery and automation pipelines in ML* (CD for model + data). <https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba>

\[2] OWASP, *Top 10 for LLM Applications* (2025) — LLM01 Prompt Injection; evaluation guidance. <https://owasp.org/www-project-top-10-for-large-language-model-applications/>

\[3] Invariant Labs, *MCP Tool Poisoning Attacks* (2025) — tool-description injection payloads (§8.5). <https://invariantlabs.ai/blog/mcp-security-notification-tool-poisoning-attacks>

\[4] Public injection benchmarks (e.g., Lakera Gandalf-class corpora, garak LLM vulnerability scanner) as sourcing/extension basis for the adversarial suite. <https://github.com/leondz/garak>

***

*Next: Section 12 (Compliance and AI Governance).*

***

### As-Built Reconciliation — V1

*Legend/sources as in §1's addendum.*

**Agent-*****behavior*****&#x20;evaluation is absent from the platform docs — Tier-1 gap (B), no companion doc yet.** The platform ships **code** testing (Vitest, Playwright, Storybook) and AI-assisted engineering with human review gates (MR §1.2, §1.4) — but code passing tests is **not** the same as an agent producing the *correct invoice*.

| §                                                                                     | Status                       | As-built (source)                                                                                                                 |
| ------------------------------------------------------------------------------------- | ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| 11.1 Playbook eval harness / golden datasets                                          | **PLANNED (does not exist)** | No golden-dataset harness; correctness of agent outputs is unmeasured. This is the direct dependency of §12.2 FTC substantiation. |
| 11.2 Action-correctness metrics (per-class precision/recall, **FAR**)                 | **PLANNED**                  | No per-risk-class metrics today.                                                                                                  |
| 11.3 Sandbox E2E across the vendor matrix                                             | **PLANNED**                  | Sandbox *providers* exist (TS §4.3); connector E2E does not (no connectors).                                                      |
| 11.4 Denial-as-signal                                                                 | **PLANNED**                  | Depends on the approval UX + structured deny reasons (§6.5, PLANNED); ties to observability SLIs.                                 |
| 11.5 Red-teaming (injection corpora, deception suites, fatigue sim, isolation probes) | **PLANNED**                  | No adversarial suite; isolation *primitive* (row-level `company_id`) exists to probe against (SC §8).                             |

**Flag:** this is the single biggest quality-and-compliance dependency in the whole paper and it is entirely unbuilt. Three earlier claims are **aspirational until this exists**: §5.1 objective-predicate correctness, §6.2 threshold-relaxation safety, and §12.2 "the eval harness *is* the FTC substantiation infrastructure." **Recommend a companion `evaluation-harness-v1.md`.**


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://wiki.fridays.bot/documentation/white-paper/11.-evaluation-and-quality.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.