bookly/DESIGN.md

# Bookly — Agent Design

A conversational customer support agent for a fictional online bookstore. Handles two depth use cases (order status, returns) and one breadth use case (policy questions) over a vanilla web chat UI, backed by Anthropic Claude Sonnet.

## Architecture

```
Browser -> /api/chat ->  FastAPI ->  agent.run_turn -> Claude
                                          │
                                          ├── tool dispatch (lookup_order,
                                          │   check_return_eligibility,
                                          │   initiate_return, lookup_policy)
                                          │
                                          └── validate_reply -> safe fallback
                                                                   on violation
```


**Stack:** Python 3.11, FastAPI, Uvicorn, the official Anthropic SDK with prompt caching, and a HTML/CSS/JS frontend.

## Conversation and decision design

1. **XML-tagged sections** (`<critical_rules>`, `<scope>`, `<return_policy>`, `<tool_rules>`, `<clarifying_rules>`, `<tone>`, `<examples>`, `<reminders>`). Tags survive long-context drift better than prose headers and give addressable sections we can re-inject later.
2. **Primacy + recency duplication.** The 3–5 non-negotiable rules appear twice — at the top in `<critical_rules>` and at the bottom in `<reminders>`. Duplication at the beginning and end of the context window is insurance against rules being forgotten.
3. **Positive action rules, explicit NEVER prohibitions.** Positive framing for normal behavior ("Always call `lookup_order` before discussing order status"); explicit `NEVER` for hallucination-class failures.
4. **Policy as data, not as summary.** `RETURN_POLICY` is a structured dict rendered verbatim into `<return_policy>` at import time. The prompt and the `check_return_eligibility` tool read the same source of truth.
5. **Concrete refusal template.** A single fill-in-the-blank refusal line for off-topic requests, quoted in `<scope>` and referenced from both `<critical_rules>` and `<reminders>`. Templates shrink the decision space and keep things clear and simple for the user.
6. **Few-shot examples for the ambiguous cases only.** Missing order ID, supported policy lookup, off-topic refusal, multi-order disambiguation.
7. **Plain text only.** Explicit instruction to avoid markdown — the chat UI does not render it, and `**bold**` would print as raw asterisks.


## Hallucination and safety controls

A system prompt is _mostly_ reliable, but models will forget or ignore them from time to time. I've added guardrails on tools (similar to hooks you'd see in Claude Code) to further enforce safety controls. There's also an output validation layer that uses good old-fashioned regex to prevent unapproved responses from being sent to the user.

| Layer | Catches | Cost |
|---|---|---|
| 1. Prompt structure | Drift, tone, minor hallucinations | Tokens |
| 2. Runtime reminder injection | Long-conversation rule decay | Tokens |
| 3. Tool-side enforcement | Protocol violations even if the model ignores instructions | Code |
| 4. Output validation | Fabricated IDs/dates, markdown leakage, scope violations | Compute |

**Layer 1 — prompt structure.** Implemented in `agent.SYSTEM_PROMPT` per the seven principles above.

**Layer 2 — runtime reminder injection.** Before each `messages.create` call, `build_system_content` appends a short `CRITICAL_REMINDER` block to the system content. Once the conversation passes 5 turns, a stronger `LONG_CONVERSATION_REMINDER` is added. The big `SYSTEM_PROMPT` block carries `cache_control: {"type": "ephemeral"}` so it stays in the Anthropic prompt cache across turns; the reminder blocks are uncached so they can vary without busting the cache. Net per-turn cost: a few dozen tokens, plus cache reads on the long prompt.

**Layer 3 — tool-side enforcement.** Lives in `tools.py`. Each session carries a `SessionGuardState` with two sets: `eligibility_checks_passed` and `returns_initiated`. `handle_initiate_return` refuses with `eligibility_not_verified` unless the order is in the first set, and refuses `already_initiated` if it is in the second set. Even if the model ignores the system prompt entirely, it cannot start a return without going through the protocol. The error message is deliberately instructional — when the tool refuses, the model self-corrects on the next iteration of the tool-use loop. `handle_lookup_order` returns `order_not_found` (not a distinct auth error) on email mismatch to prevent enumeration.

**Layer 4 — output validation.** Implemented in `agent.validate_reply`, run on every final assistant text reply before it leaves the server. Deterministic regex checks for: ungrounded `BK-` order IDs (mentioned but never returned by a tool this turn), ungrounded ISO dates, markdown leakage (`**`, `__`, leading `#` or bullets), and out-of-scope keyword engagement that does not also contain the refusal template. On any violation, the bad reply is dropped — replaced with `SAFE_FALLBACK` and **never appended to history**, so it cannot poison future turns. The validator is deliberately heuristic: it catches the cheap wins (fabricated IDs, made-up dates, formatting leaks) and trusts layers 1–3 for everything subtler. No second LLM call — that would compound cost, latency, and a new failure surface.

## Production readiness

Bookly is running end-to-end, but a few things a team would add before scaling traffic are deliberately out of the current scope. In priority order:

**Evals — three tiers.**
1. **Tier 1, CI regression set.** ~30 scripted scenarios covering the happy path, every refusal case, every tool failure mode, and a long-conversation drift test. Assertions target *protocol* (which tools were called, in which order, with which arguments) and *Layer 4 violation codes*, not exact wording. Deterministic via temperature 0 and a pinned model ID. Blocks merges.
2. **Tier 2, LLM-as-judge.** A growing labeled dataset scored on grounding, refusal correctness, policy accuracy, tone, and clarifying-question quality. The judge itself is validated against a small golden dataset.
3. **Tier 3, online.** Sample 1–5% of real conversations, run the Tier 2 judge asynchronously, alert on score regression. Flagged conversations feed back into the Tier 2 dataset.

**Observability.**
- **Per-turn structured trace** indexed by session+turn, containing the full message history, tool calls with inputs/outputs, latency breakdown, token counts, validation result and violation codes, and whether the reply was appended to history. Without this you debug blind.
- **Metrics.** Validation-failure rate by code, safe-fallback rate, refusal rate, eligibility-check-before-`initiate_return` compliance, per-tool error rate, p99 latency.
- **Alerts.** Page on validation-failure spikes, safe-fallback spikes, tool-API errors, latency regressions.
- **Thumbs feedback** wired to the trace ID, with low-rated turns auto-triaged into the Tier 2 dataset.

**Tradeoffs explicitly chosen.** Sessions are in-memory and would not survive a restart — fine for a single-node deployment, not for horizontal scale. The agent runs synchronously per request and has no streaming — adding streaming would improve perceived latency but adds a partial-validation problem (you cannot validate a reply you have not finished generating). The validator is heuristic and will miss semantic hallucinations — that is what the eval tiers are for.

The guardrails *prevent* bad outputs; the evals *measure* whether the guardrails are working; the observability tells you *when* they stop.