bookly/DESIGN.md
Cody Borders 3947180841 Harden security/perf, add literate program at /architecture
Security and performance fixes addressing a comprehensive review:

- Server-issued HMAC-signed session cookies; client-supplied session_id
  ignored. Prevents session hijacking via body substitution.
- Sliding-window rate limiter per IP and per session.
- SessionStore with LRU eviction, idle TTL, per-session threading locks,
  and a hard turn cap. Bounds memory and serializes concurrent turns for
  the same session so FastAPI's threadpool cannot corrupt history.
- Tool-use loop capped at settings.max_tool_use_iterations; Anthropic
  client gets an explicit timeout. No more infinite-loop credit burn.
- Every tool argument is regex-validated, length-capped, and
  control-character-stripped. asserts replaced with ValueError so -O
  cannot silently disable the checks.
- PII-safe warning logs: session IDs and reply bodies are hashed, never
  logged in clear.
- hmac.compare_digest for email comparison (constant-time).
- Strict Content-Security-Policy plus X-Content-Type-Options,
  X-Frame-Options, Referrer-Policy, Permissions-Policy via middleware.
- Explicit handlers for anthropic.RateLimitError, APIConnectionError,
  APIStatusError, ValueError; static dir resolved from __file__.
- Prompt cache breakpoints on the last tool schema and the last message
  so per-turn input cost scales linearly, not quadratically.
- TypedDict handler argument shapes; direct block.name/block.id access.
- functools.lru_cache on _get_client.
- Anchored word-boundary regexes for out-of-scope detection to kill
  false positives on phrases like "I'd recommend contacting...".

Literate program:

- Bookly.lit.md is now the single source of truth for the five core
  Python files. Tangles byte-for-byte; verified via tangle.ts --verify.
- Prose walkthrough, three mermaid diagrams, narrative per module.
- Woven to static/architecture.html with the app's palette
  (background #f5f3ee) via scripts/architecture-header.html.
- New GET /architecture route serves the HTML with a relaxed CSP that
  allows pandoc's inline styles. Available at
  bookly.codyborders.com/architecture.
- scripts/rebuild_architecture_html.sh regenerates the HTML after edits.
- code_reviews/2026-04-15-1433-code-review.md captures the review that
  drove these changes.

All 37 tests pass.
2026-04-15 15:02:40 -07:00

7.5 KiB
Raw Blame History

Bookly — Agent Design

A conversational customer support agent for a fictional online bookstore. Handles two depth use cases (order status, returns) and one breadth use case (policy questions) over a vanilla web chat UI, backed by Anthropic Claude Sonnet.

Architecture

Browser -> /api/chat ->  FastAPI ->  agent.run_turn -> Claude
                                          │
                                          ├── tool dispatch (lookup_order,
                                          │   check_return_eligibility,
                                          │   initiate_return, lookup_policy)
                                          │
                                          └── validate_reply -> safe fallback
                                                                   on violation

Stack: Python 3.11, FastAPI, Uvicorn, the official Anthropic SDK with prompt caching, and a HTML/CSS/JS frontend.

Conversation and decision design

  1. XML-tagged sections (<critical_rules>, <scope>, <return_policy>, <tool_rules>, <clarifying_rules>, <tone>, <examples>, <reminders>). Tags survive long-context drift better than prose headers and give addressable sections we can re-inject later.
  2. Primacy + recency duplication. The 35 non-negotiable rules appear twice — at the top in <critical_rules> and at the bottom in <reminders>. Duplication at the beginning and end of the context window is insurance against rules being forgotten.
  3. Positive action rules, explicit NEVER prohibitions. Positive framing for normal behavior ("Always call lookup_order before discussing order status"); explicit NEVER for hallucination-class failures.
  4. Policy as data, not as summary. RETURN_POLICY is a structured dict rendered verbatim into <return_policy> at import time. The prompt and the check_return_eligibility tool read the same source of truth.
  5. Concrete refusal template. A single fill-in-the-blank refusal line for off-topic requests, quoted in <scope> and referenced from both <critical_rules> and <reminders>. Templates shrink the decision space and keep things clear and simple for the user.
  6. Few-shot examples for the ambiguous cases only. Missing order ID, supported policy lookup, off-topic refusal, multi-order disambiguation.
  7. Plain text only. Explicit instruction to avoid markdown — the chat UI does not render it, and **bold** would print as raw asterisks.

Hallucination and safety controls

A system prompt is mostly reliable, but models will forget or ignore them from time to time. I've added guardrails on tools (similar to hooks you'd see in Claude Code) to further enforce safety controls. There's also an output validation layer that uses good old-fasioned regex to prevent unapproved responses from being sent to the user.

Layer Catches Cost
1. Prompt structure Drift, tone, minor hallucinations Tokens
2. Runtime reminder injection Long-conversation rule decay Tokens
3. Tool-side enforcement Protocol violations even if the model ignores instructions Code
4. Output validation Fabricated IDs/dates, markdown leakage, scope violations Compute

Layer 1 — prompt structure. Implemented in agent.SYSTEM_PROMPT per the seven principles above.

Layer 2 — runtime reminder injection. Before each messages.create call, build_system_content appends a short CRITICAL_REMINDER block to the system content. Once the conversation passes 5 turns, a stronger LONG_CONVERSATION_REMINDER is added. The big SYSTEM_PROMPT block carries cache_control: {"type": "ephemeral"} so it stays in the Anthropic prompt cache across turns; the reminder blocks are uncached so they can vary without busting the cache. Net per-turn cost: a few dozen tokens, plus cache reads on the long prompt.

Layer 3 — tool-side enforcement. Lives in tools.py. Each session carries a SessionGuardState with two sets: eligibility_checks_passed and returns_initiated. handle_initiate_return refuses with eligibility_not_verified unless the order is in the first set, and refuses already_initiated if it is in the second set. Even if the model ignores the system prompt entirely, it cannot start a return without going through the protocol. The error message is deliberately instructional — when the tool refuses, the model self-corrects on the next iteration of the tool-use loop. handle_lookup_order returns order_not_found (not a distinct auth error) on email mismatch to prevent enumeration.

Layer 4 — output validation. Implemented in agent.validate_reply, run on every final assistant text reply before it leaves the server. Deterministic regex checks for: ungrounded BK- order IDs (mentioned but never returned by a tool this turn), ungrounded ISO dates, markdown leakage (**, __, leading # or bullets), and out-of-scope keyword engagement that does not also contain the refusal template. On any violation, the bad reply is dropped — replaced with SAFE_FALLBACK and never appended to history, so it cannot poison future turns. The validator is deliberately heuristic: it catches the cheap wins (fabricated IDs, made-up dates, formatting leaks) and trusts layers 13 for everything subtler. No second LLM call — that would compound cost, latency, and a new failure surface.

Production readiness

Bookly is running end-to-end, but a few things a team would add before scaling traffic are deliberately out of the current scope. In priority order:

Evals — three tiers.

  1. Tier 1, CI regression set. ~30 scripted scenarios covering the happy path, every refusal case, every tool failure mode, and a long-conversation drift test. Assertions target protocol (which tools were called, in which order, with which arguments) and Layer 4 violation codes, not exact wording. Deterministic via temperature 0 and a pinned model ID. Blocks merges.
  2. Tier 2, LLM-as-judge. A growing labeled dataset scored on grounding, refusal correctness, policy accuracy, tone, and clarifying-question quality. The judge itself is validated against a small golden dataset.
  3. Tier 3, online. Sample 15% of real conversations, run the Tier 2 judge asynchronously, alert on score regression. Flagged conversations feed back into the Tier 2 dataset.

Observability.

  • Per-turn structured trace indexed by session+turn, containing the full message history, tool calls with inputs/outputs, latency breakdown, token counts, validation result and violation codes, and whether the reply was appended to history. Without this you debug blind.
  • Metrics. Validation-failure rate by code, safe-fallback rate, refusal rate, eligibility-check-before-initiate_return compliance, per-tool error rate, p99 latency.
  • Alerts. Page on validation-failure spikes, safe-fallback spikes, tool-API errors, latency regressions.
  • Thumbs feedback wired to the trace ID, with low-rated turns auto-triaged into the Tier 2 dataset.

Tradeoffs explicitly chosen. Sessions are in-memory and would not survive a restart — fine for a single-node deployment, not for horizontal scale. The agent runs synchronously per request and has no streaming — adding streaming would improve perceived latency but adds a partial-validation problem (you cannot validate a reply you have not finished generating). The validator is heuristic and will miss semantic hallucinations — that is what the eval tiers are for.

The guardrails prevent bad outputs; the evals measure whether the guardrails are working; the observability tells you when they stop.