"""Generate Bookly.lit.md from a template plus the current source files. This script is invoked once to bootstrap the literate program. Edits after that should go into Bookly.lit.md directly, with `tangle.ts` regenerating the source files. See the reverse-sync hook in .claude/settings.local.json for the path where source-file edits feed back into the .lit.md. """ from __future__ import annotations import textwrap from pathlib import Path ROOT = Path(__file__).resolve().parent.parent def _read(path: str) -> str: return (ROOT / path).read_text(encoding="utf-8") def _chunk(language: str, name: str, file_path: str, body: str) -> str: # A chunk fence. The body is embedded verbatim -- every character of the # file must round-trip through tangling, so we never rewrap or reformat. if body.endswith("\n"): body = body[:-1] return f'```{language} {{chunk="{name}" file="{file_path}"}}\n{body}\n```' def main() -> None: config_py = _read("config.py") mock_data_py = _read("mock_data.py") tools_py = _read("tools.py") agent_py = _read("agent.py") server_py = _read("server.py") out = textwrap.dedent( """\ --- title: "Bookly" --- # Introduction Bookly is a customer-support chatbot for a bookstore. It handles three things: looking up orders, processing returns, and answering a small set of standard policy questions. Everything else it refuses, using a verbatim template. The interesting engineering is not the feature set. It is the guardrails. A chat agent wired to real tools can hallucinate order details, leak private information, skip verification steps, or wander off topic -- and the consequences land on real customers. Bookly defends against that with four independent layers, each of which assumes the previous layers have failed. This document is both the prose walkthrough and the source code. The code you see below is the code that runs. Tangling this file produces the Python source tree byte-for-byte; weaving it produces the HTML you are reading. # The four guardrail layers Before anything else, it helps to see the layers laid out in one picture. Each layer is a separate defence, and a malicious or confused input has to defeat all of them to cause harm. ```mermaid graph TD U[User message] L1[Layer 1: System prompt
identity, critical_rules, scope,
verbatim policy, refusal template] L2[Layer 2: Runtime reminders
injected every turn +
long-conversation re-anchor] M[Claude] T{Tool use?} L3[Layer 3: Tool-side enforcement
input validation +
protocol guard
eligibility before return] L4[Layer 4: Output validation
regex grounding checks,
markdown / off-topic / ID / date] OK[Reply to user] BAD[Safe fallback,
bad reply dropped from history] U --> L1 L1 --> L2 L2 --> M M --> T T -- yes --> L3 L3 --> M T -- no --> L4 L4 -- ok --> OK L4 -- violations --> BAD ``` Layer 1 is the system prompt itself. It tells the model what Bookly is, what it can and cannot help with, what the return policy actually says (quoted verbatim, not paraphrased), and exactly which template to use when refusing. Layer 2 adds short reminder blocks on every turn so the model re-reads the non-negotiable rules at the highest-attention position right before the user turn. Layer 3 lives in `tools.py`: the tool handlers refuse unsafe calls regardless of what the model decides. Layer 4 lives at the end of the agent loop and does a deterministic regex pass over the final reply looking for things like fabricated order IDs, markdown leakage, and off-topic engagement. # Request lifecycle A single user message travels this path: ```mermaid sequenceDiagram autonumber participant B as Browser participant N as nginx participant S as FastAPI participant A as agent.run_turn participant C as Claude participant TL as tools.dispatch_tool B->>N: POST /api/chat { message } N->>S: proxy_pass S->>S: security_headers middleware S->>S: resolve_session (cookie) S->>S: rate limit (ip + session) S->>A: run_turn(session_id, message) A->>A: SessionStore.get_or_create
+ per-session lock A->>C: messages.create(tools, system, history) loop tool_use C-->>A: tool_use blocks A->>TL: dispatch_tool(name, args, state) TL-->>A: tool result A->>C: messages.create(history+tool_result) end C-->>A: final text A->>A: validate_reply (layer 4) A-->>S: reply text S-->>B: { reply } ``` # Module layout Five Python files form the core. They depend on each other in one direction only -- there are no cycles. ```mermaid graph LR MD[mock_data.py
ORDERS, POLICIES, RETURN_POLICY] C[config.py
Settings] T[tools.py
schemas, handlers, dispatch] A[agent.py
SessionStore, run_turn, validate] SV[server.py
FastAPI, middleware, routes] MD --> T MD --> A C --> T C --> A C --> SV T --> A A --> SV ``` The rest of this document visits each module in dependency order: configuration first, then the data fixtures they read, then tools, then the agent loop, then the HTTP layer on top. # Configuration Every setting that might reasonably change between environments lives in one place. The two required values -- the Anthropic API key and the session-cookie signing secret -- are wrapped in `SecretStr` so an accidental `print(settings)` cannot leak them to a log. Everything else has a default that is safe for local development and reasonable for a small production deployment. A few knobs are worth noticing: - `max_tool_use_iterations` bounds the Layer-3 loop in `agent.py`. A model that keeps asking for tools forever will not burn API credit forever. - `session_store_max_entries` and `session_idle_ttl_seconds` cap the in-memory `SessionStore`, so a trivial script that opens millions of sessions cannot OOM the process. - `rate_limit_per_ip_per_minute` and `rate_limit_per_session_per_minute` feed the sliding-window limiter in `server.py`. """ ) out += _chunk("python", "config-py", "config.py", config_py) + "\n\n" out += textwrap.dedent( """\ # Data fixtures Bookly does not talk to a real database. Four fixture orders are enough to cover the interesting scenarios: a delivered order that is still inside the 30-day return window, an in-flight order that has not been delivered yet, a processing order that has not shipped, and an old delivered order outside the return window. Sarah Chen owns two of the four so the agent has to disambiguate when she says "my order". The `RETURN_POLICY` dict is the single source of truth for policy facts. Two things read it: the system prompt (via `_format_return_policy_block` in `agent.py`, which renders it as the `` section the model must quote) and the `check_return_eligibility` handler (which enforces the window in code). Having one copy prevents the two from drifting apart. `POLICIES` is a tiny FAQ keyed by topic. The `lookup_policy` tool returns one of these entries verbatim and the system prompt instructs the model to quote the response without paraphrasing. This is a deliberate anti-hallucination pattern: the less the model has to generate, the less it can make up. `RETURNS` is the only mutable state in this file. `initiate_return` writes a new RMA record to it on each successful return. """ ) out += _chunk("python", "mock-data-py", "mock_data.py", mock_data_py) + "\n\n" out += textwrap.dedent( """\ # Tools: Layer 3 enforcement Four tools back the agent: `lookup_order`, `check_return_eligibility`, `initiate_return`, and `lookup_policy`. Each has an Anthropic-format schema (used in the `tools` argument to `messages.create`) and a handler function that takes a validated arg dict plus the per-session guard state and returns a dict that becomes the `tool_result` content sent back to the model. The most important guardrail in the entire system lives in this file. `handle_initiate_return` refuses unless `check_return_eligibility` has already succeeded for the same order in the same session. This is enforced in code, not in the prompt -- if a model somehow decides to skip the eligibility check, the tool itself refuses. This is "Layer 3" in the stack: the model's last line of defence against itself. A second guardrail is the privacy boundary in `handle_lookup_order`. When a caller supplies a `customer_email` and it does not match the email on the order, the handler returns the same `order_not_found` error as a missing order. This mirror means an attacker cannot probe for which order IDs exist by watching response differences. The check uses `hmac.compare_digest` for constant-time comparison so response-time side channels cannot leak the correct email prefix either. Input validation lives in `_require_*` helpers at the top of the file. Every string is control-character-stripped before length checks so a malicious `\\x00` byte injected into a tool arg cannot sneak into the tool result JSON and reappear in the next turn's prompt. Order IDs, emails, and policy topics are validated with tight regexes; unexpected input becomes a structured `invalid_arguments` error that the model can recover from on its next turn. `TypedDict` argument shapes make the schema-to-handler contract visible to the type checker without losing runtime validation -- the model is an untrusted caller, so the runtime checks stay. """ ) out += _chunk("python", "tools-py", "tools.py", tools_py) + "\n\n" out += textwrap.dedent( """\ # Agent loop This is the biggest file. It wires everything together: the system prompt, runtime reminders, output validation (Layer 4), the in-memory session store with per-session locking, the cached Anthropic client, and the actual tool-use loop that drives a turn end to end. ## System prompt The prompt is structured with XML-style tags (``, ``, ``, ``, ``, ``, ``, ``). The critical rules are stated up front and repeated at the bottom (primacy plus recency). The return policy section interpolates the `RETURN_POLICY` dict verbatim via `_format_return_policy_block`, so the prompt and the enforcement in `tools.py` cannot disagree. Four few-shot examples are embedded directly in the prompt. Each one demonstrates a case that is easy to get wrong: missing order ID, quoting a policy verbatim, refusing an off-topic request, disambiguating between two orders. ## Runtime reminders On every turn, `build_system_content` appends a short `CRITICAL_REMINDER` block to the system content. Once the turn count crosses `LONG_CONVERSATION_TURN_THRESHOLD`, a second `LONG_CONVERSATION_REMINDER` is added. The big `SYSTEM_PROMPT` block is the only one marked `cache_control: ephemeral` -- the reminders vary per turn and we want them at the highest-attention position, not in the cached prefix. ## Layer 4 output validation After the model produces its final reply, `validate_reply` runs four cheap deterministic checks: every `BK-NNNN` string in the reply must also appear in a tool result from this turn, every ISO date in the reply must appear in a tool result, the reply must not contain markdown, and if the reply contains off-topic engagement phrases it must also contain the refusal template. Violations are collected and returned as a frozen `ValidationResult`. The off-topic patterns used to be loose substring matches on a keyword set. That false-positived on plenty of legitimate support replies ("I'd recommend contacting..."). The current patterns use word boundaries so only the intended phrases trip them. ## Session store `SessionStore` is a bounded in-memory LRU with an idle TTL. It stores `Session` objects (history, guard state, turn count) keyed by opaque server-issued session IDs. It also owns the per-session locks used to serialize concurrent turns for the same session, since FastAPI runs the sync `chat` handler in a threadpool and two simultaneous requests for the same session would otherwise corrupt the conversation history. The locks-dict is itself protected by a class-level lock so two threads trying to create the first lock for a session cannot race into two different lock instances. Under the "single-process demo deployment" constraint this is enough. For multi-worker, the whole class would get swapped for a Redis-backed equivalent. ## The tool-use loop `_run_tool_use_loop` drives the model until it stops asking for tools. It is bounded by `settings.max_tool_use_iterations` so a runaway model cannot burn credit in an infinite loop. Each iteration serializes the assistant's content blocks into history, dispatches every requested tool, packs the results into a single `tool_result` user-role message, and calls Claude again. Before each call, `_with_last_message_cache_breakpoint` stamps the last message with `cache_control: ephemeral` so prior turns do not need to be re-tokenized on every call. This turns the per-turn input-token cost from `O(turns^2)` into `O(turns)` across a session. ## run_turn `run_turn` is the top-level entry point the server calls. It validates its inputs, acquires the per-session lock, appends the user message, runs the loop, and then either persists the final reply to history or -- on validation failure -- drops the bad reply and returns a safe fallback. Dropping a bad reply from history is important: it prevents a hallucinated claim from poisoning subsequent turns. Warning logs never include the reply body. Session IDs and reply contents are logged only as short SHA-256 hashes for correlation, which keeps PII out of the log pipeline even under active incident response. """ ) out += _chunk("python", "agent-py", "agent.py", agent_py) + "\n\n" out += textwrap.dedent( """\ # HTTP surface The FastAPI app exposes four routes: `GET /health`, `GET /` (redirects to `/static/index.html`), `POST /api/chat`, and `GET /architecture` (this very document). Everything else is deliberately missing -- the OpenAPI docs pages and the redoc pages are disabled so the public surface is as small as possible. ## Security headers A middleware injects a strict Content-Security-Policy and friends on every response. CSP is defense in depth: the chat UI in `static/chat.js` already renders model replies with `textContent` rather than `innerHTML`, so XSS is structurally impossible today. The CSP exists to catch any future regression that accidentally switches to `innerHTML`. The `/architecture` route overrides the middleware CSP with a more permissive one because pandoc's standalone HTML has inline styles. ## Sliding-window rate limiter `SlidingWindowRateLimiter` keeps a deque of timestamps per key and evicts anything older than the window. The `/api/chat` handler checks twice per call -- once with an `ip:` prefix, once with a `session:` prefix -- so a single attacker cannot exhaust the per-session budget by rotating cookies, and a legitimate user does not get locked out by a noisy neighbour on the same IP. Suitable for a single-process demo deployment. A multi-worker deployment would externalize this to Redis. ## Session cookies The client never chooses its own session ID. On the first request a new random ID is minted, HMAC-signed with `settings.session_secret`, and set in an HttpOnly, SameSite=Lax cookie. Subsequent requests carry the cookie; the server verifies the signature in constant time (`hmac.compare_digest`) and trusts nothing else. A leaked or guessed request body cannot hijack another user's conversation because the session ID is not in the body at all. ## /api/chat The handler resolves the session, checks both rate limits, then calls into `agent.run_turn`. The Anthropic exception hierarchy is caught explicitly so a rate-limit incident and a code bug cannot look identical to operators: `anthropic.RateLimitError` becomes 503, `APIConnectionError` becomes 503, `APIStatusError` becomes 502, `ValueError` from the agent becomes 400, anything else becomes 500. ## /architecture This is where the woven literate program is served. The handler reads `static/architecture.html` (produced by pandoc from this file) and returns it with a relaxed CSP. If the file does not exist yet, the route 404s with a clear message rather than raising a 500. """ ) out += _chunk("python", "server-py", "server.py", server_py) + "\n" out_path = ROOT / "Bookly.lit.md" out_path.write_text(out, encoding="utf-8") print(f"wrote {out_path} ({len(out.splitlines())} lines)") if __name__ == "__main__": main()