Bookly is a customer-support chatbot for a bookstore. It handles
three things: looking up orders, processing returns, and answering a
small set of standard policy questions. Everything else it refuses,
@@ -393,11 +461,11 @@ previous layers have failed.
code you see below is the code that runs. Tangling this file produces
the Python source tree byte-for-byte; weaving it produces the HTML you
are reading.
-
The four guardrail layers
+
2 The four guardrail layers
Before anything else, it helps to see the layers laid out in one
picture. Each layer is a separate defence, and a malicious or confused
input has to defeat all of them to cause harm.
-
+
Layer 1 is the system prompt itself. It tells the model what Bookly
is, what it can and cannot help with, what the return policy actually
says (quoted verbatim, not paraphrased), and exactly which template to
@@ -409,17 +477,17 @@ of what the model decides. Layer 4 lives at the end of the agent loop
and does a deterministic regex pass over the final reply looking for
things like fabricated order IDs, markdown leakage, and off-topic
engagement.
-
Request lifecycle
+
3 Request lifecycle
A single user message travels this path:
-
Module layout
+
4 Module layout
Five Python files form the core. They depend on each other in one
direction only – there are no cycles.
The rest of this document visits each module in dependency order:
configuration first, then the data fixtures they read, then tools, then
the agent loop, then the HTTP layer on top.
-
Configuration
+
5 Configuration
Every setting that might reasonably change between environments lives
in one place. The two required values – the Anthropic API key and the
session-cookie signing secret – are wrapped in SecretStr so
@@ -503,7 +571,7 @@ limiter in server.py.
# and `session_secret` from environment / .env at runtime, but mypy sees them as# required constructor arguments and has no way to know about that.settings = Settings() # type: ignore[call-arg]
-
Data fixtures
+
6 Data fixtures
Bookly does not talk to a real database. Four fixture orders are
enough to cover the interesting scenarios: a delivered order that is
still inside the 30-day return window, an in-flight order that has not
@@ -634,7 +702,7 @@ successful return.
# Mutated at runtime by `initiate_return`. Keyed by return_id.RETURNS: dict[str, dict] = {}
-
Tools: Layer 3 enforcement
+
7 Tools: Layer 3 enforcement
Four tools back the agent: lookup_order,
check_return_eligibility, initiate_return, and
lookup_policy. Each has an Anthropic-format schema (used in
@@ -1105,12 +1173,12 @@ the model is an untrusted caller, so the runtime checks stay.
# `_require_string` already stripped control characters, and the error# messages themselves are constructed from field names, not user data.return {"error": "invalid_arguments", "message": str(exc)}
-
Agent loop
+
8 Agent loop
This is the biggest file. It wires everything together: the system
prompt, runtime reminders, output validation (Layer 4), the in-memory
session store with per-session locking, the cached Anthropic client, and
the actual tool-use loop that drives a turn end to end.
-
System prompt
+
8.1 System prompt
The prompt is structured with XML-style tags
(<identity>, <critical_rules>,
<scope>, <return_policy>,
@@ -1125,7 +1193,7 @@ enforcement in tools.py cannot disagree.
demonstrates a case that is easy to get wrong: missing order ID, quoting
a policy verbatim, refusing an off-topic request, disambiguating between
two orders.
-
Runtime reminders
+
8.2 Runtime reminders
On every turn, build_system_content appends a short
CRITICAL_REMINDER block to the system content. Once the
turn count crosses LONG_CONVERSATION_TURN_THRESHOLD, a
@@ -1134,7 +1202,7 @@ second LONG_CONVERSATION_REMINDER is added. The big
cache_control: ephemeral – the reminders vary per turn and
we want them at the highest-attention position, not in the cached
prefix.
-
Layer 4 output validation
+
8.3 Layer 4 output validation
After the model produces its final reply, validate_reply
runs four cheap deterministic checks: every BK-NNNN string
in the reply must also appear in a tool result from this turn, every ISO
@@ -1146,7 +1214,7 @@ returned as a frozen ValidationResult.
keyword set. That false-positived on plenty of legitimate support
replies (“I’d recommend contacting…”). The current patterns use word
boundaries so only the intended phrases trip them.
-
Session store
+
8.4 Session store
SessionStore is a bounded in-memory LRU with an idle
TTL. It stores Session objects (history, guard state, turn
count) keyed by opaque server-issued session IDs. It also owns the
@@ -1160,7 +1228,7 @@ two different lock instances.
Under the “single-process demo deployment” constraint this is enough.
For multi-worker, the whole class would get swapped for a Redis-backed
equivalent.
-
The tool-use loop
+
8.5 The tool-use loop
_run_tool_use_loop drives the model until it stops
asking for tools. It is bounded by
settings.max_tool_use_iterations so a runaway model cannot
@@ -1173,7 +1241,7 @@ with cache_control: ephemeral so prior turns do not need to
be re-tokenized on every call. This turns the per-turn input-token cost
from O(turns^2) into O(turns) across a
session.
-
run_turn
+
8.6 run_turn
run_turn is the top-level entry point the server calls.
It validates its inputs, acquires the per-session lock, appends the user
message, runs the loop, and then either persists the final reply to
@@ -1810,14 +1878,14 @@ response.
) session.turn_count +=1return reply_text
-
HTTP surface
+
9 HTTP surface
The FastAPI app exposes four routes: GET /health,
GET / (redirects to /static/index.html),
POST /api/chat, and GET /architecture (this
very document). Everything else is deliberately missing – the OpenAPI
docs pages and the redoc pages are disabled so the public surface is as
small as possible.
-
Security headers
+
9.1 Security headers
A middleware injects a strict Content-Security-Policy and friends on
every response. CSP is defense in depth: the chat UI in
static/chat.js already renders model replies with
@@ -1827,7 +1895,8 @@ regression that accidentally switches to innerHTML.
The /architecture route overrides the middleware CSP
with a more permissive one because pandoc’s standalone HTML has inline
styles.
-
Sliding-window rate limiter
+
9.2 Sliding-window rate
+limiter
SlidingWindowRateLimiter keeps a deque of timestamps per
key and evicts anything older than the window. The
/api/chat handler checks twice per call – once with an
@@ -1837,7 +1906,7 @@ cookies, and a legitimate user does not get locked out by a noisy
neighbour on the same IP.
Suitable for a single-process demo deployment. A multi-worker
deployment would externalize this to Redis.
-
Session cookies
+
9.3 Session cookies
The client never chooses its own session ID. On the first request a
new random ID is minted, HMAC-signed with
settings.session_secret, and set in an HttpOnly,
@@ -1846,7 +1915,7 @@ verifies the signature in constant time
(hmac.compare_digest) and trusts nothing else. A leaked or
guessed request body cannot hijack another user’s conversation because
the session ID is not in the body at all.
-
/api/chat
+
9.4 /api/chat
The handler resolves the session, checks both rate limits, then calls
into agent.run_turn. The Anthropic exception hierarchy is
caught explicitly so a rate-limit incident and a code bug cannot look
@@ -1854,7 +1923,7 @@ identical to operators: anthropic.RateLimitError becomes
503, APIConnectionError becomes 503,
APIStatusError becomes 502, ValueError from
the agent becomes 400, anything else becomes 500.
-
/architecture
+
9.5 /architecture
This is where the woven literate program is served. The handler reads
static/architecture.html (produced by pandoc from this
file) and returns it with a relaxed CSP. If the file does not exist yet,