bookly/scripts/build_litmd.py

"""Generate Bookly.lit.md from a template plus the current source files.

This script is invoked once to bootstrap the literate program. Edits after
that should go into Bookly.lit.md directly, with `tangle.ts` regenerating
the source files. See the reverse-sync hook in .claude/settings.local.json
for the path where source-file edits feed back into the .lit.md.
"""

from __future__ import annotations

import textwrap
from pathlib import Path

ROOT = Path(__file__).resolve().parent.parent


def _read(path: str) -> str:
    return (ROOT / path).read_text(encoding="utf-8")


def _chunk(language: str, name: str, file_path: str, body: str) -> str:
    # A chunk fence. The body is embedded verbatim -- every character of the
    # file must round-trip through tangling, so we never rewrap or reformat.
    if body.endswith("\n"):
        body = body[:-1]
    return f'```{language} {{chunk="{name}" file="{file_path}"}}\n{body}\n```'


def main() -> None:
    config_py = _read("config.py")
    mock_data_py = _read("mock_data.py")
    tools_py = _read("tools.py")
    agent_py = _read("agent.py")
    server_py = _read("server.py")

    out = textwrap.dedent(
        """\
        ---
        title: "Bookly"
        ---

        # Introduction

        Bookly is a customer-support chatbot for a bookstore. It handles three
        things: looking up orders, processing returns, and answering a small
        set of standard policy questions. Everything else it refuses, using a
        verbatim template.

        The interesting engineering is not the feature set. It is the
        guardrails. A chat agent wired to real tools can hallucinate order
        details, leak private information, skip verification steps, or wander
        off topic -- and the consequences land on real customers. Bookly
        defends against that with four independent layers, each of which
        assumes the previous layers have failed.

        This document is both the prose walkthrough and the source code. The
        code you see below is the code that runs. Tangling this file produces
        the Python source tree byte-for-byte; weaving it produces the HTML
        you are reading.

        # The four guardrail layers

        Before anything else, it helps to see the layers laid out in one
        picture. Each layer is a separate defence, and a malicious or
        confused input has to defeat all of them to cause harm.

        ```mermaid
        graph TD
          U[User message]
          L1[Layer 1: System prompt<br/>identity, critical_rules, scope,<br/>verbatim policy, refusal template]
          L2[Layer 2: Runtime reminders<br/>injected every turn +<br/>long-conversation re-anchor]
          M[Claude]
          T{Tool use?}
          L3[Layer 3: Tool-side enforcement<br/>input validation +<br/>protocol guard<br/>eligibility before return]
          L4[Layer 4: Output validation<br/>regex grounding checks,<br/>markdown / off-topic / ID / date]
          OK[Reply to user]
          BAD[Safe fallback,<br/>bad reply dropped from history]

          U --> L1
          L1 --> L2
          L2 --> M
          M --> T
          T -- yes --> L3
          L3 --> M
          T -- no --> L4
          L4 -- ok --> OK
          L4 -- violations --> BAD
        ```

        Layer 1 is the system prompt itself. It tells the model what Bookly
        is, what it can and cannot help with, what the return policy actually
        says (quoted verbatim, not paraphrased), and exactly which template
        to use when refusing. Layer 2 adds short reminder blocks on every
        turn so the model re-reads the non-negotiable rules at the
        highest-attention position right before the user turn. Layer 3 lives
        in `tools.py`: the tool handlers refuse unsafe calls regardless of
        what the model decides. Layer 4 lives at the end of the agent loop
        and does a deterministic regex pass over the final reply looking
        for things like fabricated order IDs, markdown leakage, and
        off-topic engagement.

        # Request lifecycle

        A single user message travels this path:

        ```mermaid
        sequenceDiagram
          autonumber
          participant B as Browser
          participant N as nginx
          participant S as FastAPI
          participant A as agent.run_turn
          participant C as Claude
          participant TL as tools.dispatch_tool

          B->>N: POST /api/chat { message }
          N->>S: proxy_pass
          S->>S: security_headers middleware
          S->>S: resolve_session (cookie)
          S->>S: rate limit (ip + session)
          S->>A: run_turn(session_id, message)
          A->>A: SessionStore.get_or_create<br/>+ per-session lock
          A->>C: messages.create(tools, system, history)
          loop tool_use
            C-->>A: tool_use blocks
            A->>TL: dispatch_tool(name, args, state)
            TL-->>A: tool result
            A->>C: messages.create(history+tool_result)
          end
          C-->>A: final text
          A->>A: validate_reply (layer 4)
          A-->>S: reply text
          S-->>B: { reply }
        ```

        # Module layout

        Five Python files form the core. They depend on each other in one
        direction only -- there are no cycles.

        ```mermaid
        graph LR
          MD[mock_data.py<br/>ORDERS, POLICIES, RETURN_POLICY]
          C[config.py<br/>Settings]
          T[tools.py<br/>schemas, handlers, dispatch]
          A[agent.py<br/>SessionStore, run_turn, validate]
          SV[server.py<br/>FastAPI, middleware, routes]

          MD --> T
          MD --> A
          C --> T
          C --> A
          C --> SV
          T --> A
          A --> SV
        ```

        The rest of this document visits each module in dependency order:
        configuration first, then the data fixtures they read, then tools,
        then the agent loop, then the HTTP layer on top.

        # Configuration

        Every setting that might reasonably change between environments
        lives in one place. The two required values -- the Anthropic API
        key and the session-cookie signing secret -- are wrapped in
        `SecretStr` so an accidental `print(settings)` cannot leak them to
        a log.

        Everything else has a default that is safe for local development
        and reasonable for a small production deployment. A few knobs are
        worth noticing:

        - `max_tool_use_iterations` bounds the Layer-3 loop in `agent.py`.
          A model that keeps asking for tools forever will not burn API
          credit forever.
        - `session_store_max_entries` and `session_idle_ttl_seconds` cap
          the in-memory `SessionStore`, so a trivial script that opens
          millions of sessions cannot OOM the process.
        - `rate_limit_per_ip_per_minute` and
          `rate_limit_per_session_per_minute` feed the sliding-window
          limiter in `server.py`.

        """
    )

    out += _chunk("python", "config-py", "config.py", config_py) + "\n\n"

    out += textwrap.dedent(
        """\
        # Data fixtures

        Bookly does not talk to a real database. Four fixture orders are
        enough to cover the interesting scenarios: a delivered order that
        is still inside the 30-day return window, an in-flight order that
        has not been delivered yet, a processing order that has not
        shipped, and an old delivered order outside the return window.
        Sarah Chen owns two of the four so the agent has to disambiguate
        when she says "my order".

        The `RETURN_POLICY` dict is the single source of truth for policy
        facts. Two things read it: the system prompt (via
        `_format_return_policy_block` in `agent.py`, which renders it as
        the `<return_policy>` section the model must quote) and the
        `check_return_eligibility` handler (which enforces the window in
        code). Having one copy prevents the two from drifting apart.

        `POLICIES` is a tiny FAQ keyed by topic. The `lookup_policy` tool
        returns one of these entries verbatim and the system prompt
        instructs the model to quote the response without paraphrasing.
        This is a deliberate anti-hallucination pattern: the less the
        model has to generate, the less it can make up.

        `RETURNS` is the only mutable state in this file. `initiate_return`
        writes a new RMA record to it on each successful return.

        """
    )

    out += _chunk("python", "mock-data-py", "mock_data.py", mock_data_py) + "\n\n"

    out += textwrap.dedent(
        """\
        # Tools: Layer 3 enforcement

        Four tools back the agent: `lookup_order`, `check_return_eligibility`,
        `initiate_return`, and `lookup_policy`. Each has an Anthropic-format
        schema (used in the `tools` argument to `messages.create`) and a
        handler function that takes a validated arg dict plus the
        per-session guard state and returns a dict that becomes the
        `tool_result` content sent back to the model.

        The most important guardrail in the entire system lives in this
        file. `handle_initiate_return` refuses unless
        `check_return_eligibility` has already succeeded for the same
        order in the same session. This is enforced in code, not in the
        prompt -- if a model somehow decides to skip the eligibility
        check, the tool itself refuses. This is "Layer 3" in the stack:
        the model's last line of defence against itself.

        A second guardrail is the privacy boundary in `handle_lookup_order`.
        When a caller supplies a `customer_email` and it does not match
        the email on the order, the handler returns the same
        `order_not_found` error as a missing order. This mirror means an
        attacker cannot probe for which order IDs exist by watching
        response differences. The check uses `hmac.compare_digest` for
        constant-time comparison so response-time side channels cannot
        leak the correct email prefix either.

        Input validation lives in `_require_*` helpers at the top of the
        file. Every string is control-character-stripped before length
        checks so a malicious `\\x00` byte injected into a tool arg cannot
        sneak into the tool result JSON and reappear in the next turn's
        prompt. Order IDs, emails, and policy topics are validated with
        tight regexes; unexpected input becomes a structured
        `invalid_arguments` error that the model can recover from on its
        next turn.

        `TypedDict` argument shapes make the schema-to-handler contract
        visible to the type checker without losing runtime validation --
        the model is an untrusted caller, so the runtime checks stay.

        """
    )

    out += _chunk("python", "tools-py", "tools.py", tools_py) + "\n\n"

    out += textwrap.dedent(
        """\
        # Agent loop

        This is the biggest file. It wires everything together: the system
        prompt, runtime reminders, output validation (Layer 4), the
        in-memory session store with per-session locking, the cached
        Anthropic client, and the actual tool-use loop that drives a turn
        end to end.

        ## System prompt

        The prompt is structured with XML-style tags (`<identity>`,
        `<critical_rules>`, `<scope>`, `<return_policy>`, `<tool_rules>`,
        `<tone>`, `<examples>`, `<reminders>`). The critical rules are
        stated up front and repeated at the bottom (primacy plus recency).
        The return policy section interpolates the `RETURN_POLICY` dict
        verbatim via `_format_return_policy_block`, so the prompt and the
        enforcement in `tools.py` cannot disagree.

        Four few-shot examples are embedded directly in the prompt. Each
        one demonstrates a case that is easy to get wrong: missing order
        ID, quoting a policy verbatim, refusing an off-topic request,
        disambiguating between two orders.

        ## Runtime reminders

        On every turn, `build_system_content` appends a short
        `CRITICAL_REMINDER` block to the system content. Once the turn
        count crosses `LONG_CONVERSATION_TURN_THRESHOLD`, a second
        `LONG_CONVERSATION_REMINDER` is added. The big `SYSTEM_PROMPT`
        block is the only one marked `cache_control: ephemeral` -- the
        reminders vary per turn and we want them at the
        highest-attention position, not in the cached prefix.

        ## Layer 4 output validation

        After the model produces its final reply, `validate_reply` runs
        four cheap deterministic checks: every `BK-NNNN` string in the
        reply must also appear in a tool result from this turn, every
        ISO date in the reply must appear in a tool result, the reply
        must not contain markdown, and if the reply contains off-topic
        engagement phrases it must also contain the refusal template.
        Violations are collected and returned as a frozen
        `ValidationResult`.

        The off-topic patterns used to be loose substring matches on a
        keyword set. That false-positived on plenty of legitimate support
        replies ("I'd recommend contacting..."). The current patterns
        use word boundaries so only the intended phrases trip them.

        ## Session store

        `SessionStore` is a bounded in-memory LRU with an idle TTL. It
        stores `Session` objects (history, guard state, turn count) keyed
        by opaque server-issued session IDs. It also owns the per-session
        locks used to serialize concurrent turns for the same session,
        since FastAPI runs the sync `chat` handler in a threadpool and
        two simultaneous requests for the same session would otherwise
        corrupt the conversation history.

        The locks-dict is itself protected by a class-level lock so two
        threads trying to create the first lock for a session cannot race
        into two different lock instances.

        Under the "single-process demo deployment" constraint this is
        enough. For multi-worker, the whole class would get swapped for
        a Redis-backed equivalent.

        ## The tool-use loop

        `_run_tool_use_loop` drives the model until it stops asking for
        tools. It is bounded by `settings.max_tool_use_iterations` so a
        runaway model cannot burn credit in an infinite loop. Each
        iteration serializes the assistant's content blocks into history,
        dispatches every requested tool, packs the results into a single
        `tool_result` user-role message, and calls Claude again. Before
        each call, `_with_last_message_cache_breakpoint` stamps the last
        message with `cache_control: ephemeral` so prior turns do not
        need to be re-tokenized on every call. This turns the per-turn
        input-token cost from `O(turns^2)` into `O(turns)` across a
        session.

        ## run_turn

        `run_turn` is the top-level entry point the server calls. It
        validates its inputs, acquires the per-session lock, appends the
        user message, runs the loop, and then either persists the final
        reply to history or -- on validation failure -- drops the bad
        reply and returns a safe fallback. Dropping a bad reply from
        history is important: it prevents a hallucinated claim from
        poisoning subsequent turns.

        Warning logs never include the reply body. Session IDs and reply
        contents are logged only as short SHA-256 hashes for correlation,
        which keeps PII out of the log pipeline even under active
        incident response.

        """
    )

    out += _chunk("python", "agent-py", "agent.py", agent_py) + "\n\n"

    out += textwrap.dedent(
        """\
        # HTTP surface

        The FastAPI app exposes four routes: `GET /health`, `GET /`
        (redirects to `/static/index.html`), `POST /api/chat`, and
        `GET /architecture` (this very document). Everything else is
        deliberately missing -- the OpenAPI docs pages and the redoc
        pages are disabled so the public surface is as small as possible.

        ## Security headers

        A middleware injects a strict Content-Security-Policy and
        friends on every response. CSP is defense in depth: the chat UI
        in `static/chat.js` already renders model replies with
        `textContent` rather than `innerHTML`, so XSS is structurally
        impossible today. The CSP exists to catch any future regression
        that accidentally switches to `innerHTML`.

        The `/architecture` route overrides the middleware CSP with a
        more permissive one because pandoc's standalone HTML has inline
        styles.

        ## Sliding-window rate limiter

        `SlidingWindowRateLimiter` keeps a deque of timestamps per key
        and evicts anything older than the window. The `/api/chat`
        handler checks twice per call -- once with an `ip:` prefix,
        once with a `session:` prefix -- so a single attacker cannot
        exhaust the per-session budget by rotating cookies, and a
        legitimate user does not get locked out by a noisy neighbour on
        the same IP.

        Suitable for a single-process demo deployment. A multi-worker
        deployment would externalize this to Redis.

        ## Session cookies

        The client never chooses its own session ID. On the first
        request a new random ID is minted, HMAC-signed with
        `settings.session_secret`, and set in an HttpOnly, SameSite=Lax
        cookie. Subsequent requests carry the cookie; the server
        verifies the signature in constant time
        (`hmac.compare_digest`) and trusts nothing else. A leaked or
        guessed request body cannot hijack another user's conversation
        because the session ID is not in the body at all.

        ## /api/chat

        The handler resolves the session, checks both rate limits,
        then calls into `agent.run_turn`. The Anthropic exception
        hierarchy is caught explicitly so a rate-limit incident and a
        code bug cannot look identical to operators:
        `anthropic.RateLimitError` becomes 503, `APIConnectionError`
        becomes 503, `APIStatusError` becomes 502, `ValueError` from
        the agent becomes 400, anything else becomes 500.

        ## /architecture

        This is where the woven literate program is served. The handler
        reads `static/architecture.html` (produced by pandoc from this
        file) and returns it with a relaxed CSP. If the file does not
        exist yet, the route 404s with a clear message rather than
        raising a 500.

        """
    )

    out += _chunk("python", "server-py", "server.py", server_py) + "\n"

    out_path = ROOT / "Bookly.lit.md"
    out_path.write_text(out, encoding="utf-8")
    print(f"wrote {out_path} ({len(out.splitlines())} lines)")


if __name__ == "__main__":
    main()