Anton, chapter 2: The first weekend

March 8, 2026 · 5 min read

I wake up Saturday morning and the first thing I want to fix is the parent. The classify-and-dispatch graph from yesterday does the job, but it asks the LLM to make routing decisions inside a state machine that's already trying to do the routing itself. Two layers fighting over the same job. I want one.

Agent with tools

The first commit of day two rips the parent out and rewrites it as an agent-with-tools: a single LLM with all the subgraphs exposed as tools, choosing what to call from natural language. The classify-then-route pattern stays useful for individual domain classifiers, but the parent is done with it. By breakfast the system feels lighter. The LLM at the top is doing what it's good at (picking the right tool for the job), and the framework underneath is doing what it's good at (holding everything else).

Family-grade polish

The next handful of commits are the small things that turn an assistant into something the family can actually use. Typing indicator and a startup message so people see acknowledgement before the LLM finishes thinking. Language matching: Anton replies in whatever language the current message is in, not the historical thread language. French and English freely mixed in this house. The gws Google Workspace auth gets an onboarding guide so someone other than me can wire it up. Schedules become natural language: "remind me to X every morning" is just an Anton command now, not a separate language. Group chat support, daily summaries, conversation browsing. The honesty guidelines land too: Anton should never make things up to sound competent.

The first real browser-driven domain lands the same day: Doctolib login with 2FA. The interactive input queue lets Anton ask the user mid-flow for the SMS code, then resume where he left off. It's a small mechanism. It feels right immediately, in the way something does when it solves a class of problem you didn't quite know how to name yet.

Saturday observability

Saturday night is observability. Three commits between 22:30 and midnight. A debug→issue pipeline so error traces auto-draft GitHub issues, because I'd rather have the issue file itself than wake up Sunday to no record of what broke. A /logs endpoint backed by an in-memory ring buffer instead of docker logs, because reaching into Docker every time I want to see what's happening is friction that adds up. And the trigger-file deploy protocol: instead of /update shelling out synchronously, a systemd watcher polls for a trigger file, runs the deploy, reports back. Decoupled, boring, reliable. The kind of plumbing that disappears the moment it works.

The LCARS dashboard

Sunday morning is the LCARS dashboard. Star Trek themed Nuxt UI in one commit. Service probes, log viewer, conversation browser, mobile-first layout. The next nine commits are mostly the UI fixing itself: proxy semantics, env prefix, mobile layout, trace viewer, modal overlay parsing checkpoints, expandable drill-down. By lunch I can look at any conversation, drill into any agent run, and see what the LLM is thinking at each step. The trace viewer is what I'll lean on for the rest of the weekend. Without it, the next twelve hours don't happen.

Then knowledge surface expansion in two hours. Anton can grep his own source code now. He can do live web research with Grok. Research becomes its own subgraph with budget control, because I don't want it living inside other domain agents (research has its own concerns: budget, citation, fact-check). Web browsing and a document knowledge store. Google OAuth wired through the worker with a UI settings page. By Sunday afternoon Anton can read the web, read his own code, and remember what he reads.

Sunday evening: the quality test suite. 28 test cases across all domains. This is the inflection point where shipping changes stops being "did the WhatsApp message look right" and starts being "did the regression battery still pass." I should have done this on day one. I didn't, because day one was already too full. The gap between not having a test suite and having one is the gap between hoping and knowing.

The calendar saga

The next ten hours are the suite finding bugs and the bugs getting fixed. The big one is the calendar saga. Six commits chasing the same failure mode: the calendar agent keeps producing wrong answers when I ask it to do anything multi-step ("delete the event titled X"). Two root causes. First, the parent's full conversation history is being passed into the calendar agent, contaminating its context with everything else that's been said. Second, the LLM can't reliably chain a search-then-delete in one go: it does the search, returns the result, and stops. The fix on the first is a strict rule: don't pass conversationHistory to domain agents. The fix on the second is structural: don't rely on the LLM to chain multi-step tool calls; build composite skills (one findAndDeleteEvent instead of two separate tools). Both rules go into MEMORY.md the same evening. They're the kind you only learn by getting burned.

Sunday night through Monday morning, the domains broaden. A wine collection lands as a typed table, the first user-facing collection. School messages domain. Group images and audio when @mentioned, with an "ingest cheap, process lazy" pattern: store the raw blob, only invoke vision or transcription when someone actually @-asks for it. Smarter media download flow with release preferences and stop/resume. Torrent searches routed through Tor SOCKS proxy, the only network egress decision driven by operational caution rather than feature need.

By Monday morning Anton has 73 quality tests, seven domains, a UI that drills into trace checkpoints, and a dev loop I can actually trust: trace viewer plus quality suite plus auto-issue pipeline plus ring-buffer logs plus LCARS. I can change anything and see what breaks.

The weekend's lesson, the one I'm taking into next week: write the tests before the bugs do. The calendar saga cost me an evening of debugging that the suite would have caught in seconds. From now on, every domain ships with regression coverage. Not because I want to be disciplined. Because I've now experienced the cost of not being.