2026-05-26 · StudioMeyer
Keeping nine AIs alive for sixty days without crashing
How LangGraph plans the rounds, Temporal keeps the whole thing alive through failures, and Langfuse tells us what actually happened.
A single Polis season is seven hundred and twenty months of game time. Each month involves nine citizens making four decisions each, plus a chronicler writing a short story at the end. That is about thirty-three thousand decision calls per season, and each call is a real conversation with a Claude model that might take ten seconds or might time out completely. If a single one of those calls crashes silently, the whole season can lose its rhythm.
Three pieces of software keep the whole thing alive and observable. LangGraph plans the rounds. Temporal keeps the run going through failures. Langfuse records everything so we can see what really happened.
LangGraph plans the rounds
LangGraph is a small library for writing workflows that pass through a graph of steps. Each step takes the current world state, does something with it, and passes a new state to the next step.
For Polis the graph is simple in shape. Decay happens first, where pool resources shrink a bit and mood drifts. Then crisis or visitor events fire if the conditions are right. Then each of the five active citizens takes their turn, one after another, with their own personality and memory. Then the chronicler reviews what happened and assigns a score. Then the next month starts.
LangGraph lets us write each of these as a small function and connects them automatically. When the graph runs, it streams out the new state after every step, which is what lets the live feed on the website show what is happening as it happens.
Temporal keeps the run alive
A sixty-day simulation is a long time for anything to stay running. Servers reboot. Network connections drop. Claude calls occasionally take too long and get killed. If we just ran the LangGraph workflow as a plain Node script, any one of those would end the run.
Temporal is a workflow engine that makes long-running processes durable. Each Polis run becomes a Temporal workflow. The workflow does not directly call the LLMs. Instead it asks Temporal to run an activity, and the activity is what calls the LLMs. If the activity fails or times out, Temporal automatically retries it with backoff. If the server reboots mid-run, Temporal picks up where it left off when it comes back.
For Polis specifically, each Polis run is wrapped as one big activity with a one hour time limit and a five minute heartbeat. The heartbeat is the activity telling Temporal "I am still alive, do not retry me". If five minutes go by without a heartbeat, Temporal assumes the activity is dead and starts a new attempt. Two attempts maximum, then it gives up and marks the run as failed. That last guardrail prevents us from burning compute on a truly broken run.
Temporal also has a schedule API for triggering runs on a cadence. Instead of relying on system cron, the schedule itself lives inside Temporal and survives server reboots. Right now we run a new season manually, but the path to fully unattended is just flipping a switch.
Langfuse records everything
When something goes wrong in a thirty-three-thousand-call run, finding the exact bad call is impossible without tooling. Langfuse is the tool we use for that. Every single Claude call gets logged with its full prompt, full response, latency, token count, cost. We can search by citizen, by month, by error type. We can see exactly what a citizen was told and exactly what they decided to do.
This is also how we will eventually answer the research question whether bigger models really live better lives. Langfuse holds all the data side by side, so comparing the three Opus citizens with the three Sonnet and the three Haiku is just a database query.
Why three tools instead of one
Each tool does something the others cannot. LangGraph thinks at the level of game logic, which step comes after which. Temporal thinks at the level of survival, which retry policy applies when. Langfuse thinks at the level of observation, what did the model actually say at three thirty seven in the morning on day twelve. Trying to do all three with one tool would mean compromising on each.
The whole stack runs on a single small server in Germany. The Polis engine itself is a few hundred lines of TypeScript. The orchestration is what makes the small engine feel like a serious simulation that can actually run for two months without anyone watching.