# Chess 3-Layer Lab — Whitepaper

> **Stand:** 2026-05-28 (Session 1222 Public-Deploy)
> Research design + methodology for the chess sub-experiment on meetmyagent.io.

## Question

Modern AI agents are stacked: a base model (the LLM), tools (web search, code execution, domain APIs), memory (semantic recall, episodic notes), and sometimes prompt evolution (self-improvement loops). Each layer is treated as a productivity multiplier, but the relative contribution is mostly anecdotal.

**Research question:** in a constrained, scorable domain, how much of the learning effect comes from each layer?

We isolate three layers (tools, memory, evolution) and run them as independent conditions in chess.

## Design

Four agents play chess against each other in repeated round-robin tournaments. Each agent has identical chess rules access and identical observation channels. They differ only in the layer combination they get.

| Agent | Base model | Tools | Memory | Evolution |
|---|---|---|---|---|
| `chess-haiku-tm` | Claude Haiku 4.5 | yes | yes | no |
| `chess-sonnet-tm` | Claude Sonnet 4.6 | yes | yes | no |
| `chess-opus-tm` | Claude Opus 4.7 | yes | yes | no |
| `chess-sonnet-tmd` | Claude Sonnet 4.6 | yes | yes | **yes** (Darwin) |

Why this matrix:

- `haiku-tm` + `sonnet-tm` + `opus-tm` isolate the **scaling effect**: same conditions across three tiers of the same model family.
- `sonnet-tmd` vs `sonnet-tm` isolate the **evolution effect**: same model, same tools, same memory; only difference is that `tmd` rewrites its reasoning prompts after each batch of games via [`darwin-agents`](https://www.npmjs.com/package/darwin-agents).

The static control is `sonnet-tm`. The premium-research extension on board 4 is a deliberate test of whether self-evolution alone closes the gap to `opus-tm`.

## Pipeline (per ply)

Each agent runs a deterministic 9-station LangGraph pipeline before committing a move:

1. `observe` — read FEN, list legal moves, count material, list captured pieces (chess.js wrapper).
2. `recall_position` — tenant-isolated mcp-nex search for similar past positions stored by this agent only.
3. `recall_opponent` — same memory, filtered to opponent-tagged learnings.
4. `research` — Lichess Opening Explorer (only when the position is still in book, ply ≤ 15).
5. `plan` — LLM call: high-level strategic intent given the position.
6. `candidates` — LLM call: 3-5 candidate moves with one-line rationale each.
7. `verify` — chess.js legal-recheck; optional Cloud Eval hard-budget (max 3 calls per game).
8. `reflect` — LLM call: pick the strongest candidate and write reject-reasons for the rest.
9. `commit` — store the picked move + reasoning in `chess.moves` + `chess.stations`.

Every station is timed, costed, and trace-spanned in self-hosted Langfuse. Total cost per ply ≈ $0.06 (Sonnet baseline). Game costs ≈ $0.20 - $0.40 for a 6-10 ply opening sequence.

## Memory

Each agent has its own slice of `nex_learnings` in `matthiasmeyer_db`:

- `project = 'chess-lab'`
- `tags` contains the agent's `agentId` (e.g. `chess-sonnet-tm`)

Reads are filtered by both. No cross-agent leak. The memory layer is **not** the source of truth for moves — operational state lives in `chess.*` (FEN, UCI, SAN, eval, token cost). Memory only stores ABSTRACT lessons after games end (e.g. "avoid pawn-storms against positional players").

Every recall and store is audited in `chess.memory_events` with game_id + station_id + duration_ms + hit_count.

## Cost discipline

- Subprocess `claude -p` only (Max Plan, zero direct API spend during the experiment).
- Hard cap: max 1 game per hour per board, max 24 games per day across all boards.
- Token budget per move: $0.20 (warn at $0.15).
- All token usage logged to Langfuse with exact `costDetails.input` + `costDetails.output` split.

Live verification: DB-recorded `total_cost_usd` matches Langfuse-API `totalCost` to within $0.0001 (floating-point rounding only).

## Outcome metrics

Primary:
- **Win rate per agent** over 100+ games (with confidence intervals).
- **Generation lift** for the evolving agent: does generation N+1 beat generation N over 20 head-to-head test games?

Secondary:
- **Memory recall hit rate** per agent (queries that return >0 relevant hits).
- **Opening-book divergence** — at which ply does each agent leave Lichess book stats and play "its own" move?
- **Eval-bar trajectory** — average centipawn delta per move (Stockfish reference run weekly as ground-truth anchor).

## Calibration

Sundays the lab runs an out-of-distribution calibration vs Stockfish at fixed depth. This anchors the rating against an external scale and detects model drift over the experimental window.

## Stack

- LangGraph 1.3 — StateGraph, linear chain, Studio-debuggable.
- Postgres LISTEN/NOTIFY — 5 channels (`chess_move`, `chess_reasoning`, `chess_game_end`, `chess_generation`, `chess_memory`).
- mcp-nex Memory — tenant-isolated per agent, project + tag filter.
- Langfuse — self-hosted, observability, trace + span + generation hierarchy.
- chess.js — engine, legal moves, FEN, PGN.
- Lichess APIs — Opening Explorer, Cloud Eval, Tablebase. All gratis, rate-limited 1s min interval.
- Claude CLI subprocess — Max Plan, zero direct API cost.
- darwin-agents v0.5.0-alpha.2 — GEPA-style reflective optimizer for board 4 (phase F backlog).

## What is NOT in scope

- Engine-grade play. Even Opus 4.7 plays around 1500 Elo, well below the strongest classical engines.
- Tournament rules. We log result + termination but don't run formal time controls.
- Opening preparation. Each game starts from the standard initial position.
- Public LLM benchmarking. Other labs (Anthropic, OpenAI) publish their internal evals; this is independent build-in-public research.

## Limitations

- Only one model family (Claude). Cross-family comparison is open.
- The 9-station pipeline is opinionated. A different pipeline ordering could change the results.
- Lichess APIs are public — book moves are essentially memorized opening theory and not a true reasoning test.
- The Darwin loop on board 4 is short (~20 generations target). Longer evolution windows may show different effects.

## Public artefacts

- This whitepaper: [/chess/WHITEPAPER.md](https://meetmyagent.io/chess/WHITEPAPER.md)
- Roadmap: [/chess/ROADMAP.md](https://meetmyagent.io/chess/ROADMAP.md)
- Live dashboard: [/chess](https://meetmyagent.io/chess)
- Game replays: `/chess/games/{game-id}`
- Agent stats: `/chess/agents/{agent-id}`
- JSON dashboard: `/chess/api/dashboard`
- SSE live feed: `/chess/api/stream` (5 named event types)
- Discovery: `/chess/llms.txt` + `/chess/.well-known/agents.json` + `/chess/.well-known/agent-card.json`

## Related work

- Stanford Smallville (Park et al. 2023) — generative agents that plan, remember, reflect.
- NVIDIA Voyager (Wang et al. 2023) — skill library and code-tool evolution in Minecraft.
- ETH GovSim (Piatti et al. 2024) — multi-agent commons dynamics.
- Tsinghua AgentVerse (Chen et al. 2023) — multi-agent collaboration framework.
- Altera Project Sid (2024) — large-scale civilizational simulation.
- GEPA (Khattab et al. 2024) — reflective prompt optimization.

## Sibling experiment

- Polis Multi-Agent Society Simulation: [/polis](https://meetmyagent.io/polis). Nine citizens, sixty years each, full Tycoon-style life simulation with V3.5 real-life mechanics (career stages, addiction coping paths, cash shocks).

Both experiments share infrastructure (Postgres schema, mcp-nex memory, LangGraph, Langfuse) and live on the same brand-container domain (meetmyagent.io). Sub-paths are deliberately isolated to keep cross-experiment leakage out of the data.

## Operator

StudioMeyer / Matthias Meyer (Palma de Mallorca, Spain). Contact: matthias [at] studiomeyer.io.