TL;DR
I’ve built a blunder tutor, a project to train upon your historic blunders, that’s open-source, self-hosted and free to use: github, promo. Star, like, subscribe, give it a try!
docker run \
-p 8000:8000 \
-v $(pwd)/data:/app/data \
ghcr.io/mrlokans/blunder-tutor:latest
Open http://localhost:8000, enter your Lichess or Chess.com username, and start training on your own blunders.
Intro
I suck at chess. Like, it seems that when your body turns 30, suddenly your neural network stops learning new patterns the way it used to when you were 20. It’s predictable but still annoying and frustrating.
I’ve picked up that hobby recently, less than half a year ago, having played it twice at age 7. This quickly became my new obsession, so I dove headfirst — no, it’s not the LLM writing, it’s the byproduct of LLM-generated slop force-learning human brain cells — into it, consuming an enormous amount of YouTube content along the way. I’ve been playing online at Chess.com and Lichess, hitting 850 and ~1150 ELO respectively. Bought the premium subscription at Chess.com, analyzed my games and noticed that despite learning new tricks here and there I kept blundering at the simplest positions, over and over again, like the stupidest kid in the class. My dreams of becoming a chess prodigy were broken by the fact that I’m getting old and not a genius (:sad-emoji:).
Like every sane human being, I’ve decided I’m better than this and my journey should be unique, so I’ve spent a lot of time and money (Claude Max subscription is costly) to build a project I can share with the world — to help me and other folks become better at one specific aspect: not repeating your own mistakes. In the process I’ve forgotten a bit about playing chess and went full-throttle on taming the AI to build something good, mixing my two hobbies: chess and self-hosting.
Initial idea
So, how do you build such a project? You fetch a bunch of games, somehow process them in bulk by a chess-engine, somehow extract from the analysis the positions evaluations, detect obvious blunders, persist that info, slap some UI on top and you’re done. That was the initial plan and, not without the caveats, it’s basically what happens to this day.
Claude Code was unholstered and the process started, following a basic progression: CLI first, some basic domain model, defining high-level architecture, integrating pieces together and building the basic web UI.
Building a working prototype was very easy, as it always is with the AI-assisted coding, it required a few rounds of improvements to properly set-up DI, drill down on proper component isolation and attempt to enforce the Single Responsibility Principle but, in general, the journey was smooth. The project is 99% written by the AI-agent, in a sense that I didn’t write much of the code, I only orchestrated the agent(s) through a variety of approaches, making sure it does not produce too much slop and stays under the control. During this process I felt like a weird mix of an architect, product and project manager, focusing more on the results overview, building the roadmap management approach and balancing the feature development with code-reviews and refactoring storms.
Technical stack
- FastAPI for the backend. Claude seems to really love starting the development with FastAPI. Personally, I’m not the greatest fan, I don’t really like its DI system and for me it’s not that versatile, but obviously it’s really popular among folks and over-represented in the learning dataset. Still, why not.
- SQLite as the data storage. It’s battle-tested, suitable for embedding and allows me to build a single artifact that’s ready to use. It has its flaws, specifically for use-cases for huge parallel write workloads, but it works and works fast.
- Vanilla JS and Vanilla CSS for the frontend. I really wanted to avoid the hassle of building the frontend and setting up various things like Vite or some other bundling; LLMs like to write their own implementations of things anyway. Likely I will try some lightweight library like Preact later on, if things get really messy. I also vendor all of the FE dependencies and bake them into the resulting image.
- Docker for packing things together. No comments, if you want to distribute something for the self-hosting - you’d better have a docker image built.
- Stockfish is the gold standard of computer analysis these days, so it felt only natural to move on with it. I compile it and embed into the resulting docker image, so that the end-user does not bother about setting it up.
The Pipeline: From Raw PGN to Puzzle
The main data flow of the app is straightforward:
- Fetch games from Lichess/Chess.com APIs (or paste PGN directly)
- Analyze every position with Stockfish
- Classify each move (good / inaccuracy / mistake / blunder)
- Perform data enrichment for specific features:
- Enrich with tactical patterns
- Enrich with possible traps
- Classify difficulty score for blunder
- Filter out the low-value blunders, that are not really useful for human players
- Present the interesting ones as puzzles in a trainer UI
In practice, there are gotchas along the way.
The Analysis Pipeline
Analysis is structured as a step-based pipeline - each step can depend on outputs of previous steps, and steps can be skipped if they’ve already been completed. That last part is important: when I add a new analysis feature (and I keep adding them), existing games get backfilled without re-running the expensive Stockfish step.
The pipeline steps, in order:
- Stockfish - the expensive one. Evaluates every position in the game, producing centipawn evaluations and best lines.
- Move Quality - classifies each move based on eval swings and computes difficulty scores.
- Phase Detection - tags each move as opening/middlegame/endgame based on piece count and move number.
- ECO Classification - matches the opening sequence against the Encyclopedia of Chess Openings.
- Tactical Pattern Detection - identifies forks, pins, skewers, discovered attacks, etc. in blunder positions.
- Trap Detection - pattern-matches against a database of known opening traps (Scholar’s Mate, Fried Liver, etc.).
- Write - persists everything to SQLite.
The actual dependency graph between steps looks like this - topologically sorted at runtime so each step runs only after its dependencies complete:
evals + best PVs"] ECO["eco
opening classification"] PH["phase
opening/middle/endgame"] MQ["move_quality
classify + difficulty"] TAC["tactics
fork/pin/skewer/..."] TR["traps
known opening traps"] WR["write
persist to SQLite"] SF --> MQ MQ --> TAC MQ --> WR PH --> WR ECO --> WR TAC --> WR TR --> WR style SF fill:#D03020,stroke:#1A1918,color:#F2F0EB,stroke-width:3px style WR fill:#1A1918,stroke:#1A1918,color:#F2F0EB,stroke-width:3px style MQ fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style ECO fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style PH fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style TAC fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style TR fill:#E8E4DC,stroke:#1A1918,stroke-width:2px
The Hard Part: What Counts as a Blunder?
If you want to train on your historic blunders you need to detect them (duh?). And detecting one algorithmically in a naive fashion does not produce reliable results. What I originally wanted to do is basically get a list of move evaluations from the engine, filter-out the swings by some threshold and have the blunders served on the plate. It did not work well.
The Centipawn Problem
Stockfish gives you evaluations in centipawns, 1/100 of a pawn of an advantage (cp). A pawn is worth ~100cp, a knight ~320cp, etc. The raw centipawn loss (eval_before - eval_after from the player’s perspective) seems like the obvious metric. If you had +150cp and after your move it’s -100cp, that’s a 250cp swing - clearly a blunder, right?
Well, consider this: you’re up a queen (eval: +900cp). You play a move that drops a pawn, going to +800cp. That’s a 100cp loss - technically a “mistake” by raw thresholds. But you’re still completely winning. Should this appear in your training puzzles? Probably not.
Or the opposite: you’re losing badly (eval: -500cp). You play a move that makes it -700cp. Sure, you lost another 200cp, but the game was already over. Drilling this position won’t teach you much.
Winning Chances: The Lichess Approach
The solution I landed on (following Lichess’s lead) is to classify moves based on winning chances rather than raw centipawns. The conversion uses a sigmoid function:
winning_chances(cp) = 2 / (1 + exp(-0.00368208 * clamp(cp, -1000, 1000))) - 1
This maps centipawn evaluations to a -1.0 to +1.0 scale where the extremes are compressed. Going from +200cp to -50cp (a 250cp swing) shifts winning chances by 0.44 (from 0.35 to -0.09) - a clear blunder. Going from +800cp to +550cp (also 250cp) only shifts winning chances by 0.13 (from 0.90 to 0.77) - barely an inaccuracy. The sigmoid captures the intuition that positions near equality are where accuracy matters most.
The thresholds become:
| Classification | Winning Chances Loss |
|---|---|
| Inaccuracy | ≥ 0.10 |
| Mistake | ≥ 0.20 |
| Blunder | ≥ 0.30 |
The interactive chart below lets you compare the two approaches. Drag the sliders to simulate a move’s evaluation swing and see how raw centipawn loss and the sigmoid winning chances classify it differently. Try the presets to see the edge cases the sigmoid handles well:
Drag the sliders to simulate a move's evaluation swing and compare how raw centipawn loss and the sigmoid winning chances classify it.
Mate Transitions
The winning-chances approach handles most cases, but mate situations need special treatment. Lichess’s Advice.scala handles four distinct mate transitions:
- Mate created: You didn’t have mate before, but now the opponent has forced mate against you. Severity depends on how good your position was before.
- Mate lost: You had a forced mate, and now you don’t. Always at least a mistake.
- Mate delayed: You had mate, and now the opponent has mate. Always a blunder, no excuses.
- Checkmate delivered: The position is checkmate - that’s a “good” move regardless, since, well, you won.
These transitions can’t be handled by the winning-chances sigmoid because mate scores live outside the centipawn scale entirely.
One tricky refinement I added later: dead-end position handling. When Stockfish already predicts a forced mate against you, every subsequent move was getting classified as a blunder (because you’re going from “mate in 12” to “mate in 11” - technically worsening your position). The fix, borrowed from Lichess’s logic, is to treat these dead-end positions specially: once mate is inevitable, only moves that dramatically change the mate distance get flagged. This alone cut false blunder counts by 30-40% in many games.
Not All Missed Mates Are Equal
Even after getting the classification right, I found that missing mate-in-1 and missing mate-in-15 were treated identically. That’s stupid. A mate-in-1 is something every beginner should spot. A mate-in-6+ is an engine-only find that no human below GM level would consistently see.
The fix: short mate misses (mate-in-1 through mate-in-5) stay classified as blunders and get boosted training weight - 2x for mate-in-1/2, 1.5x for mate-in-3/4/5. Long mate misses (mate-in-6+) get downgraded from blunder to mistake, because drilling engine-only sequences isn’t useful training for most players.
I also capped cp_loss at 1500 for all moves at analysis time. Before this, a single missed mate could show 11,354 cp_loss, which obliterated every average across the dashboard. Boring fix, dramatic improvement.
The Hard Part: Engine Performance
Running Stockfish from Python sounds simple - spawn a process, send UCI commands, read responses. It is simple. For one game. For hundreds of games in bulk analysis, it becomes really slow.
The Engine Pool
The initial implementation spawned a fresh Stockfish process per game. This was wasteful - process startup has overhead, and you’re not utilizing multiple cores effectively. The EnginePool is a fixed-size pool of long-lived Stockfish processes with async workers consuming from a shared task queue.
Each engine gets configured with Threads (CPU cores per engine) and Hash (memory for the transposition table) based on available hardware. On an 8-core machine with the default pool size of 4, each engine gets 2 threads and 128MB hash.
The pool handles dead engine detection (Stockfish crashes under memory pressure - it happens), task timeouts for pathological positions, and graceful shutdown on app exit.
submit() / drain() / shutdown()"] Q["asyncio.Queue
task queue"] W1["Worker 1"] W2["Worker 2"] WN["Worker N"] SF1["Stockfish #1
2 threads, 128MB"] SF2["Stockfish #2
2 threads, 128MB"] SFN["Stockfish #N
2 threads, 128MB"] GA["Game A"] GB["Game B"] GC["Game C"] WC --> Q Q --> W1 Q --> W2 Q --> WN W1 --- SF1 W2 --- SF2 WN --- SFN SF1 --> GA SF2 --> GB SFN --> GC style WC fill:#1A1918,stroke:#1A1918,color:#F2F0EB,stroke-width:3px style Q fill:#D03020,stroke:#1A1918,color:#F2F0EB,stroke-width:3px style W1 fill:#D1CCC4,stroke:#1A1918,stroke-width:2px style W2 fill:#D1CCC4,stroke:#1A1918,stroke-width:2px style WN fill:#D1CCC4,stroke:#1A1918,stroke-width:2px style SF1 fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style SF2 fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style SFN fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style GA fill:#F2F0EB,stroke:#1A1918,stroke-width:2px style GB fill:#F2F0EB,stroke:#1A1918,stroke-width:2px style GC fill:#F2F0EB,stroke:#1A1918,stroke-width:2px
If an engine crashes, the worker detects it and spawns a replacement. If a task exceeds the timeout (default 300s), the engine is killed, respawned, and the task gets an error. On shutdown, all engines receive quit commands and workers are cancelled cleanly.
The WorkCoordinator wraps the pool with a simpler submit() / drain() / shutdown() interface:
for game_id in game_ids:
async def process_game(engine, *, _gid=game_id):
await self.analyze_game(game_id=_gid, engine=engine)
coordinator.submit(process_game)
await coordinator.drain()
I spent a good chunk of time exploring how Stockfish optimizations work, created a benchmark to check how various parameter changes affect the overall analysis throughput, and got around a 10x-15x improvement over the original implementation. It’s still comparatively slow, but manageable.
Explaining Blunders in Plain Language
Knowing a move was a blunder is step one. Knowing why it was a blunder - that’s where actual learning happens and I really wanted my software to explain these whys to the end-user with descriptions like “You hung your bishop on e5 - it was undefended and the opponent’s knight can capture it” or “The best move Nf5 forks the king and queen, winning the queen.”.

Each blunder gets two explanations: why your move was bad, and why the best move was good. Both are produced as I18nMessage(key, params) - translation keys with parameters, never raw text. resolve_explanation() then formats them through the TranslationManager for the active locale, so the chess logic doesn’t care about languages and translators don’t need to understand chess code.
(FEN + best move + PV)"] GEN["generate_explanation()
python-chess analysis"] MSG["I18nMessage
{key, params}"] RES["resolve_explanation()
TranslationManager"] OUT["Localized text"] LOC["Locale files
en.json, ru.json, ..."] POS --> GEN GEN --> MSG MSG --> RES LOC --> RES RES --> OUT style POS fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style GEN fill:#E8E4DC,stroke:#1A1918,stroke-width:2px style MSG fill:#D03020,stroke:#1A1918,color:#F2F0EB,stroke-width:3px style RES fill:#D1CCC4,stroke:#1A1918,stroke-width:2px style OUT fill:#1A1918,stroke:#1A1918,color:#F2F0EB,stroke-width:3px style LOC fill:#D1CCC4,stroke:#1A1918,stroke-width:2px
Explaining the best move: PV-first design
The best-move explanation went through a major rewrite. The original version used only static pattern detection - forks, pins, hanging pieces - and tried to infer causality from the position. It kept getting things wrong: a pre-existing pin was blamed on the blunder, a defended queen was called “undefended”, a bishop sacrifice for a pawn was described as “captures the pawn with check” when the real point was winning the queen three moves later.
The fix was to make Stockfish’s principal variation (PV) the primary source of truth. The PV is the engine’s best-play sequence for both sides - it shows what concretely happens after the best move. _explain_best resolves in three phases:
- Immediate mate - if the best move is checkmate, say so. No PV needed.
- PV analysis - walk the engine’s best line (up to 5 half-moves), track every capture by both sides, compute the material-balance delta, detect mate in the line. From the structured
PVAnalysisresult:- Mate in N: “Qf7+ leads to checkmate in 3 moves: Qf7+ Kh8 Qf8#”
- Sacrifice combination: the player gives up material but ends with a net gain of ≥3 pawns. If a tactical pattern label is available, it enriches the message: “Bxh7+ wins the queen via discovered attack: Bxh7+ Kxh7 Ng5+ Kg8 Qxd5”. Otherwise: “wins the queen through a combination”.
- Non-sacrifice material win: net gain ≥1 pawn through a multi-move sequence.
- Simple direct capture: the PV deliberately returns
Noneso the static layer can describe it more concisely (“wins the rook”) without redundantly showing the one-move line.
- Static fallback - used when no PV is available or when the PV is uninformative (no material gain, no mate). This is the old template system, kept as a safety net. It checks, in order: checkmate, check + capture, check + tactical pattern, named tactical patterns (fork, pin, skewer, discovered attack, back rank threat, hanging piece), retreat to safety, simple captures, null-move threat detection, centipawn-loss avoidance, and a bare fallback.
Explaining the blunder
The blunder-side explanation is simpler and still fully static. It examines the position after the blunder was played and checks, in order:
- Did the player miss an immediate mate?
- Did the moved piece land on an undefended square? (“Your bishop on e5 is undefended”)
- Did moving the piece expose another piece? (“Moving your knight exposed your queen on d1”)
- Did the player ignore an existing threat - a friendly piece already under profitable attack that the blunder doesn’t address?
- Was it a bad capture - trading a high-value piece for a low-value one on a defended square?
- Fallback: report the centipawn loss as pawn equivalents.
Tactical Pattern Detection
Beyond explanations, the tactical analysis module classifies blunders into two categories:
- Missed tactic - the best move exploited a pattern you didn’t see (fork, pin, skewer, etc.)
- Allowed tactic - your blunder let the opponent execute a tactic against you
The detection is pure python-chess board manipulation - no engine needed. For forks, it checks if the best move attacks two or more pieces worth at least as much as the attacker. For pins, it uses the library’s is_pinned() for absolute pins and manually traces rays for relative pins. Skewers check if a sliding piece attack goes through a more-valuable piece to a less-valuable one behind it.
Discovered attacks required the most careful implementation. You need to compare the attacks of each piece before and after the move to find newly revealed lines. A discovered check (revealed attack on the king) is the most forcing and ranks highest.
This enrichment data feeds back into the trainer and dashboard - you can see patterns like “you keep missing knight forks” or “your blunders often allow discovered attacks,” and the weighted puzzle selection will serve you more of those patterns.
Opening Trap Detection
This one was fun. The trap detection system pattern-matches your games against a database of known opening traps - Scholar’s Mate, Fried Liver Attack, Fishing Pole Trap, and so on. Each trap definition includes multiple entry move orders (because the same trap can arise via transpositions), the critical mistake move, the refutation, and a recognition tip.
For each game, the detector classifies the match as:
- Entered - you reached the trap position but neither side triggered it
- Fell for - you (or your opponent) fell into the trap
- Executed - you successfully deployed the trap against your opponent
I haven’t yet spent enough time to make it work reliably (traps are detected in games, but detection of whether you fell for them still has errors).
Difficulty Scoring
Not all blunders are equally hard to find. A blunder where the best move is a flashy queen capture with check is easier to spot than one where the best move is a quiet knight retreat that prevents a tactic three moves later.
The difficulty score (0-100) considers:
- Best move type: Quiet moves (no capture, no check) score highest. Captures without check are medium. Checks are easiest.
- Legal move count: Fewer legal moves means less to calculate - positions with ≤3 legal moves are easier.
- CP loss magnitude: Very large cp loss with a quiet best move suggests a deep tactical idea that requires calculation.
This feeds into the trainer’s difficulty filter - beginners can start with easier blunders (obvious captures they missed) and work up to the subtle positional ones.
The Redesign: From Bootstrap Blob to Bauhaus
I’m not a UI/UX person, CSS scares me, that’s why I’ve been doing BE for 10+ years. AI-assisted coding changed that radically and now I’m able to build something with at least a not-terrible UI.
After a spree of features smacked together I stopped for a while and looked at the UI and realized it was sloppy, like an undergraduate weekend project. Rounded corners everywhere. Box shadows on everything. Blue-gray-white palette that looked like every other vibe-coded app. A mess of styles, bad vertical and horizontal alignment. It worked, but it had zero personality - the kind of interface an AI generates when you say “make me a dashboard.”
The old trainer and dashboard:


For v2.0, I rebuilt the entire visual language around a Bauhaus-inspired design system. The thesis: chess is already a game of grids, geometry, and binary contrast. The UI should amplify that, not fight it with rounded-corner softness. Also I hate rounded corners. And liquid glass. Screw you, Apple, for that abomination of a design, you make me sick.
The new trainer and dashboard:


The redesign wasn’t just cosmetic. It forced me to think about every component in terms of affordance: if it looks clickable, it must be clickable. Informational elements (metadata tags, phase labels) have no borders and no hover states. Interactive elements always have visible borders. This was already sort of true in the old design, but the rounded corners and shadows muddied the distinction.
Was it worth spending a week on visual design for a self-hosted tool that maybe 3 people will ever use? Probably not from a rational perspective. But I spend hours looking at this interface while solving puzzles, and life’s too short for generic UIs.
What I’d Do Better
Explanation quality: The blunder explanations are decent for obvious cases (hanging pieces, simple forks) but struggle with quiet positional blunders. “Your move lost 250 centipawns” isn’t really an explanation. Explaining why a quiet move is bad requires understanding positional concepts (pawn structure, piece activity, king safety) that pure python-chess board analysis can’t capture.
This is actually a well-studied problem in chess AI research. The field has converged on four approaches, each with trade-offs:
- Rule-based pattern detection (what Blunder Tutor does today) - reliable for tactical motifs but blind to subtle positional factors. DecodeChess is the commercial state of the art here, explaining moves across five dimensions: threats, plans, piece functionality, good moves, and concepts.
- Neural sequence-to-sequence models - treat commentary as a translation task (board state → text). ChessCoach uses a Transformer decoder trained on human commentary. More natural output, but can hallucinate chess facts.
- LLM + engine grounding - the emerging dominant paradigm. Stockfish provides the analytical truth, an LLM generates the explanation. The key finding across the literature: LLMs hallucinate catastrophically when evaluating positions themselves, but do well when explaining pre-computed engine analysis. Projects like chessagine-mcp bridge Stockfish with Claude/GPT for this.
- Concept-guided generation - the current state of the art. Kim et al. (2024) extract “concept vectors” (king safety, center control, piece activity) from expert models, then use them to guide LLM commentary. Achieves human-level correctness.
A future version of Blunder Tutor might use the LLM + engine grounding approach - feed Stockfish’s PV line and positional features into an LLM to generate something like “after Nf5 Qe7 Nd6+ Kd8 Nxb7, White wins the bishop because the knight fork on d6 hits both the king and the undefended b7 bishop.” The hard part isn’t the LLM call - it’s ensuring the output is factually correct about what’s happening on the board, which is where concept extraction or symbolic verification (like Caïssa AI’s Prolog + Neo4j approach) comes in.
Test coverage for chess edge cases: The move classification code has a lot of edge cases (positions with multiple queens, underpromotion, 50-move-rule draws mid-analysis). I’ve hit most of them through real-game testing, but a comprehensive test suite with known-position fixtures would catch regressions faster.
Did I get better using this tool?
I don’t know yet 🌚. My ELO hasn’t moved much. I still hang pieces in time trouble and walk into knight forks I’ve seen a hundred times. The dashboard tells me my blunder rate in the middlegame dropped by a few percent, but I’m not sure if that’s the tool or just pattern recognition from playing more games.
What I did get is something I wasn’t expecting: I now see the board differently. After spending weeks building software that dissects positions — tracking which squares are defended, which pieces are pinned, what the engine’s best line actually does move by move — I started noticing those things in my own games. Not always. Not fast enough. But the patterns are there, somewhere between the code and the chess, slowly merging.
I also learned that building a tool to fix your weaknesses is a fantastic way to avoid actually sitting down and fixing your weaknesses. I’ve mass-analyzed 500+ games and solved maybe 40 puzzles from them. The cobbler’s children, as they say.
The project is open-source and free to use. If you try it and find it useful — or useless — I genuinely want to hear about it. And if you’re also an 850-rated adult who suspects their brain peaked at 30, welcome to the club. At least we have good tooling now.
References
Core libraries and tools used
- python-chess - the backbone of everything: move generation, PGN parsing, Stockfish communication (2,400+ stars)
- Stockfish - the chess engine, now with NNUE evaluation where 100cp = 50% win probability in engine self-play
- Chessground - Lichess’s chessboard UI component
- FastAPI - async Python web framework
- Alembic - database migrations for SQLAlchemy
- hyx - resilience library (retries with exponential backoff and jitter)
- APScheduler - background job scheduling
Move classification and evaluation
- Lichess source code (lila) - open-source chess server;
Advice.scalahas the move classification logic I borrowed from - Lichess winning chances formula - the sigmoid conversion from centipawns to winning probability
Chess explanation research
- Jhamtani et al. (ACL 2018) - “Learning to Generate Move-by-Move Commentary for Chess Games from Large-Scale Social Forum Data” - the foundational 298K move-commentary dataset from CMU
- Zang, Yu, Wan (ACL 2019) - jointly training a chess engine with commentary generation
- Lee, Wu, Dinan, Lewis (Meta AI, 2022) - Stockfish + BART hybrid for controllable chess commentary
- McGrath et al. (DeepMind, PNAS 2022) - “Acquisition of Chess Knowledge in AlphaZero” - proving neural engines learn human-recognizable concepts
- Kim, Goh, Hwang, Cho, Ok (2024) - concept-guided chess commentary, current state of the art for automated explanation quality
Open-source projects in this space
- ChessCoach - C++/Python engine with neural commentary decoder and Lichess bot integration (520 stars)
- OpenChess-Insights - open-source game review with move explanations, closest to Chess.com’s analysis
- chessagine-mcp - MCP server bridging Stockfish/Maia2 with Claude for natural language analysis
- DecodeChess - commercial XAI chess tutor explaining moves across five dimensions
- Maia Chess - predicting what humans at specific rating levels would play (1100–1900 Elo models)
- php-chess Tutor - rule-based positional assessment (center control, pawn structure, king safety)
- Chess Stalker - inspiration for several dashboard analytics features
