Deep Research Report

LLMs vs Traditional Chess Engines:
Can Language Models Play Chess?

A comprehensive analysis of how GPT-4, Claude, Gemini, and other large language models perform against Stockfish, Leela Chess Zero, and purpose-built chess AI -- with ELO ratings, benchmarks, and the science behind the gap.

Published March 2026 | 20+ Sources Analyzed | bbridgford.com

The Numbers at a Glance

The ELO gap between language models and traditional chess engines remains enormous -- but the story is more nuanced than "LLMs can't play chess."

3,653
Stockfish 17.1 ELO
CCRL 40/15 rating, Mar 2026
2,895
DeepMind Searchless Chess
270M param transformer, no search
~1,800
Best General-Purpose LLM
gpt-3.5-turbo-instruct
~1,200
Best Reasoning LLM (o3)
Per Magnus Carlsen's estimate
758
o3 (low) vs Dragon Engine
LLM Chess benchmark, Dec 2025
2,840
Magnus Carlsen (Human #1)
FIDE rating, December 2025
Key Insight: The best general-purpose LLM (gpt-3.5-turbo-instruct at ~1800 ELO) plays at roughly the level of a strong club player. The best traditional engine (Stockfish 17.1 at 3653 ELO) is nearly 1,900 rating points stronger -- a gap so large that the engine would be expected to win virtually 100% of games.

ELO Ratings: The Full Landscape

Placing LLMs, traditional engines, specialized chess transformers, and human players on the same scale reveals the true competitive picture.

Stockfish 17.1
3,653
Leela Chess Zero
~3,500
AlphaZero (2018)
~3,400+
DeepMind Searchless
2,895
Magnus Carlsen
2,840
ChessLLM (fine-tuned)
1,788
gpt-3.5-turbo-instruct
~1,750
GPT-4 (chat)
~1,370
o3 (Carlsen est.)
~1,200
Grok 4 (est.)
~800
o3 (low) vs Dragon
758
Rating Context: ELO ratings are always relative to their measurement pool. The ~1200 estimate from Magnus Carlsen for o3 at the Kaggle tournament and the 758 from the LLM Chess benchmark use different methodologies. The 1750 for gpt-3.5-turbo-instruct was measured against calibrated Stockfish levels. Direct cross-system comparisons are approximate.
Entity Type ELO (Approx.) Source / Method Year
Stockfish 17.1 Engine 3,653 CCRL 40/15 2025
Leela Chess ZeroOpen-source / AlphaZero-style Engine ~3,500 TCEC / CCRL 2025
AlphaZeroGoogle DeepMind Engine ~3,400+ vs Stockfish 8 (1000 games) 2018
DeepMind Searchless Chess270M param transformer Specialized 2,895 Lichess blitz vs humans 2024
DeepMind MCTS-MAVLLM + External Search Hybrid GM-level ICML 2025 paper 2025
Magnus Carlsen Human 2,840 FIDE Standard 2025
ChessLLM (fine-tuned)NAACL 2025 paper Fine-tuned LLM 1,788 vs Stockfish (10x sampling) 2025
gpt-3.5-turbo-instructOpenAI LLM ~1,750 vs calibrated Stockfish 2023
GPT-4OpenAI (chat) LLM ~1,370 vs calibrated Stockfish 2024
o3 (medium)OpenAI reasoning LLM ~1,200 Carlsen estimate (Kaggle) 2025
Grok 4xAI LLM ~800 Carlsen estimate (Kaggle) 2025
o3 (low)OpenAI reasoning LLM 758 LLM Chess vs Dragon 2025
Chess-GPT (50M)Research model Fine-tuned LLM ~1,300 Emergent world model study 2024

Why the Gap Exists: Architecture Matters

Traditional chess engines and LLMs solve the chess problem using fundamentally different computational strategies. Understanding this reveals why the gap is so large -- and why it may never fully close for general-purpose models.

Traditional Engines (Stockfish)

Core approach: Exhaustive tree search with alpha-beta pruning and a hand-tuned + NNUE evaluation function.

  • Evaluates 60+ million positions per second
  • Guaranteed legal move generation via internal board representation
  • Deterministic: same position always produces same analysis
  • Perfect knowledge of all chess rules encoded in software
  • Uses efficient bitboard representations (64-bit integers for board state)

Modern Stockfish combines classical search with NNUE (Efficiently Updatable Neural Networks) for position evaluation, achieving the best of both worlds.

Neural Network Engines (AlphaZero / LC0)

Core approach: Deep neural network trained via self-play with Monte Carlo Tree Search (MCTS) at inference time.

  • Evaluates ~60,000 positions per second (1000x fewer than Stockfish)
  • Learns evaluation from scratch via reinforcement learning
  • Still uses legal move generation and explicit board representation
  • MCTS provides structured lookahead (not just pattern matching)
  • Produces more "human-like" creative play style

AlphaZero defeated Stockfish 8 with a score of +155 -6 =839 in 1,000 games (2018), despite searching 1000x fewer positions.

Large Language Models (GPT, Claude, Gemini)

Core approach: Next-token prediction on text sequences. Chess moves are just another sequence of characters.

  • No internal board representation -- must infer state from move history text
  • No tree search -- no systematic lookahead of future positions
  • No legal move guarantee -- can and do produce illegal moves
  • Pattern matches against training data containing chess games
  • Tokenization artifacts: chess notation may be split at arbitrary boundaries

Despite these limitations, research shows LLMs do develop emergent internal representations of board state -- linear probes can decode piece positions from hidden activations with 99.2% accuracy.

Specialized Chess Transformers (DeepMind)

Core approach: Standard transformer architecture, purpose-trained on chess positions annotated with Stockfish evaluations.

  • 270M parameters trained on 10M games (15 billion data points)
  • Predicts action-values (win percentages) for board positions
  • Achieves Grandmaster level (2895 ELO) without any search
  • Represents a middle ground: transformer architecture but chess-specific training
  • Cannot match Stockfish's 3653 -- "perfect distillation is still beyond reach"

This work (NeurIPS 2024) demonstrates that transformers can encode strong chess knowledge, but only when purpose-built -- not as a side effect of general language training.

The Fundamental Problem: Chess engines know the rules and search for good moves. LLMs pattern-match text sequences and guess the next token. When an LLM plays chess, it's essentially autocompleting a story about chess -- not actually playing the game. The fact that it works at all is remarkable; the fact that it doesn't work well is inevitable.

Benchmarks, Tournaments, and Competitions

From controlled benchmarks to the first-ever AI chess tournament, here's what the data shows.

Kaggle Game Arena -- First AI Chess Tournament (August 2025)

Google and Kaggle organized the first major LLM chess tournament in August 2025. Eight leading models competed in a single-elimination bracket. Each AI got four attempts per move to produce a legal move; failure meant forfeiture.

RoundMatchScoreNotable
QFo3 vs Kimi k24-0All games ended within 8 moves -- Kimi couldn't make legal moves
QFo4-mini vs DeepSeek R14-0DeepSeek struggled with move legality
QFGemini 2.5 Pro vs Claude 4 Opus4-0Only match with more checkmates than illegal move forfeits
QFGrok 4 vs Gemini 2.5 Flash4-0Grok showed the strongest overall play on Day 1
SFo3 vs o4-mini4-0o3 dominant throughout
SFGrok 4 vs Gemini 2.5 ProGrok advancesDecided on tiebreaks after close play
3rd PlaceGemini 2.5 Pro vs o4-mini3.5-0.5Gemini takes bronze
Finalo3 vs Grok 44-0Grok collapsed -- dropped pieces early in every game
No, no chance [of beating my phone]. A gifted child who doesn't know how the pieces move.
-- Magnus Carlsen, estimating LLM chess skill at ~800-1200 ELO

Key Takeaway: Even the tournament winner (o3) plays at roughly Class D level (~1200 ELO). Many matches were decided by illegal move forfeiture rather than actual chess skill. GM Hikaru Nakamura noted o3 made "very few mistakes" but GM David Howell observed that Grok "crumbled under pressure."

LLM Chess Benchmark (Montgomery et al., December 2025)

The most comprehensive academic benchmark, testing 50+ models against both a random opponent and Komodo Dragon 1 chess engine.

Key Results vs Random Opponent (30 games each):

ModelWin RateCheckmate Rate
o3 (medium)100.0%N/A
o3 (low)96.3%92.7%
o4-mini (high)96.1%92.1%
o1 (medium)91.2%82.5%
Grok 3 Mini (high)86.4%72.7%
Non-reasoning avg0.7%--

Move Quality (o4-mini vs GPT-4.1-mini):

Metrico4-mini (medium)GPT-4.1-mini
Blunder Rate4.2%31.3%
Mistake Rate1.1%8.7%
Best Moves Found19.5%4.1%

Critical Finding: 71.9% of non-reasoning model losses were due to instruction-following failures (unable to format valid moves), not chess knowledge. Reasoning models reduced this to 24.4%.

Illegal Move Rates: The Achilles Heel

The percentage of illegal moves is perhaps the most telling metric for understanding the LLM-chess gap:

ModelIllegal Move RateGames with Illegal Moves
gpt-3.5-turbo-instruct0.3% of moves16% of games
GPT-4 (chat)0.66% of moves32% of games
GPT-4o12.7% of movesHigh
gpt-3.5-turbo (chat)~50%+ of moves93% of games
text-davinci-003Nearly all99% of games
Reasoning models (o1/o3)<1% (with thinking)<5% of games

Reasoning models achieve near-perfect legality because they use their "thinking budget" to write out the board state, enumerate candidate moves, verify legality, and self-correct before committing. This is computationally expensive but effective.

AlphaZero vs Stockfish (2017-2018) -- The Landmark

For historical context, AlphaZero's matches against Stockfish 8 remain the most famous neural network vs. engine competition:

  • Initial 100-game match (2017): AlphaZero won 28, lost 0, drew 72
  • Extended 1,000-game match (2018): AlphaZero won 155, lost 6, drew 839
  • AlphaZero searched ~60,000 positions/sec vs Stockfish's ~60 million -- 1000x fewer
  • AlphaZero learned chess from scratch in 4 hours of self-play
  • Its games featured stunning piece sacrifices for long-term strategic advantage

Important distinction: AlphaZero is a purpose-built chess engine with MCTS, not an LLM. It has perfect knowledge of chess rules and an internal board representation. It just uses a neural network for evaluation instead of hand-coded heuristics.

Key Research and Notable Experiments

Academic papers, open-source projects, and corporate research pushing the boundaries of what transformers can learn about chess.

DeepMind: Grandmaster-Level Chess Without Search (NeurIPS 2024)

Perhaps the most important result bridging LLMs and chess engines. DeepMind trained a 270M-parameter transformer on ChessBench -- 10 million Lichess games annotated with Stockfish 16 evaluations (15 billion data points).

  • Result: 2895 ELO on Lichess blitz (grandmaster level) with zero search at test time
  • Puzzle solving: Lichess Puzzle ELO of 2867
  • Trained to predict action-values (win percentages) for board positions
  • Demonstrated that Stockfish's search-based algorithm can be approximately distilled into a transformer
  • Key limitation: "Perfect distillation is still beyond reach" -- gap to actual Stockfish remains

This proves transformers can learn strong chess -- but only when purpose-built and trained on chess-specific annotated data, not as a side effect of general language modeling.

DeepMind: Mastering Board Games with External and Internal Planning (ICML 2025)

A follow-up that added search back into the transformer equation, achieving even stronger results:

  • External search: Transformer guides MCTS rollouts without external engine calls
  • Internal search: Model generates linearized trees of future positions in-context
  • Both approaches achieved Grandmaster-level performance on a human-like search budget
  • Pre-training method "minimizes hallucinations" -- highly accurate state prediction and legal moves
  • Tested across Chess, Fischer Random Chess, Connect Four, and Hex

This represents the most promising hybrid approach: a chess-trained transformer augmented with structured search.

Chess-GPT: Emergent World Models (Karvonen, 2024)

A landmark interpretability study showing that even a small (50M parameter) model trained on PGN chess notation develops sophisticated internal representations:

  • Linear probes achieved 99.2% accuracy classifying pieces on each square
  • The model develops a "my/their" perspective rather than absolute black/white encoding
  • 89% accuracy estimating player skill level (above/below certain ELO thresholds) from move patterns
  • Model learns rules including check, checkmate, castling, en passant, and pinned pieces
  • Intervening on internal activations can change the model's play style (e.g., simulating a stronger player)
  • 50M parameter model reached ~1300 ELO from 5M training games

Key Insight: LLMs don't just memorize move sequences -- they develop genuine internal representations of the board. But these representations are fragile and break under perturbation.

ChessLLM: Complete Games Enable Mastery (NAACL 2025)

Published in January 2025, this paper demonstrated that training on complete game sequences (rather than isolated positions) dramatically improves LLM chess ability:

  • Achieved 1788 ELO vs Stockfish when permitted 10x sampling
  • Long-round data supervision yields a 350 ELO improvement over short-round data
  • Trained on 20 billion tokens of complete chess games
  • Uses FEN (Forsyth-Edwards Notation) for position representation
  • Open-sourced code, model, and dataset
ChessGPT: Bridging Policy Learning and Language Modeling (NeurIPS 2023)

One of the earliest systematic attempts to make LLMs play chess, presented as a Datasets and Benchmarks paper at NeurIPS 2023:

  • Collected large-scale online chess game datasets with annotated Stockfish evaluations
  • Leveraged PGN metadata: ELO ratings for player strength, annotated evaluations for value learning
  • Introduced both ChessGPT (policy model) and ChessCLIP (state understanding model)
  • Open-sourced code, models, and datasets on GitHub
Strategic Reasoning Limitations (Multiple Papers, 2025)

Multiple 2025 papers investigated whether RL post-training can develop genuine strategic reasoning in LLMs through chess:

  • Finding: All models plateau far below expert levels despite RL fine-tuning
  • The limitation stems from deficits in pretrained models' internal chess understanding
  • RL "mainly amplifies existing capabilities" -- it cannot create chess knowledge from scratch
  • Models cannot reliably track game states or recognize elementary tactics
  • ChessArena benchmark: No model could beat Maia-1100 (human amateur level); some failed to beat a random player
  • Pre-trained domain knowledge is essential -- RL alone is insufficient
The gpt-3.5-turbo-instruct Anomaly

One of the most curious findings in LLM chess research is that OpenAI's gpt-3.5-turbo-instruct (released September 2023) plays significantly better chess than newer, more capable models:

  • ~1750-1800 ELO vs calibrated Stockfish -- strong club player level
  • Only 0.3% illegal move rate across 8,205 moves
  • GPT-4 (chat) scored only ~1370 ELO with 0.66% illegal moves
  • Later chat models (GPT-3.5-turbo chat) had 93% of games containing illegal moves

Why? The instruct model is a text completion model, not a chat model. RLHF training for chat may actually harm chess ability. As a pure next-token predictor, the instruct model more faithfully simulates the chess games in its training data. Chat-tuned models are optimized for helpful conversation, which conflicts with the narrow task of generating valid chess notation.

Gemini 3: Topping the Leaderboard (February 2026)

The most recent development: Google's Gemini 3 Pro and Gemini 3 Flash claimed the top ELO positions on the Kaggle Game Arena chess leaderboard as of February 2026.

  • Marked performance increase over the Gemini 2.5 generation
  • Gemini 3 topped all three game leaderboards (Chess, Poker, Werewolf)
  • Uses pattern recognition and strategic reasoning grounded in chess concepts (piece mobility, pawn structure, king safety)
  • Represents rapid model-over-model improvement in chess ability
  • Specific ELO scores not publicly disclosed, but surpassed o3's previous top position

Evolution Timeline

From "LLMs can't play chess at all" to "LLMs play at club level" in three years.

December 2017
AlphaZero Defeats Stockfish 8
DeepMind's self-taught neural network beats the world's strongest engine 28-0-72 in 100 games, proving neural approaches can master chess. Not an LLM, but the conceptual ancestor.
2022-2023
Early LLM Chess Attempts
Researchers begin testing GPT models on chess. Results are dire: most models produce illegal moves in the majority of games. text-davinci-003 manages only 1 legal game out of 73.
September 2023
gpt-3.5-turbo-instruct Surprises Everyone
OpenAI's completion model plays chess at ~1800 ELO with only 0.3% illegal moves. Grant Slatton's discovery goes viral: "I had previously reported that GPT cannot play chess, but it appears this was just the RLHF'd chat models."
November 2023
ChessGPT (NeurIPS 2023)
First systematic NeurIPS paper on bridging policy learning and language modeling for chess. Open-sources datasets and models.
January 2024
Chess-GPT World Models
Adam Karvonen shows that a 50M-parameter model trained on PGN strings develops emergent internal board representations (99.2% probe accuracy). A 50M model reaches ~1300 ELO.
February 2024
DeepMind: Searchless Chess
270M-parameter transformer achieves 2895 ELO on Lichess blitz -- grandmaster level -- without any search at test time. Published at NeurIPS 2024.
January 2025
ChessLLM Reaches 1788 ELO
NAACL-accepted paper shows training on complete games with 10x sampling achieves near-1800 ELO. Long-round training provides 350 ELO boost over short-round.
Mid 2025
Reasoning Models Break Through
o1, o3, and other reasoning models achieve near-perfect legal move rates via chain-of-thought verification. o3 (medium) achieves 100% win rate vs random opponents. But ELO vs engines remains ~758-1200.
May 2025
DeepMind: Internal + External Planning (ICML 2025)
MCTS-augmented transformer achieves grandmaster-level chess with human-like search budgets. Both internal (in-context tree generation) and external (MCTS guidance) search improve results.
August 2025
Kaggle Game Arena: First AI Chess Tournament
Eight LLMs compete. o3 sweeps the final 4-0 against Grok 4. Magnus Carlsen estimates skill levels at ~800-1200 ELO. Many games decided by illegal move forfeit.
February 2026
Gemini 3 Tops Chess Leaderboard
Google's Gemini 3 Pro and Flash claim top positions on the Kaggle Game Arena chess leaderboard, demonstrating continued rapid improvement in LLM chess ability.

The Verdict: Where Things Stand

Three distinct categories have emerged, each with different implications for the future of AI and chess.

Engines Traditional Chess Engines

Stockfish 17.1 (3653 ELO) remains the undisputed champion. It has won every major championship since 2020 and is roughly 800 ELO stronger than the best human who has ever lived. Its combination of NNUE evaluation with classical search is essentially unbeatable. Magnus Carlsen says he has "no chance" against his phone.

Hybrid Specialized Chess Transformers

DeepMind's purpose-built transformers (2895 ELO searchless; GM-level with MCTS) represent the most exciting frontier. They prove that transformer architecture can encode grandmaster-level chess knowledge. When augmented with MCTS, they approach engine-tier performance with human-scale search budgets. This is the approach most likely to eventually challenge Stockfish.

LLM General-Purpose Language Models

General-purpose LLMs play at 800-1800 ELO depending on the model and measurement methodology. The best (gpt-3.5-turbo-instruct) plays at strong club level. Reasoning models (o3, Gemini 3) show rapid improvement but still plateau far below expert level. The gap to Stockfish is ~1,800-2,800 ELO points -- an unbridgeable chasm with current architectures.

The Core Paradox: The model that plays chess best (gpt-3.5-turbo-instruct at ~1800 ELO) is one of the least capable at general intelligence tasks. The models that are most capable at reasoning, coding, and language (GPT-5, Claude, Gemini 3) play chess at a lower ELO. Chat-tuning, RLHF, and instruction-following training may actually harm chess ability by pulling models away from pure next-token prediction on chess notation.

Will LLMs ever match traditional engines?

Almost certainly not as general-purpose models. The architectural limitations are fundamental:

  • No guaranteed legal moves: Every move requires "hoping" the next token is valid notation for a legal move. Engines generate legal moves by definition.
  • No systematic search: Without exhaustive lookahead, LLMs cannot find tactical combinations that require seeing 10-20 moves ahead. This is where engines dominate.
  • No stable board representation: LLMs must reconstruct board state from text each time. Any error compounds. Engines maintain a perfect internal state.
  • Tokenization mismatch: Chess notation was not designed for transformer tokenization. Position encoding artifacts degrade performance.

However, specialized chess transformers with search (like DeepMind's MCTS-MAV) may eventually match engines. The key is combining the transformer's learned evaluation with a proper search algorithm -- essentially reinventing AlphaZero with a larger, pre-trained model.

What does chess tell us about LLM reasoning?

Chess is increasingly used as a benchmark for genuine reasoning (vs. pattern matching) because it requires:

  • State tracking: Maintaining an evolving board state across 40-80 moves
  • Planning: Evaluating consequences of moves several turns ahead
  • Constraint satisfaction: All moves must be legal within a complex ruleset
  • Adversarial reasoning: Predicting and countering an opponent's strategy

The fact that LLMs perform better at math and coding (where chain-of-thought works well) than at chess (where spatial reasoning and lookahead are essential) suggests that current "reasoning" capabilities are more about sequential logic than spatial-strategic intelligence.

The LLM Chess benchmark found a Pearson r = 0.686 correlation between chess and coding performance -- moderately positive but with significant gaps, confirming chess tests a distinct reasoning capability.

Sources and References

All primary sources used in this research report, organized by category.

Academic Papers

NeurIPS 2024
"Amortized Planning with Large-Scale Transformers: A Case Study on Chess" (Grandmaster-Level Chess Without Search) -- Google DeepMind. 270M-parameter transformer achieving 2895 ELO.
ICML 2025
"Mastering Board Games by External and Internal Planning with Language Models" -- Google DeepMind. MCTS + transformer hybrid achieving GM-level play.
NAACL 2025
"Complete Chess Games Enable LLM Become A Chess Master" -- ChessLLM achieving 1788 ELO with supervised fine-tuning.
NeurIPS 2023
"ChessGPT: Bridging Policy Learning and Language Modeling" -- Early systematic work on LLM chess.
arXiv 2024
"Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models" -- Karvonen et al. Board state probe accuracy, internal representations.

Benchmarks and Leaderboards

Leaderboard
LLM Chess Leaderboard (Maxim Saplin) -- ELO ratings vs Komodo Dragon with game duration and cost metrics.
Leaderboard
Dubesor AI Chess Leaderboard -- 251 models, Stockfish 17.1 analysis, accuracy metrics.
Benchmark
"Debunking the Chessboard: Confronting GPTs Against Chess Engines" -- Mathieu Acher. Detailed ELO and illegal move analysis.
Benchmark
chess_gpt_eval (GitHub) -- Open-source LLM chess evaluation framework.

Tournament Coverage

Analysis and Commentary

Blog
"Chess-GPT's Internal World Model" -- Adam Karvonen. Detailed probe analysis.
Blog
"Something weird is happening with LLMs and chess" -- Dynomight. Analysis of instruct vs chat model chess divergence.
Blog
"Why LLMs Can't Play Chess" -- Nico Westerdale. Architectural limitation analysis.
Article
"A Guide to Comparing AI Models in 2026: What LLM Chess Reveals" -- EPAM. Chess as AI benchmark analysis.
LessWrong
"Chess as a case study in hidden capabilities in ChatGPT" -- Investigation of latent chess knowledge.
Reference
AlphaZero -- Wikipedia -- Historical AlphaZero vs Stockfish match data.
Reference
Leela Chess Zero -- Wikipedia -- LC0 architecture, training, and performance data.