Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv
. Benchmarked on Mar 25, 2024.
Model | Solved | Solved % | Illegal Moves | Illegal Moves % | Adjusted Elo |
---|---|---|---|---|---|
gpt-4-turbo-preview | 229 | 22.9% | 163 | 16.3% | 1144 |
gpt-4 | 195 | 19.5% | 183 | 18.3% | 1047 |
claude-3-opus-20240229 | 72 | 7.2% | 464 | 46.4% | 521 |
claude-3-haiku-20240307 | 38 | 3.8% | 590 | 59.0% | 363 |
claude-3-sonnet-20240229 | 23 | 2.3% | 663 | 66.3% | 286 |
gpt-3.5-turbo | 23 | 2.3% | 683 | 68.3% | 269 |
claude-instant-1.2 | 10 | 1.0% | 707 | 66.3% | 245 |
mistral-large-latest | 4 | 0.4% | 813 | 81.3% | 149 |
mixtral-8x7b | 9 | 0.9% | 832 | 83.2% | 136 |
gemini-1.5-pro-latest* | FAIL | - | - | - | - |
Published by the CEO of Kagi!
I wonder how many of the ones they “solved” were just because they’d seen it discussed somewhere in the data set, considering the puzzles are apparently from a public resource.
Yeah, I don’t know why anyone knowledgeable would expect them to be good at chess. LLMs don’t generalise, reason or spot patterns, so unless they read a chess book where the problems came from…
Likely close to 100%. If you read the (rather good) article, a little further down they test whether the LLM can play an extremely simplistic “Connect 4” game they devise, as a way of narrowing down on specifically reasoning capabilities.
It cannot.
Chess puzzles, in particular, are frequently shared and discussed in online chess spaces, so the LLM will have a significant amount of material to work with when it tries to predict the best response to give to the prompt.