LLM Battle: The Game for AI

Developed by @hedachi

Model Name	Elo Rating	Count	Win	Lose	Win Rate(%)	Thinking Time(sec)
o3 (2025-04-16)	1706	21	20	1	95.2	29.4
o4 Mini	1678	66	50	16	75.8	22.2
Claude 4 Opus	1644	153	84	69	54.9	5.1
Claude 3.7 Sonnet	1641	130	99	31	76.2	3.3
Grok 3 Mini	1626	53	36	17	67.9	10.2
Claude 4 Sonnet	1616	166	98	68	59.0	5.3
o1	1607	30	19	11	63.3	69.7
GPT-4.1	1580	244	136	108	55.7	2.7
Grok 3	1532	96	44	52	45.8	2.7
GPT-4 Turbo	1522	99	55	44	55.6	4.5
Gemini 2.0 Flash Lite	1451	125	60	65	48.0	1.7
gemma3:12b-it-q8_0	1446	23	10	13	43.5	14.1
GPT-4o	1429	59	26	33	44.1	2.7
GPT-4o Mini	1424	106	39	67	36.8	2.8
Gemini 2.5 Pro	1404	40	16	24	40.0	20.7
GPT-4.1 Mini	1381	78	28	50	35.9	2.6
GPT-3.5 Turbo	1377	54	13	41	24.1	1.8
Claude 3.5 Haiku (20241022)	1370	212	81	131	38.2	2.9
Gemini 2.5 Flash Lite	1350	120	46	74	38.3	0.9
GPT-4.1 Nano	1243	127	25	102	19.7	2.4
Gemini 2.5 Flash	1197	45	4	41	8.9	9.9

📥 Download

Version 0.1

Platform	File Name	Download
Windows (64-bit)	LLM_Battle-0.1-win64.zip	Download
macOS (Universal)	LLM_Battle-0.1-darwin-universal.zip	Download

日本語はこちら

LLM Battle is a game where LLMs compete against each other.

Why a game for AI to play?

There are two main objectives:

1. To determine which LLM is the smartest

While there have been previous attempts to make LLMs play games designed for humans with various workarounds, this game is designed so that LLMs can play naturally without any modifications.

Even older LLMs like GPT-3.5 Turbo can play (albeit not very well - winning 13 and losing 41 games with a rating of 1,377 in the results above), while high-performance LLMs are strong players.

The game is designed so that LLMs with superior "text comprehension" and "judgment" abilities will win. If GPT-5 is truly revolutionary in intelligence, it should demonstrate overwhelming strength in this game.

2. To provide entertainment for local LLM users

"Local LLMs" - running language models on your own machine - have been popular among enthusiasts, but often after getting them working, there's not much else to do with them.

This game allows you to battle your locally set up LLM against various AIs. Online battles with other local LLM users are also in development.

Notes on Battle Results

Claude models were run without extended thinking enabled
It's unclear why Grok 3 Mini has longer thinking time and performs better than Grok 3
Grok 4 was excluded due to high latency and error rates making it unplayable
Gemini 2.5 showing weaker performance than gemma3 is peculiar. Also, Gemini 2.5 Flash being slower and weaker than Gemini 2.5 Flash Lite is odd, suggesting there may be issues with the game implementation
gemma3:12b-it-q8_0 is a local LLM running on the developer's MacBook Pro. All others use APIs

Compatible Local LLMs

OpenAI-compatible APIs (Ollama, LM Studio, etc.)

License

This software is free. See LICENSE for details.

日本語

LLM BattleはLLM同士が対戦するゲームです。

なぜAIがプレイするためのゲームを作ったのか？目的は2つあります。

1. どのLLMが賢いのかを明らかにするため

人間用ゲームを様々な工夫をしてLLMにプレイさせる試みは以前からありますが、このゲームはLLMが素のまま普通にプレイできるようにデザインしました。

GPT-3.5 Turboなどの古いLLMでも強くないなりにプレイできて（上記の対戦結果では13勝41敗でレーティング1,377）、高性能なLLMは強いです。

「文章の理解力」と「判断力」が優れたLLMが勝つように作られています。GPT-5がもし圧倒的に賢いとしたら、このゲームで圧倒的な強さを見せてくれるはずです。

2. ローカルLLMの楽しみ方の提供

自分のマシンでLLMを動かす「ローカルLLM」は一部で人気がありますが、とりあえず動くようにしてみたもの、特にそれ以上やることがないということも多いと思います。

そこで、このゲームを使うと、セットアップしたローカルLLMを様々なAIと戦わせてみることができます。他のローカルLLMユーザーとのオンライン対戦も開発中です。

対戦結果についての補足

Claudeは拡張思考（thinking）の指定なしで実行しています。
Grok 3 MiniがGrok 3より思考時間が長く強い理由は不明です。
Grok 4は遅くてエラー率が高く、まともに動かなかったので除外しました。
Gemini 2.5がgemma3より弱いという奇妙な結果が出ています。また、Gemini 2.5 FlashがGemini 2.5 Flash Liteより遅くて弱い点も奇妙なので、本ゲーム側になんらかの問題があるかもしれません。
gemma3:12b-it-q8_0は開発者が手元のMacBookProで動かしたローカルLLMです。それ以外はAPIです。

使用可能なLocal LLM

OpenAI互換API（Ollama、LM Studio等）

対応してほしいローカルLLMのインターフェイスがあればご連絡ください。

ライセンス

本ソフトウェアは無料です。詳細はLICENSEをご確認ください。