Overview

In this work, I apply NLP methods of analysis to non-linguistic data, metaphorically equating one with the other and seeking analogies.

Chess game notations are also a kind of text, and one can consider the records of moves or positions of pieces as words and statements in a certain language.

According to the rules of the PGN (portable game notation) format, an uppercase letter denotes a white piece, while a lowercase letter denotes a black piece. The letters themselves are obvious: R - rook ♖, N - knight ♘, B - bishop ♗, etc. Castling is denoted by the move of the king (the movement of the rook is implied by default).

An example of a chess game recorded in PGN format looks like this:

Pe2e4 pc7c5 Ng1f3 pd7d6 Pd2d4 pc5d4 Nf3d4 ng8f6 Nb1c3 pa7a6 Bf1e2 pe7e6 Ke1g1 bf8e7 Pf2f4 ke8g8 Bc1e3 nb8c6 Kg1h1 bc8d7 Pa2a4 ra8c8 Nd4b3 nc6a5 Nb3d2 bd7c6 Be2d3 pd6d5 Pe4e5 pd5d4 Be3d4 qd8d4 Pe5f6 be7b4 Pf6g7 qd4g7 Nd2f3 bb4c3 Pb2c3 bc6d5 Qd1e2 rc8c3 Ra1e1 na5c6 Qe2d2 kg8h8 Re1e3 nc6b4 Nf3e5 nb4d3 Pc2d3 rf8c8 Re3g3 rc3c2 Qd2b4 rc2g2 Ne5f7 kh8g8 Nf7h6 kg8h8 Nh6f7 kh8g8 Nf7h6

etc.

Here, the first letter in each "word" denotes the piece and its color, the next two symbols indicate the square from which the piece moves, and the last two symbols indicate the square to which the piece moves. Thus, Pe2e4 is a move by a white pawn (uppercase letter, so the piece is white, the letter P indicates a pawn) from square e2 to square e4. More details.

Results presented in the preprint.

Data

I used a collection of records of 5,400,137 chess games. These are mainly games played at a high international level.

Based on this data, I prepared two types of models.

Type 1 is a model "based on moves." That is, a listing of a game such as Pe2e4 pc7c5 Ng1f3 pd7d6 Bf1b5 bc8d7 Bb5d7 qd8d7 and so on is taken and considered as a "sentence," with each move in it being a separate word.

Type 2 is a model "based on positions." Here, each move is considered as a "sentence," where the words will be both the move itself (->Bc1h6) and all the squares occupied by pieces (Pa2 Pb2 Pc3 and so on). The position on the board is the context for the move, just as words form the context for a word in a statement in a natural language. Therefore, I prepared the source material differently for the models of the second type. Usually, when building a vector model on a language corpus, this corpus is divided into sentences, and the vectorization captures only the context of one sentence. For chess statements, I broke down the game notations into individual moves, and I got "sentences" of this kind:

ra8 nb8 bc8 qd8 ke8 bf8 ng8 rh8 pa7 pb7 pc7 pd7 pe7 pf7 pg7 ph7 Pa2 Pb2 Pc2 Pd2 Pe2 Pf2 Pg2 Ph2 Ra1 Nb1 Bc1 Qd1 Ke1 Bf1 Ng1 Rh1 ->Pe2e4
ra8 nb8 bc8 qd8 ke8 bf8 ng8 rh8 pa7 pb7 pc7 pd7 pe7 pf7 pg7 ph7 Pe4 Pa2 Pb2 Pc2 Pd2 Pf2 Pg2 Ph2 Ra1 Nb1 Bc1 Qd1 Ke1 Bf1 Ng1 Rh1 ->pc7c5
ra8 nb8 bc8 qd8 ke8 bf8 ng8 rh8 pa7 pb7 pd7 pe7 pf7 pg7 ph7 pc5 Pe4 Pa2 Pb2 Pc2 Pd2 Pf2 Pg2 Ph2 Ra1 Nb1 Bc1 Qd1 Ke1 Bf1 Ng1 Rh1 ->Ng1f3

etc.

Here, the position of the pieces on the board is recorded first (ra8 nb8 bc8...), followed by the move made by the player: ->Pe2e4. To avoid confusing the "words" that denote the position of a piece on the board with the "words" that denote a move, I added a prefix in the form of the symbol "->" to each move.

List of models in this repository:

moves_texts.model -- Type 1 model, all moves of all games in the collection (move = "word," game = "sentence").
lemmatized_moves_texts.model -- Type 1 model, all moves of all games (move = "word," game = "sentence"), but the square from which the piece moves is excluded from the move. This is a variant of "stemming."
white_moves.model -- Type 2 model, includes only white's moves along with positions (move and all pieces on the board = "words," position + move = "sentence").
black_moves.model -- Type 2 model, same as above but for black.
debut_moves.model -- Type 1 model, the "sentence" is the truncated segment of the beginning of the game up to the 12th move. It should reflect only the opening moves of the game.
debut_positions.model -- Type 2 model, only moves up to the 12th in the game along with their positions. Only white's moves are included; black's moves are excluded.
mittel_moves.model -- Type 1 model, only moves from the 13th to the 30th. Presumably the middlegame.
mittel_positions.model -- Type 2 model, moves along with positions from the 13th to the 30th. Only white.
endgame_moves.model -- Type 1 model, moves from the 31st to the end of the game.
endgame_positions.model -- Type 2 model for moves and positions, starting from the 31st in the game. Only white.
moves_pos.model -- Type 1 model, differs from model 1) moves_texts.model only in that each move is added with a "part-of-speech tag." There are only two tags: _CAP and _N. Thus, for each move, there can be two entries: Bc1h6_N and Bc1h6_CAP. Bc1h6_CAP means the bishop moves to square h6 with a capture, while Bc1h6_N is the same move without a capture. Thus, the linguistic idea of parts of speech is transferred to the text of chess games: there is a part of speech for a move with a capture and a part of speech for a move without a capture.
positions_pos.model -- Type 2 model with part-of-speech tags. So it can be ->Bc1h6_N and ->Bc1h6_CAP. Only white.
queens_moves.model -- Type 1 model, moves in the game until there is at least one queen on the board. I did not exclude cases when a queen reappears on the board in the endgame after a pawn promotion.
no_queens_moves.model -- Type 1 model, moves in the game from the moment both queens disappear from the board.
queens_positions.model -- Type 2 model, moves + their positions when there is at least one queen on the board. Only white.
no_queens_positions.model -- Type 2 model, moves + their positions when both queens have disappeared from the board. Only white.
positions_moves_pro.model -- A model combining types 1 and 2. This model includes the position of the move, the move itself, and three moves before and after it. Perhaps such a model will better reflect the player's strategy (aggressive or passive play, attacking or positional). Only white.
result_moves.model -- Type 1 model, includes only decisive games.
tied_moves.model -- Type 1 model, includes only drawn games.

Code

import gensim

our_model = 'moves_texts.model'
model = gensim.models.Word2Vec.load(our_model)

moves = ['Qh2h6']
i = 0
for m in moves:
    i += 1
    if i > 20:
        break
    print(m)
    for w in model.wv.most_similar(positive=[m], topn=10):
        print(w[0], w[1])
    print('\n')

the output:

Qh2h6
Qh1h6 0.6717178225517273
Qh3h6 0.596304714679718
Qh4h6 0.5802868604660034
Qg5h6 0.5741832852363586
Qc1h6 0.5662646889686584
Qh2h8 0.5631784200668335
Qh7h6 0.557466447353363
Qh8h6 0.552560567855835
Rh2h6 0.5501664876937866
Qd2h6 0.5458180904388428

Stemmed model:

model = gensim.models.Word2Vec.load('lemmatized_moves_texts.model')

moves = ['B-h7', 'N-f7']
i = 0
for m in moves:
    i += 1
    if i > 20:
        break
    print(m)
    for w in model.wv.most_similar(positive=[m], topn=10):
        print(w[0], w[1])
    print('\n')

the output:

B-h7
P-h7 0.4696713387966156
B-b1 0.459762305021286
Q-h7 0.4278913736343384
B-g6 0.4004162847995758
B-c2 0.3931606113910675
N-h7 0.3862970173358917
R-h7 0.36262062191963196
Q-g6 0.36069101095199585
K-g6 0.3498668372631073
K-f7 0.33984100818634033


N-f7
N-h7 0.46309858560562134
N-b7 0.4223353862762451
N-g4 0.3992729187011719
N-e6 0.3939955234527588
R-f7 0.39156869053840637
N-e4 0.38958096504211426
N-e8 0.3769819736480713
Q-f7 0.37392428517341614
B-f7 0.37208080291748047
P-f7 0.3680253326892853

BibTeX entry and citation info

@misc{orekhov2024youshall,
      title={You shall know a piece by the company it keeps. Chess plays as a data for word2vec models}, 
      author={Boris Orekhov},
      year={2024},
      eprint={2407.19600},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}