Thank you for your interest. I did not look very closely on if each model correctly identifies the tiles, but from the examples I manually reviewed, it doesn't seem to be a problem for models like o1 and o3.
Yeah, the result was a surprise to me as well. This is a problem with not much coverage in training corpus, and it requires quite simple logic with good abstraction. The low accuracy shows that the LLMs are probably still heavily relying on memory instead of true logic reasoning.
I do think this work can be expanded to a paper easily. Unfortunately I do not have enough time to do it myself. Happy to collaborate if you are interested.