MatTools: Benchmarking Large Language Models for Materials Science Tools
Abstract
MatTools evaluates large language models' proficiency in materials science by assessing code generation and execution based on physics-based computational tools.
Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the ability of an LLM to understand materials science tools. The real-world benchmark contains 49 tasks (138 subtasks) requiring the generation of functional Python code for materials property calculations. Our evaluation of diverse LLMs yields three key insights: (1)Generalists outshine specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a standardized framework for assessing and improving LLM capabilities for materials science tool applications, facilitating the development of more effective AI systems for materials science and general scientific research.
Community
We are excited to share our latest work "MatTools: Benchmarking Large Language Models for Materials Science Tools", which creates two new benchmarks for comprehensively benchmaking the abilities of large language models (LLMs) in using materials science tools.
We believe this work can prompt the development of intelligent material computing ! ๐ค๐ง
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study (2025)
- PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving (2025)
- ModiGen: A Large Language Model-Based Workflow for Multi-Task Modelica Code Generation (2025)
- MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers (2025)
- 34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery (2025)
- SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers (2025)
- Benchmarking Retrieval-Augmented Generation for Chemistry (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper