Papers
arxiv:2411.03562

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Published on Nov 5
· Submitted by hba123 on Nov 7
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.

Community

Paper submitter

Agent K is the first end-to-end agent (i.e., autonomous from Kaggle URL to submissions that win competitions) to achieve an equivalent of Kaggle grandmaster level. Our agent codes the whole data science pipeline from a natural language description of the competition and raw data!

It does at least the following:

  1. Cleans and pre-processing the data automatically;
  2. Do feature engineering if needed automatically;
  3. Write machine learning models that it thinks can solve the tasks automatically;
  4. Trains the models and optimises their hyperparameters with HEBO automatically;
  5. Write Kaggle submission files and decide to upload them to Kaggle to get the score automatically;

It uses this score to improve its pipeline and submission automatically.
Regarding results, we win six gold, three silver, and seven bronze medals. We also score in the top 38% against Kagglers.

Since we win medals in all competition types, we make a fair comparison to human participants by awarding them extra medals if needed. Here, we also see that our Agent K is more likely to earn more medals than humans. The difference is particularly significant for bronze medals, where Agent K outperforms in 42% of match-ups and underperforms in only 23%. Similarly, for gold medals, the agent's winning rate of 14% is over twice its losing rate of 6%.

This comment has been hidden

@hba123 is the code for this paper available?

·

We are actively working on that :D

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.03562 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.03562 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.03562 in a Space README.md to link it from this page.

Collections including this paper 10