Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale
Abstract
A new benchmark evaluates large language models on freelance programming and data analysis tasks, providing insights into their performance and feasibility as autonomous agents.
This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49 million, then Qwen 2.5 (1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.
Community
Freelance software benchmark with economic scorecard versus LLM
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EduBot -- Can LLMs Solve Personalized Learning and Programming Assignments? (2025)
- FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research (2025)
- Improving Assembly Code Performance with Large Language Models via Reinforcement Learning (2025)
- ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition (2025)
- PaperBench: Evaluating AI's Ability to Replicate AI Research (2025)
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory (2025)
- PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper