new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Apr 24

StarCoder 2 and The Stack v2: The Next Generation

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

Does your data spark joy? Performance gains from domain upsampling at the end of training

Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B)x2014a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

Estimating global article processing charges paid to six publishers for open access between 2019 and 2023

This study presents estimates of the global expenditure on article processing charges (APCs) paid to six publishers for open access between 2019 and 2023. APCs are fees charged for publishing in some fully open access journals (gold) and in subscription journals to make individual articles open access (hybrid). There is currently no way to systematically track institutional, national or global expenses for open access publishing due to a lack of transparency in APC prices, what articles they are paid for, or who pays them. We therefore curated and used an open dataset of annual APC list prices from Elsevier, Frontiers, MDPI, PLOS, Springer Nature, and Wiley in combination with the number of open access articles from these publishers indexed by OpenAlex to estimate that, globally, a total of \8.349 billion (8.968 billion in 2023 US dollars) were spent on APCs between 2019 and 2023. We estimate that in 2023 MDPI (\681.6 million), Elsevier (582.8 million) and Springer Nature (\546.6) generated the most revenue with APCs. After adjusting for inflation, we also show that annual spending almost tripled from 910.3 million in 2019 to \$2.538 billion in 2023, that hybrid exceed gold fees, and that the median APCs paid are higher than the median listed fees for both gold and hybrid. Our approach addresses major limitations in previous efforts to estimate APCs paid and offers much needed insight into an otherwise opaque aspect of the business of scholarly publishing. We call upon publishers to be more transparent about OA fees.