Any plan to open source the dataset?

#8
by Gmc2 - opened

Thanks for the great work!
May I know if there's plan to open source the datasets for pre-training or SFT? Thanks.

InternLM org

Thank you for your interest in our work!

The pre-training corpus for POLAR is extremely large (approximately 3.6T tokens), and it was derived from the InternLM pre-training corpus. Unfortunately, we currently have no plans to publicly release POLAR’s pre-training corpus.

However, I would like to emphasize that POLAR's pre-training data is relatively easy to obtain. You can use open-source corpora like Common Crawl (CC) along with publicly available LLMs like Qwen to perform large-scale inference sampling based on our methodology, forming positive and negative samples.

Thanks for the info.

RowitZou changed discussion status to closed

Sign up or log in to comment