Any plan to open source the dataset?
#8
by
Gmc2
- opened
Thanks for the great work!
May I know if there's plan to open source the datasets for pre-training or SFT? Thanks.
Thank you for your interest in our work!
The pre-training corpus for POLAR is extremely large (approximately 3.6T tokens), and it was derived from the InternLM pre-training corpus. Unfortunately, we currently have no plans to publicly release POLAR’s pre-training corpus.
However, I would like to emphasize that POLAR's pre-training data is relatively easy to obtain. You can use open-source corpora like Common Crawl (CC) along with publicly available LLMs like Qwen to perform large-scale inference sampling based on our methodology, forming positive and negative samples.
Thanks for the info.
RowitZou
changed discussion status to
closed