Finetuning Starcoder with languages that are not present in The Stack
#98
by
lazarantal
- opened
Hi,
I would like to finetune StarCoder for languages that are not present in The Stack dataset. How can i prepare a custom dataset and use it with the finetuning process? The different languages of the-stack dataset are stored are parquet files. How can I generated such files for further languages?
Thanks,
Toni
I’d encourage you to take a look into StarCoder2. It’s trained with over 600 languages. Anyway, the fine tuning process should be the same wether the language is included or not in the pre training dataset