Weak supervision on mc4

#37
by yuri-no - opened

Hello,

In the model card you say that the model was pre-trained on the mc4 dataset, considering the association (title, page content).
Since the dataset's examples are characterized by the attributes "url" and "text", did you use the url as title or did you extract the title for each web page?

Thanks

Based on our manual check, the first line of the text field from the mc4 dataset (https://huggingface.co/datasets/mc4) is the web page title.

yuri-no changed discussion status to closed

Sign up or log in to comment