Weak supervision on mc4
#37
by
yuri-no
- opened
Hello,
In the model card you say that the model was pre-trained on the mc4 dataset, considering the association (title, page content).
Since the dataset's examples are characterized by the attributes "url" and "text", did you use the url as title or did you extract the title for each web page?
Thanks
Based on our manual check, the first line of the text
field from the mc4 dataset (https://huggingface.co/datasets/mc4
) is the web page title.
yuri-no
changed discussion status to
closed