Documents in fineweb dataset may exceed max context length of this classifier
#6
by
ZefanW
- opened
How are these pieces dealt with in fineweb-edu curation?
Samples are truncated to the model's context length, you can find the inference code here: https://github.com/huggingface/cosmopedia/blob/main/classification/run_edu_bert.py