Use the DistilBertTokenizer for this DistilBERT-based model
Hello!
Pull Request overview
- Use the DistilBertTokenizer for this DistilBERT-based model
Details
I noticed that some of your distilbert-based models (not all, e.g. not https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill), are using the BERT tokenizer rather than the DistilBERT specific one. This is tricky, as it means that return_token_type_ids=False
is always required when tokenizing. This can just be avoided by using the appropriate, matching, tokenizer.
Feel free to run this code using revision="refs/pr/6"
in the AutoTokenizer
, AutoModelForMaskedLM
, etc. to test this before merging.
- Tom Aarsen
Thanks @tomaarsen for catch this!
Before merge the code, could you help me understand the difference between bert tokenizer and distilBERT tokenizer? Since we're using the same vocabulary, the only difference is we don't need to set return_token_type_ids=False
as it's the default behavior for distilBERT?