Use the DistilBertTokenizer for this DistilBERT-based model
#4
by
tomaarsen
HF Staff
- opened
Hello!
Preface
I'm a big fan of these sparse models! Especially the inference-free queries is very cool.
Pull Request overview
- Use the DistilBertTokenizer for this DistilBERT-based model
Details
I noticed that some of your distilbert-based models (not all, e.g. not https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill), are using the BERT tokenizer rather than the DistilBERT specific one. This is tricky, as it means that return_token_type_ids=False
is always required when tokenizing. This can just be avoided by using the appropriate, matching, tokenizer.
Feel free to run this code using revision="refs/pr/4"
in the AutoTokenizer
, AutoModelForMaskedLM
, etc. to test this before merging.
- Tom Aarsen
tomaarsen
changed pull request status to
open
zhichao-geng
changed pull request status to
merged