Use the DistilBertTokenizer for this DistilBERT-based model

#4
by tomaarsen HF Staff - opened

Hello!

Preface

I'm a big fan of these sparse models! Especially the inference-free queries is very cool.

Pull Request overview

  • Use the DistilBertTokenizer for this DistilBERT-based model

Details

I noticed that some of your distilbert-based models (not all, e.g. not https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill), are using the BERT tokenizer rather than the DistilBERT specific one. This is tricky, as it means that return_token_type_ids=False is always required when tokenizing. This can just be avoided by using the appropriate, matching, tokenizer.

Feel free to run this code using revision="refs/pr/4" in the AutoTokenizer, AutoModelForMaskedLM, etc. to test this before merging.

cc @arthurbresnu

  • Tom Aarsen
tomaarsen changed pull request status to open
zhichao-geng changed pull request status to merged
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment