Use the DistilBertTokenizer for this DistilBERT-based model

#6
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Use the DistilBertTokenizer for this DistilBERT-based model

Details

I noticed that some of your distilbert-based models (not all, e.g. not https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill), are using the BERT tokenizer rather than the DistilBERT specific one. This is tricky, as it means that return_token_type_ids=False is always required when tokenizing. This can just be avoided by using the appropriate, matching, tokenizer.

Feel free to run this code using revision="refs/pr/6" in the AutoTokenizer, AutoModelForMaskedLM, etc. to test this before merging.

cc @arthurbresnu

  • Tom Aarsen
tomaarsen changed pull request status to open
opensearch-project org

Thanks @tomaarsen for catch this!

Before merge the code, could you help me understand the difference between bert tokenizer and distilBERT tokenizer? Since we're using the same vocabulary, the only difference is we don't need to set return_token_type_ids=False as it's the default behavior for distilBERT?

zhichao-geng changed pull request status to merged
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment