Use the DistilBertTokenizer for this DistilBERT-based model

by tomaarsen HF Staff - opened Apr 17

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

-2

Use the DistilBertTokenizer for this DistilBERT-based modelbd901a6c

tomaarsen

Apr 17

•

edited Apr 17

Hello!

Pull Request overview

Use the DistilBertTokenizer for this DistilBERT-based model

Details

I noticed that some of your distilbert-based models (not all, e.g. not https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill), are using the BERT tokenizer rather than the DistilBERT specific one. This is tricky, as it means that return_token_type_ids=False is always required when tokenizing. This can just be avoided by using the appropriate, matching, tokenizer.

Feel free to run this code using revision="refs/pr/6" in the AutoTokenizer, AutoModelForMaskedLM, etc. to test this before merging.

cc @arthurbresnu

Tom Aarsen

tomaarsen changed pull request status to open Apr 17

zhichao-geng

opensearch-project org Apr 18

Thanks @tomaarsen for catch this!

Before merge the code, could you help me understand the difference between bert tokenizer and distilBERT tokenizer? Since we're using the same vocabulary, the only difference is we don't need to set return_token_type_ids=False as it's the default behavior for distilBERT?

zhichao-geng changed pull request status to merged Apr 21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment