opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill · Use the DistilBertTokenizer for this DistilBERT-based model

tomaarsen

5 days ago

•

edited 5 days ago

Hello!

Preface

I'm a big fan of these sparse models! Especially the inference-free queries is very cool.

Pull Request overview

Use the DistilBertTokenizer for this DistilBERT-based model

Details

I noticed that some of your distilbert-based models (not all, e.g. not https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill), are using the BERT tokenizer rather than the DistilBERT specific one. This is tricky, as it means that return_token_type_ids=False is always required when tokenizing. This can just be avoided by using the appropriate, matching, tokenizer.

Feel free to run this code using revision="refs/pr/4" in the AutoTokenizer, AutoModelForMaskedLM, etc. to test this before merging.

cc @arthurbresnu

Tom Aarsen

Use the DistilBertTokenizer for this DistilBERT-based modele4469d87

tomaarsen changed pull request status to open 5 days ago

zhichao-geng changed pull request status to merged 1 day ago