Max Input Length Documentation
#1
by
sondalex
- opened
Hi, the repository README mentions:
By default, input text longer than 128 word pieces is truncated.
However, the parameter max_seq_length from sentence_transformers returns 512.
from sentence_transformers import SentenceTransformer
model_st = SentenceTransformer('all-mpnet-base-v1')
model_st.max_seq_length
# 512
Same value is returned for the Hugging face transformer approach:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v1')
tokenizer.model_max_length
# 512
Shouldn't the README be updated from 128 to 512 ?
Output of pip freeze:
...
sentence-transformers==2.2.2
huggingface-hub==0.10.1
transformers==4.23.1
torch==1.12.1
...
I have the same question! Looking to embed text up to the maximum sequence length of 512. I am assuming it won't be truncated at 128 despite what the README says?
That's a great observation, thank you for posting this