microsoft
/

Florence-2-large

Image-Text-to-Text

text-generation

Model card Files Files and versions

Unexpected Leading sos/eos Tokens at Start of Generation

#105

by E1eMental - opened May 28

May 28

•

Issue Description

When testing florence-2-large, I observed unexpected token generation behavior. Specifically:

Decoding starts by sending decoder_start_token_id=2 (which is also the eos_token_id).
The model then generates three tokens with index 0 (sos_token) before outputting any valuable tokens.
Notably, sos_token_id=0, and eos_token_id=2, but generation is started with the eos_token_id rather than the sos_token_id.

This leads to two questions:

Why does generation begin with the eos_token_id instead of the sos_token_id?
Why are multiple leading tokens with index 0 (the sos_token_id) generated before the actual output?

Finetuning & Label Construction

While finetuning Florence-2 on a custom task (using https://huggingface.co/blog/finetune-florence2), I discovered the following during debugging:

The target label sequences during training contain a leading 0 (sos_token_id) at the start: [0, valuable_tokens, 2].
As a result, the model is forced to generate a "junk" 0 token at the start of each sequence, and the loss is calculated on this token as well.
This might explain why the model always generates an sos (0) tokens at the beginning during inference.

Questions & Request

Why is token 2 (eos_token_id) used as the start of the sequence for decoding, rather than 0 (sos_token_id)?
Is my assumption about the leading 0 token during training correct?
If so, could you retrain the model (or release a new checkpoint) without this issue, so that generation is not forced to start with a junk token?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment