distil-whisper/distil-large-v3 · [Error?] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Jul 1, 2024

When using an example from https://huggingface.co/distil-whisper/distil-large-v3#sequential-long-form, I receive Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. warning.

Is it expected or does it indicate an error in the set up on my end?

In addition to the loading example, I prepare the model locally during the docker image build with the following method:

def download_model():
    import os
    import transformers
    from huggingface_hub import snapshot_download

    # Ensure folder exists
    os.makedirs(MODEL_CACHE_DIR, exist_ok=True)
    snapshot_download(
        repo_id="distil-whisper/distil-large-v3",
        allow_patterns=["model.safetensors", "*.json", "*.txt"],
        local_dir=MODEL_CACHE_DIR,
    )
    transformers.utils.move_cache()

then when loading, instead of specifying a model string, I provide MODEL_CACHE_DIR instead.

eustlb

Whisper Distillation org Sep 5, 2024

Hey @flexai , I am not getting the warning when running the example. Do you still face this issue ?

flexai

Sep 5, 2024

Hey boss, I haven't run it since so let us close this issue until further notice! Btw, thank you for the models. It's huge value.

flexai changed discussion status to closed Sep 5, 2024