bweng/whisper-v3-turbo-onnx-qnn

Hacked up version of the ai-hub-apps repo to export this model

python .\export.py -target-runtime onnx --device "Snapdragon X Elite CRD" --skip-profiling --skip-inferencing

Patched: whisper/model.py

# The number of Mel features per audio context
# N_MELS = 80

# For Whisper V3 Turbo
N_MELS = 128


## COmmented out for now as we want to use it for Whisper V3 Turbo
# # Audio embedding length
# AUDIO_EMB_LEN = int(N_SAMPLES / N_MELS / 4)

# # Audio length per MEL feature
# MELS_AUDIO_LEN = AUDIO_EMB_LEN * 2


# Number of frames in the input mel spectrogram (e.g. 3000 for 30s audio at 160 hop_length).
# This corresponds to the 'n_frames' dimension of the mel spectrogram input to the Whisper AudioEncoder.
MELS_AUDIO_LEN = N_SAMPLES // HOP_LENGTH

# Length of the audio embedding from the encoder output (e.g. 1500).
# This corresponds to 'n_audio_ctx' in Whisper, which is MELS_AUDIO_LEN // 2
# due to the strided convolution in the encoder. This length is used for the
# cross-attention key/value cache from the encoder.
AUDIO_EMB_LEN = MELS_AUDIO_LEN // 2

WHISPER_VERSION = "large-v3-turbo"
# N_MELS_LARGE_V3_TURBO = 128
# DEFAULT_INPUT_SEQ_LEN = 3000 

@CollectionModel.add_component(WhisperEncoderInf)
@CollectionModel.add_component(WhisperDecoderInf)
class WhisperV3Turbo(BaseWhisper):
    @classmethod
    def from_pretrained(cls):
        return super().from_pretrained(WHISPER_VERSION)

You also need to patch this into the ai-hub library

https://github.com/openai/whisper/blob/dd985ac4b90cafeef8712f2998d62c59c3e62d22/whisper/__init__.py#L30

bweng
/

whisper-v3-turbo-onnx-qnn

Model tree for bweng/whisper-v3-turbo-onnx-qnn