stride_length_s is a tuple of the left and right stride length.

With only 1 number, both sides get the same stride, by default

the stride_length on one side is 1/6th of the chunk_length_s

output = pipe("very_long_file.mp3", chunk_length_s=10, stride_length_s=(4, 2))

disazoz

Apr 14

I will look in that, thanks.

neurlang

Owner Apr 14

Although I am not a whisper expert, I tried something, and it looks like in the second demo it might work but likely only for longer audio. So one timestamp appeared.

>>> prediction = pipe(sample.copy(), return_timestamps=True)
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
>>> print(prediction)
{'text': 'ˈmɪstɚ kwˈɪltɚ ˈɪz ðə əpˈɑsəl ˈʌv ðə ˈmɪtəl klˈæsɪz ˈænd wˈɪɹ glæd tə ˈwɛlkəm ˈhɪz gˈɑsbəl', 'chunks': [{'timestamp': (0.52, None), 'text': 'ˈmɪstɚ kwˈɪltɚ ˈɪz ðə əpˈɑsəl ˈʌv ðə ˈmɪtəl klˈæsɪz ˈænd wˈɪɹ glæd tə ˈwɛlkəm ˈhɪz gˈɑsbəl'}]}

More investigation will be required to be done