Is it possible to get word level or phoneme level timestamps?
#2
by
disazoz
- opened
Hi.
Is it possible to get word level or phoneme level timestamps?
Don't know. Maybe yes, maybe no, need to ask whisper expert.
Try tuning the numbers maybe it will work:
stride_length_s is a tuple of the left and right stride length.
With only 1 number, both sides get the same stride, by default
the stride_length on one side is 1/6th of the chunk_length_s
output = pipe("very_long_file.mp3", chunk_length_s=10, stride_length_s=(4, 2))
I will look in that, thanks.
Although I am not a whisper expert, I tried something, and it looks like in the second demo it might work but likely only for longer audio. So one timestamp appeared.
>>> prediction = pipe(sample.copy(), return_timestamps=True)
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
>>> print(prediction)
{'text': 'ˈmɪstɚ kwˈɪltɚ ˈɪz ðə əpˈɑsəl ˈʌv ðə ˈmɪtəl klˈæsɪz ˈænd wˈɪɹ glæd tə ˈwɛlkəm ˈhɪz gˈɑsbəl', 'chunks': [{'timestamp': (0.52, None), 'text': 'ˈmɪstɚ kwˈɪltɚ ˈɪz ðə əpˈɑsəl ˈʌv ðə ˈmɪtəl klˈæsɪz ˈænd wˈɪɹ glæd tə ˈwɛlkəm ˈhɪz gˈɑsbəl'}]}
More investigation will be required to be done