Question Regarding Processing 48kHz Audio with mhubert-147

#1
by CJY - opened

Thank you for your excellent work on encapsulating the code for extracting discrete audio unit sequences with mhubert-147. I have been reviewing the sample code you provided, and I have a question regarding the required audio format for processing.

Is it necessary to use audio that is in 16kHz and int16 format for processing? My original audio is at 48kHz, and I’ve noticed that librosa reads audio as float32 with normalization by default, while the soundfile library does not support resampling. Moreover, I observed that when I resample the 48kHz audio to 16kHz using librosa and pydub separately, the resulting discrete audio unit sequences differ.

Could you please advise on the correct approach to handle 48kHz audio? Any recommendations or best practices for converting or resampling the audio to ensure consistency with the expected output would be greatly appreciated.

Thank you for your time and assistance.

balacoon org

Is it necessary to use audio that is in 16kHz and int16 format for processing?

yes, thats what hubert was trained with.

I resample the 48kHz audio to 16kHz using librosa and pydub separately, the resulting discrete audio unit sequences differ.

that can happen, you can check what is the algorithm in each, which one suits you better. I usually resample with resampy

Overall i wouldnt way too concerned about different sequences. Mhubert is trained without attention masks so when you do batch processing and pad the audio sequences you end up in quite different code sequences.

Sign up or log in to comment