Whisper-Podlodka-Turbo
Whisper-Podlodka-Turbo is a new fine-tuned version of a Whisper large-v3-turbo. The main goal of the fine-tuning is to improve the quality of speech recognition and speech translation for Russian and English, as well as reduce the occurrence of hallucinations when processing non-speech audio signals.
Model Description
Whisper-Podlodka-Turbo is a new fine-tuned version of Whisper-Large-V3-Turbo, optimized for high-quality Russian speech recognition with proper punctuation + capitalization and enhanced with noise resistance capability.
Key Benefits
- 🎯 Improved Russian speech recognition quality compared to the base Whisper-Large-V3-Turbo model
- ✍️ Correct Russian punctuation and capitalization
- 🎧 Enhanced background noise resistance
- 🚫 Reduced number of hallucinations, especially in non-speech segments
Supported Tasks
- Automatic Speech Recognition (ASR):
- 🇷🇺 Russian (primary focus)
- 🇬🇧 English
- Speech Translation:
- Russian ↔️ English
- Speech Language Detection (including non-speech detection)
Uses
Installation
Whisper-Podlodka-Turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time:
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
Also, I recommend using whisper-lid
for initial spoken language detection. Therefore, this library is also worth installing:
pip install --upgrade whisper-lid
Usages Cases
Speech recognition
The model can be used with the pipeline
class to transcribe audios of arbitrary language:
import librosa # for loading sound from local file
from transformers import pipeline # for working with Whisper-Podlodka-Turbo
import wget # for downloading demo sound from its URL
from whisper_lid.whisper_lid import detect_language_in_speech # for spoken language detection
model_id = "bond005/whisper-podlodka-turbo" # the best Whisper model :-)
target_sampling_rate = 16_000 # Hz
asr = pipeline(model=model_id, device_map='auto', torch_dtype='auto')
# An example of speech recognition in Russian, spoken by a native speaker of this language
sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru.wav'
sound_ru_name = wget.download(sound_ru_url)
sound_ru = librosa.load(sound_ru_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound with Russian speech = {0:.3f} seconds.'.format(
sound_ru.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
sound_ru,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
sound_ru,
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
# An example of speech recognition in English, pronounced by a non-native speaker of that language with an accent
sound_en_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_en.wav'
sound_en_name = wget.download(sound_en_url)
sound_en = librosa.load(sound_en_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound with English speech = {0:.3f} seconds.'.format(
sound_en.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
sound_en,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
sound_en,
generate_kwargs={'task': 'transcribe', 'language': detected_languages[0][0]},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
As a result, you can see a text output like this:
Duration of sound with Russian speech = 29.947 seconds.
Top-3 languages:
russian 0.9568
english 0.0372
ukrainian 0.0013
Ну, виспер сам по себе. Что такое виспер? Виспер — это уже полноценное end-to-end нейросетевое решение с авторегрессионным декодером, то есть это не чистый энкодер, как Wave2Vec, это не просто текстовый сек-то-сек, энкодер-декодер, как T5, это полноценный алгоритм преобразования речи в текст, где энкодер учитывает, прежде всего, акустические фичи речи, ну и семантика тоже постепенно подмешивается, а декодер — это уже языковая модель, которая генерирует токен за токеном.
Duration of sound with English speech = 20.247 seconds.
Top-3 languages:
english 0.9526
russian 0.0311
polish 0.0006
Ensembling can help us to solve a well-known bias-variance trade-off. We can decrease variance on basis of large ensemble, large ensemble of different algorithms.
Speech recognition with timestamps
In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments:
recognition_result = asr(
sound_ru,
generate_kwargs={'task': 'transcribe', 'language': 'russian',
return_timestamps=True
)
print('Recognized chunks of Russian speech:')
for it in recognition_result['chunks']:
print(f' {it}')
recognition_result = asr(
sound_en,
generate_kwargs={'task': 'transcribe', 'language': 'english'},
return_timestamps=True
)
print('\nRecognized chunks of English speech:')
for it in recognition_result['chunks']:
print(f' {it}')
As a result, you can see a text output like this:
Recognized chunks of Russian speech:
{'timestamp': (0.0, 4.8), 'text': 'Ну, виспер, сам по себе, что такое виспер. Виспер — это уже полноценное'}
{'timestamp': (4.8, 8.4), 'text': ' end-to-end нейросетевое решение с авторегрессионным декодером.'}
{'timestamp': (8.4, 10.88), 'text': ' То есть, это не чистый энкодер, как Wave2Vec.'}
{'timestamp': (10.88, 15.6), 'text': ' Это не просто текстовый сек-то-сек, энкодер-декодер, как T5.'}
{'timestamp': (15.6, 19.12), 'text': ' Это полноценный алгоритм преобразования речи в текст,'}
{'timestamp': (19.12, 23.54), 'text': ' где энкодер учитывает, прежде всего, акустические фичи речи,'}
{'timestamp': (23.54, 25.54), 'text': ' ну и семантика тоже постепенно подмешивается,'}
{'timestamp': (25.54, 29.94), 'text': ' а декодер — это уже языковая модель, которая генерирует токен за токеном.'}
Recognized chunks of English speech:
{'timestamp': (0.0, 8.08), 'text': 'Ensembling can help us to solve a well-known bias-variance trade-off.'}
{'timestamp': (8.96, 20.08), 'text': 'We can decrease variance on basis of large ensemble, large ensemble of different algorithms.'}
Long-form speech recognition
While previous examples demonstrate accurate transcription for audio segments under thirty seconds, practical applications often require processing extensive recordings ranging from several minutes to multiple hours. This necessitates specialized techniques like the sliding window approach to overcome memory constraints and preserve contextual coherence across the entire signal. The following example showcases the model's capability to handle such long-form audio, enabling accurate transcription of lectures, interviews, and meetings.
import nltk # for splitting long text into sentences
from whisper_lid.whisper_lid import detect_language_in_long_speech # for spoken language detection in long audio
nltk.download('punkt_tab')
long_sound_ru_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_ru_longform.wav'
long_sound_ru_name = wget.download(long_sound_ru_url)
long_sound_ru = librosa.load(long_sound_ru_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of long sound with Russian speech = {0:.3f} seconds.'.format(
sound_ru.shape[0] / target_sampling_rate
))
detected_languages, _ = detect_language_in_long_speech(
long_sound_ru,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('\nTop-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
recognition_result = asr(
sound_ru_longform,
generate_kwargs={
'max_new_tokens': 410,
'num_beams': 5, # beam search width (higher values improve accuracy at the cost of increased computation)
'condition_on_prev_tokens': False,
'compression_ratio_threshold': 2.4, # used to detect and suppress repetitive loops (a common failure mode)
'temperature': (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # controls the randomness of token sampling during generation
'logprob_threshold': -1.0, # the threshold for the average log-probability of the generated tokens (provides a filter to exclude low-confidence, potentially erroneous transcriptions)
'no_speech_threshold': 0.6, # threshold for the probability of the `<|nospeech|>` token (segments with a probability above this threshold are considered silent and skipped)
'task': 'transcribe',
'language': detected_languages[0][0]
},
return_timestamps=True
)
print('\nRecognized text in the long audio, split into sentences:')
for it in map(lambda sent: sent.strip(), nltk.sent_tokenize(recognition_result['text'])):
print(f' {it}')
As a result, you can see a text output like this:
Duration of long sound with Russian speech = 148.845 seconds.
Top-3 languages:
russian 0.9787
english 0.0186
ukrainian 0.0006
Recognized text in the long audio, split into sentences:
Здравствуйте, друзья!
Здравствуйте!
Я очень рад вас всех видеть здесь, в этом зале на Хайлооде.
Я Иван, как меня уже представили, и я люблю машинное обучение.
Я люблю машинное обучение с 2005 года, когда оно проникло в моё сердце ещё, когда я был студентом.
С 2006 года я работал в академической сфере, преподавал нейронные сети, искусственный интеллект, машинное обучение своим студентам.
С тринадцатого года я перешёл в IT-индустрию, работал в разных компаниях, занимаясь примерно всё тем же машинным обучением и искусством интеллектом.
И, наконец, в двадцать втором году мне всё это надоело, я решил обратно из IT-индустрии перейти в академическую сферу.
И сейчас я работаю в Новосибирском государственном университете, занимаюсь научными исследованиями, учу студентов, делаем всякие интересные штуки.
Ну, а в конце прошлого года я и мои ученики решили всё-таки не только фундаментальными исследованиями заниматься, и наука должна проносить пользу людям, и мы сделали маленький стартап под названием «Сибирские нейросети».
То есть, сибирское здоровье и сибирские нейросети теперь.
Всё, что я делал, оно связано одним, моей любовью к машинному обучению.
Мне это очень интересно было всегда.
Ну, а помимо машинного обучения, мне интересно участвовать в соревнованиях.
Научные соревнования — это не просто способ развлечься, это способ оценить по гамбургскому счету, насколько твой алгоритм хорош объективно в сравнении с другими.
Когда мы разрабатываем какую-то систему для заказчика, мы ориентируемся на его данные, эти данные закрыты, как правило, мы делаем какие-то кастомизации, которые обеспечивают качество, может, даже черепикинг иногда делаем, хотя это фу, но тем не менее.
А когда мы предлагаем наши решения, наш научный метод, наш алгоритм на соревновании, мы все в равных условиях, и мы действительно можем оценить, насколько хорошо то или иное решение себя показывает.
При этом самое главное, что такие соревнования дают материалы в виде открытых датасетов для дальнейшей воспроизводимости, в виде открытого кода, который, опять-таки, можно воспроизводить в науке, проблема воспроизводимости стоит остро, А подобные научные соревнования, я имею в виду не Кегл, а что-то более интересное и более такое системное, они позволяют действительно какое-то движение науки вперед осуществлять.
Voice activity detection (speech/non-speech)
Along with special language tokens, the model can also return the special token <|nospeech|>
, if the input audio signal does not contain any speech (for details, see section 2.3 of the corresponding paper about Whisper). This skill of the model forms the basis of the speech/non-speech classification algorithm, as demonstrated in the following example:
nonspeech_sound_url = 'https://huggingface.co/bond005/whisper-podlodka-turbo/resolve/main/test_sound_nonspeech.wav'
nonspeech_sound_name = wget.download(nonspeech_sound_url)
nonspeech_sound = librosa.load(nonspeech_sound_name, sr=target_sampling_rate, mono=True)[0]
print('Duration of sound without speech = {0:.3f} seconds.'.format(
nonspeech_sound.shape[0] / target_sampling_rate
))
detected_languages = detect_language_in_speech(
nonspeech_sound,
asr.feature_extractor,
asr.tokenizer,
asr.model
)
print('Top-3 languages:')
lang_text_width = max([len(it[0]) for it in detected_languages])
for it in detected_languages[0:3]:
print(' {0:>{1}} {2:.4f}'.format(it[0], lang_text_width, it[1]))
As a result, you can see a text output like this:
Duration of sound without speech = 10.000 seconds.
Top-3 languages:
NO SPEECH 0.9957
lingala 0.0002
cantonese 0.0002
Speech translation
In addition to the transcription task, the model also performs speech translation (although it translates better from Russian into English than from English into Russian):
print(f'Speech translation from Russian to English:')
recognition_result = asr(
sound_ru,
generate_kwargs={'task': 'translate', 'language': 'english'},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
print(f'Speech translation from English to Russian:')
recognition_result = asr(
sound_en,
generate_kwargs={'task': 'translate', 'language': 'russian'},
return_timestamps=False
)
print(recognition_result['text'] + '\n')
As a result, you can see a text output like this:
Speech translation from Russian to English:
Well, Visper, what is Visper? Visper is already a complete end-to-end neural network with an autoregressive decoder. That is, it's not a pure encoder like Wave2Vec, it's not just a text-to-seq encoder-decoder like T5, it's a complete algorithm for the transformation of speech into text, where the encoder considers, first of all, acoustic features of speech, well, and the semantics are also gradually moving, and the decoder is already a language model that generates token by token.
Speech translation from English to Russian:
Энсемблинг может помочь нам осуществлять хорошо известный торговый байз-вариант. Мы можем ограничить варианты на основе крупного энсембла, крупного энсембла разных алгоритмов.
As you can see, in both examples the speech translation contains some errors, however in the example of translation from English to Russian these errors are more significant.
Bias, Risks, and Limitations
- While improvements are observed for English and translation tasks, statistically significant advantages are confirmed only for Russian ASR
- The model's performance on code-switching speech (where speakers alternate between Russian and English within the same utterance) has not been specifically evaluated
- Inherits basic limitations of the Whisper architecture
Training Details
Training Data
The model was fine-tuned on a composite dataset including:
- Common Voice (Ru, En)
- Podlodka Speech (Ru)
- Taiga Speech (Ru, synthetic)
- Golos Farfield and Golos Crowd (Ru)
- Sova Rudevices (Ru)
- Audioset (non-speech audio)
Training Features
1. Data Augmentation:
- Dynamic mixing of speech with background noise and music
- Gradual reduction of signal-to-noise ratio during training
2. Text Data Processing:
- Russian text punctuation and capitalization restoration using bond005/ruT5-ASR-large (for speech sub-corpora without punctuated annotations)
- Parallel Russian-English text generation using Qwen/Qwen2.5-14B-Instruct
- Multi-stage validation of generated texts to minimize hallucinations using bond005/xlm-roberta-xl-hallucination-detector
3. Training Strategy:
- Progressive increase in training example complexity
- Balanced sampling between speech and non-speech data
- Special handling of language tokens and no-speech detection (
<|nospeech|>
)
Evaluation
The experimental evaluation focused on two main tasks:
- Russian speech recognition
- Speech activity detection (binary classification "speech/non-speech")
Testing was performed on publicly available Russian speech corpora. Speech recognition was conducted using the standard pipeline from the Hugging Face 🤗 Transformers library. Due to the limitations of this pipeline in language identification and non-speech detection (caused by a certain bug), the whisper-lid library was used for speech presence/absence detection in the signal.
Testing Data & Metrics
Testing Data
The quality of the Russian speech recognition task was tested on test sub-sets of six different datasets:
The quality of the long-form Russian speech recognition was tested on the dangrebenkin/long_audio_youtube_lectures dataset, developed by Daniel Grebenkin. This dataset contains seven long-form (20-40 minute) Russian audio recordings that were manually annotated. The audios cover a variety of topics and speaking styles; they are excerpts from Russian scientific lectures on various subjects: philology, mathematics, history, etc. All recordings were made in relatively quiet, lecture-hall-like acoustic environments. However, some natural background noises, such as the sound of chalk on a blackboard, are present.
The quality of the voice activity detection task was tested on test sub-sets of two different datasets:
- noised version of Golos Crowd as a source of speech samples
- filtered sub-set of Audioset corpus as a source of non-speech samples
Noise was added using a special augmenter capable of simulating the superposition of five different types of acoustic noise (reverberation, speech-like sounds, music, household sounds, and pet sounds) at a given signal-to-noise ratio (in this case, a signal-to-noise ratio of 2 dB was used).
The quality of the robust Russian speech recognition task was tested on test sub-set of above-mentioned noised Golos Crowd.
Metrics
1. Modified WER (Word Error Rate) for Russian speech recognition quality:
- Text normalization before WER calculation:
- Unification of numeral representations (digits/words)
- Standardization of foreign words (Cyrillic/Latin scripts)
- Accounting for valid transliteration variants
- Enables more accurate assessment of semantic recognition accuracy
- The lower the WER, the better the speech recognition quality
2. F1-score for speech activity detection:
- Binary classification "speech/non-speech"
- Evaluation of non-speech segment detection accuracy using
<|nospeech|>
token - The higher the F1 score, the better the voice activity detection quality
Generation Parameters
For experiments with short audio signals (under 30 seconds), we used standard greedy decoding (num_beams=1
). For long-form audio, two approaches were tested: simple 30-second chunking and the sequential long-form algorithm.
For the sequential long-form mode, the implementation followed the strategy from Section 4.5 of the paper about Whisper with two key hyperparameter differences:
Beam Search: The paper implies the use of beam search for optimal performance, while our initial experiments for this task used greedy decoding (
num_beams=1
).Compression Ratio Threshold: A key deviation was the use of a more conservative
compression_ratio_threshold
of 1.35 (compared to 2.4 in the paper). This lower threshold makes the repetition-detection algorithm significantly more aggressive, triggering fallback mechanisms (e.g., temperature rescoring) sooner to suppress repetitive outputs.
The parameters for voice activity detection (no_speech_threshold=0.6
) and low-confidence detection (logprob_threshold=-1.0
) were kept aligned with the paper's recommendations. Context conditioning between segments (condition_on_prev_tokens
) was disabled for this experimental run.
Results
Automatic Speech Recognition (ASR)
Result (WER, %):
Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|---|
bond005/podlodka_speech | 8.17 | 8.33 |
rulibrispeech | 9.76 | 10.25 |
sberdevices_golos_farfield | 11.61 | 20.12 |
sberdevices_golos_crowd | 11.85 | 14.55 |
sova_rudevices | 15.35 | 17.70 |
common_voice_11_0 | 5.22 | 6.63 |
Long-form ASR
Result (WER, %):
Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|---|
the simple chunking | 11.66 | 15.98 |
the sequential long-form algorithm | 7.84 | 9.59 |
Voice Activity Detection (VAD)
Result (F1):
bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|
0.9235 | 0.8484 |
Robust ASR (SNR = 2 dB, speech-like noise, music, etc.)
Result (WER, %):
Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo |
---|---|---|
sberdevices_golos_crowd (noised) | 46.58 | 75.20 |
Citation
If you use this model in your work, please cite it as:
@misc{whisper-podlodka-turbo,
author = {Ivan Bondarenko},
title = {Whisper-Podlodka-Turbo: Enhanced Whisper Model for Russian ASR},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/bond005/whisper-podlodka-turbo}}
}
- Downloads last month
- 2,343
Model tree for bond005/whisper-podlodka-turbo
Base model
openai/whisper-large-v3Datasets used to train bond005/whisper-podlodka-turbo
Evaluation results
- Test WER on Podlodka Speechself-reported8.170
- Test WER on Common Voice ruself-reported5.220
- Test WER on Sova RuDevicesself-reported15.350
- Test WER on Russian Librispeechself-reported9.760
- Test WER on Sberdevices Golos (farfield)self-reported11.610
- Test WER on Sberdevices Golos (crowd)self-reported11.850