r3gm/SoniTranslate_translate_audio_of_a_video_content · Seeking Help: Syncing AI-generated Dubbing Audio with Educational Videos Based on Subtitles

Hello everyone,

I'm working on an AI-powered dubbing pipeline for educational videos, and I’m facing a core challenge related to audio-video synchronization.

Here’s the process I’m following:

I extract English subtitles from the original video and translate them into another language (e.g., Persian).

Then I use a TTS model (currently OpenAI TTS) to generate speech for each subtitle segment.

I use the start_time and end_time metadata from the .vtt file to align each generated audio segment to the intended time window.

The problem:
Despite passing the target duration to the TTS and even applying speed adjustments or padding/silence at the end, the synthesized audio often doesn’t match the original video’s pacing. As a result:

Sometimes the audio finishes too early or too late.

Speed adjustments degrade voice quality or cause unnatural rhythm.

The final merged audio doesn't sound natural or aligned, especially when multiple segments are involved.

I’m not dealing with lip-syncing, since most of the content is screen-recorded (no face or talking head on screen). But I do need each audio chunk to match its exact subtitle duration for a smooth learning experience.

What I’ve tried so far:

Calculating speech speed based on estimated word/character counts per segment.

Adjusting playback speed of audio files (with Pydub).

Padding or trimming to fit exact time slots.

Parallel processing with duration hints for each TTS call.

Still, none of these fully solve the issue when the generated voice sounds too compressed or too stretched.

What I'm looking for:

Any existing open-source projects, research papers, or tools that effectively handle this type of duration-aware TTS alignment for dubbing use cases.

Tips on how to integrate fine-grained prosody control into TTS generation.

Suggestions for improving duration prediction, or maybe aligning text-to-audio timing using tools like WhisperX, Aeneas, or custom alignment models.

I’d really appreciate any insight, example code, or ideas!

Thanks in advance 🙏
— An AI developer dubbing educational content