F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Paper
โข
2410.06885
โข
Published
โข
46
This model is reshaped for MLX from the original weights and is designed for use with f5-tts-mlx
F5 TTS is a non-autoregressive, zero-shot text-to-speech system using a flow-matching mel spectrogram generator with a diffusion transformer (DiT).
You can listen to a sample here that was generated in ~11 seconds on an M3 Max MacBook Pro.
See F5-TTS for the original checkpoint.
pip install f5-tts-mlx
python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog."
If you want to use your own reference audio sample, make sure it's a mono, 24kHz wav file of around 5-10 seconds:
python -m f5_tts_mlx.generate \
--text "The quick brown fox jumped over the lazy dog."
--ref-audio /path/to/audio.wav
--ref-text "This is the caption for the reference audio."
You can convert an audio file to the correct format with ffmpeg like this:
ffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav
See here for more options to customize generation.
โ
You can load a pretrained model from Python like this:
from f5_tts_mlx.generate import generate
audio = generate(text = "Hello world.", ...)