Spaces:
Running
Running
| # HuBERT | |
| ## Pre-trained and fine-tuned (ASR) models | |
| Model | Pretraining Data | Finetuning Dataset | Model | |
| |---|---|---|--- | |
| HuBERT Base (~95M params) | [Librispeech](http://www.openslr.org/12) 960 hr | No finetuning (Pretrained Model) | [download](https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt) | |
| HuBERT Large (~316M params) | [Libri-Light](https://github.com/facebookresearch/libri-light) 60k hr | No finetuning (Pretrained Model) | [download](https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt) | |
| HuBERT Extra Large (~1B params) | [Libri-Light](https://github.com/facebookresearch/libri-light) 60k hr | No finetuning (Pretrained Model) | [download](https://dl.fbaipublicfiles.com/hubert/hubert_xtralarge_ll60k.pt) | |
| HuBERT Large | [Libri-Light](https://github.com/facebookresearch/libri-light) 60k hr | [Librispeech](http://www.openslr.org/12) 960 hr | [download](https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k_finetune_ls960.pt) | |
| HuBERT Extra Large | [Libri-Light](https://github.com/facebookresearch/libri-light) 60k hr | [Librispeech](http://www.openslr.org/12) 960 hr | [download](https://dl.fbaipublicfiles.com/hubert/hubert_xtralarge_ll60k_finetune_ls960.pt) | |
| ## Load a model | |
| ``` | |
| ckpt_path = "/path/to/the/checkpoint.pt" | |
| models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path]) | |
| model = models[0] | |
| ``` | |
| ## Train a new model | |
| ### Data preparation | |
| Follow the steps in `./simple_kmeans` to create: | |
| - `{train,valid}.tsv` waveform list files | |
| - `{train,valid}.km` frame-aligned pseudo label files. | |
| The `label_rate` is the same as the feature frame rate used for clustering, | |
| which is 100Hz for MFCC features and 50Hz for HuBERT features by default. | |
| ### Pre-train a HuBERT model | |
| Suppose `{train,valid}.tsv` are saved at `/path/to/data`, `{train,valid}.km` | |
| are saved at `/path/to/labels`, and the label rate is 100Hz. | |
| To train a base model (12 layer transformer), run: | |
| ```sh | |
| $ python fairseq_cli/hydra_train.py \ | |
| --config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \ | |
| --config-name hubert_base_librispeech \ | |
| task.data=/path/to/data task.label_dir=/path/to/labels model.label_rate=100 | |
| ``` | |
| ### Fine-tune a HuBERT model with a CTC loss | |
| Suppose `{train,valid}.tsv` are saved at `/path/to/data`, and their | |
| corresponding character transcripts `{train,valid}.ltr` are saved at | |
| `/path/to/trans`. | |
| To fine-tune a pre-trained HuBERT model at `/path/to/checkpoint`, run | |
| ```sh | |
| $ python fairseq_cli/hydra_train.py \ | |
| --config-dir /path/to/fairseq-py/examples/hubert/config/finetune \ | |
| --config-name base_10h \ | |
| task.data=/path/to/data task.label_dir=/path/to/trans \ | |
| model.w2v_path=/path/to/checkpoint | |
| ``` | |
| ### Decode a HuBERT model | |
| Suppose the `test.tsv` and `test.ltr` are the waveform list and transcripts of | |
| the split to be decoded, saved at `/path/to/data`, and the fine-tuned model is | |
| saved at `/path/to/checkpoint`. We support three decoding modes: | |
| - Viterbi decoding: greedy decoding without a language model | |
| - KenLM decoding: decoding with an arpa-format KenLM n-gram language model | |
| - Fairseq-LM deocding: decoding with a Fairseq neural language model | |
| #### Viterbi decoding | |
| `task.normalize` needs to be consistent with the value used during fine-tuning. | |
| Decoding results will be saved at | |
| `/path/to/experiment/directory/decode/viterbi/test`. | |
| ```sh | |
| $ python examples/speech_recognition/new/infer.py \ | |
| --config-dir /path/to/fairseq-py/examples/hubert/config/decode \ | |
| --config-name infer_viterbi \ | |
| task.data=/path/to/data \ | |
| task.normalize=[true|false] \ | |
| decoding.exp_dir=/path/to/experiment/directory \ | |
| common_eval.path=/path/to/checkpoint | |
| dataset.gen_subset=test \ | |
| ``` | |
| #### KenLM / Fairseq-LM decoding | |
| Suppose the pronunciation lexicon and the n-gram LM are saved at | |
| `/path/to/lexicon` and `/path/to/arpa`, respectively. Decoding results will be | |
| saved at `/path/to/experiment/directory/decode/kenlm/test`. | |
| ```sh | |
| $ python examples/speech_recognition/new/infer.py \ | |
| --config-dir /path/to/fairseq-py/examples/hubert/config/decode \ | |
| --config-name infer_kenlm \ | |
| task.data=/path/to/data \ | |
| task.normalize=[true|false] \ | |
| decoding.exp_dir=/path/to/experiment/directory \ | |
| common_eval.path=/path/to/checkpoint | |
| dataset.gen_subset=test \ | |
| decoding.decoder.lexicon=/path/to/lexicon \ | |
| decoding.decoder.lmpath=/path/to/arpa | |
| ``` | |
| The command above uses the default decoding hyperparameter, which can be found | |
| in `examples/speech_recognition/hydra/decoder.py`. These parameters can be | |
| configured from the command line. For example, to search with a beam size of | |
| 500, we can append the command above with `decoding.decoder.beam=500`. | |
| Important parameters include: | |
| - decoding.decoder.beam | |
| - decoding.decoder.beamthreshold | |
| - decoding.decoder.lmweight | |
| - decoding.decoder.wordscore | |
| - decoding.decoder.silweight | |
| To decode with a Fairseq LM, use `--config-name infer_fsqlm` instead, and | |
| change the path of lexicon and LM accordingly. | |