Lemswasabi
/

wav2vec2-large-xlsr-53-842h-luxembourgish-14h-with-lm

Automatic Speech Recognition

Generated from Trainer

Model card Files Files and versions

Metrics Training metrics Community

wav2vec2-large-xlsr-53-842h-luxembourgish-14h-with-lm / create_lm_decoder.py

Lemswasabi's picture

add create lm scripts

98591ec over 3 years ago

history blame contribute delete

755 Bytes

	#!/usr/bin/env python3
	#
	# Created by lemswasabi on 24/05/2022.
	# Copyright © 2022 letzspek. All rights reserved.
	#

	from transformers import AutoProcessor
	from transformers import Wav2Vec2ProcessorWithLM
	from pyctcdecode import build_ctcdecoder

	processor = AutoProcessor.from_pretrained("./")
	vocab_dict = processor.tokenizer.get_vocab()
	sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

	decoder = build_ctcdecoder(
	labels=list(sorted_vocab_dict.keys()),
	kenlm_model_path="5gram_correct.arpa",
	)

	processor_with_lm = Wav2Vec2ProcessorWithLM(
	feature_extractor=processor.feature_extractor,
	tokenizer=processor.tokenizer,
	decoder=decoder
	)

	processor_with_lm.save_pretrained("./")