nectec
/

Pathumma-whisper-th-large-v3

Automatic Speech Recognition

Model card Files Files and versions Community

Pathumma-whisper-th-large-v3 / README.md

pattaraearth's picture

Update README.md

d992292 verified 5 months ago

|

history blame contribute delete

4.3 kB

	---
	license: apache-2.0
	language:
	- th
	- en
	base_model:
	- openai/whisper-large-v3
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	metrics:
	- wer
	---

	# Pathumma Whisper Large V3 (Th)

	## Model Description
	Additional information is needed

	## Quickstart
	You can transcribe audio files using the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class with the following code snippet:
	```python
	import torch
	from transformers import pipeline

	device = "cuda" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

	lang = "th"
	task = "transcribe"

	pipe = pipeline(
	task="automatic-speech-recognition",
	model="nectec/Pathumma-whisper-th-large-v3",
	torch_dtype=torch_dtype,
	device=device,
	)
	pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task)

	text = pipe("audio_path.wav")["text"]
	print(text)
	```

	<!-- ## Evaluation Performance
	WER calculated with newmm tokenizer for Thai word segmentation.
	\| Model \| CV18 (WER) \| Gowejee (WER) \| LOTUS-TRD (WER) \| Thai Dialect (WER) \| Elderly (WER) \| Gigaspeech2 (WER) \| Fleurs (WER) \| Distant Meeting (WER) \| Podcast (WER) \|
	\|:----------------------------------------\|:----------------------:\|:-------------------------:\|:----------------------:\|:--------------------------:\|:--------------------------:\|:--------------------------:\|:--------------------------:\|:--------------------------:\|:--------------------------:\|
	\| whisper-large-v3 \| 18.75 \| 46.59 \| 48.14 \| 57.82 \| 12.27 \| 33.26 \| 24.08 \| 72.57 \| 41.24 \|
	\| airesearch-wav2vec2-large-xlsr-53-th \| 8.49 \| 17.28 \| 63.01 \| 48.53 \| 11.29 \| 52.72 \| 37.32 \| 85.11 \| 65.12 \|
	\| thonburian-whisper-th-large-v3-combined \| 7.62 \| 22.06 \| 41.95 \| 26.53 \| 1.63 \| 25.22 \| 13.90 \| 64.68 \| 32.42 \|
	\| monsoon-whisper-medium-gigaspeech2 \| 11.66 \| 20.50 \| 41.04 \| 42.06 \| 7.57 \| 21.40 \| 21.54 \| 51.65 \| 38.89 \|
	\| pathumma-whisper-th-large-v3 \| 8.68 \| 9.84 \| 15.47 \| 19.85 \| 1.53 \| 21.66 \| 15.65 \| 51.56 \| 36.47 \|

	Note: Other models not target fine-tuned on dialect datasets may be less representative of dialect performance. -->

	## Limitations and Future Work
	Additional information is needed

	## Acknowledgements
	We extend our appreciation to the research teams engaged in the creation of the open speech model, including AIResearch, BiodatLab, Looloo Technology, SCB 10X, and OpenAI. We would like to express our gratitude to Dr. Titipat Achakulwisut of BiodatLab for the evaluation pipeline. We express our gratitude to ThaiSC, or NSTDA Supercomputer Centre, for supplying the LANTA used for model training, fine-tuning, and evaluation.

	## Pathumma Audio Team
	Pattara Tipaksorn, Wayupuk Sommuang, Oatsada Chatthong, Kwanchiva Thangthai

	## Citation
	```
	@misc{tipaksorn2024PathummaWhisper,
	title = { {Pathumma Whisper Large V3 (TH)} },
	author = { Pattara Tipaksorn and Wayupuk Sommuang and Oatsada Chatthong and Kwanchiva Thangthai },
	url = { https://huggingface.co/nectec/Pathumma-whisper-th-large-v3 },
	publisher = { Hugging Face },
	year = { 2024 },
	}
	```