tiantiaf
/

voxlect-spanish-dialect-whisper-large-v3

Audio Classification

model_hub_mixin

pytorch_model_hub_mixin

speaker_dialect_classification

Model card Files Files and versions Community

voxlect-spanish-dialect-whisper-large-v3 / README.md

tiantiaf's picture

Update README.md

e355545 verified 5 days ago

|

history blame contribute delete

3.08 kB

	---
	base_model:
	- openai/whisper-large-v3
	datasets:
	- mozilla-foundation/common_voice_11_0
	language:
	- es
	license: openrail
	metrics:
	- accuracy
	pipeline_tag: audio-classification
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	- speaker_dialect_classification
	library_name: transformers
	---

	# Whisper-Large v3 for Spanish Dialect Classification

	# Model Description
	This model includes the implementation of Spanish dialect classification described in <a href="https://arxiv.org/abs/2508.01691"><strong>Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe</strong></a>

	Github repository: https://github.com/tiantiaf0627/voxlect

	The included Spanish dialects are:
	```
	[
	"Andino-Pacífico",
	"Caribe and Central",
	"Chileno",
	"Mexican",
	"Penisular",
	"Rioplatense",
	]
	```

	# How to use this model

	## Download repo
	```bash
	git clone [email protected]:tiantiaf0627/voxlect
	```
	## Install the package
	```bash
	conda create -n voxlect python=3.8
	cd voxlect
	pip install -e .
	```

	## Load the model
	```python
	# Load libraries
	import torch
	import torch.nn.functional as F
	from src.model.dialect.whisper_dialect import WhisperWrapper

	# Find device
	device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

	# Load model from Huggingface
	model = WhisperWrapper.from_pretrained("tiantiaf/voxlect-spanish-dialect-whisper-large-v3").to(device)
	model.eval()
	```

	## Prediction
	```python
	# Label List
	dialect_list = [
	"Andino-Pacífico",
	"Caribe and Central",
	"Chileno",
	"Mexican",
	"Penisular",
	"Rioplatense",
	]

	# Load data, here just zeros as the example
	# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
	# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
	max_audio_length = 15 * 16000
	data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
	logits, embeddings = model(data, return_feature=True)

	# Probability and output
	dialect_prob = F.softmax(logits, dim=1)
	print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()])
	```

	Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect.

	## If you have any questions, please contact: Tiantian Feng ([email protected])

	❌ Out-of-Scope Use
	- Clinical or diagnostic applications
	- Surveillance
	- Privacy-invasive applications
	- No commercial use

	#### If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!
	```
	@article{feng2025voxlect,
	title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe},
	author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth},
	journal={arXiv preprint arXiv:2508.01691},
	year={2025}
	}
	```