--- base_model: - openai/whisper-large-v3 datasets: - mozilla-foundation/common_voice_11_0 language: - es license: openrail metrics: - accuracy pipeline_tag: audio-classification tags: - model_hub_mixin - pytorch_model_hub_mixin - speaker_dialect_classification library_name: transformers --- # Whisper-Large v3 for Spanish Dialect Classification # Model Description This model includes the implementation of Spanish dialect classification described in **Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe** Github repository: https://github.com/tiantiaf0627/voxlect The included Spanish dialects are: ``` [ "Andino-Pacífico", "Caribe and Central", "Chileno", "Mexican", "Penisular", "Rioplatense", ] ``` # How to use this model ## Download repo ```bash git clone git@github.com:tiantiaf0627/voxlect ``` ## Install the package ```bash conda create -n voxlect python=3.8 cd voxlect pip install -e . ``` ## Load the model ```python # Load libraries import torch import torch.nn.functional as F from src.model.dialect.whisper_dialect import WhisperWrapper # Find device device = torch.device("cuda") if torch.cuda.is_available() else "cpu" # Load model from Huggingface model = WhisperWrapper.from_pretrained("tiantiaf/voxlect-spanish-dialect-whisper-large-v3").to(device) model.eval() ``` ## Prediction ```python # Label List dialect_list = [ "Andino-Pacífico", "Caribe and Central", "Chileno", "Mexican", "Penisular", "Rioplatense", ] # Load data, here just zeros as the example # Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation) # So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel max_audio_length = 15 * 16000 data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length] logits, embeddings = model(data, return_feature=True) # Probability and output dialect_prob = F.softmax(logits, dim=1) print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()]) ``` Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect. ## If you have any questions, please contact: Tiantian Feng (tiantiaf@usc.edu) ❌ **Out-of-Scope Use** - Clinical or diagnostic applications - Surveillance - Privacy-invasive applications - No commercial use #### If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition! ``` @article{feng2025voxlect, title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe}, author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth}, journal={arXiv preprint arXiv:2508.01691}, year={2025} } ```