tiantiaf
/

voxlect-spanish-dialect-whisper-large-v3

@@ -5,6 +5,7 @@ datasets:
 - mozilla-foundation/common_voice_11_0
 language:
 - es
 license: openrail
 metrics:
 - accuracy
@@ -13,25 +14,25 @@ tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
 - speaker_dialect_classification
-library_name: transformers
 ---
 # Whisper-Large v3 for Spanish Dialect Classification
 # Model Description
-This model includes the implementation of Spanish dialect classification described in <a href="https://arxiv.org/abs/2508.01691"><strong>**Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe**</strong></a>
 Github repository: https://github.com/tiantiaf0627/voxlect
-The included Spanish dialects are:
 ```
 [
-  "Andino-Pacífico",
-  "Caribe and Central",
   "Chileno",
-  "Mexican",
-  "Penisular",
-  "Rioplatense",
 ]
 ```
@@ -39,7 +40,7 @@ The included Spanish dialects are:
 ## Download repo
 ```bash
-git clone [email protected]:tiantiaf0627/voxvoxlect
 ```
 ## Install the package
 ```bash
@@ -67,31 +68,50 @@ model.eval()
 ```python
 # Label List
 dialect_list = [
-    "Andino-Pacífico",
-    "Caribe and Central",
     "Chileno",
-    "Mexican",
-    "Penisular",
-    "Rioplatense",
 ]
 # Load data, here just zeros as the example
 # Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
 # So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
 max_audio_length = 15 * 16000
 data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
 logits, embeddings = model(data, return_feature=True)
 # Probability and output
 dialect_prob = F.softmax(logits, dim=1)
 print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()])
 ```
-Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect.
-## If you have any questions, please contact: Tiantian Feng ([email protected])
 ❌ **Out-of-Scope Use**
 - Clinical or diagnostic applications
 - Surveillance
-- Privacy-invasive applications

 - mozilla-foundation/common_voice_11_0
 language:
 - es
+library_name: transformers
 license: openrail
 metrics:
 - accuracy
 - model_hub_mixin
 - pytorch_model_hub_mixin
 - speaker_dialect_classification
 ---
 # Whisper-Large v3 for Spanish Dialect Classification
 # Model Description
+This model, based on OpenAI's Whisper-Large v3, is fine-tuned for Spanish dialect classification. It is part of the **Voxlect** benchmark, a novel initiative for modeling dialects and regional languages worldwide using speech foundation models. The Voxlect project conducts comprehensive benchmark evaluations on a wide range of languages and dialects, utilizing over 2 million training utterances from 30 publicly available speech corpora. This specific model provides classification for Spanish dialects, as detailed below.
+Paper: [Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe](https://arxiv.org/abs/2508.01691)
 Github repository: https://github.com/tiantiaf0627/voxlect
+The included Spanish dialects are:
 ```
 [
+  "Andino-Pacífico",
+  "Caribe and Central",
   "Chileno",
+  "Mexican",
+  "Penisular",
+  "Rioplatense",
 ]
 ```
 ## Download repo
 ```bash
+git clone [email protected]:tiantiaf0627/voxlect.git
 ```
 ## Install the package
 ```bash
 ```python
 # Label List
 dialect_list = [
+    "Andino-Pacífico",
+    "Caribe and Central",
     "Chileno",
+    "Mexican",
+    "Penisular",
+    "Rioplatense",
 ]
 # Load data, here just zeros as the example
 # Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
 # So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
 max_audio_length = 15 * 16000
 data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
 logits, embeddings = model(data, return_feature=True)
 # Probability and output
 dialect_prob = F.softmax(logits, dim=1)
 print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()])
 ```
+# Responsible Use
+Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect.
 ❌ **Out-of-Scope Use**
 - Clinical or diagnostic applications
 - Surveillance
+- Privacy-invasive applications
+# Citation
+If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!
+```bibtex
+@article{feng2025voxlect,
+  title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe},
+  author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth},
+  year={2025}
+}
+@article{feng2025vox,
+  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
+  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
+  journal={arXiv preprint arXiv:2505.14648},
+  year={2025}
+}
+```
+## Contact
+If you have any questions, please contact: Tiantian Feng ([email protected])