Speech Recognition Community Event Version 2

non-profit

Activity Feed

AI & ML interests

Multi-Lingual Speech Recognition

Recent Activity

gagan3012 authored a paper 4 days ago

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

PereLluis13 authored a paper about 1 month ago

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

gigant authored a paper about 1 month ago

BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

View all activity

speech-recognition-community-v2's activity

nguyenvulebinh

authored a paper 3 months ago

MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

Paper • 2411.18152 • Published Nov 27, 2024

FremyCompany

posted an update 5 months ago

Post

761

🔀 Very cool demo of word-level alignment of paraphrased or cross-lingual sentences, from the new Fairly Multilingual ModernBERT embedding model:

Parallia/Fairly-Multilingual-ModernBERT-Token-Alignment

Rolv-Arild

authored a paper 5 months ago

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Paper • 2412.09460 • Published Dec 12, 2024 • 8

versae

authored a paper 6 months ago

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Paper • 2412.09460 • Published Dec 12, 2024 • 8

pere

authored 4 papers 8 months ago

Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model

Paper • 2104.09617 • Published Apr 19, 2021

Boosting Norwegian Automatic Speech Recognition

Paper • 2307.01672 • Published Jul 4, 2023 • 1

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges

Paper • 2402.01917 • Published Feb 2, 2024

COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter

Paper • 2005.07503 • Published May 15, 2020

versae

authored a paper 9 months ago

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges

Paper • 2402.01917 • Published Feb 2, 2024

nguyenvulebinh

authored a paper 10 months ago

Convoifilter: A case study of doing cocktail party speech recognition

Paper • 2308.11380 • Published Aug 22, 2023 • 1

flozi00

posted an update 12 months ago

Post

2583

🌟 Progress in the German FineWeb edu reproduction 🌟

We're delighted to share the launch of our new Data Quality Classification Model, designed specifically for evaluating educational content in German. This tool uses advanced machine learning techniques to assess texts across all educational levels, from primary school to university.

🔍 Inspired by Huggingface's fine web edu dataset, we've worked hard to refine data classification methods ensuring educators and learners access top-quality resources.
We're excited about the future as we continue improving our models and expanding our datasets.

Access the model here: pL-Community/GermanEduScorer-Qwen2-1.5b

🙏 A huge thank you to David and Daryoush from Vago Solutions; Björn and Jan from Ellamind / DiscoResearch for their expert insights throughout this project. Your support has been crucial.
This project was made possible by the support of PrimeLine AI.

2 replies

FremyCompany

posted an update about 1 year ago

Post

2502

Today, April 26, is the Day of the Tatar Language! 🌟
To celebrate, we release our new language model, Tweety Tatar 🐣

https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1

The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.

We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:

https://huggingface.co/Tweeties/tweety-tatar-hydra-base-7b-2024-v1

More details in our upcoming paper 👀
François REMY, Pieter Delobelle, Alfiya Khabibullina

Татар теле көне белән!

3 replies

manandey

authored a paper about 1 year ago

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 7

sanchit-gandhi

posted an update about 1 year ago

Post

Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:

The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:

<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:

<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?

11 replies

anantoj

authored a paper about 1 year ago

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

Paper • 2403.01817 • Published Mar 4, 2024 • 4

manandey

authored a paper about 1 year ago

StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29, 2024 • 146

FremyCompany

posted an update over 1 year ago

Post

🔥 What's that biomedical model that got 170,763 downloads last month on HuggingFace?! Well, the paper is finally published! #BioLORD

📰 Read our article in the Journal of the American Medical Informatics Association:
https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae029/7614965

📝TLDR: BioLORD-2023 is a series of semantic language models for the biomedical domain, capable of representing clinical concepts and sentences in a semantic space aligned with human preferences. Our new multilingual version supports 50+ languages and is further finetuned on 7 European languages. These models were trained contrastively and through distillations, using a corpus unifying in the same latent space the concept names of biomedical concepts and their descriptions. For concepts which didn't have a description written by humans in UMLS, we use information contained in the SnomedCT knowledge graph and the capabilities of ChatGPT to generate synthetic data and improve our results.

🤗 Access our models from the HuggingFace hub, including the new 2023-C and 2023-S variants:
FremyCompany/BioLORD-2023
FremyCompany/BioLORD-2023-M
FremyCompany/BioLORD-2023-S
FremyCompany/BioLORD-2023-C