HiTZ zentroa

non-profit

https://www.hitz.eus/

hitz_zentroa

hitz-zentroa

Activity Feed Request to join this org

AI & ML interests

Natural Language Processing, Signal Processing

Recent Activity

nperez updated a dataset 3 days ago

HiTZ/latxa-corpus-v2

nperez updated a dataset 3 days ago

HiTZ/latxa-corpus-v1.1

nineunaiz updated a dataset 5 days ago

HiTZ/safety-SeGa

View all activity

Papers

MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

View all Papers

HiTZ 's collections 39

Latxa Instruct

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Paper • 2506.07597 • Published Jun 9, 2025
HiTZ/Latxa-Llama-3.1-8B-Instruct

Text Generation • 8B • Updated Dec 15, 2025 • 3.12k • • 11
HiTZ/Latxa-Llama-3.1-70B-Instruct

Text Generation • 71B • Updated Jun 12, 2025 • 373 • 5
HiTZ/Latxa-Llama-3.1-70B-Instruct-FP8

Text Generation • 71B • Updated Jun 12, 2025 • 26 • 1

Latxa VL

Multilingual multimodal instruct models

HiTZ/Latxa-Qwen3-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Dec 15, 2025 • 300
HiTZ/Latxa-Qwen3-VL-4B-Instruct

Image-Text-to-Text • 4B • Updated Dec 15, 2025 • 424 • 3

ASR Datasets

Collection with datasets for training and benchmark-evaluating ASR in Basque, Spanish and Bilingual Basque-Spanish

HiTZ/composite_corpus_eseu_v1.0

Viewer • Updated May 12, 2025 • 742k • 312 • 2
HiTZ/composite_corpus_eu_v2.1

Viewer • Updated Dec 19, 2024 • 407k • 41 • 2
HiTZ/composite_corpus_es_v1.0

Viewer • Updated May 12, 2025 • 526k • 127
HiTZ/benchmark_eseu_testsets

Updated Apr 19, 2025 • 53

Nvidia NeMo

Nvidia NeMo STT models

HiTZ/stt_eu_conformer_transducer_large

Automatic Speech Recognition • Updated Nov 28, 2025 • 17 • 2
HiTZ/stt_eu_conformer_ctc_large

Automatic Speech Recognition • Updated Nov 28, 2025 • 12 • 2
HiTZ/stt_eseu_conformer_transducer_large

Automatic Speech Recognition • Updated Nov 28, 2025 • 32
HiTZ/stt_eu_conformer_transducer_large_v2

Automatic Speech Recognition • Updated 5 days ago • 32 • 1

Whisper

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Paper • 2503.23542 • Published Mar 30, 2025 • 9
HiTZ/whisper-lm-ngrams

Automatic Speech Recognition • Updated Apr 4, 2025
HiTZ/whisper-tiny-eu

Updated Dec 16, 2025 • 3
HiTZ/whisper-small-eu

Updated Dec 16, 2025 • 16

Multilingual TruthfulQA

Truth Knows No Language: Evaluating Truthfulness Beyond English

Truth Knows No Language: Evaluating Truthfulness Beyond English

Paper • 2502.09387 • Published Feb 13, 2025 • 1
HiTZ/truthfulqa-multi

Viewer • Updated May 21, 2025 • 4.12k • 280 • 1
HiTZ/truthfulqa-multi-MT

Viewer • Updated May 22, 2025 • 4.12k • 5
HiTZ/truthful_judge

Viewer • Updated May 22, 2025 • 135k • 15

Ask2Transformers

Ask2Transformers models

Ask2Transformers: Zero-Shot Domain labelling with Pre-trained Language Models

Paper • 2101.02661 • Published Jan 7, 2021
Label Verbalization and Entailment for Effective Zero- and Few-Shot Relation Extraction

Paper • 2109.03659 • Published Sep 8, 2021
Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning

Paper • 2205.01376 • Published May 3, 2022
ZS4IE: A toolkit for Zero-Shot Information Extraction with simple Verbalizations

Paper • 2203.13602 • Published Mar 25, 2022 • 1

MATE

Vision-Language Models Struggle to Align Entities across Modalities

Vision-Language Models Struggle to Align Entities across Modalities

Paper • 2503.03854 • Published Mar 5, 2025 • 1
HiTZ/MATE

Viewer • Updated May 29, 2025 • 11k • 61

BERnaT

Basque Encoders for Representing Natural Textual Diversity

HiTZ/BERnaT-base

Fill-Mask • 0.1B • Updated about 1 month ago • 8 • 1
HiTZ/BERnaT-medium

Fill-Mask • 51.4M • Updated about 1 month ago • 10 • 1
HiTZ/BERnaT-large

Fill-Mask • 0.4B • Updated about 1 month ago • 17 • 1
HiTZ/BERnaT-base-NERC

Token Classification • 0.1B • Updated Mar 16, 2025

Lemmatization

On the Role of Morphological Information for Contextual Lemmatization

On the Role of Morphological Information for Contextual Lemmatization

Paper • 2302.00407 • Published Feb 1, 2023
HiTZ/xlm-roberta-large-lemma-eu

Token Classification • Updated Jun 24, 2024 • 33
HiTZ/xlm-roberta-large-lemma-en

Token Classification • Updated Jun 24, 2024 • 1
HiTZ/xlm-roberta-large-lemma-tr

Token Classification • Updated Jun 24, 2024 • 1

Evaluation Datasets

Basque Evaluation Datasets

HiTZ/This-is-not-a-dataset

Viewer • Updated Feb 23, 2024 • 381k • 85 • 6
HiTZ/EusProficiency

Viewer • Updated Apr 1, 2024 • 5.17k • 509 • 2
HiTZ/EusReading

Viewer • Updated Apr 1, 2024 • 352 • 529 • 2
HiTZ/EusTrivia

Viewer • Updated Apr 1, 2024 • 1.72k • 512 • 1

Basque Encoders

Basque Encoder Language Models

ixa-ehu/roberta-eus-euscrawl-large-cased

Fill-Mask • 0.4B • Updated Sep 11, 2023 • 4 • 3
ixa-ehu/roberta-eus-euscrawl-base-cased

Fill-Mask • Updated Mar 16, 2022 • 15 • 2
ixa-ehu/roberta-eus-cc100-base-cased

Fill-Mask • 0.2B • Updated Sep 11, 2023 • 10 • 1
ixa-ehu/roberta-eus-mc4-base-cased

Fill-Mask • Updated Mar 16, 2022 • 11 • 1

Composite Corpus

HiTZ/composite_corpus_eseu_v1.0

Viewer • Updated May 12, 2025 • 742k • 312 • 2
HiTZ/composite_corpus_eu_v2.1

Viewer • Updated Dec 19, 2024 • 407k • 41 • 2
HiTZ/composite_corpus_es_v1.0

Viewer • Updated May 12, 2025 • 526k • 127

Lessons in Evaluation of Spanish Encoder-only Models

State-of-the-art encoder-only models for Spanish. From the paper "Lessons learned from the evaluation of Spanish Language Models"

HiTZ/xlm-roberta-large-xnli-es

Text Classification • Updated Mar 8, 2024
Lessons learned from the evaluation of Spanish Language Models

Paper • 2212.08390 • Published Dec 16, 2022

This is not a dataset

A Large Negation Benchmark to Challenge Large Language Models

This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

Paper • 2310.15941 • Published Oct 24, 2023 • 6
HiTZ/This-is-not-a-dataset

Viewer • Updated Feb 23, 2024 • 381k • 85 • 6

CONAN-EUS: Counternarrative Generation in Basque and Spanish

Counternarrative Generation in Basque and Spanish

Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation

Paper • 2403.09159 • Published Mar 14, 2024
HiTZ/CONAN-EUS

Viewer • Updated Mar 15, 2024 • 33.2k • 133
HiTZ/mt5-counter-narrative-eu

Text Generation • Updated Mar 15, 2024 • 5
HiTZ/mt5-counter-narrative-es

Text Generation • Updated Mar 15, 2024 • 8

BERTeus

Give your Text Representation Models some Love: the Case for Basque

Give your Text Representation Models some Love: the Case for Basque

Paper • 2004.00033 • Published Mar 31, 2020
ixa-ehu/berteus-base-cased

Feature Extraction • 0.1B • Updated Sep 11, 2023 • 35 • 5

Antidote Project

Data and models generated within the Antidote Project (https://univ-cotedazur.eu/antidote)

HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine

Paper • 2306.06029 • Published Jun 9, 2023
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Paper • 2404.07613 • Published Apr 11, 2024
HiTZ/casimedicos-exp

Viewer • Updated Mar 23, 2024 • 2.49k • 100 • 3
HiTZ/casimedicos-squad

Preview • Updated Apr 14, 2024 • 12 • 1

XNLIeu

XNLIeu: a dataset for cross-lingual NLI in Basque

XNLIeu: a dataset for cross-lingual NLI in Basque

Paper • 2404.06996 • Published Apr 10, 2024
HiTZ/xnli-eu

Viewer • Updated Jul 17, 2025 • 801k • 160

Medical MT

HiTZ/medical_enes-eu

Updated Jun 27, 2024 • 3
HiTZ/medical_en-eu

Updated Jun 27, 2024
HiTZ/medical_es-eu

Updated May 15, 2025 • 1

TTS

HiTZ/TTS-gl_brais

Text-to-Speech • Updated Dec 16, 2025 • 2
HiTZ/TTS-gl_sabela

Text-to-Speech • Updated Dec 16, 2025 • 2
HiTZ/TTS-eu_antton

Text-to-Speech • Updated Dec 16, 2025 • 2
HiTZ/TTS-eu_maider

Text-to-Speech • Updated Dec 16, 2025 • 11

Cap&Punct

MarianMT based models for translation tasks

HiTZ/cap-punct-eu

Translation • 76.9M • Updated Jan 13 • 14
HiTZ/cap-punct-es

Translation • 76.9M • Updated Jan 13 • 6

Pyannote

Diarization models for VAD and Speaker Recognition

HiTZ/pyannote-segmentation-3.0-RTVE

Automatic Speech Recognition • Updated Nov 13, 2025 • 1

Speech Collection

Collection with STT models, Diarization models and datasets for training ASR in Spanish, Basque and Bilingual

Nvidia NeMo

Collection

Nvidia NeMo STT models • 5 items • Updated about 1 month ago
Whisper

Collection

30 items • Updated about 1 month ago
Pyannote

Collection

Diarization models for VAD and Speaker Recognition • 1 item • Updated about 1 month ago
ASR Datasets

Collection

Collection with datasets for training and benchmark-evaluating ASR in Basque, Spanish and Bilingual Basque-Spanish • 4 items • Updated about 1 month ago

Latxa

Latxa: An Open Language Model and Evaluation Suite for Basque

Latxa: An Open Language Model and Evaluation Suite for Basque

Paper • 2403.20266 • Published Mar 29, 2024 • 4
HiTZ/latxa-7b-v1.2

Text Generation • Updated Jul 2, 2024 • 42 • 6
HiTZ/latxa-13b-v1.2

Text Generation • Updated Jul 2, 2024 • 6 • 2
HiTZ/latxa-70b-v1.2

Text Generation • Updated Jul 3, 2024 • 105

GoLLIE

We present GoLLIE, a Large Language Model trained to follow annotation guidelines that outperforms previous approaches on zero-shot IE.

GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction

Paper • 2310.03668 • Published Oct 5, 2023 • 1
HiTZ/GoLLIE-7B

Text Generation • Updated Oct 10, 2023 • 988 • 29
HiTZ/GoLLIE-13B

Text Generation • Updated Oct 20, 2023 • 122 • 7
HiTZ/GoLLIE-34B

Text Generation • Updated Oct 20, 2023 • 196 • 38

Metaphor Processing

Datasets and models for metaphor detection and interpretation via NLI in Spanish and English

Leveraging a New Spanish Corpus for Multilingual and Crosslingual Metaphor Detection

Paper • 2210.10358 • Published Oct 19, 2022
HiTZ/cometa

Viewer • Updated Apr 15, 2024 • 3.63k • 62
HiTZ/xlm-roberta-large-metaphor-detection-es

Token Classification • Updated Feb 26, 2024 • 1
HiTZ/mdeberta-base-metaphor-detection-es

Token Classification • Updated Feb 26, 2024 • 3

EusCrawl

Does Corpus Quality Really Matter for Low-Resource Languages?

Does Corpus Quality Really Matter for Low-Resource Languages?

Paper • 2203.08111 • Published Mar 15, 2022
HiTZ/euscrawl

Updated Feb 14, 2023 • 39 • 4
ixa-ehu/roberta-eus-euscrawl-large-cased

Fill-Mask • 0.4B • Updated Sep 11, 2023 • 4 • 3
ixa-ehu/roberta-eus-euscrawl-base-cased

Fill-Mask • Updated Mar 16, 2022 • 15 • 2

Alpaca LoRA MT

Alpaca LoRA MT models and dataset

HiTZ/alpaca-lora-7b-en-pt-es-ca-eu-gl-at

Updated Mar 24, 2023 • 1
HiTZ/alpaca-lora-13b-en-pt-es-ca-eu-gl-at

Updated Mar 25, 2023
HiTZ/alpaca-lora-30b-en-pt-es-ca-eu-gl-at

Updated Mar 25, 2023
HiTZ/alpaca-lora-65b-en-pt-es-ca

Updated Apr 2, 2023 • 2

Pretraining Datasets

Basque Pretraining Datasets

HiTZ/latxa-corpus-v1.1

Viewer • Updated 3 days ago • 4.13M • 114 • 1
HiTZ/euscrawl

Updated Feb 14, 2023 • 39 • 4
orai-nlp/ZelaiHandi

Viewer • Updated May 19, 2025 • 2.25M • 66 • 9

Instruction Datasets

Basque Instruction Datasets

HiTZ/alpaca_mt

Updated Apr 7, 2023 • 87 • 9
OpenAssistant/oasst1

Viewer • Updated May 2, 2023 • 88.8k • 10.5k • 1.48k
CohereLabs/aya_dataset

Viewer • Updated Apr 15, 2025 • 206k • 3.61k • 334
CohereLabs/aya_collection

Viewer • Updated Apr 15, 2025 • 514M • 2.91k • 229

OPT RM

OPT reward models

Training Language Models with Language Feedback at Scale

Paper • 2303.16755 • Published Mar 28, 2023 • 1
HiTZ/lmloss-opt-rm-1.3b

Text Generation • Updated Apr 7, 2023 • 3
HiTZ/rmloss-opt-rm-13b

Text Generation • Updated Apr 7, 2023 • 2

Medical-mT5

An open-source text-to-text multilingual model for the medical domain.

HiTZ/Medical-mT5-large

Text Generation • 1B • Updated Apr 12, 2024 • 167 • 23
HiTZ/Medical-mT5-xl

Text Generation • Updated Apr 12, 2024 • 19 • 4
HiTZ/Medical-mT5-large-multitask

Text Generation • 1B • Updated May 6, 2024 • 30
HiTZ/Medical-mT5-xl-multitask

Text Generation • 4B • Updated Apr 12, 2024 • 47 • 2

BasqueParl

A Bilingual Corpus of Basque Parliamentary Transcriptions

BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions

Paper • 2205.01506 • Published May 3, 2022
HiTZ/basqueparl

Viewer • Updated Mar 8, 2024 • 343k • 25 • 1

Speech to Text

Basque Speech to Text models

Sleeping

5

Demo Basque ASR

🎤

5

Transcribe speech from an audio file
HiTZ/stt_eu_conformer_ctc_large

Automatic Speech Recognition • Updated Nov 28, 2025 • 12 • 2
HiTZ/stt_eu_conformer_transducer_large

Automatic Speech Recognition • Updated Nov 28, 2025 • 17 • 2
Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Paper • 2503.23542 • Published Mar 30, 2025 • 9

EriBERTa

HiTZ/EriBERTa-base

Fill-Mask • 0.1B • Updated Jul 1, 2025 • 247 • • 3
HiTZ/Multilingual-Medical-Corpus

Viewer • Updated Apr 12, 2024 • 67.4M • 846 • 43

IXAmBERT

Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque

ixa-ehu/ixambert-base-cased

Updated Jan 7, 2023 • 19 • 3
ixa-hitz/elkarhizketak

Updated Jan 18, 2024 • 18 • 1

Machine Translation

HiTZ/mt-hitz-en-eu

Updated Jun 17, 2024 • 54 • 3
HiTZ/mt-hitz-es-eu

Updated Jun 17, 2024 • 32
HiTZ/mt-hitz-eu-en

Updated Jun 25, 2024
HiTZ/mt-hitz-gl-eu

Updated Jun 17, 2024 • 1

Odesia Challenge 2024

IXA Submission for the 2024 ODESIA Challenge

HiTZ/Qwen2.5-14B-Instruct_ODESIA

Text Generation • 15B • Updated Feb 4, 2025
HiTZ/Hermes-3-Llama-3.1-8B_ODESIA

Text Generation • 8B • Updated Sep 18, 2024 • 2
HiTZ/gemma-2b-it_ODESIA

Text Generation • 3B • Updated Sep 20, 2024 • 3