Add pipeline_tag and library_name.

49d8b99 verified 8 days ago

3.55 kB

	---
	language: cs
	license: cc-by-nc-sa-4.0
	tags:
	- Czech
	- GEC
	- GECCC dataset
	pipeline_tag: text-generation
	library_name: transformers
	base_model: google/byt5-base
	---

	# Model Card for byt5-base-geccc-mate

	The `byt5-base-geccc-mate` model is a sequence-to-sequence model performing
	grammar error correction in Czech described in the paper
	[Refining Czech GEC: Insights from a Multi-Experiment Approach](https://arxiv.org/abs/2506.22402).
	It is a finetuned version of [byt5-base](https://huggingface.co/google/byt5-base) using
	the MATE method and the [GECCC dataset](https://hdl.handle.net/11234/1-4861).

	## Model Description

	- Developed by: [Seznam.cz](https://seznam.cz) and [Charles University, MFF, ÚFAL](https://ufal.mff.cuni.cz/)
	- Language(s) (NLP): Czech
	- Model type: character-based encoder-decoder Transformer model
	- Finetuned from model: `google/byt5-base`
	- Finetuned on:
	- first synthetic errors generated by the MATE method (see [the paper](https://arxiv.org/abs/2506.22402))
	- then the [GECCC dataset](https://hdl.handle.net/11234/1-4861)
	- License: CC BY-NC-SA 4.0

	## Model Sources

	- Repository: https://github.com/ufal/tsd2025-gec
	- Paper: [Refining Czech GEC: Insights from a Multi-Experiment Approach](https://arxiv.org/abs/2506.22402)
	- Dataset: [GECCC dataset](https://hdl.handle.net/11234/1-4861)

	## Evaluation

	<div align="center">
	<img src="https://github.com/ufal/tsd2025-gec/blob/main/figures/bubble_chart.svg?raw=true" width="75%" alt="Performance bubblechart" />
	</div>

	\| Model \| Parameters \| GECCC F-0.5 score \| AKCES F-0.5 score \|
	\|:------\|-----------:\|:-----------------:\|:-----------------:\|
	\| [byt5-small-geccc-mate](https://hf.co/ufal/byt5-small-geccc-mate) \| 300M \| 72.56 \|
	\| [byt5-base-geccc-mate](https://hf.co/ufal/byt5-base-geccc-mate) \| 582M \| 75.15 \|
	\| [byt5-large-geccc-mate](https://hf.co/ufal/byt5-large-geccc-mate) \| 1275M \| 77.01 \|
	\| [byt5-large-akces-mate](https://hf.co/ufal/byt5-large-akces-mate) \| 1275M \| \| 84.40 \|
	\| [transformer-base-geccc-mate](https://hf.co/ufal/transformer-base-geccc-mate) \| 65M \| 73.73 \|

	## Uses

	The model can be directly used to process space-tokenized input Czech text and produce grammar-corrected Czech text.

	## How to Get Started with the Model

	Use the code below to get started with the model. Note that the input must be space-tokenized, i.e., every token (using the [UDPipe 1](https://ufal.mff.cuni.cz/udpipe/1) tokenizer [czech-pdt-ud-2.5-191206.udpipe](https://hdl.handle.net/11234/1-3131)) must be space-separated.

	```python
	tokenizer = transformers.AutoTokenizer.from_pretrained("ufal/byt5-base-geccc-mate")
	model = transformers.AutoModelForSeq2SeqLM.from_pretrained("ufal/byt5-base-geccc-mate")

	batch = tokenizer(["Sveřepý šakali zavile vyly na býlí mesýc ."], return_tensors="pt")
	outputs = model.generate(batch.input_ids, max_length=256, num_beams=4)

	print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
	```

	## BibTeX Citation

	```
	@InProceedings{10.1007/978-3-032-02551-7_7,
	author="Pechman, Petr and Straka, Milan and Strakov{\'a}, Jana and N{\'a}plava, Jakub",
	editor="Ek{\v{s}}tein, Kamil and Konop{\'i}k, Miloslav and Pra{\v{z}}{\'a}k, Ond{\v{r}}ej and P{\'a}rtl, Franti{\v{s}}ek",
	title="Refining Czech GEC: Insights from a Multi-experiment Approach",
	booktitle="Text, Speech, and Dialogue",
	year="2026",
	publisher="Springer Nature Switzerland",
	address="Cham",
	pages="64--76",
	isbn="978-3-032-02551-7",
	doi="10.1007/978-3-032-02551-7_7"
	}
	```