Spaces:

evaluate-metric
/

rouge

Running

App Files Files Community

rouge / README.md

lvwerra HF Staff

Update Space (evaluate main: d5ecbe47)

70bcfa7 over 3 years ago

preview code

raw

history blame

6.47 kB

	---
	title: ROUGE
	emoji: 🤗
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 3.0.2
	app_file: app.py
	pinned: false
	tags:
	- evaluate
	- metric
	description: >-
	ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for
	evaluating automatic summarization and machine translation software in natural language processing.
	The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

	Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.

	This metrics is a wrapper around Google Research reimplementation of ROUGE:
	https://github.com/google-research/google-research/tree/master/rouge
	---

	# Metric Card for ROUGE

	## Metric Description
	ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

	Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.

	This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)

	## How to Use
	At minimum, this metric takes as input a list of predictions and a list of references:
	```python
	>>> rouge = evaluate.load('rouge')
	>>> predictions = ["hello there", "general kenobi"]
	>>> references = ["hello there", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references)
	>>> print(results)
	{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
	```

	One can also pass a custom tokenizer which is especially useful for non-latin languages.
	```python
	>>> results = rouge.compute(predictions=predictions,
	... references=references,
	tokenizer=lambda x: x.split())
	>>> print(results)
	{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
	```

	It can also deal with lists of references for each predictions:
	```python
	>>> rouge = evaluate.load('rouge')
	>>> predictions = ["hello there", "general kenobi"]
	>>> references = [["hello", "there"], ["general kenobi", "general yoda"]]
	>>> results = rouge.compute(predictions=predictions,
	... references=references)
	>>> print(results)
	{'rouge1': 0.8333, 'rouge2': 0.5, 'rougeL': 0.8333, 'rougeLsum': 0.8333}```

	### Inputs
	- predictions (`list`): list of predictions to score. Each prediction
	should be a string with tokens separated by spaces.
	- references (`list` or `list[list]`): list of reference for each prediction or a list of several references per prediction. Each
	reference should be a string with tokens separated by spaces.
	- rouge_types (`list`): A list of rouge types to calculate. Defaults to `['rouge1', 'rouge2', 'rougeL', 'rougeLsum']`.
	- Valid rouge types:
	- `"rouge1"`: unigram (1-gram) based scoring
	- `"rouge2"`: bigram (2-gram) based scoring
	- `"rougeL"`: Longest common subsequence based scoring.
	- `"rougeLSum"`: splits text using `"\n"`
	- See [here](https://github.com/huggingface/datasets/issues/617) for more information
	- use_aggregator (`boolean`): If True, returns aggregates. Defaults to `True`.
	- use_stemmer (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.

	### Output Values
	The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of scores, with one score for each sentence. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:

	```python
	{'rouge1': [0.6666666666666666, 1.0], 'rouge2': [0.0, 1.0]}
	```

	If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
	```python
	{'rouge1': 1.0, 'rouge2': 1.0}
	```

	The ROUGE values are in the range of 0 to 1.


	#### Values from Popular Papers


	### Examples
	An example without aggregation:
	```python
	>>> rouge = evaluate.load('rouge')
	>>> predictions = ["hello goodbye", "ankh morpork"]
	>>> references = ["goodbye", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references,
	... use_aggregator=False)
	>>> print(list(results.keys()))
	['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
	>>> print(results["rouge1"])
	[0.5, 0.0]
	```

	The same example, but with aggregation:
	```python
	>>> rouge = evaluate.load('rouge')
	>>> predictions = ["hello goodbye", "ankh morpork"]
	>>> references = ["goodbye", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references,
	... use_aggregator=True)
	>>> print(list(results.keys()))
	['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
	>>> print(results["rouge1"])
	0.25
	```

	The same example, but only calculating `rouge_1`:
	```python
	>>> rouge = evaluate.load('rouge')
	>>> predictions = ["hello goodbye", "ankh morpork"]
	>>> references = ["goodbye", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references,
	... rouge_types=['rouge_1'],
	... use_aggregator=True)
	>>> print(list(results.keys()))
	['rouge1']
	>>> print(results["rouge1"])
	0.25
	```

	## Limitations and Bias
	See [Schluter (2017)](https://aclanthology.org/E17-2007/) for an in-depth discussion of many of ROUGE's limits.

	## Citation
	```bibtex
	@inproceedings{lin-2004-rouge,
	title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
	author = "Lin, Chin-Yew",
	booktitle = "Text Summarization Branches Out",
	month = jul,
	year = "2004",
	address = "Barcelona, Spain",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/W04-1013",
	pages = "74--81",
	}
	```

	## Further References
	- This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)