combo_parser / README.md

Initial commit with adapted deliverables from Clarin: http://hdl.handle.net/20.500.12537/301

4f09c24 over 1 year ago

2.9 kB

	## A Universal Dependency parser built on top of a Transformer language model

	Score on pre-tokenized test data:

	```
	Metric \| Precision \| Recall \| F1 Score \| AligndAcc
	-----------+-----------+-----------+-----------+-----------
	Tokens \| 99.70 \| 99.77 \| 99.73 \|
	Sentences \| 100.00 \| 100.00 \| 100.00 \|
	Words \| 99.62 \| 99.61 \| 99.61 \|
	UPOS \| 96.99 \| 96.97 \| 96.98 \| 97.36
	XPOS \| 93.65 \| 93.64 \| 93.65 \| 94.01
	UFeats \| 91.31 \| 91.29 \| 91.30 \| 91.65
	AllTags \| 86.86 \| 86.85 \| 86.86 \| 87.19
	Lemmas \| 95.83 \| 95.81 \| 95.82 \| 96.19
	UAS \| 89.01 \| 89.00 \| 89.00 \| 89.35
	LAS \| 85.72 \| 85.70 \| 85.71 \| 86.04
	CLAS \| 81.39 \| 80.91 \| 81.15 \| 81.34
	MLAS \| 69.21 \| 68.81 \| 69.01 \| 69.17
	BLEX \| 77.44 \| 76.99 \| 77.22 \| 77.40
	```


	Score on untokenized test data:

	```
	Metric \| Precision \| Recall \| F1 Score \| AligndAcc
	-----------+-----------+-----------+-----------+-----------
	Tokens \| 99.50 \| 99.66 \| 99.58 \|
	Sentences \| 100.00 \| 100.00 \| 100.00 \|
	Words \| 99.42 \| 99.50 \| 99.46 \|
	UPOS \| 96.80 \| 96.88 \| 96.84 \| 97.37
	XPOS \| 93.48 \| 93.56 \| 93.52 \| 94.03
	UFeats \| 91.13 \| 91.20 \| 91.16 \| 91.66
	AllTags \| 86.71 \| 86.78 \| 86.75 \| 87.22
	Lemmas \| 95.66 \| 95.74 \| 95.70 \| 96.22
	UAS \| 88.76 \| 88.83 \| 88.80 \| 89.28
	LAS \| 85.49 \| 85.55 \| 85.52 \| 85.99
	CLAS \| 81.19 \| 80.73 \| 80.96 \| 81.31
	MLAS \| 69.06 \| 68.67 \| 68.87 \| 69.16
	BLEX \| 77.28 \| 76.84 \| 77.06 \| 77.39
	````

	To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model ([electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is)).

	```bash
	git submodule update --init --recursive
	pip install -U pip setuptools wheel
	pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5
	```

	* For Python 3.9, you might need to install cython:

	```bash
	pip install -U pip cython
	```

	* Then you can run the model as it is done in parse_file.py

	For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo

	The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer).

	The directory `transformer_models/` contains the pretrained model [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is),
	which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.


	## License
	https://opensource.org/licenses/Apache-2.0

	## A Universal Dependency parser built on top of a Transformer language model

	Score on pre-tokenized test data:

	```
	Metric \| Precision \| Recall \| F1 Score \| AligndAcc
	-----------+-----------+-----------+-----------+-----------
	Tokens \| 99.70 \| 99.77 \| 99.73 \|
	Sentences \| 100.00 \| 100.00 \| 100.00 \|
	Words \| 99.62 \| 99.61 \| 99.61 \|
	UPOS \| 96.99 \| 96.97 \| 96.98 \| 97.36
	XPOS \| 93.65 \| 93.64 \| 93.65 \| 94.01
	UFeats \| 91.31 \| 91.29 \| 91.30 \| 91.65
	AllTags \| 86.86 \| 86.85 \| 86.86 \| 87.19
	Lemmas \| 95.83 \| 95.81 \| 95.82 \| 96.19
	UAS \| 89.01 \| 89.00 \| 89.00 \| 89.35
	LAS \| 85.72 \| 85.70 \| 85.71 \| 86.04
	CLAS \| 81.39 \| 80.91 \| 81.15 \| 81.34
	MLAS \| 69.21 \| 68.81 \| 69.01 \| 69.17
	BLEX \| 77.44 \| 76.99 \| 77.22 \| 77.40
	```


	Score on untokenized test data:

	```
	Metric \| Precision \| Recall \| F1 Score \| AligndAcc
	-----------+-----------+-----------+-----------+-----------
	Tokens \| 99.50 \| 99.66 \| 99.58 \|
	Sentences \| 100.00 \| 100.00 \| 100.00 \|
	Words \| 99.42 \| 99.50 \| 99.46 \|
	UPOS \| 96.80 \| 96.88 \| 96.84 \| 97.37
	XPOS \| 93.48 \| 93.56 \| 93.52 \| 94.03
	UFeats \| 91.13 \| 91.20 \| 91.16 \| 91.66
	AllTags \| 86.71 \| 86.78 \| 86.75 \| 87.22
	Lemmas \| 95.66 \| 95.74 \| 95.70 \| 96.22
	UAS \| 88.76 \| 88.83 \| 88.80 \| 89.28
	LAS \| 85.49 \| 85.55 \| 85.52 \| 85.99
	CLAS \| 81.19 \| 80.73 \| 80.96 \| 81.31
	MLAS \| 69.06 \| 68.67 \| 68.87 \| 69.16
	BLEX \| 77.28 \| 76.84 \| 77.06 \| 77.39
	````

	To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model ([electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is)).

	```bash
	git submodule update --init --recursive
	pip install -U pip setuptools wheel
	pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5
	```

	* For Python 3.9, you might need to install cython:

	```bash
	pip install -U pip cython
	```

	* Then you can run the model as it is done in parse_file.py

	For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo

	The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer).

	The directory `transformer_models/` contains the pretrained model [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is),
	which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.


	## License
	https://opensource.org/licenses/Apache-2.0