pearl_small / README.md

Update README.md

581e568 verified 8 months ago

5.08 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- Phrase Representation
	- String Matching
	- Fuzzy Join
	- Entity Retrieval
	- transformers
	- sentence-transformers
	---
	## 🦪⚪ PEARL-small
	[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
	[Lihu Chen](https://chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/).
	Accepted by EACL Findings 2024 <br>

	PEARL-small is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings,
	creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join...
	<br>
	It differs from typical sentence embedders because it incorporates phrase type information and morphological features,
	allowing it to better capture variations in strings.
	The model is a variant of [E5-small](https://huggingface.co/intfloat/e5-small-v2) finetuned on our constructed context-free [dataset](https://zenodo.org/records/10676475) to yield better representations
	for phrases and strings. <br>


	🤗 [PEARL-small](https://huggingface.co/Lihuchen/pearl_small) 🤗 [PEARL-base](https://huggingface.co/Lihuchen/pearl_base)
	📐 [PEARL Benchmark](https://huggingface.co/datasets/Lihuchen/pearl_benchmark) 🏆 [PEARL Leaderboard](https://huggingface.co/spaces/Lihuchen/pearl_leaderboard)
	<br>


	\| Model \|Size\|Avg\| PPDB \| PPDB filtered \|Turney\|BIRD\|YAGO\|UMLS\|CoNLL\|BC5CDR\|AutoFJ\|
	\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|-----------------\|
	\| FastText \|-\| 40.3\| 94.4 \| 61.2 \| 59.6 \| 58.9 \|16.9\|14.5\|3.0\|0.2\| 53.6\|
	\| Sentence-BERT \|110M\|50.1\| 94.6 \| 66.8 \| 50.4 \| 62.6 \| 21.6\|23.6\|25.5\|48.4\| 57.2\|
	\| Phrase-BERT \|110M\|54.5\| 96.8 \| 68.7 \| 57.2 \| 68.8 \|23.7\|26.1\|35.4\| 59.5\|66.9\|
	\| E5-small \|34M\|57.0\| 96.0\| 56.8\|55.9\| 63.1\|43.3\| 42.0\|27.6\| 53.7\|74.8\|
	\|E5-base\|110M\| 61.1\| 95.4\|65.6\|59.4\|66.3\| 47.3\|44.0\|32.0\| 69.3\|76.1\|
	\|PEARL-small\|34M\| 62.5\| 97.0\|70.2\|57.9\|68.1\| 48.1\|44.5\|42.4\|59.3\|75.2\|
	\|PEARL-base\|110M\|64.8\|97.3\|72.2\|59.7\|72.6\|50.7\|45.8\|39.3\|69.4\|77.1\|

	Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is `*ms/512 samples`.
	The FastText model here is `crawl-300d-2M-subword.bin`.
	\| Model \|Avg Score\| Estimated Memory \|Speed GPU \| Speed CPU \|
	\|-\|-\|-\|-\|-\|
	\|FastText\|40.3\|1200MB\|-\|57ms\|
	\|PEARL-small\|62.5\|68MB\|42ms\|446ms\|
	\|PEARL-base\|64.8\|220MB\|89ms\|1394ms\|

	## Usage

	### Sentence Transformers
	PEARL is integrated with the Sentence Transformers library (Thanks for [Tom Aarsen](https://huggingface.co/tomaarsen)'s contribution), and can be used like so:

	```python
	from sentence_transformers import SentenceTransformer, util

	query_texts = ["The New York Times"]
	doc_texts = [ "NYTimes", "New York Post", "New York"]
	input_texts = query_texts + doc_texts

	model = SentenceTransformer("Lihuchen/pearl_small")
	embeddings = model.encode(input_texts)
	scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
	print(scores.tolist())
	# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]
	```

	### Transformers
	You can also use `transformers` to use PEARL. Below is an example of entity retrieval, and we reuse the code from E5.

	```python
	import torch.nn.functional as F

	from torch import Tensor
	from transformers import AutoTokenizer, AutoModel


	def average_pool(last_hidden_states: Tensor,
	attention_mask: Tensor) -> Tensor:
	last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
	return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

	def encode_text(model, input_texts):
	# Tokenize the input texts
	batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

	outputs = model(**batch_dict)
	embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

	return embeddings


	query_texts = ["The New York Times"]
	doc_texts = [ "NYTimes", "New York Post", "New York"]
	input_texts = query_texts + doc_texts

	tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
	model = AutoModel.from_pretrained('Lihuchen/pearl_small')

	# encode
	embeddings = encode_text(model, input_texts)

	# calculate similarity
	embeddings = F.normalize(embeddings, p=2, dim=1)
	scores = (embeddings[:1] @ embeddings[1:].T) * 100
	print(scores.tolist())

	# expected outputs
	# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]
	```

	## Training and Evaluation
	Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)



	## Citation

	If you find our work useful, please give us a citation:

	```
	@article{chen2024learning,
	title={Learning High-Quality and General-Purpose Phrase Representations},
	author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
	journal={arXiv preprint arXiv:2401.10407},
	year={2024}
	}
	```