|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- Phrase Representation |
|
- String Matching |
|
- Fuzzy Join |
|
- Entity Retrieval |
|
- transformers |
|
- sentence-transformers |
|
--- |
|
## 🦪⚪ PEARL-small |
|
[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br> |
|
[Lihu Chen](https://chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/). |
|
Accepted by EACL Findings 2024 <br> |
|
|
|
PEARL-small is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings, |
|
creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join... |
|
<br> |
|
It differs from typical sentence embedders because it incorporates phrase type information and morphological features, |
|
allowing it to better capture variations in strings. |
|
The model is a variant of [E5-small](https://huggingface.co/intfloat/e5-small-v2) finetuned on our constructed context-free [dataset](https://zenodo.org/records/10676475) to yield better representations |
|
for phrases and strings. <br> |
|
|
|
|
|
🤗 [PEARL-small](https://huggingface.co/Lihuchen/pearl_small) 🤗 [PEARL-base](https://huggingface.co/Lihuchen/pearl_base) |
|
📐 [PEARL Benchmark](https://huggingface.co/datasets/Lihuchen/pearl_benchmark) 🏆 [PEARL Leaderboard](https://huggingface.co/spaces/Lihuchen/pearl_leaderboard) |
|
<br> |
|
|
|
|
|
| Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ| |
|
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| |
|
| FastText |-| 40.3| 94.4 | 61.2 | 59.6 | 58.9 |16.9|14.5|3.0|0.2| 53.6| |
|
| Sentence-BERT |110M|50.1| 94.6 | 66.8 | 50.4 | 62.6 | 21.6|23.6|25.5|48.4| 57.2| |
|
| Phrase-BERT |110M|54.5| 96.8 | 68.7 | 57.2 | 68.8 |23.7|26.1|35.4| 59.5|66.9| |
|
| E5-small |34M|57.0| 96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8| |
|
|E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1| |
|
|PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2| |
|
|PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1| |
|
|
|
Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is `*ms/512 samples`. |
|
The FastText model here is `crawl-300d-2M-subword.bin`. |
|
| Model |Avg Score| Estimated Memory |Speed GPU | Speed CPU | |
|
|-|-|-|-|-| |
|
|FastText|40.3|1200MB|-|57ms| |
|
|PEARL-small|62.5|68MB|42ms|446ms| |
|
|PEARL-base|64.8|220MB|89ms|1394ms| |
|
|
|
## Usage |
|
|
|
### Sentence Transformers |
|
PEARL is integrated with the Sentence Transformers library (Thanks for [Tom Aarsen](https://huggingface.co/tomaarsen)'s contribution), and can be used like so: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer, util |
|
|
|
query_texts = ["The New York Times"] |
|
doc_texts = [ "NYTimes", "New York Post", "New York"] |
|
input_texts = query_texts + doc_texts |
|
|
|
model = SentenceTransformer("Lihuchen/pearl_small") |
|
embeddings = model.encode(input_texts) |
|
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100 |
|
print(scores.tolist()) |
|
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]] |
|
``` |
|
|
|
### Transformers |
|
You can also use `transformers` to use PEARL. Below is an example of entity retrieval, and we reuse the code from E5. |
|
|
|
```python |
|
import torch.nn.functional as F |
|
|
|
from torch import Tensor |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
def average_pool(last_hidden_states: Tensor, |
|
attention_mask: Tensor) -> Tensor: |
|
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) |
|
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] |
|
|
|
def encode_text(model, input_texts): |
|
# Tokenize the input texts |
|
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') |
|
|
|
outputs = model(**batch_dict) |
|
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
|
|
|
return embeddings |
|
|
|
|
|
query_texts = ["The New York Times"] |
|
doc_texts = [ "NYTimes", "New York Post", "New York"] |
|
input_texts = query_texts + doc_texts |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small') |
|
model = AutoModel.from_pretrained('Lihuchen/pearl_small') |
|
|
|
# encode |
|
embeddings = encode_text(model, input_texts) |
|
|
|
# calculate similarity |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
scores = (embeddings[:1] @ embeddings[1:].T) * 100 |
|
print(scores.tolist()) |
|
|
|
# expected outputs |
|
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]] |
|
``` |
|
|
|
## Training and Evaluation |
|
Have a look at our code on [Github](https://github.com/tigerchen52/PEARL) |
|
|
|
|
|
|
|
## Citation |
|
|
|
If you find our work useful, please give us a citation: |
|
|
|
``` |
|
@article{chen2024learning, |
|
title={Learning High-Quality and General-Purpose Phrase Representations}, |
|
author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M}, |
|
journal={arXiv preprint arXiv:2401.10407}, |
|
year={2024} |
|
} |
|
``` |