ember-v1

This model has been trained on an extensive corpus of text pairs that encompass a broad spectrum of domains, including finance, science, medicine, law, and various others. During the training process, we incorporated techniques derived from the RetroMAE and SetFit research papers.

Plans

The research paper will be published soon.
The v2 of the model is currently in development and will feature an extended maximum sequence length of 4,000 tokens.

Usage

Use with transformers:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "This is an example sentence",
    "Each sentence is converted"
]

tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1")
model = AutoModel.from_pretrained("llmrails/ember-v1")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Use with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = [
    "This is an example sentence",
    "Each sentence is converted"
]

model = SentenceTransformer('llmrails/ember-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Massive Text Embedding Benchmark (MTEB) Evaluation

Our model achieve state-of-the-art performance on MTEB leaderboard

Model Name	Dimension	Sequence Length	Average (56)
ember-v1	1024	512	63.54
bge-large-en-v1.5	1024	512	63.23
bge-base-en-v1.5	768	512	63.05
text-embedding-ada-002	1536	8191	60.99

Limitation

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

License

MIT

Citation

@misc{nur2024emberv1,
      title={ember-v1: SOTA embedding model}, 
      author={Enrike Nur and Anar Aliyev},
      year={2023},
}

llmrails
/

ember-v1