Granite-Embedding-30m-Sparse

Model Summary: Granite-Embedding-30m-Sparse is a 30M parameter sparse biencoder embedding model from the Granite Experimental suite that can be used to generate high quality text embeddings. This model produces variable length bag-of-word like dictionary, containing expansions of sentence tokens and their corresponding weights and is trained using a combination of open source relevance-pair datasets with permissive, enterprise-friendly license, and IBM collected and generated datasets. While maintaining competitive scores on academic benchmarks such as BEIR, this model also performs well on many enterprise use cases. This model is developed using retrieval oriented pretraining, contrastive finetuning and knowledge distillation for improved performance.

Supported Languages: English.

Intended use: The model is designed to produce variable length bag-of-word like dictionary, containing expansions of sentence tokens and their corresponding weights, for a given text, which can be used for text similarity, retrieval, and search applications.

Usage with Milvus: The model is compatible with Milvus Vector DB and is very easy to use:

First, install the pymilvus library

pip install pymilvus

The model can then be used to encode pairs of text and find the similarity between their representations


from pymilvus import model
from pymilvus import MilvusClient, DataType

client = MilvusClient("./milvus_demo.db")

client.drop_collection(collection_name="my_sparse_collection")

schema = client.create_schema(
    auto_id=True,
    enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="id", datatype=DataType.VARCHAR, is_primary=False, max_length=100)
schema.add_field(field_name="embeddings", datatype=DataType.SPARSE_FLOAT_VECTOR)

index_params = client.prepare_index_params()

index_params.add_index(field_name="embeddings",
                               index_name="sparse_inverted_index",
                               index_type="SPARSE_INVERTED_INDEX",
                               metric_type="IP",
                               params={"drop_ratio_build": 0.2})
client.create_collection(
    collection_name="my_sparse_collection",
    schema=schema,
    index_params=index_params
)

embeddings_model = model.sparse.SpladeEmbeddingFunction(
    model_name="ibm-granite/granite-embedding-30m-sparse/", 
    device="cpu",
    batch_size=2,
    k_tokens_query=50,
    k_tokens_document=192
)

# Prepare documents to be ingested
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]
doc_vector = [{"embeddings": doc_emb, "id": f"item_{i}"} for i, doc_emb in enumerate(embeddings_model.encode_documents(docs))]

client.insert(
    collection_name="my_sparse_collection",
    data=doc_vector
)

# Prepare search parameters
search_params = {
    "params": {"drop_ratio_search": 0.2},  # Additional optional search parameters
}

# Prepare the query vector

queries = [
      "When was artificial intelligence founded", 
      "Where was Turing born?"
]
query_vector = embeddings_model.encode_documents(queries)

res = client.search(
    collection_name="my_sparse_collection",
    data=query_vector,
    limit=1, #top k documents to return
    output_fields=["id"],
    search_params=search_params,
)

for r in res:
    print(r)

Evaluation:

Granite-Embedding-30m-Sparse is competive in performance to the naver/splade-v3-distilbert despite being half the parameter size. We also compare the sparse model with similar sized dense embedding counterpart ibm-granite/granite-embedding-30m-english. The performance of the models on MTEB Retrieval (i.e., BEIR) is reported below. To maintain consistency with results reported by naver/splade-v3-distilbert, we do not include CQADupstack and MS-MARCO in the table below.

Model Paramters (M) Vocab Size BEIR Retrieval (13)
naver/splade-v3-distilbert 67 30522 50.0
granite-embedding-30m-english 30 50265 50.6
granite-embedding-30m-sparse 30 50265 50.8

Model Architecture: Granite-Embedding-30m-Sparse is based on an encoder-only RoBERTa like transformer architecture, trained internally at IBM Research.

Model granite-embedding-30m-sparse
Embedding size 384
Number of layers 6
Number of attention heads 12
Intermediate size 1536
Activation Function GeLU
Vocabulary Size 50265
Max. Sequence Length 512
# Parameters 30M

Training Data: Overall, the training data consists of four key sources: (1) unsupervised title-body paired data scraped from the web, (2) publicly available paired with permissive, enterprise-friendly license, (3) IBM-internal paired data targetting specific technical domains, and (4) IBM-generated synthetic data. The data is listed below:

Dataset Num. Pairs
SPECTER citation triplets 684,100
Stack Exchange Duplicate questions (titles) 304,525
Stack Exchange Duplicate questions (bodies) 250,519
Stack Exchange Duplicate questions (titles+bodies) 250,460
Natural Questions (NQ) 100,231
SQuAD2.0 87,599
PAQ (Question, Answer) pairs 64,371,441
Stack Exchange (Title, Answer) pairs 4,067,139
Stack Exchange (Title, Body) pairs 23,978,013
Stack Exchange (Title+Body, Answer) pairs 187,195
S2ORC Citation pairs (Titles) 52,603,982
S2ORC (Title, Abstract) 41,769,185
S2ORC (Citations, abstracts) 52,603,982
WikiAnswers Duplicate question pairs 77,427,422
SearchQA 582,261
HotpotQA 85,000
Fever 109,810
Arxiv 2,358,545
Wikipedia 20,745,403
PubMed 20,000,000
Miracl En Pairs 9,016
DBPedia Title-Body Pairs 4,635,922
Synthetic: Query-Wikipedia Passage 1,879,093
Synthetic: Fact Verification 9,888
IBM Internal Triples 40,290
IBM Internal Title-Body Pairs 1,524,586

Notably, we do not use the popular MS-MARCO retrieval dataset in our training corpus due to its non-commercial license.

Infrastructure: We train Granite Embedding Models using IBM's computing cluster, Cognitive Compute Cluster, which is outfitted with NVIDIA A100 80gb GPUs. This cluster provides a scalable and efficient infrastructure for training our models over multiple GPUs.

Ethical Considerations and Limitations: The data used to train the base language model was filtered to remove text containing hate, abuse, and profanity. Granite-Embedding-30m-Sparse is trained only for English texts, and has a context length of 512 tokens (longer texts will be truncated to this size).

Resources

Downloads last month
108
Safetensors
Model size
30.3M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support sentence-similarity models for transformers library.

Collection including ibm-granite/granite-embedding-30m-sparse