Netizine/icis_e5_mistral_embeddings_instruct
🔗 HF Repo: https://huggingface.co/Netizine/icis_e5_mistral_embeddings_instruct
A 4-bit + LoRA-adapted Sentence-Transformer based on E5-Mistral-7B-Instruct, fine-tuned on ICIS commodity-news triplets. Ideal for embedding news headlines, press releases, regulations and supply-demand alerts in the chemicals, fertilizers & energy markets.
Model Details
- Base:
intfloat/e5-mistral-7b-instruct
- Adapter: LoRA (r=8, α=16, dropout=0.05) on attention & MLP projections
- Quantization: 4-bit NF4 with double-quant & fp16 compute
- Layers: 32 Transformer layers
- Embedding size: 4 096
- Training data: ICIS private dataset—triplets mined from supply-chain news & regulatory alerts
- Loss: Multiple Negatives Ranking (contrastive)
Installation
pip install sentence-transformers transformers accelerate bitsandbytes peft
Usage
Below is an example to encode queries and passages
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Netizine/icis_e5_mistral_embeddings_instruct")
# if you want to restrict max length:
model.max_seq_length = 512
# Encode ICIS‐style texts
texts = [
"US ethylene spot prices up 20% on outage at Gulf Coast cracker",
"China government releases new fertilizer export quotas for Q4"
]
embeddings = model.encode(texts, convert_to_tensor=True, normalize_embeddings=True)
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Netizine/icis_e5_mistral_embeddings_instruct")
model = AutoModel.from_pretrained("Netizine/icis_e5_mistral_embeddings_instruct", device_map="auto")
def pool_last_hidden(hidden, mask):
lengths = mask.sum(dim=1) - 1
return hidden[torch.arange(hidden.size(0)), lengths]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
).to(model.device)
outputs = model(**inputs)
pooled_emb = pool_last_hidden(outputs.last_hidden_state, inputs.attention_mask)
embeddings = F.normalize(pooled_emb, p=2, dim=1)
Supported Languages
This model is based on from Mistral-7B-v0.1 and fine-tuned on a the ICIS datasets. Since Mistral-7B-v0.1 is mainly trained on English data, we recommend using this model for English only.
Prompts & Config
Take a look at config_sentence_transformers.json for the built-in prompts:
"web_search_query": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ", "sts_query": "Instruct: Retrieve semantically similar text.\nQuery: ", "summarization_query": "Instruct: Given a news summary, retrieve other semantically similar summaries\nQuery: ",
Use via:
model.encode(
["spot prices surge"],
prompt_name="industry_query"
)
Evaluation
Fine-tuned on ICIS triplets, this model achieves strong recall@1 and MRR on held-out commodity-news triplets.
Training summary
Duration: 16 h 08 m over 2 epochs (146 220 steps) Final avg training loss: 0.0094 Held-out results: Recall@1 = 1.0000, MRR = 1.0000, mean rank = 1.00
Limitations
Max length: 512 tokens (beyond this, performance may degrade) Domain: Trained on English commodity/news text; not optimized for general web text Quantization: 4-bit inference may yield slight numeric variance vs. full-precision
FAQ
1. Do I need to add instructions to the query?
Yes, this is how the model is trained, otherwise you will see a performance degradation. The task definition should be a one-sentence instruction that describes the task.
Citation
If you use this model, please cite:
@misc{netizine2025icis_e5,
title = {ICIS E5‐Mistral Embeddings (Instruct)},
author = {Netizine R&D Team},
year = {2025},
howpublished = {\url{https://huggingface.co/Netizine/icis_e5_mistral_embeddings_instruct}}
}
- Downloads last month
- 10
Evaluation results
- cos_sim_pearson on MTEB AFQMCvalidation set self-reported37.863
- cos_sim_spearman on MTEB AFQMCvalidation set self-reported38.987
- euclidean_pearson on MTEB AFQMCvalidation set self-reported37.518
- euclidean_spearman on MTEB AFQMCvalidation set self-reported38.987
- manhattan_pearson on MTEB AFQMCvalidation set self-reported37.267
- manhattan_spearman on MTEB AFQMCvalidation set self-reported38.710
- cos_sim_pearson on MTEB ATECtest set self-reported43.339
- cos_sim_spearman on MTEB ATECtest set self-reported42.843
- euclidean_pearson on MTEB ATECtest set self-reported45.627
- euclidean_spearman on MTEB ATECtest set self-reported42.843