DejanX13/Javne_Nabavke_embedding_1000
This is a sentence-transformers model fine-tuned specifically for Serbian public procurement documents ("Javne Nabavke"). It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering, semantic search, and document retrieval in the context of Serbian public procurement.
Model Description
This model has been fine-tuned on a dataset of 1000 Serbian public procurement documents to improve semantic understanding and retrieval performance for:
- Public procurement document analysis
- Tender document similarity matching
- Legal document search and retrieval
- Procurement process automation
- Serbian legal text understanding
The model is based on a multilingual transformer architecture and has been optimized for both Serbian and English text in the public procurement domain.
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
# Example Serbian public procurement texts
sentences = [
"Javni poziv za nabavku računarske opreme",
"Tender za izgradnju javnih objekata",
"Specifikacija tehničkih zahteva za softver"
]
model = SentenceTransformer('DejanX13/Javne_Nabavke_embedding_1000')
embeddings = model.encode(sentences)
print(embeddings)
Usage (LlamaIndex)
You can also use this model with LlamaIndex for document retrieval:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model = HuggingFaceEmbedding(
model_name="DejanX13/Javne_Nabavke_embedding_1000",
embed_batch_size=16
)
# Use with VectorStoreIndex for document retrieval
from llama_index.core import VectorStoreIndex, Document
documents = [Document(text="Your procurement document text here")]
index = VectorStoreIndex.from_documents(documents, embed_model=embedding_model)
Performance
This model has been evaluated on Serbian public procurement document retrieval tasks and shows significant improvement over general-purpose multilingual models for domain-specific tasks.
Training Details
The model was fine-tuned with the following parameters:
Base Model: multilingual-e5-large
Training Dataset: 1000 Serbian public procurement documents with query-document pairs
Training Parameters:
- Epochs: 2
- Batch Size: 5
- Learning Rate: 2e-05
- Loss Function: MultipleNegativesRankingLoss
- Evaluation Steps: 50
- Warmup Steps: 94
- Weight Decay: 0.01
- Max Gradient Norm: 1
- Optimizer: AdamW
DataLoader:
torch.utils.data.dataloader.DataLoader
of length 470 with parameters:
{'batch_size': 5, 'sampler': 'torch.utils.data.sampler.SequentialSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss Function:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss
with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Use Cases
This model is particularly useful for:
- Document Retrieval: Finding relevant procurement documents based on queries
- Tender Matching: Matching suppliers with relevant tender opportunities
- Legal Document Analysis: Understanding legal requirements in procurement documents
- Compliance Checking: Identifying similar regulatory requirements across documents
- Procurement Automation: Building AI systems for procurement process automation
Languages
- Primary: Serbian (sr)
- Secondary: English (en)
- Optimized for: Serbian public procurement terminology and legal language
Limitations
- Optimized specifically for Serbian public procurement domain
- May not perform optimally on general-purpose text outside this domain
- Performance may vary on other Serbian text domains not related to public procurement
Citation
If you use this model in your research or applications, please cite:
@misc{javne_nabavke_embedding_1000,
author = {DejanX13},
title = {Javne_Nabavke_embedding_1000: Fine-tuned Embeddings for Serbian Public Procurement},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/DejanX13/Javne_Nabavke_embedding_1000}
}
Contact
For questions or issues related to this model, please open an issue in the model repository or contact the author through Hugging Face.
- Downloads last month
- 0