SentenceTransformer based on microsoft/mpnet-base
This is a sentence-transformers model finetuned from microsoft/mpnet-base on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: microsoft/mpnet-base
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- json
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sahithkumar7/mpnet-base-smartbots-iter01")
# Run inference
sentences = [
'What was the most frequently identified pharmaceutical in the groundwater samples?',
'from one to five compounds. The most frequently identified pharmaceuticals, in decreasing order, were ciprofloxacin 43%\n(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline 14% (1/7). The enzyme\ninhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located well. This catchment point\nshowed the most significant number of pharmaceuticals. West/Tejo and Centre were the regions with the most\nconsiderable number of substances in groundwater, accounting for 43%. All groundwater samples were contaminated by',
'Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics\nare not completely metabolised in humans and animals; thus, a high percentage of the active substance (40-90%) is\nexcreted in urine/faeces in the unchanged form. These molecules are discharged into water and soil through wastewater,\nanimal manure, and sewage sludge, frequently used as fertilisers to agricultural lands. Also, it is expected that the\nhospital effluent will contribute partly to the pharmaceutical load in the wastewater treatment plant influence [63].',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Triplet
- Datasets:
antibiotics_test
,mpnet-base-smartbots/iter1
andmpnet-base-smartbots/iter1
- Evaluated with
TripletEvaluator
Metric | antibiotics_test | mpnet-base-smartbots/iter1 |
---|---|---|
cosine_accuracy | 0.75 | 0.9333 |
Training Details
Training Dataset
json
- Dataset: json
- Size: 100 training samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 100 samples:
anchor positive negative type string string string details - min: 9 tokens
- mean: 16.14 tokens
- max: 33 tokens
- min: 48 tokens
- mean: 125.65 tokens
- max: 218 tokens
- min: 48 tokens
- mean: 122.97 tokens
- max: 211 tokens
- Samples:
anchor positive negative Which two macrolide antibiotics are frequently detected in surface water samples?
seems to undertake a similar fate in the environment.
Nevertheless, due to stronger adsorption, with higher emergence in sediment, its occurrence in the surface water is lower
[71]. The use of tetracyclines, mainly as medicated premix and oral solution for food-producing animals [72], and the very
low bioavailability (e.g. in pig feed) [43] contribute to increasing its release into the environment. Regarding macrolides,
erythromycin and clarithromycin exhibit a remarkable frequency of detection in surface water samples. The mostNonetheless, besides the sorption capacity, these antibiotics have high solubility in water. Crucial routes for these
substances into the environment are manure from animal production and sewage sludge from wastewater treatment
plant (WWTP) used as fertilisers. Therefore, these substances have been evidenced in topsoil samples [68]. These
quinolones and other antibiotics, for instance, norfloxacin and tetracycline, have been identified in groundwater samples
despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwaterWhat antimicrobial drugs were identified in the survey besides macrolides?
is one of the most frequently pharmaceutical in representative rivers [74,75]. The three macrolides identified in our
detection survey are included since 2018 in the first 'watch list' [76].
Another group of antimicrobial drugs identified in our survey were sulfamethoxazole/trimethoprim and sulfamethazine.
Sulfamethoxazole/trimethoprim are often used combined since the effectiveness of sulfonamides is enhanced. In the
present study, the detection of both substances was comparable; however, trimethoprim was detected in groundwater.upstream samples obtained in rural locations was demonstrated and could be attributed to a low efficiency in the urban
wastewater treatment plants or due to agricultural pressure.
The higher frequency of detection for most substances was observed in the Ave river and Ria Formosa, confirming that
several effluents impact these water bodies from urban wastewater treatment plants and livestock production.
Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibioticsHow long was the observational period of the antibiotic survey in Portugal?
of antibiotics and their metabolites in surface- groundwater. It seeks to reflect the current demographic, spatial, drug
consumption, and drug profile on an observational period of 3 years in Portugal. The greatest challenge of this survey
data will be to promote the ecopharmacovigilance framework development shortly to implement measures for avoiding
misuse/overuse of antibiotics and slow down emission and antibiotic resistance.
2. Results
2.1. Frequency of Detections:
Antibiotics/Enzyme-Inhibitors and Abacavir
in Surface-Groundwaterdespite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater
could be due to livestock farming pressure, namely by spreading manure in the soil or the possible sewage sludge
application in the area. High clay and low sand content in soils can decrease the mobility of pharmaceuticals, which is
attributed to clay intense exchange capacity. Thus, soil properties (e.g. particle composition) are a significant, influential - Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset
json
- Dataset: json
- Size: 100 evaluation samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 100 samples:
anchor positive negative type string string string details - min: 11 tokens
- mean: 16.4 tokens
- max: 25 tokens
- min: 76 tokens
- mean: 113.65 tokens
- max: 148 tokens
- min: 89 tokens
- mean: 118.8 tokens
- max: 162 tokens
- Samples:
anchor positive negative What percentage of unchanged excretion did the most significant number of detected substances show?
coefficients were not available for lincomycin, clavulanic acid and cilastatin.
Physicochemical properties of detected pharmaceuticals.
1 Data retrieved from [16]; 2 Data retrieved from [17]; 3 Data retrieved from [18]; 4 Data retrieved from [19]; 5
Data retrieved from [20];
6 Data retrieved from [21]; 7 Data retrieved from [22]; 8 Data retrieved from [23]; 9 Data retrieved from [24]; 10
Data retrieved from [25];
NA-not available.
The most significant number of detected substances showed a percentage of unchanged excretion higher than 40%.1. Introduction
Antibiotics are a critical component of human and veterinary modern medicine, developed to produce desirable or
beneficial effects on infections induced by pathogens. Like most pharmaceuticals, antibiotics tend to be small organic
polar compounds, generally ionisable, ordinarily subject to a metabolism or biotransformation process by the organism to
be eliminated more efficiently [1,2]. The excretion of these compounds and their metabolites occurs mainly through urine,How many kilograms of abacavir were detected in Portugal in 2017?
Regarding the different regions, it has been concluded that North and West/Tejo were the regions with the higher
consuming values. Both regions presented a significant value (33%) for the abacavir. For the detected antiviral abacavir,
an amount of 1458 kg has been observed.
Regarding antibiotics used in veterinary medicine, the regional amount was not available. Likewise, due to the reported
missing quantity for sulfamethazine, the sulfonamides group has been matched.
Consumption (Kg) of the detected pharmaceuticals in Portugal (2017).43%
(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline
14% (1/7). The enzyme inhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located
well. This catchment point showed the most significant
number of pharmaceuticals. West/Tejo and Centre were the regions with the most considerable number of substances in
groundwater, accounting for 43%. All groundwater
samples were contaminated by at least one antibiotic. Supplemental Tables S2 and S4 contain a detailed description of
theWhat must marketing authorisation procedures for medicines include since 2006?
substances in passive samplers [7]. Since 2006, marketing authorisation procedures for both human and veterinary
medicines must include an environmental risk assessment that comprises a prospective exposure assessment,
underestimating the possible impact and the occurrence of antibiotics after years of consumption. Ultimately, the potential
risk may not be correctly anticipated. It becomes urgent to generate new data, mainly to refine exposure assessments.
As much as the specificities of each member state should be considered this issue has become one of the Europeanclarithromycin/erythromycin, tetracycline, sulfamethoxazole, and abacavir. In groundwater, enrofloxacin/ciprofloxacin,
norfloxacin, trimethoprim, lincomycin, abacavir and tetracycline were recovered. Metabolites were not detected in water
bodies. Noticeable was the detection of enzyme inhibitors, tazobactam and cilastatin, which are both for exclusive
hospital use. The North region and Algarve (South) were the areas with the most significant frequency of substances in
surface water. The relatively higher detection of substances downstream of the effluent discharge points compared with a - Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 1warmup_ratio
: 0.1fp16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size
: 0fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | antibiotics_test_cosine_accuracy | mpnet-base-smartbots/iter1_cosine_accuracy |
---|---|---|---|
-1 | -1 | 0.75 | 0.9333 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.51.3
- PyTorch: 2.6.0+cu124
- Accelerate: 1.5.2
- Datasets: 3.6.0
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 31
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for sahithkumar7/mpnet-base-smartbots-iter01
Base model
microsoft/mpnet-baseEvaluation results
- Cosine Accuracy on antibiotics testself-reported0.750
- Cosine Accuracy on mpnet base smartbots/iter1self-reported0.933
- Cosine Accuracy on mpnet base smartbots/iter1self-reported0.933