SentenceTransformer based on intfloat/multilingual-e5-base

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-base on the mnlp_encoder_data dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: intfloat/multilingual-e5-base
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- mnlp_encoder_data

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ngkan146/test-encoder-st")
# Run inference
sentences = [
    'What is the main purpose of chain coding in image segmentation?  \nA. To enhance the color depth of images  \nB. To compress binary images by tracing contours  \nC. To convert images into three-dimensional models  \nD. To increase the size of image files',
    'A chain code is a lossless compression based image segmentation method for binary images based upon tracing image contours. The basic principle of chain coding, like other contour codings, is to separately encode each connected component, or "blob", in the image.\n\nFor each such region, a point on the boundary is selected and its coordinates are transmitted. The encoder then moves along the boundary of the region and, at each step, transmits a symbol representing the direction of this movement.\n\nThis continues until the encoder returns to the starting position, at which point the blob has been completely described, and encoding continues with the next blob in the image.\n\nThis encoding method is particularly effective for images consisting of a reasonably small number of large connected components.\n\nVariations \nSome popular chain codes include:\n the Freeman Chain Code of Eight Directions (FCCE)\n Directional Freeman Chain Code of Eight Directions (DFCCE)\n Vertex Chain Code (VCC)\n Three OrThogonal symbol chain code (3OT)\n Unsigned Manhattan Chain Code (UMCC)\n Ant Colonies Chain Code (ACCC)\n Predator-Prey System Chain Code (PPSCC)\n Beaver Territories Chain Code (BTCC)\n Biological Reproduction Chain Code (BRCC)\n Agent-Based Modeling Chain Code (ABMCC)\n\nIn particular, FCCE, VCC, 3OT and DFCCE can be transformed from one to another\n\nA related blob encoding method is crack code. Algorithms exist to convert between chain code, crack code, and run-length encoding.\n\nA new trend of chain codes involve the utilization of biological behaviors. This started by the work of Mouring et al. who developed an algorithm that takes advantage of the pheromone of ants to track image information. An ant releases a pheromone when they find a piece of food. Other ants use the pheromone to track the food. In their algorithm, an image is transferred into a virtual environment that consists of food and paths according to the distribution of the pixels in the original image. Then, ants are distributed and their job is to move around while releasing pheromone when they encounter food items. This helps other ants identify information, and therefore, encode information.\n\nIn use \nRecently, the combination of move-to-front transform and adaptive run-length encoding accomplished efficient compression of the popular chain codes.\nChain codes also can be used to obtain high levels of compression for image documents, outperforming standards such as DjVu and JBIG2.',
    'Meripilus sumstinei, commonly known as the giant polypore or the black-staining polypore, is a species of fungus in the family Meripilaceae.\n\nTaxonomy \nOriginally described in 1905 by William Alphonso Murrill as Grifola sumstinei, the species was transferred to Meripilus in 1988.\n\nDescription \nThe cap of this polypore is  wide, with folds of flesh up to  thick. It has white to brownish concentric zones and tapers toward the base; the stipe is indistinct.\n\nDistribution and habitat \nIt is found in eastern North America from June to September. It grows in large clumps on the ground around hardwood (including oak) trunks, stumps, and logs.\n\nUses \nThe mushroom is edible.\n\nReferences',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

mnlp_encoder_data

Dataset: mnlp_encoder_data at 39af5de
Size: 8,000 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 23 tokens mean: 65.95 tokens max: 171 tokens	min: 19 tokens mean: 413.21 tokens max: 512 tokens	min: 14 tokens mean: 405.39 tokens max: 512 tokens

Samples:

anchor	positive	negative
`What are the two key processes that relative nonlinearity depends on for maintaining species diversity? A. Species must differ in their resource consumption and reproductive rates. B. Species must differ in their responses to resource density and affect competition differently. C. Species must have identical growth rates and resource requirements. D. Species must compete for the same resources and have similar responses to competition.`	Relative nonlinearity is a coexistence mechanism that maintains species diversity via differences in the response to and effect on variation in resource density or some other factor mediating competition. Relative nonlinearity depends on two processes: 1) species have to differ in the curvature of their responses to resource density and 2) the patterns of resource variation generated by each species must favor the relative growth of another species. In its most basic form, one species grows best under equilibrium competitive conditions and another performs better under variable competitive conditions. Like all coexistence mechanisms, relative nonlinearity maintains species diversity by concentrating intraspecific competition relative to interspecific competition. Because resource density can be variable, intraspecific competition is the reduction of per-capita growth rate under variable resources generated by conspecifics (i.e. individuals of the same species). Interspecific competitio...	Muellerella lichenicola is a species of lichenicolous fungus in the family Verrucariaceae. It was first formally described as a new species in 1826 by Søren Christian Sommerfelt, as Sphaeria lichenicola. David Leslie Hawksworth transferred it to the genus Muellerella in 1979. It has been reported growing on Caloplaca aurantia, Caloplaca saxicola and Physcia aipolia in Sicily, and on an unidentified crustose lichen in Iceland. In Mongolia, it has been reported growing on the thallus of a Biatora-lichen at elevation in the Bulgan district and on Aspicilia at elevation in the Altai district. In Victoria Land, Antarctica, it has been reported from multiple hosts, including members of the Teloschistaceae and Physciaceae. References
`What was the unemployment rate in Japan in 2010? A. 3.1% B. 4.2% C. 5.1% D. 6.0%`	The labor force in Japan numbered 65.9 million people in 2010, which was 59.6% of the population of 15 years old and older, and amongst them, 62.57 million people were employed, whereas 3.34 million people were unemployed which made the unemployment rate 5.1%. The structure of Japan's labor market experienced gradual change in the late 1980s and continued this trend throughout the 1990s. The structure of the labor market is affected by: 1) shrinking population, 2) replacement of postwar baby boom generation, 3) increasing numbers of women in the labor force, and 4) workers' rising education level. Also, an increase in the number of foreign nationals in the labor force is foreseen. As of 2019, Japan's unemployment rate was the lowest in the G7. Its employment rate for the working-age population (15-64) was the highest in the G7. By 2021 the size of the labor force changed to 68.60 million, a decrease of 0.08 million from the previous year. Viewing by sex, the male labor force was 38.0...	The Aircraft Classification Rating (ACR) - Pavement Classification Rating (PCR) method is a standardized international airport pavement rating system developed by ICAO in 2022. The method is scheduled to replace the ACN-PCN method as the official ICAO pavement rating system by November 28, 2024. The method uses similar concepts as the ACN-PCN method, however, the ACR-PCR method is based on layered elastic analysis, uses standard subgrade categories for both flexible and rigid pavement, and eliminates the use of alpha factor and layer equivalency factors. The method relies on the comparison of two numbers: The ACR, a number defined as two times the derived single wheel load (expressed in hundreds of kilograms) conveying the relative effect on an airplane of a given weight on a pavement structure for a specified standard subgrade strength; The PCR, a number (and series of letters) representing the pavement bearing strength (on the same scale as ACR) of a given pavement section (runwa...
`What was the original name of WordMARC before it was changed due to a trademark conflict? A. MUSE B. WordPerfect C. Document Assembly D. Primeword`	WordMARC Composer was a scientifically oriented word processor developed by MARC Software, an offshoot of MARC Analysis Research Corporation (which specialized in high end Finite Element Analysis software for mechanical engineering). It ran originally on minicomputers such as Prime and Digital Equipment Corporation VAX. When the IBM PC emerged as the platform of choice for word processing, WordMARC allowed users to easily move documents from a minicomputer (where they could be easily shared) to PCs. WordMARC was the creation of Pedro Marcal, who pioneered work in finite element analysis and needed a technical word processor that both supported complex notations and was capable of running on minicomputers and other high-end machines such as Alliant and AT&T. WordMARC was originally known as MUSE (MARC Universal Screen Editor), but the name was changed because of a trademark conflict with another company when the product was ported to the IBM PC. Features In comparison with WordPerf...	Parametric stereo (abbreviated as PS) is an audio compression algorithm used as an audio coding format for digital audio. It is considered an Audio Object Type of MPEG-4 Part 3 (MPEG-4 Audio) that serves to enhance the coding efficiency of low bandwidth stereo audio media. Parametric Stereo digitally codes a stereo audio signal by storing the audio as monaural alongside a small amount of extra information. This extra information (defined as "parametric overhead") describes how the monaural signal will behave across both stereo channels, which allows for the signal to exist in true stereo upon playback. History Background Advanced Audio Coding Low Complexity (AAC LC) combined with Spectral Band Replication (SBR) and Parametric Stereo (PS) was defined as HE-AAC v2. A HE-AAC v1 decoder will only give a mono output when decoding a HE-AAC v2 bitstream. Parametric Stereo performs sparse coding in the spatial domain, somewhat similar to what SBR does in the frequency domain. An AAC HE v2 b...

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Training Hyperparameters

Non-Default Hyperparameters

learning_rate: 2e-05
weight_decay: 0.01
num_train_epochs: 1
warmup_steps: 10
remove_unused_columns: False

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 10
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: False
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss
0.1	100	4.2263
0.2	200	3.9742
0.3	300	3.9605
0.4	400	3.9198
0.5	500	3.8953
0.6	600	3.8793
0.7	700	3.8918
0.8	800	3.8691
0.9	900	3.8747
1.0	1000	3.8523

Framework Versions

Python: 3.10.12
Sentence Transformers: 4.1.0
Transformers: 4.51.3
PyTorch: 2.7.0+cu126
Accelerate: 1.7.0
Datasets: 3.6.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

ngkan146
/

test-encoder-st