SentenceTransformer based on Qwen/Qwen3-Embedding-0.6B

This is a sentence-transformers model finetuned from Qwen/Qwen3-Embedding-0.6B on the massive_triplet_v3 dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Qwen/Qwen3-Embedding-0.6B
Maximum Sequence Length: 32768 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- massive_triplet_v3

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen3Model 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CocoRoF/POLAR-Qwen3-0.6b-linq-gist")
# Run inference
sentences = [
    'create list of spiders that obeys the visible projects list, through use of the spider selection menu',
    "def create_spiders_list():\n    spiders_lst = [obj for obj in globals().values() if\n                   inspect.isclass(obj) and str(obj).split('.')[2] == 'spiders' and 'BaseSpider' not in str(obj)]\n    visible_projects = find_visible_projects()\n    spiders_dict = {i.split('.')[0]: [obj for obj in spiders_lst if i.split('.')[0] in str(obj)] for i in\n                    os.listdir('HousingPriceScraper/HousingPriceScraper/spiders/SpiderGroups')[:-1] if i.split('.')[0] in visible_projects}\n    if len(list(spiders_dict.keys())) > 0:\n        spiders_lst = select_spiders(spiders_dict)\n    else:\n        print('There are no visible projects, got to set_visible_projects to set defaults')\n        return False\n    return spiders_lst",
    'def instantiate_pipelines(settings, simulator_settings):\n    pipelines = []\n    # lock to manage race parallel processes race conditions \n    lock = Lock()\n\n    logger.info("\\nVALIDATING PIPELINES\\n")\n    for p_idx, pipeline_settings in enumerate(settings.runs):\n\n        # turn a pipeline off by specifying num_runs as 0\n        num_runs = pipeline_settings.get("num_runs", 0)\n\n        # start_idx determines the first dataset name\'s starting idx\n        start_idx = pipeline_settings.get("start_idx", 0)\n\n        if num_runs:\n            logger.info("Validating run: {}\\n".format(p_idx))\n        else:\n            logger.info("Skipping run: {}\\n".format(p_idx))\n            \n        for idx in range(start_idx, start_idx + num_runs):           \n            logger.info("Pipeline sub index: {}\\n".format(idx))\n            # class factory and instantiate pipeline object\n            Pipeline = pipeline_factory(pipeline_settings["pipeline_name"])\n            p = Pipeline(pipeline_settings, idx, simulator_settings)\n            \n            # give each pipeline an idependent logger\n            log_name = "dSim_{}".format(p.pipeline_settings["dataset_name"])\n            log_path = os.path.join(p.pipeline_settings["outdir"],\n                                    p.pipeline_settings["dataset_name"]+\'.log\')\n            fh = logging.FileHandler(log_path, mode=\'w\')\n            fh.setLevel(logging.DEBUG)\n            format = "%(asctime)-6s: %(name)s - %(levelname)s - %(message)s"\n            fmt = logging.Formatter(format)\n            fh.setFormatter(fmt)\n            local_logger = logging.getLogger(log_name)\n            local_logger.addHandler(fh)\n            logger.info("Init local logging: {}".format(log_path))\n            p.logger = local_logger\n\n            # pipeline/ dataset directory\n            p.pipeline_settings["lock"] = lock\n\n            # validate all submodules for each pipeline is ready (use local logger) \n            p.instantiate_modules()\n\n            # append to list of instantiated pipelines\n            pipelines.append(p)\n    return pipelines',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

massive_triplet_v3

Dataset: massive_triplet_v3 at 51266de
Size: 500,000 training samples
Columns: query, positive, and negative

Approximate statistics based on the first 1000 samples:

	query	positive	negative
type	string	string	string
details	min: 6 tokens mean: 22.57 tokens max: 67 tokens	min: 8 tokens mean: 132.85 tokens max: 1160 tokens	min: 4 tokens mean: 122.89 tokens max: 1758 tokens

Samples:

query	positive	negative
`방학기간에 소외지역의 청소년을 대상으로 청춘누리 봉사단이 할 수 있는 캠프의 이름은 뭐야`	주요 수상기관 교육기부프로그램 개요 4. 대학생 동아리 ｢청춘누리 봉사단｣ □ 청춘누리축제 ◦ (참가대상) 전국 유치원, 초·중·고등학생 ◦ (활동내역) 대학생들이 운영하는 교육기부활동을 청소년들이 직접 체험해봄으로써 학생들이 사고력, 창의력 향상을 도모하고 자신의 꿈을 펼칠 수 있는 장 마련 ◦ (주요성과) 대학생들의 교육기부에 대한 전반적인 이해를 돕고 교육 기부 활동의 우수성 홍보 □ 청춘누리봉사단과 함께하는 교육기부(쏙쏙캠프, 함성소리) ◦ (참가대상) 전국의 초·중학생 ◦ (활동내역) - 쏙쏙캠프 : 방학을 이용하여 상대적으로 교육기부 혜택이 적은 소외 지역을 방문하여 창의력 체험, 진로체험 등을 제공, 배움의 기회 균등 및 꿈을 찾아주는 활동 전개 - 함성소리 : 학기중 토요일마다 수도권에 있는 청소년 대상으로 꿈을 설계하고 지원하는 활동 전개 ◦ (주요성과) 소외지역 청소년 대상 배움의 기회를 제공하고 대학생들의 봉사활동을 장려하여 많은 청소년 대상 멘토 활동 전개	개도국에 IT나눔을 실천한 청년들과 아름다운 동행 □ 미래창조과학부(장관 최문기)와 한국정보화진흥원(원장 장광수)은 12월 18일(수) 오후 2시 10분 과천과학관에서 ｢2013년도 월드프렌즈 IT봉사단 귀국보고대회｣(이하, IT봉사단 귀국보고대회)를 개최하였다. o 정부는 2001년부터 현재까지 전 세계 70여개 개도국에 5,158명의 IT봉사단을 파견한 바 있으며, ｢IT봉사단 귀국보고대회｣는 매년 개도국에서 활동하고 온 봉사단원들이 서로의 경험을 공유하고 글로벌 역량을 배양하는 ‘소통'과 ‘협력‘의 장(場)으로 운영되고 있다. ※ 월드프렌즈(World Frends Korea, WFK) : 우리나라 해외봉사단사업 통합브랜드 □ 이번 ｢IT봉사단 귀국보고대회｣에는 30개국에 파견되었던 552명의 봉사단원 중 약 300여명의 봉사단원이 참석했으며, 윤종록 제2차관과 주한 외교사절(인도네시아 대사, 코스타리카 대사, 네팔 대사 등)이 참석해 세계의 오지를 누비고 온 봉사단원들을 격려했다. o 윤종록 제2차관은 IT봉사단원들에게“귀한경험을 활용하여 대한민국의 이름을 빛내는 사람이 되기를 바란다”는 당부와 함께“정부는 여러분과 같은 젊은이들이 세계를 무대로 능력을 마음껏 발휘할 수 있는 글로벌 플랫폼을 구축하는데 노력할 계획”이라고 덧붙였다.
`Loads sensor filters from an Excel file. Both new style XLSX and oldstyle XLS formats are supported.`	def load_sensor_filters_excel(filename, normalise=False, sheet_names=None): sensor_filters = {} with pd.ExcelFile(filename) as excel_file: # default is all sheets if not sheet_names: sheet_names = excel_file.sheet_names for sheet in sheet_names: try: dataframe = excel_file.parse( sheet, index_col=0 ) # the sheet as a DataFrame # OK, we have the data frame. Let's process it... if not _validate_filter_dataframe(dataframe): continue if normalise: dataframe = _normalise_dataframe(dataframe) sensor_filters[sheet] = ( np.array(dataframe.index), dataframe.values.transpose(), ) except xlrd.biffh.XLRDError: continue # except xlrd.biffh.XLRDError as xlrd_error: # TODO: log wa...	def convert_csv(fname): # Make sure this is an Excel file. if (not is_excel_file(fname)): # Not Excel, so no sheets. return [] # Run soffice in listening mode if it is not already running. run_soffice() # TODO: Make sure soffice is running in listening mode. # # Connect to the local LibreOffice server. context = connect(Socket(HOST, PORT)) # Load the Excel sheet. component = get_component(fname, context) # Iterate on all the sheets in the spreadsheet. controller = component.getCurrentController() sheets = component.getSheets() enumeration = sheets.createEnumeration() r = [] pos = 0 if sheets.getCount() > 0: while enumeration.hasMoreElements(): # Move to next sheet. sheet = enumeration.nextElement() name = sheet.getName() if (name.count(" ") > 10): name = name.replace(" ", "") name = fix_file_name(name) ...
`Create an additional feature to metadata by counting number of occurrences in data, for a specific element_type`	`def create_count_features(metadata, element_type, data, grp_feat, res_feat, feature_suffix): feature_name = 'n_'+ element_type + '_modif' + feature_suffix newfeature = (data.groupby([grp_feat])[res_feat] .count() .reset_index() .fillna(0)) newfeature.columns = [grp_feat, feature_name] metadata = pd.merge(metadata, newfeature, on=grp_feat, how="outer").fillna(0) return metadata`	`def test(self): count = Counter() for example in self.testing_set: classification = self.classify(example.attributes) if example.CLASS and classification: count['TP'] += 1 elif not example.CLASS and classification: count['FP'] += 1 elif not example.CLASS and not classification: count['TN'] += 1 elif example.CLASS and not classification: count['FN'] += 1 return count`

Loss: CachedGISTEmbedLoss with these parameters:

{'guide': SentenceTransformer(
  (0): Transformer({'max_seq_length': 40960, 'do_lower_case': False}) with Transformer model: Qwen3Model 
  (1): Pooling({'word_embedding_dimension': 4096, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
), 'temperature': 0.01}

Training Hyperparameters

Non-Default Hyperparameters

overwrite_output_dir: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
gradient_accumulation_steps: 16
learning_rate: 2e-06
weight_decay: 0.01
adam_beta2: 0.99
adam_epsilon: 1e-07
max_grad_norm: 0.3
num_train_epochs: 1.0
warmup_ratio: 0.1
dataloader_num_workers: 16
hub_model_id: CocoRoF/POLAR-Qwen3-0.6b-linq-gist
prompts: ({'query': 'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:', 'document': ''},)
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: True
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 16
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-06
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.99
adam_epsilon: 1e-07
max_grad_norm: 0.3
num_train_epochs: 1.0
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: True
dataloader_num_workers: 16
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: CocoRoF/POLAR-Qwen3-0.6b-linq-gist
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: ({'query': 'Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:', 'document': ''},)
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs