ModernBERT-korean-large-preview / README.md

sigridjineth

Update README.md

9f83192 verified 11 months ago

preview code

raw

history blame contribute delete

11.9 kB

metadata

language:
  - ko
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:1120235
  - loss:CachedMultipleNegativesRankingLoss
base_model: answerdotai/ModernBERT-large
widget:
  - source_sentence: 나, 가스불, 찻물, 올리다
    sentences:
      - 나는 가스불에 꽃을 넣은 찻물을 올렸다.
      - 과제수행 기간중에 연구 현장에 대해 정기점검을 실시하고, 과제 수행 종료 후에도 일정한 안전조치를 이행하도록 규정한다.
      - 고기, 상추, 밥, 나, 올리다
  - source_sentence: 파란색 데님 재킷을 입은 여성과 검은색 코트를 입은 여성이 일본 식당 앞에 서 있다.
    sentences:
      - >-
        복합 도금된 시편의 표면과 조성은 전계방출 주사전자현미경(field emission scanning electron
        microscopy,FESEM)과 에너지 분산형 X-선 분광기(energy dispersivespectroscopy, EDS)를
        이용하여 분석하였다.
      - 재킷을 입은 두 여자가 식당 밖에 서 있다.
      - 두 여자가 식당 밖에서 음식을 먹는다
  - source_sentence: 한 남자가 암벽을 오르고 다른 남자가 아래에 있다.
    sentences:
      - 남자가 암벽을 기어오르다
      - 담당 공무원들은 보호 관찰 대상자를 정기적으로 상담을 했다.
      - 한 남자가 암벽에 오른다.
  - source_sentence: 골목, 동네, 동, 나누다, 크다, 서
    sentences:
      - 큰 골목이 우리 동네를 동과 서로 나눠 놓았다.
      - 내 아내는 몸에 좋은 음식을 항상 만들어 주었다.
      - 골목, 많다, 공간, 놀이, 골목
  - source_sentence: 한 소녀가 자전거를 타고 있고 모든 사람들이 도시에서 그녀에게 달려들고 있다.
    sentences:
      - 소녀는 자전거를 탄다
      - 소녀가 자전거를 타고 있다.
      - >-
        그리고 특수한 소재의 광섬유를 이용한 온도센서는 감도가 고정되는 단점이 있고, 간섭계형 온도센서는 높은 감도의 장점을 가지지만,
        2차 코팅이 이루어지지 않은 광섬유 센서나 팁기반 광섬유 센서는 일반적으로 클래드를 제거하여 융착(splicing)을 하기 때문에
        취급상에 불편함과 파손되기 쉬운 단점을 가지고 있다.
datasets:
  - sigridjineth/korean_nli_dataset_reranker_v1
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
model-index:
  - name: SentenceTransformer based on answerdotai/ModernBERT-large
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: dev eval
          type: dev-eval
        metrics:
          - type: cosine_accuracy
            value: 0.877
            name: Cosine Accuracy

SentenceTransformer based on answerdotai/ModernBERT-large

This is a sentence-transformers model finetuned from answerdotai/ModernBERT-large on the korean_nli_dataset_reranker_v1 dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: answerdotai/ModernBERT-large
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- korean_nli_dataset_reranker_v1
Language: ko

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

Metrics

AutoRAG Retrieval

Metrics	sigridjineth/ModernBERT-korean-large-preview (241225)	Alibaba-NLP/gte-multilingual-base	answerdotai/ModernBERT-large
NDCG@10	0.72503	0.77108	0.0
Recall@10	0.87719	0.93860	0.0
Precision@1	0.57018	0.59649	0.0
NDCG@100	0.74543	0.78411	0.01565
Recall@100	0.98246	1.0	0.09649
Recall@1000	1.0	1.0	1.0

Triplet

Dataset: dev-eval
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.877

Training Details

Training Dataset

Size: 1,120,235 training samples
Columns: query, positive, and negative

Approximate statistics based on the first 1000 samples:

	query	positive	negative
type	string	string	string
details	min: 5 tokens mean: 55.49 tokens max: 476 tokens	min: 5 tokens mean: 186.0 tokens max: 1784 tokens	min: 9 tokens mean: 120.54 tokens max: 2383 tokens

Samples:

query	positive	negative
`양복을 입은 노인이 짐을 뒤로 끌고 간다.`	`양복을 입은 남자`	`옷을 입은 노인`
`한국의 제1위 서비스 수출 시장은 중국이니`	중국은 세계 제2위의 서비스 교역국이자 우리나라의 제1위 서비스 수출 시장*으로서, 2016년 중국의 서비스교역 규모는 6,571억불로 미국(12,145억불)에 이어 세계 2위 중국 서비스산업의 GDP대비 비중은 2015년 50% 돌파, 서비스산업 성장률 98.3%) > GDP 성장률(6.9%) ** 2016년 서비스 분야 우리의 對中수출(206억불)은 對세계수출(949억불)의 22% ㅇ 네거티브 방식의 포괄적인 서비스 투자 개방 협정이 중국과 체결될 경우, 양국간 상호 서비스 시장 개방 수준을 높이고, 우리 투자 기업에 대한 실질적 보호를 한층 강화할 수 있을 것으로 기대된다.	`우리나라에서 중국으로 수출되는 제품은 점점 계속 증가하고 있다.`
`아버지, 병원, 치료, 받다, 결심하다`	`너무나 아팠던 아버지는 병원에서 치료를 받기로 결심했다.`	`요즘, 아버지, 건강, 걱정`

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Logs

Epoch	Step	dev-eval_cosine_accuracy
0	0	0.331
4.8783	170	0.877

Framework Versions

Python: 3.11.9
Sentence Transformers: 3.3.1
Transformers: 4.48.0.dev0
PyTorch: 2.3.0+cu121
Accelerate: 1.2.1
Datasets: 3.2.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}