---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:901028
- loss:CosineSimilarityLoss
base_model: Shuu12121/CodeModernBERT-Owl
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- accuracy
- f1
model-index:
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: val
      type: val
    metrics:
    - type: pearson_cosine
      value: 0.9481467499740959
      name: Training Pearson Cosine
    - type: accuracy
      value: 0.9900051996071408
      name: Test Accuracy
    - type: f1
      value: 0.963323498754483
      name: Test F1 Score
license: apache-2.0
datasets:
- google/code_x_glue_cc_clone_detection_big_clone_bench
---

#　SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉`

This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks.


## 🎯 Distinctive Performance and Stability

This model achieves **very high accuracy and F1 scores** in code clone detection.  
One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**.  
This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**.

| Threshold         | Accuracy          | F1 Score           |
|-------------------|-------------------|--------------------|
| 0.5               | 0.9900            | 0.9633             |
| 0.85              | 0.9903            | 0.9641             |
| 0.90              | 0.9902            | 0.9637             |
| 0.95              | 0.9887            | 0.9579             |
| 0.98              | 0.9879            | 0.9540             |

- **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant.  
  _(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_

- **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation.


## 📌 Model Overview

- **Architecture**: Sentence-BERT (SBERT)
- **Base Model**: `Shuu12121/CodeModernBERT-Owl`
- **Output Dimension**: 768
- **Max Sequence Length**: 2048 tokens
- **Pooling Method**: CLS token pooling
- **Similarity Function**: Cosine Similarity

---

## 🏋️‍♂️ Training Configuration

- **Loss Function**: `CosineSimilarityLoss`
- **Epochs**: 1
- **Batch Size**: 32
- **Warmup Steps**: 3% of training steps
- **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation)

---

## 📊 Evaluation Metrics

| Metric                    | Score              |
|---------------------------|--------------------|
| Pearson Cosine (Train)    | `0.9481`           |
| Accuracy (Test)           | `0.9902`           |
| F1 Score (Test)           | `0.9637`           |

---

## 📚 Dataset

- [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench)

---

## 🧪 How to Use

```python
from sentence_transformers import SentenceTransformer
from torch.nn.functional import cosine_similarity
import torch

# Load the fine-tuned model
model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")

# Two code snippets to compare
code1 = "def add(a, b): return a + b"
code2 = "def sum(x, y): return x + y"

# Encode the code snippets
embeddings = model.encode([code1, code2], convert_to_tensor=True)

# Compute cosine similarity
similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item()

# Print the result
print(f"Cosine Similarity: {similarity_score:.4f}")
if similarity_score >= 0.9:
    print("🟢 These code snippets are considered CLONES.")
else:
    print("🔴 These code snippets are NOT considered clones.")
```
## 🧪 How to Test

```python
!pip install -U sentence-transformers datasets

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import torch
from sklearn.metrics import accuracy_score, f1_score

# --- データセットのロード ---
ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test")

model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl")
model.to("cuda")


test_sentences1 = ds_test["func1"]
test_sentences2 = ds_test["func2"]
test_labels = ds_test["label"]

batch_size = 256  # GPUメモリに合わせて調整

print("Encoding sentences1...")

embeddings1 = model.encode(
    test_sentences1,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Encoding sentences2...")
embeddings2 = model.encode(
    test_sentences2,
    convert_to_tensor=True,
    batch_size=batch_size,
    show_progress_bar=True
)

print("Calculating cosine scores...")
cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)

# 閾値設定（ここでは0.9を採用）
threshold = 0.9
print(f"Using threshold: {threshold}")
predictions = (cosine_scores > threshold).long().cpu().numpy()

accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)
print("Test Accuracy:", accuracy)
print("Test F1 Score:", f1)

```

## 🛠️ Model Architecture

```python
SentenceTransformer(
  (0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel'
  (1): Pooling({
        'word_embedding_dimension': 768,
        'pooling_mode_cls_token': True,
        ...
  })
)
```

---

## 📦 Dependencies

- Python: `3.11.11`
- sentence-transformers: `4.0.1`
- transformers: `4.50.3`
- torch: `2.6.0+cu124`
- datasets: `3.5.0`
- tokenizers: `0.21.1`
- flash-attn: ✅ Installed

### Install Required Libraries

```bash
pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets
```

---

## 🔐 Optional: Authentication

```python
from huggingface_hub import login
login("your_huggingface_token")

import wandb
wandb.login(key="your_wandb_token")
```

---

## 🧾 Citation

```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "EMNLP 2019",
    url = "https://arxiv.org/abs/1908.10084"
}
```

---

## 🔓 License

Apache License 2.0