--- tags: - sentence-transformers - sentence-similarity - dataset_size:901028 - loss:CosineSimilarityLoss base_model: Shuu12121/CodeModernBERT-Owl pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - pearson_cosine - accuracy - f1 model-index: - name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl results: - task: type: semantic-similarity name: Semantic Similarity dataset: name: val type: val metrics: - type: pearson_cosine value: 0.9481467499740959 name: Training Pearson Cosine - type: accuracy value: 0.9900051996071408 name: Test Accuracy - type: f1 value: 0.963323498754483 name: Test F1 Score license: apache-2.0 datasets: - google/code_x_glue_cc_clone_detection_big_clone_bench --- # SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉` This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks. ## 🎯 Distinctive Performance and Stability This model achieves **very high accuracy and F1 scores** in code clone detection. One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**. This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**. | Threshold | Accuracy | F1 Score | |-------------------|-------------------|--------------------| | 0.5 | 0.9900 | 0.9633 | | 0.85 | 0.9903 | 0.9641 | | 0.90 | 0.9902 | 0.9637 | | 0.95 | 0.9887 | 0.9579 | | 0.98 | 0.9879 | 0.9540 | - **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant. _(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_ - **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation. ## 📌 Model Overview - **Architecture**: Sentence-BERT (SBERT) - **Base Model**: `Shuu12121/CodeModernBERT-Owl` - **Output Dimension**: 768 - **Max Sequence Length**: 2048 tokens - **Pooling Method**: CLS token pooling - **Similarity Function**: Cosine Similarity --- ## 🏋️‍♂️ Training Configuration - **Loss Function**: `CosineSimilarityLoss` - **Epochs**: 1 - **Batch Size**: 32 - **Warmup Steps**: 3% of training steps - **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation) --- ## 📊 Evaluation Metrics | Metric | Score | |---------------------------|--------------------| | Pearson Cosine (Train) | `0.9481` | | Accuracy (Test) | `0.9902` | | F1 Score (Test) | `0.9637` | --- ## 📚 Dataset - [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) --- ## 🧪 How to Use ```python from sentence_transformers import SentenceTransformer from torch.nn.functional import cosine_similarity import torch # Load the fine-tuned model model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl") # Two code snippets to compare code1 = "def add(a, b): return a + b" code2 = "def sum(x, y): return x + y" # Encode the code snippets embeddings = model.encode([code1, code2], convert_to_tensor=True) # Compute cosine similarity similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item() # Print the result print(f"Cosine Similarity: {similarity_score:.4f}") if similarity_score >= 0.9: print("🟢 These code snippets are considered CLONES.") else: print("🔴 These code snippets are NOT considered clones.") ``` ## 🧪 How to Test ```python !pip install -U sentence-transformers datasets from sentence_transformers import SentenceTransformer from datasets import load_dataset import torch from sklearn.metrics import accuracy_score, f1_score # --- データセットのロード --- ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test") model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl") model.to("cuda") test_sentences1 = ds_test["func1"] test_sentences2 = ds_test["func2"] test_labels = ds_test["label"] batch_size = 256 # GPUメモリに合わせて調整 print("Encoding sentences1...") embeddings1 = model.encode( test_sentences1, convert_to_tensor=True, batch_size=batch_size, show_progress_bar=True ) print("Encoding sentences2...") embeddings2 = model.encode( test_sentences2, convert_to_tensor=True, batch_size=batch_size, show_progress_bar=True ) print("Calculating cosine scores...") cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2) # 閾値設定(ここでは0.9を採用) threshold = 0.9 print(f"Using threshold: {threshold}") predictions = (cosine_scores > threshold).long().cpu().numpy() accuracy = accuracy_score(test_labels, predictions) f1 = f1_score(test_labels, predictions) print("Test Accuracy:", accuracy) print("Test F1 Score:", f1) ``` ## 🛠️ Model Architecture ```python SentenceTransformer( (0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel' (1): Pooling({ 'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, ... }) ) ``` --- ## 📦 Dependencies - Python: `3.11.11` - sentence-transformers: `4.0.1` - transformers: `4.50.3` - torch: `2.6.0+cu124` - datasets: `3.5.0` - tokenizers: `0.21.1` - flash-attn: ✅ Installed ### Install Required Libraries ```bash pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets ``` --- ## 🔐 Optional: Authentication ```python from huggingface_hub import login login("your_huggingface_token") import wandb wandb.login(key="your_wandb_token") ``` --- ## 🧾 Citation ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "EMNLP 2019", url = "https://arxiv.org/abs/1908.10084" } ``` --- ## 🔓 License Apache License 2.0