File size: 7,062 Bytes
b67582b 3894d07 b67582b a0281ec b67582b 7d45f75 b67582b c6b9f91 b67582b 7d45f75 b67582b c6b9f91 7d45f75 b67582b 7d45f75 b67582b 7d45f75 b67582b 7d45f75 b67582b 7d45f75 3ee8361 7d45f75 b67582b 7d45f75 b67582b 7d45f75 b67582b c6b9f91 b67582b 7d45f75 b67582b 7d45f75 a5db7a9 7d45f75 b67582b 7d45f75 a5db7a9 b67582b 7d45f75 a5db7a9 7d45f75 b67582b 7d45f75 b67582b 7d45f75 a5db7a9 c6b9f91 b67582b 7d45f75 b67582b 7d45f75 b67582b 7d45f75 c6b9f91 b67582b 7d45f75 b67582b 7d45f75 b67582b c6b9f91 7d45f75 b67582b a0281ec c6b9f91 b67582b 7d45f75 b67582b 7d45f75 b67582b 7d45f75 3894d07 b67582b 7d45f75 f0ee24b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
license: apache-2.0
datasets:
- Shuu12121/rust-codesearch-dataset-open
- Shuu12121/java-codesearch-dataset-open
- code-search-net/code_search_net
- google/code_x_glue_ct_code_to_text
language:
- en
pipeline_tag: sentence-similarity
tags:
- code
- code-search
- retrieval
- sentence-similarity
- bert
- transformers
- deep-learning
- machine-learning
- nlp
- programming
- multi-language
- rust
- python
- java
- javascript
- php
- ruby
- go
---
# **CodeModernBERT-Owl**
## **概要 / Overview**
### **🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル**
**CodeModernBERT-Owl** is a **pretrained model** designed from scratch for **code search and code understanding tasks**.
Compared to previous versions such as **CodeHawks-ModernBERT** and **CodeMorph-ModernBERT**, this model **now supports Rust** and **improves search accuracy** in Python, PHP, Java, JavaScript, Go, and Ruby.
### **🛠 主な特徴 / Key Features**
✅ **Supports long sequences up to 2048 tokens** (compared to Microsoft's 512-token models)
✅ **Optimized for code search, code understanding, and code clone detection**
✅ **Fine-tuned on GitHub open-source repositories (Java, Rust)**
✅ **Achieves the highest accuracy among the CodeHawks/CodeMorph series**
✅ **Multi-language support**: **Python, PHP, Java, JavaScript, Go, Ruby, and Rust**
---
## **📊 モデルパラメータ / Model Parameters**
| パラメータ / Parameter | 値 / Value |
|-------------------------|------------|
| **vocab_size** | 50,004 |
| **hidden_size** | 768 |
| **num_hidden_layers** | 12 |
| **num_attention_heads**| 12 |
| **intermediate_size** | 3,072 |
| **max_position_embeddings** | 2,048 |
| **type_vocab_size** | 2 |
| **hidden_dropout_prob**| 0.1 |
| **attention_probs_dropout_prob** | 0.1 |
| **local_attention_window** | 128 |
| **rope_theta** | 160,000 |
| **local_attention_rope_theta** | 10,000 |
---
## **💻 モデルの使用方法 / How to Use**
This model can be easily loaded using the **Hugging Face Transformers** library.
⚠️ **Requires transformers >= 4.48.0**
🔗 **[Colab Demo (Replace with "CodeModernBERT-Owl")](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)**
### **モデルのロード / Load the Model**
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
```
### **コード埋め込みの取得 / Get Code Embeddings**
```python
import torch
def get_embedding(text, model, tokenizer, device="cuda"):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
if "token_type_ids" in inputs:
inputs.pop("token_type_ids")
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
return embedding
embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
```
---
# **🔍 評価結果 / Evaluation Results**
### **データセット / Dataset**
📌 **Tested on code_x_glue_ct_code_to_text with a candidate pool size of 100.**
📌 **Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open.**
---
## **📈 主要な評価指標の比較(同一シード値)/ Key Evaluation Metrics (Same Seed)**
| 言語 / Language | **CodeModernBERT-Owl** | **CodeHawks-ModernBERT** | **Salesforce CodeT5+** | **Microsoft CodeBERT** | **GraphCodeBERT** |
|-----------|-----------------|----------------------|-----------------|------------------|------------------|
| **Python** | **0.8793** | 0.8551 | 0.8266 | 0.5243 | 0.5493 |
| **Java** | **0.8880** | 0.7971 | **0.8867** | 0.3134 | 0.5879 |
| **JavaScript** | **0.8423** | 0.7634 | 0.7628 | 0.2694 | 0.5051 |
| **PHP** | **0.9129** | 0.8578 | **0.9027** | 0.2642 | 0.6225 |
| **Ruby** | **0.8038** | 0.7469 | **0.7568** | 0.3318 | 0.5876 |
| **Go** | **0.9386** | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
✅ **Achieves the highest accuracy in all target languages.**
✅ **Significantly improved Java accuracy using additional fine-tuned GitHub data.**
✅ **Outperforms previous models, especially in PHP and Go.**
---
## **📊 Rust (独自データセット) / Rust Performance**
| 指標 / Metric | **CodeModernBERT-Owl** |
|--------------|----------------|
| **MRR** | 0.7940 |
| **MAP** | 0.7940 |
| **R-Precision** | 0.7173 |
### **📌 K別評価指標 / Evaluation Metrics by K**
| K | **Recall@K** | **Precision@K** | **NDCG@K** | **F1@K** | **Success Rate@K** | **Query Coverage@K** |
|----|-------------|---------------|------------|--------|-----------------|-----------------|
| **1** | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 |
| **5** | 0.8913 | 0.7852 | 0.8118 | 0.8132 | 0.8913 | 0.8913 |
| **10** | 0.9333 | 0.7908 | 0.8254 | 0.8230 | 0.9333 | 0.9333 |
| **50** | 0.9887 | 0.7938 | 0.8383 | 0.8288 | 0.9887 | 0.9887 |
| **100** | 1.0000 | 0.7940 | 0.8401 | 0.8291 | 1.0000 | 1.0000 |
---
## **🔁 別のおすすめモデル / Recommended Alternative Models**
### 1. **CodeSearch-ModernBERT-Owl🦉** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Owl)
If you need a model that is **more specialized for code search**, this model is highly recommended.
コードサーチに**特化したモデルが必要な場合**はこちらがおすすめです。
### 2. **CodeModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeModernBERT-Snake)
If you need a pretrained model that supports **longer sequences or a smaller model size**, this model is ideal.
**シーケンス長が長い**、または**モデルサイズが小さい**事前学習済みモデルが必要な場合はこちらをおすすめします。
- **Maximum Sequence Length:** 8192 tokens
- **Smaller Model Size:** ~75M parameters
### 3. **CodeSearch-ModernBERT-Snake🐍** (https://huggingface.co/Shuu12121/CodeSearch-ModernBERT-Snake)
For those looking for a model that combines **long sequence length and code search specialization**, this model is the best choice.
**コードサーチに特化しつつ長いシーケンスを処理できるモデル**が欲しい場合にはこちらがおすすめです。
- **Maximum Sequence Length:** 8192 tokens
- **High Code Search Performance**
## **📝 結論 / Conclusion**
✅ **Top performance in all languages**
✅ **Rust support successfully added through dataset augmentation**
✅ **Further performance improvements possible with better datasets**
---
## **📜 ライセンス / License**
📄 **Apache-2.0**
## **📧 連絡先 / Contact**
📩 **For any questions, please contact:**
📧 **[email protected]** |