|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- code-search-net/code_search_net |
|
pipeline_tag: fill-mask |
|
tags: |
|
- code |
|
metrics: |
|
- code_eval |
|
new_version: Shuu12121/CodeHawks-ModernBERT |
|
--- |
|
|
|
# CodeMorph-ModernBERT |
|
|
|
## 概要 |
|
|
|
**CodeMorph-ModernBERT** は、コード検索およびコード理解のタスク向けに1からトレーニングした事前学習済みモデルです。本モデルは `code-search-net/code_search_net` データセットを活用し、コードの意味的な理解を強化するために訓練されています。 |
|
**最大シーケンス長2048トークン**(従来のMicrosoftモデルは512トークン)に対応し、特にPythonコード検索において抜群の性能を発揮します。 |
|
- **アーキテクチャ**: ModernBERT ベース |
|
- **目的**: コード検索 / コード理解 / コード補完 |
|
- **トレーニングデータ**: CodeSearchNet (全言語) |
|
- **ライセンス**: Apache 2.0 |
|
|
|
## 主な特徴 |
|
|
|
- **長いシーケンス対応** |
|
最大2048トークンのシーケンス処理が可能。長いコードや複雑な関数にも対応します。 |
|
|
|
- **高いコード検索性能** |
|
Pythonをはじめとする6言語対応のSentencepieceを用いて作成したトークナイザを採用し、従来モデルを大幅に上回る検索精度を実現しています。 |
|
|
|
- **専用にトレーニングされたモデル** |
|
CodeSearchNetデータセットを活用して1から学習。コード特有の文法やコメントとの関係を深く理解します。 |
|
|
|
|
|
## パラメータについて |
|
|
|
以下のパラメータで設計しています。 |
|
|
|
| パラメータ名 | 設定値 | |
|
|-----------------------------------|--------------------| |
|
| **vocab_size** | 50000 | |
|
| **hidden_size** | 768 | |
|
| **num_hidden_layers** | 12 | |
|
| **num_attention_heads** | 12 | |
|
| **intermediate_size** | 3072 | |
|
| **max_position_embeddings** | 2048 | |
|
| **type_vocab_size** | 2 | |
|
| **hidden_dropout_prob** | 0.1 | |
|
| **attention_probs_dropout_prob** | 0.1 | |
|
| **local_attention_window** | 128 | |
|
| **rope_theta** | 160000 | |
|
| **local_attention_rope_theta** | 10000 | |
|
|
|
## モデルの使用方法 |
|
|
|
Hugging Face Transformers ライブラリを利用して、本モデルを簡単にロードできます。(※ Transformers のバージョンは `4.48.0` 以上のみ動作します) |
|
- [簡単な動作例はこちらです](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb) |
|
|
|
### モデルのロード |
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT") |
|
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT") |
|
``` |
|
|
|
### マスク補完 (fill-mask) |
|
```python |
|
from transformers import pipeline |
|
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
print(fill_mask("def add_numbers(a, b): return a + [MASK]")) |
|
``` |
|
|
|
### コード埋め込みの取得 |
|
```python |
|
import torch |
|
|
|
def get_embedding(text, model, tokenizer, device="cuda"): |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
|
if "token_type_ids" in inputs: |
|
inputs.pop("token_type_ids") |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
outputs = model.model(**inputs) |
|
embedding = outputs.last_hidden_state[:, 0, :] |
|
return embedding |
|
|
|
embedding = get_embedding("def my_function(): pass", model, tokenizer) |
|
print(embedding.shape) |
|
``` |
|
|
|
## データセット |
|
|
|
本モデルは `code-search-net/code_search_net` データセットを使用して訓練されました。このデータセットは、複数のプログラミング言語 (Python, Java, JavaScript など) に関するコードスニペットを含んでおり、コード検索タスクに最適です。 |
|
|
|
## 評価結果 |
|
|
|
本モデルは `code_x_glue_ct_code_to_text` データセットのPythonの部分を用いて評価されました。以下は主要な評価指標です。 |
|
また実験の詳細については[こちら](https://colab.research.google.com/gist/Shun0212/474d9092deb60bd10523c3bef427d422/codemorph-modernbert-exp.ipynb?hl=ja) を確認してください。 |
|
|
|
| 指標 | スコア | |
|
|-------|-------| |
|
| **MRR** (Mean Reciprocal Rank) | 0.8172 | |
|
| **MAP** (Mean Average Precision) | 0.8172 | |
|
| **R-Precision** | 0.7501 | |
|
| **Recall@10** | 0.9389 | |
|
| **Precision@10** | 0.8143 | |
|
| **NDCG@10** | 0.8445 | |
|
| **F1@10** | 0.8423 | |
|
|
|
## 他のモデルとの比較 |
|
|
|
以下は、CodeMorph-ModernBERT と他の主要なコード検索モデルの比較結果です。 |
|
|
|
| モデル | MRR | MAP | R-Precision | |
|
|--------|------|------|-------------| |
|
| **CodeMorph-ModernBERT** | **0.8172** | **0.8172** | **0.7501** | |
|
| microsoft/graphcodebert-base | 0.5482 | 0.5482 | 0.4458 | |
|
| microsoft/codebert-base-mlm | 0.5243 | 0.5243 | 0.4378 | |
|
| Salesforce/codet5p-220m-py | 0.7512 | 0.7512 | 0.6617 | |
|
| Salesforce/codet5-large-ntp-py | 0.7846 | 0.7846 | 0.7067 | |
|
| Shuu12121/CodeMorph-BERT | 0.6851 | 0.6851 | 0.5934 | |
|
| Shuu12121/CodeMorph-BERTv2 | 0.6535 | 0.6535 | 0.5543 | |
|
|
|
|
|
## Code Search モデル評価結果 (google/code_x_glue_tc_nl_code_search_adv データセット Test) |
|
|
|
以下に、google/code_x_glue_tc_nl_code_search_adv データセット (Test) を使用した、各種Code Searchモデルの評価結果をまとめます。候補プールサイズは全て100です。 |
|
また追加実験のコードは[こちら](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorph-ModernBERT-exp-2.ipynb)です |
|
|
|
| モデル | MRR | MAP | R-Precision | |
|
| :-------------------------------------- | :----- | :----- | :---------- | |
|
| Shuu12121/CodeMorph-ModernBERT | 0.6107 | 0.6107 | 0.5038 | |
|
| Salesforce/codet5p-220m-py | 0.5037 | 0.5037 | 0.3805 | |
|
| Salesforce/codet5-large-ntp-py | 0.4872 | 0.4872 | 0.3658 | |
|
| microsoft/graphcodebert-base | 0.3844 | 0.3844 | 0.2764 | |
|
| microsoft/codebert-base-mlm | 0.3766 | 0.3766 | 0.2683 | |
|
| Shuu12121/CodeMorph-BERTv2 | 0.3142 | 0.3142 | 0.2166 | |
|
| Shuu12121/CodeMorph-BERT | 0.2978 | 0.2978 | 0.1992 | |
|
|
|
CodeMorph-ModernBERT は、他の CodeBERT や CodeT5 モデルと比較して、より高い検索精度を達成しています。 |
|
|
|
|
|
|
|
## 多言語における評価結果 |
|
|
|
CodeMorph-ModernBERTは、複数の言語で高いコード検索性能を示しています。以下は、各言語における主要な評価指標(MRR、MAP、R-Precision)の概要です。 |
|
またこの実験は全データではなく1000件を抽出して行っています.[こちらのノートブック](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorphModernBERTvsCodeT5p.ipynb)をご参照ください。 |
|
|
|
| 言語 | MRR | MAP | R-Precision | |
|
|--------------|--------|--------|-------------| |
|
| **Python** | 0.8098 | 0.8098 | 0.7520 | |
|
| **Java** | 0.6437 | 0.6437 | 0.5480 | |
|
| **JavaScript** | 0.5928 | 0.5928 | 0.4880 | |
|
| **PHP** | 0.7512 | 0.7512 | 0.6710 | |
|
| **Ruby** | 0.7188 | 0.7188 | 0.6310 | |
|
| **Go** | 0.5358 | 0.5358 | 0.4320 | |
|
|
|
このように、言語によって数値にはばらつきが見られるものの、CodeMorph-ModernBERTは全体として高い検索精度を維持しています。特にPythonやPHPでは顕著な性能向上が確認されています。 |
|
|
|
また,Salesforce/codet5p-220m-bimodalは以下のようにCodeMorph-ModernBERTよりも全体的に上回っている検索精度ですが, |
|
| 言語 | MRR | MAP | R-Precision | |
|
|----------------|--------|--------|-------------| |
|
| **Python** | 0.8322 | 0.8322 | 0.7660 | |
|
| **Java** | 0.8886 | 0.8886 | 0.8390 | |
|
| **JavaScript** | 0.7611 | 0.7611 | 0.6710 | |
|
| **PHP** | 0.8985 | 0.8985 | 0.8530 | |
|
| **Ruby** | 0.7635 | 0.7635 | 0.6740 | |
|
| **Go** | 0.8127 | 0.8127 | 0.7260 | |
|
|
|
|
|
別のデータセットであるgoogle/code_x_glue_tc_nl_code_search_adv データセット (Test)での結果が下記のようにgoogle/code_x_glue_tc_nl_code_search_advにおいてはCodeMorph-ModernBERT が上回っているため,より難しいタスクやPythonでの汎用性においてはCodeMorph-ModernBERTのほうが有利である可能性があると考えられます. |
|
|
|
| モデル | MRR | MAP | R-Precision | |
|
| :-------------------------------------- | :----- | :----- | :---------- | |
|
| Shuu12121/CodeMorph-ModernBERT | 0.6107 | 0.6107 | 0.5038 | |
|
| Salesforce/codet5p-220m-bimodal | 0.5326 | 0.5326 | 0.4208 | |
|
|
|
|
|
## ライセンス |
|
|
|
本モデルは `Apache-2.0` ライセンスのもとで提供されます。 |
|
|
|
## 連絡先 |
|
このモデルで何か質問等がございましたらこちらのメールアドレスまでお願いします |
|
[email protected] |
|
|
|
# CodeMorph-ModernBERT-English-ver |
|
|
|
## Overview |
|
|
|
**CodeMorph-ModernBERT** is a pre-trained model designed from scratch for code search and code understanding tasks. This model has been trained using the `code-search-net/code_search_net` dataset to enhance semantic comprehension of code. |
|
It supports **a maximum sequence length of 2048 tokens** (compared to Microsoft’s conventional models, which support only 512 tokens) and demonstrates outstanding performance, particularly in Python code search. |
|
|
|
- **Architecture**: ModernBERT-based |
|
- **Purpose**: Code search / Code understanding / Code completion |
|
- **Training Data**: CodeSearchNet (all languages) |
|
- **License**: Apache 2.0 |
|
|
|
## Key Features |
|
|
|
- **Long Sequence Support** |
|
Handles sequences of up to 2048 tokens, making it suitable for long and complex functions. |
|
|
|
- **High Code Search Performance** |
|
Uses a SentencePiece-based tokenizer trained on six programming languages, achieving significantly improved search accuracy over previous models. |
|
|
|
- **Specifically Trained Model** |
|
Trained from scratch using the CodeSearchNet dataset, enabling deep understanding of programming syntax and comments. |
|
|
|
## Model Parameters |
|
|
|
The model is designed with the following parameters: |
|
|
|
| Parameter Name | Value | |
|
|----------------------------------|-------| |
|
| **vocab_size** | 50000 | |
|
| **hidden_size** | 768 | |
|
| **num_hidden_layers** | 12 | |
|
| **num_attention_heads** | 12 | |
|
| **intermediate_size** | 3072 | |
|
| **max_position_embeddings** | 2048 | |
|
| **type_vocab_size** | 2 | |
|
| **hidden_dropout_prob** | 0.1 | |
|
| **attention_probs_dropout_prob** | 0.1 | |
|
| **local_attention_window** | 128 | |
|
| **rope_theta** | 160000 | |
|
| **local_attention_rope_theta** | 10000 | |
|
|
|
## How to Use the Model |
|
|
|
The model can be easily loaded using the Hugging Face Transformers library. |
|
(*Note: Requires Transformers version `4.48.0` or later.*) |
|
|
|
- [Example usage is available here](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb) |
|
|
|
### Load the Model |
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT") |
|
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT") |
|
``` |
|
|
|
### Fill-Mask (Code Completion) |
|
```python |
|
from transformers import pipeline |
|
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
print(fill_mask("def add_numbers(a, b): return a + [MASK]")) |
|
``` |
|
|
|
### Obtain Code Embeddings |
|
```python |
|
import torch |
|
|
|
def get_embedding(text, model, tokenizer, device="cuda"): |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
|
if "token_type_ids" in inputs: |
|
inputs.pop("token_type_ids") |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
outputs = model.model(**inputs) |
|
embedding = outputs.last_hidden_state[:, 0, :] |
|
return embedding |
|
|
|
embedding = get_embedding("def my_function(): pass", model, tokenizer) |
|
print(embedding.shape) |
|
``` |
|
|
|
## Dataset |
|
|
|
This model has been trained using the `code-search-net/code_search_net` dataset. |
|
The dataset contains code snippets from multiple programming languages (Python, Java, JavaScript, etc.), making it well-suited for code search tasks. |
|
|
|
## Evaluation Results |
|
|
|
The model was evaluated using the `code_x_glue_ct_code_to_text` dataset, specifically the Python subset. |
|
Key evaluation metrics are shown below. |
|
For further details, refer to [this link](https://colab.research.google.com/gist/Shun0212/474d9092deb60bd10523c3bef427d422/codemorph-modernbert-exp.ipynb?hl=ja). |
|
|
|
| Metric | Score | |
|
|---------|-------| |
|
| **MRR** (Mean Reciprocal Rank) | 0.8172 | |
|
| **MAP** (Mean Average Precision) | 0.8172 | |
|
| **R-Precision** | 0.7501 | |
|
| **Recall@10** | 0.9389 | |
|
| **Precision@10** | 0.8143 | |
|
| **NDCG@10** | 0.8445 | |
|
| **F1@10** | 0.8423 | |
|
|
|
## Comparison with Other Models |
|
|
|
Below is a comparison of CodeMorph-ModernBERT with other leading code search models. |
|
|
|
| Model | MRR | MAP | R-Precision | |
|
|--------|------|------|-------------| |
|
| **CodeMorph-ModernBERT** | **0.8172** | **0.8172** | **0.7501** | |
|
| microsoft/graphcodebert-base | 0.5482 | 0.5482 | 0.4458 | |
|
| microsoft/codebert-base-mlm | 0.5243 | 0.5243 | 0.4378 | |
|
| Salesforce/codet5p-220m-py | 0.7512 | 0.7512 | 0.6617 | |
|
| Salesforce/codet5-large-ntp-py | 0.7846 | 0.7846 | 0.7067 | |
|
| Shuu12121/CodeMorph-BERT | 0.6851 | 0.6851 | 0.5934 | |
|
| Shuu12121/CodeMorph-BERTv2 | 0.6535 | 0.6535 | 0.5543 | |
|
|
|
## Code Search Model Evaluation Results (google/code_x_glue_tc_nl_code_search_adv Dataset Test) |
|
|
|
The following table summarizes the evaluation results of various code search models using the `google/code_x_glue_tc_nl_code_search_adv` dataset (Test). |
|
The candidate pool size for all evaluations was set to 100. |
|
For additional experiment details, see [this link](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorph-ModernBERT-exp-2.ipynb). |
|
|
|
| Model | MRR | MAP | R-Precision | |
|
|-------------------------------------- | :----- | :----- | :---------- | |
|
| Shuu12121/CodeMorph-ModernBERT | 0.6107 | 0.6107 | 0.5038 | |
|
| Salesforce/codet5p-220m-py | 0.5037 | 0.5037 | 0.3805 | |
|
| Salesforce/codet5-large-ntp-py | 0.4872 | 0.4872 | 0.3658 | |
|
| microsoft/graphcodebert-base | 0.3844 | 0.3844 | 0.2764 | |
|
| microsoft/codebert-base-mlm | 0.3766 | 0.3766 | 0.2683 | |
|
| Shuu12121/CodeMorph-BERTv2 | 0.3142 | 0.3142 | 0.2166 | |
|
| Shuu12121/CodeMorph-BERT | 0.2978 | 0.2978 | 0.1992 | |
|
|
|
CodeMorph-ModernBERT achieves superior search accuracy compared to other CodeBERT and CodeT5 models. |
|
|
|
## Evaluation Results Across Multiple Languages |
|
|
|
CodeMorph-ModernBERT demonstrates high code search performance across multiple programming languages. |
|
The table below summarizes key evaluation metrics (MRR, MAP, R-Precision) for each language. |
|
(*Evaluations were conducted using a sample of 1,000 data points. See [this notebook](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorphModernBERTvsCodeT5p.ipynb) for details.*) |
|
|
|
| Language | MRR | MAP | R-Precision | |
|
|--------------|--------|--------|-------------| |
|
| **Python** | 0.8098 | 0.8098 | 0.7520 | |
|
| **Java** | 0.6437 | 0.6437 | 0.5480 | |
|
| **JavaScript** | 0.5928 | 0.5928 | 0.4880 | |
|
| **PHP** | 0.7512 | 0.7512 | 0.6710 | |
|
| **Ruby** | 0.7188 | 0.7188 | 0.6310 | |
|
| **Go** | 0.5358 | 0.5358 | 0.4320 | |
|
|
|
|
|
Additionally, Salesforce/codet5p-220m-bimodal generally outperforms CodeMorph-ModernBERT in terms of search accuracy. |
|
| Language | MRR | MAP | R-Precision | |
|
|---------------|--------|--------|-------------| |
|
| **Python** | 0.8322 | 0.8322 | 0.7660 | |
|
| **Java** | 0.8886 | 0.8886 | 0.8390 | |
|
| **JavaScript**| 0.7611 | 0.7611 | 0.6710 | |
|
| **PHP** | 0.8985 | 0.8985 | 0.8530 | |
|
| **Ruby** | 0.7635 | 0.7635 | 0.6740 | |
|
| **Go** | 0.8127 | 0.8127 | 0.7260 | |
|
|
|
However, when evaluated on a different dataset, **google/code_x_glue_tc_nl_code_search_adv (Test)**, CodeMorph-ModernBERT achieved higher scores, as shown below. |
|
This suggests that CodeMorph-ModernBERT may be more advantageous for **more challenging tasks and generalization in Python**. |
|
|
|
| Model | MRR | MAP | R-Precision | |
|
| :-------------------------------------- | :----- | :----- | :---------- | |
|
| Shuu12121/CodeMorph-ModernBERT | 0.6107 | 0.6107 | 0.5038 | |
|
| Salesforce/codet5p-220m-bimodal | 0.5326 | 0.5326 | 0.4208 | |
|
|
|
|
|
|
|
## License |
|
|
|
This model is released under the `Apache-2.0` license. |
|
|
|
## Contact Information |
|
|
|
If you have any questions about this model, please contact us at the following email address: |
|
[email protected] |