Update README.md

73ac97b verified 5 months ago

17.3 kB

	---
	license: apache-2.0
	datasets:
	- code-search-net/code_search_net
	pipeline_tag: fill-mask
	tags:
	- code
	metrics:
	- code_eval
	new_version: Shuu12121/CodeHawks-ModernBERT
	---

	# CodeMorph-ModernBERT

	## 概要

	CodeMorph-ModernBERT は、コード検索およびコード理解のタスク向けに１からトレーニングした事前学習済みモデルです。本モデルは `code-search-net/code_search_net` データセットを活用し、コードの意味的な理解を強化するために訓練されています。
	最大シーケンス長2048トークン（従来のMicrosoftモデルは512トークン）に対応し、特にPythonコード検索において抜群の性能を発揮します。
	- アーキテクチャ: ModernBERT ベース
	- 目的: コード検索 / コード理解 / コード補完
	- トレーニングデータ: CodeSearchNet (全言語)
	- ライセンス: Apache 2.0

	## 主な特徴

	- 長いシーケンス対応
	最大2048トークンのシーケンス処理が可能。長いコードや複雑な関数にも対応します。

	- 高いコード検索性能
	Pythonをはじめとする6言語対応のSentencepieceを用いて作成したトークナイザを採用し、従来モデルを大幅に上回る検索精度を実現しています。

	- 専用にトレーニングされたモデル
	CodeSearchNetデータセットを活用して1から学習。コード特有の文法やコメントとの関係を深く理解します。


	## パラメータについて

	以下のパラメータで設計しています。

	\| パラメータ名 \| 設定値 \|
	\|-----------------------------------\|--------------------\|
	\| vocab_size \| 50000 \|
	\| hidden_size \| 768 \|
	\| num_hidden_layers \| 12 \|
	\| num_attention_heads \| 12 \|
	\| intermediate_size \| 3072 \|
	\| max_position_embeddings \| 2048 \|
	\| type_vocab_size \| 2 \|
	\| hidden_dropout_prob \| 0.1 \|
	\| attention_probs_dropout_prob \| 0.1 \|
	\| local_attention_window \| 128 \|
	\| rope_theta \| 160000 \|
	\| local_attention_rope_theta \| 10000 \|

	## モデルの使用方法

	Hugging Face Transformers ライブラリを利用して、本モデルを簡単にロードできます。（※ Transformers のバージョンは `4.48.0` 以上のみ動作します）
	- [簡単な動作例はこちらです](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)

	### モデルのロード
	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
	model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
	```

	### マスク補完 (fill-mask)
	```python
	from transformers import pipeline

	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	print(fill_mask("def add_numbers(a, b): return a + [MASK]"))
	```

	### コード埋め込みの取得
	```python
	import torch

	def get_embedding(text, model, tokenizer, device="cuda"):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
	if "token_type_ids" in inputs:
	inputs.pop("token_type_ids")
	inputs = {k: v.to(device) for k, v in inputs.items()}
	outputs = model.model(**inputs)
	embedding = outputs.last_hidden_state[:, 0, :]
	return embedding

	embedding = get_embedding("def my_function(): pass", model, tokenizer)
	print(embedding.shape)
	```

	## データセット

	本モデルは `code-search-net/code_search_net` データセットを使用して訓練されました。このデータセットは、複数のプログラミング言語 (Python, Java, JavaScript など) に関するコードスニペットを含んでおり、コード検索タスクに最適です。

	## 評価結果

	本モデルは `code_x_glue_ct_code_to_text` データセットのPythonの部分を用いて評価されました。以下は主要な評価指標です。
	また実験の詳細については[こちら](https://colab.research.google.com/gist/Shun0212/474d9092deb60bd10523c3bef427d422/codemorph-modernbert-exp.ipynb?hl=ja)　を確認してください。

	\| 指標 \| スコア \|
	\|-------\|-------\|
	\| MRR (Mean Reciprocal Rank) \| 0.8172 \|
	\| MAP (Mean Average Precision) \| 0.8172 \|
	\| R-Precision \| 0.7501 \|
	\| Recall@10 \| 0.9389 \|
	\| Precision@10 \| 0.8143 \|
	\| NDCG@10 \| 0.8445 \|
	\| F1@10 \| 0.8423 \|

	## 他のモデルとの比較

	以下は、CodeMorph-ModernBERT と他の主要なコード検索モデルの比較結果です。

	\| モデル \| MRR \| MAP \| R-Precision \|
	\|--------\|------\|------\|-------------\|
	\| CodeMorph-ModernBERT \| 0.8172 \| 0.8172 \| 0.7501 \|
	\| microsoft/graphcodebert-base \| 0.5482 \| 0.5482 \| 0.4458 \|
	\| microsoft/codebert-base-mlm \| 0.5243 \| 0.5243 \| 0.4378 \|
	\| Salesforce/codet5p-220m-py \| 0.7512 \| 0.7512 \| 0.6617 \|
	\| Salesforce/codet5-large-ntp-py \| 0.7846 \| 0.7846 \| 0.7067 \|
	\| Shuu12121/CodeMorph-BERT \| 0.6851 \| 0.6851 \| 0.5934 \|
	\| Shuu12121/CodeMorph-BERTv2 \| 0.6535 \| 0.6535 \| 0.5543 \|


	## Code Search モデル評価結果 (google/code_x_glue_tc_nl_code_search_adv データセット Test)

	以下に、google/code_x_glue_tc_nl_code_search_adv データセット (Test) を使用した、各種Code Searchモデルの評価結果をまとめます。候補プールサイズは全て100です。
	また追加実験のコードは[こちら](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorph-ModernBERT-exp-2.ipynb)です

	\| モデル \| MRR \| MAP \| R-Precision \|
	\| :-------------------------------------- \| :----- \| :----- \| :---------- \|
	\| Shuu12121/CodeMorph-ModernBERT \| 0.6107 \| 0.6107 \| 0.5038 \|
	\| Salesforce/codet5p-220m-py \| 0.5037 \| 0.5037 \| 0.3805 \|
	\| Salesforce/codet5-large-ntp-py \| 0.4872 \| 0.4872 \| 0.3658 \|
	\| microsoft/graphcodebert-base \| 0.3844 \| 0.3844 \| 0.2764 \|
	\| microsoft/codebert-base-mlm \| 0.3766 \| 0.3766 \| 0.2683 \|
	\| Shuu12121/CodeMorph-BERTv2 \| 0.3142 \| 0.3142 \| 0.2166 \|
	\| Shuu12121/CodeMorph-BERT \| 0.2978 \| 0.2978 \| 0.1992 \|

	CodeMorph-ModernBERT は、他の CodeBERT や CodeT5 モデルと比較して、より高い検索精度を達成しています。



	## 多言語における評価結果

	CodeMorph-ModernBERTは、複数の言語で高いコード検索性能を示しています。以下は、各言語における主要な評価指標（MRR、MAP、R-Precision）の概要です。
	またこの実験は全データではなく1000件を抽出して行っています.[こちらのノートブック](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorphModernBERTvsCodeT5p.ipynb)をご参照ください。

	\| 言語 \| MRR \| MAP \| R-Precision \|
	\|--------------\|--------\|--------\|-------------\|
	\| Python \| 0.8098 \| 0.8098 \| 0.7520 \|
	\| Java \| 0.6437 \| 0.6437 \| 0.5480 \|
	\| JavaScript \| 0.5928 \| 0.5928 \| 0.4880 \|
	\| PHP \| 0.7512 \| 0.7512 \| 0.6710 \|
	\| Ruby \| 0.7188 \| 0.7188 \| 0.6310 \|
	\| Go \| 0.5358 \| 0.5358 \| 0.4320 \|

	このように、言語によって数値にはばらつきが見られるものの、CodeMorph-ModernBERTは全体として高い検索精度を維持しています。特にPythonやPHPでは顕著な性能向上が確認されています。

	また,Salesforce/codet5p-220m-bimodalは以下のようにCodeMorph-ModernBERTよりも全体的に上回っている検索精度ですが,
	\| 言語 \| MRR \| MAP \| R-Precision \|
	\|----------------\|--------\|--------\|-------------\|
	\| Python \| 0.8322 \| 0.8322 \| 0.7660 \|
	\| Java \| 0.8886 \| 0.8886 \| 0.8390 \|
	\| JavaScript \| 0.7611 \| 0.7611 \| 0.6710 \|
	\| PHP \| 0.8985 \| 0.8985 \| 0.8530 \|
	\| Ruby \| 0.7635 \| 0.7635 \| 0.6740 \|
	\| Go \| 0.8127 \| 0.8127 \| 0.7260 \|


	別のデータセットであるgoogle/code_x_glue_tc_nl_code_search_adv データセット (Test)での結果が下記のようにgoogle/code_x_glue_tc_nl_code_search_advにおいてはCodeMorph-ModernBERT が上回っているため,より難しいタスクやPythonでの汎用性においてはCodeMorph-ModernBERTのほうが有利である可能性があると考えられます.

	\| モデル \| MRR \| MAP \| R-Precision \|
	\| :-------------------------------------- \| :----- \| :----- \| :---------- \|
	\| Shuu12121/CodeMorph-ModernBERT \| 0.6107 \| 0.6107 \| 0.5038 \|
	\| Salesforce/codet5p-220m-bimodal \|　0.5326 \| 0.5326 \| 0.4208　　　\|


	## ライセンス

	本モデルは `Apache-2.0` ライセンスのもとで提供されます。

	## 連絡先
	このモデルで何か質問等がございましたらこちらのメールアドレスまでお願いします
	[email protected]

	# CodeMorph-ModernBERT-English-ver

	## Overview

	CodeMorph-ModernBERT is a pre-trained model designed from scratch for code search and code understanding tasks. This model has been trained using the `code-search-net/code_search_net` dataset to enhance semantic comprehension of code.
	It supports a maximum sequence length of 2048 tokens (compared to Microsoft’s conventional models, which support only 512 tokens) and demonstrates outstanding performance, particularly in Python code search.

	- Architecture: ModernBERT-based
	- Purpose: Code search / Code understanding / Code completion
	- Training Data: CodeSearchNet (all languages)
	- License: Apache 2.0

	## Key Features

	- Long Sequence Support
	Handles sequences of up to 2048 tokens, making it suitable for long and complex functions.

	- High Code Search Performance
	Uses a SentencePiece-based tokenizer trained on six programming languages, achieving significantly improved search accuracy over previous models.

	- Specifically Trained Model
	Trained from scratch using the CodeSearchNet dataset, enabling deep understanding of programming syntax and comments.

	## Model Parameters

	The model is designed with the following parameters:

	\| Parameter Name \| Value \|
	\|----------------------------------\|-------\|
	\| vocab_size \| 50000 \|
	\| hidden_size \| 768 \|
	\| num_hidden_layers \| 12 \|
	\| num_attention_heads \| 12 \|
	\| intermediate_size \| 3072 \|
	\| max_position_embeddings \| 2048 \|
	\| type_vocab_size \| 2 \|
	\| hidden_dropout_prob \| 0.1 \|
	\| attention_probs_dropout_prob \| 0.1 \|
	\| local_attention_window \| 128 \|
	\| rope_theta \| 160000 \|
	\| local_attention_rope_theta \| 10000 \|

	## How to Use the Model

	The model can be easily loaded using the Hugging Face Transformers library.
	(Note: Requires Transformers version `4.48.0` or later.)

	- [Example usage is available here](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)

	### Load the Model
	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
	model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
	```

	### Fill-Mask (Code Completion)
	```python
	from transformers import pipeline

	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	print(fill_mask("def add_numbers(a, b): return a + [MASK]"))
	```

	### Obtain Code Embeddings
	```python
	import torch

	def get_embedding(text, model, tokenizer, device="cuda"):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
	if "token_type_ids" in inputs:
	inputs.pop("token_type_ids")
	inputs = {k: v.to(device) for k, v in inputs.items()}
	outputs = model.model(**inputs)
	embedding = outputs.last_hidden_state[:, 0, :]
	return embedding

	embedding = get_embedding("def my_function(): pass", model, tokenizer)
	print(embedding.shape)
	```

	## Dataset

	This model has been trained using the `code-search-net/code_search_net` dataset.
	The dataset contains code snippets from multiple programming languages (Python, Java, JavaScript, etc.), making it well-suited for code search tasks.

	## Evaluation Results

	The model was evaluated using the `code_x_glue_ct_code_to_text` dataset, specifically the Python subset.
	Key evaluation metrics are shown below.
	For further details, refer to [this link](https://colab.research.google.com/gist/Shun0212/474d9092deb60bd10523c3bef427d422/codemorph-modernbert-exp.ipynb?hl=ja).

	\| Metric \| Score \|
	\|---------\|-------\|
	\| MRR (Mean Reciprocal Rank) \| 0.8172 \|
	\| MAP (Mean Average Precision) \| 0.8172 \|
	\| R-Precision \| 0.7501 \|
	\| Recall@10 \| 0.9389 \|
	\| Precision@10 \| 0.8143 \|
	\| NDCG@10 \| 0.8445 \|
	\| F1@10 \| 0.8423 \|

	## Comparison with Other Models

	Below is a comparison of CodeMorph-ModernBERT with other leading code search models.

	\| Model \| MRR \| MAP \| R-Precision \|
	\|--------\|------\|------\|-------------\|
	\| CodeMorph-ModernBERT \| 0.8172 \| 0.8172 \| 0.7501 \|
	\| microsoft/graphcodebert-base \| 0.5482 \| 0.5482 \| 0.4458 \|
	\| microsoft/codebert-base-mlm \| 0.5243 \| 0.5243 \| 0.4378 \|
	\| Salesforce/codet5p-220m-py \| 0.7512 \| 0.7512 \| 0.6617 \|
	\| Salesforce/codet5-large-ntp-py \| 0.7846 \| 0.7846 \| 0.7067 \|
	\| Shuu12121/CodeMorph-BERT \| 0.6851 \| 0.6851 \| 0.5934 \|
	\| Shuu12121/CodeMorph-BERTv2 \| 0.6535 \| 0.6535 \| 0.5543 \|

	## Code Search Model Evaluation Results (google/code_x_glue_tc_nl_code_search_adv Dataset Test)

	The following table summarizes the evaluation results of various code search models using the `google/code_x_glue_tc_nl_code_search_adv` dataset (Test).
	The candidate pool size for all evaluations was set to 100.
	For additional experiment details, see [this link](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorph-ModernBERT-exp-2.ipynb).

	\| Model \| MRR \| MAP \| R-Precision \|
	\|-------------------------------------- \| :----- \| :----- \| :---------- \|
	\| Shuu12121/CodeMorph-ModernBERT \| 0.6107 \| 0.6107 \| 0.5038 \|
	\| Salesforce/codet5p-220m-py \| 0.5037 \| 0.5037 \| 0.3805 \|
	\| Salesforce/codet5-large-ntp-py \| 0.4872 \| 0.4872 \| 0.3658 \|
	\| microsoft/graphcodebert-base \| 0.3844 \| 0.3844 \| 0.2764 \|
	\| microsoft/codebert-base-mlm \| 0.3766 \| 0.3766 \| 0.2683 \|
	\| Shuu12121/CodeMorph-BERTv2 \| 0.3142 \| 0.3142 \| 0.2166 \|
	\| Shuu12121/CodeMorph-BERT \| 0.2978 \| 0.2978 \| 0.1992 \|

	CodeMorph-ModernBERT achieves superior search accuracy compared to other CodeBERT and CodeT5 models.

	## Evaluation Results Across Multiple Languages

	CodeMorph-ModernBERT demonstrates high code search performance across multiple programming languages.
	The table below summarizes key evaluation metrics (MRR, MAP, R-Precision) for each language.
	(Evaluations were conducted using a sample of 1,000 data points. See [this notebook](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorphModernBERTvsCodeT5p.ipynb) for details.)

	\| Language \| MRR \| MAP \| R-Precision \|
	\|--------------\|--------\|--------\|-------------\|
	\| Python \| 0.8098 \| 0.8098 \| 0.7520 \|
	\| Java \| 0.6437 \| 0.6437 \| 0.5480 \|
	\| JavaScript \| 0.5928 \| 0.5928 \| 0.4880 \|
	\| PHP \| 0.7512 \| 0.7512 \| 0.6710 \|
	\| Ruby \| 0.7188 \| 0.7188 \| 0.6310 \|
	\| Go \| 0.5358 \| 0.5358 \| 0.4320 \|


	Additionally, Salesforce/codet5p-220m-bimodal generally outperforms CodeMorph-ModernBERT in terms of search accuracy.
	\| Language \| MRR \| MAP \| R-Precision \|
	\|---------------\|--------\|--------\|-------------\|
	\| Python \| 0.8322 \| 0.8322 \| 0.7660 \|
	\| Java \| 0.8886 \| 0.8886 \| 0.8390 \|
	\| JavaScript\| 0.7611 \| 0.7611 \| 0.6710 \|
	\| PHP \| 0.8985 \| 0.8985 \| 0.8530 \|
	\| Ruby \| 0.7635 \| 0.7635 \| 0.6740 \|
	\| Go \| 0.8127 \| 0.8127 \| 0.7260 \|

	However, when evaluated on a different dataset, google/code_x_glue_tc_nl_code_search_adv (Test), CodeMorph-ModernBERT achieved higher scores, as shown below.
	This suggests that CodeMorph-ModernBERT may be more advantageous for more challenging tasks and generalization in Python.

	\| Model \| MRR \| MAP \| R-Precision \|
	\| :-------------------------------------- \| :----- \| :----- \| :---------- \|
	\| Shuu12121/CodeMorph-ModernBERT \| 0.6107 \| 0.6107 \| 0.5038 \|
	\| Salesforce/codet5p-220m-bimodal \| 0.5326 \| 0.5326 \| 0.4208 \|



	## License

	This model is released under the `Apache-2.0` license.

	## Contact Information

	If you have any questions about this model, please contact us at the following email address:
	[email protected]