File size: 2,537 Bytes
2c77bdd 6855ad8 2c77bdd 6855ad8 2c77bdd 6855ad8 2c77bdd 4312eeb 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf 20847d2 8e53fcf a364f07 8e53fcf 4312eeb 039203b 6d3ed54 039203b 4312eeb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
license: mit
language:
- en
inference: true
base_model:
- microsoft/codebert-base-mlm
pipeline_tag: fill-mask
tags:
- fill-mask
- smart-contract
- web3
- software-engineering
- embedding
- codebert
library_name: transformers
---
# SmartBERT V2 CodeBERT

## Overview
SmartBERT V2 CodeBERT is a pre-trained model, initialized with **[CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)**, designed to transfer **Smart Contract** function-level code into embeddings effectively.
- **Training Data:** Trained on **16,000** smart contracts.
- **Hardware:** Utilized 2 Nvidia A100 80G GPUs.
- **Training Duration:** More than 10 hours.
- **Evaluation Data:** Evaluated on **4,000** smart contracts.
## Preprocessing
All newline (`\n`) and tab (`\t`) characters in the function code were replaced with a single space to ensure consistency in the input data format.
## Base Model
- **Base Model**: [CodeBERT-base-mlm](https://huggingface.co/microsoft/codebert-base-mlm)
## Training Setup
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=20,
per_device_train_batch_size=64,
save_steps=10000,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=10000,
resume_from_checkpoint=checkpoint
)
```
## How to Use
To train and deploy the SmartBERT V2 model for Web API services, please refer to our GitHub repository: [web3se-lab/SmartBERT](https://github.com/web3se-lab/SmartBERT).
Or use pipline:
```python
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained('web3se/SmartBERT-v3')
tokenizer = RobertaTokenizer.from_pretrained('web3se/SmartBERT-v3')
code_example = "function totalSupply() external view <mask> (uint256);"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(code_example)
print(outputs)
```
## Contributors
- [Youwei Huang](https://www.devil.ren)
- [Sen Fang](https://github.com/TomasAndersonFang)
## Citations
```tex
@article{huang2025smart,
title={Smart Contract Intent Detection with Pre-trained Programming Language Model},
author={Huang, Youwei and Li, Jianwen and Fang, Sen and Li, Yao and Yang, Peng and Hu, Bin},
journal={arXiv preprint arXiv:2508.20086},
year={2025}
}
```
## Sponsors
- [Institute of Intelligent Computing Technology, Suzhou, CAS](http://iict.ac.cn/)
- CAS Mino (中科劢诺)
|