eunJ commited on
Commit
1452df5
ยท
verified ยท
1 Parent(s): 5531463

Upload vulnerability detection model

Browse files
Files changed (4) hide show
  1. README.md +45 -25
  2. config.json +25 -4
  3. model.safetensors +2 -2
  4. tokenizer_config.json +0 -1
README.md CHANGED
@@ -1,34 +1,54 @@
1
  ---
 
 
2
  library_name: transformers
 
3
  tags:
4
- - Code
5
- - Vulnerability
6
- - Detection
7
- datasets:
8
- - DetectVul/bigvul
9
- language:
10
- - en
11
- base_model:
12
- - microsoft/codebert-base
13
  license: mit
14
- metrics:
15
- - accuracy
16
- - precision
17
- - f1
18
- - recall
19
  ---
20
 
21
- ## CodeBERT for Code Vulnerability Detection
22
 
23
- ## Model Summary
24
- This model is a fine-tuned version of **microsoft/codebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **bigvul** dataset. The model takes in a code snippet and classifies it as either **benign (0)** or **vulnerable (1)**.
25
 
26
- ## Model Details
 
 
 
27
 
28
- - **Developed by:** Eun Jung
29
- - **Finetuned from:** `microsoft/codebert-base`
30
- - **Language(s):** English (for code comments & metadata), C/C++
31
- - **License:** MIT
32
- - **Task:** Code vulnerability detection
33
- - **Dataset Used:** `bigvul`
34
- - **Architecture:** Transformer-based sequence classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - code
4
  library_name: transformers
5
+ pipeline_tag: text-classification
6
  tags:
7
+ - code-analysis
8
+ - vulnerability-detection
9
+ - security
10
+ - cwe
 
 
 
 
 
11
  license: mit
12
+ base_model: microsoft/codebert-base
 
 
 
 
13
  ---
14
 
15
+ # CodeBERT Vulnerability Detector (Multi-class)
16
 
17
+ C/C++ ์ฝ”๋“œ์˜ ์ทจ์•ฝ์ ์„ ํƒ์ง€ํ•˜๋Š” ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
 
18
 
19
+ ## ๋ชจ๋ธ ์ •๋ณด
20
+ - **๊ธฐ๋ฐ˜ ๋ชจ๋ธ**: microsoft/codebert-base
21
+ - **๋ถ„๋ฅ˜ ํด๋ž˜์Šค**: 4๊ฐœ (CWE-79, CWE-89, CWE-119, ๊ธฐํƒ€)
22
+ - **์ž…๋ ฅ**: C/C++ ์†Œ์Šค ์ฝ”๋“œ ํ…์ŠคํŠธ
23
 
24
+ ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
25
+
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
28
+ import torch
29
+
30
+ # ๋ชจ๋ธ ๋กœ๋“œ
31
+ model_name = "eunJ/codebert_vulnerability_detector_multi"
32
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
33
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
34
+
35
+ # ์ฝ”๋“œ ๋ถ„์„
36
+ code = '''
37
+ char buffer[100];
38
+ gets(buffer);
39
+ '''
40
+
41
+ inputs = tokenizer(code, return_tensors="pt", max_length=512, truncation=True)
42
+ with torch.no_grad():
43
+ outputs = model(**inputs)
44
+ predictions = torch.softmax(outputs.logits, dim=-1)
45
+ predicted_class = torch.argmax(predictions)
46
+
47
+ print(f"์˜ˆ์ธก ํด๋ž˜์Šค: {predicted_class.item()}")
48
+ ```
49
+
50
+ ## ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”
51
+ - 0: CWE-79 (Cross-site Scripting)
52
+ - 1: CWE-89 (SQL Injection)
53
+ - 2: CWE-119 (Buffer Overflow)
54
+ - 3: CWE-Other (๊ธฐํƒ€)
config.json CHANGED
@@ -1,19 +1,40 @@
1
  {
2
- "model_type": "roberta",
3
  "architectures": [
4
  "RobertaForSequenceClassification"
5
  ],
6
- "num_labels": 4,
 
 
 
 
 
 
7
  "id2label": {
8
  "0": "LABEL_0",
9
  "1": "LABEL_1",
10
  "2": "LABEL_2",
11
  "3": "LABEL_3"
12
  },
 
 
13
  "label2id": {
14
  "LABEL_0": 0,
15
  "LABEL_1": 1,
16
  "LABEL_2": 2,
17
  "LABEL_3": 3
18
- }
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  {
2
+ "_name_or_path": "microsoft/codebert-base",
3
  "architectures": [
4
  "RobertaForSequenceClassification"
5
  ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
  "id2label": {
14
  "0": "LABEL_0",
15
  "1": "LABEL_1",
16
  "2": "LABEL_2",
17
  "3": "LABEL_3"
18
  },
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 3072,
21
  "label2id": {
22
  "LABEL_0": 0,
23
  "LABEL_1": 1,
24
  "LABEL_2": 2,
25
  "LABEL_3": 3
26
+ },
27
+ "layer_norm_eps": 1e-05,
28
+ "max_position_embeddings": 514,
29
+ "model_type": "roberta",
30
+ "num_attention_heads": 12,
31
+ "num_hidden_layers": 12,
32
+ "output_past": true,
33
+ "pad_token_id": 1,
34
+ "position_embedding_type": "absolute",
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.49.0",
37
+ "type_vocab_size": 1,
38
+ "use_cache": true,
39
+ "vocab_size": 50265
40
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c8037175a0f7980967910a796119cf0250a6c0200f7568225f2e9aaeb43b9b68
3
- size 498633008
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c601158bf8733adb819d956d6e3c418480f72fa7216b30463b1f5aa291ce2756
3
+ size 498618976
tokenizer_config.json CHANGED
@@ -45,7 +45,6 @@
45
  "bos_token": "<s>",
46
  "clean_up_tokenization_spaces": false,
47
  "cls_token": "<s>",
48
- "do_lower_case": false,
49
  "eos_token": "</s>",
50
  "errors": "replace",
51
  "extra_special_tokens": {},
 
45
  "bos_token": "<s>",
46
  "clean_up_tokenization_spaces": false,
47
  "cls_token": "<s>",
 
48
  "eos_token": "</s>",
49
  "errors": "replace",
50
  "extra_special_tokens": {},