File size: 2,288 Bytes
aadb9a0
 
 
 
 
 
 
 
 
 
 
89d32d7
aadb9a0
 
 
 
 
d04a44d
aadb9a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89d32d7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
language:
- vi
- en
base_model:
- microsoft/phi-4
pipeline_tag: text-generation
tags:
- cybersecurity
- text-generation-inference
- transformers
license: mit
---

## Model Overview
|                         |                                                                               |
|-------------------------|-------------------------------------------------------------------------------|
| **Developers**          | Microsoft                                                                          |
| **Architecture**        | 14B parameters, dense decoder-only Transformer model                           |
| **Inputs**              | Text, best suited for prompts in the chat format                              |
| **Context length**      | 16K tokens                                                                     |
| **Outputs**             | Generated text in response to input                                           |
| **License**             | MIT                                                                           |

## Training Datasets
Our training data is an extension of the data used for `cyber-llm-14b` and includes a wide variety of sources from:

1. Publicly available blogs, papers, reference from: https://github.com/PEASEC/cybersecurity_dataset.

2. Newly created synthetic, "textbook-like" data for the purpose of teaching cybersecurity (use GPT-4o).

3. Acquired academic books and Q&A datasets


## Usage

### Input Formats

Given the nature of the training data, `cyber-llm-14b` is best suited for prompts using the chat format as follows: 

```bash
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>
I'm great thanks!<|eot_id|>
```

### With `transformers`

```python
import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model="viettelsecurity-ai/cyber-llm-14b",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a SOC-tier3"},
    {"role": "user", "content": "What is the url phishing?"},
]

outputs = pipeline(messages, max_new_tokens=2048)
print(outputs[0]["generated_text"][-1])
```