File size: 2,288 Bytes

---
language:
- vi
- en
base_model:
- microsoft/phi-4
pipeline_tag: text-generation
tags:
- cybersecurity
- text-generation-inference
- transformers
license: mit
---

## Model Overview
|                         |                                                                               |
|-------------------------|-------------------------------------------------------------------------------|
| **Developers**          | Microsoft                                                                          |
| **Architecture**        | 14B parameters, dense decoder-only Transformer model                           |
| **Inputs**              | Text, best suited for prompts in the chat format                              |
| **Context length**      | 16K tokens                                                                     |
| **Outputs**             | Generated text in response to input                                           |
| **License**             | MIT                                                                           |

## Training Datasets
Our training data is an extension of the data used for `cyber-llm-14b` and includes a wide variety of sources from:

1. Publicly available blogs, papers, reference from: https://github.com/PEASEC/cybersecurity_dataset.

2. Newly created synthetic, "textbook-like" data for the purpose of teaching cybersecurity (use GPT-4o).

3. Acquired academic books and Q&A datasets


## Usage

### Input Formats

Given the nature of the training data, `cyber-llm-14b` is best suited for prompts using the chat format as follows: 

```bash
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>
I'm great thanks!<|eot_id|>
```

### With `transformers`

```python
import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model="viettelsecurity-ai/cyber-llm-14b",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a SOC-tier3"},
    {"role": "user", "content": "What is the url phishing?"},
]

outputs = pipeline(messages, max_new_tokens=2048)
print(outputs[0]["generated_text"][-1])
```