ahsanayub
/

malicious-prompts-detection-random-forest

English

random-forest

binary-classification

prompt-injection

security

Model card Files Files and versions Community

ahsanayub commited on Feb 11

Commit

b3aaf96

verified ·

1 Parent(s): c376ce7

Upload README.md

Browse files

Files changed (1) hide show

README.md +27 -5

README.md CHANGED Viewed

@@ -1,10 +1,33 @@
 # Model Description
-The purpose of our trained Random Forest models is to identify malicious prompts given the prompt embeddings derived from [OpenAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-openai-embeddings), [OctoAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-octoai-embeddings), and [MiniLM](https://huggingface.co/datasets/ahsanayub/malicious-prompts-minilm-embeddings). The models are trained with 373,120 benign and malicious prompts. We split this dataset into 80% training and 20% test sets. To ensure equal proportion of the malicious and benign labels across splits, we use stratified sampling.
-Embeddings consist of fixed-length numerical representations. OpenAI generates an embedding vector consisting of 1,536 floating-point numbers for each prompt. Similarly, the embedding datasets for OctoAI and MiniLM consist of 1,027 and 387 features, respectively.
-# Model Evaluation
 The binary classification performance of embedding-based random forest models is shared below:
@@ -14,11 +37,10 @@ The binary classification performance of embedding-based random forest models is
 | OctoAI    | 0.849     | 0.853  | 0.851    | 0.731 |
 | MiniLM    | 0.849     | 0.853  | 0.851    | 0.730 |
-## How to Use the Model
 We have shared three versions of random forest models in this repository. We used the following embedding models: `text-embedding-3-small` from OpenAI, and the open-source models `gte-large` hosted on OctoAI, as well as the well-known `all-MiniLM-L6-v2`. Therefore, you need to covert the prompts to its respective embeddings before querying the model to obtain its  prediction: `0` for benign and `1` for malicous.
 ## Citing This Work
 Our implementation, along with the curated datasets used for evaluation, is available on [GitHub](https://github.com/AhsanAyub/malicious-prompt-detection). Additionaly, if you use our implementation for scientific research, you are highly encouraged to cite [our paper](https://arxiv.org/abs/2410.22284).

+---
+license: mit
+language:
+  - en
+tags:
+- random-forest
+- binary-classification
+- prompt-injection
+- security
+datasets:
+- imoxto/prompt_injection_cleaned_dataset-v2
+- reshabhs/SPML_Chatbot_Prompt_Injection
+- Harelix/Prompt-Injection-Mixed-Techniques-2024
+- JasperLS/prompt-injections
+- fka/awesome-chatgpt-prompts
+- rubend18/ChatGPT-Jailbreak-Prompts
+metrics:
+- recall
+- precision
+- f1
+- auc
+---
 # Model Description
+The purpose of our trained Random Forest models is to identify malicious prompts given the prompt embeddings derived from [OpenAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-openai-embeddings), [OctoAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-octoai-embeddings), and [MiniLM](https://huggingface.co/datasets/ahsanayub/malicious-prompts-minilm-embeddings). The models are trained with 373,598 benign and malicious prompts. We split this dataset into 80% training and 20% test sets. To ensure equal proportion of the malicious and benign labels across splits, we use stratified sampling.
+Embeddings consist of fixed-length numerical representations. For example, OpenAI generates an embedding vector consisting of 1,536 floating-point numbers for each prompt. Similarly, the embedding datasets for OctoAI and MiniLM consist of 1,027 and 387 features, respectively.
+## Model Evaluation
 The binary classification performance of embedding-based random forest models is shared below:
 | OctoAI    | 0.849     | 0.853  | 0.851    | 0.731 |
 | MiniLM    | 0.849     | 0.853  | 0.851    | 0.730 |
+## How To Use The Model
 We have shared three versions of random forest models in this repository. We used the following embedding models: `text-embedding-3-small` from OpenAI, and the open-source models `gte-large` hosted on OctoAI, as well as the well-known `all-MiniLM-L6-v2`. Therefore, you need to covert the prompts to its respective embeddings before querying the model to obtain its  prediction: `0` for benign and `1` for malicous.
 ## Citing This Work
 Our implementation, along with the curated datasets used for evaluation, is available on [GitHub](https://github.com/AhsanAyub/malicious-prompt-detection). Additionaly, if you use our implementation for scientific research, you are highly encouraged to cite [our paper](https://arxiv.org/abs/2410.22284).