---
library_name: sentence-transformers
tags:
- cross-encoder
- cyber
- cybersecurity
- code
license: mit
metrics:
- accuracy
- f1
- recall
- precision
base_model:
- google/canine-c
pipeline_tag: text-classification
datasets:
- Anvilogic/CE-Typosquat-Training-Dataset
---

# Typosquat CE detector

## Model Details

### Model Description
This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model.
The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

- **Developed by:** Anvilogic
- **Model type:** Cross-encoder binary classification
- **Maximum Sequence Length**: 512 tokens
- **Language(s) (NLP):** Multilingual
- **License:** MIT
- **Finetuned from model :** [google/CANINE-c](https://huggingface.co/google/canine-c)


## Usage


### Direct Usage (Sentence Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:
```python
from sentence_transformers import CrossEncoder

model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine")
result = model.predict([("example.com", "exarnple.com")])
```

### Downstream Usage
This model can be used with an embedding model to enhance typosquatting detection.
First, an embedding model retrieves similar domains from a legitimate database.
Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect)

## Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security.
Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

## Training Details

### Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.1
- Transformers: 4.46.2
- PyTorch: 2.2.2
- Tokenizers: 0.20.3

### Training Data

The model was fine-tuned using [Anvilogic/CE-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/CE-Typosquat-Training-Dataset), which contains pairs of domain names and their similarity labels.
The dataset was filtered and converted to the parquet format for efficient processing.

### Training Procedure
The model was optimized using the binary cross-entropy loss function with logits, `nn.BCEWithLogitsLoss()`.

#### Training Hyperparameters
- **Model Architecture**: Cross-encoder fine-tuned from [canine-c](https://huggingface.co/google/canine-c)
- **Batch Size**: 64
- **Epochs**: 3
- **Learning Rate**: 2e-5
- **Warmup Steps**: 100


## Evaluation

In the final evaluation after training, the model achieved the following metrics on the test set:

**CE Binary Classification Evaluator**
```json
Accuracy : 0.9740
F1 Score : 0.9737
Precision : 0.9836
Recall : 0.964
Average Precision : 0.9969
```
These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.