--- library_name: sentence-transformers tags: - cross-encoder - cyber - cybersecurity - code license: mit metrics: - accuracy - f1 - recall - precision base_model: - google/canine-c pipeline_tag: text-classification datasets: - Anvilogic/CE-Typosquat-Training-Dataset --- # Typosquat CE detector ## Model Details ### Model Description This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain. - **Developed by:** Anvilogic - **Model type:** Cross-encoder binary classification - **Maximum Sequence Length**: 512 tokens - **Language(s) (NLP):** Multilingual - **License:** MIT - **Finetuned from model :** [google/CANINE-c](https://huggingface.co/google/canine-c) ## Usage ### Direct Usage (Sentence Transformers) This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one. To start using this model, the following code can be used for loading and testing: ```python from sentence_transformers import CrossEncoder model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine") result = model.predict([("example.com", "exarnple.com")]) ``` ### Downstream Usage This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source. For embedding, consider using: [Anvilogic/Embedder-typosquat-detect](https://huggingface.co/Anvilogic/Embedder-typosquat-detect) ## Bias, Risks, and Limitations Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing. ## Training Details ### Framework Versions - Python: 3.10.14 - Sentence Transformers: 3.2.1 - Transformers: 4.46.2 - PyTorch: 2.2.2 - Tokenizers: 0.20.3 ### Training Data The model was fine-tuned using [Anvilogic/CE-Typosquat-Training-Dataset](https://huggingface.co/datasets/Anvilogic/CE-Typosquat-Training-Dataset), which contains pairs of domain names and their similarity labels. The dataset was filtered and converted to the parquet format for efficient processing. ### Training Procedure The model was optimized using the binary cross-entropy loss function with logits, `nn.BCEWithLogitsLoss()`. #### Training Hyperparameters - **Model Architecture**: Cross-encoder fine-tuned from [canine-c](https://huggingface.co/google/canine-c) - **Batch Size**: 64 - **Epochs**: 3 - **Learning Rate**: 2e-5 - **Warmup Steps**: 100 ## Evaluation In the final evaluation after training, the model achieved the following metrics on the test set: **CE Binary Classification Evaluator** ```json Accuracy : 0.9740 F1 Score : 0.9737 Precision : 0.9836 Recall : 0.964 Average Precision : 0.9969 ``` These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.