--- library_name: transformers tags: - computer-vision - image-classification - vit - deepfake-detection - binary-classification - pytorch license: mit metrics: - accuracy: 99.20% base_model: - facebook/deit-base-distilled-patch16-224 pipeline_tag: image-classification --- # Model Card for Virtus Virtus is a fine-tuned Vision Transformer (ViT) model for binary image classification, specifically trained to distinguish between real and deepfake images. It achieves **~99.2% accuracy** on a balanced dataset of over 190,000 images. ## Model Details ### Model Description Virtus is based on `facebook/deit-base-distilled-patch16-224` and was fine-tuned on a binary classification task using a large dataset of real and fake facial images. The training process involved class balancing, data augmentation, and evaluation using accuracy and F1 score. - **Developed by:** [Agasta](https://github.com/Itz-Agasta) - **Funded by:** None - **Shared by:** Agasta - **Model type:** Vision Transformer (ViT) for image classification - **Language(s):** N/A (vision model) - **License:** MIT - **Finetuned from model:** [facebook/deit-base-distilled-patch16-224](https://huggingface.co/facebook/deit-base-distilled-patch16-224) ### Model Sources - **Repository:** [https://huggingface.co/agasta/virtus](https://huggingface.co/agasta/virtus) ## Uses ### Direct Use This model can be used to predict whether an input image is a real or a deepfake. It can be deployed in image analysis pipelines or integrated into applications that require media authenticity detection. ### Downstream Use Virtus may be used in broader deepfake detection systems, educational tools for detecting synthetic media, or pre-screening systems for online platforms. ### Out-of-Scope Use - Detection of deepfakes in videos or audio - General object classification tasks outside of the real/fake binary domain ## Bias, Risks, and Limitations The dataset, while balanced, may still carry biases in facial features, lighting conditions, or demographics. The model is also not robust to non-standard input sizes or heavily occluded faces. ### Recommendations - Use only on face images similar in nature to the training set. - Do not use for critical or high-stakes decisions without human verification. - Regularly re-evaluate performance with updated data. ## How to Get Started with the Model ```python from transformers import AutoFeatureExtractor, AutoModelForImageClassification from PIL import Image import torch model = AutoModelForImageClassification.from_pretrained("agasta/virtus") extractor = AutoFeatureExtractor.from_pretrained("agasta/virtus") image = Image.open("path_to_image.jpg") inputs = extractor(images=image, return_tensors="pt") outputs = model(**inputs) predicted_class = outputs.logits.argmax(-1).item() print(model.config.id2label[predicted_class]) ``` ## Training Details ### Training Data The dataset consisted of 190,335 self-collected real and deepfake face images, with RandomOverSampler used to balance the two classes. The data was split into 60% training and 40% testing, maintaining class stratification. ### Training Procedure #### Preprocessing - Images resized to 224x224 - Augmentations: Random rotation, sharpness adjustments, normalization #### Training Hyperparameters - **Epochs:** 2 - **Learning rate:** 1e-6 - **Train batch size:** 32 - **Eval batch size:** 8 - **Weight decay:** 0.02 - **Optimizer:** AdamW (via Trainer API) - **Mixed precision:** Not used ## Evaluation ### Testing Data Same dataset, stratified 60:40 split, used for evaluation. ### Metrics - **Accuracy** - **F1 Score (macro)** - **Confusion matrix** - **Classification report** ### Results - **Accuracy:** 99.20% - **F1 Score (macro):** 0.9920 ## Environmental Impact - **Hardware Type:** NVIDIA Tesla V100 (Kaggle Notebook GPU) - **Hours used:** ~2.3 hours - **Cloud Provider:** Kaggle - **Compute Region:** Unknown - **Carbon Emitted:** Can be estimated via [MLCO2 Calculator](https://mlco2.github.io/impact#compute) ## Technical Specifications ### Model Architecture and Objective The model is a distilled Vision Transformer (DeiT) designed for image classification with a binary objective: classify images as Real or Fake. ### Compute Infrastructure - **Hardware:** 1x NVIDIA Tesla V100 GPU - **Software:** PyTorch, Hugging Face Transformers, Datasets, Accelerate ## Citation **BibTeX:** ```bibtex @misc{virtus2025, title={Virtus: Deepfake Detection using Vision Transformers}, author={Agasta}, year={2025}, howpublished={\url{https://huggingface.co/agasta/virtus}}, } ``` **APA:** Agasta. (2025). *Virtus: Deepfake Detection using Vision Transformers*. Hugging Face. https://huggingface.co/agasta/virtus ## Model Card Contact For questions or feedback, reach out via [GitHub](https://github.com/Itz-Agasta) or open an issue on the [model repository](https://github.com/Itz-Agasta/Lopt/tree/main/models/image). or mail me at rupam.golui@proton.me