---
library_name: transformers
tags:
- computer-vision
- image-classification
- vit
- deepfake-detection
- binary-classification
- pytorch
license: mit
metrics:
- accuracy: 99.20%
base_model:
- facebook/deit-base-distilled-patch16-224
pipeline_tag: image-classification
---


# Model Card for Virtus

Virtus is a fine-tuned Vision Transformer (ViT) model for binary image classification, specifically trained to distinguish between real and deepfake images. It achieves **~99.2% accuracy** on a balanced dataset of over 190,000 images.

## Model Details

### Model Description

Virtus is based on `facebook/deit-base-distilled-patch16-224` and was fine-tuned on a binary classification task using a large dataset of real and fake facial images. The training process involved class balancing, data augmentation, and evaluation using accuracy and F1 score.

- **Developed by:** [Agasta](https://github.com/Itz-Agasta)
- **Funded by:** None
- **Shared by:** Agasta
- **Model type:** Vision Transformer (ViT) for image classification
- **Language(s):** N/A (vision model)
- **License:** MIT
- **Finetuned from model:** [facebook/deit-base-distilled-patch16-224](https://huggingface.co/facebook/deit-base-distilled-patch16-224)

### Model Sources

- **Repository:** [https://huggingface.co/agasta/virtus](https://huggingface.co/agasta/virtus)

## Uses

### Direct Use

This model can be used to predict whether an input image is a real or a deepfake. It can be deployed in image analysis pipelines or integrated into applications that require media authenticity detection.

### Downstream Use

Virtus may be used in broader deepfake detection systems, educational tools for detecting synthetic media, or pre-screening systems for online platforms.

### Out-of-Scope Use

- Detection of deepfakes in videos or audio
- General object classification tasks outside of the real/fake binary domain

## Bias, Risks, and Limitations

The dataset, while balanced, may still carry biases in facial features, lighting conditions, or demographics. The model is also not robust to non-standard input sizes or heavily occluded faces.

### Recommendations

- Use only on face images similar in nature to the training set.
- Do not use for critical or high-stakes decisions without human verification.
- Regularly re-evaluate performance with updated data.

## How to Get Started with the Model

```python
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
from PIL import Image
import torch

model = AutoModelForImageClassification.from_pretrained("agasta/virtus")
extractor = AutoFeatureExtractor.from_pretrained("agasta/virtus")

image = Image.open("path_to_image.jpg")
inputs = extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])
```

## Training Details

### Training Data

The dataset consisted of 190,335 self-collected real and deepfake face images, with RandomOverSampler used to balance the two classes. The data was split into 60% training and 40% testing, maintaining class stratification.

### Training Procedure

#### Preprocessing
- Images resized to 224x224
- Augmentations: Random rotation, sharpness adjustments, normalization

#### Training Hyperparameters

- **Epochs:** 2
- **Learning rate:** 1e-6
- **Train batch size:** 32
- **Eval batch size:** 8
- **Weight decay:** 0.02
- **Optimizer:** AdamW (via Trainer API)
- **Mixed precision:** Not used


## Evaluation

### Testing Data

Same dataset, stratified 60:40 split, used for evaluation.

### Metrics

- **Accuracy**
- **F1 Score (macro)**
- **Confusion matrix**
- **Classification report**

### Results

- **Accuracy:** 99.20%
- **F1 Score (macro):** 0.9920

## Environmental Impact

- **Hardware Type:** NVIDIA Tesla V100 (Kaggle Notebook GPU)
- **Hours used:** ~2.3 hours
- **Cloud Provider:** Kaggle
- **Compute Region:** Unknown
- **Carbon Emitted:** Can be estimated via [MLCO2 Calculator](https://mlco2.github.io/impact#compute)

## Technical Specifications

### Model Architecture and Objective

The model is a distilled Vision Transformer (DeiT) designed for image classification with a binary objective: classify images as Real or Fake.

### Compute Infrastructure

- **Hardware:** 1x NVIDIA Tesla V100 GPU
- **Software:** PyTorch, Hugging Face Transformers, Datasets, Accelerate

## Citation

**BibTeX:**
```bibtex
@misc{virtus2025,
  title={Virtus: Deepfake Detection using Vision Transformers},
  author={Agasta},
  year={2025},
  howpublished={\url{https://huggingface.co/agasta/virtus}},
}
```

**APA:**
Agasta. (2025). *Virtus: Deepfake Detection using Vision Transformers*. Hugging Face. https://huggingface.co/agasta/virtus

## Model Card Contact

For questions or feedback, reach out via [GitHub](https://github.com/Itz-Agasta) or open an issue on the [model repository](https://github.com/Itz-Agasta/Lopt/tree/main/models/image). or mail me at rupam.golui@proton.me