virtus / README.md

Update README.md

acd0690 verified 14 days ago

4.98 kB

	---
	library_name: transformers
	tags:
	- computer-vision
	- image-classification
	- vit
	- deepfake-detection
	- binary-classification
	- pytorch
	license: mit
	metrics:
	- accuracy: 99.20%
	base_model:
	- facebook/deit-base-distilled-patch16-224
	pipeline_tag: image-classification
	---


	# Model Card for Virtus

	Virtus is a fine-tuned Vision Transformer (ViT) model for binary image classification, specifically trained to distinguish between real and deepfake images. It achieves ~99.2% accuracy on a balanced dataset of over 190,000 images.

	## Model Details

	### Model Description

	Virtus is based on `facebook/deit-base-distilled-patch16-224` and was fine-tuned on a binary classification task using a large dataset of real and fake facial images. The training process involved class balancing, data augmentation, and evaluation using accuracy and F1 score.

	- Developed by: [Agasta](https://github.com/Itz-Agasta)
	- Funded by: None
	- Shared by: Agasta
	- Model type: Vision Transformer (ViT) for image classification
	- Language(s): N/A (vision model)
	- License: MIT
	- Finetuned from model: [facebook/deit-base-distilled-patch16-224](https://huggingface.co/facebook/deit-base-distilled-patch16-224)

	### Model Sources

	- Repository: [https://huggingface.co/agasta/virtus](https://huggingface.co/agasta/virtus)

	## Uses

	### Direct Use

	This model can be used to predict whether an input image is a real or a deepfake. It can be deployed in image analysis pipelines or integrated into applications that require media authenticity detection.

	### Downstream Use

	Virtus may be used in broader deepfake detection systems, educational tools for detecting synthetic media, or pre-screening systems for online platforms.

	### Out-of-Scope Use

	- Detection of deepfakes in videos or audio
	- General object classification tasks outside of the real/fake binary domain

	## Bias, Risks, and Limitations

	The dataset, while balanced, may still carry biases in facial features, lighting conditions, or demographics. The model is also not robust to non-standard input sizes or heavily occluded faces.

	### Recommendations

	- Use only on face images similar in nature to the training set.
	- Do not use for critical or high-stakes decisions without human verification.
	- Regularly re-evaluate performance with updated data.

	## How to Get Started with the Model

	```python
	from transformers import AutoFeatureExtractor, AutoModelForImageClassification
	from PIL import Image
	import torch

	model = AutoModelForImageClassification.from_pretrained("agasta/virtus")
	extractor = AutoFeatureExtractor.from_pretrained("agasta/virtus")

	image = Image.open("path_to_image.jpg")
	inputs = extractor(images=image, return_tensors="pt")
	outputs = model(**inputs)
	predicted_class = outputs.logits.argmax(-1).item()
	print(model.config.id2label[predicted_class])
	```

	## Training Details

	### Training Data

	The dataset consisted of 190,335 self-collected real and deepfake face images, with RandomOverSampler used to balance the two classes. The data was split into 60% training and 40% testing, maintaining class stratification.

	### Training Procedure

	#### Preprocessing
	- Images resized to 224x224
	- Augmentations: Random rotation, sharpness adjustments, normalization

	#### Training Hyperparameters

	- Epochs: 2
	- Learning rate: 1e-6
	- Train batch size: 32
	- Eval batch size: 8
	- Weight decay: 0.02
	- Optimizer: AdamW (via Trainer API)
	- Mixed precision: Not used



	## Evaluation

	### Testing Data

	Same dataset, stratified 60:40 split, used for evaluation.

	### Metrics

	- Accuracy
	- F1 Score (macro)
	- Confusion matrix
	- Classification report

	### Results

	- Accuracy: 99.20%
	- F1 Score (macro): 0.9920

	## Environmental Impact

	- Hardware Type: NVIDIA Tesla V100 (Kaggle Notebook GPU)
	- Hours used: ~2.3 hours
	- Cloud Provider: Kaggle
	- Compute Region: Unknown
	- Carbon Emitted: Can be estimated via [MLCO2 Calculator](https://mlco2.github.io/impact#compute)

	## Technical Specifications

	### Model Architecture and Objective

	The model is a distilled Vision Transformer (DeiT) designed for image classification with a binary objective: classify images as Real or Fake.

	### Compute Infrastructure

	- Hardware: 1x NVIDIA Tesla V100 GPU
	- Software: PyTorch, Hugging Face Transformers, Datasets, Accelerate

	## Citation

	BibTeX:
	```bibtex
	@misc{virtus2025,
	title={Virtus: Deepfake Detection using Vision Transformers},
	author={Agasta},
	year={2025},
	howpublished={\url{https://huggingface.co/agasta/virtus}},
	}
	```

	APA:
	Agasta. (2025). Virtus: Deepfake Detection using Vision Transformers. Hugging Face. https://huggingface.co/agasta/virtus

	## Model Card Contact

	For questions or feedback, reach out via [GitHub](https://github.com/Itz-Agasta) or open an issue on the [model repository](https://github.com/Itz-Agasta/Lopt/tree/main/models/image). or mail me at [email protected]