Charm / README.md

Add AVA

1ad489c verified 11 days ago

3.94 kB

	---
	license: apache-2.0
	pipeline_tag: image-feature-extraction
	tags:
	- pytorch
	- aesthetics
	metrics:
	- pearsonr
	- spearmanr
	- accuracy
	base_model:
	- facebook/dinov2-small
	---

	💫 Official implementation of Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

	> [Accepted at CVPR 2025](https://cvpr.thecvf.com/virtual/2025/poster/34423)<br>
	> [arXiv](https://arxiv.org/abs/2504.02522)<br>

	<div align="left">
	<a href="https://github.com/FBehrad/Charm">
	<img src="https://github.com/FBehrad/Charm/blob/main/Figures/MainFigure.jpg?raw=true" alt="Overall framework" width="400"/>
	</a>
	</div>

	We introduce Charm , a novel tokenization approach that preserves Composition, High-resolution,
	Aspect Ratio, and Multi-scale information simultaneously. By preserving critical information, <em> Charm </em> works like a charm for image aesthetic and quality assessment 🌟.


	### Quick Inference

	* Step 1) Check our [GitHub Page](https://github.com/FBehrad/Charm/) and install the requirements.

	```setup
	pip install -r requirements.txt
	```
	___
	* Step 2) Install Charm tokenizer.
	```setup
	pip install Charm-tokenizer
	```
	___
	* Step 3) Tokenization + Position embedding preparation

	<div align="center">
	<a href="https://github.com/FBehrad/Charm">
	<img src="https://github.com/FBehrad/Charm/blob/main/Figures/charm.gif?raw=true" alt="Charm tokenizer" width="700"/>
	</a>
	</div>


	Charm Tokenizer has the following input args:
	* patch_selection (str): The method for selecting important patches
	* Options: 'saliency', 'random', 'frequency', 'gradient', 'entropy', 'original'.
	* training_dataset (str): Used to set the number of ViT input tokens to match a specific training dataset from the paper.
	* Aesthetic assessment datasets: 'ava', 'aadb', 'tad66k', 'para', 'baid'.
	* Quality assessment datasets: 'spaq', 'koniq10k'.
	* backbone (str): The ViT backbone model (default: 'facebook/dinov2-small' (for all datasets except for AVA) and 'facebook/dinov2-large' (Just for AVA).
	* factor (float): The downscaling factor for less important patches (default: 0.5).
	* scales (int): The number of scales used for multiscale processing (default: 2).
	* random_crop_size (tuple): Used for the 'original' patch selection strategy (default: (224, 224)).
	* downscale_shortest_edge (int): Used for the 'original' patch selection strategy (default: 256).
	* without_pad_or_dropping (bool): Whether to avoid padding or dropping patches (default: True).

	Note: While random patch selection during training helps avoid overfitting,for consistent results during inference, fully deterministic patch selection approaches should be used.

	The output is the preprocessed tokens, their corresponding positional embeddings, and a mask token that indicates which patches are in high resolution and which are in low resolution.

	```python
	from Charm_tokenizer.ImageProcessor import Charm_Tokenizer

	img_path = r"img.png"

	charm_tokenizer = Charm_Tokenizer(patch_selection='frequency', training_dataset='tad66k',backbone='facebook/dinov2-small', without_pad_or_dropping=True)
	tokens, pos_embed, mask_token = charm_tokenizer.preprocess(img_path)
	```
	___

	* Step 4) Predicting aesthetic/quality score

	* If training_dataset is set to 'spaq' or 'koniq10k', the model predicts the image quality score. For other options ('aadb', 'tad66k', 'para', 'baid'), it predicts the image aesthetic score.

	* Selecting a dataset with image resolutions similar to your input images can improve prediction accuracy.

	* For more details about the process, please refer to the [paper](https://cvpr.thecvf.com/virtual/2025/poster/34423).


	```python
	from Charm_tokenizer.Backbone import backbone

	model = backbone(training_dataset='tad66k', device='cpu')
	prediction = model.predict(tokens, pos_embed, mask_token)
	```


	Note: For the training code, check our [GitHub Page](https://github.com/FBehrad/Charm/).