ImageGPT (small-sized model)

ImageGPT (iGPT) model pre-trained on ImageNet ILSVRC 2012 (14 million images, 21,843 classes) at resolution 32x32. It was introduced in the paper Generative Pretraining from Pixels by Chen et al. and first released in this repository. See also the official blog post.

Model description

The ImageGPT (iGPT) is a transformer decoder model (GPT-like) pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-21k, at a resolution of 32x32 pixels.

The goal for the model is simply to predict the next pixel value, given the previous ones.

By pre-training the model, it learns an inner representation of images that can then be used to:

extract features useful for downstream tasks: one can either use ImageGPT to produce fixed image features, in order to train a linear model (like a sklearn logistic regression model or SVM). This is also referred to as "linear probing".
perform (un)conditional image generation.

Intended uses & limitations

You can use the raw model for either feature extractor or (un) conditional image generation.

How to use

Here is how to use this model as feature extractor:

from transformers import AutoFeatureExtractor
from onnxruntime import InferenceSession
from datasets import load_dataset

# load image
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

# load model
feature_extractor = AutoFeatureExtractor.from_pretrained("openai/imagegpt-small")
session = InferenceSession("model/model.onnx")

# ONNX Runtime expects NumPy arrays as input
inputs = feature_extractor(image, return_tensors="np")
outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))

Or you can use the model with classification head that returns logits

from transformers import AutoFeatureExtractor
from onnxruntime import InferenceSession
from datasets import load_dataset

# load image
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

# load model
feature_extractor = AutoFeatureExtractor.from_pretrained("openai/imagegpt-small")
session = InferenceSession("model/model_classification.onnx")

# ONNX Runtime expects NumPy arrays as input
inputs = feature_extractor(image, return_tensors="np")
outputs = session.run(output_names=["logits"], input_feed=dict(inputs))

Original implementation

Follow this link to see the original implementation.

Training data

The ImageGPT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes.

Training procedure

Preprocessing

Images are first resized/rescaled to the same resolution (32x32) and normalized across the RGB channels. Next, color-clustering is performed. This means that every pixel is turned into one of 512 possible cluster values. This way, one ends up with a sequence of 32x32 = 1024 pixel values, rather than 32x32x3 = 3072, which is prohibitively large for Transformer-based models.

Pretraining

Training details can be found in section 3.4 of v2 of the paper.

Evaluation results

For evaluation results on several image classification benchmarks, we refer to the original paper.

BibTeX entry and citation info

@InProceedings{pmlr-v119-chen20s,
  title = 	 {Generative Pretraining From Pixels},
  author =       {Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {1691--1703},
  year = 	 {2020},
  editor = 	 {III, Hal DaumÃ© and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/chen20s/chen20s.pdf},
  url = 	 {https://proceedings.mlr.press/v119/chen20s.html
}

@inproceedings{deng2009imagenet,
  title={Imagenet: A large-scale hierarchical image database},
  author={Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
  booktitle={2009 IEEE conference on computer vision and pattern recognition},
  pages={248--255},
  year={2009},
  organization={Ieee}
}