arxiv:2410.01820

PixelBytes: Catching Unified Representation for Multimodal Generation

Published on Sep 16, 2024

Authors:

Fabien Furfaro

Abstract

PixelBytes integrates multimodal data into a cohesive representation, demonstrating that autoregressive models perform better than predictive models and diffusion models can be applied to control problems and parallelized generation.

AI-generated summary

This report presents PixelBytes, an approach for unified multimodal representation learning. Drawing inspiration from sequence models like Image Transformers, PixelCNN, and Mamba-Bytes, we explore integrating text, audio, action-state, and pixelated images (sprites) into a cohesive representation. We conducted experiments on a PixelBytes Pokemon dataset and an Optimal-Control dataset. Our investigation covered various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, with a focus on bidirectional processing and our PxBy embedding technique. We evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory (LSTM) networks in predictive and autoregressive modes. Our results indicate that autoregressive models perform better than predictive models in this context. Additionally, we found that diffusion models can be applied to control problems and parallelized generation. PixelBytes aims to contribute to the development of foundation models for multimodal data processing and generation. The project's code, models, and datasets are available online.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.01820 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.01820 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.01820 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.