license: apache-2.0
pipeline_tag: text-to-3d
MeshGPT-alpha-preview
MeshGPT is a text-to-3D model based on an autoencoder (tokenizer) and a transformer to generate the tokens.
The autoencoder's purpose is to be able to translate 3D models into tokens which then the decoder part of it can convert back to 3D mesh.
For all purposes and definitions the autoencoder is the world first published 3D model tokenizer! (correct me if i'm wrong!)
Model Details
The autoencoder (tokenizer) is a relative small model using 50M parameters and the transformer model uses 184M parameters and the core is based on GPT2-small.
Due to hardware contraints it's trained using a codebook/vocabablity size of 2048.
Devoloped & trained by: Me with credits for MeshGPT codebase to Phil Wang
Preformance:
CPU 10 triangles/s
3060 GPU: 40 triangles/s
4090 GPU: 110 triangles/s
Warning:
This model has been created without any sponsors or renting any GPU hardware, so it has a very limited capability in terms what it can generate. It can handle fine single objects such as 'chair' or 'table' but more complex objects requires more training (see training dataset section).
The is also a problem with the face orientation since the triangles order was optimized for the model before training. However this will be fixed in later versions.
Usage:
Install:
pip install git+https://github.com/MarcusLoppe/meshgpt-pytorch.git
import torch
from meshgpt_pytorch import (
MeshAutoencoder,
MeshTransformer,
mesh_render
)
device = "cuda" if torch.cuda.is_available() else "cpu"
transformer = MeshTransformer.from_pretrained("MarcusLoren/MeshGPT-preview").to(device)
output = []
output.append((transformer.generate(texts = ['sofa','bed', 'computer screen', 'bench', 'chair', 'table' ] , temperature = 0.0) ))
output.append((transformer.generate(texts = ['milk carton', 'door', 'shovel', 'heart', 'trash can', 'ladder'], temperature = 0.0) ))
output.append((transformer.generate(texts = ['hammer', 'pedestal', 'pickaxe', 'wooden cross', 'coffee bean', 'crowbar'], temperature = 0.0) ))
output.append((transformer.generate(texts = ['key', 'minecraft character', 'dragon head', 'open book', 'minecraft turtle', 'wooden table'], temperature = 0.0) ))
output.append((transformer.generate(texts = ['gun', 'ice cream cone', 'axe', 'helicopter', 'shotgun', 'plastic bottle'], temperature = 0.0) ))
mesh_render.save_rendering(f'./render.obj', output)
Expected output:
Random samples generated by text only:
Training dataset
I've only had access to the free tier GPU on kaggle so this model is only trained on 4k models with max 250 triangles. The dataset contains total of 800 text labels so in terms what it can generate it's limited. 3D models was sourced from objaverse, shapenet and ModelNet40.
How it works:
MeshGPT uses an autoencoder which takes 3D mesh (has support for quads but not implemented in this model) then quantizes them into a codebook which can be used as tokens. The second part of MeshGPT is the transformer that trains on the tokens generated by the autoencoder while cross-attending to a text embedding.
The final product is a tokenizer and a transformer that can input a text embedding and then autoregressive generate a 3D model based on the text input. The tokens generated by the transformer can then be converted into 3D mesh using the autoencoder.
Credits
The idea for MeshGPT came from the paper ( https://arxiv.org/abs/2311.15475 ) but the creators didn't release any code or model.
Phil Wang (https://github.com/lucidrains) drew inspiration from the paper and did a ton of improvements over the papers implementation and created the repo : https://github.com/lucidrains/meshgpt-pytorch
My goal has been to figure out how to train and implement MeshGPT into reality.
Many thanks to K. S. Ernest who helped me with the gradio demo aswell as helping me training the upcoming model on a larger dataset.
See my github repo for a notebook on how to get started training your own MeshGPT! MarcusLoppe/meshgpt-pytorch