MultiModalHackVAE

A multi-modal Variational Autoencoder trained on NetHack game states for representation learning.

Model Description

This model is a MultiModalHackVAE that learns compact representations of NetHack game states by processing:

Game character grids (21x79)
Color information
Game statistics (blstats)
Message text
Bag of glyphs
Hero information (role, race, gender, alignment)

Model Details

Model Type: Multi-modal Variational Autoencoder
Framework: PyTorch
Dataset: NetHack Learning Dataset
Latent Dimensions: 96
Low-rank Dimensions: 0

Usage

from train import load_model_from_huggingface
import torch

# Load the model
model = load_model_from_huggingface("CatkinChen/nethack-vae")

# Example usage with synthetic data
batch_size = 1
game_chars = torch.randint(32, 127, (batch_size, 21, 79))
game_colors = torch.randint(0, 16, (batch_size, 21, 79))
blstats = torch.randn(batch_size, 27)
msg_tokens = torch.randint(0, 128, (batch_size, 256))
hero_info = torch.randint(0, 10, (batch_size, 4))

with torch.no_grad():
    output = model(
        glyph_chars=game_chars,
        glyph_colors=game_colors,
        blstats=blstats,
        msg_tokens=msg_tokens,
        hero_info=hero_info
    )
    latent_mean = output['mu']
    latent_logvar = output['logvar']
    lowrank_factors = output['lowrank_factors']

Training

This model was trained using adaptive loss weighting with:

Embedding warm-up for quick convergence
Gradual raw reconstruction focus
KL beta annealing for better latent structure

Citation

If you use this model, please consider citing:

@misc{nethack-vae,
  title={MultiModalHackVAE: Multi-modal Variational Autoencoder for NetHack},
  author={Xu Chen},
  year={2025},
  url={https://huggingface.co/CatkinChen/nethack-vae}
}