This is a conditional unet model designed for music generation using mel spectrogram images. The model was trained on the alppo/music dataset, which includes 5 different genres. It accepts 512x512 images and 1x64 condition embeddings, which can be generated from my own variational autoencoder implementation.