SEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound

Rishit Dagli1 · Shivesh Prakash1 · Rupert Wu1 · Houman Khosravani1,2,3

1University of Toronto    2Temerty Centre for Artificial Intelligence Research and Education in Medicine    3Sunnybrook Research Institute

Paper PDF Project Page Hugging Face Spaces Hugging Face Paper

This work presents SEE-2-SOUND, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our website to view some results of this work.

teaser

These checkpoints are meant to be used with our code: SEE-2-SOUND.

Installation

First, install the pip package and download these checkpoints (needs Git LFS):

pip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound
git clone https://huggingface.co/rishitdagli/see-2-sound
cd see-2-sound

View the full installation instructions as well a tips on dependencies in the repository README.

Running the Models

Now, we can start by making a configuration file, make a file called config.yaml:

codi_encoder: 'codi/codi_encoder.pth'
codi_text: 'codi/codi_text.pth'
codi_audio: 'codi/codi_audio.pth'
codi_video: 'codi/codi_video.pth'

sam: 'sam/sam.pth'
# H, L or B in decreasing performance
sam_size: 'H'

depth: '/depth/depth.pth'
# L, B, or S in decreasing performance
depth_size: 'L'

download: False

# Change to True if your GPU has < 40 GB vRAM
low_mem: False
fp16: False
gpu: True
steps: 500
num_audios: 3
prompt: ''
verbose: True

Now, we can start running inference:

import see2sound


config_file_path = "config.yaml"

model = see2sound.See2Sound(config_path = config_file_path)
model.setup()
model.run(path = "test.png", output_path = "test.wav")

More Information

Feel free to take a look at the full dcoumentation for extra information and tips on running the model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Spaces using rishitdagli/see-2-sound 2

Collection including rishitdagli/see-2-sound

Evaluation results

  • AViTAR Marginal Scene Guidance - Mel-Frequency Cepstral Coefficient - Dynamic Time Warping on SEE-2-SOUND Evaluation Dataset
    arXiv
    0.03 × 10^-3
  • AViTAR Marginal Scene Guidance - Zero Crossing Rate on SEE-2-SOUND Evaluation Dataset
    arXiv
    0.950
  • Chroma Feature on SEE-2-SOUND Evaluation Dataset
    arXiv
    0.770
  • AViTAR Marginal Scene Guidance - Spectral Score on SEE-2-SOUND Evaluation Dataset
    arXiv
    0.950