Spaces:

AIGC-Audio
/

Make-An-Audio-3

Running on Zero

App Files Files Community

3v324v23 commited on Jun 14, 2024

Commit

a84a65c

1 Parent(s): 28cda0c

Add code

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
README copy.md +107 -0
app.py +199 -0
audiocaps_test_struct.tsv +3 -0
data/audiocaps_test_struct.tsv +0 -0
data/musiccaps_test_16000_struct.tsv +0 -0
infer.sh +20 -0
ldm/__pycache__/util.cpython-38.pyc +0 -0
ldm/__pycache__/util.cpython-39.pyc +0 -0
ldm/data/__pycache__/joinaudiodataset_anylen.cpython-38.pyc +0 -0
ldm/data/__pycache__/joinaudiodataset_struct_sample_anylen.cpython-38.pyc +0 -0
ldm/data/joinaudiodataset_anylen.py +330 -0
ldm/data/joinaudiodataset_struct_sample_anylen.py +380 -0
ldm/data/tsv_dirs/full_data/V1_new/audiocaps_train_16000.tsv +3 -0
ldm/data/tsv_dirs/full_data/V2/MACS.tsv +3 -0
ldm/data/tsv_dirs/full_data/V2/WavText5K.tsv +3 -0
ldm/data/tsv_dirs/full_data/V2/adobe.tsv +3 -0
ldm/data/tsv_dirs/full_data/V2/audiostock.tsv +3 -0
ldm/data/tsv_dirs/full_data/V2/epidemic_sound.tsv +3 -0
ldm/data/tsv_dirs/full_data/caps_struct/audiocaps_train_16000_struct2.tsv +3 -0
ldm/data/txt_spec_dataset.py +171 -0
ldm/data/video_spec_maa2_dataset.py +837 -0
ldm/lr_scheduler.py +98 -0
ldm/models/__pycache__/autoencoder.cpython-38.pyc +0 -0
ldm/models/__pycache__/autoencoder.cpython-39.pyc +0 -0
ldm/models/__pycache__/autoencoder1d.cpython-38.pyc +0 -0
ldm/models/autoencoder.py +503 -0
ldm/models/autoencoder1d.py +517 -0
ldm/models/diffusion/__init__.py +0 -0
ldm/models/diffusion/__pycache__/__init__.cpython-38.pyc +0 -0
ldm/models/diffusion/__pycache__/__init__.cpython-39.pyc +0 -0
ldm/models/diffusion/__pycache__/cfm1_audio.cpython-38.pyc +0 -0
ldm/models/diffusion/__pycache__/cfm1_audio.cpython-39.pyc +0 -0
ldm/models/diffusion/__pycache__/ddim.cpython-38.pyc +0 -0
ldm/models/diffusion/__pycache__/ddim.cpython-39.pyc +0 -0
ldm/models/diffusion/__pycache__/ddpm.cpython-38.pyc +0 -0
ldm/models/diffusion/__pycache__/ddpm.cpython-39.pyc +0 -0
ldm/models/diffusion/__pycache__/ddpm_audio.cpython-38.pyc +0 -0
ldm/models/diffusion/__pycache__/ddpm_audio.cpython-39.pyc +0 -0
ldm/models/diffusion/__pycache__/plms.cpython-38.pyc +0 -0
ldm/models/diffusion/__pycache__/plms.cpython-39.pyc +0 -0
ldm/models/diffusion/audioldm.py +818 -0
ldm/models/diffusion/cfm1_audio.py +312 -0
ldm/models/diffusion/cfm1_audio_sampler.py +105 -0
ldm/models/diffusion/classifier.py +267 -0
ldm/models/diffusion/ddim.py +262 -0
ldm/models/diffusion/ddpm.py +1461 -0
ldm/models/diffusion/ddpm_audio.py +865 -0
ldm/models/diffusion/plms.py +236 -0
ldm/models/diffusion/transport/__init__.py +73 -0

.gitattributes CHANGED Viewed

@@ -32,4 +32,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
+*.tsv filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

README copy.md ADDED Viewed

	@@ -0,0 +1,107 @@

+# Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers
+PyTorch Implementation of [Lumina-t2x](https://arxiv.org/abs/2405.05945)
+We will provide our implementation and pretrained models as open source in this repository recently.
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2305.18474)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/AIGC-Audio/Lumina-Audio)
+[![GitHub Stars](https://img.shields.io/github/stars/Text-to-Audio/Make-An-Audio-3?style=social)](https://github.com/Text-to-Audio/Make-An-Audio-3)
+## Use pretrained model
+We provide our implementation and pretrained models as open source in this repository.
+Visit our [demo page](https://make-an-audio-2.github.io/) for audio samples.
+## Quick Started
+### Pretrained Models
+Simply download the weights from [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/Alpha-VLLM/Lumina-T2Music).
+- Text Encoder: [FLAN-T5-Large](https://huggingface.co/google/flan-t5-large)
+- VAE: Make-An-Audio 2, finetuned from [Make an Audio](https://github.com/Text-to-Audio/Make-An-Audio)
+- Decoder: [Vocoder](https://github.com/NVIDIA/BigVGAN)
+- `Music` Checkpoints: [huggingface](https://huggingface.co/Alpha-VLLM/Lumina-T2Music), `Audio` Checkpoints: [huggingface]()
+### Generate audio/music from text
+```
+python3 scripts/txt2audio_for_2cap_flow.py
+--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0
+--vocoder-ckpt useful_ckpts/bigvnat --test-dataset audiocaps
+```
+### Generate audio/music from audiocaps or musiccaps test dataset
+- remember to relatively change `config["test_dataset]`
+```
+python3 scripts/txt2audio_for_2cap_flow.py
+--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0
+--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset
+```
+### Generate audio/music from video
+```
+python3 scripts/video2audio_flow.py
+--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0
+--vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound
+```
+## Train
+### Data preparation
+- We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec, count audio duration and generate structured caption.
+- Before training, we need to construct the dataset information into a tsv file, which includes name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio), duration (the duration of the audio). We provide a tsv file of audiocaps test set: audiocaps_test_struct.tsv as a sample.
+- We provide a tsv file of the audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample.
+### Generate the melspec file of audio
+Assume you have already got a tsv file to link each caption to its audio_path, which mean the tsv_file have "name","audio_path","dataset" and "caption" columns in it.
+To get the melspec of audio, run the following command, which will save mels in ./processed
+```
+python preprocess/mel_spec.py --tsv_path tmp.tsv --num_gpus 1 --max_duration 10
+```
+### Count audio duration
+To count the duration of the audio and save duration information in tsv file, run the following command:
+```
+python preprocess/add_duration.py --tsv_path tmp.tsv
+```
+### Generated structure caption from the original natural language caption
+Firstly you need to get an authorization token in openai(https://openai.com/blog/openai-api), here is a tutorial(https://www.maisieai.com/help/how-to-get-an-openai-api-key-for-chatgpt). Then replace your key of variable openai_key in preprocess/n2s_by_openai.py. Run the following command to add structed caption, the tsv file with structured caption will be saved into {tsv_file_name}_struct.tsv:
+```
+python preprocess/n2s_by_openai.py --tsv_path tmp.tsv
+```
+### Place Tsv files
+After generated structure caption, put the tsv with structed caption to ./data/main_spec_dir . And put tsv files without structured caption to ./data/no_struct_dir
+Modify the config data.params.main_spec_dir and  data.params.main_spec_dir.other_spec_dir_path respectively in config file configs/text2audio-ConcatDiT-ae1dnat_Skl20d2_struct2MLPanylen.yaml .
+## Train variational autoencoder
+Assume we have processed several datasets, and save the .tsv files in tsv_dir/*.tsv . Replace data.params.spec_dir_path with tsv_dir in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums
+```
+python main.py --base configs/research/autoencoder/autoencoder1d_kl20_natbig_r1_down2_disc2.yaml -t --gpus 0,1,2,3,4,5,6,7
+```
+## Train latent diffsuion
+After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file.
+Run the following command to train Diffusion model
+```
+python main.py --base configs/research/text2audio/text2audio-ConcatDiT-ae1dnat_Skl20d2_freezeFlananylen_drop.yaml -t  --gpus 0,1,2,3,4,5,6,7
+```
+## Evaluation
+Please refer to [Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio?tab=readme-ov-file#evaluation)
+## Acknowledgements
+This implementation uses parts of the code from the following Github repos:
+[Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio),
+[AudioLCM](https://github.com/Text-to-Audio/AudioLCM),
+[CLAP](https://github.com/LAION-AI/CLAP),
+as described in our code.
+## Citations ##
+If you find this code useful in your research, please consider citing:
+```bibtex
+```
+# Disclaimer ##
+Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

app.py ADDED Viewed

	@@ -0,0 +1,199 @@

+import spaces
+import argparse, os, sys, glob
+import pathlib
+directory = pathlib.Path(os.getcwd())
+print(directory)
+sys.path.append(str(directory))
+import torch
+import numpy as np
+from omegaconf import OmegaConf
+from ldm.util import instantiate_from_config
+from ldm.models.diffusion.ddim import DDIMSampler
+from ldm.models.diffusion.plms import PLMSSampler
+import pandas as pd
+from tqdm import tqdm
+import preprocess.n2s_by_openai as n2s
+from vocoder.bigvgan.models import VocoderBigVGAN
+import soundfile
+import torchaudio, math
+import gradio
+import gradio as gr
+def load_model_from_config(config, ckpt = None, verbose=True):
+    model = instantiate_from_config(config.model)
+    if ckpt:
+        print(f"Loading model from {ckpt}")
+        pl_sd = torch.load(ckpt, map_location="cpu")
+        sd = pl_sd["state_dict"]
+        m, u = model.load_state_dict(sd, strict=False)
+        if len(m) > 0 and verbose:
+            print("missing keys:")
+            print(m)
+        if len(u) > 0 and verbose:
+            print("unexpected keys:")
+            print(u)
+    else:
+        print(f"Note chat no ckpt is loaded !!!")
+    model.cuda()
+    model.eval()
+    return model
+class GenSamples:
+    def __init__(self,opt, model,outpath,config, vocoder = None,save_mel = True,save_wav = True) -> None:
+        self.opt = opt
+        self.model = model
+        self.outpath = outpath
+        if save_wav:
+            assert vocoder is not None
+            self.vocoder = vocoder
+        self.save_mel = save_mel
+        self.save_wav = save_wav
+        self.channel_dim = self.model.channels
+        self.config = config
+    def gen_test_sample(self,prompt, mel_name = None,wav_name = None, gt=None, video=None):# prompt is {'ori_caption':’xxx‘,'struct_caption':'xxx'}
+        uc = None
+        record_dicts = []
+        if self.opt['scale'] != 1.0:
+            try: # audiocaps
+                uc = self.model.get_learned_conditioning({'ori_caption': "",'struct_caption': ""})
+            except: # audioset
+                uc = self.model.get_learned_conditioning(prompt['ori_caption'])
+        for n in range(self.opt['n_iter']):
+            try: # audiocaps
+                c = self.model.get_learned_conditioning(prompt) # shape:[1,77,1280],即还没有变成句子embedding，仍是每个单词的embedding
+            except: # audioset
+                c = self.model.get_learned_conditioning(prompt['ori_caption'])
+            if self.channel_dim>0:
+                shape = [self.channel_dim, self.opt['H'], self.opt['W']]  # (z_dim, 80//2^x, 848//2^x)
+            else:
+                shape = [1, self.opt['H'], self.opt['W']]
+            x0 = torch.randn(shape, device=self.model.device)
+            if self.opt['scale'] == 1: # w/o cfg
+                sample, _ = self.model.sample(c, 1, timesteps=self.opt['ddim_steps'], x_latent=x0)
+            else:  # cfg
+                sample, _ = self.model.sample_cfg(c, self.opt['scale'], uc, 1, timesteps=self.opt['ddim_steps'], x_latent=x0)
+            x_samples_ddim = self.model.decode_first_stage(sample)
+            for idx,spec in enumerate(x_samples_ddim):
+                spec = spec.squeeze(0).cpu().numpy()
+                print(spec[0])
+                record_dict = {'caption':prompt['ori_caption'][0]}
+                if self.save_mel:
+                    mel_path = os.path.join(self.outpath,mel_name+f'_{idx}.npy')
+                    np.save(mel_path,spec)
+                    record_dict['mel_path'] = mel_path
+                if self.save_wav:
+                    wav = self.vocoder.vocode(spec)
+                    wav_path = os.path.join(self.outpath,wav_name+f'_{idx}.wav')
+                    soundfile.write(wav_path, wav, self.opt['sample_rate'])
+                    record_dict['audio_path'] = wav_path
+                record_dicts.append(record_dict)
+        return record_dicts
+@spaces.GPU(enable_queue=True)
+def infer(ori_prompt, ddim_steps, scale, seed):
+    # np.random.seed(seed)
+    # torch.manual_seed(seed)
+    prompt = dict(ori_caption=ori_prompt,struct_caption=f'<{ori_prompt}& all>')
+    opt = {
+        'sample_rate': 16000,
+        'outdir': 'outputs/txt2music-samples',
+        'ddim_steps': ddim_steps,
+        'n_iter': 1,
+        'H': 20,
+        'W': 312,
+        'scale': scale,
+        'resume': 'useful_ckpts/music_generation/119.ckpt',
+        'base': 'configs/txt2music-cfm1-cfg-LargeDiT3.yaml',
+        'vocoder_ckpt': 'useful_ckpts/bigvnat',
+    }
+    config = OmegaConf.load(opt['base'])
+    model = load_model_from_config(config, opt['resume'])
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    model = model.to(device)
+    os.makedirs(opt['outdir'], exist_ok=True)
+    vocoder = VocoderBigVGAN(opt['vocoder_ckpt'],device)
+    generator = GenSamples(opt, model,opt['outdir'],config, vocoder,save_mel=False,save_wav=True)
+    with torch.no_grad():
+        with model.ema_scope():
+            wav_name = f'{prompt["ori_caption"].strip().replace(" ", "-")}'
+            generator.gen_test_sample(prompt,wav_name=wav_name)
+    file_path = os.path.join(opt['outdir'],wav_name+'_0.wav')
+    print(f"Your samples are ready and waiting four you here: \n{file_path} \nEnjoy.")
+    return file_path
+def my_inference_function(text_prompt, ddim_steps, scale, seed):
+    file_path = infer(text_prompt, ddim_steps, scale, seed)
+    return file_path
+with gr.Blocks() as demo:
+    with gr.Row():
+        gr.Markdown("## Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers")
+    with gr.Row():
+        with gr.Column():
+            prompt = gr.Textbox(label="Prompt: Input your text here.        ")
+            run_button = gr.Button()
+            with gr.Accordion("Advanced options", open=False):
+                ddim_steps = gr.Slider(label="ddim_steps", minimum=1,
+                                       maximum=50, value=25, step=1)
+                scale = gr.Slider(
+                    label="Guidance Scale:(Large => more relevant to text but the quality may drop)", minimum=0.1, maximum=8.0, value=3.0, step=0.1
+                )
+                seed = gr.Slider(
+                    label="Seed:Change this value (any integer number) will lead to a different generation result.",
+                    minimum=0,
+                    maximum=2147483647,
+                    step=1,
+                    value=44,
+                )
+        with gr.Column():
+            outaudio = gr.Audio()
+    run_button.click(fn=my_inference_function, inputs=[
+                    prompt, ddim_steps, scale, seed], outputs=[outaudio])
+    with gr.Row():
+        with gr.Column():
+            gr.Examples(
+                        examples = [['An amateur recording features a steel drum playing in a higher register',25,5,55],
+                                    ['An instrumental song with a caribbean feel, happy mood, and featuring steel pan music, programmed percussion, and bass',25,5,55],
+                                    ['This musical piece features a playful and emotionally melodic male vocal accompanied by piano',25,5,55],
+                                    ['A eerie yet calming experimental electronic track featuring haunting synthesizer strings and pads',25,5,55],
+                                    ['A slow tempo pop instrumental piece featuring only acoustic guitar with fingerstyle and percussive strumming techniques',25,5,55]],
+                        inputs = [prompt, ddim_steps, scale, seed],
+                        outputs = [outaudio]
+                        )
+        with gr.Column():
+            pass
+demo.launch()
+# gradio_interface = gradio.Interface(
+#     fn = my_inference_function,
+#     inputs = "text",
+#     outputs = "audio"
+# )
+# gradio_interface.launch()
+# text_prompt = 'An amateur recording features a steel drum playing in a higher register'
+# # text_prompt = 'A slow tempo pop instrumental piece featuring only acoustic guitar with fingerstyle and percussive strumming techniques'
+# ddim_steps=25
+# scale=5.0
+# seed=55
+# my_inference_function(text_prompt, ddim_steps, scale, seed)

audiocaps_test_struct.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:36d5f93b134ee6ed8c7e75adffca2e0a378fb683e67836abd78b50153659858b
+size 1306277

data/audiocaps_test_struct.tsv CHANGED Viewed

The diff for this file is too large to render. See raw diff

data/musiccaps_test_16000_struct.tsv CHANGED Viewed

The diff for this file is too large to render. See raw diff

infer.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+# music prompt genneration
+python3 scripts/txt2audio_for_2cap_flow.py \
+--outdir output_dir_text -r useful_ckpts/music_generation/119.ckpt  -b configs/txt2music-cfm1-cfg-LargeDiT3.yaml --scale 3.0 \
+--vocoder-ckpt useful_ckpts/bigvnat
+# music test dataset genneration
+python3 scripts/txt2audio_for_2cap_flow.py \
+--outdir results/music/dataset -r useful_ckpts/music_generation/119.ckpt  -b configs/txt2music-cfm1-cfg-LargeDiT3.yaml --scale 3.0 \
+--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset
+# audio prompt genneration
+python3 scripts/txt2audio_for_2cap_flow.py \
+--prompt 'A train running on a railroad track followed by a vehicle door closing and a man talking in the distance while a train horn honks and railroad crossing warning signals ring' \
+--outdir results/auido/text -r useful_ckpts/audio_generation/324.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0 \
+--vocoder-ckpt useful_ckpts/bigvnat
+# audio test dataset genneration
+python3 scripts/txt2audio_for_2cap_flow.py \
+--outdir results/auido/dataset -r useful_ckpts/audio_generation/324.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0 \
+--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset

ldm/__pycache__/util.cpython-38.pyc ADDED Viewed

Binary file (5.1 kB). View file

ldm/__pycache__/util.cpython-39.pyc ADDED Viewed

Binary file (5.16 kB). View file

ldm/data/__pycache__/joinaudiodataset_anylen.cpython-38.pyc ADDED Viewed

Binary file (12.1 kB). View file

ldm/data/__pycache__/joinaudiodataset_struct_sample_anylen.cpython-38.pyc ADDED Viewed

Binary file (11.6 kB). View file

ldm/data/joinaudiodataset_anylen.py ADDED Viewed

	@@ -0,0 +1,330 @@

+import os
+import sys
+import math
+import numpy as np
+import torch
+from torch.utils.data.sampler import Sampler
+from torch.utils.data.distributed import DistributedSampler
+import torch.distributed
+from typing import TypeVar, Optional, Iterator,List
+import logging
+import pandas as pd
+import glob
+import torch.distributed as dist
+logger = logging.getLogger(f'main.{__name__}')
+sys.path.insert(0, '.')  # nopep8
+class JoinManifestSpecs(torch.utils.data.Dataset):
+    def __init__(self, split, spec_dir_path, mel_num=80,spec_crop_len=1248,mode='pad',pad_value=-5,drop=0,**kwargs):
+        super().__init__()
+        self.split = split
+        self.max_batch_len = spec_crop_len
+        self.min_batch_len = 64
+        self.mel_num = mel_num
+        self.min_factor = 4
+        self.drop = drop
+        self.pad_value = pad_value
+        assert mode in ['pad','tile']
+        self.collate_mode = mode
+        # print(f"################# self.collate_mode {self.collate_mode} ##################")
+        manifest_files = []
+        for dir_path in spec_dir_path.split(','):
+            manifest_files += glob.glob(f'{dir_path}/*.tsv')
+        df_list = [pd.read_csv(manifest,sep='\t') for manifest in manifest_files]
+        df = pd.concat(df_list,ignore_index=True)
+        if split == 'train':
+            self.dataset = df.iloc[100:]
+        elif split == 'valid' or split == 'val':
+            self.dataset = df.iloc[:100]
+        elif split == 'test':
+            df = self.add_name_num(df)
+            self.dataset = df
+        else:
+            raise ValueError(f'Unknown split {split}')
+        self.dataset.reset_index(inplace=True)
+        print('dataset len:', len(self.dataset))
+    def add_name_num(self,df):
+        """each file may have different caption, we add num to filename to identify each audio-caption pair"""
+        name_count_dict = {}
+        change = []
+        for t in df.itertuples():
+            name = getattr(t,'name')
+            if name in name_count_dict:
+                name_count_dict[name] += 1
+            else:
+                name_count_dict[name] = 0
+            change.append((t[0],name_count_dict[name]))
+        for t in change:
+            df.loc[t[0],'name'] = df.loc[t[0],'name'] + f'_{t[1]}'
+        return df
+    def ordered_indices(self):
+        index2dur = self.dataset[['duration']]
+        index2dur = index2dur.sort_values(by='duration')
+        return list(index2dur.index)
+    def __getitem__(self, idx):
+        item = {}
+        data = self.dataset.iloc[idx]
+        try:
+            spec = np.load(data['mel_path']) # mel spec [80, 624]
+        except:
+            mel_path = data['mel_path']
+            print(f'corrupted:{mel_path}')
+            spec = np.ones((self.mel_num,self.min_batch_len)).astype(np.float32)*self.pad_value
+        item['image'] = spec
+        p = np.random.uniform(0,1)
+        if p > self.drop:
+            item["caption"] = data['caption']
+        else:
+            item["caption"] = ""
+        if self.split == 'test':
+            item['f_name'] = data['name']
+        # item['f_name'] = data['mel_path']
+        return item
+    def collater(self,inputs):
+        to_dict = {}
+        for l in inputs:
+            for k,v in l.items():
+                if k in to_dict:
+                    to_dict[k].append(v)
+                else:
+                    to_dict[k] = [v]
+        if self.collate_mode == 'pad':
+            to_dict['image'] = collate_1d_or_2d(to_dict['image'],pad_idx=self.pad_value,min_len = self.min_batch_len,max_len=self.max_batch_len,min_factor=self.min_factor)
+        elif self.collate_mode == 'tile':
+            to_dict['image'] = collate_1d_or_2d_tile(to_dict['image'],min_len = self.min_batch_len,max_len=self.max_batch_len,min_factor=self.min_factor)
+        else:
+            raise NotImplementedError
+        return to_dict
+    def __len__(self):
+        return len(self.dataset)
+class JoinSpecsTrain(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('train', **specs_dataset_cfg)
+class JoinSpecsValidation(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('valid', **specs_dataset_cfg)
+class JoinSpecsTest(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('test', **specs_dataset_cfg)
+class JoinSpecsDebug(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('valid', **specs_dataset_cfg)
+        self.dataset = self.dataset.iloc[:37]
+class DDPIndexBatchSampler(Sampler):# 让长度相似的音频的indices合到一个batch中以避免过长的pad
+    def __init__(self, indices ,batch_size, num_replicas: Optional[int] = None,
+                 rank: Optional[int] = None, shuffle: bool = True,
+                 seed: int = 0, drop_last: bool = False) -> None:
+        if num_replicas is None:
+            if not dist.is_initialized():
+                # raise RuntimeError("Requires distributed package to be available")
+                print("Not in distributed mode")
+                num_replicas = 1
+            else:
+                num_replicas = dist.get_world_size()
+        if rank is None:
+            if not dist.is_initialized():
+                # raise RuntimeError("Requires distributed package to be available")
+                rank = 0
+            else:
+                rank = dist.get_rank()
+        if rank >= num_replicas or rank < 0:
+            raise ValueError(
+                "Invalid rank {}, rank should be in the interval"
+                " [0, {}]".format(rank, num_replicas - 1))
+        self.indices = indices
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.epoch = 0
+        self.drop_last = drop_last
+        self.batch_size = batch_size
+        self.batches = self.build_batches()
+        print(f"rank: {self.rank}, batches_num {len(self.batches)}")
+        # If the dataset length is evenly divisible by replicas, then there
+        # is no need to drop any data, since the dataset will be split equally.
+        if self.drop_last and len(self.batches) % self.num_replicas != 0:
+            self.batches = self.batches[:len(self.batches)//self.num_replicas*self.num_replicas]
+        if len(self.batches) > self.num_replicas:
+            self.batches = self.batches[self.rank::self.num_replicas]
+        else: # may happen in sanity checking
+            self.batches = [self.batches[0]]
+        print(f"after split batches_num {len(self.batches)}")
+        self.shuffle = shuffle
+        if self.shuffle:
+            self.batches = np.random.permutation(self.batches)
+        self.seed = seed
+    def set_epoch(self,epoch):
+        self.epoch = epoch
+        if self.shuffle:
+            np.random.seed(self.seed+self.epoch)
+            self.batches = np.random.permutation(self.batches)
+    def build_batches(self):
+        batches,batch = [],[]
+        for index in self.indices:
+            batch.append(index)
+            if len(batch) == self.batch_size:
+                batches.append(batch)
+                batch = []
+        if not self.drop_last and len(batch) > 0:
+            batches.append(batch)
+        return batches
+    def __iter__(self) -> Iterator[List[int]]:
+        for batch in self.batches:
+            yield batch
+    def __len__(self) -> int:
+        return len(self.batches)
+    def set_epoch(self, epoch: int) -> None:
+        r"""
+        Sets the epoch for this sampler. When :attr:`shuffle=True`, this ensures all replicas
+        use a different random ordering for each epoch. Otherwise, the next iteration of this
+        sampler will yield the same ordering.
+        Args:
+            epoch (int): Epoch number.
+        """
+        self.epoch = epoch
+def collate_1d_or_2d(values, pad_idx=0, left_pad=False, shift_right=False,min_len = None, max_len=None,min_factor=None, shift_id=1):
+    if len(values[0].shape) == 1:
+        return collate_1d(values, pad_idx, left_pad, shift_right,min_len, max_len,min_factor, shift_id)
+    else:
+        return collate_2d(values, pad_idx, left_pad, shift_right,min_len,max_len,min_factor)
+def collate_1d(values, pad_idx=0, left_pad=False, shift_right=False,min_len=None, max_len=None,min_factor=None, shift_id=1):
+    """Convert a list of 1d tensors into a padded 2d tensor."""
+    size = max(v.size(0) for v in values)
+    if max_len:
+        size = min(size,max_len)
+    if min_len:
+        size = max(size,min_len)
+    if min_factor and (size % min_factor!=0):# size must be the multiple of min_factor
+        size += (min_factor - size % min_factor)
+    res = values[0].new(len(values), size).fill_(pad_idx)
+    def copy_tensor(src, dst):
+        assert dst.numel() == src.numel(), f"dst shape:{dst.shape} src shape:{src.shape}"
+        if shift_right:
+            dst[1:] = src[:-1]
+            dst[0] = shift_id
+        else:
+            dst.copy_(src)
+    for i, v in enumerate(values):
+        copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
+    return res
+def collate_2d(values, pad_idx=0, left_pad=False, shift_right=False, min_len=None,max_len=None,min_factor=None):
+    """Collate 2d for melspec,Convert a list of 2d tensors into a padded 3d tensor,pad in mel_length dimension.
+        values[0] shape: (melbins,mel_length)
+    """
+    size = max(v.shape[1] for v in values) # if max_len is None else max_len
+    if max_len:
+        size = min(size,max_len)
+    if min_len:
+        size = max(size,min_len)
+    if min_factor and (size % min_factor!=0):# size must be the multiple of min_factor
+        size += (min_factor - size % min_factor)
+    if isinstance(values,np.ndarray):
+        values = torch.FloatTensor(values)
+    if isinstance(values,list):
+        values = [torch.FloatTensor(v) for v in values]
+    res = torch.ones(len(values), values[0].shape[0],size).to(dtype=torch.float32)*pad_idx
+    def copy_tensor(src, dst):
+        assert dst.numel() == src.numel(), f"dst shape:{dst.shape} src shape:{src.shape}"
+        if shift_right:
+            dst[1:] = src[:-1]
+        else:
+            dst.copy_(src)
+    for i, v in enumerate(values):
+        copy_tensor(v[:,:size], res[i][:,size - v.shape[1]:] if left_pad else res[i][:,:v.shape[1]])
+    return res
+def collate_1d_or_2d_tile(values, shift_right=False,min_len = None, max_len=None,min_factor=None, shift_id=1):
+    if len(values[0].shape) == 1:
+        return collate_1d_tile(values, shift_right,min_len, max_len,min_factor, shift_id)
+    else:
+        return collate_2d_tile(values, shift_right,min_len,max_len,min_factor)
+def collate_1d_tile(values, shift_right=False,min_len=None, max_len=None,min_factor=None,shift_id=1):
+    """Convert a list of 1d tensors into a padded 2d tensor."""
+    size = max(v.size(0) for v in values)
+    if max_len:
+        size = min(size,max_len)
+    if min_len:
+        size = max(size,min_len)
+    if min_factor and (size%min_factor!=0):# size must be the multiple of min_factor
+        size += (min_factor - size % min_factor)
+    res = values[0].new(len(values), size)
+    def copy_tensor(src, dst):
+        assert dst.numel() == src.numel(), f"dst shape:{dst.shape} src shape:{src.shape}"
+        if shift_right:
+            dst[1:] = src[:-1]
+            dst[0] = shift_id
+        else:
+            dst.copy_(src)
+    for i, v in enumerate(values):
+        n_repeat = math.ceil((size + 1) / v.shape[0])
+        v = torch.tile(v,dims=(1,n_repeat))[:size]
+        copy_tensor(v, res[i])
+    return res
+def collate_2d_tile(values, shift_right=False, min_len=None,max_len=None,min_factor=None):
+    """Collate 2d for melspec,Convert a list of 2d tensors into a padded 3d tensor,pad in mel_length dimension. """
+    size = max(v.shape[1] for v in values) # if max_len is None else max_len
+    if max_len:
+        size = min(size,max_len)
+    if min_len:
+        size = max(size,min_len)
+    if min_factor and (size % min_factor!=0):# size must be the multiple of min_factor
+        size += (min_factor - size % min_factor)
+    if isinstance(values,np.ndarray):
+        values = torch.FloatTensor(values)
+    if isinstance(values,list):
+        values = [torch.FloatTensor(v) for v in values]
+    res = torch.zeros(len(values), values[0].shape[0],size).to(dtype=torch.float32)
+    def copy_tensor(src, dst):
+        assert dst.numel() == src.numel()
+        if shift_right:
+            dst[1:] = src[:-1]
+        else:
+            dst.copy_(src)
+    for i, v in enumerate(values):
+        n_repeat = math.ceil((size + 1) / v.shape[1])
+        v = torch.tile(v,dims=(1,n_repeat))[:,:size]
+        copy_tensor(v, res[i])
+    return res

ldm/data/joinaudiodataset_struct_sample_anylen.py ADDED Viewed

	@@ -0,0 +1,380 @@

+import os
+import sys
+import numpy as np
+import torch
+from typing import TypeVar, Optional, Iterator
+import logging
+import pandas as pd
+from ldm.data.joinaudiodataset_anylen import *
+import glob
+logger = logging.getLogger(f'main.{__name__}')
+sys.path.insert(0, '.')  # nopep8
+class JoinManifestSpecs(torch.utils.data.Dataset):
+    def __init__(self, split, main_spec_dir_path,other_spec_dir_path, mel_num=80,mode='pad', spec_crop_len=1248,pad_value=-5,drop=0,**kwargs):
+        super().__init__()
+        self.split = split
+        self.max_batch_len = spec_crop_len
+        self.min_batch_len = 64
+        self.min_factor = 4
+        self.mel_num = mel_num
+        self.drop = drop
+        self.pad_value = pad_value
+        assert mode in ['pad','tile']
+        self.collate_mode = mode
+        manifest_files = []
+        for dir_path in main_spec_dir_path.split(','):
+            manifest_files += glob.glob(f'{dir_path}/*.tsv')
+        df_list = [pd.read_csv(manifest,sep='\t') for manifest in manifest_files]
+        self.df_main = pd.concat(df_list,ignore_index=True)
+        # manifest_files = []
+        # for dir_path in other_spec_dir_path.split(','):
+        #     manifest_files += glob.glob(f'{dir_path}/*.tsv')
+        # df_list = [pd.read_csv(manifest,sep='\t') for manifest in manifest_files]
+        # self.df_other = pd.concat(df_list,ignore_index=True)
+        # self.df_other.reset_index(inplace=True)
+        if split == 'train':
+            self.dataset = self.df_main.iloc[100:]
+        elif split == 'valid' or split == 'val':
+            self.dataset = self.df_main.iloc[:100]
+        elif split == 'test':
+            self.df_main = self.add_name_num(self.df_main)
+            self.dataset = self.df_main
+        else:
+            raise ValueError(f'Unknown split {split}')
+        self.dataset.reset_index(inplace=True)
+        print('dataset len:', len(self.dataset),"drop_rate",self.drop)
+    def add_name_num(self,df):
+        """each file may have different caption, we add num to filename to identify each audio-caption pair"""
+        name_count_dict = {}
+        change = []
+        for t in df.itertuples():
+            name = getattr(t,'name')
+            if name in name_count_dict:
+                name_count_dict[name] += 1
+            else:
+                name_count_dict[name] = 0
+            change.append((t[0],name_count_dict[name]))
+        for t in change:
+            df.loc[t[0],'name'] = str(df.loc[t[0],'name']) + f'_{t[1]}'
+        return df
+    def ordered_indices(self):
+        index2dur = self.dataset[['duration']].sort_values(by='duration')
+        # index2dur_other = self.df_other[['duration']].sort_values(by='duration')
+        # other_indices = list(index2dur_other.index)
+        offset = len(self.dataset)
+        # other_indices = [x + offset for x in other_indices]
+        return list(index2dur.index) # ,other_indices
+    def collater(self,inputs):
+        to_dict = {}
+        for l in inputs:
+            for k,v in l.items():
+                if k in to_dict:
+                    to_dict[k].append(v)
+                else:
+                    to_dict[k] = [v]
+        if self.collate_mode == 'pad':
+            to_dict['image'] = collate_1d_or_2d(to_dict['image'],pad_idx=self.pad_value,min_len = self.min_batch_len,max_len=self.max_batch_len,min_factor=self.min_factor)
+        elif self.collate_mode == 'tile':
+            to_dict['image'] = collate_1d_or_2d_tile(to_dict['image'],min_len = self.min_batch_len,max_len=self.max_batch_len,min_factor=self.min_factor)
+        else:
+            raise NotImplementedError
+        to_dict['caption'] = {'ori_caption':[c['ori_caption'] for c in to_dict['caption']],
+                              'struct_caption':[c['struct_caption'] for c in to_dict['caption']]}
+        return to_dict
+    def __getitem__(self, idx):
+        # if idx < len(self.dataset):
+        data = self.dataset.iloc[idx]
+        p = np.random.uniform(0,1)
+        if p > self.drop:
+            ori_caption = data['ori_cap']
+            struct_caption = data['caption']
+        else:
+            ori_caption = ""
+            struct_caption = ""
+            # else:
+            #     data = self.df_other.iloc[idx-len(self.dataset)]
+            #     p = np.random.uniform(0,1)
+            #     if p > self.drop:
+            #         ori_caption = data['caption']
+            #         struct_caption = f'<{ori_caption}& all>'
+            #     else:
+            #         ori_caption = ""
+            #         struct_caption = ""
+        item = {}
+        try:
+            if not os.path.exists(data['mel_path']):
+                mel_path = data['mel_path'].replace('/apdcephfs', '/apdcephfs_intern')
+            else:
+                mel_path = data['mel_path']
+            spec = np.load(mel_path)  # mel spec [80, T]
+            if spec.shape[1] > self.max_batch_len:
+                spec = spec[:, :self.max_batch_len]
+        except:
+            mel_path = data['mel_path']
+            print(f'corrupted:{mel_path}')
+            spec = np.ones((self.mel_num,self.min_batch_len)).astype(np.float32)*self.pad_value
+        item['image'] = spec
+        item["caption"] = {"ori_caption":ori_caption,"struct_caption":struct_caption}
+        if self.split == 'test':
+            item['f_name'] = data['name']
+        return item
+    def __len__(self):
+        return len(self.dataset) # + len(self.df_other)
+class JoinSpecsTrain(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('train', **specs_dataset_cfg)
+class JoinSpecsValidation(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('valid', **specs_dataset_cfg)
+class JoinSpecsTest(JoinManifestSpecs):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('test', **specs_dataset_cfg)
+class TestManifest(torch.utils.data.Dataset):
+    def __init__(self, manifest, mel_num=80, mode='pad', spec_crop_len=1248, pad_value=-5, **kwargs):
+        super().__init__()
+        self.max_batch_len = spec_crop_len
+        self.min_batch_len = 64
+        self.min_factor = 4
+        self.mel_num = mel_num
+        self.pad_value = pad_value
+        assert mode in ['pad', 'tile']
+        self.collate_mode = mode
+        df_list = pd.read_csv(manifest, sep='\t')
+        self.df_main = pd.concat([df_list], ignore_index=True)
+        self.df_main = self.add_name_num(self.df_main)
+        self.dataset = self.df_main
+        self.dataset.reset_index(inplace=True)
+        print('dataset len:', len(self.dataset))
+    def add_name_num(self, df):
+        """each file may have different caption, we add num to filename to identify each audio-caption pair"""
+        name_count_dict = {}
+        change = []
+        for t in df.itertuples():
+            name = getattr(t, 'name')
+            if name in name_count_dict:
+                name_count_dict[name] += 1
+            else:
+                name_count_dict[name] = 0
+            change.append((t[0], name_count_dict[name]))
+        for t in change:
+            df.loc[t[0], 'name'] = str(df.loc[t[0], 'name']) + f'_{t[1]}'
+        return df
+    def ordered_indices(self):
+        index2dur = self.dataset[['duration']].sort_values(by='duration')
+        return list(index2dur.index)  # ,other_indices
+    def collater(self, inputs):
+        to_dict = {}
+        for l in inputs:
+            for k, v in l.items():
+                if k in to_dict:
+                    to_dict[k].append(v)
+                else:
+                    to_dict[k] = [v]
+        if self.collate_mode == 'pad':
+            to_dict['image'] = collate_1d_or_2d(to_dict['image'], pad_idx=self.pad_value, min_len=self.min_batch_len,
+                                                max_len=self.max_batch_len, min_factor=self.min_factor)
+        elif self.collate_mode == 'tile':
+            to_dict['image'] = collate_1d_or_2d_tile(to_dict['image'], min_len=self.min_batch_len,
+                                                     max_len=self.max_batch_len, min_factor=self.min_factor)
+        else:
+            raise NotImplementedError
+        to_dict['caption'] = {'ori_caption': [c['ori_caption'] for c in to_dict['caption']],
+                              'struct_caption': [c['struct_caption'] for c in to_dict['caption']]}
+        return to_dict
+    def __getitem__(self, idx):
+        # if idx < len(self.dataset):
+        data = self.dataset.iloc[idx]
+        ori_caption = data['ori_cap']
+        struct_caption = data['caption']
+        item = {}
+        try:
+            if not os.path.exists(data['mel_path']):
+                mel_path = data['mel_path'].replace('/apdcephfs', '/apdcephfs_intern')
+            else:
+                mel_path = data['mel_path']
+            spec = np.load(mel_path)  # mel spec [80, T]
+            if spec.shape[1] > self.max_batch_len:
+                spec = spec[:, :self.max_batch_len]
+        except:
+            mel_path = data['mel_path']
+            print(f'corrupted:{mel_path}')
+            spec = np.ones((self.mel_num, self.min_batch_len)).astype(np.float32) * self.pad_value
+        item['image'] = spec
+        item["caption"] = {"ori_caption": ori_caption, "struct_caption": struct_caption}
+        item['f_name'] = data['name']
+        return item
+    def __len__(self):
+        return len(self.dataset)  # + len(self.df_other)
+class DDPIndexBatchSampler(Sampler):# 让长度相似的音频的indices合到一个batch中以避免过长的pad
+    def __init__(self, main_indices,batch_size, num_replicas: Optional[int] = None,
+                 rank: Optional[int] = None, shuffle: bool = True,
+                 seed: int = 0, drop_last: bool = False) -> None:
+        if num_replicas is None:
+            if not dist.is_initialized():
+                # raise RuntimeError("Requires distributed package to be available")
+                print("Not in distributed mode")
+                num_replicas = 1
+            else:
+                num_replicas = dist.get_world_size()
+        if rank is None:
+            if not dist.is_initialized():
+                # raise RuntimeError("Requires distributed package to be available")
+                rank = 0
+            else:
+                rank = dist.get_rank()
+        if rank >= num_replicas or rank < 0:
+            raise ValueError(
+                "Invalid rank {}, rank should be in the interval"
+                " [0, {}]".format(rank, num_replicas - 1))
+        self.main_indices = main_indices
+        # self.other_indices = other_indices
+        # self.max_index = max(self.other_indices)
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.epoch = 0
+        self.drop_last = drop_last
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.batches = self.build_batches()
+        self.seed = seed
+    def set_epoch(self,epoch):
+        # print("!!!!!!!!!!!set epoch is called!!!!!!!!!!!!!!")
+        self.epoch = epoch
+        if self.shuffle:
+            np.random.seed(self.seed+self.epoch)
+            self.batches = self.build_batches()
+    def build_batches(self):
+        batches,batch = [],[]
+        for index in self.main_indices:
+            batch.append(index)
+            if len(batch) == self.batch_size:
+                batches.append(batch)
+                batch = []
+        if not self.drop_last and len(batch) > 0:
+            batches.append(batch)
+        # selected_others = np.random.choice(len(self.other_indices),len(batches),replace=False)
+        # for index in selected_others:
+        #     if index + self.batch_size > len(self.other_indices):
+        #         index = len(self.other_indices) - self.batch_size
+        #     batch = [self.other_indices[index + i] for i in range(self.batch_size)]
+        #     batches.append(batch)
+        self.batches = batches
+        if self.shuffle:
+            self.batches = np.random.permutation(self.batches)
+        if self.rank == 0:
+            print(f"rank: {self.rank}, batches_num {len(self.batches)}")
+        if self.drop_last and len(self.batches) % self.num_replicas != 0:
+            self.batches = self.batches[:len(self.batches)//self.num_replicas*self.num_replicas]
+        if len(self.batches) >= self.num_replicas:
+            self.batches = self.batches[self.rank::self.num_replicas]
+        else: # may happen in sanity checking
+            self.batches = [self.batches[0]]
+        if self.rank == 0:
+            print(f"after split batches_num {len(self.batches)}")
+        return self.batches
+    def __iter__(self) -> Iterator[List[int]]:
+        print(f"len(self.batches):{len(self.batches)}")
+        for batch in self.batches:
+            yield batch
+    def __len__(self) -> int:
+        return len(self.batches)
+class JoinManifestSpecs_Caption(JoinManifestSpecs):
+    def collater(self, inputs):
+        to_dict = {}
+        for l in inputs:
+            for k, v in l.items():
+                if k in to_dict:
+                    to_dict[k].append(v)
+                else:
+                    to_dict[k] = [v]
+        if self.collate_mode == 'pad':
+            to_dict['image'] = collate_1d_or_2d(to_dict['image'], pad_idx=self.pad_value, min_len=self.min_batch_len,
+                                                max_len=self.max_batch_len, min_factor=self.min_factor)
+        elif self.collate_mode == 'tile':
+            to_dict['image'] = collate_1d_or_2d_tile(to_dict['image'], min_len=self.min_batch_len,
+                                                     max_len=self.max_batch_len, min_factor=self.min_factor)
+        else:
+            raise NotImplementedError
+        return to_dict
+    def __getitem__(self, idx):
+        # if idx < len(self.dataset):
+        data = self.dataset.iloc[idx]
+        p = np.random.uniform(0, 1)
+        if p > self.drop:
+            caption = data['ori_cap']
+        else:
+            caption = ""
+        item = {}
+        try:
+            if not os.path.exists(data['mel_path']):
+                mel_path = data['mel_path'].replace('/apdcephfs', '/apdcephfs_intern')
+            else:
+                mel_path = data['mel_path']
+            spec = np.load(mel_path)  # mel spec [80, T]
+            if spec.shape[1] > self.max_batch_len:
+                spec = spec[:, :self.max_batch_len]
+        except:
+            mel_path = data['mel_path']
+            print(f'corrupted:{mel_path}')
+            spec = np.ones((self.mel_num, self.min_batch_len)).astype(np.float32) * self.pad_value
+        item['image'] = spec
+        item["caption"] = caption
+        if self.split == 'test':
+            item['f_name'] = data['name']
+        return item
+class JoinSpecsTrain_Caption(JoinManifestSpecs_Caption):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('train', **specs_dataset_cfg)
+class JoinSpecsValidation_Caption(JoinManifestSpecs_Caption):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('valid', **specs_dataset_cfg)
+class JoinSpecsTest_Caption(JoinManifestSpecs_Caption):
+    def __init__(self, specs_dataset_cfg):
+        super().__init__('test', **specs_dataset_cfg)

ldm/data/tsv_dirs/full_data/V1_new/audiocaps_train_16000.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a34eeaf905d408e7faab9424f1742df3c1eb89e763c91ba355058b61e86c60b8
+size 8042145

ldm/data/tsv_dirs/full_data/V2/MACS.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7e993db5676570b42daf04a7836ad0cfdbef4d04b8a73f56a5828f864ee37f6
+size 6019546

ldm/data/tsv_dirs/full_data/V2/WavText5K.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:617bc20b11d6206e8735153a850b16449c484f52286dee4d7f67ed4f26bfb221
+size 1145878

ldm/data/tsv_dirs/full_data/V2/adobe.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:da973ea2f5e2440a832c40a022e33ef03aad24fbf2da7943ba5a77d43a7100d4
+size 2138832

ldm/data/tsv_dirs/full_data/V2/audiostock.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cafe0c81c72b3fa1574f98fa293e4036f69f1c4b8d8cd9cb369087076482e63a
+size 2028510

ldm/data/tsv_dirs/full_data/V2/epidemic_sound.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dc67e42c9defa98edfc2c6b23c731fafa4a22307fddfd1fb95ccfc00d0168951
+size 15062608

ldm/data/tsv_dirs/full_data/caps_struct/audiocaps_train_16000_struct2.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:565a506454c19ddd694cfb4b5c47a13f98e7966bce5617a7bbecec50c418257b
+size 10208584

ldm/data/txt_spec_dataset.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import csv
+import os
+import pickle
+import sys
+import numpy as np
+import torch
+import random
+import math
+import librosa
+import pandas as pd
+from pathlib import Path
+class audio_spec_join_Dataset(torch.utils.data.Dataset):
+    # Only Load audio dataset: for training Stage1: Audio Npy Dataset
+    def __init__(self, split, dataset_name, spec_crop_len, drop=0.0):
+        super().__init__()
+        if split == "train":
+            self.split = "Train"
+        elif split == "valid" or split == 'test':
+            self.split = "Test"
+        # Default params:
+        self.min_duration = 2
+        self.spec_crop_len = spec_crop_len
+        self.drop = drop
+        print("Use Drop: {}".format(self.drop))
+        self.init_text2audio(dataset_name)
+        print('Split: {}  Total Sample Num: {}'.format(split, len(self.dataset)))
+        if os.path.exists('/apdcephfs_intern/share_1316500/nlphuang/data/video_to_audio/vggsound/cavp/empty_vid.npz'):
+            self.root = '/apdcephfs_intern'
+        else:
+            self.root = '/apdcephfs'
+    def init_text2audio(self, dataset):
+        with open(dataset) as f:
+            reader = csv.DictReader(
+                f,
+                delimiter="\t",
+                quotechar=None,
+                doublequote=False,
+                lineterminator="\n",
+                quoting=csv.QUOTE_NONE,
+            )
+            samples = [dict(e) for e in reader]
+        if self.split == 'Test':
+            samples = samples[:100]
+        self.dataset = samples
+        print('text2audio dataset len:', len(self.dataset))
+    def __len__(self):
+        return len(self.dataset)
+    def load_feat(self, spec_path):
+        try:
+            spec_raw = np.load(spec_path)  # mel spec [80, T]
+        except:
+            print(f'corrupted mel:{spec_path}', flush=True)
+            spec_raw = np.zeros((80, self.spec_crop_len), dtype=np.float32) # [C, T]
+        spec_len = self.spec_crop_len
+        if spec_raw.shape[1] < spec_len:
+            spec_raw = np.tile(spec_raw, math.ceil(spec_len / spec_raw.shape[1]))
+        spec_raw = spec_raw[:, :int(spec_len)]
+        return spec_raw
+    def __getitem__(self, idx):
+        data_dict = {}
+        data = self.dataset[idx]
+        p = np.random.uniform(0, 1)
+        if p > self.drop:
+            caption = {"ori_caption": data['ori_cap'], "struct_caption": data['caption']}
+        else:
+            caption = {"ori_caption": "", "struct_caption": ""}
+        mel_path = data['mel_path'].replace('/apdcephfs', '/apdcephfs_intern') if self.root == '/apdcephfs_intern' else data['mel_path']
+        spec = self.load_feat(mel_path)
+        data_dict['caption'] = caption
+        data_dict['image'] = spec  # (80, 624)
+        return data_dict
+class spec_join_Dataset_Train(audio_spec_join_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class spec_join_Dataset_Valid(audio_spec_join_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class spec_join_Dataset_Test(audio_spec_join_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)
+class audio_spec_join_audioset_Dataset(audio_spec_join_Dataset):
+    # def __init__(self, split, dataset_name, root, spec_crop_len, drop=0.0):
+    #     super().__init__(split, dataset_name, spec_crop_len, drop)
+    #
+    #     self.data_dir = root
+        # MANIFEST_COLUMNS = ["name", "dataset", "ori_cap", "audio_path", "mel_path", "duration"]
+        # manifest = {c: [] for c in MANIFEST_COLUMNS}
+        # skip = 0
+        # if self.split != 'Train': return
+        # from preprocess.generate_manifest import save_df_to_tsv
+        # from tqdm import tqdm
+        # for idx in tqdm(range(len(self.dataset))):
+        #     item = self.dataset[idx]
+        #     mel_path = f'{self.data_dir}/{Path(item["name"])}_mel.npy'
+        #     try:
+        #         _ = np.load(mel_path)
+        #     except:
+        #         skip += 1
+        #         continue
+        #
+        #     manifest["name"].append(item['name'])
+        #     manifest["dataset"].append("audioset")
+        #     manifest["ori_cap"].append(item['ori_cap'])
+        #     manifest["duration"].append(item['audio_path'])
+        #     manifest["audio_path"].append(item['duration'])
+        #     manifest["mel_path"].append(mel_path)
+        #
+        # print(f"Writing manifest to {dataset_name.replace('audioset.tsv', 'audioset_new.tsv')}..., skip: {skip}")
+        # save_df_to_tsv(pd.DataFrame.from_dict(manifest), f"{dataset_name.replace('audioset.tsv', 'audioset_new.tsv')}")
+    def __getitem__(self, idx):
+        data_dict = {}
+        data = self.dataset[idx]
+        p = np.random.uniform(0, 1)
+        if p > self.drop:
+            caption = data['ori_cap']
+        else:
+            caption = ""
+        spec = self.load_feat(data['mel_path'])
+        data_dict['caption'] = caption
+        data_dict['image'] = spec  # (80, 624)
+        return data_dict
+class spec_join_Dataset_audioset_Train(audio_spec_join_audioset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class spec_join_Dataset_audioset_Valid(audio_spec_join_audioset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class spec_join_Dataset_audioset_Test(audio_spec_join_audioset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)

ldm/data/video_spec_maa2_dataset.py ADDED Viewed

	@@ -0,0 +1,837 @@

+import csv
+import os
+import pickle
+import sys
+import numpy as np
+import torch
+import random
+import math
+import librosa
+class audio_video_spec_fullset_Dataset(torch.utils.data.Dataset):
+    # Only Load audio dataset: for training Stage1: Audio Npy Dataset
+    def __init__(self, split, dataset1, feat_type='clip', transforms=None, sr=22050, duration=10, truncate=220000, fps=21.5, drop=0.0, fix_frames=False, hop_len=256):
+        super().__init__()
+        if split == "train":
+            self.split = "Train"
+        elif split == "valid" or split == 'test':
+            self.split = "Test"
+        # Default params:
+        self.min_duration = 2
+        self.sr = sr                # 22050
+        self.duration = duration    # 10
+        self.truncate = truncate    # 220000
+        self.fps = fps
+        self.fix_frames = fix_frames
+        self.hop_len = hop_len
+        self.drop = drop
+        print("Fix Frames: {}".format(self.fix_frames))
+        print("Use Drop: {}".format(self.drop))
+        # Dataset1: (VGGSound)
+        assert dataset1.dataset_name == "VGGSound"
+        # spec_dir: spectrogram path
+        # feat_dir: CAVP feature path
+        # video_dir: video path
+        dataset1_spec_dir = os.path.join(dataset1.data_dir, "mel_maa2", "npy")
+        dataset1_feat_dir = os.path.join(dataset1.data_dir, "cavp")
+        dataset1_video_dir = os.path.join(dataset1.video_dir, "tmp_vid")
+        split_txt_path = dataset1.split_txt_path
+        with open(os.path.join(split_txt_path, '{}.txt'.format(self.split)), "r") as f:
+            data_list1 = f.readlines()
+            data_list1 = list(map(lambda x: x.strip(), data_list1))
+            spec_list1 = list(map(lambda x: os.path.join(dataset1_spec_dir, x) + "_mel.npy", data_list1))      # spec
+            feat_list1 = list(map(lambda x: os.path.join(dataset1_feat_dir, x) + ".npz",     data_list1))      # feat
+            video_list1 = list(map(lambda x: os.path.join(dataset1_video_dir, x) + "_new_fps_21.5_truncate_0_10.0.mp4",   data_list1))      # video
+        # Merge Data:
+        self.data_list = data_list1 if self.split != "Test" else data_list1[:200]
+        self.spec_list = spec_list1 if self.split != "Test" else spec_list1[:200]
+        self.feat_list = feat_list1 if self.split != "Test" else feat_list1[:200]
+        self.video_list = video_list1 if self.split != "Test" else video_list1[:200]
+        assert len(self.data_list) == len(self.spec_list) == len(self.feat_list) == len(self.video_list)
+        shuffle_idx = np.random.permutation(np.arange(len(self.data_list)))
+        self.data_list = [self.data_list[i] for i in shuffle_idx]
+        self.spec_list = [self.spec_list[i] for i in shuffle_idx]
+        self.feat_list = [self.feat_list[i] for i in shuffle_idx]
+        self.video_list = [self.video_list[i] for i in shuffle_idx]
+        print('Split: {}  Sample Num: {}'.format(split, len(self.data_list)))
+    def __len__(self):
+        return len(self.data_list)
+    def load_spec_and_feat(self, spec_path, video_feat_path):
+        """Load audio spec and video feat"""
+        try:
+            spec_raw = np.load(spec_path).astype(np.float32)                    # channel: 1
+        except:
+            print(f"corrupted mel: {spec_path}", flush=True)
+            spec_raw = np.zeros((80, 625), dtype=np.float32) # [C, T]
+        p = np.random.uniform(0,1)
+        if p > self.drop:
+            try:
+                video_feat = np.load(video_feat_path)['feat'].astype(np.float32)
+            except:
+                print(f"corrupted video: {video_feat_path}", flush=True)
+                video_feat = np.load(os.path.join(os.path.dirname(video_feat_path), 'empty_vid.npz'))['feat'].astype(np.float32)
+        else:
+            video_feat = np.load(os.path.join(os.path.dirname(video_feat_path), 'empty_vid.npz'))['feat'].astype(np.float32)
+        spec_len = self.sr * self.duration / self.hop_len
+        if spec_raw.shape[1] < spec_len:
+            spec_raw = np.tile(spec_raw, math.ceil(spec_len / spec_raw.shape[1]))
+        spec_raw = spec_raw[:, :int(spec_len)]
+        feat_len = self.fps * self.duration
+        if video_feat.shape[0] < feat_len:
+            video_feat = np.tile(video_feat, (math.ceil(feat_len / video_feat.shape[0]), 1))
+        video_feat = video_feat[:int(feat_len)]
+        return spec_raw, video_feat
+    def mix_audio_and_feat(self, spec1=None, spec2=None, video_feat1=None, video_feat2=None, video_info_dict={}, mode='single'):
+        """ Return Mix Spec and Mix video feat"""
+        if mode == "single":
+            # spec1:
+            if not self.fix_frames:
+                start_idx = random.randint(0, self.sr * self.duration - self.truncate - 1)  # audio start
+            else:
+                start_idx = 0
+            start_frame = int(self.fps * start_idx / self.sr)
+            truncate_frame = int(self.fps * self.truncate / self.sr)
+            # Spec Start & Truncate:
+            spec_start = int(start_idx / self.hop_len)
+            spec_truncate = int(self.truncate / self.hop_len)
+            spec1 = spec1[:, spec_start : spec_start + spec_truncate]
+            video_feat1 = video_feat1[start_frame: start_frame + truncate_frame]
+            # info_dict:
+            video_info_dict['video_time1'] = str(start_frame) + '_' + str(start_frame+truncate_frame)   # Start frame, end frame
+            video_info_dict['video_time2'] = ""
+            return spec1, video_feat1, video_info_dict
+        elif mode == "concat":
+            total_spec_len = int(self.truncate / self.hop_len)
+            # Random Trucate len:
+            spec1_truncate_len = random.randint(self.min_duration * self.sr // self.hop_len, total_spec_len - self.min_duration * self.sr // self.hop_len - 1)
+            spec2_truncate_len = total_spec_len - spec1_truncate_len
+            # Sample spec clip:
+            spec_start1 = random.randint(0, total_spec_len - spec1_truncate_len - 1)
+            spec_start2 = random.randint(0, total_spec_len - spec2_truncate_len - 1)
+            spec_end1, spec_end2 = spec_start1 + spec1_truncate_len, spec_start2 + spec2_truncate_len
+            # concat spec:
+            spec1, spec2 = spec1[:, spec_start1 : spec_end1], spec2[:, spec_start2 : spec_end2]
+            concat_audio_spec = np.concatenate([spec1, spec2], axis=1)
+            # Concat Video Feat:
+            start1_frame, truncate1_frame = int(self.fps * spec_start1 * self.hop_len / self.sr), int(self.fps * spec1_truncate_len * self.hop_len / self.sr)
+            start2_frame, truncate2_frame = int(self.fps * spec_start2 * self.hop_len / self.sr), int(self.fps * self.truncate / self.sr) - truncate1_frame
+            video_feat1, video_feat2 = video_feat1[start1_frame : start1_frame + truncate1_frame], video_feat2[start2_frame : start2_frame + truncate2_frame]
+            concat_video_feat = np.concatenate([video_feat1, video_feat2])
+            video_info_dict['video_time1'] = str(start1_frame) + '_' + str(start1_frame+truncate1_frame)   # Start frame, end frame
+            video_info_dict['video_time2'] = str(start2_frame) + '_' + str(start2_frame+truncate2_frame)
+            return concat_audio_spec, concat_video_feat, video_info_dict
+    def __getitem__(self, idx):
+        audio_name1 = self.data_list[idx]
+        spec_npy_path1 = self.spec_list[idx]
+        video_feat_path1 = self.feat_list[idx]
+        video_path1 = self.video_list[idx]
+        # select other video:
+        flag = False
+        if random.uniform(0, 1) < 0.5:
+            flag = True
+            random_idx = idx
+            while random_idx == idx:
+                random_idx = random.randint(0, len(self.data_list)-1)
+            audio_name2 = self.data_list[random_idx]
+            spec_npy_path2 = self.spec_list[random_idx]
+            video_feat_path2 = self.feat_list[random_idx]
+            video_path2 = self.video_list[random_idx]
+        # Load the Spec and Feat:
+        spec1, video_feat1 = self.load_spec_and_feat(spec_npy_path1, video_feat_path1)
+        if flag:
+            spec2, video_feat2 = self.load_spec_and_feat(spec_npy_path2, video_feat_path2)
+            video_info_dict = {'audio_name1':audio_name1, 'audio_name2': audio_name2, 'video_path1': video_path1, 'video_path2': video_path2}
+            mix_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(spec1, spec2, video_feat1, video_feat2, video_info_dict, mode='concat')
+        else:
+            video_info_dict = {'audio_name1':audio_name1, 'audio_name2': "", 'video_path1': video_path1, 'video_path2': ""}
+            mix_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(spec1=spec1, video_feat1=video_feat1, video_info_dict=video_info_dict, mode='single')
+        # print("mix spec shape:", mix_spec.shape)
+        # print("mix video feat:", mix_video_feat.shape)
+        data_dict = {}
+        # data_dict['mix_spec'] = mix_spec[None].repeat(3, axis=0) # TODO：要把这里改掉，否则无法适应maa的autoencoder
+        data_dict['mix_spec'] = mix_spec # (80, 512)
+        data_dict['mix_video_feat'] = mix_video_feat # (32, 512)
+        data_dict['mix_info_dict'] = mix_info
+        return data_dict
+class audio_video_spec_fullset_Dataset_Train(audio_video_spec_fullset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class audio_video_spec_fullset_Dataset_Valid(audio_video_spec_fullset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class audio_video_spec_fullset_Dataset_Test(audio_video_spec_fullset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)
+class audio_video_spec_fullset_Dataset_inpaint(audio_video_spec_fullset_Dataset):
+    def __getitem__(self, idx):
+        audio_name1 = self.data_list[idx]
+        spec_npy_path1 = self.spec_list[idx]
+        video_feat_path1 = self.feat_list[idx]
+        video_path1 = self.video_list[idx]
+        # Load the Spec and Feat:
+        spec1, video_feat1 = self.load_spec_and_feat(spec_npy_path1, video_feat_path1)
+        video_info_dict = {'audio_name1': audio_name1, 'audio_name2': "", 'video_path1': video_path1, 'video_path2': ""}
+        mix_spec, mix_masked_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(spec1=spec1, video_feat1=video_feat1, video_info_dict=video_info_dict)
+        # print("mix spec shape:", mix_spec.shape)
+        # print("mix video feat:", mix_video_feat.shape)
+        data_dict = {}
+        # data_dict['mix_spec'] = mix_spec[None].repeat(3, axis=0) # TODO：要把这里改掉，否则无法适应maa的autoencoder
+        data_dict['mix_spec'] = mix_spec  # (80, 512)
+        data_dict['hybrid_feat'] = {'mix_video_feat': mix_video_feat, 'mix_spec': mix_masked_spec}  # (32, 512)
+        data_dict['mix_info_dict'] = mix_info
+        return data_dict
+    def mix_audio_and_feat(self, spec1=None, video_feat1=None, video_info_dict={}):
+        """ Return Mix Spec and Mix video feat"""
+        # spec1:
+        if not self.fix_frames:
+            start_idx = random.randint(0, self.sr * self.duration - self.truncate - 1)  # audio start
+        else:
+            start_idx = 0
+        start_frame = int(self.fps * start_idx / self.sr)
+        truncate_frame = int(self.fps * self.truncate / self.sr)
+        # Spec Start & Truncate:
+        spec_start = int(start_idx / self.hop_len)
+        spec_truncate = int(self.truncate / self.hop_len)
+        spec1 = spec1[:, spec_start: spec_start + spec_truncate]
+        video_feat1 = video_feat1[start_frame: start_frame + truncate_frame]
+        # Start masking frames:
+        masked_spec = random.randint(1, int(spec_truncate * 0.5 // 16)) * 16  # 16帧的倍数，最多mask 50%
+        masked_truncate = int(masked_spec * self.hop_len)
+        masked_frame = int(self.fps * masked_truncate / self.sr)
+        start_masked_idx = random.randint(0, self.truncate - masked_truncate - 1)
+        start_masked_frame = int(self.fps * start_masked_idx / self.sr)
+        start_masked_spec = int(start_masked_idx / self.hop_len)
+        masked_spec1 = np.zeros((80, spec_truncate)).astype(np.float32)
+        masked_spec1[:] = spec1[:]
+        masked_spec1[:, start_masked_spec:start_masked_spec+masked_spec] = np.zeros((80, masked_spec))
+        video_feat1[start_masked_frame:start_masked_frame+masked_frame, :] = np.zeros((masked_frame, 512))
+        # info_dict:
+        video_info_dict['video_time1'] = str(start_frame) + '_' + str(start_frame + truncate_frame)  # Start frame, end frame
+        video_info_dict['video_time2'] = ""
+        return spec1, masked_spec1, video_feat1, video_info_dict
+class audio_video_spec_fullset_Dataset_inpaint_Train(audio_video_spec_fullset_Dataset_inpaint):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class audio_video_spec_fullset_Dataset_inpaint_Valid(audio_video_spec_fullset_Dataset_inpaint):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class audio_video_spec_fullset_Dataset_inpaint_Test(audio_video_spec_fullset_Dataset_inpaint):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)
+class audio_Dataset(torch.utils.data.Dataset):
+    # Only Load audio dataset: for training Stage1: Audio Npy Dataset
+    def __init__(self, split, dataset1, sr=22050, duration=10, truncate=220000, debug_num=False, fix_frames=False, hop_len=256):
+        super().__init__()
+        if split == "train":
+            self.split = "Train"
+        elif split == "valid" or split == 'test':
+            self.split = "Test"
+        # Default params:
+        self.min_duration = 2
+        self.sr = sr                # 22050
+        self.duration = duration    # 10
+        self.truncate = truncate    # 220000
+        self.fix_frames = fix_frames
+        self.hop_len = hop_len
+        print("Fix Frames: {}".format(self.fix_frames))
+        # Dataset1: (VGGSound)
+        assert dataset1.dataset_name == "VGGSound"
+        # spec_dir: spectrogram path
+        # dataset1_spec_dir = os.path.join(dataset1.data_dir, "codec")
+        dataset1_wav_dir = os.path.join(dataset1.wav_dir, "wav")
+        split_txt_path = dataset1.split_txt_path
+        with open(os.path.join(split_txt_path, '{}.txt'.format(self.split)), "r") as f:
+            data_list1 = f.readlines()
+            data_list1 = list(map(lambda x: x.strip(), data_list1))
+            wav_list1 = list(map(lambda x: os.path.join(dataset1_wav_dir, x) + ".wav", data_list1))  # feat
+        # Merge Data:
+        self.data_list = data_list1
+        self.wav_list = wav_list1
+        assert len(self.data_list) == len(self.wav_list)
+        shuffle_idx = np.random.permutation(np.arange(len(self.data_list)))
+        self.data_list = [self.data_list[i] for i in shuffle_idx]
+        self.wav_list = [self.wav_list[i] for i in shuffle_idx]
+        if debug_num:
+            self.data_list = self.data_list[:debug_num]
+            self.wav_list = self.wav_list[:debug_num]
+        print('Split: {}  Sample Num: {}'.format(split, len(self.data_list)))
+    def __len__(self):
+        return len(self.data_list)
+    def load_spec_and_feat(self, wav_path):
+        """Load audio spec and video feat"""
+        try:
+            wav_raw, sr = librosa.load(wav_path, sr=self.sr)                   # channel: 1
+        except:
+            print(f"corrupted wav: {wav_path}", flush=True)
+            wav_raw = np.zeros((160000,), dtype=np.float32) # [T]
+        wav_len = self.sr * self.duration
+        if wav_raw.shape[0] < wav_len:
+            wav_raw = np.tile(wav_raw, math.ceil(wav_len / wav_raw.shape[0]))
+        wav_raw = wav_raw[:int(wav_len)]
+        return wav_raw
+    def mix_audio_and_feat(self, wav_raw1=None, video_info_dict={}, mode='single'):
+        """ Return Mix Spec and Mix video feat"""
+        if mode == "single":
+            # spec1:
+            if not self.fix_frames:
+                start_idx = random.randint(0, self.sr * self.duration - self.truncate - 1)  # audio start
+            else:
+                start_idx = 0
+            wav_start = start_idx
+            wav_truncate = self.truncate
+            wav_raw1 = wav_raw1[wav_start: wav_start + wav_truncate]
+            return wav_raw1, video_info_dict
+        elif mode == "concat":
+            total_spec_len = int(self.truncate / self.hop_len)
+            # Random Trucate len:
+            spec1_truncate_len = random.randint(self.min_duration * self.sr // self.hop_len, total_spec_len - self.min_duration * self.sr // self.hop_len - 1)
+            spec2_truncate_len = total_spec_len - spec1_truncate_len
+            # Sample spec clip:
+            spec_start1 = random.randint(0, total_spec_len - spec1_truncate_len - 1)
+            spec_start2 = random.randint(0, total_spec_len - spec2_truncate_len - 1)
+            spec_end1, spec_end2 = spec_start1 + spec1_truncate_len, spec_start2 + spec2_truncate_len
+            # concat spec:
+            return video_info_dict
+    def __getitem__(self, idx):
+        audio_name1 = self.data_list[idx]
+        wav_path1 = self.wav_list[idx]
+        # select other video:
+        flag = False
+        if random.uniform(0, 1) < -1:
+            flag = True
+            random_idx = idx
+            while random_idx == idx:
+                random_idx = random.randint(0, len(self.data_list)-1)
+            audio_name2 = self.data_list[random_idx]
+            spec_npy_path2 = self.spec_list[random_idx]
+            wav_path2 = self.wav_list[random_idx]
+        # Load the Spec and Feat:
+        wav_raw1 = self.load_spec_and_feat(wav_path1)
+        if flag:
+            spec2, video_feat2 = self.load_spec_and_feat(spec_npy_path2, wav_path2)
+            video_info_dict = {'audio_name1':audio_name1, 'audio_name2': audio_name2}
+            mix_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(video_info_dict, mode='concat')
+        else:
+            video_info_dict = {'audio_name1':audio_name1, 'audio_name2': ""}
+            mix_wav, mix_info = self.mix_audio_and_feat(wav_raw1=wav_raw1, video_info_dict=video_info_dict, mode='single')
+        data_dict = {}
+        data_dict['mix_wav'] = mix_wav  # (131072,)
+        data_dict['mix_info_dict'] = mix_info
+        return data_dict
+class audio_Dataset_Train(audio_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class audio_Dataset_Test(audio_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)
+class audio_Dataset_Valid(audio_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class video_codec_Dataset(torch.utils.data.Dataset):
+    # Only Load audio dataset: for training Stage1: Audio Npy Dataset
+    def __init__(self, split, dataset1, sr=22050, duration=10, truncate=220000, fps=21.5, debug_num=False, fix_frames=False, hop_len=256):
+        super().__init__()
+        if split == "train":
+            self.split = "Train"
+        elif split == "valid" or split == 'test':
+            self.split = "Test"
+        # Default params:
+        self.min_duration = 2
+        self.fps = fps
+        self.sr = sr                # 22050
+        self.duration = duration    # 10
+        self.truncate = truncate    # 220000
+        self.fix_frames = fix_frames
+        self.hop_len = hop_len
+        print("Fix Frames: {}".format(self.fix_frames))
+        # Dataset1: (VGGSound)
+        assert dataset1.dataset_name == "VGGSound"
+        # spec_dir: spectrogram path
+        # dataset1_spec_dir = os.path.join(dataset1.data_dir, "codec")
+        dataset1_feat_dir = os.path.join(dataset1.data_dir, "cavp")
+        dataset1_wav_dir = os.path.join(dataset1.wav_dir, "wav")
+        split_txt_path = dataset1.split_txt_path
+        with open(os.path.join(split_txt_path, '{}.txt'.format(self.split)), "r") as f:
+            data_list1 = f.readlines()
+            data_list1 = list(map(lambda x: x.strip(), data_list1))
+            wav_list1 = list(map(lambda x: os.path.join(dataset1_wav_dir, x) + ".wav", data_list1))  # feat
+            feat_list1 = list(map(lambda x: os.path.join(dataset1_feat_dir, x) + ".npz", data_list1))  # feat
+        # Merge Data:
+        self.data_list = data_list1
+        self.wav_list = wav_list1
+        self.feat_list = feat_list1
+        assert len(self.data_list) == len(self.wav_list)
+        shuffle_idx = np.random.permutation(np.arange(len(self.data_list)))
+        self.data_list = [self.data_list[i] for i in shuffle_idx]
+        self.wav_list = [self.wav_list[i] for i in shuffle_idx]
+        self.feat_list = [self.feat_list[i] for i in shuffle_idx]
+        if debug_num:
+            self.data_list = self.data_list[:debug_num]
+            self.wav_list = self.wav_list[:debug_num]
+            self.feat_list = self.feat_list[:debug_num]
+        print('Split: {}  Sample Num: {}'.format(split, len(self.data_list)))
+    def __len__(self):
+        return len(self.data_list)
+    def load_spec_and_feat(self, wav_path, video_feat_path):
+        """Load audio spec and video feat"""
+        try:
+            wav_raw, sr = librosa.load(wav_path, sr=self.sr)                   # channel: 1
+        except:
+            print(f"corrupted wav: {wav_path}", flush=True)
+            wav_raw = np.zeros((160000,), dtype=np.float32) # [T]
+        try:
+            video_feat = np.load(video_feat_path)['feat'].astype(np.float32)
+        except:
+            print(f"corrupted video: {video_feat_path}", flush=True)
+            video_feat = np.load(os.path.join(os.path.dirname(video_feat_path), 'empty_vid.npz'))['feat'].astype(np.float32)
+        wav_len = self.sr * self.duration
+        if wav_raw.shape[0] < wav_len:
+            wav_raw = np.tile(wav_raw, math.ceil(wav_len / wav_raw.shape[0]))
+        wav_raw = wav_raw[:int(wav_len)]
+        feat_len = self.fps * self.duration
+        if video_feat.shape[0] < feat_len:
+            video_feat = np.tile(video_feat, (math.ceil(feat_len / video_feat.shape[0]), 1))
+        video_feat = video_feat[:int(feat_len)]
+        return wav_raw, video_feat
+    def mix_audio_and_feat(self, wav_raw1=None, video_feat1=None, video_info_dict={}, mode='single'):
+        """ Return Mix Spec and Mix video feat"""
+        if mode == "single":
+            # spec1:
+            if not self.fix_frames:
+                start_idx = random.randint(0, self.sr * self.duration - self.truncate - 1)  # audio start
+            else:
+                start_idx = 0
+            wav_start = start_idx
+            wav_truncate = self.truncate
+            wav_raw1 = wav_raw1[wav_start: wav_start + wav_truncate]
+            start_frame = int(self.fps * start_idx / self.sr)
+            truncate_frame = int(self.fps * self.truncate / self.sr)
+            video_feat1 = video_feat1[start_frame: start_frame + truncate_frame]
+            # info_dict:
+            video_info_dict['video_time1'] = str(start_frame) + '_' + str(start_frame+truncate_frame)   # Start frame, end frame
+            video_info_dict['video_time2'] = ""
+            return wav_raw1, video_feat1, video_info_dict
+        elif mode == "concat":
+            total_spec_len = int(self.truncate / self.hop_len)
+            # Random Trucate len:
+            spec1_truncate_len = random.randint(self.min_duration * self.sr // self.hop_len, total_spec_len - self.min_duration * self.sr // self.hop_len - 1)
+            spec2_truncate_len = total_spec_len - spec1_truncate_len
+            # Sample spec clip:
+            spec_start1 = random.randint(0, total_spec_len - spec1_truncate_len - 1)
+            spec_start2 = random.randint(0, total_spec_len - spec2_truncate_len - 1)
+            spec_end1, spec_end2 = spec_start1 + spec1_truncate_len, spec_start2 + spec2_truncate_len
+            # concat spec:
+            return video_info_dict
+    def __getitem__(self, idx):
+        audio_name1 = self.data_list[idx]
+        wav_path1 = self.wav_list[idx]
+        video_feat_path1 = self.feat_list[idx]
+        # select other video:
+        flag = False
+        if random.uniform(0, 1) < -1:
+            flag = True
+            random_idx = idx
+            while random_idx == idx:
+                random_idx = random.randint(0, len(self.data_list)-1)
+            audio_name2 = self.data_list[random_idx]
+            wav_path2 = self.wav_list[random_idx]
+            video_feat_path2 = self.feat_list[random_idx]
+        # Load the Spec and Feat:
+        wav_raw1, video_feat1 = self.load_spec_and_feat(wav_path1, video_feat_path1)
+        if flag:
+            wav_raw2, video_feat2 = self.load_spec_and_feat(wav_path2, video_feat_path2)
+            video_info_dict = {'audio_name1':audio_name1, 'audio_name2': audio_name2}
+            mix_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(video_info_dict, mode='concat')
+        else:
+            video_info_dict = {'audio_name1':audio_name1, 'audio_name2': ""}
+            mix_wav, mix_video_feat, mix_info = self.mix_audio_and_feat(wav_raw1=wav_raw1, video_feat1=video_feat1, video_info_dict=video_info_dict, mode='single')
+        data_dict = {}
+        data_dict['mix_wav'] = mix_wav  # (131072,)
+        data_dict['mix_video_feat'] = mix_video_feat # (32, 512)
+        data_dict['mix_info_dict'] = mix_info
+        return data_dict
+class video_codec_Dataset_Train(video_codec_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class video_codec_Dataset_Test(video_codec_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)
+class video_codec_Dataset_Valid(video_codec_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class audio_video_spec_fullset_Audioset_Dataset(torch.utils.data.Dataset):
+    # Only Load audio dataset: for training Stage1: Audio Npy Dataset
+    def __init__(self, split, dataset1, dataset2, sr=22050, duration=10, truncate=220000,
+                 fps=21.5, drop=0.0, fix_frames=False, hop_len=256):
+        super().__init__()
+        if split == "train":
+            self.split = "Train"
+        elif split == "valid" or split == 'test':
+            self.split = "Test"
+        # Default params:
+        self.min_duration = 2
+        self.sr = sr  # 22050
+        self.duration = duration  # 10
+        self.truncate = truncate  # 220000
+        self.fps = fps
+        self.fix_frames = fix_frames
+        self.hop_len = hop_len
+        self.drop = drop
+        print("Fix Frames: {}".format(self.fix_frames))
+        print("Use Drop: {}".format(self.drop))
+        # Dataset1: (VGGSound)
+        assert dataset1.dataset_name == "VGGSound"
+        assert dataset2.dataset_name == "Audioset"
+        # spec_dir: spectrogram path
+        # feat_dir: CAVP feature path
+        # video_dir: video path
+        dataset1_spec_dir = os.path.join(dataset1.data_dir, "mel_maa2", "npy")
+        dataset1_feat_dir = os.path.join(dataset1.data_dir, "cavp")
+        split_txt_path = dataset1.split_txt_path
+        with open(os.path.join(split_txt_path, '{}.txt'.format(self.split)), "r") as f:
+            data_list1 = f.readlines()
+            data_list1 = list(map(lambda x: x.strip(), data_list1))
+            spec_list1 = list(map(lambda x: os.path.join(dataset1_spec_dir, x) + "_mel.npy", data_list1))  # spec
+            feat_list1 = list(map(lambda x: os.path.join(dataset1_feat_dir, x) + ".npz", data_list1))  # feat
+        if split == "train":
+            dataset2_spec_dir = os.path.join(dataset2.data_dir, "mel")
+            dataset2_feat_dir = os.path.join(dataset2.data_dir, "cavp_renamed")
+            split_txt_path = dataset2.split_txt_path
+            with open(os.path.join(split_txt_path, '{}.txt'.format(self.split)), "r") as f:
+                data_list2 = f.readlines()
+                data_list2 = list(map(lambda x: x.strip(), data_list2))
+                spec_list2 = list(map(lambda x: os.path.join(dataset2_spec_dir, f'Y{x}') + "_mel.npy", data_list2))  # spec
+                feat_list2 = list(map(lambda x: os.path.join(dataset2_feat_dir, x) + ".npz", data_list2))  # feat
+            data_list1 += data_list2
+            spec_list1 += spec_list2
+            feat_list1 += feat_list2
+        # Merge Data:
+        self.data_list = data_list1 if self.split != "Test" else data_list1[:200]
+        self.spec_list = spec_list1 if self.split != "Test" else spec_list1[:200]
+        self.feat_list = feat_list1 if self.split != "Test" else feat_list1[:200]
+        assert len(self.data_list) == len(self.spec_list) == len(self.feat_list)
+        shuffle_idx = np.random.permutation(np.arange(len(self.data_list)))
+        self.data_list = [self.data_list[i] for i in shuffle_idx]
+        self.spec_list = [self.spec_list[i] for i in shuffle_idx]
+        self.feat_list = [self.feat_list[i] for i in shuffle_idx]
+        print('Split: {}  Sample Num: {}'.format(split, len(self.data_list)))
+        # self.check(self.spec_list)
+    def __len__(self):
+        return len(self.data_list)
+    def check(self, feat_list):
+        from tqdm import tqdm
+        for spec_path in tqdm(feat_list):
+            mel = np.load(spec_path).astype(np.float32)
+            if mel.shape[0] != 80:
+                import ipdb
+                ipdb.set_trace()
+    def load_spec_and_feat(self, spec_path, video_feat_path):
+        """Load audio spec and video feat"""
+        spec_raw = np.load(spec_path).astype(np.float32)  # channel: 1
+        if spec_raw.shape[0] != 80:
+            print(f"corrupted mel: {spec_path}", flush=True)
+            spec_raw = np.zeros((80, 625), dtype=np.float32)  # [C, T]
+        p = np.random.uniform(0, 1)
+        if p > self.drop:
+            try:
+                video_feat = np.load(video_feat_path)['feat'].astype(np.float32)
+            except:
+                print(f"corrupted video: {video_feat_path}", flush=True)
+                video_feat = np.load(os.path.join(os.path.dirname(video_feat_path), 'empty_vid.npz'))['feat'].astype(np.float32)
+        else:
+            video_feat = np.load(os.path.join(os.path.dirname(video_feat_path), 'empty_vid.npz'))['feat'].astype(np.float32)
+        spec_len = self.sr * self.duration / self.hop_len
+        if spec_raw.shape[1] < spec_len:
+            spec_raw = np.tile(spec_raw, math.ceil(spec_len / spec_raw.shape[1]))
+        spec_raw = spec_raw[:, :int(spec_len)]
+        feat_len = self.fps * self.duration
+        if video_feat.shape[0] < feat_len:
+            video_feat = np.tile(video_feat, (math.ceil(feat_len / video_feat.shape[0]), 1))
+        video_feat = video_feat[:int(feat_len)]
+        return spec_raw, video_feat
+    def mix_audio_and_feat(self, spec1=None, spec2=None, video_feat1=None, video_feat2=None, video_info_dict={},
+                           mode='single'):
+        """ Return Mix Spec and Mix video feat"""
+        if mode == "single":
+            # spec1:
+            if not self.fix_frames:
+                start_idx = random.randint(0, self.sr * self.duration - self.truncate - 1)  # audio start
+            else:
+                start_idx = 0
+            start_frame = int(self.fps * start_idx / self.sr)
+            truncate_frame = int(self.fps * self.truncate / self.sr)
+            # Spec Start & Truncate:
+            spec_start = int(start_idx / self.hop_len)
+            spec_truncate = int(self.truncate / self.hop_len)
+            spec1 = spec1[:, spec_start: spec_start + spec_truncate]
+            video_feat1 = video_feat1[start_frame: start_frame + truncate_frame]
+            # info_dict:
+            video_info_dict['video_time1'] = str(start_frame) + '_' + str(
+                start_frame + truncate_frame)  # Start frame, end frame
+            video_info_dict['video_time2'] = ""
+            return spec1, video_feat1, video_info_dict
+        elif mode == "concat":
+            total_spec_len = int(self.truncate / self.hop_len)
+            # Random Trucate len:
+            spec1_truncate_len = random.randint(self.min_duration * self.sr // self.hop_len,
+                                                total_spec_len - self.min_duration * self.sr // self.hop_len - 1)
+            spec2_truncate_len = total_spec_len - spec1_truncate_len
+            # Sample spec clip:
+            spec_start1 = random.randint(0, total_spec_len - spec1_truncate_len - 1)
+            spec_start2 = random.randint(0, total_spec_len - spec2_truncate_len - 1)
+            spec_end1, spec_end2 = spec_start1 + spec1_truncate_len, spec_start2 + spec2_truncate_len
+            # concat spec:
+            spec1, spec2 = spec1[:, spec_start1: spec_end1], spec2[:, spec_start2: spec_end2]
+            concat_audio_spec = np.concatenate([spec1, spec2], axis=1)
+            # Concat Video Feat:
+            start1_frame, truncate1_frame = int(self.fps * spec_start1 * self.hop_len / self.sr), int(
+                self.fps * spec1_truncate_len * self.hop_len / self.sr)
+            start2_frame, truncate2_frame = int(self.fps * spec_start2 * self.hop_len / self.sr), int(
+                self.fps * self.truncate / self.sr) - truncate1_frame
+            video_feat1, video_feat2 = video_feat1[start1_frame: start1_frame + truncate1_frame], video_feat2[
+                                                                                                  start2_frame: start2_frame + truncate2_frame]
+            concat_video_feat = np.concatenate([video_feat1, video_feat2])
+            video_info_dict['video_time1'] = str(start1_frame) + '_' + str(
+                start1_frame + truncate1_frame)  # Start frame, end frame
+            video_info_dict['video_time2'] = str(start2_frame) + '_' + str(start2_frame + truncate2_frame)
+            return concat_audio_spec, concat_video_feat, video_info_dict
+    def __getitem__(self, idx):
+        audio_name1 = self.data_list[idx]
+        spec_npy_path1 = self.spec_list[idx]
+        video_feat_path1 = self.feat_list[idx]
+        # select other video:
+        flag = False
+        if random.uniform(0, 1) < -1:
+            flag = True
+            random_idx = idx
+            while random_idx == idx:
+                random_idx = random.randint(0, len(self.data_list) - 1)
+            audio_name2 = self.data_list[random_idx]
+            spec_npy_path2 = self.spec_list[random_idx]
+            video_feat_path2 = self.feat_list[random_idx]
+        # Load the Spec and Feat:
+        spec1, video_feat1 = self.load_spec_and_feat(spec_npy_path1, video_feat_path1)
+        if flag:
+            spec2, video_feat2 = self.load_spec_and_feat(spec_npy_path2, video_feat_path2)
+            video_info_dict = {'audio_name1': audio_name1, 'audio_name2': audio_name2}
+            mix_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(spec1, spec2, video_feat1, video_feat2, video_info_dict, mode='concat')
+        else:
+            video_info_dict = {'audio_name1': audio_name1, 'audio_name2': ""}
+            mix_spec, mix_video_feat, mix_info = self.mix_audio_and_feat(spec1=spec1, video_feat1=video_feat1, video_info_dict=video_info_dict, mode='single')
+        # print("mix spec shape:", mix_spec.shape)
+        # print("mix video feat:", mix_video_feat.shape)
+        data_dict = {}
+        data_dict['mix_spec'] = mix_spec  # (80, 512)
+        data_dict['mix_video_feat'] = mix_video_feat  # (32, 512)
+        data_dict['mix_info_dict'] = mix_info
+        return data_dict
+class audio_video_spec_fullset_Audioset_Train(audio_video_spec_fullset_Audioset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='train', **dataset_cfg)
+class audio_video_spec_fullset_Audioset_Valid(audio_video_spec_fullset_Audioset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='valid', **dataset_cfg)
+class audio_video_spec_fullset_Audioset_Test(audio_video_spec_fullset_Audioset_Dataset):
+    def __init__(self, dataset_cfg):
+        super().__init__(split='test', **dataset_cfg)

ldm/lr_scheduler.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import numpy as np
+class LambdaWarmUpCosineScheduler:
+    """
+    note: use with a base_lr of 1.0
+    """
+    def __init__(self, warm_up_steps, lr_min, lr_max, lr_start, max_decay_steps, verbosity_interval=0):
+        self.lr_warm_up_steps = warm_up_steps
+        self.lr_start = lr_start
+        self.lr_min = lr_min
+        self.lr_max = lr_max
+        self.lr_max_decay_steps = max_decay_steps
+        self.last_lr = 0.
+        self.verbosity_interval = verbosity_interval
+    def schedule(self, n, **kwargs):
+        if self.verbosity_interval > 0:
+            if n % self.verbosity_interval == 0: print(f"current step: {n}, recent lr-multiplier: {self.last_lr}")
+        if n < self.lr_warm_up_steps:
+            lr = (self.lr_max - self.lr_start) / self.lr_warm_up_steps * n + self.lr_start
+            self.last_lr = lr
+            return lr
+        else:
+            t = (n - self.lr_warm_up_steps) / (self.lr_max_decay_steps - self.lr_warm_up_steps)
+            t = min(t, 1.0)
+            lr = self.lr_min + 0.5 * (self.lr_max - self.lr_min) * (
+                    1 + np.cos(t * np.pi))
+            self.last_lr = lr
+            return lr
+    def __call__(self, n, **kwargs):
+        return self.schedule(n,**kwargs)
+class LambdaWarmUpCosineScheduler2:
+    """
+    supports repeated iterations, configurable via lists
+    note: use with a base_lr of 1.0.
+    """
+    def __init__(self, warm_up_steps, f_min, f_max, f_start, cycle_lengths, verbosity_interval=0):
+        assert len(warm_up_steps) == len(f_min) == len(f_max) == len(f_start) == len(cycle_lengths)
+        self.lr_warm_up_steps = warm_up_steps
+        self.f_start = f_start
+        self.f_min = f_min
+        self.f_max = f_max
+        self.cycle_lengths = cycle_lengths
+        self.cum_cycles = np.cumsum([0] + list(self.cycle_lengths))
+        self.last_f = 0.
+        self.verbosity_interval = verbosity_interval
+    def find_in_interval(self, n):
+        interval = 0
+        for cl in self.cum_cycles[1:]:
+            if n <= cl:
+                return interval
+            interval += 1
+    def schedule(self, n, **kwargs):
+        cycle = self.find_in_interval(n)
+        n = n - self.cum_cycles[cycle]
+        if self.verbosity_interval > 0:
+            if n % self.verbosity_interval == 0: print(f"current step: {n}, recent lr-multiplier: {self.last_f}, "
+                                                       f"current cycle {cycle}")
+        if n < self.lr_warm_up_steps[cycle]:
+            f = (self.f_max[cycle] - self.f_start[cycle]) / self.lr_warm_up_steps[cycle] * n + self.f_start[cycle]
+            self.last_f = f
+            return f
+        else:
+            t = (n - self.lr_warm_up_steps[cycle]) / (self.cycle_lengths[cycle] - self.lr_warm_up_steps[cycle])
+            t = min(t, 1.0)
+            f = self.f_min[cycle] + 0.5 * (self.f_max[cycle] - self.f_min[cycle]) * (
+                    1 + np.cos(t * np.pi))
+            self.last_f = f
+            return f
+    def __call__(self, n, **kwargs):
+        return self.schedule(n, **kwargs)
+class LambdaLinearScheduler(LambdaWarmUpCosineScheduler2):
+    def schedule(self, n, **kwargs):
+        cycle = self.find_in_interval(n)
+        n = n - self.cum_cycles[cycle]
+        if self.verbosity_interval > 0:
+            if n % self.verbosity_interval == 0: print(f"current step: {n}, recent lr-multiplier: {self.last_f}, "
+                                                       f"current cycle {cycle}")
+        if n < self.lr_warm_up_steps[cycle]:
+            f = (self.f_max[cycle] - self.f_start[cycle]) / self.lr_warm_up_steps[cycle] * n + self.f_start[cycle]
+            self.last_f = f
+            return f
+        else:
+            f = self.f_min[cycle] + (self.f_max[cycle] - self.f_min[cycle]) * (self.cycle_lengths[cycle] - n) / (self.cycle_lengths[cycle])
+            self.last_f = f
+            return f

ldm/models/__pycache__/autoencoder.cpython-38.pyc ADDED Viewed

Binary file (15.5 kB). View file

ldm/models/__pycache__/autoencoder.cpython-39.pyc ADDED Viewed

Binary file (15.5 kB). View file

ldm/models/__pycache__/autoencoder1d.cpython-38.pyc ADDED Viewed

Binary file (13.4 kB). View file

ldm/models/autoencoder.py ADDED Viewed

	@@ -0,0 +1,503 @@

+import os
+import torch
+import pytorch_lightning as pl
+import torch.nn.functional as F
+from contextlib import contextmanager
+from taming.modules.vqvae.quantize import VectorQuantizer2 as VectorQuantizer
+from packaging import version
+import numpy as np
+from ldm.modules.diffusionmodules.model import Encoder, Decoder
+from ldm.modules.distributions.distributions import DiagonalGaussianDistribution
+from torch.optim.lr_scheduler import LambdaLR
+from ldm.util import instantiate_from_config
+class VQModel(pl.LightningModule):
+    def __init__(self,
+                 ddconfig,
+                 lossconfig,
+                 n_embed,
+                 embed_dim,
+                 ckpt_path=None,
+                 ignore_keys=[],
+                 image_key="image",
+                 colorize_nlabels=None,
+                 monitor=None,
+                 batch_resize_range=None,
+                 scheduler_config=None,
+                 lr_g_factor=1.0,
+                 remap=None,
+                 sane_index_shape=False, # tell vector quantizer to return indices as bhw
+                 use_ema=False
+                 ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.n_embed = n_embed
+        self.image_key = image_key
+        self.encoder = Encoder(**ddconfig)
+        self.decoder = Decoder(**ddconfig)
+        self.loss = instantiate_from_config(lossconfig)
+        self.quantize = VectorQuantizer(n_embed, embed_dim, beta=0.25,
+                                        remap=remap,
+                                        sane_index_shape=sane_index_shape)
+        self.quant_conv = torch.nn.Conv2d(ddconfig["z_channels"], embed_dim, 1)
+        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
+        if colorize_nlabels is not None:
+            assert type(colorize_nlabels)==int
+            self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
+        if monitor is not None:
+            self.monitor = monitor
+        self.batch_resize_range = batch_resize_range
+        if self.batch_resize_range is not None:
+            print(f"{self.__class__.__name__}: Using per-batch resizing in range {batch_resize_range}.")
+        self.use_ema = use_ema
+        if self.use_ema:
+            self.model_ema = LitEma(self)
+            print(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
+        self.scheduler_config = scheduler_config
+        self.lr_g_factor = lr_g_factor
+    @contextmanager
+    def ema_scope(self, context=None):
+        if self.use_ema:
+            self.model_ema.store(self.parameters())
+            self.model_ema.copy_to(self)
+            if context is not None:
+                print(f"{context}: Switched to EMA weights")
+        try:
+            yield None
+        finally:
+            if self.use_ema:
+                self.model_ema.restore(self.parameters())
+                if context is not None:
+                    print(f"{context}: Restored training weights")
+    def init_from_ckpt(self, path, ignore_keys=list()):
+        sd = torch.load(path, map_location="cpu")["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    print("Deleting key {} from state_dict.".format(k))
+                    del sd[k]
+        missing, unexpected = self.load_state_dict(sd, strict=False)
+        print(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys")
+        if len(missing) > 0:
+            print(f"Missing Keys: {missing}")
+            print(f"Unexpected Keys: {unexpected}")
+    def on_train_batch_end(self, *args, **kwargs):
+        if self.use_ema:
+            self.model_ema(self)
+    def encode(self, x):
+        h = self.encoder(x)
+        h = self.quant_conv(h)
+        quant, emb_loss, info = self.quantize(h)
+        return quant, emb_loss, info
+    def encode_to_prequant(self, x):
+        h = self.encoder(x)
+        h = self.quant_conv(h)
+        return h
+    def decode(self, quant):
+        quant = self.post_quant_conv(quant)
+        dec = self.decoder(quant)
+        return dec
+    def decode_code(self, code_b):
+        quant_b = self.quantize.embed_code(code_b)
+        dec = self.decode(quant_b)
+        return dec
+    def forward(self, input, return_pred_indices=False):
+        quant, diff, (_,_,ind) = self.encode(input)
+        dec = self.decode(quant)
+        if return_pred_indices:
+            return dec, diff, ind
+        return dec, diff
+    def get_input(self, batch, k):
+        x = batch[k]
+        if len(x.shape) == 3:
+            x = x[..., None]
+        x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format).float()
+        if self.batch_resize_range is not None:
+            lower_size = self.batch_resize_range[0]
+            upper_size = self.batch_resize_range[1]
+            if self.global_step <= 4:
+                # do the first few batches with max size to avoid later oom
+                new_resize = upper_size
+            else:
+                new_resize = np.random.choice(np.arange(lower_size, upper_size+16, 16))
+            if new_resize != x.shape[2]:
+                x = F.interpolate(x, size=new_resize, mode="bicubic")
+            x = x.detach()
+        return x
+    def training_step(self, batch, batch_idx, optimizer_idx):
+        # https://github.com/pytorch/pytorch/issues/37142
+        # try not to fool the heuristics
+        x = self.get_input(batch, self.image_key)
+        xrec, qloss, ind = self(x, return_pred_indices=True)
+        if optimizer_idx == 0:
+            # autoencode
+            aeloss, log_dict_ae = self.loss(qloss, x, xrec, optimizer_idx, self.global_step,
+                                            last_layer=self.get_last_layer(), split="train",
+                                            predicted_indices=ind)
+            self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=True)
+            return aeloss
+        if optimizer_idx == 1:
+            # discriminator
+            discloss, log_dict_disc = self.loss(qloss, x, xrec, optimizer_idx, self.global_step,
+                                            last_layer=self.get_last_layer(), split="train")
+            self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=True)
+            return discloss
+    def validation_step(self, batch, batch_idx):
+        log_dict = self._validation_step(batch, batch_idx)
+        with self.ema_scope():
+            log_dict_ema = self._validation_step(batch, batch_idx, suffix="_ema")
+        return log_dict
+    def _validation_step(self, batch, batch_idx, suffix=""):
+        x = self.get_input(batch, self.image_key)
+        xrec, qloss, ind = self(x, return_pred_indices=True)
+        aeloss, log_dict_ae = self.loss(qloss, x, xrec, 0,
+                                        self.global_step,
+                                        last_layer=self.get_last_layer(),
+                                        split="val"+suffix,
+                                        predicted_indices=ind
+                                        )
+        discloss, log_dict_disc = self.loss(qloss, x, xrec, 1,
+                                            self.global_step,
+                                            last_layer=self.get_last_layer(),
+                                            split="val"+suffix,
+                                            predicted_indices=ind
+                                            )
+        rec_loss = log_dict_ae[f"val{suffix}/rec_loss"]
+        self.log(f"val{suffix}/rec_loss", rec_loss,
+                   prog_bar=True, logger=True, on_step=False, on_epoch=True, sync_dist=True)
+        self.log(f"val{suffix}/aeloss", aeloss,
+                   prog_bar=True, logger=True, on_step=False, on_epoch=True, sync_dist=True)
+        if version.parse(pl.__version__) >= version.parse('1.4.0'):
+            del log_dict_ae[f"val{suffix}/rec_loss"]
+        self.log_dict(log_dict_ae)
+        self.log_dict(log_dict_disc)
+        return self.log_dict
+    def test_step(self, batch, batch_idx):
+        x = self.get_input(batch, self.image_key)
+        xrec, qloss, ind = self(x, return_pred_indices=True)
+        reconstructions = (xrec + 1)/2 # to mel scale
+        test_ckpt_path = os.path.basename(self.trainer.tested_ckpt_path)
+        savedir = os.path.join(self.trainer.log_dir,f'output_imgs_{test_ckpt_path}','fake_class')
+        if not os.path.exists(savedir):
+            os.makedirs(savedir)
+        file_names = batch['f_name']
+        # print(f"reconstructions.shape:{reconstructions.shape}",file_names)
+        reconstructions = reconstructions.cpu().numpy().squeeze(1) # squuze channel dim
+        for b in range(reconstructions.shape[0]):
+            vname_num_split_index = file_names[b].rfind('_')# file_names[b]:video_name+'_'+num
+            v_n,num = file_names[b][:vname_num_split_index],file_names[b][vname_num_split_index+1:]
+            save_img_path = os.path.join(savedir,f'{v_n}_sample_{num}.npy')
+            np.save(save_img_path,reconstructions[b])
+        return None
+    def configure_optimizers(self):
+        lr_d = self.learning_rate
+        lr_g = self.lr_g_factor*self.learning_rate
+        print("lr_d", lr_d)
+        print("lr_g", lr_g)
+        opt_ae = torch.optim.Adam(list(self.encoder.parameters())+
+                                  list(self.decoder.parameters())+
+                                  list(self.quantize.parameters())+
+                                  list(self.quant_conv.parameters())+
+                                  list(self.post_quant_conv.parameters()),
+                                  lr=lr_g, betas=(0.5, 0.9))
+        opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),
+                                    lr=lr_d, betas=(0.5, 0.9))
+        if self.scheduler_config is not None:
+            scheduler = instantiate_from_config(self.scheduler_config)
+            print("Setting up LambdaLR scheduler...")
+            scheduler = [
+                {
+                    'scheduler': LambdaLR(opt_ae, lr_lambda=scheduler.schedule),
+                    'interval': 'step',
+                    'frequency': 1
+                },
+                {
+                    'scheduler': LambdaLR(opt_disc, lr_lambda=scheduler.schedule),
+                    'interval': 'step',
+                    'frequency': 1
+                },
+            ]
+            return [opt_ae, opt_disc], scheduler
+        return [opt_ae, opt_disc], []
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+    def log_images(self, batch, only_inputs=False, plot_ema=False, **kwargs):
+        log = dict()
+        x = self.get_input(batch, self.image_key)
+        x = x.to(self.device)
+        if only_inputs:
+            log["inputs"] = x
+            return log
+        xrec, _ = self(x)
+        if x.shape[1] > 3:
+            # colorize with random projection
+            assert xrec.shape[1] > 3
+            x = self.to_rgb(x)
+            xrec = self.to_rgb(xrec)
+        log["inputs"] = x
+        log["reconstructions"] = xrec
+        if plot_ema:
+            with self.ema_scope():
+                xrec_ema, _ = self(x)
+                if x.shape[1] > 3: xrec_ema = self.to_rgb(xrec_ema)
+                log["reconstructions_ema"] = xrec_ema
+        return log
+    def to_rgb(self, x):
+        assert self.image_key == "segmentation"
+        if not hasattr(self, "colorize"):
+            self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
+        x = F.conv2d(x, weight=self.colorize)
+        x = 2.*(x-x.min())/(x.max()-x.min()) - 1.
+        return x
+class VQModelInterface(VQModel):
+    def __init__(self, embed_dim, *args, **kwargs):
+        super().__init__(embed_dim=embed_dim, *args, **kwargs)
+        self.embed_dim = embed_dim
+    def encode(self, x):# VQModel的quantize写在encoder里,VQModelInterface则将其写在decoder里
+        h = self.encoder(x)
+        h = self.quant_conv(h)
+        return h
+    def decode(self, h, force_not_quantize=False):
+        # also go through quantization layer
+        if not force_not_quantize:
+            quant, emb_loss, info = self.quantize(h)
+        else:
+            quant = h
+        quant = self.post_quant_conv(quant)
+        dec = self.decoder(quant)
+        return dec
+class AutoencoderKL(pl.LightningModule):
+    def __init__(self,
+                 ddconfig,
+                 lossconfig,
+                 embed_dim,
+                 ckpt_path=None,
+                 ignore_keys=[],
+                 image_key="image",
+                 colorize_nlabels=None,
+                 monitor=None,
+                 ):
+        super().__init__()
+        self.to_1d = False
+        print(f"to_1d is {self.to_1d} in AUTOENCODER")
+        self.image_key = image_key
+        self.encoder = Encoder(**ddconfig)
+        self.decoder = Decoder(**ddconfig)
+        self.loss = instantiate_from_config(lossconfig)
+        assert ddconfig["double_z"]
+        self.quant_conv = torch.nn.Conv2d(2*ddconfig["z_channels"], 2*embed_dim, 1)
+        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
+        self.embed_dim = embed_dim
+        if colorize_nlabels is not None:
+            assert type(colorize_nlabels)==int
+            self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
+        if monitor is not None:
+            self.monitor = monitor
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
+        # self.automatic_optimization = False # hjw for debug
+    def init_from_ckpt(self, path, ignore_keys=list()):
+        sd = torch.load(path, map_location="cpu")["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    print("Deleting key {} from state_dict.".format(k))
+                    del sd[k]
+        self.load_state_dict(sd, strict=False)
+        print(f"Restored from {path}")
+    def encode(self, x):
+        if self.to_1d and len(x.shape)==3:
+            x = x.unsqueeze(1)
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        if self.to_1d:
+            b,c,h,w = moments.shape
+            moments = moments.reshape(b,c*h,w)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def decode(self, z):
+        if self.to_1d:
+            b,c_h,w = z.shape
+            c = self.post_quant_conv.in_channels
+            z = z.reshape(b,c,-1,w)
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+        return dec
+    def forward(self, input, sample_posterior=True):
+        posterior = self.encode(input)
+        if sample_posterior:
+            z = posterior.sample()
+        else:
+            z = posterior.mode()
+        dec = self.decode(z)
+        return dec, posterior
+    def get_input(self, batch, k):
+        x = batch[k]
+        if len(x.shape) == 3:
+            x = x[..., None]
+        x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format).float()
+        return x
+    def training_step(self, batch, batch_idx, optimizer_idx):
+        inputs = self.get_input(batch, self.image_key)
+        reconstructions, posterior = self(inputs)
+        if optimizer_idx == 0:
+            # train encoder+decoder+logvar
+            aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
+                                            last_layer=self.get_last_layer(), split="train")
+            self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
+            self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False)
+            # print(optimizer_idx,log_dict_ae)
+            return aeloss
+        if optimizer_idx == 1:
+            # train the discriminator
+            discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
+                                                last_layer=self.get_last_layer(), split="train")
+            self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
+            self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False)
+            # print(optimizer_idx,log_dict_disc)
+            return discloss
+    def validation_step(self, batch, batch_idx):
+        inputs = self.get_input(batch, self.image_key)
+        reconstructions, posterior = self(inputs)
+        aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step,
+                                        last_layer=self.get_last_layer(), split="val")
+        discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step,
+                                            last_layer=self.get_last_layer(), split="val")
+        self.log("val/rec_loss", log_dict_ae["val/rec_loss"])
+        self.log_dict(log_dict_ae)
+        self.log_dict(log_dict_disc)
+        return self.log_dict
+    def test_step(self, batch, batch_idx):
+        inputs = self.get_input(batch, self.image_key)# inputs shape:(b,mel_len,T)
+        reconstructions, posterior = self(inputs)# reconstructions:(b,mel_len,T)
+        mse_loss = torch.nn.functional.mse_loss(reconstructions,inputs)
+        self.log('test/mse_loss',mse_loss)
+        test_ckpt_path = os.path.basename(self.trainer.tested_ckpt_path)
+        savedir = os.path.join(self.trainer.log_dir,f'output_imgs_{test_ckpt_path}','fake_class')
+        if batch_idx == 0:
+            print(f"save_path is: {savedir}")
+        if not os.path.exists(savedir):
+            os.makedirs(savedir)
+            print(f"save_path is: {savedir}")
+        file_names = batch['f_name']
+        # print(f"reconstructions.shape:{reconstructions.shape}",file_names)
+        # reconstructions = (reconstructions + 1)/2 # to mel scale
+        reconstructions = reconstructions.cpu().numpy().squeeze(1) # squeeze channel dim
+        for b in range(reconstructions.shape[0]):
+            vname_num_split_index = file_names[b].rfind('_')# file_names[b]:video_name+'_'+num
+            v_n,num = file_names[b][:vname_num_split_index],file_names[b][vname_num_split_index+1:]
+            save_img_path = os.path.join(savedir, f'{v_n}.npy') # f'{v_n}_sample_{num}.npy'   f'{v_n}.npy'
+            np.save(save_img_path,reconstructions[b])
+        return None
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        opt_ae = torch.optim.Adam(list(self.encoder.parameters())+
+                                  list(self.decoder.parameters())+
+                                  list(self.quant_conv.parameters())+
+                                  list(self.post_quant_conv.parameters()),
+                                  lr=lr, betas=(0.5, 0.9))
+        opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),
+                                    lr=lr, betas=(0.5, 0.9))
+        return [opt_ae, opt_disc], []
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+    @torch.no_grad()
+    def log_images(self, batch, only_inputs=False,save_dir = 'mel_result_ae13_26_debug/fake_class', **kwargs): # 在main.py的on_validation_batch_end中调用
+        log = dict()
+        x = self.get_input(batch, self.image_key)
+        x = x.to(self.device)
+        if not only_inputs:
+            xrec, posterior = self(x)
+            if x.shape[1] > 3:
+                # colorize with random projection
+                assert xrec.shape[1] > 3
+                x = self.to_rgb(x)
+                xrec = self.to_rgb(xrec)
+            log["samples"] = self.decode(torch.randn_like(posterior.sample()))
+            log["reconstructions"] = xrec
+        log["inputs"] = x
+        return log
+    def to_rgb(self, x):
+        assert self.image_key == "segmentation"
+        if not hasattr(self, "colorize"):
+            self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
+        x = F.conv2d(x, weight=self.colorize)
+        x = 2.*(x-x.min())/(x.max()-x.min()) - 1.
+        return x
+class IdentityFirstStage(torch.nn.Module):
+    def __init__(self, *args, vq_interface=False, **kwargs):
+        self.vq_interface = vq_interface  # TODO: Should be true by default but check to not break older stuff
+        super().__init__()
+    def encode(self, x, *args, **kwargs):
+        return x
+    def decode(self, x, *args, **kwargs):
+        return x
+    def quantize(self, x, *args, **kwargs):
+        if self.vq_interface:
+            return x, None, [None, None, None]
+        return x
+    def forward(self, x, *args, **kwargs):
+        return x

ldm/models/autoencoder1d.py ADDED Viewed

	@@ -0,0 +1,517 @@

+"""
+与autoencoder.py的区别在于，autoencoder.py是(B,1,80,T) ->(B,C,80/8,T/8),现在vae要变成(B,80,T) -> (B,80/downsample_c,T/downsample_t)
+"""
+import os
+import torch
+import torch.nn as nn
+import pytorch_lightning as pl
+import torch.nn.functional as F
+from contextlib import contextmanager
+from packaging import version
+import numpy as np
+from ldm.modules.distributions.distributions import DiagonalGaussianDistribution
+from torch.optim.lr_scheduler import LambdaLR
+from ldm.util import instantiate_from_config
+class AutoencoderKL(pl.LightningModule):
+    def __init__(self,
+                 embed_dim,
+                 ddconfig,
+                 lossconfig,
+                 ckpt_path=None,
+                 ignore_keys=[],
+                 image_key="image",
+                 monitor=None,
+                 ):
+        super().__init__()
+        self.image_key = image_key
+        self.encoder = Encoder1D(**ddconfig)
+        self.decoder = Decoder1D(**ddconfig)
+        self.loss = instantiate_from_config(lossconfig)
+        assert ddconfig["double_z"]
+        self.quant_conv = torch.nn.Conv1d(2*ddconfig["z_channels"], 2*embed_dim, 1)
+        self.post_quant_conv = torch.nn.Conv1d(embed_dim, ddconfig["z_channels"], 1)
+        self.embed_dim = embed_dim
+        if monitor is not None:
+            self.monitor = monitor
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
+    def init_from_ckpt(self, path, ignore_keys=list()):
+        sd = torch.load(path, map_location="cpu")["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    print("Deleting key {} from state_dict.".format(k))
+                    del sd[k]
+        self.load_state_dict(sd, strict=False)
+        print(f"AutoencoderKL Restored from {path} Done")
+    def encode(self, x):
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        posterior = DiagonalGaussianDistribution(moments)
+        return posterior
+    def decode(self, z):
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+        return dec
+    def forward(self, input, sample_posterior=True):
+        posterior = self.encode(input)
+        if sample_posterior:
+            z = posterior.sample()
+        else:
+            z = posterior.mode()
+        dec = self.decode(z)
+        return dec, posterior
+    def get_input(self, batch, k):
+        x = batch[k]
+        assert len(x.shape) == 3
+        x = x.to(memory_format=torch.contiguous_format).float()
+        return x
+    def training_step(self, batch, batch_idx, optimizer_idx):
+        inputs = self.get_input(batch, self.image_key)
+        # print(inputs.shape)
+        reconstructions, posterior = self(inputs)
+        if optimizer_idx == 0:
+            # train encoder+decoder+logvar
+            aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
+                                            last_layer=self.get_last_layer(), split="train")
+            self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
+            self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False)
+            return aeloss
+        if optimizer_idx == 1:
+            # train the discriminator
+            discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
+                                                last_layer=self.get_last_layer(), split="train")
+            self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
+            self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False)
+            return discloss
+    def validation_step(self, batch, batch_idx):
+        inputs = self.get_input(batch, self.image_key)
+        reconstructions, posterior = self(inputs)
+        aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step,
+                                        last_layer=self.get_last_layer(), split="val")
+        discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step,
+                                            last_layer=self.get_last_layer(), split="val")
+        self.log("val/rec_loss", log_dict_ae["val/rec_loss"])
+        self.log_dict(log_dict_ae)
+        self.log_dict(log_dict_disc)
+        return self.log_dict
+    def test_step(self, batch, batch_idx):
+        inputs = self.get_input(batch, self.image_key)# inputs shape:(b,mel_len,T)
+        reconstructions, posterior = self(inputs)# reconstructions:(b,mel_len,T)
+        mse_loss = torch.nn.functional.mse_loss(reconstructions,inputs)
+        self.log('test/mse_loss',mse_loss)
+        test_ckpt_path = os.path.basename(self.trainer.tested_ckpt_path)
+        savedir = os.path.join(self.trainer.log_dir,f'output_imgs_{test_ckpt_path}','fake_class')
+        if batch_idx == 0:
+            print(f"save_path is: {savedir}")
+        if not os.path.exists(savedir):
+            os.makedirs(savedir)
+            print(f"save_path is: {savedir}")
+        file_names = batch['f_name']
+        # print(f"reconstructions.shape:{reconstructions.shape}",file_names)
+        # reconstructions = (reconstructions + 1)/2 # to mel scale
+        reconstructions = reconstructions.cpu().numpy() # squuze channel dim
+        for b in range(reconstructions.shape[0]):
+            vname_num_split_index = file_names[b].rfind('_')# file_names[b]:video_name+'_'+num
+            v_n,num = file_names[b][:vname_num_split_index],file_names[b][vname_num_split_index+1:]
+            save_img_path = os.path.join(savedir, f'{v_n}.npy') # f'{v_n}_sample_{num}.npy'   f'{v_n}.npy'
+            np.save(save_img_path,reconstructions[b])
+        return None
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        opt_ae = torch.optim.Adam(list(self.encoder.parameters())+
+                                  list(self.decoder.parameters())+
+                                  list(self.quant_conv.parameters())+
+                                  list(self.post_quant_conv.parameters()),
+                                  lr=lr, betas=(0.5, 0.9))
+        opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),
+                                    lr=lr, betas=(0.5, 0.9))
+        return [opt_ae, opt_disc], []
+    def get_last_layer(self):
+        return self.decoder.conv_out.weight
+    @torch.no_grad()
+    def log_images(self, batch, only_inputs=False, **kwargs):
+        log = dict()
+        x = self.get_input(batch, self.image_key)
+        x = x.to(self.device)
+        if not only_inputs:
+            xrec, posterior = self(x)
+            log["samples"] = self.decode(torch.randn_like(posterior.sample())).unsqueeze(1) # (b,1,H,W)
+            log["reconstructions"] = xrec.unsqueeze(1)
+        log["inputs"] = x.unsqueeze(1)
+        return log
+def Normalize(in_channels, num_groups=32):
+    return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True)
+def nonlinearity(x):
+    # swish
+    return x*torch.sigmoid(x)
+class ResnetBlock1D(nn.Module):
+    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
+                 dropout, temb_channels=512,kernel_size = 3):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.use_conv_shortcut = conv_shortcut
+        self.norm1 = Normalize(in_channels)
+        self.conv1 = torch.nn.Conv1d(in_channels,
+                                     out_channels,
+                                     kernel_size=kernel_size,
+                                     stride=1,
+                                     padding=kernel_size//2)
+        if temb_channels > 0:
+            self.temb_proj = torch.nn.Linear(temb_channels,
+                                             out_channels)
+        self.norm2 = Normalize(out_channels)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = torch.nn.Conv1d(out_channels,
+                                     out_channels,
+                                     kernel_size=kernel_size,
+                                     stride=1,
+                                     padding=kernel_size//2)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                self.conv_shortcut = torch.nn.Conv1d(in_channels,
+                                                     out_channels,
+                                                     kernel_size=kernel_size,
+                                                     stride=1,
+                                                     padding=kernel_size//2)
+            else:
+                self.nin_shortcut = torch.nn.Conv1d(in_channels,
+                                                    out_channels,
+                                                    kernel_size=1,
+                                                    stride=1,
+                                                    padding=0)
+    def forward(self, x, temb):
+        h = x
+        h = self.norm1(h)
+        h = nonlinearity(h)
+        h = self.conv1(h)
+        if temb is not None:
+            h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None]
+        h = self.norm2(h)
+        h = nonlinearity(h)
+        h = self.dropout(h)
+        h = self.conv2(h)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                x = self.conv_shortcut(x)
+            else:
+                x = self.nin_shortcut(x)
+        return x+h
+class AttnBlock1D(nn.Module):
+    def __init__(self, in_channels):
+        super().__init__()
+        self.in_channels = in_channels
+        self.norm = Normalize(in_channels)
+        self.q = torch.nn.Conv1d(in_channels,
+                                 in_channels,
+                                 kernel_size=1)
+        self.k = torch.nn.Conv1d(in_channels,
+                                 in_channels,
+                                 kernel_size=1)
+        self.v = torch.nn.Conv1d(in_channels,
+                                 in_channels,
+                                 kernel_size=1)
+        self.proj_out = torch.nn.Conv1d(in_channels,
+                                        in_channels,
+                                        kernel_size=1)
+    def forward(self, x):
+        h_ = x
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+        # compute attention
+        b,t,c = q.shape
+        q = q.permute(0,2,1)   # b,t,c
+        w_ = torch.bmm(q,k)     # b,t,t   w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+        # if still 2d attn (q:b,hw,c ,k:b,c,hw -> w_:b,hw,hw)
+        w_ = w_ * (int(t)**(-0.5))
+        w_ = torch.nn.functional.softmax(w_, dim=2)
+        # attend to values
+        w_ = w_.permute(0,2,1)   # b,t,t (first t of k, second of q)
+        h_ = torch.bmm(v,w_)     # b,c,t (t of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+        h_ = self.proj_out(h_)
+        return x+h_
+class Upsample1D(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            self.conv = torch.nn.Conv1d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, x):
+        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest") # support 3D tensor(B,C,T)
+        if self.with_conv:
+            x = self.conv(x)
+        return x
+class Downsample1D(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            # no asymmetric padding in torch conv, must do it ourselves
+            self.conv = torch.nn.Conv1d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=2,
+                                        padding=0)
+    def forward(self, x):
+        if self.with_conv:
+            pad = (0,1)
+            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
+            x = self.conv(x)
+        else:
+            x = torch.nn.functional.avg_pool1d(x, kernel_size=2, stride=2)
+        return x
+class Encoder1D(nn.Module):
+    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
+                 attn_layers = [],down_layers = [], dropout=0.0, resamp_with_conv=True, in_channels,
+                 z_channels, double_z=True,kernel_size=3, **ignore_kwargs):
+        """ out_ch is only used in decoder,not used here
+        """
+        super().__init__()
+        self.ch = ch
+        self.temb_ch = 0
+        self.num_layers = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.in_channels = in_channels
+        print(f"downsample rates is {2**len(down_layers)}")
+        self.down_layers = down_layers
+        self.attn_layers = attn_layers
+        self.conv_in = torch.nn.Conv1d(in_channels,
+                                       self.ch,
+                                       kernel_size=kernel_size,
+                                       stride=1,
+                                       padding=kernel_size//2)
+        in_ch_mult = (1,)+tuple(ch_mult)
+        self.in_ch_mult = in_ch_mult
+        # downsampling
+        self.down = nn.ModuleList()
+        for i_level in range(self.num_layers):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_in = ch*in_ch_mult[i_level]
+            block_out = ch*ch_mult[i_level]
+            for i_block in range(self.num_res_blocks):
+                block.append(ResnetBlock1D(in_channels=block_in,
+                                         out_channels=block_out,
+                                         temb_channels=self.temb_ch,
+                                         dropout=dropout,
+                                         kernel_size=kernel_size))
+                block_in = block_out
+                if i_level in attn_layers:
+                    # print(f"add attn in layer:{i_level}")
+                    attn.append(AttnBlock1D(block_in))
+            down = nn.Module()
+            down.block = block
+            down.attn = attn
+            if i_level in down_layers:
+                down.downsample = Downsample1D(block_in, resamp_with_conv)
+            self.down.append(down)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock1D(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout,
+                                       kernel_size=kernel_size)
+        self.mid.attn_1 = AttnBlock1D(block_in)
+        self.mid.block_2 = ResnetBlock1D(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout,
+                                       kernel_size=kernel_size)
+        # end
+        self.norm_out = Normalize(block_in)# GroupNorm
+        self.conv_out = torch.nn.Conv1d(block_in,
+                                        2*z_channels if double_z else z_channels,
+                                        kernel_size=kernel_size,
+                                        stride=1,
+                                        padding=kernel_size//2)
+    def forward(self, x):
+        # timestep embedding
+        temb = None
+        # downsampling
+        hs = [self.conv_in(x)]
+        for i_level in range(self.num_layers):
+            for i_block in range(self.num_res_blocks):
+                h = self.down[i_level].block[i_block](hs[-1], temb)
+                if len(self.down[i_level].attn) > 0:
+                    h = self.down[i_level].attn[i_block](h)
+                hs.append(h)
+            if i_level in self.down_layers:
+                hs.append(self.down[i_level].downsample(hs[-1]))
+        # middle
+        h = hs[-1]
+        h = self.mid.block_1(h, temb)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h, temb)
+        # end
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        return h
+class Decoder1D(nn.Module):
+    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
+                 attn_layers = [],down_layers = [], dropout=0.0,kernel_size=3, resamp_with_conv=True, in_channels,
+                z_channels, give_pre_end=False, tanh_out=False, **ignorekwargs):
+        super().__init__()
+        self.ch = ch
+        self.temb_ch = 0
+        self.num_layers = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.in_channels = in_channels
+        self.give_pre_end = give_pre_end
+        self.tanh_out = tanh_out
+        self.down_layers = [i+1 for i in down_layers] # each downlayer add one
+        print(f"upsample rates is {2**len(down_layers)}")
+        # compute in_ch_mult, block_in and curr_res at lowest res
+        in_ch_mult = (1,)+tuple(ch_mult)
+        block_in = ch*ch_mult[self.num_layers-1]
+        # z to block_in
+        self.conv_in = torch.nn.Conv1d(z_channels,
+                                       block_in,
+                                       kernel_size=kernel_size,
+                                       stride=1,
+                                       padding=kernel_size//2)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock1D(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        self.mid.attn_1 = AttnBlock1D(block_in)
+        self.mid.block_2 = ResnetBlock1D(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        # upsampling
+        self.up = nn.ModuleList()
+        for i_level in reversed(range(self.num_layers)):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_out = ch*ch_mult[i_level]
+            for i_block in range(self.num_res_blocks+1):
+                block.append(ResnetBlock1D(in_channels=block_in,
+                                         out_channels=block_out,
+                                         temb_channels=self.temb_ch,
+                                         dropout=dropout))
+                block_in = block_out
+                if i_level in attn_layers:
+                    # print(f"add attn in layer:{i_level}")
+                    attn.append(AttnBlock1D(block_in))
+            up = nn.Module()
+            up.block = block
+            up.attn = attn
+            if i_level in self.down_layers:
+                up.upsample = Upsample1D(block_in, resamp_with_conv)
+            self.up.insert(0, up) # prepend to get consistent order
+        # end
+        self.norm_out = Normalize(block_in)
+        self.conv_out = torch.nn.Conv1d(block_in,
+                                        out_ch,
+                                        kernel_size=kernel_size,
+                                        stride=1,
+                                        padding=kernel_size//2)
+    def forward(self, z):
+        #assert z.shape[1:] == self.z_shape[1:]
+        self.last_z_shape = z.shape
+        # timestep embedding
+        temb = None
+        # z to block_in
+        h = self.conv_in(z)
+        # middle
+        h = self.mid.block_1(h, temb)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h, temb)
+        # upsampling
+        for i_level in reversed(range(self.num_layers)):
+            for i_block in range(self.num_res_blocks+1):
+                h = self.up[i_level].block[i_block](h, temb)
+                if len(self.up[i_level].attn) > 0:
+                    h = self.up[i_level].attn[i_block](h)
+            if i_level in self.down_layers:
+                h = self.up[i_level].upsample(h)
+        # end
+        if self.give_pre_end:
+            return h
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        if self.tanh_out:
+            h = torch.tanh(h)
+        return h

ldm/models/diffusion/__init__.py ADDED Viewed

File without changes

ldm/models/diffusion/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (177 Bytes). View file

ldm/models/diffusion/__pycache__/__init__.cpython-39.pyc ADDED Viewed

Binary file (177 Bytes). View file

ldm/models/diffusion/__pycache__/cfm1_audio.cpython-38.pyc ADDED Viewed

Binary file (11 kB). View file

ldm/models/diffusion/__pycache__/cfm1_audio.cpython-39.pyc ADDED Viewed

Binary file (11 kB). View file

ldm/models/diffusion/__pycache__/ddim.cpython-38.pyc ADDED Viewed

Binary file (7.62 kB). View file

ldm/models/diffusion/__pycache__/ddim.cpython-39.pyc ADDED Viewed

Binary file (7.56 kB). View file

ldm/models/diffusion/__pycache__/ddpm.cpython-38.pyc ADDED Viewed

Binary file (44.4 kB). View file

ldm/models/diffusion/__pycache__/ddpm.cpython-39.pyc ADDED Viewed

Binary file (44.3 kB). View file

ldm/models/diffusion/__pycache__/ddpm_audio.cpython-38.pyc ADDED Viewed

Binary file (25.9 kB). View file

ldm/models/diffusion/__pycache__/ddpm_audio.cpython-39.pyc ADDED Viewed

Binary file (25.9 kB). View file

ldm/models/diffusion/__pycache__/plms.cpython-38.pyc ADDED Viewed

Binary file (7.38 kB). View file

ldm/models/diffusion/__pycache__/plms.cpython-39.pyc ADDED Viewed

Binary file (7.31 kB). View file

ldm/models/diffusion/audioldm.py ADDED Viewed

	@@ -0,0 +1,818 @@

+import os
+import torch
+import numpy as np
+from tqdm import tqdm
+from audioldm.utils import default, instantiate_from_config, save_wave
+from audioldm.latent_diffusion.ddpm import DDPM
+from audioldm.variational_autoencoder.distributions import DiagonalGaussianDistribution
+from audioldm.latent_diffusion.util import noise_like
+from audioldm.latent_diffusion.ddim import DDIMSampler
+import os
+def disabled_train(self, mode=True):
+    """Overwrite model.train with this function to make sure train/eval mode
+    does not change anymore."""
+    return self
+class LatentDiffusion(DDPM):
+    """main class"""
+    def __init__(
+        self,
+        device="cuda",
+        first_stage_config=None,
+        cond_stage_config=None,
+        num_timesteps_cond=None,
+        cond_stage_key="image",
+        cond_stage_trainable=False,
+        concat_mode=True,
+        cond_stage_forward=None,
+        conditioning_key=None,
+        scale_factor=1.0,
+        scale_by_std=False,
+        base_learning_rate=None,
+        *args,
+        **kwargs,
+    ):
+        self.device = device
+        self.learning_rate = base_learning_rate
+        self.num_timesteps_cond = default(num_timesteps_cond, 1)
+        self.scale_by_std = scale_by_std
+        assert self.num_timesteps_cond <= kwargs["timesteps"]
+        # for backwards compatibility after implementation of DiffusionWrapper
+        if conditioning_key is None:
+            conditioning_key = "concat" if concat_mode else "crossattn"
+        if cond_stage_config == "__is_unconditional__":
+            conditioning_key = None
+        ckpt_path = kwargs.pop("ckpt_path", None)
+        ignore_keys = kwargs.pop("ignore_keys", [])
+        super().__init__(conditioning_key=conditioning_key, *args, **kwargs)
+        self.concat_mode = concat_mode
+        self.cond_stage_trainable = cond_stage_trainable
+        self.cond_stage_key = cond_stage_key
+        self.cond_stage_key_orig = cond_stage_key
+        try:
+            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
+        except:
+            self.num_downs = 0
+        if not scale_by_std:
+            self.scale_factor = scale_factor
+        else:
+            self.register_buffer("scale_factor", torch.tensor(scale_factor))
+        self.instantiate_first_stage(first_stage_config)
+        self.instantiate_cond_stage(cond_stage_config)
+        self.cond_stage_forward = cond_stage_forward
+        self.clip_denoised = False
+    def make_cond_schedule(
+        self,
+    ):
+        self.cond_ids = torch.full(
+            size=(self.num_timesteps,),
+            fill_value=self.num_timesteps - 1,
+            dtype=torch.long,
+        )
+        ids = torch.round(
+            torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)
+        ).long()
+        self.cond_ids[: self.num_timesteps_cond] = ids
+    def register_schedule(
+        self,
+        given_betas=None,
+        beta_schedule="linear",
+        timesteps=1000,
+        linear_start=1e-4,
+        linear_end=2e-2,
+        cosine_s=8e-3,
+    ):
+        super().register_schedule(
+            given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s
+        )
+        self.shorten_cond_schedule = self.num_timesteps_cond > 1
+        if self.shorten_cond_schedule:
+            self.make_cond_schedule()
+    def instantiate_first_stage(self, config):
+        model = instantiate_from_config(config)
+        self.first_stage_model = model.eval()
+        self.first_stage_model.train = disabled_train
+        for param in self.first_stage_model.parameters():
+            param.requires_grad = False
+    def instantiate_cond_stage(self, config):
+        if not self.cond_stage_trainable:
+            if config == "__is_first_stage__":
+                print("Using first stage also as cond stage.")
+                self.cond_stage_model = self.first_stage_model
+            elif config == "__is_unconditional__":
+                print(f"Training {self.__class__.__name__} as an unconditional model.")
+                self.cond_stage_model = None
+                # self.be_unconditional = True
+            else:
+                model = instantiate_from_config(config)
+                self.cond_stage_model = model.eval()
+                self.cond_stage_model.train = disabled_train
+                for param in self.cond_stage_model.parameters():
+                    param.requires_grad = False
+        else:
+            assert config != "__is_first_stage__"
+            assert config != "__is_unconditional__"
+            model = instantiate_from_config(config)
+            self.cond_stage_model = model
+        self.cond_stage_model = self.cond_stage_model.to(self.device)
+    def get_first_stage_encoding(self, encoder_posterior):
+        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
+            z = encoder_posterior.sample()
+        elif isinstance(encoder_posterior, torch.Tensor):
+            z = encoder_posterior
+        else:
+            raise NotImplementedError(
+                f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented"
+            )
+        return self.scale_factor * z
+    def get_learned_conditioning(self, c):
+        if self.cond_stage_forward is None:
+            if hasattr(self.cond_stage_model, "encode") and callable(
+                self.cond_stage_model.encode
+            ):
+                c = self.cond_stage_model.encode(c)
+                if isinstance(c, DiagonalGaussianDistribution):
+                    c = c.mode()
+            else:
+                # Text input is list
+                if type(c) == list and len(c) == 1:
+                    c = self.cond_stage_model([c[0], c[0]])
+                    c = c[0:1] # (2,1,512) -> (1,1,512)
+                else:
+                    c = self.cond_stage_model(c)
+        else:
+            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
+            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
+        return c
+    @torch.no_grad()
+    def get_input(
+        self,
+        batch,
+        k,
+        return_first_stage_encode=True,
+        return_first_stage_outputs=False,
+        force_c_encode=False,
+        cond_key=None,
+        return_original_cond=False,
+        bs=None,
+    ):
+        x = super().get_input(batch, k)# shape(b,1,T=1024,melbins=64)
+        if bs is not None:
+            x = x[:bs]
+        x = x.to(self.device)
+        if return_first_stage_encode:
+            encoder_posterior = self.encode_first_stage(x)
+            z = self.get_first_stage_encoding(encoder_posterior).detach()# z:(b,8,256,16) 长压缩4倍，宽压缩4倍，dim增到8倍，基本没做压缩嘛
+        else:
+            z = None
+        if self.model.conditioning_key is not None:
+            if cond_key is None:
+                cond_key = self.cond_stage_key
+            if cond_key != self.first_stage_key:
+                if cond_key in ["caption", "coordinates_bbox"]:
+                    xc = batch[cond_key]
+                elif cond_key == "class_label":
+                    xc = batch
+                else:
+                    # [bs, 1, 527]
+                    xc = super().get_input(batch, cond_key)
+                    if type(xc) == torch.Tensor:
+                        xc = xc.to(self.device)
+            else:
+                xc = x
+            if not self.cond_stage_trainable or force_c_encode:
+                if isinstance(xc, dict) or isinstance(xc, list):
+                    c = self.get_learned_conditioning(xc)
+                else:
+                    c = self.get_learned_conditioning(xc.to(self.device))
+            else:
+                c = xc
+            if bs is not None:
+                c = c[:bs]
+        else:
+            c = None
+            xc = None
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                c = {"pos_x": pos_x, "pos_y": pos_y}
+        out = [z, c]# z:(b,8,256,16)
+        if return_first_stage_outputs:
+            xrec = self.decode_first_stage(z)
+            out.extend([x, xrec])
+        if return_original_cond:
+            out.append(xc)
+        return out
+    @torch.no_grad()
+    def decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):
+        if predict_cids:
+            if z.dim() == 4:
+                z = torch.argmax(z.exp(), dim=1).long()
+            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)
+            z = rearrange(z, "b h w c -> b c h w").contiguous()
+        z = 1.0 / self.scale_factor * z
+        return self.first_stage_model.decode(z)
+    def mel_spectrogram_to_waveform(self, mel):
+        # Mel: [bs, 1, t-steps, fbins]
+        if len(mel.size()) == 4:
+            mel = mel.squeeze(1)
+        mel = mel.permute(0, 2, 1)
+        waveform = self.first_stage_model.vocoder(mel)
+        waveform = waveform.cpu().detach().numpy()
+        return waveform
+    @torch.no_grad()
+    def encode_first_stage(self, x):
+        return self.first_stage_model.encode(x)
+    def apply_model(self, x_noisy, t, cond, return_ids=False):
+        if isinstance(cond, dict):
+            # hybrid case, cond is exptected to be a dict
+            pass
+        else:
+            if not isinstance(cond, list):
+                cond = [cond]
+            if self.model.conditioning_key == "concat":
+                key = "c_concat"
+            elif self.model.conditioning_key == "crossattn":
+                key = "c_crossattn"
+            else:
+                key = "c_film"
+            cond = {key: cond}
+        x_recon = self.model(x_noisy, t, **cond)
+        if isinstance(x_recon, tuple) and not return_ids:
+            return x_recon[0]
+        else:
+            return x_recon
+    def p_mean_variance(
+        self,
+        x,
+        c,
+        t,
+        clip_denoised: bool,
+        return_codebook_ids=False,
+        quantize_denoised=False,
+        return_x0=False,
+        score_corrector=None,
+        corrector_kwargs=None,
+    ):
+        t_in = t
+        model_out = self.apply_model(x, t_in, c, return_ids=return_codebook_ids)
+        if score_corrector is not None:
+            assert self.parameterization == "eps"
+            model_out = score_corrector.modify_score(
+                self, model_out, x, t, c, **corrector_kwargs
+            )
+        if return_codebook_ids:
+            model_out, logits = model_out
+        if self.parameterization == "eps":
+            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
+        elif self.parameterization == "x0":
+            x_recon = model_out
+        else:
+            raise NotImplementedError()
+        if clip_denoised:
+            x_recon.clamp_(-1.0, 1.0)
+        if quantize_denoised:
+            x_recon, _, [_, _, indices] = self.first_stage_model.quantize(x_recon)
+        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(
+            x_start=x_recon, x_t=x, t=t
+        )
+        if return_codebook_ids:
+            return model_mean, posterior_variance, posterior_log_variance, logits
+        elif return_x0:
+            return model_mean, posterior_variance, posterior_log_variance, x_recon
+        else:
+            return model_mean, posterior_variance, posterior_log_variance
+    @torch.no_grad()
+    def p_sample(
+        self,
+        x,
+        c,
+        t,
+        clip_denoised=False,
+        repeat_noise=False,
+        return_codebook_ids=False,
+        quantize_denoised=False,
+        return_x0=False,
+        temperature=1.0,
+        noise_dropout=0.0,
+        score_corrector=None,
+        corrector_kwargs=None,
+    ):
+        b, *_, device = *x.shape, x.device
+        outputs = self.p_mean_variance(
+            x=x,
+            c=c,
+            t=t,
+            clip_denoised=clip_denoised,
+            return_codebook_ids=return_codebook_ids,
+            quantize_denoised=quantize_denoised,
+            return_x0=return_x0,
+            score_corrector=score_corrector,
+            corrector_kwargs=corrector_kwargs,
+        )
+        if return_codebook_ids:
+            raise DeprecationWarning("Support dropped.")
+            model_mean, _, model_log_variance, logits = outputs
+        elif return_x0:
+            model_mean, _, model_log_variance, x0 = outputs
+        else:
+            model_mean, _, model_log_variance = outputs
+        noise = noise_like(x.shape, device, repeat_noise) * temperature
+        if noise_dropout > 0.0:
+            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+        # no noise when t == 0
+        nonzero_mask = (
+            (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1))).contiguous()
+        )
+        if return_codebook_ids:
+            return model_mean + nonzero_mask * (
+                0.5 * model_log_variance
+            ).exp() * noise, logits.argmax(dim=1)
+        if return_x0:
+            return (
+                model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise,
+                x0,
+            )
+        else:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
+    @torch.no_grad()
+    def progressive_denoising(
+        self,
+        cond,
+        shape,
+        verbose=True,
+        callback=None,
+        quantize_denoised=False,
+        img_callback=None,
+        mask=None,
+        x0=None,
+        temperature=1.0,
+        noise_dropout=0.0,
+        score_corrector=None,
+        corrector_kwargs=None,
+        batch_size=None,
+        x_T=None,
+        start_T=None,
+        log_every_t=None,
+    ):
+        if not log_every_t:
+            log_every_t = self.log_every_t
+        timesteps = self.num_timesteps
+        if batch_size is not None:
+            b = batch_size if batch_size is not None else shape[0]
+            shape = [batch_size] + list(shape)
+        else:
+            b = batch_size = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=self.device)
+        else:
+            img = x_T
+        intermediates = []
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {
+                    key: cond[key][:batch_size]
+                    if not isinstance(cond[key], list)
+                    else list(map(lambda x: x[:batch_size], cond[key]))
+                    for key in cond
+                }
+            else:
+                cond = (
+                    [c[:batch_size] for c in cond]
+                    if isinstance(cond, list)
+                    else cond[:batch_size]
+                )
+        if start_T is not None:
+            timesteps = min(timesteps, start_T)
+        iterator = (
+            tqdm(
+                reversed(range(0, timesteps)),
+                desc="Progressive Generation",
+                total=timesteps,
+            )
+            if verbose
+            else reversed(range(0, timesteps))
+        )
+        if type(temperature) == float:
+            temperature = [temperature] * timesteps
+        for i in iterator:
+            ts = torch.full((b,), i, device=self.device, dtype=torch.long)
+            if self.shorten_cond_schedule:
+                assert self.model.conditioning_key != "hybrid"
+                tc = self.cond_ids[ts].to(cond.device)
+                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
+            img, x0_partial = self.p_sample(
+                img,
+                cond,
+                ts,
+                clip_denoised=self.clip_denoised,
+                quantize_denoised=quantize_denoised,
+                return_x0=True,
+                temperature=temperature[i],
+                noise_dropout=noise_dropout,
+                score_corrector=score_corrector,
+                corrector_kwargs=corrector_kwargs,
+            )
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.q_sample(x0, ts)
+                img = img_orig * mask + (1.0 - mask) * img
+            if i % log_every_t == 0 or i == timesteps - 1:
+                intermediates.append(x0_partial)
+            if callback:
+                callback(i)
+            if img_callback:
+                img_callback(img, i)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_loop(
+        self,
+        cond,
+        shape,
+        return_intermediates=False,
+        x_T=None,
+        verbose=True,
+        callback=None,
+        timesteps=None,
+        quantize_denoised=False,
+        mask=None,
+        x0=None,
+        img_callback=None,
+        start_T=None,
+        log_every_t=None,
+    ):
+        if not log_every_t:
+            log_every_t = self.log_every_t
+        device = self.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        intermediates = [img]
+        if timesteps is None:
+            timesteps = self.num_timesteps
+        if start_T is not None:
+            timesteps = min(timesteps, start_T)
+        iterator = (
+            tqdm(reversed(range(0, timesteps)), desc="Sampling t", total=timesteps)
+            if verbose
+            else reversed(range(0, timesteps))
+        )
+        if mask is not None:
+            assert x0 is not None
+            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match
+        for i in iterator:
+            ts = torch.full((b,), i, device=device, dtype=torch.long)
+            if self.shorten_cond_schedule:
+                assert self.model.conditioning_key != "hybrid"
+                tc = self.cond_ids[ts].to(cond.device)
+                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
+            img = self.p_sample(
+                img,
+                cond,
+                ts,
+                clip_denoised=self.clip_denoised,
+                quantize_denoised=quantize_denoised,
+            )
+            if mask is not None:
+                img_orig = self.q_sample(x0, ts)
+                img = img_orig * mask + (1.0 - mask) * img
+            if i % log_every_t == 0 or i == timesteps - 1:
+                intermediates.append(img)
+            if callback:
+                callback(i)
+            if img_callback:
+                img_callback(img, i)
+        if return_intermediates:
+            return img, intermediates
+        return img
+    @torch.no_grad()
+    def sample(
+        self,
+        cond,
+        batch_size=16,
+        return_intermediates=False,
+        x_T=None,
+        verbose=True,
+        timesteps=None,
+        quantize_denoised=False,
+        mask=None,
+        x0=None,
+        shape=None,
+        **kwargs,
+    ):
+        if shape is None:
+            shape = (batch_size, self.channels, self.latent_t_size, self.latent_f_size)
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {
+                    key: cond[key][:batch_size]
+                    if not isinstance(cond[key], list)
+                    else list(map(lambda x: x[:batch_size], cond[key]))
+                    for key in cond
+                }
+            else:
+                cond = (
+                    [c[:batch_size] for c in cond]
+                    if isinstance(cond, list)
+                    else cond[:batch_size]
+                )
+        return self.p_sample_loop(
+            cond,
+            shape,
+            return_intermediates=return_intermediates,
+            x_T=x_T,
+            verbose=verbose,
+            timesteps=timesteps,
+            quantize_denoised=quantize_denoised,
+            mask=mask,
+            x0=x0,
+            **kwargs,
+        )
+    @torch.no_grad()
+    def sample_log(
+        self,
+        cond,
+        batch_size,
+        ddim,
+        ddim_steps,
+        unconditional_guidance_scale=1.0,
+        unconditional_conditioning=None,
+        use_plms=False,
+        mask=None,
+        **kwargs,
+    ):
+        if mask is not None:
+            shape = (self.channels, mask.size()[-2], mask.size()[-1])
+        else:
+            shape = (self.channels, self.latent_t_size, self.latent_f_size)
+        intermediate = None
+        if ddim and not use_plms:
+            # print("Use ddim sampler")
+            ddim_sampler = DDIMSampler(self)
+            samples, intermediates = ddim_sampler.sample(
+                ddim_steps,
+                batch_size,
+                shape,
+                cond,
+                verbose=False,
+                unconditional_guidance_scale=unconditional_guidance_scale,
+                unconditional_conditioning=unconditional_conditioning,
+                mask=mask,
+                **kwargs,
+            )
+        else:
+            # print("Use DDPM sampler")
+            samples, intermediates = self.sample(
+                cond=cond,
+                batch_size=batch_size,
+                return_intermediates=True,
+                unconditional_guidance_scale=unconditional_guidance_scale,
+                mask=mask,
+                unconditional_conditioning=unconditional_conditioning,
+                **kwargs,
+            )
+        return samples, intermediate
+    @torch.no_grad()
+    def generate_sample(
+        self,
+        batchs,
+        ddim_steps=200,
+        ddim_eta=1.0,
+        x_T=None,
+        n_candidate_gen_per_text=1,
+        unconditional_guidance_scale=1.0,
+        unconditional_conditioning=None,
+        name="waveform",
+        use_plms=False,
+        save=False,
+        **kwargs,
+    ):
+        # Generate n_candidate_gen_per_text times and select the best
+        # Batch: audio, text, fnames
+        assert x_T is None
+        try:
+            batchs = iter(batchs)
+        except TypeError:
+            raise ValueError("The first input argument should be an iterable object")
+        if use_plms:
+            assert ddim_steps is not None
+        use_ddim = ddim_steps is not None
+        # waveform_save_path = os.path.join(self.get_log_dir(), name)
+        # os.makedirs(waveform_save_path, exist_ok=True)
+        # print("Waveform save path: ", waveform_save_path)
+        with self.ema_scope("Generate"):
+            for batch in batchs:
+                z, c = self.get_input(
+                    batch,
+                    self.first_stage_key,
+                    cond_key=self.cond_stage_key,
+                    return_first_stage_outputs=False,
+                    force_c_encode=True,
+                    return_original_cond=False,
+                    bs=None,
+                )
+                text = super().get_input(batch, "text")
+                # Generate multiple samples
+                batch_size = z.shape[0] * n_candidate_gen_per_text
+                c = torch.cat([c] * n_candidate_gen_per_text, dim=0)
+                text = text * n_candidate_gen_per_text
+                if unconditional_guidance_scale != 1.0:
+                    unconditional_conditioning = (
+                        self.cond_stage_model.get_unconditional_condition(batch_size)
+                    )
+                samples, _ = self.sample_log(
+                    cond=c,
+                    batch_size=batch_size,
+                    x_T=x_T,
+                    ddim=use_ddim,
+                    ddim_steps=ddim_steps,
+                    eta=ddim_eta,
+                    unconditional_guidance_scale=unconditional_guidance_scale,
+                    unconditional_conditioning=unconditional_conditioning,
+                    use_plms=use_plms,
+                )
+                if(torch.max(torch.abs(samples)) > 1e2):
+                    samples = torch.clip(samples, min=-10, max=10)
+                mel = self.decode_first_stage(samples)
+                waveform = self.mel_spectrogram_to_waveform(mel)
+                if waveform.shape[0] > 1:
+                    similarity = self.cond_stage_model.cos_similarity(
+                        torch.FloatTensor(waveform).squeeze(1), text
+                    )
+                    best_index = []
+                    for i in range(z.shape[0]):
+                        candidates = similarity[i :: z.shape[0]]
+                        max_index = torch.argmax(candidates).item()
+                        best_index.append(i + max_index * z.shape[0])
+                    waveform = waveform[best_index]
+                    # print("Similarity between generated audio and text", similarity)
+                    # print("Choose the following indexes:", best_index)
+        return waveform
+    @torch.no_grad()
+    def generate_sample_masked(
+        self,
+        batchs,
+        ddim_steps=200,
+        ddim_eta=1.0,
+        x_T=None,
+        n_candidate_gen_per_text=1,
+        unconditional_guidance_scale=1.0,
+        unconditional_conditioning=None,
+        name="waveform",
+        use_plms=False,
+        time_mask_ratio_start_and_end=(0.25, 0.75),
+        freq_mask_ratio_start_and_end=(0.75, 1.0),
+        save=False,
+        **kwargs,
+    ):
+        # Generate n_candidate_gen_per_text times and select the best
+        # Batch: audio, text, fnames
+        assert x_T is None
+        try:
+            batchs = iter(batchs)
+        except TypeError:
+            raise ValueError("The first input argument should be an iterable object")
+        if use_plms:
+            assert ddim_steps is not None
+        use_ddim = ddim_steps is not None
+        # waveform_save_path = os.path.join(self.get_log_dir(), name)
+        # os.makedirs(waveform_save_path, exist_ok=True)
+        # print("Waveform save path: ", waveform_save_path)
+        with self.ema_scope("Generate"):
+            for batch in batchs:
+                z, c = self.get_input(
+                    batch,
+                    self.first_stage_key,
+                    cond_key=self.cond_stage_key,
+                    return_first_stage_outputs=False,
+                    force_c_encode=True,
+                    return_original_cond=False,
+                    bs=None,
+                )
+                text = super().get_input(batch, "text")
+                # Generate multiple samples
+                batch_size = z.shape[0] * n_candidate_gen_per_text
+                _, h, w = z.shape[0], z.shape[2], z.shape[3]
+                mask = torch.ones(batch_size, h, w).to(self.device)
+                mask[:, int(h * time_mask_ratio_start_and_end[0]) : int(h * time_mask_ratio_start_and_end[1]), :] = 0
+                mask[:, :, int(w * freq_mask_ratio_start_and_end[0]) : int(w * freq_mask_ratio_start_and_end[1])] = 0
+                mask = mask[:, None, ...]
+                c = torch.cat([c] * n_candidate_gen_per_text, dim=0)
+                text = text * n_candidate_gen_per_text
+                if unconditional_guidance_scale != 1.0:
+                    unconditional_conditioning = (
+                        self.cond_stage_model.get_unconditional_condition(batch_size)
+                    )
+                samples, _ = self.sample_log(
+                    cond=c,
+                    batch_size=batch_size,
+                    x_T=x_T,
+                    ddim=use_ddim,
+                    ddim_steps=ddim_steps,
+                    eta=ddim_eta,
+                    unconditional_guidance_scale=unconditional_guidance_scale,
+                    unconditional_conditioning=unconditional_conditioning,
+                    use_plms=use_plms, mask=mask, x0=torch.cat([z] * n_candidate_gen_per_text)
+                )
+                mel = self.decode_first_stage(samples)
+                waveform = self.mel_spectrogram_to_waveform(mel)
+                if waveform.shape[0] > 1:
+                    similarity = self.cond_stage_model.cos_similarity(
+                        torch.FloatTensor(waveform).squeeze(1), text
+                    )
+                    best_index = []
+                    for i in range(z.shape[0]):
+                        candidates = similarity[i :: z.shape[0]]
+                        max_index = torch.argmax(candidates).item()
+                        best_index.append(i + max_index * z.shape[0])
+                    waveform = waveform[best_index]
+                    # print("Similarity between generated audio and text", similarity)
+                    # print("Choose the following indexes:", best_index)
+        return waveform

ldm/models/diffusion/cfm1_audio.py ADDED Viewed

	@@ -0,0 +1,312 @@

+import os
+from pytorch_memlab import LineProfiler,profile
+import torch
+import torch.nn as nn
+import numpy as np
+import pytorch_lightning as pl
+from torch.optim.lr_scheduler import LambdaLR
+from einops import rearrange, repeat
+from contextlib import contextmanager
+from functools import partial
+from tqdm import tqdm
+from ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps
+from torchvision.utils import make_grid
+try:
+    from pytorch_lightning.utilities.distributed import rank_zero_only
+except:
+    from pytorch_lightning.utilities import rank_zero_only # torch2
+from torchdyn.core import NeuralODE
+from ldm.util import log_txt_as_img, exists, default, ismap, isimage, mean_flat, count_params, instantiate_from_config
+from ldm.models.diffusion.ddpm_audio import LatentDiffusion_audio, disabled_train
+from ldm.modules.diffusionmodules.util import make_beta_schedule, extract_into_tensor, noise_like
+from omegaconf import ListConfig
+__conditioning_keys__ = {'concat': 'c_concat',
+                         'crossattn': 'c_crossattn',
+                         'adm': 'y'}
+class CFM(LatentDiffusion_audio):
+    def __init__(self, **kwargs):
+        super(CFM, self).__init__(**kwargs)
+        self.sigma_min = 1e-4
+    def p_losses(self, x_start, cond, t, noise=None):
+        x1 = x_start
+        x0 = default(noise, lambda: torch.randn_like(x_start))
+        ut = x1 - (1 - self.sigma_min) * x0  # 和ut的梯度没关系
+        t_unsqueeze = t.unsqueeze(1).unsqueeze(1).float() / self.num_timesteps
+        x_noisy = t_unsqueeze * x1 + (1. - (1 - self.sigma_min) * t_unsqueeze) * x0
+        model_output = self.apply_model(x_noisy, t, cond)
+        loss_dict = {}
+        prefix = 'train' if self.training else 'val'
+        target = ut
+        mean_dims = list(range(1,len(target.shape)))
+        loss_simple = self.get_loss(model_output, target, mean=False).mean(dim=mean_dims)
+        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
+        loss = loss_simple
+        loss = self.l_simple_weight * loss.mean()
+        loss_dict.update({f'{prefix}/loss': loss})
+        return loss, loss_dict
+    @torch.no_grad()
+    def sample(self, cond, batch_size=16, timesteps=None, shape=None, x_latent=None, t_start=None, **kwargs):
+        if shape is None:
+            if self.channels > 0:
+                shape = (batch_size, self.channels, self.mel_dim, self.mel_length)
+            else:
+                shape = (batch_size, self.mel_dim, self.mel_length)
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            else:
+                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        neural_ode = NeuralODE(self.ode_wrapper(cond), solver='euler', sensitivity="adjoint", atol=1e-4, rtol=1e-4)
+        t_span = torch.linspace(0, 1, 25 if timesteps is None else timesteps)
+        if t_start is not None:
+            t_span = t_span[t_start:]
+        x0 = torch.randn(shape, device=self.device) if x_latent is None else x_latent
+        eval_points, traj = neural_ode(x0, t_span)
+        return traj[-1], traj
+    def ode_wrapper(self, cond):
+        # self.estimator receives x, mask, mu, t, spk as arguments
+        return Wrapper(self, cond)
+    @torch.no_grad()
+    def sample_cfg(self, cond, unconditional_guidance_scale, unconditional_conditioning, batch_size=16, timesteps=None, shape=None, x_latent=None, t_start=None, **kwargs):
+        if shape is None:
+            if self.channels > 0:
+                shape = (batch_size, self.channels, self.mel_dim, self.mel_length)
+            else:
+                shape = (batch_size, self.mel_dim, self.mel_length)
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            else:
+                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        neural_ode = NeuralODE(self.ode_wrapper_cfg(cond, unconditional_guidance_scale, unconditional_conditioning), solver='euler', sensitivity="adjoint", atol=1e-4, rtol=1e-4)
+        t_span = torch.linspace(0, 1, 25 if timesteps is None else timesteps)
+        if t_start is not None:
+            t_span = t_span[t_start:]
+        x0 = torch.randn(shape, device=self.device) if x_latent is None else x_latent
+        eval_points, traj = neural_ode(x0, t_span)
+        return traj[-1], traj
+    def ode_wrapper_cfg(self, cond, unconditional_guidance_scale, unconditional_conditioning):
+        # self.estimator receives x, mask, mu, t, spk as arguments
+        return Wrapper_cfg(self, cond, unconditional_guidance_scale, unconditional_conditioning)
+    @torch.no_grad()
+    def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
+        # fast, but does not allow for exact reconstruction
+        # t serves as an index to gather the correct alphas
+        # if use_original_steps:
+        #     sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
+        #     sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
+        # else:
+        sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
+        sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
+        if noise is None:
+            noise = torch.randn_like(x0)
+        return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
+                extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)
+class Wrapper(nn.Module):
+    def __init__(self, net, cond):
+        super(Wrapper, self).__init__()
+        self.net = net
+        self.cond = cond
+    def forward(self, t, x, args):
+        t = torch.tensor([t * 1000] * x.shape[0], device=t.device).long()
+        return self.net.apply_model(x, t, self.cond)
+class Wrapper_cfg(nn.Module):
+    def __init__(self, net, cond, unconditional_guidance_scale, unconditional_conditioning):
+        super(Wrapper_cfg, self).__init__()
+        self.net = net
+        self.cond = cond
+        self.unconditional_conditioning = unconditional_conditioning
+        self.unconditional_guidance_scale = unconditional_guidance_scale
+    def forward(self, t, x, args):
+        x_in = torch.cat([x] * 2)
+        t = torch.tensor([t * 1000] * x.shape[0], device=t.device).long()
+        t_in = torch.cat([t] * 2)
+        c_in = torch.cat([self.unconditional_conditioning, self.cond])  # c/uc shape [b,seq_len=77,dim=1024],c_in shape [b*2,seq_len,dim]
+        e_t_uncond, e_t = self.net.apply_model(x_in, t_in, c_in).chunk(2)
+        e_t = e_t_uncond + self.unconditional_guidance_scale * (e_t - e_t_uncond)
+        return e_t
+class CFM_inpaint(CFM):
+    @torch.no_grad()
+    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,
+                  cond_key=None, return_original_cond=False, bs=None):
+        x = batch[k]
+        if self.channels > 0:  # use 4d input
+            if len(x.shape) == 3:
+                x = x[..., None]
+            x = rearrange(x, 'b h w c -> b c h w')
+        x = x.to(memory_format=torch.contiguous_format).float()
+        if bs is not None:
+            x = x[:bs]
+        x = x.to(self.device)
+        encoder_posterior = self.encode_first_stage(x)
+        z = self.get_first_stage_encoding(encoder_posterior).detach()
+        if self.model.conditioning_key is not None:
+            if cond_key is None:
+                cond_key = self.cond_stage_key
+            if cond_key != self.first_stage_key:
+                if cond_key in ['caption', 'coordinates_bbox', 'hybrid_feat']:
+                    xc = batch[cond_key]
+                elif cond_key == 'class_label':
+                    xc = batch
+                else:
+                    xc = super().get_input(batch, cond_key).to(self.device)
+            else:
+                xc = x
+            ##### Testing #######
+            spec = xc['mix_spec'].to(self.device)
+            encoder_posterior = self.encode_first_stage(spec)
+            z_spec = self.get_first_stage_encoding(encoder_posterior).detach()
+            c = {"mix_spec": z_spec, "mix_video_feat": xc['mix_video_feat']}
+            ##### Testing #######
+            if bs is not None:
+                c = {"mix_spec": c["mix_spec"][:bs], "mix_video_feat": c['mix_video_feat'][:bs]}
+            # Testing #
+            if cond_key == 'masked_image':
+                mask = super().get_input(batch, "mask")
+                cc = torch.nn.functional.interpolate(mask, size=c.shape[-2:]) # [B, 1, 10, 106]
+                c = torch.cat((c, cc), dim=1) # [B, 5, 10, 106]
+            # Testing #
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                ckey = __conditioning_keys__[self.model.conditioning_key]
+                c = {ckey: c, 'pos_x': pos_x, 'pos_y': pos_y}
+        else:
+            c = None
+            xc = None
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                c = {'pos_x': pos_x, 'pos_y': pos_y}
+        out = [z, c]
+        if return_first_stage_outputs:
+            xrec = self.decode_first_stage(z)
+            out.extend([x, xrec])
+        if return_original_cond:
+            out.append(xc)
+        return out
+    def apply_model(self, x_noisy, t, cond, return_ids=False):
+        if isinstance(cond, dict):
+            # hybrid case, cond is exptected to be a dict
+            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
+            cond = {key: cond}
+        else:
+            if not isinstance(cond, list):
+                cond = [cond]
+            if self.model.conditioning_key == "concat":
+                key = "c_concat"
+            elif self.model.conditioning_key == "crossattn" or self.model.conditioning_key == "hybrid_inpaint":
+                key = "c_crossattn"
+            else:
+                key = "c_film"
+            cond = {key: cond}
+        x_recon = self.model(x_noisy, t, **cond)
+        if isinstance(x_recon, tuple) and not return_ids:
+            return x_recon[0]
+        else:
+            return x_recon
+    @torch.no_grad()
+    def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None,
+                   quantize_denoised=True, inpaint=False, plot_denoise_rows=False, plot_progressive_rows=True,
+                   plot_diffusion_rows=True, **kwargs):
+        log = dict()
+        z, c, x, xrec, xc = self.get_input(batch, self.first_stage_key,
+                                           return_first_stage_outputs=True,
+                                           force_c_encode=True,
+                                           return_original_cond=True,
+                                           bs=N) # z is latent,c is condition embedding, xc is condition(caption) list
+        N = min(x.shape[0], N)
+        n_row = min(x.shape[0], n_row)
+        log["inputs"] = x if len(x.shape)==4 else x.unsqueeze(1)
+        log["reconstruction"] = xrec if len(xrec.shape)==4 else xrec.unsqueeze(1)
+        if self.model.conditioning_key is not None:
+            if hasattr(self.cond_stage_model, "decode") and self.cond_stage_key != "masked_image":
+                xc = self.cond_stage_model.decode(c)
+                log["conditioning"] = xc
+            elif self.cond_stage_key == "masked_image":
+                log["mask"] = c[:, -1, :, :][:, None, :, :]
+                xc = self.cond_stage_model.decode(c[:, :self.cond_stage_model.embed_dim, :, :])
+                log["conditioning"] = xc
+            elif self.cond_stage_key in ["caption"]:
+                pass
+                # xc = log_txt_as_img((256, 256), batch["caption"])
+                # log["conditioning"] = xc
+            elif self.cond_stage_key == 'class_label':
+                xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"])
+                log['conditioning'] = xc
+            elif isimage(xc):
+                log["conditioning"] = xc
+        if plot_diffusion_rows:
+            # get diffusion row
+            diffusion_row = list()
+            z_start = z[:n_row]
+            for t in range(self.num_timesteps):
+                if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
+                    t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
+                    t = t.to(self.device).long()
+                    noise = torch.randn_like(z_start)
+                    z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise)
+                    diffusion_row.append(self.decode_first_stage(z_noisy))
+            if len(diffusion_row[0].shape) == 3:
+                diffusion_row = [x.unsqueeze(1) for x in diffusion_row]
+            diffusion_row = torch.stack(diffusion_row)  # n_log_step, n_row, C, H, W
+            diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w')
+            diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w')
+            diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0])
+            log["diffusion_row"] = diffusion_grid
+        if return_keys:
+            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
+                return log
+            else:
+                return {key: log[key] for key in return_keys}
+        return log

ldm/models/diffusion/cfm1_audio_sampler.py ADDED Viewed

	@@ -0,0 +1,105 @@

+import os
+from pytorch_memlab import LineProfiler,profile
+import torch
+import torch.nn as nn
+import numpy as np
+import pytorch_lightning as pl
+from torch.optim.lr_scheduler import LambdaLR
+from einops import rearrange, repeat
+from contextlib import contextmanager
+from functools import partial
+from tqdm import tqdm
+from ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps
+from torchvision.utils import make_grid
+try:
+    from pytorch_lightning.utilities.distributed import rank_zero_only
+except:
+    from pytorch_lightning.utilities import rank_zero_only # torch2
+from torchdyn.core import NeuralODE
+from ldm.models.diffusion.cfm_audio import Wrapper, Wrapper_cfg
+from ldm.modules.diffusionmodules.util import make_beta_schedule, extract_into_tensor, noise_like
+from omegaconf import ListConfig
+from ldm.util import log_txt_as_img, exists, default
+class CFMSampler(object):
+    def __init__(self, model, num_timesteps, schedule="linear", **kwargs):
+        super().__init__()
+        self.model = model
+        self.ddpm_num_timesteps = model.num_timesteps
+        self.num_timesteps = num_timesteps
+        self.schedule = schedule
+    def register_buffer(self, name, attr):
+        if type(attr) == torch.Tensor:
+            if attr.device != torch.device("cuda"):
+                attr = attr.to(torch.device("cuda"))
+        setattr(self, name, attr)
+    def stochastic_encode(self, x_start, t, noise=None):
+        x1 = x_start
+        x0 = default(noise, lambda: torch.randn_like(x_start))
+        t_unsqueeze = 1 - t.unsqueeze(1).unsqueeze(1).float() / self.num_timesteps
+        x_noisy = t_unsqueeze * x1 + (1. - (1 - self.model.sigma_min) * t_unsqueeze) * x0
+        return x_noisy
+    @torch.no_grad()
+    def sample(self, cond, batch_size=16, timesteps=None, shape=None, x_latent=None, t_start=None, **kwargs):
+        if shape is None:
+            if self.model.channels > 0:
+                shape = (batch_size, self.model.channels, self.model.mel_dim, self.model.mel_length)
+            else:
+                shape = (batch_size, self.model.mel_dim, self.model.mel_length)
+        # if cond is not None:
+        #     if isinstance(cond, dict):
+        #         cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+        #         list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+        #     else:
+        #         cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        neural_ode = NeuralODE(self.ode_wrapper(cond), solver='euler', sensitivity="adjoint", atol=1e-4, rtol=1e-4)
+        t_span = torch.linspace(0, 1, 25 if timesteps is None else timesteps)
+        if t_start is not None:
+            t_span = t_span[t_start:]
+        x0 = torch.randn(shape, device=self.model.device) if x_latent is None else x_latent
+        eval_points, traj = neural_ode(x0, t_span)
+        return traj[-1], traj
+    def ode_wrapper(self, cond):
+        # self.estimator receives x, mask, mu, t, spk as arguments
+        return Wrapper(self.model, cond)
+    @torch.no_grad()
+    def sample_cfg(self, cond, unconditional_guidance_scale, unconditional_conditioning, batch_size=16, timesteps=None, shape=None, x_latent=None, t_start=None, **kwargs):
+        if shape is None:
+            if self.model.channels > 0:
+                shape = (batch_size, self.model.channels, self.model.mel_dim, self.model.mel_length)
+            else:
+                shape = (batch_size, self.model.mel_dim, self.model.mel_length)
+        # if cond is not None:
+            # if isinstance(cond, dict):
+            #     cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+            #     list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            # else:
+            #     cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        neural_ode = NeuralODE(self.ode_wrapper_cfg(cond, unconditional_guidance_scale, unconditional_conditioning), solver='euler', sensitivity="adjoint", atol=1e-4, rtol=1e-4)
+        t_span = torch.linspace(0, 1, 25 if timesteps is None else timesteps)
+        if t_start is not None:
+            t_span = t_span[t_start:]
+        x0 = torch.randn(shape, device=self.model.device) if x_latent is None else x_latent
+        eval_points, traj = neural_ode(x0, t_span)
+        return traj[-1], traj
+    def ode_wrapper_cfg(self, cond, unconditional_guidance_scale, unconditional_conditioning):
+        # self.estimator receives x, mask, mu, t, spk as arguments
+        return Wrapper_cfg(self.model, cond, unconditional_guidance_scale, unconditional_conditioning)

ldm/models/diffusion/classifier.py ADDED Viewed

	@@ -0,0 +1,267 @@

+import os
+import torch
+import pytorch_lightning as pl
+from omegaconf import OmegaConf
+from torch.nn import functional as F
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import LambdaLR
+from copy import deepcopy
+from einops import rearrange
+from glob import glob
+from natsort import natsorted
+from ldm.modules.diffusionmodules.openaimodel import EncoderUNetModel, UNetModel
+from ldm.util import log_txt_as_img, default, ismap, instantiate_from_config
+__models__ = {
+    'class_label': EncoderUNetModel,
+    'segmentation': UNetModel
+}
+def disabled_train(self, mode=True):
+    """Overwrite model.train with this function to make sure train/eval mode
+    does not change anymore."""
+    return self
+class NoisyLatentImageClassifier(pl.LightningModule):
+    def __init__(self,
+                 diffusion_path,
+                 num_classes,
+                 ckpt_path=None,
+                 pool='attention',
+                 label_key=None,
+                 diffusion_ckpt_path=None,
+                 scheduler_config=None,
+                 weight_decay=1.e-2,
+                 log_steps=10,
+                 monitor='val/loss',
+                 *args,
+                 **kwargs):
+        super().__init__(*args, **kwargs)
+        self.num_classes = num_classes
+        # get latest config of diffusion model
+        diffusion_config = natsorted(glob(os.path.join(diffusion_path, 'configs', '*-project.yaml')))[-1]
+        self.diffusion_config = OmegaConf.load(diffusion_config).model
+        self.diffusion_config.params.ckpt_path = diffusion_ckpt_path
+        self.load_diffusion()
+        self.monitor = monitor
+        self.numd = self.diffusion_model.first_stage_model.encoder.num_resolutions - 1
+        self.log_time_interval = self.diffusion_model.num_timesteps // log_steps
+        self.log_steps = log_steps
+        self.label_key = label_key if not hasattr(self.diffusion_model, 'cond_stage_key') \
+            else self.diffusion_model.cond_stage_key
+        assert self.label_key is not None, 'label_key neither in diffusion model nor in model.params'
+        if self.label_key not in __models__:
+            raise NotImplementedError()
+        self.load_classifier(ckpt_path, pool)
+        self.scheduler_config = scheduler_config
+        self.use_scheduler = self.scheduler_config is not None
+        self.weight_decay = weight_decay
+    def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
+        sd = torch.load(path, map_location="cpu")
+        if "state_dict" in list(sd.keys()):
+            sd = sd["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    print("Deleting key {} from state_dict.".format(k))
+                    del sd[k]
+        missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(
+            sd, strict=False)
+        print(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys")
+        if len(missing) > 0:
+            print(f"Missing Keys: {missing}")
+        if len(unexpected) > 0:
+            print(f"Unexpected Keys: {unexpected}")
+    def load_diffusion(self):
+        model = instantiate_from_config(self.diffusion_config)
+        self.diffusion_model = model.eval()
+        self.diffusion_model.train = disabled_train
+        for param in self.diffusion_model.parameters():
+            param.requires_grad = False
+    def load_classifier(self, ckpt_path, pool):
+        model_config = deepcopy(self.diffusion_config.params.unet_config.params)
+        model_config.in_channels = self.diffusion_config.params.unet_config.params.out_channels
+        model_config.out_channels = self.num_classes
+        if self.label_key == 'class_label':
+            model_config.pool = pool
+        self.model = __models__[self.label_key](**model_config)
+        if ckpt_path is not None:
+            print('#####################################################################')
+            print(f'load from ckpt "{ckpt_path}"')
+            print('#####################################################################')
+            self.init_from_ckpt(ckpt_path)
+    @torch.no_grad()
+    def get_x_noisy(self, x, t, noise=None):
+        noise = default(noise, lambda: torch.randn_like(x))
+        continuous_sqrt_alpha_cumprod = None
+        if self.diffusion_model.use_continuous_noise:
+            continuous_sqrt_alpha_cumprod = self.diffusion_model.sample_continuous_noise_level(x.shape[0], t + 1)
+            # todo: make sure t+1 is correct here
+        return self.diffusion_model.q_sample(x_start=x, t=t, noise=noise,
+                                             continuous_sqrt_alpha_cumprod=continuous_sqrt_alpha_cumprod)
+    def forward(self, x_noisy, t, *args, **kwargs):
+        return self.model(x_noisy, t)
+    @torch.no_grad()
+    def get_input(self, batch, k):
+        x = batch[k]
+        if len(x.shape) == 3:
+            x = x[..., None]
+        x = rearrange(x, 'b h w c -> b c h w')
+        x = x.to(memory_format=torch.contiguous_format).float()
+        return x
+    @torch.no_grad()
+    def get_conditioning(self, batch, k=None):
+        if k is None:
+            k = self.label_key
+        assert k is not None, 'Needs to provide label key'
+        targets = batch[k].to(self.device)
+        if self.label_key == 'segmentation':
+            targets = rearrange(targets, 'b h w c -> b c h w')
+            for down in range(self.numd):
+                h, w = targets.shape[-2:]
+                targets = F.interpolate(targets, size=(h // 2, w // 2), mode='nearest')
+            # targets = rearrange(targets,'b c h w -> b h w c')
+        return targets
+    def compute_top_k(self, logits, labels, k, reduction="mean"):
+        _, top_ks = torch.topk(logits, k, dim=1)
+        if reduction == "mean":
+            return (top_ks == labels[:, None]).float().sum(dim=-1).mean().item()
+        elif reduction == "none":
+            return (top_ks == labels[:, None]).float().sum(dim=-1)
+    def on_train_epoch_start(self):
+        # save some memory
+        self.diffusion_model.model.to('cpu')
+    @torch.no_grad()
+    def write_logs(self, loss, logits, targets):
+        log_prefix = 'train' if self.training else 'val'
+        log = {}
+        log[f"{log_prefix}/loss"] = loss.mean()
+        log[f"{log_prefix}/acc@1"] = self.compute_top_k(
+            logits, targets, k=1, reduction="mean"
+        )
+        log[f"{log_prefix}/acc@5"] = self.compute_top_k(
+            logits, targets, k=5, reduction="mean"
+        )
+        self.log_dict(log, prog_bar=False, logger=True, on_step=self.training, on_epoch=True)
+        self.log('loss', log[f"{log_prefix}/loss"], prog_bar=True, logger=False)
+        self.log('global_step', self.global_step, logger=False, on_epoch=False, prog_bar=True)
+        lr = self.optimizers().param_groups[0]['lr']
+        self.log('lr_abs', lr, on_step=True, logger=True, on_epoch=False, prog_bar=True)
+    def shared_step(self, batch, t=None):
+        x, *_ = self.diffusion_model.get_input(batch, k=self.diffusion_model.first_stage_key)
+        targets = self.get_conditioning(batch)
+        if targets.dim() == 4:
+            targets = targets.argmax(dim=1)
+        if t is None:
+            t = torch.randint(0, self.diffusion_model.num_timesteps, (x.shape[0],), device=self.device).long()
+        else:
+            t = torch.full(size=(x.shape[0],), fill_value=t, device=self.device).long()
+        x_noisy = self.get_x_noisy(x, t)
+        logits = self(x_noisy, t)
+        loss = F.cross_entropy(logits, targets, reduction='none')
+        self.write_logs(loss.detach(), logits.detach(), targets.detach())
+        loss = loss.mean()
+        return loss, logits, x_noisy, targets
+    def training_step(self, batch, batch_idx):
+        loss, *_ = self.shared_step(batch)
+        return loss
+    def reset_noise_accs(self):
+        self.noisy_acc = {t: {'acc@1': [], 'acc@5': []} for t in
+                          range(0, self.diffusion_model.num_timesteps, self.diffusion_model.log_every_t)}
+    def on_validation_start(self):
+        self.reset_noise_accs()
+    @torch.no_grad()
+    def validation_step(self, batch, batch_idx):
+        loss, *_ = self.shared_step(batch)
+        for t in self.noisy_acc:
+            _, logits, _, targets = self.shared_step(batch, t)
+            self.noisy_acc[t]['acc@1'].append(self.compute_top_k(logits, targets, k=1, reduction='mean'))
+            self.noisy_acc[t]['acc@5'].append(self.compute_top_k(logits, targets, k=5, reduction='mean'))
+        return loss
+    def configure_optimizers(self):
+        optimizer = AdamW(self.model.parameters(), lr=self.learning_rate, weight_decay=self.weight_decay)
+        if self.use_scheduler:
+            scheduler = instantiate_from_config(self.scheduler_config)
+            print("Setting up LambdaLR scheduler...")
+            scheduler = [
+                {
+                    'scheduler': LambdaLR(optimizer, lr_lambda=scheduler.schedule),
+                    'interval': 'step',
+                    'frequency': 1
+                }]
+            return [optimizer], scheduler
+        return optimizer
+    @torch.no_grad()
+    def log_images(self, batch, N=8, *args, **kwargs):
+        log = dict()
+        x = self.get_input(batch, self.diffusion_model.first_stage_key)
+        log['inputs'] = x
+        y = self.get_conditioning(batch)
+        if self.label_key == 'class_label':
+            y = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"])
+            log['labels'] = y
+        if ismap(y):
+            log['labels'] = self.diffusion_model.to_rgb(y)
+            for step in range(self.log_steps):
+                current_time = step * self.log_time_interval
+                _, logits, x_noisy, _ = self.shared_step(batch, t=current_time)
+                log[f'inputs@t{current_time}'] = x_noisy
+                pred = F.one_hot(logits.argmax(dim=1), num_classes=self.num_classes)
+                pred = rearrange(pred, 'b h w c -> b c h w')
+                log[f'pred@t{current_time}'] = self.diffusion_model.to_rgb(pred)
+        for key in log:
+            log[key] = log[key][:N]
+        return log

ldm/models/diffusion/ddim.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""SAMPLING ONLY."""
+import torch
+import numpy as np
+from tqdm import tqdm
+from functools import partial
+from ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like, \
+    extract_into_tensor
+class DDIMSampler(object):
+    def __init__(self, model, schedule="linear", **kwargs):
+        super().__init__()
+        self.model = model
+        self.ddpm_num_timesteps = model.num_timesteps
+        self.schedule = schedule
+    def register_buffer(self, name, attr):
+        if type(attr) == torch.Tensor:
+            if attr.device != torch.device("cuda"):
+                attr = attr.to(torch.device("cuda"))
+        setattr(self, name, attr)
+    def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
+        self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
+                                                  num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
+        alphas_cumprod = self.model.alphas_cumprod
+        assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
+        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
+        self.register_buffer('betas', to_torch(self.model.betas))
+        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
+        self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
+        # calculations for diffusion q(x_t | x_{t-1}) and others
+        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
+        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
+        # ddim sampling parameters
+        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
+                                                                                   ddim_timesteps=self.ddim_timesteps,
+                                                                                   eta=ddim_eta,verbose=verbose)
+        self.register_buffer('ddim_sigmas', ddim_sigmas)
+        self.register_buffer('ddim_alphas', ddim_alphas)
+        self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
+        self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
+        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
+            (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
+                        1 - self.alphas_cumprod / self.alphas_cumprod_prev))
+        self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)
+    @torch.no_grad()
+    def sample(self,
+               S,
+               batch_size,
+               shape,
+               conditioning=None,
+               callback=None,
+               normals_sequence=None,
+               img_callback=None,
+               quantize_x0=False,
+               eta=0.,
+               mask=None,
+               x0=None,
+               temperature=1.,
+               noise_dropout=0.,
+               score_corrector=None,
+               corrector_kwargs=None,
+               verbose=True,
+               x_T=None,
+               log_every_t=100,
+               unconditional_guidance_scale=1.,
+               unconditional_conditioning=None,
+               # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
+               **kwargs
+               ):
+        if conditioning is not None:
+            if isinstance(conditioning, dict):
+                ctmp = conditioning[list(conditioning.keys())[0]]
+                while isinstance(ctmp, list): ctmp = ctmp[0]
+                cbs = ctmp.shape[0]
+                if cbs != batch_size:
+                    print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
+            else:
+                if conditioning.shape[0] != batch_size:
+                    print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")
+        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose)
+        # sampling
+        if len(shape)==3:
+            C, H, W = shape
+            size = (batch_size, C, H, W)
+        else:
+            C, T = shape
+            size = (batch_size, C, T)
+        # print(f'Data shape for DDIM sampling is {size}, eta {eta}')
+        samples, intermediates = self.ddim_sampling(conditioning, size,
+                                                    callback=callback,
+                                                    img_callback=img_callback,
+                                                    quantize_denoised=quantize_x0,
+                                                    mask=mask, x0=x0,
+                                                    ddim_use_original_steps=False,
+                                                    noise_dropout=noise_dropout,
+                                                    temperature=temperature,
+                                                    score_corrector=score_corrector,
+                                                    corrector_kwargs=corrector_kwargs,
+                                                    x_T=x_T,
+                                                    log_every_t=log_every_t,
+                                                    unconditional_guidance_scale=unconditional_guidance_scale,
+                                                    unconditional_conditioning=unconditional_conditioning,
+                                                    )
+        return samples, intermediates
+    @torch.no_grad()
+    def ddim_sampling(self, cond, shape,
+                      x_T=None, ddim_use_original_steps=False,
+                      callback=None, timesteps=None, quantize_denoised=False,
+                      mask=None, x0=None, img_callback=None, log_every_t=100,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None,):
+        device = self.model.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        if timesteps is None:
+            timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
+        elif timesteps is not None and not ddim_use_original_steps:
+            subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
+            timesteps = self.ddim_timesteps[:subset_end]
+        intermediates = {'x_inter': [img], 'pred_x0': [img]}
+        time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)
+        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
+        # iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps)
+        for i, step in enumerate(time_range):
+            index = total_steps - i - 1
+            ts = torch.full((b,), step, device=device, dtype=torch.long)
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.model.q_sample(x0, ts)  # TODO: deterministic forward pass?
+                img = img_orig * mask + (1. - mask) * img
+            outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
+                                      quantize_denoised=quantize_denoised, temperature=temperature,
+                                      noise_dropout=noise_dropout, score_corrector=score_corrector,
+                                      corrector_kwargs=corrector_kwargs,
+                                      unconditional_guidance_scale=unconditional_guidance_scale,
+                                      unconditional_conditioning=unconditional_conditioning)
+            img, pred_x0 = outs
+            if callback: callback(i)
+            if img_callback: img_callback(pred_x0, i)
+            if index % log_every_t == 0 or index == total_steps - 1:
+                intermediates['x_inter'].append(img)
+                intermediates['pred_x0'].append(pred_x0)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None):
+        b, *_, device = *x.shape, x.device
+        if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
+            e_t = self.model.apply_model(x, t, c)
+        else:
+            x_in = torch.cat([x] * 2)
+            t_in = torch.cat([t] * 2)
+            if isinstance(c, dict):
+                assert isinstance(unconditional_conditioning, dict)
+                c_in = dict()
+                for k in c:
+                    if isinstance(c[k], list):
+                        c_in[k] = [torch.cat([
+                            unconditional_conditioning[k][i],
+                            c[k][i]]) for i in range(len(c[k]))]
+                    else:
+                        c_in[k] = torch.cat([
+                            unconditional_conditioning[k],
+                            c[k]])
+            elif isinstance(c, list):
+                c_in = list()
+                assert isinstance(unconditional_conditioning, list)
+                for i in range(len(c)):
+                    c_in.append(torch.cat([unconditional_conditioning[i], c[i]]))
+            else:
+                c_in = torch.cat([unconditional_conditioning, c])# c/uc shape [b,seq_len=77,dim=1024],c_in shape [b*2,seq_len,dim]
+            e_t_uncond, e_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
+            e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)
+        if score_corrector is not None:
+            assert self.model.parameterization == "eps"
+            e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)
+        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
+        alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
+        sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
+        sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
+        # select parameters corresponding to the currently considered timestep
+        full_shape = (b,) + tuple([1 for dim in range(len(x.shape)-1)])
+        a_t = torch.full(full_shape, alphas[index], device=device)
+        a_prev = torch.full(full_shape, alphas_prev[index], device=device)
+        sigma_t = torch.full(full_shape, sigmas[index], device=device)
+        sqrt_one_minus_at = torch.full(full_shape, sqrt_one_minus_alphas[index],device=device)
+        # current prediction for x_0
+        pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
+        if quantize_denoised:
+            pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
+        # direction pointing to x_t
+        dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
+        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
+        if noise_dropout > 0.:
+            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+        x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
+        return x_prev, pred_x0
+    @torch.no_grad()
+    def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
+        # fast, but does not allow for exact reconstruction
+        # t serves as an index to gather the correct alphas
+        if use_original_steps:
+            sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
+            sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
+        else:
+            sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
+            sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
+        if noise is None:
+            noise = torch.randn_like(x0)
+        return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
+                extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)
+    @torch.no_grad()
+    def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None,
+               use_original_steps=False):
+        timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
+        timesteps = timesteps[:t_start]
+        time_range = np.flip(timesteps)
+        total_steps = timesteps.shape[0]
+        x_dec = x_latent
+        for i, step in enumerate(time_range):
+            index = total_steps - i - 1
+            ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long)
+            x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps,
+                                          unconditional_guidance_scale=unconditional_guidance_scale,
+                                          unconditional_conditioning=unconditional_conditioning)
+        return x_dec

ldm/models/diffusion/ddpm.py ADDED Viewed

	@@ -0,0 +1,1461 @@

+"""
+wild mixture of
+https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
+https://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py
+https://github.com/CompVis/taming-transformers
+-- merci
+"""
+import torch
+import torch.nn as nn
+import numpy as np
+import pytorch_lightning as pl
+from torch.optim.lr_scheduler import LambdaLR
+from einops import rearrange, repeat
+from contextlib import contextmanager
+from functools import partial
+from tqdm import tqdm
+from torchvision.utils import make_grid
+try:
+    from pytorch_lightning.utilities.distributed import rank_zero_only
+except:
+    from pytorch_lightning.utilities import rank_zero_only # torch2
+from ldm.util import log_txt_as_img, exists, default, ismap, isimage, mean_flat, count_params, instantiate_from_config
+from ldm.modules.ema import LitEma
+from ldm.modules.distributions.distributions import normal_kl, DiagonalGaussianDistribution
+from ldm.models.autoencoder import VQModelInterface, IdentityFirstStage, AutoencoderKL
+from ldm.modules.diffusionmodules.util import make_beta_schedule, extract_into_tensor, noise_like
+from ldm.models.diffusion.ddim import DDIMSampler
+__conditioning_keys__ = {'concat': 'c_concat',
+                         'crossattn': 'c_crossattn',
+                         'adm': 'y'}
+def disabled_train(self, mode=True):
+    """Overwrite model.train with this function to make sure train/eval mode
+    does not change anymore."""
+    return self
+def uniform_on_device(r1, r2, shape, device):
+    return (r1 - r2) * torch.rand(*shape, device=device) + r2
+class DDPM(pl.LightningModule):
+    # classic DDPM with Gaussian diffusion, in image space
+    def __init__(self,
+                 unet_config,
+                 timesteps=1000,
+                 beta_schedule="linear",
+                 loss_type="l2",
+                 ckpt_path=None,
+                 ignore_keys=[],
+                 load_only_unet=False,
+                 monitor="val/loss",
+                 use_ema=True,
+                 first_stage_key="image",
+                 image_size=256,
+                 channels=3,
+                 log_every_t=100,
+                 clip_denoised=True,
+                 linear_start=1e-4,
+                 linear_end=2e-2,
+                 cosine_s=8e-3,
+                 given_betas=None,
+                 original_elbo_weight=0.,
+                 v_posterior=0.,  # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta
+                 l_simple_weight=1.,
+                 conditioning_key=None,
+                 parameterization="eps",  # all config files uses "eps"
+                 scheduler_config=None,
+                 use_positional_encodings=False,
+                 learn_logvar=False,
+                 logvar_init=0.,
+                 ):
+        super().__init__()
+        assert parameterization in ["eps", "x0"], 'currently only supporting "eps" and "x0"'
+        self.parameterization = parameterization
+        print(f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode")
+        self.cond_stage_model = None
+        self.clip_denoised = clip_denoised
+        self.log_every_t = log_every_t
+        self.first_stage_key = first_stage_key
+        self.image_size = image_size  # try conv?
+        self.channels = channels
+        self.use_positional_encodings = use_positional_encodings
+        self.model = DiffusionWrapper(unet_config, conditioning_key)
+        count_params(self.model, verbose=True)
+        self.use_ema = use_ema
+        if self.use_ema:
+            self.model_ema = LitEma(self.model)
+            print(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
+        self.use_scheduler = scheduler_config is not None
+        if self.use_scheduler:
+            self.scheduler_config = scheduler_config
+        self.v_posterior = v_posterior
+        self.original_elbo_weight = original_elbo_weight
+        self.l_simple_weight = l_simple_weight
+        if monitor is not None:
+            self.monitor = monitor
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet)
+        self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps,
+                               linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s)
+        self.loss_type = loss_type
+        self.learn_logvar = learn_logvar
+        self.logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,))
+        if self.learn_logvar:
+            self.logvar = nn.Parameter(self.logvar, requires_grad=True)
+    def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000,
+                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
+        if exists(given_betas):
+            betas = given_betas
+        else:
+            betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end,
+                                       cosine_s=cosine_s)
+        alphas = 1. - betas
+        alphas_cumprod = np.cumprod(alphas, axis=0)
+        alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])
+        timesteps, = betas.shape
+        self.num_timesteps = int(timesteps)
+        self.linear_start = linear_start
+        self.linear_end = linear_end
+        assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep'
+        to_torch = partial(torch.tensor, dtype=torch.float32)
+        self.register_buffer('betas', to_torch(betas))
+        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
+        self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))
+        # calculations for diffusion q(x_t | x_{t-1}) and others
+        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
+        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
+        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
+        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
+        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))
+        # calculations for posterior q(x_{t-1} | x_t, x_0)
+        posterior_variance = (1 - self.v_posterior) * betas * (1. - alphas_cumprod_prev) / (
+                    1. - alphas_cumprod) + self.v_posterior * betas
+        # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
+        self.register_buffer('posterior_variance', to_torch(posterior_variance))
+        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
+        self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
+        self.register_buffer('posterior_mean_coef1', to_torch(
+            betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
+        self.register_buffer('posterior_mean_coef2', to_torch(
+            (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))
+        if self.parameterization == "eps":
+            lvlb_weights = self.betas ** 2 / (
+                        2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod))
+        elif self.parameterization == "x0":
+            lvlb_weights = 0.5 * np.sqrt(torch.Tensor(alphas_cumprod)) / (2. * 1 - torch.Tensor(alphas_cumprod))
+        else:
+            raise NotImplementedError("mu not supported")
+        # TODO how to choose this term
+        lvlb_weights[0] = lvlb_weights[1]
+        self.register_buffer('lvlb_weights', lvlb_weights, persistent=False)
+        assert not torch.isnan(self.lvlb_weights).all()
+    @contextmanager
+    def ema_scope(self, context=None):
+        if self.use_ema:
+            self.model_ema.store(self.model.parameters())
+            self.model_ema.copy_to(self.model)
+            if context is not None:
+                print(f"{context}: Switched to EMA weights")
+        try:
+            yield None
+        finally:
+            if self.use_ema:
+                self.model_ema.restore(self.model.parameters())
+                if context is not None:
+                    print(f"{context}: Restored training weights")
+    def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
+        sd = torch.load(path, map_location="cpu")
+        if "state_dict" in list(sd.keys()):
+            sd = sd["state_dict"]
+        keys = list(sd.keys())
+        for k in keys:
+            for ik in ignore_keys:
+                if k.startswith(ik):
+                    print("Deleting key {} from state_dict.".format(k))
+                    del sd[k]
+        missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(
+            sd, strict=False)
+        print(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys")
+        if len(missing) > 0:
+            print(f"Missing Keys: {missing}")
+        if len(unexpected) > 0:
+            print(f"Unexpected Keys: {unexpected}")
+    def q_mean_variance(self, x_start, t):
+        """
+        Get the distribution q(x_t | x_0).
+        :param x_start: the [N x C x ...] tensor of noiseless inputs.
+        :param t: the number of diffusion steps (minus 1). Here, 0 means one step.
+        :return: A tuple (mean, variance, log_variance), all of x_start's shape.
+        """
+        mean = (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start)
+        variance = extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
+        log_variance = extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape)
+        return mean, variance, log_variance
+    def predict_start_from_noise(self, x_t, t, noise):
+        return (
+                extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
+                extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
+        )
+    def q_posterior(self, x_start, x_t, t):
+        posterior_mean = (
+                extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start +
+                extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
+        )
+        posterior_variance = extract_into_tensor(self.posterior_variance, t, x_t.shape)
+        posterior_log_variance_clipped = extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape)
+        return posterior_mean, posterior_variance, posterior_log_variance_clipped
+    def p_mean_variance(self, x, t, clip_denoised: bool):
+        model_out = self.model(x, t)
+        if self.parameterization == "eps":
+            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
+        elif self.parameterization == "x0":
+            x_recon = model_out
+        if clip_denoised:
+            x_recon.clamp_(-1., 1.)
+        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
+        return model_mean, posterior_variance, posterior_log_variance
+    @torch.no_grad()
+    def p_sample(self, x, t, clip_denoised=True, repeat_noise=False):
+        b, *_, device = *x.shape, x.device
+        model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, clip_denoised=clip_denoised)
+        noise = noise_like(x.shape, device, repeat_noise)
+        # no noise when t == 0
+        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
+        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
+    @torch.no_grad()
+    def p_sample_loop(self, shape, return_intermediates=False):
+        device = self.betas.device
+        b = shape[0]
+        img = torch.randn(shape, device=device)
+        intermediates = [img]
+        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='Sampling t', total=self.num_timesteps):
+            img = self.p_sample(img, torch.full((b,), i, device=device, dtype=torch.long),
+                                clip_denoised=self.clip_denoised)
+            if i % self.log_every_t == 0 or i == self.num_timesteps - 1:
+                intermediates.append(img)
+        if return_intermediates:
+            return img, intermediates
+        return img
+    @torch.no_grad()
+    def sample(self, batch_size=16, return_intermediates=False):
+        image_size = self.image_size
+        channels = self.channels
+        return self.p_sample_loop((batch_size, channels, image_size, image_size),
+                                  return_intermediates=return_intermediates)
+    def q_sample(self, x_start, t, noise=None):
+        noise = default(noise, lambda: torch.randn_like(x_start))
+        return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
+                extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)
+    def get_loss(self, pred, target, mean=True):
+        if self.loss_type == 'l1':
+            loss = (target - pred).abs()
+            if mean:
+                loss = loss.mean()
+        elif self.loss_type == 'l2':
+            if mean:
+                loss = torch.nn.functional.mse_loss(target, pred)
+            else:
+                loss = torch.nn.functional.mse_loss(target, pred, reduction='none')
+        else:
+            raise NotImplementedError("unknown loss type '{loss_type}'")
+        return loss
+    def p_losses(self, x_start, t, noise=None):
+        noise = default(noise, lambda: torch.randn_like(x_start))
+        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
+        model_out = self.model(x_noisy, t)
+        loss_dict = {}
+        if self.parameterization == "eps":
+            target = noise
+        elif self.parameterization == "x0":
+            target = x_start
+        else:
+            raise NotImplementedError(f"Paramterization {self.parameterization} not yet supported")
+        loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])
+        log_prefix = 'train' if self.training else 'val'
+        loss_dict.update({f'{log_prefix}/loss_simple': loss.mean()})
+        loss_simple = loss.mean() * self.l_simple_weight
+        loss_vlb = (self.lvlb_weights[t] * loss).mean()
+        loss_dict.update({f'{log_prefix}/loss_vlb': loss_vlb})
+        loss = loss_simple + self.original_elbo_weight * loss_vlb
+        loss_dict.update({f'{log_prefix}/loss': loss})
+        return loss, loss_dict
+    def forward(self, x, *args, **kwargs):
+        # b, c, h, w, device, img_size, = *x.shape, x.device, self.image_size
+        # assert h == img_size and w == img_size, f'height and width of image must be {img_size}'
+        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()
+        return self.p_losses(x, t, *args, **kwargs)
+    def get_input(self, batch, k):
+        x = batch[k]
+        if self.channels > 0:# use 4d input
+            if len(x.shape) == 3:
+                x = x[..., None]
+            x = rearrange(x, 'b h w c -> b c h w')
+        x = x.to(memory_format=torch.contiguous_format).float()
+        return x
+    def shared_step(self, batch):
+        x = self.get_input(batch, self.first_stage_key)
+        loss, loss_dict = self(x)
+        return loss, loss_dict
+    def training_step(self, batch, batch_idx):
+        loss, loss_dict = self.shared_step(batch)
+        self.log_dict(loss_dict, prog_bar=True,
+                      logger=True, on_step=True, on_epoch=True)
+        self.log('epoch', float(self.trainer.current_epoch))
+        self.log("global_step", self.global_step,
+                 prog_bar=True, logger=True, on_step=True, on_epoch=False)
+        if self.use_scheduler:
+            lr = self.optimizers().param_groups[0]['lr']
+            self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False)
+        return loss
+    @torch.no_grad()
+    def validation_step(self, batch, batch_idx):
+        _, loss_dict_no_ema = self.shared_step(batch)
+        with self.ema_scope():
+            _, loss_dict_ema = self.shared_step(batch)
+            loss_dict_ema = {key + '_ema': loss_dict_ema[key] for key in loss_dict_ema}
+        self.log_dict(loss_dict_no_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True,sync_dist=True)
+        self.log_dict(loss_dict_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True,sync_dist=True)
+    def on_train_batch_end(self, *args, **kwargs):
+        if self.use_ema:
+            self.model_ema(self.model)
+    def _get_rows_from_list(self, samples):
+        n_imgs_per_row = len(samples)
+        denoise_grid = rearrange(samples, 'n b c h w -> b n c h w')
+        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
+        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)
+        return denoise_grid
+    @torch.no_grad()
+    def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs):
+        log = dict()
+        x = self.get_input(batch, self.first_stage_key)
+        N = min(x.shape[0], N)
+        n_row = min(x.shape[0], n_row)
+        x = x.to(self.device)[:N]
+        log["inputs"] = x
+        # get diffusion row
+        diffusion_row = list()
+        x_start = x[:n_row]
+        for t in range(self.num_timesteps):
+            if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
+                t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
+                t = t.to(self.device).long()
+                noise = torch.randn_like(x_start)
+                x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
+                diffusion_row.append(x_noisy)
+        log["diffusion_row"] = self._get_rows_from_list(diffusion_row)
+        if sample:
+            # get denoise row
+            with self.ema_scope("Plotting"):
+                samples, denoise_row = self.sample(batch_size=N, return_intermediates=True)
+            log["samples"] = samples
+            log["denoise_row"] = self._get_rows_from_list(denoise_row)
+        if return_keys:
+            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
+                return log
+            else:
+                return {key: log[key] for key in return_keys}
+        return log
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        params = list(self.model.parameters())
+        if self.learn_logvar:
+            params = params + [self.logvar]
+        opt = torch.optim.AdamW(params, lr=lr)
+        return opt
+class LatentDiffusion(DDPM):
+    """main class"""
+    def __init__(self,
+                 first_stage_config,
+                 cond_stage_config,
+                 num_timesteps_cond=None,
+                 cond_stage_key="image",# 'caption' for txt2image, 'masked_image' for inpainting
+                 cond_stage_trainable=False,
+                 concat_mode=True,# true for inpainting
+                 cond_stage_forward=None,
+                 conditioning_key=None, # 'crossattn' for txt2image, None for inpainting
+                 scale_factor=1.0,
+                 scale_by_std=False,
+                 *args, **kwargs):
+        self.num_timesteps_cond = default(num_timesteps_cond, 1)
+        self.scale_by_std = scale_by_std
+        assert self.num_timesteps_cond <= kwargs['timesteps']
+        # for backwards compatibility after implementation of DiffusionWrapper
+        if conditioning_key is None:
+            conditioning_key = 'concat' if concat_mode else 'crossattn'
+        if cond_stage_config == '__is_unconditional__':
+            conditioning_key = None
+        ckpt_path = kwargs.pop("ckpt_path", None)
+        ignore_keys = kwargs.pop("ignore_keys", [])
+        super().__init__(conditioning_key=conditioning_key, *args, **kwargs)
+        self.concat_mode = concat_mode
+        self.cond_stage_trainable = cond_stage_trainable
+        self.cond_stage_key = cond_stage_key
+        try:
+            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
+        except:
+            self.num_downs = 0
+        if not scale_by_std:
+            self.scale_factor = scale_factor
+        else:
+            self.register_buffer('scale_factor', torch.tensor(scale_factor))
+        self.instantiate_first_stage(first_stage_config)
+        self.instantiate_cond_stage(cond_stage_config)
+        self.cond_stage_forward = cond_stage_forward
+        self.clip_denoised = False
+        self.bbox_tokenizer = None
+        self.restarted_from_ckpt = False
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys)
+            self.restarted_from_ckpt = True
+    def make_cond_schedule(self, ):
+        self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long)
+        ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long()
+        self.cond_ids[:self.num_timesteps_cond] = ids
+    @rank_zero_only
+    @torch.no_grad()
+    def on_train_batch_start(self, batch, batch_idx, dataloader_idx):
+        # only for very first batch
+        if self.scale_by_std and self.current_epoch == 0 and self.global_step == 0 and batch_idx == 0 and not self.restarted_from_ckpt:
+            assert self.scale_factor == 1., 'rather not use custom rescaling and std-rescaling simultaneously'
+            # set rescale weight to 1./std of encodings
+            print("### USING STD-RESCALING ###")
+            x = super().get_input(batch, self.first_stage_key)
+            x = x.to(self.device)
+            encoder_posterior = self.encode_first_stage(x)
+            z = self.get_first_stage_encoding(encoder_posterior).detach()
+            del self.scale_factor
+            self.register_buffer('scale_factor', 1. / z.flatten().std())
+            print(f"setting self.scale_factor to {self.scale_factor}")
+            print("### USING STD-RESCALING ###")
+    def register_schedule(self,
+                          given_betas=None, beta_schedule="linear", timesteps=1000,
+                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
+        super().register_schedule(given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s)
+        self.shorten_cond_schedule = self.num_timesteps_cond > 1
+        if self.shorten_cond_schedule:
+            self.make_cond_schedule()
+    def instantiate_first_stage(self, config):
+        model = instantiate_from_config(config)
+        self.first_stage_model = model.eval()
+        self.first_stage_model.train = disabled_train
+        for param in self.first_stage_model.parameters():
+            param.requires_grad = False
+    def instantiate_cond_stage(self, config):
+        if not self.cond_stage_trainable:
+            if config == "__is_first_stage__":# inpaint
+                print("Using first stage also as cond stage.")
+                self.cond_stage_model = self.first_stage_model
+            elif config == "__is_unconditional__":
+                print(f"Training {self.__class__.__name__} as an unconditional model.")
+                self.cond_stage_model = None
+                # self.be_unconditional = True
+            else:
+                model = instantiate_from_config(config)
+                self.cond_stage_model = model.eval()
+                self.cond_stage_model.train = disabled_train
+                for param in self.cond_stage_model.parameters():
+                    param.requires_grad = False
+        else:
+            assert config != '__is_first_stage__'
+            assert config != '__is_unconditional__'
+            model = instantiate_from_config(config)
+            self.cond_stage_model = model
+    def _get_denoise_row_from_list(self, samples, desc='', force_no_decoder_quantization=False):
+        denoise_row = []
+        for zd in tqdm(samples, desc=desc):
+            denoise_row.append(self.decode_first_stage(zd.to(self.device),
+                                                            force_not_quantize=force_no_decoder_quantization))
+        n_imgs_per_row = len(denoise_row)
+        denoise_row = torch.stack(denoise_row)  # n_log_step, n_row, C, H, W
+        denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')
+        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
+        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)
+        return denoise_grid
+    def get_first_stage_encoding(self, encoder_posterior):
+        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
+            z = encoder_posterior.sample()
+        elif isinstance(encoder_posterior, torch.Tensor):
+            z = encoder_posterior
+        else:
+            raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented")
+        return self.scale_factor * z
+    def get_learned_conditioning(self, c):
+        if self.cond_stage_forward is None:
+            if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode):
+                c = self.cond_stage_model.encode(c)
+                if isinstance(c, DiagonalGaussianDistribution):
+                    c = c.mode()
+            else:
+                c = self.cond_stage_model(c)
+        else:
+            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
+            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
+        return c
+    def meshgrid(self, h, w):
+        y = torch.arange(0, h).view(h, 1, 1).repeat(1, w, 1)
+        x = torch.arange(0, w).view(1, w, 1).repeat(h, 1, 1)
+        arr = torch.cat([y, x], dim=-1)
+        return arr
+    def delta_border(self, h, w):
+        """
+        :param h: height
+        :param w: width
+        :return: normalized distance to image border,
+         wtith min distance = 0 at border and max dist = 0.5 at image center
+        """
+        lower_right_corner = torch.tensor([h - 1, w - 1]).view(1, 1, 2)
+        arr = self.meshgrid(h, w) / lower_right_corner
+        dist_left_up = torch.min(arr, dim=-1, keepdims=True)[0]
+        dist_right_down = torch.min(1 - arr, dim=-1, keepdims=True)[0]
+        edge_dist = torch.min(torch.cat([dist_left_up, dist_right_down], dim=-1), dim=-1)[0]
+        return edge_dist
+    def get_weighting(self, h, w, Ly, Lx, device):
+        weighting = self.delta_border(h, w)
+        weighting = torch.clip(weighting, self.split_input_params["clip_min_weight"],
+                               self.split_input_params["clip_max_weight"], )
+        weighting = weighting.view(1, h * w, 1).repeat(1, 1, Ly * Lx).to(device)
+        if self.split_input_params["tie_braker"]:
+            L_weighting = self.delta_border(Ly, Lx)
+            L_weighting = torch.clip(L_weighting,
+                                     self.split_input_params["clip_min_tie_weight"],
+                                     self.split_input_params["clip_max_tie_weight"])
+            L_weighting = L_weighting.view(1, 1, Ly * Lx).to(device)
+            weighting = weighting * L_weighting
+        return weighting
+    def get_fold_unfold(self, x, kernel_size, stride, uf=1, df=1):  # todo load once not every time, shorten code
+        """
+        :param x: img of size (bs, c, h, w)
+        :return: n img crops of size (n, bs, c, kernel_size[0], kernel_size[1])
+        """
+        bs, nc, h, w = x.shape
+        # number of crops in image
+        Ly = (h - kernel_size[0]) // stride[0] + 1
+        Lx = (w - kernel_size[1]) // stride[1] + 1
+        if uf == 1 and df == 1:
+            fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride)
+            unfold = torch.nn.Unfold(**fold_params)
+            fold = torch.nn.Fold(output_size=x.shape[2:], **fold_params)
+            weighting = self.get_weighting(kernel_size[0], kernel_size[1], Ly, Lx, x.device).to(x.dtype)
+            normalization = fold(weighting).view(1, 1, h, w)  # normalizes the overlap
+            weighting = weighting.view((1, 1, kernel_size[0], kernel_size[1], Ly * Lx))
+        elif uf > 1 and df == 1:
+            fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride)
+            unfold = torch.nn.Unfold(**fold_params)
+            fold_params2 = dict(kernel_size=(kernel_size[0] * uf, kernel_size[0] * uf),
+                                dilation=1, padding=0,
+                                stride=(stride[0] * uf, stride[1] * uf))
+            fold = torch.nn.Fold(output_size=(x.shape[2] * uf, x.shape[3] * uf), **fold_params2)
+            weighting = self.get_weighting(kernel_size[0] * uf, kernel_size[1] * uf, Ly, Lx, x.device).to(x.dtype)
+            normalization = fold(weighting).view(1, 1, h * uf, w * uf)  # normalizes the overlap
+            weighting = weighting.view((1, 1, kernel_size[0] * uf, kernel_size[1] * uf, Ly * Lx))
+        elif df > 1 and uf == 1:
+            fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride)
+            unfold = torch.nn.Unfold(**fold_params)
+            fold_params2 = dict(kernel_size=(kernel_size[0] // df, kernel_size[0] // df),
+                                dilation=1, padding=0,
+                                stride=(stride[0] // df, stride[1] // df))
+            fold = torch.nn.Fold(output_size=(x.shape[2] // df, x.shape[3] // df), **fold_params2)
+            weighting = self.get_weighting(kernel_size[0] // df, kernel_size[1] // df, Ly, Lx, x.device).to(x.dtype)
+            normalization = fold(weighting).view(1, 1, h // df, w // df)  # normalizes the overlap
+            weighting = weighting.view((1, 1, kernel_size[0] // df, kernel_size[1] // df, Ly * Lx))
+        else:
+            raise NotImplementedError
+        return fold, unfold, normalization, weighting
+    @torch.no_grad()
+    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,
+                  cond_key=None, return_original_cond=False, bs=None):
+        x = super().get_input(batch, k)
+        if bs is not None:
+            x = x[:bs]
+        x = x.to(self.device)
+        encoder_posterior = self.encode_first_stage(x)
+        z = self.get_first_stage_encoding(encoder_posterior).detach()
+        if self.model.conditioning_key is not None:
+            if cond_key is None:
+                cond_key = self.cond_stage_key
+            if cond_key != self.first_stage_key:# cond_key is not image. for inapint it's masked_img
+                if cond_key in ['caption', 'coordinates_bbox']:
+                    xc = batch[cond_key]
+                elif cond_key == 'class_label':
+                    xc = batch
+                else:
+                    xc = super().get_input(batch, cond_key).to(self.device)
+            else:
+                xc = x
+            if not self.cond_stage_trainable or force_c_encode:
+                if isinstance(xc, dict) or isinstance(xc, list):
+                    # import pudb; pudb.set_trace()
+                    c = self.get_learned_conditioning(xc)
+                else:
+                    c = self.get_learned_conditioning(xc.to(self.device))
+            else:
+                c = xc
+            if bs is not None:
+                c = c[:bs]
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                ckey = __conditioning_keys__[self.model.conditioning_key]
+                c = {ckey: c, 'pos_x': pos_x, 'pos_y': pos_y}
+        else:
+            c = None
+            xc = None
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                c = {'pos_x': pos_x, 'pos_y': pos_y}
+        out = [z, c]
+        if return_first_stage_outputs:
+            xrec = self.decode_first_stage(z)
+            out.extend([x, xrec])
+        if return_original_cond:
+            out.append(xc)
+        return out
+    @torch.no_grad()
+    def decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):
+        if predict_cids:
+            if z.dim() == 4:
+                z = torch.argmax(z.exp(), dim=1).long()
+            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)
+            z = rearrange(z, 'b h w c -> b c h w').contiguous()
+        z = 1. / self.scale_factor * z
+        if hasattr(self, "split_input_params"):
+            if self.split_input_params["patch_distributed_vq"]:
+                ks = self.split_input_params["ks"]  # eg. (128, 128)
+                stride = self.split_input_params["stride"]  # eg. (64, 64)
+                uf = self.split_input_params["vqf"]
+                bs, nc, h, w = z.shape
+                if ks[0] > h or ks[1] > w:
+                    ks = (min(ks[0], h), min(ks[1], w))
+                    print("reducing Kernel")
+                if stride[0] > h or stride[1] > w:
+                    stride = (min(stride[0], h), min(stride[1], w))
+                    print("reducing stride")
+                fold, unfold, normalization, weighting = self.get_fold_unfold(z, ks, stride, uf=uf)
+                z = unfold(z)  # (bn, nc * prod(**ks), L)
+                # 1. Reshape to img shape
+                z = z.view((z.shape[0], -1, ks[0], ks[1], z.shape[-1]))  # (bn, nc, ks[0], ks[1], L )
+                # 2. apply model loop over last dim
+                if isinstance(self.first_stage_model, VQModelInterface):
+                    output_list = [self.first_stage_model.decode(z[:, :, :, :, i],
+                                                                 force_not_quantize=predict_cids or force_not_quantize)
+                                   for i in range(z.shape[-1])]
+                else:
+                    output_list = [self.first_stage_model.decode(z[:, :, :, :, i])
+                                   for i in range(z.shape[-1])]
+                o = torch.stack(output_list, axis=-1)  # # (bn, nc, ks[0], ks[1], L)
+                o = o * weighting
+                # Reverse 1. reshape to img shape
+                o = o.view((o.shape[0], -1, o.shape[-1]))  # (bn, nc * ks[0] * ks[1], L)
+                # stitch crops together
+                decoded = fold(o)
+                decoded = decoded / normalization  # norm is shape (1, 1, h, w)
+                return decoded
+            else:
+                if isinstance(self.first_stage_model, VQModelInterface):
+                    return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)
+                else:
+                    return self.first_stage_model.decode(z)
+        else:
+            if isinstance(self.first_stage_model, VQModelInterface):
+                return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)
+            else:
+                return self.first_stage_model.decode(z)
+    # same as above but without decorator
+    def differentiable_decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):
+        if predict_cids:
+            if z.dim() == 4:
+                z = torch.argmax(z.exp(), dim=1).long()
+            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)
+            z = rearrange(z, 'b h w c -> b c h w').contiguous()
+        z = 1. / self.scale_factor * z
+        if hasattr(self, "split_input_params"):
+            if self.split_input_params["patch_distributed_vq"]:
+                ks = self.split_input_params["ks"]  # eg. (128, 128)
+                stride = self.split_input_params["stride"]  # eg. (64, 64)
+                uf = self.split_input_params["vqf"]
+                bs, nc, h, w = z.shape
+                if ks[0] > h or ks[1] > w:
+                    ks = (min(ks[0], h), min(ks[1], w))
+                    print("reducing Kernel")
+                if stride[0] > h or stride[1] > w:
+                    stride = (min(stride[0], h), min(stride[1], w))
+                    print("reducing stride")
+                fold, unfold, normalization, weighting = self.get_fold_unfold(z, ks, stride, uf=uf)
+                z = unfold(z)  # (bn, nc * prod(**ks), L)
+                # 1. Reshape to img shape
+                z = z.view((z.shape[0], -1, ks[0], ks[1], z.shape[-1]))  # (bn, nc, ks[0], ks[1], L )
+                # 2. apply model loop over last dim
+                if isinstance(self.first_stage_model, VQModelInterface):
+                    output_list = [self.first_stage_model.decode(z[:, :, :, :, i],
+                                                                 force_not_quantize=predict_cids or force_not_quantize)
+                                   for i in range(z.shape[-1])]
+                else:
+                    output_list = [self.first_stage_model.decode(z[:, :, :, :, i])
+                                   for i in range(z.shape[-1])]
+                o = torch.stack(output_list, axis=-1)  # # (bn, nc, ks[0], ks[1], L)
+                o = o * weighting
+                # Reverse 1. reshape to img shape
+                o = o.view((o.shape[0], -1, o.shape[-1]))  # (bn, nc * ks[0] * ks[1], L)
+                # stitch crops together
+                decoded = fold(o)
+                decoded = decoded / normalization  # norm is shape (1, 1, h, w)
+                return decoded
+            else:
+                if isinstance(self.first_stage_model, VQModelInterface):
+                    return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)
+                else:
+                    return self.first_stage_model.decode(z)
+        else:
+            if isinstance(self.first_stage_model, VQModelInterface):
+                return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)
+            else:
+                return self.first_stage_model.decode(z)
+    @torch.no_grad()
+    def encode_first_stage(self, x):
+        if hasattr(self, "split_input_params"):
+            if self.split_input_params["patch_distributed_vq"]:
+                ks = self.split_input_params["ks"]  # eg. (128, 128)
+                stride = self.split_input_params["stride"]  # eg. (64, 64)
+                df = self.split_input_params["vqf"]
+                self.split_input_params['original_image_size'] = x.shape[-2:]
+                bs, nc, h, w = x.shape
+                if ks[0] > h or ks[1] > w:
+                    ks = (min(ks[0], h), min(ks[1], w))
+                    print("reducing Kernel")
+                if stride[0] > h or stride[1] > w:
+                    stride = (min(stride[0], h), min(stride[1], w))
+                    print("reducing stride")
+                fold, unfold, normalization, weighting = self.get_fold_unfold(x, ks, stride, df=df)
+                z = unfold(x)  # (bn, nc * prod(**ks), L)
+                # Reshape to img shape
+                z = z.view((z.shape[0], -1, ks[0], ks[1], z.shape[-1]))  # (bn, nc, ks[0], ks[1], L )
+                output_list = [self.first_stage_model.encode(z[:, :, :, :, i])
+                               for i in range(z.shape[-1])]
+                o = torch.stack(output_list, axis=-1)
+                o = o * weighting
+                # Reverse reshape to img shape
+                o = o.view((o.shape[0], -1, o.shape[-1]))  # (bn, nc * ks[0] * ks[1], L)
+                # stitch crops together
+                decoded = fold(o)
+                decoded = decoded / normalization
+                return decoded
+            else:
+                return self.first_stage_model.encode(x)
+        else:
+            return self.first_stage_model.encode(x)
+    def shared_step(self, batch, **kwargs):
+        x, c = self.get_input(batch, self.first_stage_key)
+        loss = self(x, c)
+        return loss
+    def forward(self, x, c, *args, **kwargs):
+        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()
+        if self.model.conditioning_key is not None:
+            assert c is not None
+            if self.cond_stage_trainable:# true when use text
+                c = self.get_learned_conditioning(c) # c: string list -> [B, T, Context_dim]
+            if self.shorten_cond_schedule:  # TODO: drop this option
+                tc = self.cond_ids[t].to(self.device)
+                c = self.q_sample(x_start=c, t=tc, noise=torch.randn_like(c.float()))
+        return self.p_losses(x, c, t, *args, **kwargs)
+    def _rescale_annotations(self, bboxes, crop_coordinates):  # TODO: move to dataset
+        def rescale_bbox(bbox):
+            x0 = clamp((bbox[0] - crop_coordinates[0]) / crop_coordinates[2])
+            y0 = clamp((bbox[1] - crop_coordinates[1]) / crop_coordinates[3])
+            w = min(bbox[2] / crop_coordinates[2], 1 - x0)
+            h = min(bbox[3] / crop_coordinates[3], 1 - y0)
+            return x0, y0, w, h
+        return [rescale_bbox(b) for b in bboxes]
+    def apply_model(self, x_noisy, t, cond, return_ids=False):
+        if isinstance(cond, dict):
+            # hybrid case, cond is exptected to be a dict
+            pass
+        else:
+            if not isinstance(cond, list):
+                cond = [cond]
+            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
+            cond = {key: cond}
+        if hasattr(self, "split_input_params"):
+            assert len(cond) == 1  # todo can only deal with one conditioning atm
+            assert not return_ids
+            ks = self.split_input_params["ks"]  # eg. (128, 128)
+            stride = self.split_input_params["stride"]  # eg. (64, 64)
+            h, w = x_noisy.shape[-2:]
+            fold, unfold, normalization, weighting = self.get_fold_unfold(x_noisy, ks, stride)
+            z = unfold(x_noisy)  # (bn, nc * prod(**ks), L)
+            # Reshape to img shape
+            z = z.view((z.shape[0], -1, ks[0], ks[1], z.shape[-1]))  # (bn, nc, ks[0], ks[1], L )
+            z_list = [z[:, :, :, :, i] for i in range(z.shape[-1])]
+            if self.cond_stage_key in ["image", "LR_image", "segmentation",
+                                       'bbox_img'] and self.model.conditioning_key:  # todo check for completeness
+                c_key = next(iter(cond.keys()))  # get key
+                c = next(iter(cond.values()))  # get value
+                assert (len(c) == 1)  # todo extend to list with more than one elem
+                c = c[0]  # get element
+                c = unfold(c)
+                c = c.view((c.shape[0], -1, ks[0], ks[1], c.shape[-1]))  # (bn, nc, ks[0], ks[1], L )
+                cond_list = [{c_key: [c[:, :, :, :, i]]} for i in range(c.shape[-1])]
+            elif self.cond_stage_key == 'coordinates_bbox':
+                assert 'original_image_size' in self.split_input_params, 'BoudingBoxRescaling is missing original_image_size'
+                # assuming padding of unfold is always 0 and its dilation is always 1
+                n_patches_per_row = int((w - ks[0]) / stride[0] + 1)
+                full_img_h, full_img_w = self.split_input_params['original_image_size']
+                # as we are operating on latents, we need the factor from the original image size to the
+                # spatial latent size to properly rescale the crops for regenerating the bbox annotations
+                num_downs = self.first_stage_model.encoder.num_resolutions - 1
+                rescale_latent = 2 ** (num_downs)
+                # get top left postions of patches as conforming for the bbbox tokenizer, therefore we
+                # need to rescale the tl patch coordinates to be in between (0,1)
+                tl_patch_coordinates = [(rescale_latent * stride[0] * (patch_nr % n_patches_per_row) / full_img_w,
+                                         rescale_latent * stride[1] * (patch_nr // n_patches_per_row) / full_img_h)
+                                        for patch_nr in range(z.shape[-1])]
+                # patch_limits are tl_coord, width and height coordinates as (x_tl, y_tl, h, w)
+                patch_limits = [(x_tl, y_tl,
+                                 rescale_latent * ks[0] / full_img_w,
+                                 rescale_latent * ks[1] / full_img_h) for x_tl, y_tl in tl_patch_coordinates]
+                # patch_values = [(np.arange(x_tl,min(x_tl+ks, 1.)),np.arange(y_tl,min(y_tl+ks, 1.))) for x_tl, y_tl in tl_patch_coordinates]
+                # tokenize crop coordinates for the bounding boxes of the respective patches
+                patch_limits_tknzd = [torch.LongTensor(self.bbox_tokenizer._crop_encoder(bbox))[None].to(self.device)
+                                      for bbox in patch_limits]  # list of length l with tensors of shape (1, 2)
+                print(patch_limits_tknzd[0].shape)
+                # cut tknzd crop position from conditioning
+                assert isinstance(cond, dict), 'cond must be dict to be fed into model'
+                cut_cond = cond['c_crossattn'][0][..., :-2].to(self.device)
+                print(cut_cond.shape)
+                adapted_cond = torch.stack([torch.cat([cut_cond, p], dim=1) for p in patch_limits_tknzd])
+                adapted_cond = rearrange(adapted_cond, 'l b n -> (l b) n')
+                print(adapted_cond.shape)
+                adapted_cond = self.get_learned_conditioning(adapted_cond)
+                print(adapted_cond.shape)
+                adapted_cond = rearrange(adapted_cond, '(l b) n d -> l b n d', l=z.shape[-1])
+                print(adapted_cond.shape)
+                cond_list = [{'c_crossattn': [e]} for e in adapted_cond]
+            else:
+                cond_list = [cond for i in range(z.shape[-1])]  # Todo make this more efficient
+            # apply model by loop over crops
+            output_list = [self.model(z_list[i], t, **cond_list[i]) for i in range(z.shape[-1])]
+            assert not isinstance(output_list[0],
+                                  tuple)  # todo cant deal with multiple model outputs check this never happens
+            o = torch.stack(output_list, axis=-1)
+            o = o * weighting
+            # Reverse reshape to img shape
+            o = o.view((o.shape[0], -1, o.shape[-1]))  # (bn, nc * ks[0] * ks[1], L)
+            # stitch crops together
+            x_recon = fold(o) / normalization
+        else:
+            x_recon = self.model(x_noisy, t, **cond)
+        if isinstance(x_recon, tuple) and not return_ids:
+            return x_recon[0]
+        else:
+            return x_recon
+    def _predict_eps_from_xstart(self, x_t, t, pred_xstart):
+        return (extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - pred_xstart) / \
+               extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape)
+    def _prior_bpd(self, x_start):
+        """
+        Get the prior KL term for the variational lower-bound, measured in
+        bits-per-dim.
+        This term can't be optimized, as it only depends on the encoder.
+        :param x_start: the [N x C x ...] tensor of inputs.
+        :return: a batch of [N] KL values (in bits), one per batch element.
+        """
+        batch_size = x_start.shape[0]
+        t = torch.tensor([self.num_timesteps - 1] * batch_size, device=x_start.device)
+        qt_mean, _, qt_log_variance = self.q_mean_variance(x_start, t)
+        kl_prior = normal_kl(mean1=qt_mean, logvar1=qt_log_variance, mean2=0.0, logvar2=0.0)
+        return mean_flat(kl_prior) / np.log(2.0)
+    def p_losses(self, x_start, cond, t, noise=None):
+        noise = default(noise, lambda: torch.randn_like(x_start))
+        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
+        model_output = self.apply_model(x_noisy, t, cond)
+        loss_dict = {}
+        prefix = 'train' if self.training else 'val'
+        if self.parameterization == "x0":
+            target = x_start
+        elif self.parameterization == "eps":
+            target = noise
+        else:
+            raise NotImplementedError()
+        loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3])
+        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
+        logvar_t = self.logvar[t].to(self.device)
+        loss = loss_simple / torch.exp(logvar_t) + logvar_t
+        # loss = loss_simple / torch.exp(self.logvar) + self.logvar
+        if self.learn_logvar:
+            loss_dict.update({f'{prefix}/loss_gamma': loss.mean()})
+            loss_dict.update({'logvar': self.logvar.data.mean()})
+        loss = self.l_simple_weight * loss.mean()
+        loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3))
+        loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()
+        loss_dict.update({f'{prefix}/loss_vlb': loss_vlb})
+        loss += (self.original_elbo_weight * loss_vlb)
+        loss_dict.update({f'{prefix}/loss': loss})
+        return loss, loss_dict
+    def p_mean_variance(self, x, c, t, clip_denoised: bool, return_codebook_ids=False, quantize_denoised=False,
+                        return_x0=False, score_corrector=None, corrector_kwargs=None):
+        t_in = t
+        model_out = self.apply_model(x, t_in, c, return_ids=return_codebook_ids)
+        if score_corrector is not None:
+            assert self.parameterization == "eps"
+            model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs)
+        if return_codebook_ids:
+            model_out, logits = model_out
+        if self.parameterization == "eps":
+            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
+        elif self.parameterization == "x0":
+            x_recon = model_out
+        else:
+            raise NotImplementedError()
+        if clip_denoised:
+            x_recon.clamp_(-1., 1.)
+        if quantize_denoised:
+            x_recon, _, [_, _, indices] = self.first_stage_model.quantize(x_recon)
+        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
+        if return_codebook_ids:
+            return model_mean, posterior_variance, posterior_log_variance, logits
+        elif return_x0:
+            return model_mean, posterior_variance, posterior_log_variance, x_recon
+        else:
+            return model_mean, posterior_variance, posterior_log_variance
+    @torch.no_grad()
+    def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False,
+                 return_codebook_ids=False, quantize_denoised=False, return_x0=False,
+                 temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None):
+        b, *_, device = *x.shape, x.device
+        outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised,
+                                       return_codebook_ids=return_codebook_ids,
+                                       quantize_denoised=quantize_denoised,
+                                       return_x0=return_x0,
+                                       score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)
+        if return_codebook_ids:
+            raise DeprecationWarning("Support dropped.")
+            model_mean, _, model_log_variance, logits = outputs
+        elif return_x0:
+            model_mean, _, model_log_variance, x0 = outputs
+        else:
+            model_mean, _, model_log_variance = outputs
+        noise = noise_like(x.shape, device, repeat_noise) * temperature
+        if noise_dropout > 0.:
+            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+        # no noise when t == 0
+        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
+        if return_codebook_ids:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, logits.argmax(dim=1)
+        if return_x0:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0
+        else:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
+    @torch.no_grad()
+    def progressive_denoising(self, cond, shape, verbose=True, callback=None, quantize_denoised=False,
+                              img_callback=None, mask=None, x0=None, temperature=1., noise_dropout=0.,
+                              score_corrector=None, corrector_kwargs=None, batch_size=None, x_T=None, start_T=None,
+                              log_every_t=None):
+        if not log_every_t:
+            log_every_t = self.log_every_t # 100
+        timesteps = self.num_timesteps
+        if batch_size is not None:
+            b = batch_size if batch_size is not None else shape[0]
+            shape = [batch_size] + list(shape)
+        else:
+            b = batch_size = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=self.device)
+        else:
+            img = x_T
+        intermediates = []
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            else:
+                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        if start_T is not None:
+            timesteps = min(timesteps, start_T)
+        iterator = tqdm(reversed(range(0, timesteps)), desc='Progressive Generation',
+                        total=timesteps) if verbose else reversed(
+            range(0, timesteps))
+        if type(temperature) == float:
+            temperature = [temperature] * timesteps
+        for i in iterator:
+            ts = torch.full((b,), i, device=self.device, dtype=torch.long)
+            if self.shorten_cond_schedule:
+                assert self.model.conditioning_key != 'hybrid'
+                tc = self.cond_ids[ts].to(cond.device)
+                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
+            img, x0_partial = self.p_sample(img, cond, ts,
+                                            clip_denoised=self.clip_denoised,
+                                            quantize_denoised=quantize_denoised, return_x0=True,
+                                            temperature=temperature[i], noise_dropout=noise_dropout,
+                                            score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.q_sample(x0, ts)
+                img = img_orig * mask + (1. - mask) * img
+            if i % log_every_t == 0 or i == timesteps - 1:
+                intermediates.append(x0_partial)
+            if callback: callback(i)
+            if img_callback: img_callback(img, i)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_loop(self, cond, shape, return_intermediates=False,
+                      x_T=None, verbose=True, callback=None, timesteps=None, quantize_denoised=False,
+                      mask=None, x0=None, img_callback=None, start_T=None,
+                      log_every_t=None):
+        if not log_every_t:
+            log_every_t = self.log_every_t
+        device = self.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        intermediates = [img]
+        if timesteps is None:
+            timesteps = self.num_timesteps
+        if start_T is not None:
+            timesteps = min(timesteps, start_T)
+        iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(
+            range(0, timesteps))
+        if mask is not None:
+            assert x0 is not None
+            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match
+        for i in iterator:
+            ts = torch.full((b,), i, device=device, dtype=torch.long)
+            if self.shorten_cond_schedule:
+                assert self.model.conditioning_key != 'hybrid'
+                tc = self.cond_ids[ts].to(cond.device)
+                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
+            img = self.p_sample(img, cond, ts,
+                                clip_denoised=self.clip_denoised,
+                                quantize_denoised=quantize_denoised)
+            if mask is not None:
+                img_orig = self.q_sample(x0, ts)
+                img = img_orig * mask + (1. - mask) * img
+            if i % log_every_t == 0 or i == timesteps - 1:
+                intermediates.append(img)
+            if callback: callback(i)
+            if img_callback: img_callback(img, i)
+        if return_intermediates:
+            return img, intermediates
+        return img
+    @torch.no_grad()
+    def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None,
+               verbose=True, timesteps=None, quantize_denoised=False,
+               mask=None, x0=None, shape=None,**kwargs):
+        if shape is None:
+            shape = (batch_size, self.channels, self.image_size, self.image_size)
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            else:
+                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        return self.p_sample_loop(cond,
+                                  shape,
+                                  return_intermediates=return_intermediates, x_T=x_T,
+                                  verbose=verbose, timesteps=timesteps, quantize_denoised=quantize_denoised,
+                                  mask=mask, x0=x0)
+    @torch.no_grad()
+    def sample_log(self,cond,batch_size,ddim, ddim_steps,**kwargs):
+        if ddim:
+            ddim_sampler = DDIMSampler(self)
+            shape = (self.channels, self.image_size, self.image_size)
+            samples, intermediates =ddim_sampler.sample(ddim_steps,batch_size,
+                                                        shape,cond,verbose=False,**kwargs)
+        else:
+            samples, intermediates = self.sample(cond=cond, batch_size=batch_size,
+                                                 return_intermediates=True,**kwargs)
+        return samples, intermediates
+    @torch.no_grad()
+    def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None,
+                   quantize_denoised=True, inpaint=True, plot_denoise_rows=False, plot_progressive_rows=True,
+                   plot_diffusion_rows=True, **kwargs):
+        use_ddim = ddim_steps is not None
+        log = dict()
+        z, c, x, xrec, xc = self.get_input(batch, self.first_stage_key,
+                                           return_first_stage_outputs=True,
+                                           force_c_encode=True,
+                                           return_original_cond=True,
+                                           bs=N)
+        N = min(x.shape[0], N)
+        n_row = min(x.shape[0], n_row)
+        log["inputs"] = x
+        log["reconstruction"] = xrec
+        if self.model.conditioning_key is not None:
+            if hasattr(self.cond_stage_model, "decode"):
+                xc = self.cond_stage_model.decode(c)
+                log["conditioning"] = xc
+            elif self.cond_stage_key in ["caption"]:
+                xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["caption"])
+                log["conditioning"] = xc
+            elif self.cond_stage_key == 'class_label':
+                xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"])
+                log['conditioning'] = xc
+            elif isimage(xc):
+                log["conditioning"] = xc
+            if ismap(xc):
+                log["original_conditioning"] = self.to_rgb(xc)
+        if plot_diffusion_rows:
+            # get diffusion row
+            diffusion_row = list()
+            z_start = z[:n_row]
+            for t in range(self.num_timesteps):
+                if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
+                    t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
+                    t = t.to(self.device).long()
+                    noise = torch.randn_like(z_start)
+                    z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise)
+                    diffusion_row.append(self.decode_first_stage(z_noisy))
+            diffusion_row = torch.stack(diffusion_row)  # n_log_step, n_row, C, H, W
+            diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w')
+            diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w')
+            diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0])
+            log["diffusion_row"] = diffusion_grid
+        if sample:
+            # get denoise row
+            with self.ema_scope("Plotting"):
+                samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,
+                                                         ddim_steps=ddim_steps,eta=ddim_eta)
+                # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True)
+            x_samples = self.decode_first_stage(samples)
+            log["samples"] = x_samples
+            if plot_denoise_rows:
+                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
+                log["denoise_row"] = denoise_grid
+            if quantize_denoised and not isinstance(self.first_stage_model, AutoencoderKL) and not isinstance(
+                    self.first_stage_model, IdentityFirstStage):
+                # also display when quantizing x0 while sampling
+                with self.ema_scope("Plotting Quantized Denoised"):
+                    samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,
+                                                             ddim_steps=ddim_steps,eta=ddim_eta,
+                                                             quantize_denoised=True)
+                    # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True,
+                    #                                      quantize_denoised=True)
+                x_samples = self.decode_first_stage(samples.to(self.device))
+                log["samples_x0_quantized"] = x_samples
+            if inpaint:
+                # make a simple center square
+                b, h, w = z.shape[0], z.shape[2], z.shape[3]
+                mask = torch.ones(N, h, w).to(self.device)
+                # zeros will be filled in
+                mask[:, h // 4:3 * h // 4, w // 4:3 * w // 4] = 0.
+                mask = mask[:, None, ...]
+                with self.ema_scope("Plotting Inpaint"):
+                    samples, _ = self.sample_log(cond=c,batch_size=N,ddim=use_ddim, eta=ddim_eta,
+                                                ddim_steps=ddim_steps, x0=z[:N], mask=mask)
+                x_samples = self.decode_first_stage(samples.to(self.device))
+                log["samples_inpainting"] = x_samples
+                log["mask"] = mask
+                # outpaint
+                with self.ema_scope("Plotting Outpaint"):
+                    samples, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim,eta=ddim_eta,
+                                                ddim_steps=ddim_steps, x0=z[:N], mask=mask)
+                x_samples = self.decode_first_stage(samples.to(self.device))
+                log["samples_outpainting"] = x_samples
+        if plot_progressive_rows:
+            with self.ema_scope("Plotting Progressives"):
+                img, progressives = self.progressive_denoising(c,
+                                                               shape=(self.channels, self.image_size, self.image_size),
+                                                               batch_size=N)
+            prog_row = self._get_denoise_row_from_list(progressives, desc="Progressive Generation")
+            log["progressive_row"] = prog_row
+        if return_keys:
+            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
+                return log
+            else:
+                return {key: log[key] for key in return_keys}
+        return log
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        params = list(self.model.parameters())
+        if self.cond_stage_trainable:
+            print(f"{self.__class__.__name__}: Also optimizing conditioner params!")
+            params = params + list(self.cond_stage_model.parameters())
+        if self.learn_logvar:
+            print('Diffusion model optimizing logvar')
+            params.append(self.logvar)
+        opt = torch.optim.AdamW(params, lr=lr)
+        if self.use_scheduler:
+            assert 'target' in self.scheduler_config
+            scheduler = instantiate_from_config(self.scheduler_config)
+            print("Setting up LambdaLR scheduler...")
+            scheduler = [
+                {
+                    'scheduler': LambdaLR(opt, lr_lambda=scheduler.schedule),
+                    'interval': 'step',
+                    'frequency': 1
+                }]
+            return [opt], scheduler
+        return opt
+    @torch.no_grad()
+    def to_rgb(self, x):
+        x = x.float()
+        if not hasattr(self, "colorize"):
+            self.colorize = torch.randn(3, x.shape[1], 1, 1).to(x)
+        x = nn.functional.conv2d(x, weight=self.colorize)
+        x = 2. * (x - x.min()) / (x.max() - x.min()) - 1.
+        return x
+class DiffusionWrapper(pl.LightningModule):
+    def __init__(self, diff_model_config, conditioning_key):
+        super().__init__()
+        self.diffusion_model = instantiate_from_config(diff_model_config)
+        self.conditioning_key = conditioning_key # 'crossattn' for txt2image, concat for inpainting
+        assert self.conditioning_key in [None, 'concat', 'crossattn', 'hybrid', 'adm', 'film', 'hybrid_inpaint']
+    def forward(self, x, t, c_concat: list = None, c_crossattn: list = None,c_film: list = None):
+        x = x.contiguous()
+        t = t.contiguous()
+        """param x: tensor with shape:[B,C,mel_len,T]"""
+        if self.conditioning_key is None:
+            out = self.diffusion_model(x, t)
+        elif self.conditioning_key == 'concat':
+            xc = torch.cat([x] + c_concat, dim=1)# channel dim,x shape (b,3,64,64) c_concat shape(b,4,64,64)
+            out = self.diffusion_model(xc, t)
+        elif self.conditioning_key == 'crossattn':
+            if isinstance(c_crossattn,list):
+                cc = torch.cat(c_crossattn, 1)# [b,seq_len,dim]
+            else:
+                cc = c_crossattn
+            out = self.diffusion_model(x, t, context=cc)
+        elif self.conditioning_key == 'hybrid':# not implemented in the LatentDiffusion
+            xc = torch.cat([x] + c_concat, dim=1)
+            cc = torch.cat(c_crossattn, 1)
+            out = self.diffusion_model(xc, t, context=cc)
+        elif self.conditioning_key == 'hybrid_inpaint': # special
+            cc = c_crossattn
+            out = self.diffusion_model(x, t, context=cc)
+        elif self.conditioning_key == "film":  # The condition is assumed to be a global token, which wil pass through a linear layer and added with the time embedding for the FILM
+            cc = c_film[0].squeeze(1).contiguous()  # only has one token, shape (b,context_dim)
+            out = self.diffusion_model(x, t, y=cc)
+        elif self.conditioning_key == 'adm':
+            cc = c_crossattn[0]
+            out = self.diffusion_model(x, t, y=cc)
+        else:
+            raise NotImplementedError()
+        return out
+class Layout2ImgDiffusion(LatentDiffusion):
+    # TODO: move all layout-specific hacks to this class
+    def __init__(self, cond_stage_key, *args, **kwargs):
+        assert cond_stage_key == 'coordinates_bbox', 'Layout2ImgDiffusion only for cond_stage_key="coordinates_bbox"'
+        super().__init__(cond_stage_key=cond_stage_key, *args, **kwargs)
+    def log_images(self, batch, N=8, *args, **kwargs):
+        logs = super().log_images(batch=batch, N=N, *args, **kwargs)
+        key = 'train' if self.training else 'validation'
+        dset = self.trainer.datamodule.datasets[key]
+        mapper = dset.conditional_builders[self.cond_stage_key]
+        bbox_imgs = []
+        map_fn = lambda catno: dset.get_textual_label(dset.get_category_id(catno))
+        for tknzd_bbox in batch[self.cond_stage_key][:N]:
+            bboximg = mapper.plot(tknzd_bbox.detach().cpu(), map_fn, (256, 256))
+            bbox_imgs.append(bboximg)
+        cond_img = torch.stack(bbox_imgs, dim=0)
+        logs['bbox_image'] = cond_img
+        return logs

ldm/models/diffusion/ddpm_audio.py ADDED Viewed

	@@ -0,0 +1,865 @@

+import os
+from pytorch_memlab import LineProfiler,profile
+import torch
+import torch.nn as nn
+import numpy as np
+import pytorch_lightning as pl
+from torch.optim.lr_scheduler import LambdaLR
+from einops import rearrange, repeat
+from contextlib import contextmanager
+from functools import partial
+from tqdm import tqdm
+from torchvision.utils import make_grid
+try:
+    from pytorch_lightning.utilities.distributed import rank_zero_only
+except:
+    from pytorch_lightning.utilities import rank_zero_only # torch2
+from ldm.util import log_txt_as_img, exists, default, ismap, isimage, mean_flat, count_params, instantiate_from_config
+from ldm.modules.ema import LitEma
+from ldm.modules.distributions.distributions import normal_kl, DiagonalGaussianDistribution
+from ldm.models.autoencoder import VQModelInterface, IdentityFirstStage, AutoencoderKL
+from ldm.modules.diffusionmodules.util import make_beta_schedule, extract_into_tensor, noise_like
+from ldm.models.diffusion.ddim import DDIMSampler
+from ldm.models.diffusion.ddpm import DDPM, disabled_train
+from omegaconf import ListConfig
+__conditioning_keys__ = {'concat': 'c_concat',
+                         'crossattn': 'c_crossattn',
+                         'adm': 'y'}
+class LatentDiffusion_audio(DDPM):
+    """main class"""
+    def __init__(self,
+                 first_stage_config,
+                 cond_stage_config,
+                 num_timesteps_cond=None,
+                 mel_dim=80,
+                 mel_length=848,
+                 cond_stage_key="image",
+                 cond_stage_trainable=False,
+                 concat_mode=True,
+                 cond_stage_forward=None,
+                 conditioning_key=None,
+                 scale_factor=1.0,
+                 scale_by_std=False,
+                 *args, **kwargs):
+        self.num_timesteps_cond = default(num_timesteps_cond, 1)
+        self.scale_by_std = scale_by_std
+        assert self.num_timesteps_cond <= kwargs['timesteps']
+        # for backwards compatibility after implementation of DiffusionWrapper
+        if conditioning_key is None:
+            conditioning_key = 'concat' if concat_mode else 'crossattn'
+        if cond_stage_config == '__is_unconditional__':
+            conditioning_key = None
+        ckpt_path = kwargs.pop("ckpt_path", None)
+        ignore_keys = kwargs.pop("ignore_keys", [])
+        super().__init__(conditioning_key=conditioning_key, *args, **kwargs)
+        self.concat_mode = concat_mode
+        self.mel_dim = mel_dim
+        self.mel_length = mel_length
+        self.cond_stage_trainable = cond_stage_trainable
+        self.cond_stage_key = cond_stage_key
+        try:
+            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
+        except:
+            self.num_downs = 0
+        if not scale_by_std:
+            self.scale_factor = scale_factor
+        else:
+            self.register_buffer('scale_factor', torch.tensor(scale_factor))
+        self.instantiate_first_stage(first_stage_config)
+        self.instantiate_cond_stage(cond_stage_config)
+        self.cond_stage_forward = cond_stage_forward
+        self.clip_denoised = False
+        self.bbox_tokenizer = None
+        self.restarted_from_ckpt = False
+        if ckpt_path is not None:
+            self.init_from_ckpt(ckpt_path, ignore_keys)
+            self.restarted_from_ckpt = True
+    def make_cond_schedule(self, ):
+        self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long)
+        ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long()
+        self.cond_ids[:self.num_timesteps_cond] = ids
+    @rank_zero_only
+    @torch.no_grad()
+    def on_train_batch_start(self, batch, batch_idx):
+        # only for very first batch
+        if self.scale_by_std and self.current_epoch == 0 and self.global_step == 0 and batch_idx == 0 and not self.restarted_from_ckpt:
+            assert self.scale_factor == 1., 'rather not use custom rescaling and std-rescaling simultaneously'
+            # set rescale weight to 1./std of encodings
+            print("### USING STD-RESCALING ###")
+            x = super().get_input(batch, self.first_stage_key)
+            x = x.to(self.device)
+            encoder_posterior = self.encode_first_stage(x)
+            z = self.get_first_stage_encoding(encoder_posterior).detach()# get latent
+            del self.scale_factor
+            self.register_buffer('scale_factor', 1. / z.flatten().std())# 1/latent.std， get_first_stage_encoding returns self.scale_factor * latent
+            print(f"setting self.scale_factor to {self.scale_factor}")
+            print("### USING STD-RESCALING ###")
+    # def on_train_epoch_start(self):
+    #     print("!!!!!!!!!!!!!!!!!!!!!!!!!!on_train_epoch_strat",self.trainer.train_dataloader.batch_sampler,hasattr(self.trainer.train_dataloader.batch_sampler,'set_epoch'))
+    #     if hasattr(self.trainer.train_dataloader.batch_sampler,'set_epoch'):
+    #         self.trainer.train_dataloader.batch_sampler.set_epoch(self.current_epoch)
+    #     return super().on_train_epoch_start()
+    def register_schedule(self,
+                          given_betas=None, beta_schedule="linear", timesteps=1000,
+                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
+        super().register_schedule(given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s)
+        self.shorten_cond_schedule = self.num_timesteps_cond > 1
+        if self.shorten_cond_schedule:
+            self.make_cond_schedule()
+    def instantiate_first_stage(self, config):
+        model = instantiate_from_config(config)
+        self.first_stage_model = model.eval()
+        self.first_stage_model.train = disabled_train
+        for param in self.first_stage_model.parameters():
+            param.requires_grad = False
+    def instantiate_cond_stage(self, config):
+        if not self.cond_stage_trainable:
+            if config == "__is_first_stage__":
+                print("Using first stage also as cond stage.")
+                self.cond_stage_model = self.first_stage_model
+            elif config == "__is_unconditional__":
+                print(f"Training {self.__class__.__name__} as an unconditional model.")
+                self.cond_stage_model = None
+            else:
+                model = instantiate_from_config(config)
+                self.cond_stage_model = model.eval()
+                self.cond_stage_model.train = disabled_train
+                for param in self.cond_stage_model.parameters():
+                    param.requires_grad = False
+        else:
+            assert config != '__is_first_stage__'
+            assert config != '__is_unconditional__'
+            model = instantiate_from_config(config)
+            self.cond_stage_model = model
+    def _get_denoise_row_from_list(self, samples, desc='', force_no_decoder_quantization=False):
+        denoise_row = []
+        for zd in tqdm(samples, desc=desc):
+            denoise_row.append(self.decode_first_stage(zd.to(self.device),
+                                                            force_not_quantize=force_no_decoder_quantization))
+        n_imgs_per_row = len(denoise_row)
+        if len(denoise_row[0].shape) == 3:
+            denoise_row = [x.unsqueeze(1) for x in denoise_row]
+        denoise_row = torch.stack(denoise_row)  # n_log_step, n_row, C, H, W
+        denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')
+        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
+        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)
+        return denoise_grid
+    def get_first_stage_encoding(self, encoder_posterior):
+        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
+            z = encoder_posterior.sample()
+        elif isinstance(encoder_posterior, torch.Tensor):
+            z = encoder_posterior
+        else:
+            raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented")
+        return self.scale_factor * z
+    #@profile
+    def get_learned_conditioning(self, c):
+        if self.cond_stage_forward is None:
+            if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode):
+                c = self.cond_stage_model.encode(c)
+                if isinstance(c, DiagonalGaussianDistribution):
+                    c = c.mode()
+            else:
+                c = self.cond_stage_model(c)
+        else:
+            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
+            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
+        return c
+    @torch.no_grad()
+    def get_unconditional_conditioning(self, batch_size, null_label=None):
+        if null_label is not None:
+            xc = null_label
+            if isinstance(xc, ListConfig):
+                xc = list(xc)
+            if isinstance(xc, dict) or isinstance(xc, list):
+                c = self.get_learned_conditioning(xc)
+            else:
+                if hasattr(xc, "to"):
+                    xc = xc.to(self.device)
+                c = self.get_learned_conditioning(xc)
+        else:
+            if self.cond_stage_key in ["class_label", "cls"]:
+                xc = self.cond_stage_model.get_unconditional_conditioning(batch_size, device=self.device)
+                return self.get_learned_conditioning(xc)
+            else:
+                raise NotImplementedError("todo")
+        if isinstance(c, list):  # in case the encoder gives us a list
+            for i in range(len(c)):
+                c[i] = repeat(c[i], '1 ... -> b ...', b=batch_size).to(self.device)
+        else:
+            c = repeat(c, '1 ... -> b ...', b=batch_size).to(self.device)
+        return c
+    def meshgrid(self, h, w):
+        y = torch.arange(0, h).view(h, 1, 1).repeat(1, w, 1)
+        x = torch.arange(0, w).view(1, w, 1).repeat(h, 1, 1)
+        arr = torch.cat([y, x], dim=-1)
+        return arr
+    def delta_border(self, h, w):
+        """
+        :param h: height
+        :param w: width
+        :return: normalized distance to image border,
+         wtith min distance = 0 at border and max dist = 0.5 at image center
+        """
+        lower_right_corner = torch.tensor([h - 1, w - 1]).view(1, 1, 2)
+        arr = self.meshgrid(h, w) / lower_right_corner
+        dist_left_up = torch.min(arr, dim=-1, keepdims=True)[0]
+        dist_right_down = torch.min(1 - arr, dim=-1, keepdims=True)[0]
+        edge_dist = torch.min(torch.cat([dist_left_up, dist_right_down], dim=-1), dim=-1)[0]
+        return edge_dist
+    def get_weighting(self, h, w, Ly, Lx, device):
+        weighting = self.delta_border(h, w)
+        weighting = torch.clip(weighting, self.split_input_params["clip_min_weight"],
+                               self.split_input_params["clip_max_weight"], )
+        weighting = weighting.view(1, h * w, 1).repeat(1, 1, Ly * Lx).to(device)
+        if self.split_input_params["tie_braker"]:
+            L_weighting = self.delta_border(Ly, Lx)
+            L_weighting = torch.clip(L_weighting,
+                                     self.split_input_params["clip_min_tie_weight"],
+                                     self.split_input_params["clip_max_tie_weight"])
+            L_weighting = L_weighting.view(1, 1, Ly * Lx).to(device)
+            weighting = weighting * L_weighting
+        return weighting
+    def get_fold_unfold(self, x, kernel_size, stride, uf=1, df=1):  # todo load once not every time, shorten code
+        """
+        :param x: img of size (bs, c, h, w)
+        :return: n img crops of size (n, bs, c, kernel_size[0], kernel_size[1])
+        """
+        bs, nc, h, w = x.shape
+        # number of crops in image
+        Ly = (h - kernel_size[0]) // stride[0] + 1
+        Lx = (w - kernel_size[1]) // stride[1] + 1
+        if uf == 1 and df == 1:
+            fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride)
+            unfold = torch.nn.Unfold(**fold_params)
+            fold = torch.nn.Fold(output_size=x.shape[2:], **fold_params)
+            weighting = self.get_weighting(kernel_size[0], kernel_size[1], Ly, Lx, x.device).to(x.dtype)
+            normalization = fold(weighting).view(1, 1, h, w)  # normalizes the overlap
+            weighting = weighting.view((1, 1, kernel_size[0], kernel_size[1], Ly * Lx))
+        elif uf > 1 and df == 1:
+            fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride)
+            unfold = torch.nn.Unfold(**fold_params)
+            fold_params2 = dict(kernel_size=(kernel_size[0] * uf, kernel_size[0] * uf),
+                                dilation=1, padding=0,
+                                stride=(stride[0] * uf, stride[1] * uf))
+            fold = torch.nn.Fold(output_size=(x.shape[2] * uf, x.shape[3] * uf), **fold_params2)
+            weighting = self.get_weighting(kernel_size[0] * uf, kernel_size[1] * uf, Ly, Lx, x.device).to(x.dtype)
+            normalization = fold(weighting).view(1, 1, h * uf, w * uf)  # normalizes the overlap
+            weighting = weighting.view((1, 1, kernel_size[0] * uf, kernel_size[1] * uf, Ly * Lx))
+        elif df > 1 and uf == 1:
+            fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride)
+            unfold = torch.nn.Unfold(**fold_params)
+            fold_params2 = dict(kernel_size=(kernel_size[0] // df, kernel_size[0] // df),
+                                dilation=1, padding=0,
+                                stride=(stride[0] // df, stride[1] // df))
+            fold = torch.nn.Fold(output_size=(x.shape[2] // df, x.shape[3] // df), **fold_params2)
+            weighting = self.get_weighting(kernel_size[0] // df, kernel_size[1] // df, Ly, Lx, x.device).to(x.dtype)
+            normalization = fold(weighting).view(1, 1, h // df, w // df)  # normalizes the overlap
+            weighting = weighting.view((1, 1, kernel_size[0] // df, kernel_size[1] // df, Ly * Lx))
+        else:
+            raise NotImplementedError
+        return fold, unfold, normalization, weighting
+    @torch.no_grad()
+    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,
+                  cond_key=None, return_original_cond=False, bs=None):
+        x = super().get_input(batch, k)
+        if bs is not None:
+            x = x[:bs]
+        x = x.to(self.device)
+        encoder_posterior = self.encode_first_stage(x)
+        z = self.get_first_stage_encoding(encoder_posterior).detach()
+        if self.model.conditioning_key is not None:
+            if cond_key is None:
+                cond_key = self.cond_stage_key
+            if cond_key != self.first_stage_key:
+                if cond_key in ['caption', 'coordinates_bbox', 'hybrid_feat']:
+                    xc = batch[cond_key]
+                elif cond_key == 'class_label':
+                    xc = batch
+                else:
+                    xc = super().get_input(batch, cond_key).to(self.device)
+            else:
+                xc = x
+            if not self.cond_stage_trainable or force_c_encode: # False
+                if isinstance(xc, dict) or isinstance(xc, list):
+                    # import pudb; pudb.set_trace()
+                    c = self.get_learned_conditioning(xc)
+                else:
+                    c = self.get_learned_conditioning(xc.to(self.device))
+            else:
+                c = xc
+            if bs is not None:
+                c = c[:bs]
+            # Testing #
+            if cond_key == 'masked_image':
+                mask = super().get_input(batch, "mask")
+                cc = torch.nn.functional.interpolate(mask, size=c.shape[-2:]) # [B, 1, 10, 106]
+                c = torch.cat((c, cc), dim=1) # [B, 5, 10, 106]
+            # Testing #
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                ckey = __conditioning_keys__[self.model.conditioning_key]
+                c = {ckey: c, 'pos_x': pos_x, 'pos_y': pos_y}
+        else:
+            c = None
+            xc = None
+            if self.use_positional_encodings:
+                pos_x, pos_y = self.compute_latent_shifts(batch)
+                c = {'pos_x': pos_x, 'pos_y': pos_y}
+        out = [z, c]
+        if return_first_stage_outputs:
+            xrec = self.decode_first_stage(z)
+            out.extend([x, xrec])
+        if return_original_cond:
+            out.append(xc)
+        return out
+    @torch.no_grad()
+    def decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):
+        if predict_cids:
+            if z.dim() == 4:
+                z = torch.argmax(z.exp(), dim=1).long()
+            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)
+            z = rearrange(z, 'b h w c -> b c h w').contiguous()
+        z = 1. / self.scale_factor * z
+        if isinstance(self.first_stage_model, VQModelInterface):
+            return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)
+        else:
+            return self.first_stage_model.decode(z)
+    # same as above but without decorator
+    def differentiable_decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):
+        if predict_cids:
+            if z.dim() == 4:
+                z = torch.argmax(z.exp(), dim=1).long()
+            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)
+            z = rearrange(z, 'b h w c -> b c h w').contiguous()
+        z = 1. / self.scale_factor * z
+        if isinstance(self.first_stage_model, VQModelInterface):
+            return self.first_stage_model.decode(z, force_not_quantize=predict_cids or force_not_quantize)
+        else:
+            return self.first_stage_model.decode(z)
+    @torch.no_grad()
+    def encode_first_stage(self, x):
+        return self.first_stage_model.encode(x)
+    def shared_step(self, batch, **kwargs):
+        x, c = self.get_input(batch, self.first_stage_key)
+        loss = self(x, c)
+        return loss
+    def test_step(self,batch,batch_idx):
+        cond = batch[self.cond_stage_key] #  * self.test_repeat
+        cond = self.get_learned_conditioning(cond) # c: string -> [B, T, Context_dim]
+        batch_size = len(cond)
+        enc_emb = self.sample(cond,batch_size,timesteps=self.num_timesteps)# shape = [batch_size,self.channels,self.mel_dim,self.mel_length]
+        xrec = self.decode_first_stage(enc_emb)
+        # reconstructions = (xrec + 1)/2 # to mel scale
+        # test_ckpt_path = os.path.basename(self.trainer.tested_ckpt_path)
+        # savedir = os.path.join(self.trainer.log_dir,f'output_imgs_{test_ckpt_path}','fake_class')
+        # if not os.path.exists(savedir):
+        #     os.makedirs(savedir)
+        # file_names = batch['f_name']
+        # nfiles = len(file_names)
+        # reconstructions = reconstructions.cpu().numpy().squeeze(1) # squuze channel dim
+        # for k in range(reconstructions.shape[0]):
+        #     b,repeat = k % nfiles, k // nfiles
+        #     vname_num_split_index = file_names[b].rfind('_')# file_names[b]:video_name+'_'+num
+        #     v_n,num = file_names[b][:vname_num_split_index],file_names[b][vname_num_split_index+1:]
+        #     save_img_path = os.path.join(savedir,f'{v_n}_sample_{num}_{repeat}.npy')# the num_th caption, the repeat_th repitition
+        #     np.save(save_img_path,reconstructions[b])
+        return None
+    def forward(self, x, c, *args, **kwargs):
+        '''
+        video to audio:
+        x (latent): [B, 256 (time), 20]  c (video feat): [B, 32 (time), 512]
+        '''
+        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long() # [B]
+        if self.model.conditioning_key is not None:
+            assert c is not None
+            if self.cond_stage_trainable:
+                c = self.get_learned_conditioning(c) # c: string -> [B, T, Context_dim]
+            if self.shorten_cond_schedule:  # TODO: drop this option
+                tc = self.cond_ids[t].to(self.device)
+                c = self.q_sample(x_start=c, t=tc, noise=torch.randn_like(c.float()))
+        return self.p_losses(x, c, t, *args, **kwargs)
+    def apply_model(self, x_noisy, t, cond, return_ids=False):
+        if isinstance(cond, dict):
+            # hybrid case, cond is exptected to be a dict
+            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
+            cond = {key: cond}
+        else:
+            if not isinstance(cond, list):
+                cond = [cond]
+            if self.model.conditioning_key == "concat":
+                key = "c_concat"
+            elif self.model.conditioning_key == "crossattn":
+                key = "c_crossattn"
+            else:
+                key = "c_film"
+            cond = {key: cond}
+        x_recon = self.model(x_noisy, t, **cond)
+        if isinstance(x_recon, tuple) and not return_ids:
+            return x_recon[0]
+        else:
+            return x_recon
+    def _predict_eps_from_xstart(self, x_t, t, pred_xstart):
+        return (extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - pred_xstart) / \
+               extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape)
+    def _prior_bpd(self, x_start):
+        """
+        Get the prior KL term for the variational lower-bound, measured in
+        bits-per-dim.
+        This term can't be optimized, as it only depends on the encoder.
+        :param x_start: the [N x C x ...] tensor of inputs.
+        :return: a batch of [N] KL values (in bits), one per batch element.
+        """
+        batch_size = x_start.shape[0]
+        t = torch.tensor([self.num_timesteps - 1] * batch_size, device=x_start.device)
+        qt_mean, _, qt_log_variance = self.q_mean_variance(x_start, t)
+        kl_prior = normal_kl(mean1=qt_mean, logvar1=qt_log_variance, mean2=0.0, logvar2=0.0)
+        return mean_flat(kl_prior) / np.log(2.0)
+    def p_losses(self, x_start, cond, t, noise=None):
+        noise = default(noise, lambda: torch.randn_like(x_start))
+        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
+        model_output = self.apply_model(x_noisy, t, cond)
+        loss_dict = {}
+        prefix = 'train' if self.training else 'val'
+        if self.parameterization == "x0":
+            target = x_start
+        elif self.parameterization == "eps":
+            target = noise
+        else:
+            raise NotImplementedError()
+        mean_dims = list(range(1,len(target.shape)))
+        loss_simple = self.get_loss(model_output, target, mean=False).mean(dim=mean_dims)
+        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
+        logvar_t = self.logvar[t.to(self.logvar.device)].to(self.device)
+        loss = loss_simple / torch.exp(logvar_t) + logvar_t
+        # loss = loss_simple / torch.exp(self.logvar) + self.logvar
+        if self.learn_logvar:
+            loss_dict.update({f'{prefix}/loss_gamma': loss.mean()})
+            loss_dict.update({'logvar': self.logvar.data.mean()})
+        loss = self.l_simple_weight * loss.mean()
+        loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=mean_dims)
+        loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()
+        loss_dict.update({f'{prefix}/loss_vlb': loss_vlb})
+        loss += (self.original_elbo_weight * loss_vlb)
+        loss_dict.update({f'{prefix}/loss': loss})
+        return loss, loss_dict
+    def p_mean_variance(self, x, c, t, clip_denoised: bool, return_codebook_ids=False, quantize_denoised=False,
+                        return_x0=False, score_corrector=None, corrector_kwargs=None):
+        t_in = t
+        model_out = self.apply_model(x, t_in, c, return_ids=return_codebook_ids)
+        if score_corrector is not None:
+            assert self.parameterization == "eps"
+            model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs)
+        if return_codebook_ids:
+            model_out, logits = model_out
+        if self.parameterization == "eps":
+            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
+        elif self.parameterization == "x0":
+            x_recon = model_out
+        else:
+            raise NotImplementedError()
+        if clip_denoised:
+            x_recon.clamp_(-1., 1.)
+        if quantize_denoised:
+            x_recon, _, [_, _, indices] = self.first_stage_model.quantize(x_recon)
+        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
+        if return_codebook_ids:
+            return model_mean, posterior_variance, posterior_log_variance, logits
+        elif return_x0:
+            return model_mean, posterior_variance, posterior_log_variance, x_recon
+        else:
+            return model_mean, posterior_variance, posterior_log_variance
+    @torch.no_grad()
+    def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False,
+                 return_codebook_ids=False, quantize_denoised=False, return_x0=False,
+                 temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None):
+        b, *_, device = *x.shape, x.device
+        outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised,
+                                       return_codebook_ids=return_codebook_ids,
+                                       quantize_denoised=quantize_denoised,
+                                       return_x0=return_x0,
+                                       score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)
+        if return_codebook_ids:
+            raise DeprecationWarning("Support dropped.")
+            model_mean, _, model_log_variance, logits = outputs
+        elif return_x0:
+            model_mean, _, model_log_variance, x0 = outputs
+        else:
+            model_mean, _, model_log_variance = outputs
+        noise = noise_like(x.shape, device, repeat_noise) * temperature
+        if noise_dropout > 0.:
+            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+        # no noise when t == 0
+        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
+        if return_codebook_ids:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, logits.argmax(dim=1)
+        if return_x0:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0
+        else:
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
+    @torch.no_grad()
+    def progressive_denoising(self, cond, shape, verbose=True, callback=None, quantize_denoised=False,
+                              img_callback=None, mask=None, x0=None, temperature=1., noise_dropout=0.,
+                              score_corrector=None, corrector_kwargs=None, batch_size=None, x_T=None, start_T=None,
+                              log_every_t=None):
+        if not log_every_t:
+            log_every_t = self.log_every_t
+        timesteps = self.num_timesteps
+        if batch_size is not None:
+            b = batch_size if batch_size is not None else shape[0]
+            shape = [batch_size] + list(shape)
+        else:
+            b = batch_size = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=self.device)
+        else:
+            img = x_T
+        intermediates = []
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            else:
+                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        if start_T is not None:
+            timesteps = min(timesteps, start_T)
+        iterator = tqdm(reversed(range(0, timesteps)), desc='Progressive Generation',
+                        total=timesteps) if verbose else reversed(
+            range(0, timesteps))
+        if type(temperature) == float:
+            temperature = [temperature] * timesteps
+        for i in iterator:
+            ts = torch.full((b,), i, device=self.device, dtype=torch.long)
+            if self.shorten_cond_schedule:
+                assert self.model.conditioning_key != 'hybrid'
+                tc = self.cond_ids[ts].to(cond.device)
+                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
+            img, x0_partial = self.p_sample(img, cond, ts,
+                                            clip_denoised=self.clip_denoised,
+                                            quantize_denoised=quantize_denoised, return_x0=True,
+                                            temperature=temperature[i], noise_dropout=noise_dropout,
+                                            score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.q_sample(x0, ts)
+                img = img_orig * mask + (1. - mask) * img
+            if i % log_every_t == 0 or i == timesteps - 1:
+                intermediates.append(x0_partial)
+            if callback: callback(i)
+            if img_callback: img_callback(img, i)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_loop(self, cond, shape, return_intermediates=False,
+                      x_T=None, verbose=True, callback=None, timesteps=None, quantize_denoised=False,
+                      mask=None, x0=None, img_callback=None, start_T=None,
+                      log_every_t=None):
+        if not log_every_t:
+            log_every_t = self.log_every_t
+        device = self.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        intermediates = [img]
+        if timesteps is None:
+            timesteps = self.num_timesteps
+        if start_T is not None:
+            timesteps = min(timesteps, start_T)
+        iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(
+            range(0, timesteps))
+        if mask is not None:
+            assert x0 is not None
+            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match
+        for i in iterator:
+            ts = torch.full((b,), i, device=device, dtype=torch.long) # num
+            if self.shorten_cond_schedule: # False
+                assert self.model.conditioning_key != 'hybrid'
+                tc = self.cond_ids[ts].to(cond.device)
+                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
+            img = self.p_sample(img, cond, ts,
+                                clip_denoised=self.clip_denoised, # False
+                                quantize_denoised=quantize_denoised) # False
+            if mask is not None: # False
+                img_orig = self.q_sample(x0, ts)
+                img = img_orig * mask + (1. - mask) * img
+            if i % log_every_t == 0 or i == timesteps - 1:
+                intermediates.append(img)
+            if callback: callback(i)
+            if img_callback: img_callback(img, i)
+        if return_intermediates:
+            return img, intermediates
+        return img
+    @torch.no_grad()
+    def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None,
+               verbose=True, timesteps=None, quantize_denoised=False,
+               mask=None, x0=None, shape=None,**kwargs):
+        if shape is None:
+            if self.channels > 0:
+                shape = (batch_size, self.channels, self.mel_dim, self.mel_length)
+            else:
+                shape = (batch_size, self.mel_dim, self.mel_length)
+        if cond is not None:
+            if isinstance(cond, dict):
+                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
+                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
+            else:
+                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
+        return self.p_sample_loop(cond,
+                                  shape,
+                                  return_intermediates=return_intermediates, x_T=x_T,
+                                  verbose=verbose, timesteps=timesteps, quantize_denoised=quantize_denoised,
+                                  mask=mask, x0=x0)
+    @torch.no_grad()
+    def sample_log(self,cond,batch_size,ddim, ddim_steps,**kwargs):
+        if ddim:
+            ddim_sampler = DDIMSampler(self)
+            shape = (self.channels, self.mel_dim, self.mel_length) if self.channels > 0 else (self.mel_dim, self.mel_length)
+            samples, intermediates = ddim_sampler.sample(ddim_steps,batch_size,
+                                                        shape,cond,verbose=False,**kwargs)
+        else:
+            samples, intermediates = self.sample(cond=cond, batch_size=batch_size,
+                                                 return_intermediates=True,**kwargs)
+        return samples, intermediates
+    @torch.no_grad()
+    def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None,
+                   quantize_denoised=True, inpaint=False, plot_denoise_rows=False, plot_progressive_rows=True,
+                   plot_diffusion_rows=True, **kwargs):
+        use_ddim = ddim_steps is not None
+        log = dict()
+        z, c, x, xrec, xc = self.get_input(batch, self.first_stage_key,
+                                           return_first_stage_outputs=True,
+                                           force_c_encode=True,
+                                           return_original_cond=True,
+                                           bs=N) # z is latent,c is condition embedding, xc is condition(caption) list
+        N = min(x.shape[0], N)
+        n_row = min(x.shape[0], n_row)
+        log["inputs"] = x if len(x.shape)==4 else x.unsqueeze(1)
+        log["reconstruction"] = xrec if len(xrec.shape)==4 else xrec.unsqueeze(1)
+        if self.model.conditioning_key is not None:
+            if hasattr(self.cond_stage_model, "decode") and self.cond_stage_key != "masked_image":
+                xc = self.cond_stage_model.decode(c)
+                log["conditioning"] = xc
+            elif self.cond_stage_key == "masked_image":
+                log["mask"] = c[:, -1, :, :][:, None, :, :]
+                xc = self.cond_stage_model.decode(c[:, :self.cond_stage_model.embed_dim, :, :])
+                log["conditioning"] = xc
+            elif self.cond_stage_key in ["caption"]:
+                pass
+                # xc = log_txt_as_img((256, 256), batch["caption"])
+                # log["conditioning"] = xc
+            elif self.cond_stage_key == 'class_label':
+                xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"])
+                log['conditioning'] = xc
+            elif isimage(xc):
+                log["conditioning"] = xc
+        if plot_diffusion_rows:
+            # get diffusion row
+            diffusion_row = list()
+            z_start = z[:n_row]
+            for t in range(self.num_timesteps):
+                if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
+                    t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
+                    t = t.to(self.device).long()
+                    noise = torch.randn_like(z_start)
+                    z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise)
+                    diffusion_row.append(self.decode_first_stage(z_noisy))
+            if len(diffusion_row[0].shape) == 3:
+                diffusion_row = [x.unsqueeze(1) for x in diffusion_row]
+            diffusion_row = torch.stack(diffusion_row)  # n_log_step, n_row, C, H, W
+            diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w')
+            diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w')
+            diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0])
+            log["diffusion_row"] = diffusion_grid
+        if sample:
+            # get denoise row
+            with self.ema_scope("Plotting"):
+                samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,
+                                                         ddim_steps=ddim_steps,eta=ddim_eta)
+                # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True)
+            x_samples = self.decode_first_stage(samples)
+            log["samples"] = x_samples if len(x_samples.shape)==4 else x_samples.unsqueeze(1)
+            if plot_denoise_rows:
+                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
+                log["denoise_row"] = denoise_grid
+            if quantize_denoised and not isinstance(self.first_stage_model, AutoencoderKL) and not isinstance(
+                    self.first_stage_model, IdentityFirstStage):
+                # also display when quantizing x0 while sampling
+                with self.ema_scope("Plotting Quantized Denoised"):
+                    samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,
+                                                             ddim_steps=ddim_steps,eta=ddim_eta,
+                                                             quantize_denoised=True)
+                    # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True,
+                    #                                      quantize_denoised=True)
+                x_samples = self.decode_first_stage(samples.to(self.device))
+                log["samples_x0_quantized"] = x_samples if len(x_samples.shape)==4 else x_samples.unsqueeze(1)
+            if inpaint:
+                # make a simple center square
+                b, h, w = z.shape[0], z.shape[2], z.shape[3]
+                mask = torch.ones(N, h, w).to(self.device)
+                # zeros will be filled in
+                mask[:, h // 4:3 * h // 4, w // 4:3 * w // 4] = 0.
+                mask = mask[:, None, ...]
+                with self.ema_scope("Plotting Inpaint"):
+                    samples, _ = self.sample_log(cond=c,batch_size=N,ddim=use_ddim, eta=ddim_eta,
+                                                ddim_steps=ddim_steps, x0=z[:N], mask=mask)
+                x_samples = self.decode_first_stage(samples.to(self.device))
+                log["samples_inpainting"] = x_samples
+                log["mask_inpainting"] = mask
+                # outpaint
+                mask = 1 - mask
+                with self.ema_scope("Plotting Outpaint"):
+                    samples, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim,eta=ddim_eta,
+                                                ddim_steps=ddim_steps, x0=z[:N], mask=mask)
+                x_samples = self.decode_first_stage(samples.to(self.device))
+                log["samples_outpainting"] = x_samples
+                log["mask_outpainting"] = mask
+        if plot_progressive_rows:
+            with self.ema_scope("Plotting Progressives"):
+                shape = (self.channels, self.mel_dim, self.mel_length) if self.channels > 0 else (self.mel_dim, self.mel_length)
+                img, progressives = self.progressive_denoising(c,
+                                                               shape=shape,
+                                                               batch_size=N)
+            prog_row = self._get_denoise_row_from_list(progressives, desc="Progressive Generation")
+            log["progressive_row"] = prog_row
+        if return_keys:
+            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
+                return log
+            else:
+                return {key: log[key] for key in return_keys}
+        return log
+    def configure_optimizers(self):
+        lr = self.learning_rate
+        params = list(self.model.parameters())
+        if self.cond_stage_trainable:
+            print(f"{self.__class__.__name__}: Also optimizing conditioner params!")
+            params = params + list(self.cond_stage_model.parameters())
+        if self.learn_logvar:
+            print('Diffusion model optimizing logvar')
+            params.append(self.logvar)
+        opt = torch.optim.AdamW(params, lr=lr)
+        if self.use_scheduler:
+            assert 'target' in self.scheduler_config
+            scheduler = instantiate_from_config(self.scheduler_config)
+            print("Setting up LambdaLR scheduler...")
+            scheduler = [
+                {
+                    'scheduler': LambdaLR(opt, lr_lambda=scheduler.schedule),
+                    'interval': 'step',
+                    'frequency': 1
+                }]
+            return [opt], scheduler
+        return opt

ldm/models/diffusion/plms.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""SAMPLING ONLY."""
+import torch
+import numpy as np
+from tqdm import tqdm
+from functools import partial
+from ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like
+class PLMSSampler(object):
+    def __init__(self, model, schedule="linear", **kwargs):
+        super().__init__()
+        self.model = model
+        self.ddpm_num_timesteps = model.num_timesteps
+        self.schedule = schedule
+    def register_buffer(self, name, attr):
+        if type(attr) == torch.Tensor:
+            if attr.device != torch.device("cuda"):
+                attr = attr.to(torch.device("cuda"))
+        setattr(self, name, attr)
+    def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
+        if ddim_eta != 0:
+            raise ValueError('ddim_eta must be 0 for PLMS')
+        self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
+                                                  num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
+        alphas_cumprod = self.model.alphas_cumprod
+        assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
+        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
+        self.register_buffer('betas', to_torch(self.model.betas))
+        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
+        self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
+        # calculations for diffusion q(x_t | x_{t-1}) and others
+        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
+        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
+        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
+        # ddim sampling parameters
+        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
+                                                                                   ddim_timesteps=self.ddim_timesteps,
+                                                                                   eta=ddim_eta,verbose=verbose)
+        self.register_buffer('ddim_sigmas', ddim_sigmas)
+        self.register_buffer('ddim_alphas', ddim_alphas)
+        self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
+        self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
+        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
+            (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
+                        1 - self.alphas_cumprod / self.alphas_cumprod_prev))
+        self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)
+    @torch.no_grad()
+    def sample(self,
+               S,
+               batch_size,
+               shape,
+               conditioning=None,
+               callback=None,
+               normals_sequence=None,
+               img_callback=None,
+               quantize_x0=False,
+               eta=0.,
+               mask=None,
+               x0=None,
+               temperature=1.,
+               noise_dropout=0.,
+               score_corrector=None,
+               corrector_kwargs=None,
+               verbose=True,
+               x_T=None,
+               log_every_t=100,
+               unconditional_guidance_scale=1.,
+               unconditional_conditioning=None,
+               # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
+               **kwargs
+               ):
+        if conditioning is not None:
+            if isinstance(conditioning, dict):
+                cbs = conditioning[list(conditioning.keys())[0]].shape[0]
+                if cbs != batch_size:
+                    print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
+            else:
+                if conditioning.shape[0] != batch_size:
+                    print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")
+        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose)
+        # sampling
+        C, H, W = shape
+        size = (batch_size, C, H, W)
+        print(f'Data shape for PLMS sampling is {size}')
+        samples, intermediates = self.plms_sampling(conditioning, size,
+                                                    callback=callback,
+                                                    img_callback=img_callback,
+                                                    quantize_denoised=quantize_x0,
+                                                    mask=mask, x0=x0,
+                                                    ddim_use_original_steps=False,
+                                                    noise_dropout=noise_dropout,
+                                                    temperature=temperature,
+                                                    score_corrector=score_corrector,
+                                                    corrector_kwargs=corrector_kwargs,
+                                                    x_T=x_T,
+                                                    log_every_t=log_every_t,
+                                                    unconditional_guidance_scale=unconditional_guidance_scale,
+                                                    unconditional_conditioning=unconditional_conditioning,
+                                                    )
+        return samples, intermediates
+    @torch.no_grad()
+    def plms_sampling(self, cond, shape,
+                      x_T=None, ddim_use_original_steps=False,
+                      callback=None, timesteps=None, quantize_denoised=False,
+                      mask=None, x0=None, img_callback=None, log_every_t=100,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None,):
+        device = self.model.betas.device
+        b = shape[0]
+        if x_T is None:
+            img = torch.randn(shape, device=device)
+        else:
+            img = x_T
+        if timesteps is None:
+            timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
+        elif timesteps is not None and not ddim_use_original_steps:
+            subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
+            timesteps = self.ddim_timesteps[:subset_end]
+        intermediates = {'x_inter': [img], 'pred_x0': [img]}
+        time_range = list(reversed(range(0,timesteps))) if ddim_use_original_steps else np.flip(timesteps)
+        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
+        print(f"Running PLMS Sampling with {total_steps} timesteps")
+        iterator = tqdm(time_range, desc='PLMS Sampler', total=total_steps)
+        old_eps = []
+        for i, step in enumerate(iterator):
+            index = total_steps - i - 1
+            ts = torch.full((b,), step, device=device, dtype=torch.long)
+            ts_next = torch.full((b,), time_range[min(i + 1, len(time_range) - 1)], device=device, dtype=torch.long)
+            if mask is not None:
+                assert x0 is not None
+                img_orig = self.model.q_sample(x0, ts)  # TODO: deterministic forward pass?
+                img = img_orig * mask + (1. - mask) * img
+            outs = self.p_sample_plms(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
+                                      quantize_denoised=quantize_denoised, temperature=temperature,
+                                      noise_dropout=noise_dropout, score_corrector=score_corrector,
+                                      corrector_kwargs=corrector_kwargs,
+                                      unconditional_guidance_scale=unconditional_guidance_scale,
+                                      unconditional_conditioning=unconditional_conditioning,
+                                      old_eps=old_eps, t_next=ts_next)
+            img, pred_x0, e_t = outs
+            old_eps.append(e_t)
+            if len(old_eps) >= 4:
+                old_eps.pop(0)
+            if callback: callback(i)
+            if img_callback: img_callback(pred_x0, i)
+            if index % log_every_t == 0 or index == total_steps - 1:
+                intermediates['x_inter'].append(img)
+                intermediates['pred_x0'].append(pred_x0)
+        return img, intermediates
+    @torch.no_grad()
+    def p_sample_plms(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
+                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
+                      unconditional_guidance_scale=1., unconditional_conditioning=None, old_eps=None, t_next=None):
+        b, *_, device = *x.shape, x.device
+        def get_model_output(x, t):
+            if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
+                e_t = self.model.apply_model(x, t, c)
+            else:
+                x_in = torch.cat([x] * 2)
+                t_in = torch.cat([t] * 2)
+                c_in = torch.cat([unconditional_conditioning, c])
+                e_t_uncond, e_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
+                e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)
+            if score_corrector is not None:
+                assert self.model.parameterization == "eps"
+                e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)
+            return e_t
+        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
+        alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
+        sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
+        sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
+        def get_x_prev_and_pred_x0(e_t, index):
+            # select parameters corresponding to the currently considered timestep
+            a_t = torch.full((b, 1, 1, 1), alphas[index], device=device)
+            a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device)
+            sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device)
+            sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device)
+            # current prediction for x_0
+            pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
+            if quantize_denoised:
+                pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
+            # direction pointing to x_t
+            dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
+            noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
+            if noise_dropout > 0.:
+                noise = torch.nn.functional.dropout(noise, p=noise_dropout)
+            x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
+            return x_prev, pred_x0
+        e_t = get_model_output(x, t)
+        if len(old_eps) == 0:
+            # Pseudo Improved Euler (2nd order)
+            x_prev, pred_x0 = get_x_prev_and_pred_x0(e_t, index)
+            e_t_next = get_model_output(x_prev, t_next)
+            e_t_prime = (e_t + e_t_next) / 2
+        elif len(old_eps) == 1:
+            # 2nd order Pseudo Linear Multistep (Adams-Bashforth)
+            e_t_prime = (3 * e_t - old_eps[-1]) / 2
+        elif len(old_eps) == 2:
+            # 3nd order Pseudo Linear Multistep (Adams-Bashforth)
+            e_t_prime = (23 * e_t - 16 * old_eps[-1] + 5 * old_eps[-2]) / 12
+        elif len(old_eps) >= 3:
+            # 4nd order Pseudo Linear Multistep (Adams-Bashforth)
+            e_t_prime = (55 * e_t - 59 * old_eps[-1] + 37 * old_eps[-2] - 9 * old_eps[-3]) / 24
+        x_prev, pred_x0 = get_x_prev_and_pred_x0(e_t_prime, index)
+        return x_prev, pred_x0, e_t

ldm/models/diffusion/transport/__init__.py ADDED Viewed

	@@ -0,0 +1,73 @@

+from .transport import Transport, ModelType, WeightType, PathType, SNRType, Sampler
+def create_transport(
+    path_type='Linear',
+    prediction="velocity",
+    loss_weight=None,
+    train_eps=None,
+    sample_eps=None,
+    snr_type="uniform"
+):
+    """function for creating Transport object
+    **Note**: model prediction defaults to velocity
+    Args:
+    - path_type: type of path to use; default to linear
+    - learn_score: set model prediction to score
+    - learn_noise: set model prediction to noise
+    - velocity_weighted: weight loss by velocity weight
+    - likelihood_weighted: weight loss by likelihood weight
+    - train_eps: small epsilon for avoiding instability during training
+    - sample_eps: small epsilon for avoiding instability during sampling
+    """
+    if prediction == "noise":
+        model_type = ModelType.NOISE
+    elif prediction == "score":
+        model_type = ModelType.SCORE
+    else:
+        model_type = ModelType.VELOCITY
+    if loss_weight == "velocity":
+        loss_type = WeightType.VELOCITY
+    elif loss_weight == "likelihood":
+        loss_type = WeightType.LIKELIHOOD
+    else:
+        loss_type = WeightType.NONE
+    if snr_type == "lognorm":
+        snr_type = SNRType.LOGNORM
+    elif snr_type == "uniform":
+        snr_type = SNRType.UNIFORM
+    else:
+        raise ValueError(f"Invalid snr type {snr_type}")
+    path_choice = {
+        "Linear": PathType.LINEAR,
+        "GVP": PathType.GVP,
+        "VP": PathType.VP,
+    }
+    path_type = path_choice[path_type]
+    if (path_type in [PathType.VP]):
+        train_eps = 1e-5 if train_eps is None else train_eps
+        sample_eps = 1e-3 if train_eps is None else sample_eps
+    elif (path_type in [PathType.GVP, PathType.LINEAR] and model_type != ModelType.VELOCITY):
+        train_eps = 1e-3 if train_eps is None else train_eps
+        sample_eps = 1e-3 if train_eps is None else sample_eps
+    else: # velocity & [GVP, LINEAR] is stable everywhere
+        train_eps = 0
+        sample_eps = 0
+    # create flow state
+    state = Transport(
+        model_type=model_type,
+        path_type=path_type,
+        loss_type=loss_type,
+        train_eps=train_eps,
+        sample_eps=sample_eps,
+        snr_type=snr_type
+    )
+    return state