kiuuiro commited on
Commit
d542179
·
verified ·
1 Parent(s): e16c301

Upload 20 files

Browse files
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Neo
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,148 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # python RVC inference
2
+
3
+ > [!NOTE]
4
+ > This project is still under development.</a>.
5
+
6
+ ![PyPI](https://img.shields.io/pypi/v/rvc-inferpy?logo=pypi&logoColor=white)
7
+ ![GitHub forks](https://img.shields.io/github/forks/TheNeodev/rvc_inferpy?style=flat) [![GitHub Stars](https://img.shields.io/github/stars/TheNeodev/rvc_inferpy?style=flat&logo=github&label=Star&color=blue)](https://github.com/TheNeodev/rvc_inferpy/stargazers)
8
+
9
+ `rvc_inferpy` is a Python library for performing audio inference with RVC (Retrieval-based Voice Conversion). It provides a simple command-line interface (CLI) and can be integrated into Python projects for audio processing with customizable parameters.
10
+
11
+ ## Installation
12
+
13
+ You can install the package using `pip`:
14
+
15
+ ```bash
16
+ pip install rvc-inferpy
17
+ ```
18
+
19
+ ## Usage
20
+
21
+ ### Command Line Interface (CLI)
22
+
23
+ You can interact with `rvc_inferpy` through the command line. To view the available options and how to use the tool, run:
24
+
25
+ ```bash
26
+ rvc-cli -h
27
+ ```
28
+
29
+ Here’s a breakdown of the full command-line options:
30
+
31
+ ```bash
32
+ usage: rvc-cli [-h] [--model_name MODEL_NAME] [--audio_path AUDIO_PATH]
33
+ [--f0_change F0_CHANGE] [--f0_method F0_METHOD]
34
+ [--min_pitch MIN_PITCH] [--max_pitch MAX_PITCH]
35
+ [--crepe_hop_length CREPE_HOP_LENGTH] [--index_rate INDEX_RATE]
36
+ [--filter_radius FILTER_RADIUS] [--rms_mix_rate RMS_MIX_RATE]
37
+ [--protect PROTECT] [--split_infer] [--min_silence MIN_SILENCE]
38
+ [--silence_threshold SILENCE_THRESHOLD] [--seek_step SEEK_STEP]
39
+ [--keep_silence KEEP_SILENCE] [--do_formant] [--quefrency QUEFRENCY]
40
+ [--timbre TIMBRE] [--f0_autotune] [--audio_format AUDIO_FORMAT]
41
+ [--resample_sr RESAMPLE_SR]
42
+ ```
43
+
44
+ ### Command-Line Options:
45
+ - `-h, --help`: Show help message and exit.
46
+ - `--model_name MODEL_NAME`: Name or path of the model.
47
+ - `--audio_path AUDIO_PATH`: Path to the input audio file.
48
+ - `--f0_change F0_CHANGE`: Pitch change factor.
49
+ - `--f0_method F0_METHOD`: Method for F0 estimation (e.g., "crepe").
50
+ - `--min_pitch MIN_PITCH`: Minimum pitch value.
51
+ - `--max_pitch MAX_PITCH`: Maximum pitch value.
52
+ - `--crepe_hop_length CREPE_HOP_LENGTH`: Crepe hop length.
53
+ - `--index_rate INDEX_RATE`: Index rate.
54
+ - `--filter_radius FILTER_RADIUS`: Filter radius.
55
+ - `--rms_mix_rate RMS_MIX_RATE`: RMS mix rate.
56
+ - `--protect PROTECT`: Protect factor to avoid distortion.
57
+ - `--split_infer`: Enable split inference.
58
+ - `--min_silence MIN_SILENCE`: Minimum silence duration (in seconds).
59
+ - `--silence_threshold SILENCE_THRESHOLD`: Silence threshold in dB.
60
+ - `--seek_step SEEK_STEP`: Step size for silence detection.
61
+ - `--keep_silence KEEP_SILENCE`: Duration to keep silence (in seconds).
62
+ - `--do_formant`: Enable formant processing.
63
+ - `--quefrency QUEFRENCY`: Quefrency adjustment.
64
+ - `--timbre TIMBRE`: Timbre adjustment factor.
65
+ - `--f0_autotune`: Enable automatic F0 tuning.
66
+ - `--audio_format AUDIO_FORMAT`: Desired output audio format (e.g., "wav", "mp3").
67
+ - `--resample_sr RESAMPLE_SR`: Resample sample rate.
68
+
69
+
70
+ ### Example Command:
71
+
72
+ ```bash
73
+ rvc-cli --model_name "model_name_here" --audio_path "path_to_audio.wav" --f0_change 0 --f0_method "crepe" --min_pitch 50 --max_pitch 800
74
+ ```
75
+
76
+ ### As a Dependency in a Python Project
77
+
78
+ You can also use `rvc_inferpy` directly in your Python projects. Here's an example:
79
+
80
+ ```python
81
+ from rvc_inferpy import infer_audio
82
+
83
+ inferred_audio = infer_audio(
84
+ model_name="model_name_here", # Name or path to the RVC model
85
+ audio_path="path_to_audio.wav", # Path to the input audio file
86
+ f0_change=0, # Change in fundamental frequency
87
+ f0_method="crepe", # F0 extraction method ("crepe", "dio", etc.)
88
+ crepe_hop_length=128, # Hop length for Crepe
89
+ index_rate=1.0, # Index rate for model inference
90
+ filter_radius=3, # Radius for smoothing filters
91
+ rms_mix_rate=0.75, # Mixing rate for RMS
92
+ protect=0.33, # Protect level to prevent overfitting
93
+ split_infer=True, # Whether to split audio for inference
94
+ min_silence=0.5, # Minimum silence duration for splitting
95
+ silence_threshold=-40, # Silence threshold in dB
96
+ seek_step=10, # Seek step in milliseconds
97
+ keep_silence=0.1, # Keep silence duration in seconds
98
+ quefrency=0.0, # Cepstrum quefrency adjustment
99
+ tumbre=1.0, # Timbre preservation level
100
+ f0_autotune=False, # Enable or disable F0 autotuning
101
+
102
+ )
103
+ ```
104
+
105
+ The `infer_audio` function will return the processed audio object based on the provided parameters
106
+
107
+
108
+
109
+
110
+ > [!TIP]
111
+ > Ensure that you upload your models in the `models/{model_name}` folder.</a>
112
+
113
+
114
+
115
+
116
+
117
+ ## Terms of Use
118
+
119
+ The use of the converted voice for the following purposes is prohibited.
120
+
121
+ * Criticizing or attacking individuals.
122
+
123
+ * Advocating for or opposing specific political positions, religions, or ideologies.
124
+
125
+ * Publicly displaying strongly stimulating expressions without proper zoning.
126
+
127
+ * Selling of voice models and generated voice clips.
128
+
129
+ * Impersonation of the original owner of the voice with malicious intentions to harm/hurt others.
130
+
131
+ * Fraudulent purposes that lead to identity theft or fraudulent phone calls.
132
+
133
+ ## Disclaimer
134
+
135
+ I am not liable for any direct, indirect, consequential, incidental, or special damages arising out of or in any way connected with the use/misuse or inability to use this software.
136
+
137
+
138
+
139
+ ## Credits
140
+
141
+ - **IAHispano's Applio**: Base of this project.
142
+ - **RVC-Project**: Original RVC repository.
143
+
144
+ ## License
145
+
146
+ This project is licensed under the [MIT License]((https://github.com/TheNeodev/rvc_inferpy/tree/main?tab=MIT-1-ov-file)]).
147
+
148
+
pyproject.toml ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "rvc_inferpy" # Name of your package
7
+ version = "0.6.1" # Package version
8
+ description = "Easy tools for RVC Inference"
9
+ readme = "README.md" # Optional: Add your README[project]
10
+ license = {text = "MIT"} # Use this if you are specifying the license directly
11
+ authors = [
12
+ {name = "TheNeoDev", email = "[email protected]"}
13
+ ]
14
+ dependencies = [
15
+ "av",
16
+ "ffmpeg-python>=0.2.0",
17
+ "faiss-cpu==1.7.3",
18
+ "praat-parselmouth==0.4.2",
19
+ "pyworld==0.3.4",
20
+ "resampy==0.4.2",
21
+ "fairseq==0.12.2",
22
+ "pydub==0.25.1",
23
+ "einops",
24
+ "local_attention",
25
+ "torchcrepe==0.0.23",
26
+ "torchfcpe",
27
+ "audio-separator[gpu]==0.28.5",
28
+ "yt-dlp",
29
+ "gradio==4.44.0",
30
+ "gtts",
31
+ "edge-tts",
32
+
33
+ ]
34
+
35
+ [project.scripts]
36
+ rvc-cli = "rvc_inferpy.cli:infer_audio_cli" # Entry point for CLI
37
+
38
+ [project.optional-dependencies]
39
+ gpu = ["audio-separator[gpu]"]
40
+ cpu = ["audio-separator[cpu]"]
41
+
42
+ [tool.poetry.dev-dependencies]
43
+ librosa = ">=0.9.1,<0.11"
rvc_inferpy/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .infer import infer_audio, dl_model, infernew
rvc_inferpy/cli.py ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # dummy cli stuff 😏
2
+
3
+ import argparse
4
+ import os
5
+ import shutil
6
+ import gc
7
+ from rvc_inferpy.modules import VC
8
+ from rvc_inferpy.infer import Configs, get_model
9
+ from rvc_inferpy.split_audio import (
10
+ split_silence_nonsilent,
11
+ adjust_audio_lengths,
12
+ combine_silence_nonsilent,
13
+ )
14
+
15
+
16
+ def infer_audio_cli():
17
+ parser = argparse.ArgumentParser(description="RVC INFERPY CLI VER.")
18
+ parser.add_argument("--model_name", type=str, help="Name of the model.")
19
+ parser.add_argument("--audio_path", type=str, help="Path to the input audio file.")
20
+ parser.add_argument(
21
+ "--f0_change", type=float, default=0, help="Pitch change factor."
22
+ )
23
+ parser.add_argument(
24
+ "--f0_method", type=str, default="rmvpe+", help="Method for F0 estimation."
25
+ )
26
+ parser.add_argument(
27
+ "--min_pitch", type=str, default="50", help="Minimum pitch value."
28
+ )
29
+ parser.add_argument(
30
+ "--max_pitch", type=str, default="1100", help="Maximum pitch value."
31
+ )
32
+ parser.add_argument(
33
+ "--crepe_hop_length", type=int, default=128, help="Crepe hop length."
34
+ )
35
+ parser.add_argument("--index_rate", type=float, default=0.75, help="Index rate.")
36
+ parser.add_argument("--filter_radius", type=int, default=3, help="Filter radius.")
37
+ parser.add_argument(
38
+ "--rms_mix_rate", type=float, default=0.25, help="RMS mix rate."
39
+ )
40
+ parser.add_argument("--protect", type=float, default=0.33, help="Protect factor.")
41
+ parser.add_argument(
42
+ "--split_infer", action="store_true", help="Enable split inference."
43
+ )
44
+ parser.add_argument(
45
+ "--min_silence", type=int, default=500, help="Minimum silence duration."
46
+ )
47
+ parser.add_argument(
48
+ "--silence_threshold", type=float, default=-50, help="Silence threshold (dB)."
49
+ )
50
+ parser.add_argument(
51
+ "--seek_step", type=int, default=1, help="Seek step for silence detection."
52
+ )
53
+ parser.add_argument(
54
+ "--keep_silence", type=int, default=100, help="Silence retention duration."
55
+ )
56
+ parser.add_argument(
57
+ "--do_formant", action="store_true", help="Enable formant processing."
58
+ )
59
+ parser.add_argument(
60
+ "--quefrency", type=float, default=0, help="Quefrency adjustment value."
61
+ )
62
+ parser.add_argument(
63
+ "--timbre", type=float, default=1, help="Timbre adjustment factor."
64
+ )
65
+ parser.add_argument(
66
+ "--f0_autotune", action="store_true", help="Enable F0 autotuning."
67
+ )
68
+ parser.add_argument(
69
+ "--audio_format", type=str, default="wav", help="Output audio format."
70
+ )
71
+ parser.add_argument(
72
+ "--resample_sr", type=int, default=0, help="Resample sample rate."
73
+ )
74
+ parser.add_argument(
75
+ "--hubert_model_path",
76
+ type=str,
77
+ default="hubert_base.pt",
78
+ help="Path to Hubert model.",
79
+ )
80
+ parser.add_argument(
81
+ "--rmvpe_model_path", type=str, default="rmvpe.pt", help="Path to RMVPE model."
82
+ )
83
+ parser.add_argument(
84
+ "--fcpe_model_path", type=str, default="fcpe.pt", help="Path to FCPE model."
85
+ )
86
+ args = parser.parse_args()
87
+
88
+ os.environ["rmvpe_model_path"] = args.rmvpe_model_path
89
+ os.environ["fcpe_model_path"] = args.fcpe_model_path
90
+ configs = Configs("cuda:0", True)
91
+ vc = VC(configs)
92
+ pth_path, index_path = get_model(args.model_name)
93
+ vc_data = vc.get_vc(pth_path, args.protect, 0.5)
94
+
95
+ if args.split_infer:
96
+ inferred_files = []
97
+ temp_dir = os.path.join(os.getcwd(), "seperate", "temp")
98
+ os.makedirs(temp_dir, exist_ok=True)
99
+ print("Splitting audio into silence and nonsilent segments.")
100
+ silence_files, nonsilent_files = split_silence_nonsilent(
101
+ args.audio_path,
102
+ args.min_silence,
103
+ args.silence_threshold,
104
+ args.seek_step,
105
+ args.keep_silence,
106
+ )
107
+ for i, nonsilent_file in enumerate(nonsilent_files):
108
+ print(f"Processing nonsilent audio {i+1}/{len(nonsilent_files)}")
109
+ inference_info, audio_data, output_path = vc.vc_single(
110
+ 0,
111
+ nonsilent_file,
112
+ args.f0_change,
113
+ args.f0_method,
114
+ index_path,
115
+ index_path,
116
+ args.index_rate,
117
+ args.filter_radius,
118
+ args.resample_sr,
119
+ args.rms_mix_rate,
120
+ args.protect,
121
+ args.audio_format,
122
+ args.crepe_hop_length,
123
+ args.do_formant,
124
+ args.quefrency,
125
+ args.timbre,
126
+ args.min_pitch,
127
+ args.max_pitch,
128
+ args.f0_autotune,
129
+ args.hubert_model_path,
130
+ )
131
+ if inference_info[0] == "Success.":
132
+ print("Inference ran successfully.")
133
+ print(inference_info[1])
134
+ else:
135
+ print(f"Error: {inference_info[0]}")
136
+ return
137
+ inferred_files.append(output_path)
138
+
139
+ adjusted_inferred_files = adjust_audio_lengths(nonsilent_files, inferred_files)
140
+ output_path = combine_silence_nonsilent(
141
+ silence_files, adjusted_inferred_files, args.keep_silence, output_path
142
+ )
143
+ shutil.rmtree(temp_dir)
144
+ else:
145
+ inference_info, audio_data, output_path = vc.vc_single(
146
+ 0,
147
+ args.audio_path,
148
+ args.f0_change,
149
+ args.f0_method,
150
+ index_path,
151
+ index_path,
152
+ args.index_rate,
153
+ args.filter_radius,
154
+ args.resample_sr,
155
+ args.rms_mix_rate,
156
+ args.protect,
157
+ args.audio_format,
158
+ args.crepe_hop_length,
159
+ args.do_formant,
160
+ args.quefrency,
161
+ args.timbre,
162
+ args.min_pitch,
163
+ args.max_pitch,
164
+ args.f0_autotune,
165
+ args.hubert_model_path,
166
+ )
167
+ if inference_info[0] == "Success.":
168
+ print("Inference ran successfully.")
169
+ print(inference_info[1])
170
+ else:
171
+ print(f"Error: {inference_info[0]}")
172
+ return
173
+
174
+ del configs, vc
175
+ gc.collect()
176
+ print(f"Output saved to: {output_path}")
177
+
178
+
179
+ if __name__ == "__main__":
180
+ infer_audio_cli()
rvc_inferpy/infer.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os, sys
2
+ import shutil
3
+ import gc
4
+ import torch
5
+ from multiprocessing import cpu_count
6
+ from rvc_inferpy.modules import VC
7
+ from rvc_inferpy.split_audio import (
8
+ split_silence_nonsilent,
9
+ adjust_audio_lengths,
10
+ combine_silence_nonsilent,
11
+ )
12
+ from pathlib import Path
13
+ import requests
14
+
15
+
16
+ class Configs:
17
+ def __init__(self, device, is_half):
18
+ self.device = device
19
+ self.is_half = is_half
20
+ self.n_cpu = 0
21
+ self.gpu_name = None
22
+ self.gpu_mem = None
23
+ self.x_pad, self.x_query, self.x_center, self.x_max = self.device_config()
24
+
25
+ def device_config(self) -> tuple:
26
+ if torch.cuda.is_available():
27
+ i_device = int(self.device.split(":")[-1])
28
+ self.gpu_name = torch.cuda.get_device_name(i_device)
29
+ elif torch.backends.mps.is_available():
30
+ print("No supported N-card found, use MPS for inference")
31
+ self.device = "mps"
32
+ else:
33
+ print("No supported N-card found, use CPU for inference")
34
+ self.device = "cpu"
35
+
36
+ if self.n_cpu == 0:
37
+ self.n_cpu = cpu_count()
38
+
39
+ if self.is_half:
40
+ # 6G memory config
41
+ x_pad = 3
42
+ x_query = 10
43
+ x_center = 60
44
+ x_max = 65
45
+ else:
46
+ # 5G memory config
47
+ x_pad = 1
48
+ x_query = 6
49
+ x_center = 38
50
+ x_max = 41
51
+
52
+ if self.gpu_mem != None and self.gpu_mem <= 4:
53
+ x_pad = 1
54
+ x_query = 5
55
+ x_center = 30
56
+ x_max = 32
57
+
58
+ return x_pad, x_query, x_center, x_max
59
+
60
+
61
+ def get_model(voice_model):
62
+ model_dir = os.path.join(os.getcwd(), "models", voice_model)
63
+ model_filename, index_filename = None, None
64
+ for file in os.listdir(model_dir):
65
+ ext = os.path.splitext(file)[1]
66
+ if ext == ".pth":
67
+ model_filename = file
68
+ if ext == ".index":
69
+ index_filename = file
70
+
71
+ if model_filename is None:
72
+ print(f"No model file exists in {models_dir}.")
73
+ return None, None
74
+
75
+ return os.path.join(model_dir, model_filename), (
76
+ os.path.join(model_dir, index_filename) if index_filename else ""
77
+ )
78
+
79
+
80
+ BASE_DIR = Path(os.getcwd())
81
+ sys.path.append(str(BASE_DIR))
82
+
83
+
84
+
85
+
86
+ def infer_audio(
87
+ model_name,
88
+ audio_path,
89
+ f0_change=0,
90
+ f0_method="rmvpe+",
91
+ min_pitch="50",
92
+ max_pitch="1100",
93
+ crepe_hop_length=128,
94
+ index_rate=0.75,
95
+ filter_radius=3,
96
+ rms_mix_rate=0.25,
97
+ protect=0.33,
98
+ split_infer=False,
99
+ min_silence=500,
100
+ silence_threshold=-50,
101
+ seek_step=1,
102
+ keep_silence=100,
103
+ do_formant=False,
104
+ quefrency=0,
105
+ timbre=1,
106
+ f0_autotune=False,
107
+ audio_format="wav",
108
+ resample_sr=0,
109
+ hubert_model_path="hubert_base.pt",
110
+ rmvpe_model_path="rmvpe.pt",
111
+ fcpe_model_path="fcpe.pt",
112
+ ):
113
+ os.environ["rmvpe_model_path"] = rmvpe_model_path
114
+ os.environ["fcpe_model_path"] = fcpe_model_path
115
+ configs = Configs("cuda:0", True)
116
+ vc = VC(configs)
117
+ pth_path, index_path = get_model(model_name)
118
+ vc_data = vc.get_vc(pth_path, protect, 0.5)
119
+
120
+ if split_infer:
121
+ inferred_files = []
122
+ temp_dir = os.path.join(os.getcwd(), "seperate", "temp")
123
+ os.makedirs(temp_dir, exist_ok=True)
124
+ print("Splitting audio to silence and nonsilent segments.")
125
+ silence_files, nonsilent_files = split_silence_nonsilent(
126
+ audio_path, min_silence, silence_threshold, seek_step, keep_silence
127
+ )
128
+ print(
129
+ f"Total silence segments: {len(silence_files)}.\nTotal nonsilent segments: {len(nonsilent_files)}."
130
+ )
131
+ for i, nonsilent_file in enumerate(nonsilent_files):
132
+ print(f"Inferring nonsilent audio {i+1}")
133
+ inference_info, audio_data, output_path = vc.vc_single(
134
+ 0,
135
+ nonsilent_file,
136
+ f0_change,
137
+ f0_method,
138
+ index_path,
139
+ index_path,
140
+ index_rate,
141
+ filter_radius,
142
+ resample_sr,
143
+ rms_mix_rate,
144
+ protect,
145
+ audio_format,
146
+ crepe_hop_length,
147
+ do_formant,
148
+ quefrency,
149
+ timbre,
150
+ min_pitch,
151
+ max_pitch,
152
+ f0_autotune,
153
+ hubert_model_path,
154
+ )
155
+ if inference_info[0] == "Success.":
156
+ print("Inference ran successfully.")
157
+ print(inference_info[1])
158
+ print(
159
+ "Times:\nnpy: %.2fs f0: %.2fs infer: %.2fs\nTotal time: %.2fs"
160
+ % (*inference_info[2],)
161
+ )
162
+ else:
163
+ print(f"An error occurred while processing.\n{inference_info[0]}")
164
+ return None
165
+ inferred_files.append(output_path)
166
+ print("Adjusting inferred audio lengths.")
167
+ adjusted_inferred_files = adjust_audio_lengths(nonsilent_files, inferred_files)
168
+ print("Combining silence and inferred audios.")
169
+ output_count = 1
170
+ while True:
171
+ output_path = os.path.join(
172
+ os.getcwd(),
173
+ "output",
174
+ f"{os.path.splitext(os.path.basename(audio_path))[0]}{model_name}{f0_method.capitalize()}_{output_count}.{audio_format}",
175
+ )
176
+ if not os.path.exists(output_path):
177
+ break
178
+ output_count += 1
179
+ output_path = combine_silence_nonsilent(
180
+ silence_files, adjusted_inferred_files, keep_silence, output_path
181
+ )
182
+ [shutil.move(inferred_file, temp_dir) for inferred_file in inferred_files]
183
+ shutil.rmtree(temp_dir)
184
+ else:
185
+ inference_info, audio_data, output_path = vc.vc_single(
186
+ 0,
187
+ audio_path,
188
+ f0_change,
189
+ f0_method,
190
+ index_path,
191
+ index_path,
192
+ index_rate,
193
+ filter_radius,
194
+ resample_sr,
195
+ rms_mix_rate,
196
+ protect,
197
+ audio_format,
198
+ crepe_hop_length,
199
+ do_formant,
200
+ quefrency,
201
+ timbre,
202
+ min_pitch,
203
+ max_pitch,
204
+ f0_autotune,
205
+ hubert_model_path,
206
+ )
207
+ if inference_info[0] == "Success.":
208
+ print("Inference ran successfully.")
209
+ print(inference_info[1])
210
+ print(
211
+ "Times:\nnpy: %.2fs f0: %.2fs infer: %.2fs\nTotal time: %.2fs"
212
+ % (*inference_info[2],)
213
+ )
214
+ else:
215
+ print(f"An error occurred while processing.\n{inference_info[0]}")
216
+ del configs, vc
217
+ gc.collect()
218
+ return inference_info[0]
219
+
220
+ del configs, vc
221
+ gc.collect()
222
+ return output_path
223
+
224
+
rvc_inferpy/infer_list/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
rvc_inferpy/infer_list/audio.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import av
3
+ import ffmpeg
4
+ import os
5
+ import traceback
6
+ import sys
7
+ import subprocess
8
+
9
+ platform_stft_mapping = {
10
+ "linux": os.path.join(os.getcwd(), "stftpitchshift"),
11
+ "darwin": os.path.join(os.getcwd(), "stftpitchshift"),
12
+ "win32": os.path.join(os.getcwd(), "stftpitchshift.exe"),
13
+ }
14
+
15
+ stft = platform_stft_mapping.get(sys.platform)
16
+
17
+
18
+ def wav2(i, o, format):
19
+ inp = av.open(i, "rb")
20
+ if format == "m4a":
21
+ format = "mp4"
22
+ out = av.open(o, "wb", format=format)
23
+ if format == "ogg":
24
+ format = "libvorbis"
25
+ if format == "mp4":
26
+ format = "aac"
27
+
28
+ ostream = out.add_stream(format)
29
+
30
+ for frame in inp.decode(audio=0):
31
+ for p in ostream.encode(frame):
32
+ out.mux(p)
33
+
34
+ for p in ostream.encode(None):
35
+ out.mux(p)
36
+
37
+ out.close()
38
+ inp.close()
39
+
40
+
41
+ def load_audio(file, sr, DoFormant=False, Quefrency=1.0, Timbre=1.0):
42
+ formanted = False
43
+ file = file.strip(' \n"')
44
+ if not os.path.exists(file):
45
+ raise RuntimeError("Wrong audio path, that does not exist.")
46
+
47
+ try:
48
+ if DoFormant:
49
+ print("Starting formant shift. Please wait as this process takes a while.")
50
+ formanted_file = f"{os.path.splitext(os.path.basename(file))[0]}_formanted{os.path.splitext(os.path.basename(file))[1]}"
51
+ command = (
52
+ f'{stft} -i "{file}" -q "{Quefrency}" '
53
+ f'-t "{Timbre}" -o "{formanted_file}"'
54
+ )
55
+ subprocess.run(command, shell=True)
56
+ file = formanted_file
57
+ print(f"Formanted {file}\n")
58
+
59
+ # https://github.com/openai/whisper/blob/main/whisper/audio.py#L26
60
+ # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
61
+ # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
62
+ file = (
63
+ file.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
64
+ ) # Prevent small white copy path head and tail with spaces and " and return
65
+ out, _ = (
66
+ ffmpeg.input(file, threads=0)
67
+ .output("-", format="f32le", acodec="pcm_f32le", ac=1, ar=sr)
68
+ .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
69
+ )
70
+
71
+ return np.frombuffer(out, np.float32).flatten()
72
+
73
+ except Exception as e:
74
+ raise RuntimeError(f"Failed to load audio: {e}")
75
+
76
+
77
+ def check_audio_duration(file):
78
+ try:
79
+ file = file.strip(" ").strip('"').strip("\n").strip('"').strip(" ")
80
+
81
+ probe = ffmpeg.probe(file)
82
+
83
+ duration = float(probe["streams"][0]["duration"])
84
+
85
+ if duration < 0.76:
86
+ print(
87
+ f"Audio file, {file.split('/')[-1]}, under ~0.76s detected - file is too short. Target at least 1-2s for best results."
88
+ )
89
+ return False
90
+
91
+ return True
92
+ except Exception as e:
93
+ raise RuntimeError(f"Failed to check audio duration: {e}")
rvc_inferpy/infer_list/fcpe.py ADDED
@@ -0,0 +1,1065 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Union
2
+
3
+ import torch.nn.functional as F
4
+ import numpy as np
5
+ import torch
6
+ import torch.nn as nn
7
+ from torch.nn.utils import weight_norm
8
+ from torchaudio.transforms import Resample
9
+ import os
10
+ import librosa
11
+ import soundfile as sf
12
+ import torch.utils.data
13
+ from librosa.filters import mel as librosa_mel_fn
14
+ import math
15
+ from functools import partial
16
+
17
+ from einops import rearrange, repeat
18
+ from local_attention import LocalAttention
19
+ from torch import nn
20
+
21
+ os.environ["LRU_CACHE_CAPACITY"] = "3"
22
+
23
+
24
+ def load_wav_to_torch(full_path, target_sr=None, return_empty_on_exception=False):
25
+ sampling_rate = None
26
+ try:
27
+ data, sampling_rate = sf.read(full_path, always_2d=True) # than soundfile.
28
+ except Exception as ex:
29
+ print(f"'{full_path}' failed to load.\nException:")
30
+ print(ex)
31
+ if return_empty_on_exception:
32
+ return [], sampling_rate or target_sr or 48000
33
+ else:
34
+ raise Exception(ex)
35
+
36
+ if len(data.shape) > 1:
37
+ data = data[:, 0]
38
+ assert (
39
+ len(data) > 2
40
+ ) # check duration of audio file is > 2 samples (because otherwise the slice operation was on the wrong dimension)
41
+
42
+ if np.issubdtype(data.dtype, np.integer): # if audio data is type int
43
+ max_mag = -np.iinfo(
44
+ data.dtype
45
+ ).min # maximum magnitude = min possible value of intXX
46
+ else: # if audio data is type fp32
47
+ max_mag = max(np.amax(data), -np.amin(data))
48
+ max_mag = (
49
+ (2**31) + 1
50
+ if max_mag > (2**15)
51
+ else ((2**15) + 1 if max_mag > 1.01 else 1.0)
52
+ ) # data should be either 16-bit INT, 32-bit INT or [-1 to 1] float32
53
+
54
+ data = torch.FloatTensor(data.astype(np.float32)) / max_mag
55
+
56
+ if (
57
+ torch.isinf(data) | torch.isnan(data)
58
+ ).any() and return_empty_on_exception: # resample will crash with inf/NaN inputs. return_empty_on_exception will return empty arr instead of except
59
+ return [], sampling_rate or target_sr or 48000
60
+ if target_sr is not None and sampling_rate != target_sr:
61
+ data = torch.from_numpy(
62
+ librosa.core.resample(
63
+ data.numpy(), orig_sr=sampling_rate, target_sr=target_sr
64
+ )
65
+ )
66
+ sampling_rate = target_sr
67
+
68
+ return data, sampling_rate
69
+
70
+
71
+ def dynamic_range_compression(x, C=1, clip_val=1e-5):
72
+ return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
73
+
74
+
75
+ def dynamic_range_decompression(x, C=1):
76
+ return np.exp(x) / C
77
+
78
+
79
+ def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
80
+ return torch.log(torch.clamp(x, min=clip_val) * C)
81
+
82
+
83
+ def dynamic_range_decompression_torch(x, C=1):
84
+ return torch.exp(x) / C
85
+
86
+
87
+ class STFT:
88
+ def __init__(
89
+ self,
90
+ sr=22050,
91
+ n_mels=80,
92
+ n_fft=1024,
93
+ win_size=1024,
94
+ hop_length=256,
95
+ fmin=20,
96
+ fmax=11025,
97
+ clip_val=1e-5,
98
+ ):
99
+ self.target_sr = sr
100
+
101
+ self.n_mels = n_mels
102
+ self.n_fft = n_fft
103
+ self.win_size = win_size
104
+ self.hop_length = hop_length
105
+ self.fmin = fmin
106
+ self.fmax = fmax
107
+ self.clip_val = clip_val
108
+ self.mel_basis = {}
109
+ self.hann_window = {}
110
+
111
+ def get_mel(self, y, keyshift=0, speed=1, center=False, train=False):
112
+ sampling_rate = self.target_sr
113
+ n_mels = self.n_mels
114
+ n_fft = self.n_fft
115
+ win_size = self.win_size
116
+ hop_length = self.hop_length
117
+ fmin = self.fmin
118
+ fmax = self.fmax
119
+ clip_val = self.clip_val
120
+
121
+ factor = 2 ** (keyshift / 12)
122
+ n_fft_new = int(np.round(n_fft * factor))
123
+ win_size_new = int(np.round(win_size * factor))
124
+ hop_length_new = int(np.round(hop_length * speed))
125
+ if not train:
126
+ mel_basis = self.mel_basis
127
+ hann_window = self.hann_window
128
+ else:
129
+ mel_basis = {}
130
+ hann_window = {}
131
+
132
+ if torch.min(y) < -1.0:
133
+ print("min value is ", torch.min(y))
134
+ if torch.max(y) > 1.0:
135
+ print("max value is ", torch.max(y))
136
+
137
+ mel_basis_key = str(fmax) + "_" + str(y.device)
138
+ if mel_basis_key not in mel_basis:
139
+ mel = librosa_mel_fn(
140
+ sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax
141
+ )
142
+ mel_basis[mel_basis_key] = torch.from_numpy(mel).float().to(y.device)
143
+
144
+ keyshift_key = str(keyshift) + "_" + str(y.device)
145
+ if keyshift_key not in hann_window:
146
+ hann_window[keyshift_key] = torch.hann_window(win_size_new).to(y.device)
147
+
148
+ pad_left = (win_size_new - hop_length_new) // 2
149
+ pad_right = max(
150
+ (win_size_new - hop_length_new + 1) // 2,
151
+ win_size_new - y.size(-1) - pad_left,
152
+ )
153
+ if pad_right < y.size(-1):
154
+ mode = "reflect"
155
+ else:
156
+ mode = "constant"
157
+ y = torch.nn.functional.pad(y.unsqueeze(1), (pad_left, pad_right), mode=mode)
158
+ y = y.squeeze(1)
159
+
160
+ spec = torch.stft(
161
+ y,
162
+ n_fft_new,
163
+ hop_length=hop_length_new,
164
+ win_length=win_size_new,
165
+ window=hann_window[keyshift_key],
166
+ center=center,
167
+ pad_mode="reflect",
168
+ normalized=False,
169
+ onesided=True,
170
+ return_complex=True,
171
+ )
172
+ spec = torch.sqrt(spec.real.pow(2) + spec.imag.pow(2) + (1e-9))
173
+ if keyshift != 0:
174
+ size = n_fft // 2 + 1
175
+ resize = spec.size(1)
176
+ if resize < size:
177
+ spec = F.pad(spec, (0, 0, 0, size - resize))
178
+ spec = spec[:, :size, :] * win_size / win_size_new
179
+ spec = torch.matmul(mel_basis[mel_basis_key], spec)
180
+ spec = dynamic_range_compression_torch(spec, clip_val=clip_val)
181
+ return spec
182
+
183
+ def __call__(self, audiopath):
184
+ audio, sr = load_wav_to_torch(audiopath, target_sr=self.target_sr)
185
+ spect = self.get_mel(audio.unsqueeze(0)).squeeze(0)
186
+ return spect
187
+
188
+
189
+ stft = STFT()
190
+
191
+ # import fast_transformers.causal_product.causal_product_cuda
192
+
193
+
194
+ def softmax_kernel(
195
+ data, *, projection_matrix, is_query, normalize_data=True, eps=1e-4, device=None
196
+ ):
197
+ b, h, *_ = data.shape
198
+ # (batch size, head, length, model_dim)
199
+
200
+ # normalize model dim
201
+ data_normalizer = (data.shape[-1] ** -0.25) if normalize_data else 1.0
202
+
203
+ # what is ration?, projection_matrix.shape[0] --> 266
204
+
205
+ ratio = projection_matrix.shape[0] ** -0.5
206
+
207
+ projection = repeat(projection_matrix, "j d -> b h j d", b=b, h=h)
208
+ projection = projection.type_as(data)
209
+
210
+ # data_dash = w^T x
211
+ data_dash = torch.einsum("...id,...jd->...ij", (data_normalizer * data), projection)
212
+
213
+ # diag_data = D**2
214
+ diag_data = data**2
215
+ diag_data = torch.sum(diag_data, dim=-1)
216
+ diag_data = (diag_data / 2.0) * (data_normalizer**2)
217
+ diag_data = diag_data.unsqueeze(dim=-1)
218
+
219
+ # print ()
220
+ if is_query:
221
+ data_dash = ratio * (
222
+ torch.exp(
223
+ data_dash
224
+ - diag_data
225
+ - torch.max(data_dash, dim=-1, keepdim=True).values
226
+ )
227
+ + eps
228
+ )
229
+ else:
230
+ data_dash = ratio * (
231
+ torch.exp(data_dash - diag_data + eps)
232
+ ) # - torch.max(data_dash)) + eps)
233
+
234
+ return data_dash.type_as(data)
235
+
236
+
237
+ def orthogonal_matrix_chunk(cols, qr_uniform_q=False, device=None):
238
+ unstructured_block = torch.randn((cols, cols), device=device)
239
+ q, r = torch.linalg.qr(unstructured_block.cpu(), mode="reduced")
240
+ q, r = map(lambda t: t.to(device), (q, r))
241
+
242
+ # proposed by @Parskatt
243
+ # to make sure Q is uniform https://arxiv.org/pdf/math-ph/0609050.pdf
244
+ if qr_uniform_q:
245
+ d = torch.diag(r, 0)
246
+ q *= d.sign()
247
+ return q.t()
248
+
249
+
250
+ def exists(val):
251
+ return val is not None
252
+
253
+
254
+ def empty(tensor):
255
+ return tensor.numel() == 0
256
+
257
+
258
+ def default(val, d):
259
+ return val if exists(val) else d
260
+
261
+
262
+ def cast_tuple(val):
263
+ return (val,) if not isinstance(val, tuple) else val
264
+
265
+
266
+ class PCmer(nn.Module):
267
+ """The encoder that is used in the Transformer model."""
268
+
269
+ def __init__(
270
+ self,
271
+ num_layers,
272
+ num_heads,
273
+ dim_model,
274
+ dim_keys,
275
+ dim_values,
276
+ residual_dropout,
277
+ attention_dropout,
278
+ ):
279
+ super().__init__()
280
+ self.num_layers = num_layers
281
+ self.num_heads = num_heads
282
+ self.dim_model = dim_model
283
+ self.dim_values = dim_values
284
+ self.dim_keys = dim_keys
285
+ self.residual_dropout = residual_dropout
286
+ self.attention_dropout = attention_dropout
287
+
288
+ self._layers = nn.ModuleList([_EncoderLayer(self) for _ in range(num_layers)])
289
+
290
+ # METHODS ########################################################################################################
291
+
292
+ def forward(self, phone, mask=None):
293
+
294
+ # apply all layers to the input
295
+ for i, layer in enumerate(self._layers):
296
+ phone = layer(phone, mask)
297
+ # provide the final sequence
298
+ return phone
299
+
300
+
301
+ # ==================================================================================================================== #
302
+ # CLASS _ E N C O D E R L A Y E R #
303
+ # ==================================================================================================================== #
304
+
305
+
306
+ class _EncoderLayer(nn.Module):
307
+ """One layer of the encoder.
308
+
309
+ Attributes:
310
+ attn: (:class:`mha.MultiHeadAttention`): The attention mechanism that is used to read the input sequence.
311
+ feed_forward (:class:`ffl.FeedForwardLayer`): The feed-forward layer on top of the attention mechanism.
312
+ """
313
+
314
+ def __init__(self, parent: PCmer):
315
+ """Creates a new instance of ``_EncoderLayer``.
316
+
317
+ Args:
318
+ parent (Encoder): The encoder that the layers is created for.
319
+ """
320
+ super().__init__()
321
+
322
+ self.conformer = ConformerConvModule(parent.dim_model)
323
+ self.norm = nn.LayerNorm(parent.dim_model)
324
+ self.dropout = nn.Dropout(parent.residual_dropout)
325
+
326
+ # selfatt -> fastatt: performer!
327
+ self.attn = SelfAttention(
328
+ dim=parent.dim_model, heads=parent.num_heads, causal=False
329
+ )
330
+
331
+ # METHODS ########################################################################################################
332
+
333
+ def forward(self, phone, mask=None):
334
+
335
+ # compute attention sub-layer
336
+ phone = phone + (self.attn(self.norm(phone), mask=mask))
337
+
338
+ phone = phone + (self.conformer(phone))
339
+
340
+ return phone
341
+
342
+
343
+ def calc_same_padding(kernel_size):
344
+ pad = kernel_size // 2
345
+ return (pad, pad - (kernel_size + 1) % 2)
346
+
347
+
348
+ # helper classes
349
+
350
+
351
+ class Swish(nn.Module):
352
+ def forward(self, x):
353
+ return x * x.sigmoid()
354
+
355
+
356
+ class Transpose(nn.Module):
357
+ def __init__(self, dims):
358
+ super().__init__()
359
+ assert len(dims) == 2, "dims must be a tuple of two dimensions"
360
+ self.dims = dims
361
+
362
+ def forward(self, x):
363
+ return x.transpose(*self.dims)
364
+
365
+
366
+ class GLU(nn.Module):
367
+ def __init__(self, dim):
368
+ super().__init__()
369
+ self.dim = dim
370
+
371
+ def forward(self, x):
372
+ out, gate = x.chunk(2, dim=self.dim)
373
+ return out * gate.sigmoid()
374
+
375
+
376
+ class DepthWiseConv1d(nn.Module):
377
+ def __init__(self, chan_in, chan_out, kernel_size, padding):
378
+ super().__init__()
379
+ self.padding = padding
380
+ self.conv = nn.Conv1d(chan_in, chan_out, kernel_size, groups=chan_in)
381
+
382
+ def forward(self, x):
383
+ x = F.pad(x, self.padding)
384
+ return self.conv(x)
385
+
386
+
387
+ class ConformerConvModule(nn.Module):
388
+ def __init__(
389
+ self, dim, causal=False, expansion_factor=2, kernel_size=31, dropout=0.0
390
+ ):
391
+ super().__init__()
392
+
393
+ inner_dim = dim * expansion_factor
394
+ padding = calc_same_padding(kernel_size) if not causal else (kernel_size - 1, 0)
395
+
396
+ self.net = nn.Sequential(
397
+ nn.LayerNorm(dim),
398
+ Transpose((1, 2)),
399
+ nn.Conv1d(dim, inner_dim * 2, 1),
400
+ GLU(dim=1),
401
+ DepthWiseConv1d(
402
+ inner_dim, inner_dim, kernel_size=kernel_size, padding=padding
403
+ ),
404
+ # nn.BatchNorm1d(inner_dim) if not causal else nn.Identity(),
405
+ Swish(),
406
+ nn.Conv1d(inner_dim, dim, 1),
407
+ Transpose((1, 2)),
408
+ nn.Dropout(dropout),
409
+ )
410
+
411
+ def forward(self, x):
412
+ return self.net(x)
413
+
414
+
415
+ def linear_attention(q, k, v):
416
+ if v is None:
417
+ # print (k.size(), q.size())
418
+ out = torch.einsum("...ed,...nd->...ne", k, q)
419
+ return out
420
+
421
+ else:
422
+ k_cumsum = k.sum(dim=-2)
423
+ # k_cumsum = k.sum(dim = -2)
424
+ D_inv = 1.0 / (torch.einsum("...nd,...d->...n", q, k_cumsum.type_as(q)) + 1e-8)
425
+
426
+ context = torch.einsum("...nd,...ne->...de", k, v)
427
+ # print ("TRUEEE: ", context.size(), q.size(), D_inv.size())
428
+ out = torch.einsum("...de,...nd,...n->...ne", context, q, D_inv)
429
+ return out
430
+
431
+
432
+ def gaussian_orthogonal_random_matrix(
433
+ nb_rows, nb_columns, scaling=0, qr_uniform_q=False, device=None
434
+ ):
435
+ nb_full_blocks = int(nb_rows / nb_columns)
436
+ # print (nb_full_blocks)
437
+ block_list = []
438
+
439
+ for _ in range(nb_full_blocks):
440
+ q = orthogonal_matrix_chunk(
441
+ nb_columns, qr_uniform_q=qr_uniform_q, device=device
442
+ )
443
+ block_list.append(q)
444
+ # block_list[n] is a orthogonal matrix ... (model_dim * model_dim)
445
+ # print (block_list[0].size(), torch.einsum('...nd,...nd->...n', block_list[0], torch.roll(block_list[0],1,1)))
446
+ # print (nb_rows, nb_full_blocks, nb_columns)
447
+ remaining_rows = nb_rows - nb_full_blocks * nb_columns
448
+ # print (remaining_rows)
449
+ if remaining_rows > 0:
450
+ q = orthogonal_matrix_chunk(
451
+ nb_columns, qr_uniform_q=qr_uniform_q, device=device
452
+ )
453
+ # print (q[:remaining_rows].size())
454
+ block_list.append(q[:remaining_rows])
455
+
456
+ final_matrix = torch.cat(block_list)
457
+
458
+ if scaling == 0:
459
+ multiplier = torch.randn((nb_rows, nb_columns), device=device).norm(dim=1)
460
+ elif scaling == 1:
461
+ multiplier = math.sqrt((float(nb_columns))) * torch.ones(
462
+ (nb_rows,), device=device
463
+ )
464
+ else:
465
+ raise ValueError(f"Invalid scaling {scaling}")
466
+
467
+ return torch.diag(multiplier) @ final_matrix
468
+
469
+
470
+ class FastAttention(nn.Module):
471
+ def __init__(
472
+ self,
473
+ dim_heads,
474
+ nb_features=None,
475
+ ortho_scaling=0,
476
+ causal=False,
477
+ generalized_attention=False,
478
+ kernel_fn=nn.ReLU(),
479
+ qr_uniform_q=False,
480
+ no_projection=False,
481
+ ):
482
+ super().__init__()
483
+ nb_features = default(nb_features, int(dim_heads * math.log(dim_heads)))
484
+
485
+ self.dim_heads = dim_heads
486
+ self.nb_features = nb_features
487
+ self.ortho_scaling = ortho_scaling
488
+
489
+ self.create_projection = partial(
490
+ gaussian_orthogonal_random_matrix,
491
+ nb_rows=self.nb_features,
492
+ nb_columns=dim_heads,
493
+ scaling=ortho_scaling,
494
+ qr_uniform_q=qr_uniform_q,
495
+ )
496
+ projection_matrix = self.create_projection()
497
+ self.register_buffer("projection_matrix", projection_matrix)
498
+
499
+ self.generalized_attention = generalized_attention
500
+ self.kernel_fn = kernel_fn
501
+
502
+ # if this is turned on, no projection will be used
503
+ # queries and keys will be softmax-ed as in the original efficient attention paper
504
+ self.no_projection = no_projection
505
+
506
+ self.causal = causal
507
+
508
+ @torch.no_grad()
509
+ def redraw_projection_matrix(self):
510
+ projections = self.create_projection()
511
+ self.projection_matrix.copy_(projections)
512
+ del projections
513
+
514
+ def forward(self, q, k, v):
515
+ device = q.device
516
+
517
+ if self.no_projection:
518
+ q = q.softmax(dim=-1)
519
+ k = torch.exp(k) if self.causal else k.softmax(dim=-2)
520
+ else:
521
+ create_kernel = partial(
522
+ softmax_kernel, projection_matrix=self.projection_matrix, device=device
523
+ )
524
+
525
+ q = create_kernel(q, is_query=True)
526
+ k = create_kernel(k, is_query=False)
527
+
528
+ attn_fn = linear_attention if not self.causal else self.causal_linear_fn
529
+ if v is None:
530
+ out = attn_fn(q, k, None)
531
+ return out
532
+ else:
533
+ out = attn_fn(q, k, v)
534
+ return out
535
+
536
+
537
+ class SelfAttention(nn.Module):
538
+ def __init__(
539
+ self,
540
+ dim,
541
+ causal=False,
542
+ heads=8,
543
+ dim_head=64,
544
+ local_heads=0,
545
+ local_window_size=256,
546
+ nb_features=None,
547
+ feature_redraw_interval=1000,
548
+ generalized_attention=False,
549
+ kernel_fn=nn.ReLU(),
550
+ qr_uniform_q=False,
551
+ dropout=0.0,
552
+ no_projection=False,
553
+ ):
554
+ super().__init__()
555
+ assert dim % heads == 0, "dimension must be divisible by number of heads"
556
+ dim_head = default(dim_head, dim // heads)
557
+ inner_dim = dim_head * heads
558
+ self.fast_attention = FastAttention(
559
+ dim_head,
560
+ nb_features,
561
+ causal=causal,
562
+ generalized_attention=generalized_attention,
563
+ kernel_fn=kernel_fn,
564
+ qr_uniform_q=qr_uniform_q,
565
+ no_projection=no_projection,
566
+ )
567
+
568
+ self.heads = heads
569
+ self.global_heads = heads - local_heads
570
+ self.local_attn = (
571
+ LocalAttention(
572
+ window_size=local_window_size,
573
+ causal=causal,
574
+ autopad=True,
575
+ dropout=dropout,
576
+ look_forward=int(not causal),
577
+ rel_pos_emb_config=(dim_head, local_heads),
578
+ )
579
+ if local_heads > 0
580
+ else None
581
+ )
582
+
583
+ # print (heads, nb_features, dim_head)
584
+ # name_embedding = torch.zeros(110, heads, dim_head, dim_head)
585
+ # self.name_embedding = nn.Parameter(name_embedding, requires_grad=True)
586
+
587
+ self.to_q = nn.Linear(dim, inner_dim)
588
+ self.to_k = nn.Linear(dim, inner_dim)
589
+ self.to_v = nn.Linear(dim, inner_dim)
590
+ self.to_out = nn.Linear(inner_dim, dim)
591
+ self.dropout = nn.Dropout(dropout)
592
+
593
+ @torch.no_grad()
594
+ def redraw_projection_matrix(self):
595
+ self.fast_attention.redraw_projection_matrix()
596
+ # torch.nn.init.zeros_(self.name_embedding)
597
+ # print (torch.sum(self.name_embedding))
598
+
599
+ def forward(
600
+ self,
601
+ x,
602
+ context=None,
603
+ mask=None,
604
+ context_mask=None,
605
+ name=None,
606
+ inference=False,
607
+ **kwargs,
608
+ ):
609
+ _, _, _, h, gh = *x.shape, self.heads, self.global_heads
610
+
611
+ cross_attend = exists(context)
612
+
613
+ context = default(context, x)
614
+ context_mask = default(context_mask, mask) if not cross_attend else context_mask
615
+ # print (torch.sum(self.name_embedding))
616
+ q, k, v = self.to_q(x), self.to_k(context), self.to_v(context)
617
+
618
+ q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))
619
+ (q, lq), (k, lk), (v, lv) = map(lambda t: (t[:, :gh], t[:, gh:]), (q, k, v))
620
+
621
+ attn_outs = []
622
+ # print (name)
623
+ # print (self.name_embedding[name].size())
624
+ if not empty(q):
625
+ if exists(context_mask):
626
+ global_mask = context_mask[:, None, :, None]
627
+ v.masked_fill_(~global_mask, 0.0)
628
+ if cross_attend:
629
+ pass
630
+ # print (torch.sum(self.name_embedding))
631
+ # out = self.fast_attention(q,self.name_embedding[name],None)
632
+ # print (torch.sum(self.name_embedding[...,-1:]))
633
+ else:
634
+ out = self.fast_attention(q, k, v)
635
+ attn_outs.append(out)
636
+
637
+ if not empty(lq):
638
+ assert (
639
+ not cross_attend
640
+ ), "local attention is not compatible with cross attention"
641
+ out = self.local_attn(lq, lk, lv, input_mask=mask)
642
+ attn_outs.append(out)
643
+
644
+ out = torch.cat(attn_outs, dim=1)
645
+ out = rearrange(out, "b h n d -> b n (h d)")
646
+ out = self.to_out(out)
647
+ return self.dropout(out)
648
+
649
+
650
+ def l2_regularization(model, l2_alpha):
651
+ l2_loss = []
652
+ for module in model.modules():
653
+ if type(module) is nn.Conv2d:
654
+ l2_loss.append((module.weight**2).sum() / 2.0)
655
+ return l2_alpha * sum(l2_loss)
656
+
657
+
658
+ class FCPEModel(nn.Module):
659
+ def __init__(
660
+ self,
661
+ input_channel=128,
662
+ out_dims=360,
663
+ n_layers=12,
664
+ n_chans=512,
665
+ use_siren=False,
666
+ use_full=False,
667
+ loss_mse_scale=10,
668
+ loss_l2_regularization=False,
669
+ loss_l2_regularization_scale=1,
670
+ loss_grad1_mse=False,
671
+ loss_grad1_mse_scale=1,
672
+ f0_max=1975.5,
673
+ f0_min=32.70,
674
+ confidence=False,
675
+ threshold=0.05,
676
+ use_input_conv=True,
677
+ ):
678
+ super().__init__()
679
+ if use_siren is True:
680
+ raise ValueError("Siren is not supported yet.")
681
+ if use_full is True:
682
+ raise ValueError("Full model is not supported yet.")
683
+
684
+ self.loss_mse_scale = loss_mse_scale if (loss_mse_scale is not None) else 10
685
+ self.loss_l2_regularization = (
686
+ loss_l2_regularization if (loss_l2_regularization is not None) else False
687
+ )
688
+ self.loss_l2_regularization_scale = (
689
+ loss_l2_regularization_scale
690
+ if (loss_l2_regularization_scale is not None)
691
+ else 1
692
+ )
693
+ self.loss_grad1_mse = loss_grad1_mse if (loss_grad1_mse is not None) else False
694
+ self.loss_grad1_mse_scale = (
695
+ loss_grad1_mse_scale if (loss_grad1_mse_scale is not None) else 1
696
+ )
697
+ self.f0_max = f0_max if (f0_max is not None) else 1975.5
698
+ self.f0_min = f0_min if (f0_min is not None) else 32.70
699
+ self.confidence = confidence if (confidence is not None) else False
700
+ self.threshold = threshold if (threshold is not None) else 0.05
701
+ self.use_input_conv = use_input_conv if (use_input_conv is not None) else True
702
+
703
+ self.cent_table_b = torch.Tensor(
704
+ np.linspace(
705
+ self.f0_to_cent(torch.Tensor([f0_min]))[0],
706
+ self.f0_to_cent(torch.Tensor([f0_max]))[0],
707
+ out_dims,
708
+ )
709
+ )
710
+ self.register_buffer("cent_table", self.cent_table_b)
711
+
712
+ # conv in stack
713
+ _leaky = nn.LeakyReLU()
714
+ self.stack = nn.Sequential(
715
+ nn.Conv1d(input_channel, n_chans, 3, 1, 1),
716
+ nn.GroupNorm(4, n_chans),
717
+ _leaky,
718
+ nn.Conv1d(n_chans, n_chans, 3, 1, 1),
719
+ )
720
+
721
+ # transformer
722
+ self.decoder = PCmer(
723
+ num_layers=n_layers,
724
+ num_heads=8,
725
+ dim_model=n_chans,
726
+ dim_keys=n_chans,
727
+ dim_values=n_chans,
728
+ residual_dropout=0.1,
729
+ attention_dropout=0.1,
730
+ )
731
+ self.norm = nn.LayerNorm(n_chans)
732
+
733
+ # out
734
+ self.n_out = out_dims
735
+ self.dense_out = weight_norm(nn.Linear(n_chans, self.n_out))
736
+
737
+ def forward(
738
+ self, mel, infer=True, gt_f0=None, return_hz_f0=False, cdecoder="local_argmax"
739
+ ):
740
+ """
741
+ input:
742
+ B x n_frames x n_unit
743
+ return:
744
+ dict of B x n_frames x feat
745
+ """
746
+ if cdecoder == "argmax":
747
+ self.cdecoder = self.cents_decoder
748
+ elif cdecoder == "local_argmax":
749
+ self.cdecoder = self.cents_local_decoder
750
+ if self.use_input_conv:
751
+ x = self.stack(mel.transpose(1, 2)).transpose(1, 2)
752
+ else:
753
+ x = mel
754
+ x = self.decoder(x)
755
+ x = self.norm(x)
756
+ x = self.dense_out(x) # [B,N,D]
757
+ x = torch.sigmoid(x)
758
+ if not infer:
759
+ gt_cent_f0 = self.f0_to_cent(gt_f0) # mel f0 #[B,N,1]
760
+ gt_cent_f0 = self.gaussian_blurred_cent(gt_cent_f0) # #[B,N,out_dim]
761
+ loss_all = self.loss_mse_scale * F.binary_cross_entropy(
762
+ x, gt_cent_f0
763
+ ) # bce loss
764
+ # l2 regularization
765
+ if self.loss_l2_regularization:
766
+ loss_all = loss_all + l2_regularization(
767
+ model=self, l2_alpha=self.loss_l2_regularization_scale
768
+ )
769
+ x = loss_all
770
+ if infer:
771
+ x = self.cdecoder(x)
772
+ x = self.cent_to_f0(x)
773
+ if not return_hz_f0:
774
+ x = (1 + x / 700).log()
775
+ return x
776
+
777
+ def cents_decoder(self, y, mask=True):
778
+ B, N, _ = y.size()
779
+ ci = self.cent_table[None, None, :].expand(B, N, -1)
780
+ rtn = torch.sum(ci * y, dim=-1, keepdim=True) / torch.sum(
781
+ y, dim=-1, keepdim=True
782
+ ) # cents: [B,N,1]
783
+ if mask:
784
+ confident = torch.max(y, dim=-1, keepdim=True)[0]
785
+ confident_mask = torch.ones_like(confident)
786
+ confident_mask[confident <= self.threshold] = float("-INF")
787
+ rtn = rtn * confident_mask
788
+ if self.confidence:
789
+ return rtn, confident
790
+ else:
791
+ return rtn
792
+
793
+ def cents_local_decoder(self, y, mask=True):
794
+ B, N, _ = y.size()
795
+ ci = self.cent_table[None, None, :].expand(B, N, -1)
796
+ confident, max_index = torch.max(y, dim=-1, keepdim=True)
797
+ local_argmax_index = torch.arange(0, 9).to(max_index.device) + (max_index - 4)
798
+ local_argmax_index[local_argmax_index < 0] = 0
799
+ local_argmax_index[local_argmax_index >= self.n_out] = self.n_out - 1
800
+ ci_l = torch.gather(ci, -1, local_argmax_index)
801
+ y_l = torch.gather(y, -1, local_argmax_index)
802
+ rtn = torch.sum(ci_l * y_l, dim=-1, keepdim=True) / torch.sum(
803
+ y_l, dim=-1, keepdim=True
804
+ ) # cents: [B,N,1]
805
+ if mask:
806
+ confident_mask = torch.ones_like(confident)
807
+ confident_mask[confident <= self.threshold] = float("-INF")
808
+ rtn = rtn * confident_mask
809
+ if self.confidence:
810
+ return rtn, confident
811
+ else:
812
+ return rtn
813
+
814
+ def cent_to_f0(self, cent):
815
+ return 10.0 * 2 ** (cent / 1200.0)
816
+
817
+ def f0_to_cent(self, f0):
818
+ return 1200.0 * torch.log2(f0 / 10.0)
819
+
820
+ def gaussian_blurred_cent(self, cents): # cents: [B,N,1]
821
+ mask = (cents > 0.1) & (cents < (1200.0 * np.log2(self.f0_max / 10.0)))
822
+ B, N, _ = cents.size()
823
+ ci = self.cent_table[None, None, :].expand(B, N, -1)
824
+ return torch.exp(-torch.square(ci - cents) / 1250) * mask.float()
825
+
826
+
827
+ class FCPEInfer:
828
+ def __init__(self, model_path, device=None, dtype=torch.float32):
829
+ if device is None:
830
+ device = "cuda" if torch.cuda.is_available() else "cpu"
831
+ self.device = device
832
+ ckpt = torch.load(model_path, map_location=torch.device(self.device))
833
+ self.args = DotDict(ckpt["config"])
834
+ self.dtype = dtype
835
+ model = FCPEModel(
836
+ input_channel=self.args.model.input_channel,
837
+ out_dims=self.args.model.out_dims,
838
+ n_layers=self.args.model.n_layers,
839
+ n_chans=self.args.model.n_chans,
840
+ use_siren=self.args.model.use_siren,
841
+ use_full=self.args.model.use_full,
842
+ loss_mse_scale=self.args.loss.loss_mse_scale,
843
+ loss_l2_regularization=self.args.loss.loss_l2_regularization,
844
+ loss_l2_regularization_scale=self.args.loss.loss_l2_regularization_scale,
845
+ loss_grad1_mse=self.args.loss.loss_grad1_mse,
846
+ loss_grad1_mse_scale=self.args.loss.loss_grad1_mse_scale,
847
+ f0_max=self.args.model.f0_max,
848
+ f0_min=self.args.model.f0_min,
849
+ confidence=self.args.model.confidence,
850
+ )
851
+ model.to(self.device).to(self.dtype)
852
+ model.load_state_dict(ckpt["model"])
853
+ model.eval()
854
+ self.model = model
855
+ self.wav2mel = Wav2Mel(self.args, dtype=self.dtype, device=self.device)
856
+
857
+ @torch.no_grad()
858
+ def __call__(self, audio, sr, threshold=0.05):
859
+ self.model.threshold = threshold
860
+ audio = audio[None, :]
861
+ mel = self.wav2mel(audio=audio, sample_rate=sr).to(self.dtype)
862
+ f0 = self.model(mel=mel, infer=True, return_hz_f0=True)
863
+ return f0
864
+
865
+
866
+ class Wav2Mel:
867
+
868
+ def __init__(self, args, device=None, dtype=torch.float32):
869
+ # self.args = args
870
+ self.sampling_rate = args.mel.sampling_rate
871
+ self.hop_size = args.mel.hop_size
872
+ if device is None:
873
+ device = "cuda" if torch.cuda.is_available() else "cpu"
874
+ self.device = device
875
+ self.dtype = dtype
876
+ self.stft = STFT(
877
+ args.mel.sampling_rate,
878
+ args.mel.num_mels,
879
+ args.mel.n_fft,
880
+ args.mel.win_size,
881
+ args.mel.hop_size,
882
+ args.mel.fmin,
883
+ args.mel.fmax,
884
+ )
885
+ self.resample_kernel = {}
886
+
887
+ def extract_nvstft(self, audio, keyshift=0, train=False):
888
+ mel = self.stft.get_mel(audio, keyshift=keyshift, train=train).transpose(
889
+ 1, 2
890
+ ) # B, n_frames, bins
891
+ return mel
892
+
893
+ def extract_mel(self, audio, sample_rate, keyshift=0, train=False):
894
+ audio = audio.to(self.dtype).to(self.device)
895
+ # resample
896
+ if sample_rate == self.sampling_rate:
897
+ audio_res = audio
898
+ else:
899
+ key_str = str(sample_rate)
900
+ if key_str not in self.resample_kernel:
901
+ self.resample_kernel[key_str] = Resample(
902
+ sample_rate, self.sampling_rate, lowpass_filter_width=128
903
+ )
904
+ self.resample_kernel[key_str] = (
905
+ self.resample_kernel[key_str].to(self.dtype).to(self.device)
906
+ )
907
+ audio_res = self.resample_kernel[key_str](audio)
908
+
909
+ # extract
910
+ mel = self.extract_nvstft(
911
+ audio_res, keyshift=keyshift, train=train
912
+ ) # B, n_frames, bins
913
+ n_frames = int(audio.shape[1] // self.hop_size) + 1
914
+ if n_frames > int(mel.shape[1]):
915
+ mel = torch.cat((mel, mel[:, -1:, :]), 1)
916
+ if n_frames < int(mel.shape[1]):
917
+ mel = mel[:, :n_frames, :]
918
+ return mel
919
+
920
+ def __call__(self, audio, sample_rate, keyshift=0, train=False):
921
+ return self.extract_mel(audio, sample_rate, keyshift=keyshift, train=train)
922
+
923
+
924
+ class DotDict(dict):
925
+ def __getattr__(*args):
926
+ val = dict.get(*args)
927
+ return DotDict(val) if type(val) is dict else val
928
+
929
+ __setattr__ = dict.__setitem__
930
+ __delattr__ = dict.__delitem__
931
+
932
+
933
+ class F0Predictor(object):
934
+ def compute_f0(self, wav, p_len):
935
+ """
936
+ input: wav:[signal_length]
937
+ p_len:int
938
+ output: f0:[signal_length//hop_length]
939
+ """
940
+ pass
941
+
942
+ def compute_f0_uv(self, wav, p_len):
943
+ """
944
+ input: wav:[signal_length]
945
+ p_len:int
946
+ output: f0:[signal_length//hop_length],uv:[signal_length//hop_length]
947
+ """
948
+ pass
949
+
950
+
951
+ class FCPE(F0Predictor):
952
+ def __init__(
953
+ self,
954
+ model_path,
955
+ hop_length=512,
956
+ f0_min=50,
957
+ f0_max=1100,
958
+ dtype=torch.float32,
959
+ device=None,
960
+ sampling_rate=44100,
961
+ threshold=0.05,
962
+ ):
963
+ self.fcpe = FCPEInfer(model_path, device=device, dtype=dtype)
964
+ self.hop_length = hop_length
965
+ self.f0_min = f0_min
966
+ self.f0_max = f0_max
967
+ if device is None:
968
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
969
+ else:
970
+ self.device = device
971
+ self.threshold = threshold
972
+ self.sampling_rate = sampling_rate
973
+ self.dtype = dtype
974
+ self.name = "fcpe"
975
+
976
+ def repeat_expand(
977
+ self,
978
+ content: Union[torch.Tensor, np.ndarray],
979
+ target_len: int,
980
+ mode: str = "nearest",
981
+ ):
982
+ ndim = content.ndim
983
+
984
+ if content.ndim == 1:
985
+ content = content[None, None]
986
+ elif content.ndim == 2:
987
+ content = content[None]
988
+
989
+ assert content.ndim == 3
990
+
991
+ is_np = isinstance(content, np.ndarray)
992
+ if is_np:
993
+ content = torch.from_numpy(content)
994
+
995
+ results = torch.nn.functional.interpolate(content, size=target_len, mode=mode)
996
+
997
+ if is_np:
998
+ results = results.numpy()
999
+
1000
+ if ndim == 1:
1001
+ return results[0, 0]
1002
+ elif ndim == 2:
1003
+ return results[0]
1004
+
1005
+ def post_process(self, x, sampling_rate, f0, pad_to):
1006
+ if isinstance(f0, np.ndarray):
1007
+ f0 = torch.from_numpy(f0).float().to(x.device)
1008
+
1009
+ if pad_to is None:
1010
+ return f0
1011
+
1012
+ f0 = self.repeat_expand(f0, pad_to)
1013
+
1014
+ vuv_vector = torch.zeros_like(f0)
1015
+ vuv_vector[f0 > 0.0] = 1.0
1016
+ vuv_vector[f0 <= 0.0] = 0.0
1017
+
1018
+ # 去掉0频率, 并线性插值
1019
+ nzindex = torch.nonzero(f0).squeeze()
1020
+ f0 = torch.index_select(f0, dim=0, index=nzindex).cpu().numpy()
1021
+ time_org = self.hop_length / sampling_rate * nzindex.cpu().numpy()
1022
+ time_frame = np.arange(pad_to) * self.hop_length / sampling_rate
1023
+
1024
+ vuv_vector = F.interpolate(vuv_vector[None, None, :], size=pad_to)[0][0]
1025
+
1026
+ if f0.shape[0] <= 0:
1027
+ return (
1028
+ torch.zeros(pad_to, dtype=torch.float, device=x.device).cpu().numpy(),
1029
+ vuv_vector.cpu().numpy(),
1030
+ )
1031
+ if f0.shape[0] == 1:
1032
+ return (
1033
+ torch.ones(pad_to, dtype=torch.float, device=x.device) * f0[0]
1034
+ ).cpu().numpy(), vuv_vector.cpu().numpy()
1035
+
1036
+ # 大概可以用 torch 重写?
1037
+ f0 = np.interp(time_frame, time_org, f0, left=f0[0], right=f0[-1])
1038
+ # vuv_vector = np.ceil(scipy.ndimage.zoom(vuv_vector,pad_to/len(vuv_vector),order = 0))
1039
+
1040
+ return f0, vuv_vector.cpu().numpy()
1041
+
1042
+ def compute_f0(self, wav, p_len=None):
1043
+ x = torch.FloatTensor(wav).to(self.dtype).to(self.device)
1044
+ if p_len is None:
1045
+ print("fcpe p_len is None")
1046
+ p_len = x.shape[0] // self.hop_length
1047
+ # else:
1048
+ # assert abs(p_len - x.shape[0] // self.hop_length) < 4, "pad length error"
1049
+ f0 = self.fcpe(x, sr=self.sampling_rate, threshold=self.threshold)[0, :, 0]
1050
+ if torch.all(f0 == 0):
1051
+ rtn = f0.cpu().numpy() if p_len is None else np.zeros(p_len)
1052
+ return rtn, rtn
1053
+ return self.post_process(x, self.sampling_rate, f0, p_len)[0]
1054
+
1055
+ def compute_f0_uv(self, wav, p_len=None):
1056
+ x = torch.FloatTensor(wav).to(self.dtype).to(self.device)
1057
+ if p_len is None:
1058
+ p_len = x.shape[0] // self.hop_length
1059
+ # else:
1060
+ # assert abs(p_len - x.shape[0] // self.hop_length) < 4, "pad length error"
1061
+ f0 = self.fcpe(x, sr=self.sampling_rate, threshold=self.threshold)[0, :, 0]
1062
+ if torch.all(f0 == 0):
1063
+ rtn = f0.cpu().numpy() if p_len is None else np.zeros(p_len)
1064
+ return rtn, rtn
1065
+ return self.post_process(x, self.sampling_rate, f0, p_len)
rvc_inferpy/infer_list/packs/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
rvc_inferpy/infer_list/packs/attentions.py ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import torch
3
+ from torch import nn
4
+ from torch.nn import functional as F
5
+
6
+ from rvc_inferpy.infer_list.packs import commons
7
+ from rvc_inferpy.infer_list.packs.modules import LayerNorm
8
+
9
+
10
+ class Encoder(nn.Module):
11
+ def __init__(
12
+ self,
13
+ hidden_channels,
14
+ filter_channels,
15
+ n_heads,
16
+ n_layers,
17
+ kernel_size=1,
18
+ p_dropout=0.0,
19
+ window_size=10,
20
+ **kwargs
21
+ ):
22
+ super().__init__()
23
+ self.hidden_channels = hidden_channels
24
+ self.filter_channels = filter_channels
25
+ self.n_heads = n_heads
26
+ self.n_layers = n_layers
27
+ self.kernel_size = kernel_size
28
+ self.p_dropout = p_dropout
29
+ self.window_size = window_size
30
+
31
+ self.drop = nn.Dropout(p_dropout)
32
+ self.attn_layers = nn.ModuleList()
33
+ self.norm_layers_1 = nn.ModuleList()
34
+ self.ffn_layers = nn.ModuleList()
35
+ self.norm_layers_2 = nn.ModuleList()
36
+ for i in range(self.n_layers):
37
+ self.attn_layers.append(
38
+ MultiHeadAttention(
39
+ hidden_channels,
40
+ hidden_channels,
41
+ n_heads,
42
+ p_dropout=p_dropout,
43
+ window_size=window_size,
44
+ )
45
+ )
46
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
47
+ self.ffn_layers.append(
48
+ FFN(
49
+ hidden_channels,
50
+ hidden_channels,
51
+ filter_channels,
52
+ kernel_size,
53
+ p_dropout=p_dropout,
54
+ )
55
+ )
56
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
57
+
58
+ def forward(self, x, x_mask):
59
+ attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
60
+ x = x * x_mask
61
+ for i in range(self.n_layers):
62
+ y = self.attn_layers[i](x, x, attn_mask)
63
+ y = self.drop(y)
64
+ x = self.norm_layers_1[i](x + y)
65
+
66
+ y = self.ffn_layers[i](x, x_mask)
67
+ y = self.drop(y)
68
+ x = self.norm_layers_2[i](x + y)
69
+ x = x * x_mask
70
+ return x
71
+
72
+
73
+ class Decoder(nn.Module):
74
+ def __init__(
75
+ self,
76
+ hidden_channels,
77
+ filter_channels,
78
+ n_heads,
79
+ n_layers,
80
+ kernel_size=1,
81
+ p_dropout=0.0,
82
+ proximal_bias=False,
83
+ proximal_init=True,
84
+ **kwargs
85
+ ):
86
+ super().__init__()
87
+ self.hidden_channels = hidden_channels
88
+ self.filter_channels = filter_channels
89
+ self.n_heads = n_heads
90
+ self.n_layers = n_layers
91
+ self.kernel_size = kernel_size
92
+ self.p_dropout = p_dropout
93
+ self.proximal_bias = proximal_bias
94
+ self.proximal_init = proximal_init
95
+
96
+ self.drop = nn.Dropout(p_dropout)
97
+ self.self_attn_layers = nn.ModuleList()
98
+ self.norm_layers_0 = nn.ModuleList()
99
+ self.encdec_attn_layers = nn.ModuleList()
100
+ self.norm_layers_1 = nn.ModuleList()
101
+ self.ffn_layers = nn.ModuleList()
102
+ self.norm_layers_2 = nn.ModuleList()
103
+ for i in range(self.n_layers):
104
+ self.self_attn_layers.append(
105
+ MultiHeadAttention(
106
+ hidden_channels,
107
+ hidden_channels,
108
+ n_heads,
109
+ p_dropout=p_dropout,
110
+ proximal_bias=proximal_bias,
111
+ proximal_init=proximal_init,
112
+ )
113
+ )
114
+ self.norm_layers_0.append(LayerNorm(hidden_channels))
115
+ self.encdec_attn_layers.append(
116
+ MultiHeadAttention(
117
+ hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout
118
+ )
119
+ )
120
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
121
+ self.ffn_layers.append(
122
+ FFN(
123
+ hidden_channels,
124
+ hidden_channels,
125
+ filter_channels,
126
+ kernel_size,
127
+ p_dropout=p_dropout,
128
+ causal=True,
129
+ )
130
+ )
131
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
132
+
133
+ def forward(self, x, x_mask, h, h_mask):
134
+ """
135
+ x: decoder input
136
+ h: encoder output
137
+ """
138
+ self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(
139
+ device=x.device, dtype=x.dtype
140
+ )
141
+ encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
142
+ x = x * x_mask
143
+ for i in range(self.n_layers):
144
+ y = self.self_attn_layers[i](x, x, self_attn_mask)
145
+ y = self.drop(y)
146
+ x = self.norm_layers_0[i](x + y)
147
+
148
+ y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
149
+ y = self.drop(y)
150
+ x = self.norm_layers_1[i](x + y)
151
+
152
+ y = self.ffn_layers[i](x, x_mask)
153
+ y = self.drop(y)
154
+ x = self.norm_layers_2[i](x + y)
155
+ x = x * x_mask
156
+ return x
157
+
158
+
159
+ class MultiHeadAttention(nn.Module):
160
+ def __init__(
161
+ self,
162
+ channels,
163
+ out_channels,
164
+ n_heads,
165
+ p_dropout=0.0,
166
+ window_size=None,
167
+ heads_share=True,
168
+ block_length=None,
169
+ proximal_bias=False,
170
+ proximal_init=False,
171
+ ):
172
+ super().__init__()
173
+ assert channels % n_heads == 0
174
+
175
+ self.channels = channels
176
+ self.out_channels = out_channels
177
+ self.n_heads = n_heads
178
+ self.p_dropout = p_dropout
179
+ self.window_size = window_size
180
+ self.heads_share = heads_share
181
+ self.block_length = block_length
182
+ self.proximal_bias = proximal_bias
183
+ self.proximal_init = proximal_init
184
+ self.attn = None
185
+
186
+ self.k_channels = channels // n_heads
187
+ self.conv_q = nn.Conv1d(channels, channels, 1)
188
+ self.conv_k = nn.Conv1d(channels, channels, 1)
189
+ self.conv_v = nn.Conv1d(channels, channels, 1)
190
+ self.conv_o = nn.Conv1d(channels, out_channels, 1)
191
+ self.drop = nn.Dropout(p_dropout)
192
+
193
+ if window_size is not None:
194
+ n_heads_rel = 1 if heads_share else n_heads
195
+ rel_stddev = self.k_channels**-0.5
196
+ self.emb_rel_k = nn.Parameter(
197
+ torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
198
+ * rel_stddev
199
+ )
200
+ self.emb_rel_v = nn.Parameter(
201
+ torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
202
+ * rel_stddev
203
+ )
204
+
205
+ nn.init.xavier_uniform_(self.conv_q.weight)
206
+ nn.init.xavier_uniform_(self.conv_k.weight)
207
+ nn.init.xavier_uniform_(self.conv_v.weight)
208
+ if proximal_init:
209
+ with torch.no_grad():
210
+ self.conv_k.weight.copy_(self.conv_q.weight)
211
+ self.conv_k.bias.copy_(self.conv_q.bias)
212
+
213
+ def forward(self, x, c, attn_mask=None):
214
+ q = self.conv_q(x)
215
+ k = self.conv_k(c)
216
+ v = self.conv_v(c)
217
+
218
+ x, self.attn = self.attention(q, k, v, mask=attn_mask)
219
+
220
+ x = self.conv_o(x)
221
+ return x
222
+
223
+ def attention(self, query, key, value, mask=None):
224
+ # reshape [b, d, t] -> [b, n_h, t, d_k]
225
+ b, d, t_s, t_t = (*key.size(), query.size(2))
226
+ query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
227
+ key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
228
+ value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
229
+
230
+ scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1))
231
+ if self.window_size is not None:
232
+ assert (
233
+ t_s == t_t
234
+ ), "Relative attention is only available for self-attention."
235
+ key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
236
+ rel_logits = self._matmul_with_relative_keys(
237
+ query / math.sqrt(self.k_channels), key_relative_embeddings
238
+ )
239
+ scores_local = self._relative_position_to_absolute_position(rel_logits)
240
+ scores = scores + scores_local
241
+ if self.proximal_bias:
242
+ assert t_s == t_t, "Proximal bias is only available for self-attention."
243
+ scores = scores + self._attention_bias_proximal(t_s).to(
244
+ device=scores.device, dtype=scores.dtype
245
+ )
246
+ if mask is not None:
247
+ scores = scores.masked_fill(mask == 0, -1e4)
248
+ if self.block_length is not None:
249
+ assert (
250
+ t_s == t_t
251
+ ), "Local attention is only available for self-attention."
252
+ block_mask = (
253
+ torch.ones_like(scores)
254
+ .triu(-self.block_length)
255
+ .tril(self.block_length)
256
+ )
257
+ scores = scores.masked_fill(block_mask == 0, -1e4)
258
+ p_attn = F.softmax(scores, dim=-1) # [b, n_h, t_t, t_s]
259
+ p_attn = self.drop(p_attn)
260
+ output = torch.matmul(p_attn, value)
261
+ if self.window_size is not None:
262
+ relative_weights = self._absolute_position_to_relative_position(p_attn)
263
+ value_relative_embeddings = self._get_relative_embeddings(
264
+ self.emb_rel_v, t_s
265
+ )
266
+ output = output + self._matmul_with_relative_values(
267
+ relative_weights, value_relative_embeddings
268
+ )
269
+ output = (
270
+ output.transpose(2, 3).contiguous().view(b, d, t_t)
271
+ ) # [b, n_h, t_t, d_k] -> [b, d, t_t]
272
+ return output, p_attn
273
+
274
+ def _matmul_with_relative_values(self, x, y):
275
+ """
276
+ x: [b, h, l, m]
277
+ y: [h or 1, m, d]
278
+ ret: [b, h, l, d]
279
+ """
280
+ ret = torch.matmul(x, y.unsqueeze(0))
281
+ return ret
282
+
283
+ def _matmul_with_relative_keys(self, x, y):
284
+ """
285
+ x: [b, h, l, d]
286
+ y: [h or 1, m, d]
287
+ ret: [b, h, l, m]
288
+ """
289
+ ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
290
+ return ret
291
+
292
+ def _get_relative_embeddings(self, relative_embeddings, length):
293
+ max_relative_position = 2 * self.window_size + 1
294
+ # Pad first before slice to avoid using cond ops.
295
+ pad_length = max(length - (self.window_size + 1), 0)
296
+ slice_start_position = max((self.window_size + 1) - length, 0)
297
+ slice_end_position = slice_start_position + 2 * length - 1
298
+ if pad_length > 0:
299
+ padded_relative_embeddings = F.pad(
300
+ relative_embeddings,
301
+ commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]),
302
+ )
303
+ else:
304
+ padded_relative_embeddings = relative_embeddings
305
+ used_relative_embeddings = padded_relative_embeddings[
306
+ :, slice_start_position:slice_end_position
307
+ ]
308
+ return used_relative_embeddings
309
+
310
+ def _relative_position_to_absolute_position(self, x):
311
+ """
312
+ x: [b, h, l, 2*l-1]
313
+ ret: [b, h, l, l]
314
+ """
315
+ batch, heads, length, _ = x.size()
316
+ # Concat columns of pad to shift from relative to absolute indexing.
317
+ x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
318
+
319
+ # Concat extra elements so to add up to shape (len+1, 2*len-1).
320
+ x_flat = x.view([batch, heads, length * 2 * length])
321
+ x_flat = F.pad(
322
+ x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [0, length - 1]])
323
+ )
324
+
325
+ # Reshape and slice out the padded elements.
326
+ x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[
327
+ :, :, :length, length - 1 :
328
+ ]
329
+ return x_final
330
+
331
+ def _absolute_position_to_relative_position(self, x):
332
+ """
333
+ x: [b, h, l, l]
334
+ ret: [b, h, l, 2*l-1]
335
+ """
336
+ batch, heads, length, _ = x.size()
337
+ # padd along column
338
+ x = F.pad(
339
+ x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]])
340
+ )
341
+ x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
342
+ # add 0's in the beginning that will skew the elements after reshape
343
+ x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
344
+ x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
345
+ return x_final
346
+
347
+ def _attention_bias_proximal(self, length):
348
+ """Bias for self-attention to encourage attention to close positions.
349
+ Args:
350
+ length: an integer scalar.
351
+ Returns:
352
+ a Tensor with shape [1, 1, length, length]
353
+ """
354
+ r = torch.arange(length, dtype=torch.float32)
355
+ diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
356
+ return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
357
+
358
+
359
+ class FFN(nn.Module):
360
+ def __init__(
361
+ self,
362
+ in_channels,
363
+ out_channels,
364
+ filter_channels,
365
+ kernel_size,
366
+ p_dropout=0.0,
367
+ activation=None,
368
+ causal=False,
369
+ ):
370
+ super().__init__()
371
+ self.in_channels = in_channels
372
+ self.out_channels = out_channels
373
+ self.filter_channels = filter_channels
374
+ self.kernel_size = kernel_size
375
+ self.p_dropout = p_dropout
376
+ self.activation = activation
377
+ self.causal = causal
378
+
379
+ if causal:
380
+ self.padding = self._causal_padding
381
+ else:
382
+ self.padding = self._same_padding
383
+
384
+ self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
385
+ self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
386
+ self.drop = nn.Dropout(p_dropout)
387
+
388
+ def forward(self, x, x_mask):
389
+ x = self.conv_1(self.padding(x * x_mask))
390
+ if self.activation == "gelu":
391
+ x = x * torch.sigmoid(1.702 * x)
392
+ else:
393
+ x = torch.relu(x)
394
+ x = self.drop(x)
395
+ x = self.conv_2(self.padding(x * x_mask))
396
+ return x * x_mask
397
+
398
+ def _causal_padding(self, x):
399
+ if self.kernel_size == 1:
400
+ return x
401
+ pad_l = self.kernel_size - 1
402
+ pad_r = 0
403
+ padding = [[0, 0], [0, 0], [pad_l, pad_r]]
404
+ x = F.pad(x, commons.convert_pad_shape(padding))
405
+ return x
406
+
407
+ def _same_padding(self, x):
408
+ if self.kernel_size == 1:
409
+ return x
410
+ pad_l = (self.kernel_size - 1) // 2
411
+ pad_r = self.kernel_size // 2
412
+ padding = [[0, 0], [0, 0], [pad_l, pad_r]]
413
+ x = F.pad(x, commons.convert_pad_shape(padding))
414
+ return x
rvc_inferpy/infer_list/packs/commons.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import torch
3
+ from torch.nn import functional as F
4
+
5
+
6
+ def init_weights(m, mean=0.0, std=0.01):
7
+ classname = m.__class__.__name__
8
+ if classname.find("Conv") != -1:
9
+ m.weight.data.normal_(mean, std)
10
+
11
+
12
+ def get_padding(kernel_size, dilation=1):
13
+ return int((kernel_size * dilation - dilation) / 2)
14
+
15
+
16
+ def convert_pad_shape(pad_shape):
17
+ l = pad_shape[::-1]
18
+ pad_shape = [item for sublist in l for item in sublist]
19
+ return pad_shape
20
+
21
+
22
+ def kl_divergence(m_p, logs_p, m_q, logs_q):
23
+ """KL(P||Q)"""
24
+ kl = (logs_q - logs_p) - 0.5
25
+ kl += (
26
+ 0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
27
+ )
28
+ return kl
29
+
30
+
31
+ def rand_gumbel(shape):
32
+ """Sample from the Gumbel distribution, protect from overflows."""
33
+ uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
34
+ return -torch.log(-torch.log(uniform_samples))
35
+
36
+
37
+ def rand_gumbel_like(x):
38
+ g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
39
+ return g
40
+
41
+
42
+ def slice_segments(x, ids_str, segment_size=4):
43
+ ret = torch.zeros_like(x[:, :, :segment_size])
44
+ for i in range(x.size(0)):
45
+ idx_str = ids_str[i]
46
+ idx_end = idx_str + segment_size
47
+ ret[i] = x[i, :, idx_str:idx_end]
48
+ return ret
49
+
50
+
51
+ def slice_segments2(x, ids_str, segment_size=4):
52
+ ret = torch.zeros_like(x[:, :segment_size])
53
+ for i in range(x.size(0)):
54
+ idx_str = ids_str[i]
55
+ idx_end = idx_str + segment_size
56
+ ret[i] = x[i, idx_str:idx_end]
57
+ return ret
58
+
59
+
60
+ def rand_slice_segments(x, x_lengths=None, segment_size=4):
61
+ b, d, t = x.size()
62
+ if x_lengths is None:
63
+ x_lengths = t
64
+ ids_str_max = x_lengths - segment_size + 1
65
+ ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
66
+ ret = slice_segments(x, ids_str, segment_size)
67
+ return ret, ids_str
68
+
69
+
70
+ def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
71
+ position = torch.arange(length, dtype=torch.float)
72
+ num_timescales = channels // 2
73
+ log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
74
+ num_timescales - 1
75
+ )
76
+ inv_timescales = min_timescale * torch.exp(
77
+ torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
78
+ )
79
+ scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
80
+ signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
81
+ signal = F.pad(signal, [0, 0, 0, channels % 2])
82
+ signal = signal.view(1, channels, length)
83
+ return signal
84
+
85
+
86
+ def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
87
+ b, channels, length = x.size()
88
+ signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
89
+ return x + signal.to(dtype=x.dtype, device=x.device)
90
+
91
+
92
+ def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
93
+ b, channels, length = x.size()
94
+ signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
95
+ return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
96
+
97
+
98
+ def subsequent_mask(length):
99
+ mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
100
+ return mask
101
+
102
+
103
+ @torch.jit.script
104
+ def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
105
+ n_channels_int = n_channels[0]
106
+ in_act = input_a + input_b
107
+ t_act = torch.tanh(in_act[:, :n_channels_int, :])
108
+ s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
109
+ acts = t_act * s_act
110
+ return acts
111
+
112
+
113
+ def convert_pad_shape(pad_shape):
114
+ l = pad_shape[::-1]
115
+ pad_shape = [item for sublist in l for item in sublist]
116
+ return pad_shape
117
+
118
+
119
+ def shift_1d(x):
120
+ x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
121
+ return x
122
+
123
+
124
+ def sequence_mask(length, max_length=None):
125
+ if max_length is None:
126
+ max_length = length.max()
127
+ x = torch.arange(max_length, dtype=length.dtype, device=length.device)
128
+ return x.unsqueeze(0) < length.unsqueeze(1)
129
+
130
+
131
+ def generate_path(duration, mask):
132
+ """
133
+ duration: [b, 1, t_x]
134
+ mask: [b, 1, t_y, t_x]
135
+ """
136
+ device = duration.device
137
+
138
+ b, _, t_y, t_x = mask.shape
139
+ cum_duration = torch.cumsum(duration, -1)
140
+
141
+ cum_duration_flat = cum_duration.view(b * t_x)
142
+ path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
143
+ path = path.view(b, t_x, t_y)
144
+ path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
145
+ path = path.unsqueeze(1).transpose(2, 3) * mask
146
+ return path
147
+
148
+
149
+ def clip_grad_value_(parameters, clip_value, norm_type=2):
150
+ if isinstance(parameters, torch.Tensor):
151
+ parameters = [parameters]
152
+ parameters = list(filter(lambda p: p.grad is not None, parameters))
153
+ norm_type = float(norm_type)
154
+ if clip_value is not None:
155
+ clip_value = float(clip_value)
156
+
157
+ total_norm = 0
158
+ for p in parameters:
159
+ param_norm = p.grad.data.norm(norm_type)
160
+ total_norm += param_norm.item() ** norm_type
161
+ if clip_value is not None:
162
+ p.grad.data.clamp_(min=-clip_value, max=clip_value)
163
+ total_norm = total_norm ** (1.0 / norm_type)
164
+ return total_norm
rvc_inferpy/infer_list/packs/models.py ADDED
@@ -0,0 +1,1182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import logging
3
+
4
+ logger = logging.getLogger(__name__)
5
+
6
+ import numpy as np
7
+ import torch
8
+ from torch import nn
9
+ from torch.nn import Conv1d, Conv2d, ConvTranspose1d
10
+ from torch.nn import functional as F
11
+ from torch.nn.utils import remove_weight_norm, spectral_norm, weight_norm
12
+
13
+ from rvc_inferpy.infer_list.packs import attentions, commons, modules
14
+ from rvc_inferpy.infer_list.packs.commons import get_padding, init_weights
15
+
16
+ has_xpu = bool(hasattr(torch, "xpu") and torch.xpu.is_available())
17
+
18
+
19
+ class TextEncoder256(nn.Module):
20
+ def __init__(
21
+ self,
22
+ out_channels,
23
+ hidden_channels,
24
+ filter_channels,
25
+ n_heads,
26
+ n_layers,
27
+ kernel_size,
28
+ p_dropout,
29
+ f0=True,
30
+ ):
31
+ super().__init__()
32
+ self.out_channels = out_channels
33
+ self.hidden_channels = hidden_channels
34
+ self.filter_channels = filter_channels
35
+ self.n_heads = n_heads
36
+ self.n_layers = n_layers
37
+ self.kernel_size = kernel_size
38
+ self.p_dropout = p_dropout
39
+ self.emb_phone = nn.Linear(256, hidden_channels)
40
+ self.lrelu = nn.LeakyReLU(0.1, inplace=True)
41
+ if f0 == True:
42
+ self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
43
+ self.encoder = attentions.Encoder(
44
+ hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
45
+ )
46
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
47
+
48
+ def forward(self, phone, pitch, lengths):
49
+ if pitch == None:
50
+ x = self.emb_phone(phone)
51
+ else:
52
+ x = self.emb_phone(phone) + self.emb_pitch(pitch)
53
+ x = x * math.sqrt(self.hidden_channels) # [b, t, h]
54
+ x = self.lrelu(x)
55
+ x = torch.transpose(x, 1, -1) # [b, h, t]
56
+ x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
57
+ x.dtype
58
+ )
59
+ x = self.encoder(x * x_mask, x_mask)
60
+ stats = self.proj(x) * x_mask
61
+
62
+ m, logs = torch.split(stats, self.out_channels, dim=1)
63
+ return m, logs, x_mask
64
+
65
+
66
+ class TextEncoder768(nn.Module):
67
+ def __init__(
68
+ self,
69
+ out_channels,
70
+ hidden_channels,
71
+ filter_channels,
72
+ n_heads,
73
+ n_layers,
74
+ kernel_size,
75
+ p_dropout,
76
+ f0=True,
77
+ ):
78
+ super().__init__()
79
+ self.out_channels = out_channels
80
+ self.hidden_channels = hidden_channels
81
+ self.filter_channels = filter_channels
82
+ self.n_heads = n_heads
83
+ self.n_layers = n_layers
84
+ self.kernel_size = kernel_size
85
+ self.p_dropout = p_dropout
86
+ self.emb_phone = nn.Linear(768, hidden_channels)
87
+ self.lrelu = nn.LeakyReLU(0.1, inplace=True)
88
+ if f0 == True:
89
+ self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
90
+ self.encoder = attentions.Encoder(
91
+ hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
92
+ )
93
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
94
+
95
+ def forward(self, phone, pitch, lengths):
96
+ if pitch == None:
97
+ x = self.emb_phone(phone)
98
+ else:
99
+ x = self.emb_phone(phone) + self.emb_pitch(pitch)
100
+ x = x * math.sqrt(self.hidden_channels) # [b, t, h]
101
+ x = self.lrelu(x)
102
+ x = torch.transpose(x, 1, -1) # [b, h, t]
103
+ x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
104
+ x.dtype
105
+ )
106
+ x = self.encoder(x * x_mask, x_mask)
107
+ stats = self.proj(x) * x_mask
108
+
109
+ m, logs = torch.split(stats, self.out_channels, dim=1)
110
+ return m, logs, x_mask
111
+
112
+
113
+ class ResidualCouplingBlock(nn.Module):
114
+ def __init__(
115
+ self,
116
+ channels,
117
+ hidden_channels,
118
+ kernel_size,
119
+ dilation_rate,
120
+ n_layers,
121
+ n_flows=4,
122
+ gin_channels=0,
123
+ ):
124
+ super().__init__()
125
+ self.channels = channels
126
+ self.hidden_channels = hidden_channels
127
+ self.kernel_size = kernel_size
128
+ self.dilation_rate = dilation_rate
129
+ self.n_layers = n_layers
130
+ self.n_flows = n_flows
131
+ self.gin_channels = gin_channels
132
+
133
+ self.flows = nn.ModuleList()
134
+ for i in range(n_flows):
135
+ self.flows.append(
136
+ modules.ResidualCouplingLayer(
137
+ channels,
138
+ hidden_channels,
139
+ kernel_size,
140
+ dilation_rate,
141
+ n_layers,
142
+ gin_channels=gin_channels,
143
+ mean_only=True,
144
+ )
145
+ )
146
+ self.flows.append(modules.Flip())
147
+
148
+ def forward(self, x, x_mask, g=None, reverse=False):
149
+ if not reverse:
150
+ for flow in self.flows:
151
+ x, _ = flow(x, x_mask, g=g, reverse=reverse)
152
+ else:
153
+ for flow in reversed(self.flows):
154
+ x = flow(x, x_mask, g=g, reverse=reverse)
155
+ return x
156
+
157
+ def remove_weight_norm(self):
158
+ for i in range(self.n_flows):
159
+ self.flows[i * 2].remove_weight_norm()
160
+
161
+
162
+ class PosteriorEncoder(nn.Module):
163
+ def __init__(
164
+ self,
165
+ in_channels,
166
+ out_channels,
167
+ hidden_channels,
168
+ kernel_size,
169
+ dilation_rate,
170
+ n_layers,
171
+ gin_channels=0,
172
+ ):
173
+ super().__init__()
174
+ self.in_channels = in_channels
175
+ self.out_channels = out_channels
176
+ self.hidden_channels = hidden_channels
177
+ self.kernel_size = kernel_size
178
+ self.dilation_rate = dilation_rate
179
+ self.n_layers = n_layers
180
+ self.gin_channels = gin_channels
181
+
182
+ self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
183
+ self.enc = modules.WN(
184
+ hidden_channels,
185
+ kernel_size,
186
+ dilation_rate,
187
+ n_layers,
188
+ gin_channels=gin_channels,
189
+ )
190
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
191
+
192
+ def forward(self, x, x_lengths, g=None):
193
+ x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(
194
+ x.dtype
195
+ )
196
+ x = self.pre(x) * x_mask
197
+ x = self.enc(x, x_mask, g=g)
198
+ stats = self.proj(x) * x_mask
199
+ m, logs = torch.split(stats, self.out_channels, dim=1)
200
+ z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
201
+ return z, m, logs, x_mask
202
+
203
+ def remove_weight_norm(self):
204
+ self.enc.remove_weight_norm()
205
+
206
+
207
+ class Generator(torch.nn.Module):
208
+ def __init__(
209
+ self,
210
+ initial_channel,
211
+ resblock,
212
+ resblock_kernel_sizes,
213
+ resblock_dilation_sizes,
214
+ upsample_rates,
215
+ upsample_initial_channel,
216
+ upsample_kernel_sizes,
217
+ gin_channels=0,
218
+ ):
219
+ super(Generator, self).__init__()
220
+ self.num_kernels = len(resblock_kernel_sizes)
221
+ self.num_upsamples = len(upsample_rates)
222
+ self.conv_pre = Conv1d(
223
+ initial_channel, upsample_initial_channel, 7, 1, padding=3
224
+ )
225
+ resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
226
+
227
+ self.ups = nn.ModuleList()
228
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
229
+ self.ups.append(
230
+ weight_norm(
231
+ ConvTranspose1d(
232
+ upsample_initial_channel // (2**i),
233
+ upsample_initial_channel // (2 ** (i + 1)),
234
+ k,
235
+ u,
236
+ padding=(k - u) // 2,
237
+ )
238
+ )
239
+ )
240
+
241
+ self.resblocks = nn.ModuleList()
242
+ for i in range(len(self.ups)):
243
+ ch = upsample_initial_channel // (2 ** (i + 1))
244
+ for j, (k, d) in enumerate(
245
+ zip(resblock_kernel_sizes, resblock_dilation_sizes)
246
+ ):
247
+ self.resblocks.append(resblock(ch, k, d))
248
+
249
+ self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
250
+ self.ups.apply(init_weights)
251
+
252
+ if gin_channels != 0:
253
+ self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
254
+
255
+ def forward(self, x, g=None):
256
+ x = self.conv_pre(x)
257
+ if g is not None:
258
+ x = x + self.cond(g)
259
+
260
+ for i in range(self.num_upsamples):
261
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
262
+ x = self.ups[i](x)
263
+ xs = None
264
+ for j in range(self.num_kernels):
265
+ if xs is None:
266
+ xs = self.resblocks[i * self.num_kernels + j](x)
267
+ else:
268
+ xs += self.resblocks[i * self.num_kernels + j](x)
269
+ x = xs / self.num_kernels
270
+ x = F.leaky_relu(x)
271
+ x = self.conv_post(x)
272
+ x = torch.tanh(x)
273
+
274
+ return x
275
+
276
+ def remove_weight_norm(self):
277
+ for l in self.ups:
278
+ remove_weight_norm(l)
279
+ for l in self.resblocks:
280
+ l.remove_weight_norm()
281
+
282
+
283
+ class SineGen(torch.nn.Module):
284
+ """Definition of sine generator
285
+ SineGen(samp_rate, harmonic_num = 0,
286
+ sine_amp = 0.1, noise_std = 0.003,
287
+ voiced_threshold = 0,
288
+ flag_for_pulse=False)
289
+ samp_rate: sampling rate in Hz
290
+ harmonic_num: number of harmonic overtones (default 0)
291
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
292
+ noise_std: std of Gaussian noise (default 0.003)
293
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
294
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
295
+ Note: when flag_for_pulse is True, the first time step of a voiced
296
+ segment is always sin(np.pi) or cos(0)
297
+ """
298
+
299
+ def __init__(
300
+ self,
301
+ samp_rate,
302
+ harmonic_num=0,
303
+ sine_amp=0.1,
304
+ noise_std=0.003,
305
+ voiced_threshold=0,
306
+ flag_for_pulse=False,
307
+ ):
308
+ super(SineGen, self).__init__()
309
+ self.sine_amp = sine_amp
310
+ self.noise_std = noise_std
311
+ self.harmonic_num = harmonic_num
312
+ self.dim = self.harmonic_num + 1
313
+ self.sampling_rate = samp_rate
314
+ self.voiced_threshold = voiced_threshold
315
+
316
+ def _f02uv(self, f0):
317
+ # generate uv signal
318
+ uv = torch.ones_like(f0)
319
+ uv = uv * (f0 > self.voiced_threshold)
320
+ if uv.device.type == "privateuseone": # for DirectML
321
+ uv = uv.float()
322
+ return uv
323
+
324
+ def forward(self, f0, upp):
325
+ """sine_tensor, uv = forward(f0)
326
+ input F0: tensor(batchsize=1, length, dim=1)
327
+ f0 for unvoiced steps should be 0
328
+ output sine_tensor: tensor(batchsize=1, length, dim)
329
+ output uv: tensor(batchsize=1, length, 1)
330
+ """
331
+ with torch.no_grad():
332
+ f0 = f0[:, None].transpose(1, 2)
333
+ f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim, device=f0.device)
334
+ # fundamental component
335
+ f0_buf[:, :, 0] = f0[:, :, 0]
336
+ for idx in np.arange(self.harmonic_num):
337
+ f0_buf[:, :, idx + 1] = f0_buf[:, :, 0] * (
338
+ idx + 2
339
+ ) # idx + 2: the (idx+1)-th overtone, (idx+2)-th harmonic
340
+ rad_values = (
341
+ f0_buf / self.sampling_rate
342
+ ) % 1 ###%1意味着n_har的乘积无法后处理优化
343
+ rand_ini = torch.rand(
344
+ f0_buf.shape[0], f0_buf.shape[2], device=f0_buf.device
345
+ )
346
+ rand_ini[:, 0] = 0
347
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
348
+ tmp_over_one = torch.cumsum(
349
+ rad_values, 1
350
+ ) # % 1 #####%1意味着后面的cumsum无法再优化
351
+ tmp_over_one *= upp
352
+ tmp_over_one = F.interpolate(
353
+ tmp_over_one.transpose(2, 1),
354
+ scale_factor=upp,
355
+ mode="linear",
356
+ align_corners=True,
357
+ ).transpose(2, 1)
358
+ rad_values = F.interpolate(
359
+ rad_values.transpose(2, 1), scale_factor=upp, mode="nearest"
360
+ ).transpose(
361
+ 2, 1
362
+ ) #######
363
+ tmp_over_one %= 1
364
+ tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
365
+ cumsum_shift = torch.zeros_like(rad_values)
366
+ cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
367
+ sine_waves = torch.sin(
368
+ torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi
369
+ )
370
+ sine_waves = sine_waves * self.sine_amp
371
+ uv = self._f02uv(f0)
372
+ uv = F.interpolate(
373
+ uv.transpose(2, 1), scale_factor=upp, mode="nearest"
374
+ ).transpose(2, 1)
375
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
376
+ noise = noise_amp * torch.randn_like(sine_waves)
377
+ sine_waves = sine_waves * uv + noise
378
+ return sine_waves, uv, noise
379
+
380
+
381
+ class SourceModuleHnNSF(torch.nn.Module):
382
+ """SourceModule for hn-nsf
383
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
384
+ add_noise_std=0.003, voiced_threshod=0)
385
+ sampling_rate: sampling_rate in Hz
386
+ harmonic_num: number of harmonic above F0 (default: 0)
387
+ sine_amp: amplitude of sine source signal (default: 0.1)
388
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
389
+ note that amplitude of noise in unvoiced is decided
390
+ by sine_amp
391
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
392
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
393
+ F0_sampled (batchsize, length, 1)
394
+ Sine_source (batchsize, length, 1)
395
+ noise_source (batchsize, length 1)
396
+ uv (batchsize, length, 1)
397
+ """
398
+
399
+ def __init__(
400
+ self,
401
+ sampling_rate,
402
+ harmonic_num=0,
403
+ sine_amp=0.1,
404
+ add_noise_std=0.003,
405
+ voiced_threshod=0,
406
+ is_half=True,
407
+ ):
408
+ super(SourceModuleHnNSF, self).__init__()
409
+
410
+ self.sine_amp = sine_amp
411
+ self.noise_std = add_noise_std
412
+ self.is_half = is_half
413
+ # to produce sine waveforms
414
+ self.l_sin_gen = SineGen(
415
+ sampling_rate, harmonic_num, sine_amp, add_noise_std, voiced_threshod
416
+ )
417
+
418
+ # to merge source harmonics into a single excitation
419
+ self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
420
+ self.l_tanh = torch.nn.Tanh()
421
+
422
+ def forward(self, x, upp=None):
423
+ if hasattr(self, "ddtype") == False:
424
+ self.ddtype = self.l_linear.weight.dtype
425
+ sine_wavs, uv, _ = self.l_sin_gen(x, upp)
426
+ # print(x.dtype,sine_wavs.dtype,self.l_linear.weight.dtype)
427
+ # if self.is_half:
428
+ # sine_wavs = sine_wavs.half()
429
+ # sine_merge = self.l_tanh(self.l_linear(sine_wavs.to(x)))
430
+ # print(sine_wavs.dtype,self.ddtype)
431
+ if sine_wavs.dtype != self.ddtype:
432
+ sine_wavs = sine_wavs.to(self.ddtype)
433
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
434
+ return sine_merge, None, None # noise, uv
435
+
436
+
437
+ class GeneratorNSF(torch.nn.Module):
438
+ def __init__(
439
+ self,
440
+ initial_channel,
441
+ resblock,
442
+ resblock_kernel_sizes,
443
+ resblock_dilation_sizes,
444
+ upsample_rates,
445
+ upsample_initial_channel,
446
+ upsample_kernel_sizes,
447
+ gin_channels,
448
+ sr,
449
+ is_half=False,
450
+ ):
451
+ super(GeneratorNSF, self).__init__()
452
+ self.num_kernels = len(resblock_kernel_sizes)
453
+ self.num_upsamples = len(upsample_rates)
454
+
455
+ self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(upsample_rates))
456
+ self.m_source = SourceModuleHnNSF(
457
+ sampling_rate=sr, harmonic_num=0, is_half=is_half
458
+ )
459
+ self.noise_convs = nn.ModuleList()
460
+ self.conv_pre = Conv1d(
461
+ initial_channel, upsample_initial_channel, 7, 1, padding=3
462
+ )
463
+ resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
464
+
465
+ self.ups = nn.ModuleList()
466
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
467
+ c_cur = upsample_initial_channel // (2 ** (i + 1))
468
+ self.ups.append(
469
+ weight_norm(
470
+ ConvTranspose1d(
471
+ upsample_initial_channel // (2**i),
472
+ upsample_initial_channel // (2 ** (i + 1)),
473
+ k,
474
+ u,
475
+ padding=(k - u) // 2,
476
+ )
477
+ )
478
+ )
479
+ if i + 1 < len(upsample_rates):
480
+ stride_f0 = np.prod(upsample_rates[i + 1 :])
481
+ self.noise_convs.append(
482
+ Conv1d(
483
+ 1,
484
+ c_cur,
485
+ kernel_size=stride_f0 * 2,
486
+ stride=stride_f0,
487
+ padding=stride_f0 // 2,
488
+ )
489
+ )
490
+ else:
491
+ self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
492
+
493
+ self.resblocks = nn.ModuleList()
494
+ for i in range(len(self.ups)):
495
+ ch = upsample_initial_channel // (2 ** (i + 1))
496
+ for j, (k, d) in enumerate(
497
+ zip(resblock_kernel_sizes, resblock_dilation_sizes)
498
+ ):
499
+ self.resblocks.append(resblock(ch, k, d))
500
+
501
+ self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
502
+ self.ups.apply(init_weights)
503
+
504
+ if gin_channels != 0:
505
+ self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
506
+
507
+ self.upp = np.prod(upsample_rates)
508
+
509
+ def forward(self, x, f0, g=None):
510
+ har_source, noi_source, uv = self.m_source(f0, self.upp)
511
+ har_source = har_source.transpose(1, 2)
512
+ x = self.conv_pre(x)
513
+ if g is not None:
514
+ x = x + self.cond(g)
515
+
516
+ for i in range(self.num_upsamples):
517
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
518
+ x = self.ups[i](x)
519
+ x_source = self.noise_convs[i](har_source)
520
+ x = x + x_source
521
+ xs = None
522
+ for j in range(self.num_kernels):
523
+ if xs is None:
524
+ xs = self.resblocks[i * self.num_kernels + j](x)
525
+ else:
526
+ xs += self.resblocks[i * self.num_kernels + j](x)
527
+ x = xs / self.num_kernels
528
+ x = F.leaky_relu(x)
529
+ x = self.conv_post(x)
530
+ x = torch.tanh(x)
531
+ return x
532
+
533
+ def remove_weight_norm(self):
534
+ for l in self.ups:
535
+ remove_weight_norm(l)
536
+ for l in self.resblocks:
537
+ l.remove_weight_norm()
538
+
539
+
540
+ sr2sr = {
541
+ "32k": 32000,
542
+ "40k": 40000,
543
+ "48k": 48000,
544
+ }
545
+
546
+
547
+ class SynthesizerTrnMs256NSFsid(nn.Module):
548
+ def __init__(
549
+ self,
550
+ spec_channels,
551
+ segment_size,
552
+ inter_channels,
553
+ hidden_channels,
554
+ filter_channels,
555
+ n_heads,
556
+ n_layers,
557
+ kernel_size,
558
+ p_dropout,
559
+ resblock,
560
+ resblock_kernel_sizes,
561
+ resblock_dilation_sizes,
562
+ upsample_rates,
563
+ upsample_initial_channel,
564
+ upsample_kernel_sizes,
565
+ spk_embed_dim,
566
+ gin_channels,
567
+ sr,
568
+ **kwargs
569
+ ):
570
+ super().__init__()
571
+ if type(sr) == type("strr"):
572
+ sr = sr2sr[sr]
573
+ self.spec_channels = spec_channels
574
+ self.inter_channels = inter_channels
575
+ self.hidden_channels = hidden_channels
576
+ self.filter_channels = filter_channels
577
+ self.n_heads = n_heads
578
+ self.n_layers = n_layers
579
+ self.kernel_size = kernel_size
580
+ self.p_dropout = p_dropout
581
+ self.resblock = resblock
582
+ self.resblock_kernel_sizes = resblock_kernel_sizes
583
+ self.resblock_dilation_sizes = resblock_dilation_sizes
584
+ self.upsample_rates = upsample_rates
585
+ self.upsample_initial_channel = upsample_initial_channel
586
+ self.upsample_kernel_sizes = upsample_kernel_sizes
587
+ self.segment_size = segment_size
588
+ self.gin_channels = gin_channels
589
+ # self.hop_length = hop_length#
590
+ self.spk_embed_dim = spk_embed_dim
591
+ self.enc_p = TextEncoder256(
592
+ inter_channels,
593
+ hidden_channels,
594
+ filter_channels,
595
+ n_heads,
596
+ n_layers,
597
+ kernel_size,
598
+ p_dropout,
599
+ )
600
+ self.dec = GeneratorNSF(
601
+ inter_channels,
602
+ resblock,
603
+ resblock_kernel_sizes,
604
+ resblock_dilation_sizes,
605
+ upsample_rates,
606
+ upsample_initial_channel,
607
+ upsample_kernel_sizes,
608
+ gin_channels=gin_channels,
609
+ sr=sr,
610
+ is_half=kwargs["is_half"],
611
+ )
612
+ self.enc_q = PosteriorEncoder(
613
+ spec_channels,
614
+ inter_channels,
615
+ hidden_channels,
616
+ 5,
617
+ 1,
618
+ 16,
619
+ gin_channels=gin_channels,
620
+ )
621
+ self.flow = ResidualCouplingBlock(
622
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
623
+ )
624
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
625
+ logger.debug(
626
+ "gin_channels: "
627
+ + str(gin_channels)
628
+ + ", self.spk_embed_dim: "
629
+ + str(self.spk_embed_dim)
630
+ )
631
+
632
+ def remove_weight_norm(self):
633
+ self.dec.remove_weight_norm()
634
+ self.flow.remove_weight_norm()
635
+ self.enc_q.remove_weight_norm()
636
+
637
+ def forward(
638
+ self, phone, phone_lengths, pitch, pitchf, y, y_lengths, ds
639
+ ): # 这里ds是id,[bs,1]
640
+ # print(1,pitch.shape)#[bs,t]
641
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
642
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
643
+ z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
644
+ z_p = self.flow(z, y_mask, g=g)
645
+ z_slice, ids_slice = commons.rand_slice_segments(
646
+ z, y_lengths, self.segment_size
647
+ )
648
+ # print(-1,pitchf.shape,ids_slice,self.segment_size,self.hop_length,self.segment_size//self.hop_length)
649
+ pitchf = commons.slice_segments2(pitchf, ids_slice, self.segment_size)
650
+ # print(-2,pitchf.shape,z_slice.shape)
651
+ o = self.dec(z_slice, pitchf, g=g)
652
+ return o, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q)
653
+
654
+ def infer(self, phone, phone_lengths, pitch, nsff0, sid, rate=None):
655
+ g = self.emb_g(sid).unsqueeze(-1)
656
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
657
+ z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
658
+ if rate:
659
+ head = int(z_p.shape[2] * rate)
660
+ z_p = z_p[:, :, -head:]
661
+ x_mask = x_mask[:, :, -head:]
662
+ nsff0 = nsff0[:, -head:]
663
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
664
+ o = self.dec(z * x_mask, nsff0, g=g)
665
+ return o, x_mask, (z, z_p, m_p, logs_p)
666
+
667
+
668
+ class SynthesizerTrnMs768NSFsid(nn.Module):
669
+ def __init__(
670
+ self,
671
+ spec_channels,
672
+ segment_size,
673
+ inter_channels,
674
+ hidden_channels,
675
+ filter_channels,
676
+ n_heads,
677
+ n_layers,
678
+ kernel_size,
679
+ p_dropout,
680
+ resblock,
681
+ resblock_kernel_sizes,
682
+ resblock_dilation_sizes,
683
+ upsample_rates,
684
+ upsample_initial_channel,
685
+ upsample_kernel_sizes,
686
+ spk_embed_dim,
687
+ gin_channels,
688
+ sr,
689
+ **kwargs
690
+ ):
691
+ super().__init__()
692
+ if type(sr) == type("strr"):
693
+ sr = sr2sr[sr]
694
+ self.spec_channels = spec_channels
695
+ self.inter_channels = inter_channels
696
+ self.hidden_channels = hidden_channels
697
+ self.filter_channels = filter_channels
698
+ self.n_heads = n_heads
699
+ self.n_layers = n_layers
700
+ self.kernel_size = kernel_size
701
+ self.p_dropout = p_dropout
702
+ self.resblock = resblock
703
+ self.resblock_kernel_sizes = resblock_kernel_sizes
704
+ self.resblock_dilation_sizes = resblock_dilation_sizes
705
+ self.upsample_rates = upsample_rates
706
+ self.upsample_initial_channel = upsample_initial_channel
707
+ self.upsample_kernel_sizes = upsample_kernel_sizes
708
+ self.segment_size = segment_size
709
+ self.gin_channels = gin_channels
710
+ # self.hop_length = hop_length#
711
+ self.spk_embed_dim = spk_embed_dim
712
+ self.enc_p = TextEncoder768(
713
+ inter_channels,
714
+ hidden_channels,
715
+ filter_channels,
716
+ n_heads,
717
+ n_layers,
718
+ kernel_size,
719
+ p_dropout,
720
+ )
721
+ self.dec = GeneratorNSF(
722
+ inter_channels,
723
+ resblock,
724
+ resblock_kernel_sizes,
725
+ resblock_dilation_sizes,
726
+ upsample_rates,
727
+ upsample_initial_channel,
728
+ upsample_kernel_sizes,
729
+ gin_channels=gin_channels,
730
+ sr=sr,
731
+ is_half=kwargs["is_half"],
732
+ )
733
+ self.enc_q = PosteriorEncoder(
734
+ spec_channels,
735
+ inter_channels,
736
+ hidden_channels,
737
+ 5,
738
+ 1,
739
+ 16,
740
+ gin_channels=gin_channels,
741
+ )
742
+ self.flow = ResidualCouplingBlock(
743
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
744
+ )
745
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
746
+ logger.debug(
747
+ "gin_channels: "
748
+ + str(gin_channels)
749
+ + ", self.spk_embed_dim: "
750
+ + str(self.spk_embed_dim)
751
+ )
752
+
753
+ def remove_weight_norm(self):
754
+ self.dec.remove_weight_norm()
755
+ self.flow.remove_weight_norm()
756
+ self.enc_q.remove_weight_norm()
757
+
758
+ def forward(
759
+ self, phone, phone_lengths, pitch, pitchf, y, y_lengths, ds
760
+ ): # 这里ds是id,[bs,1]
761
+ # print(1,pitch.shape)#[bs,t]
762
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
763
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
764
+ z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
765
+ z_p = self.flow(z, y_mask, g=g)
766
+ z_slice, ids_slice = commons.rand_slice_segments(
767
+ z, y_lengths, self.segment_size
768
+ )
769
+ # print(-1,pitchf.shape,ids_slice,self.segment_size,self.hop_length,self.segment_size//self.hop_length)
770
+ pitchf = commons.slice_segments2(pitchf, ids_slice, self.segment_size)
771
+ # print(-2,pitchf.shape,z_slice.shape)
772
+ o = self.dec(z_slice, pitchf, g=g)
773
+ return o, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q)
774
+
775
+ def infer(self, phone, phone_lengths, pitch, nsff0, sid, rate=None):
776
+ g = self.emb_g(sid).unsqueeze(-1)
777
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
778
+ z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
779
+ if rate:
780
+ head = int(z_p.shape[2] * rate)
781
+ z_p = z_p[:, :, -head:]
782
+ x_mask = x_mask[:, :, -head:]
783
+ nsff0 = nsff0[:, -head:]
784
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
785
+ o = self.dec(z * x_mask, nsff0, g=g)
786
+ return o, x_mask, (z, z_p, m_p, logs_p)
787
+
788
+
789
+ class SynthesizerTrnMs256NSFsid_nono(nn.Module):
790
+ def __init__(
791
+ self,
792
+ spec_channels,
793
+ segment_size,
794
+ inter_channels,
795
+ hidden_channels,
796
+ filter_channels,
797
+ n_heads,
798
+ n_layers,
799
+ kernel_size,
800
+ p_dropout,
801
+ resblock,
802
+ resblock_kernel_sizes,
803
+ resblock_dilation_sizes,
804
+ upsample_rates,
805
+ upsample_initial_channel,
806
+ upsample_kernel_sizes,
807
+ spk_embed_dim,
808
+ gin_channels,
809
+ sr=None,
810
+ **kwargs
811
+ ):
812
+ super().__init__()
813
+ self.spec_channels = spec_channels
814
+ self.inter_channels = inter_channels
815
+ self.hidden_channels = hidden_channels
816
+ self.filter_channels = filter_channels
817
+ self.n_heads = n_heads
818
+ self.n_layers = n_layers
819
+ self.kernel_size = kernel_size
820
+ self.p_dropout = p_dropout
821
+ self.resblock = resblock
822
+ self.resblock_kernel_sizes = resblock_kernel_sizes
823
+ self.resblock_dilation_sizes = resblock_dilation_sizes
824
+ self.upsample_rates = upsample_rates
825
+ self.upsample_initial_channel = upsample_initial_channel
826
+ self.upsample_kernel_sizes = upsample_kernel_sizes
827
+ self.segment_size = segment_size
828
+ self.gin_channels = gin_channels
829
+ # self.hop_length = hop_length#
830
+ self.spk_embed_dim = spk_embed_dim
831
+ self.enc_p = TextEncoder256(
832
+ inter_channels,
833
+ hidden_channels,
834
+ filter_channels,
835
+ n_heads,
836
+ n_layers,
837
+ kernel_size,
838
+ p_dropout,
839
+ f0=False,
840
+ )
841
+ self.dec = Generator(
842
+ inter_channels,
843
+ resblock,
844
+ resblock_kernel_sizes,
845
+ resblock_dilation_sizes,
846
+ upsample_rates,
847
+ upsample_initial_channel,
848
+ upsample_kernel_sizes,
849
+ gin_channels=gin_channels,
850
+ )
851
+ self.enc_q = PosteriorEncoder(
852
+ spec_channels,
853
+ inter_channels,
854
+ hidden_channels,
855
+ 5,
856
+ 1,
857
+ 16,
858
+ gin_channels=gin_channels,
859
+ )
860
+ self.flow = ResidualCouplingBlock(
861
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
862
+ )
863
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
864
+ logger.debug(
865
+ "gin_channels: "
866
+ + str(gin_channels)
867
+ + ", self.spk_embed_dim: "
868
+ + str(self.spk_embed_dim)
869
+ )
870
+
871
+ def remove_weight_norm(self):
872
+ self.dec.remove_weight_norm()
873
+ self.flow.remove_weight_norm()
874
+ self.enc_q.remove_weight_norm()
875
+
876
+ def forward(self, phone, phone_lengths, y, y_lengths, ds): # 这里ds是id,[bs,1]
877
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
878
+ m_p, logs_p, x_mask = self.enc_p(phone, None, phone_lengths)
879
+ z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
880
+ z_p = self.flow(z, y_mask, g=g)
881
+ z_slice, ids_slice = commons.rand_slice_segments(
882
+ z, y_lengths, self.segment_size
883
+ )
884
+ o = self.dec(z_slice, g=g)
885
+ return o, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q)
886
+
887
+ def infer(self, phone, phone_lengths, sid, rate=None):
888
+ g = self.emb_g(sid).unsqueeze(-1)
889
+ m_p, logs_p, x_mask = self.enc_p(phone, None, phone_lengths)
890
+ z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
891
+ if rate:
892
+ head = int(z_p.shape[2] * rate)
893
+ z_p = z_p[:, :, -head:]
894
+ x_mask = x_mask[:, :, -head:]
895
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
896
+ o = self.dec(z * x_mask, g=g)
897
+ return o, x_mask, (z, z_p, m_p, logs_p)
898
+
899
+
900
+ class SynthesizerTrnMs768NSFsid_nono(nn.Module):
901
+ def __init__(
902
+ self,
903
+ spec_channels,
904
+ segment_size,
905
+ inter_channels,
906
+ hidden_channels,
907
+ filter_channels,
908
+ n_heads,
909
+ n_layers,
910
+ kernel_size,
911
+ p_dropout,
912
+ resblock,
913
+ resblock_kernel_sizes,
914
+ resblock_dilation_sizes,
915
+ upsample_rates,
916
+ upsample_initial_channel,
917
+ upsample_kernel_sizes,
918
+ spk_embed_dim,
919
+ gin_channels,
920
+ sr=None,
921
+ **kwargs
922
+ ):
923
+ super().__init__()
924
+ self.spec_channels = spec_channels
925
+ self.inter_channels = inter_channels
926
+ self.hidden_channels = hidden_channels
927
+ self.filter_channels = filter_channels
928
+ self.n_heads = n_heads
929
+ self.n_layers = n_layers
930
+ self.kernel_size = kernel_size
931
+ self.p_dropout = p_dropout
932
+ self.resblock = resblock
933
+ self.resblock_kernel_sizes = resblock_kernel_sizes
934
+ self.resblock_dilation_sizes = resblock_dilation_sizes
935
+ self.upsample_rates = upsample_rates
936
+ self.upsample_initial_channel = upsample_initial_channel
937
+ self.upsample_kernel_sizes = upsample_kernel_sizes
938
+ self.segment_size = segment_size
939
+ self.gin_channels = gin_channels
940
+ # self.hop_length = hop_length#
941
+ self.spk_embed_dim = spk_embed_dim
942
+ self.enc_p = TextEncoder768(
943
+ inter_channels,
944
+ hidden_channels,
945
+ filter_channels,
946
+ n_heads,
947
+ n_layers,
948
+ kernel_size,
949
+ p_dropout,
950
+ f0=False,
951
+ )
952
+ self.dec = Generator(
953
+ inter_channels,
954
+ resblock,
955
+ resblock_kernel_sizes,
956
+ resblock_dilation_sizes,
957
+ upsample_rates,
958
+ upsample_initial_channel,
959
+ upsample_kernel_sizes,
960
+ gin_channels=gin_channels,
961
+ )
962
+ self.enc_q = PosteriorEncoder(
963
+ spec_channels,
964
+ inter_channels,
965
+ hidden_channels,
966
+ 5,
967
+ 1,
968
+ 16,
969
+ gin_channels=gin_channels,
970
+ )
971
+ self.flow = ResidualCouplingBlock(
972
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
973
+ )
974
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
975
+ logger.debug(
976
+ "gin_channels: "
977
+ + str(gin_channels)
978
+ + ", self.spk_embed_dim: "
979
+ + str(self.spk_embed_dim)
980
+ )
981
+
982
+ def remove_weight_norm(self):
983
+ self.dec.remove_weight_norm()
984
+ self.flow.remove_weight_norm()
985
+ self.enc_q.remove_weight_norm()
986
+
987
+ def forward(self, phone, phone_lengths, y, y_lengths, ds): # 这里ds是id,[bs,1]
988
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
989
+ m_p, logs_p, x_mask = self.enc_p(phone, None, phone_lengths)
990
+ z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
991
+ z_p = self.flow(z, y_mask, g=g)
992
+ z_slice, ids_slice = commons.rand_slice_segments(
993
+ z, y_lengths, self.segment_size
994
+ )
995
+ o = self.dec(z_slice, g=g)
996
+ return o, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q)
997
+
998
+ def infer(self, phone, phone_lengths, sid, rate=None):
999
+ g = self.emb_g(sid).unsqueeze(-1)
1000
+ m_p, logs_p, x_mask = self.enc_p(phone, None, phone_lengths)
1001
+ z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
1002
+ if rate:
1003
+ head = int(z_p.shape[2] * rate)
1004
+ z_p = z_p[:, :, -head:]
1005
+ x_mask = x_mask[:, :, -head:]
1006
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
1007
+ o = self.dec(z * x_mask, g=g)
1008
+ return o, x_mask, (z, z_p, m_p, logs_p)
1009
+
1010
+
1011
+ class MultiPeriodDiscriminator(torch.nn.Module):
1012
+ def __init__(self, use_spectral_norm=False):
1013
+ super(MultiPeriodDiscriminator, self).__init__()
1014
+ periods = [2, 3, 5, 7, 11, 17]
1015
+ # periods = [3, 5, 7, 11, 17, 23, 37]
1016
+
1017
+ discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
1018
+ discs = discs + [
1019
+ DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods
1020
+ ]
1021
+ self.discriminators = nn.ModuleList(discs)
1022
+
1023
+ def forward(self, y, y_hat):
1024
+ y_d_rs = [] #
1025
+ y_d_gs = []
1026
+ fmap_rs = []
1027
+ fmap_gs = []
1028
+ for i, d in enumerate(self.discriminators):
1029
+ y_d_r, fmap_r = d(y)
1030
+ y_d_g, fmap_g = d(y_hat)
1031
+ # for j in range(len(fmap_r)):
1032
+ # print(i,j,y.shape,y_hat.shape,fmap_r[j].shape,fmap_g[j].shape)
1033
+ y_d_rs.append(y_d_r)
1034
+ y_d_gs.append(y_d_g)
1035
+ fmap_rs.append(fmap_r)
1036
+ fmap_gs.append(fmap_g)
1037
+
1038
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
1039
+
1040
+
1041
+ class MultiPeriodDiscriminatorV2(torch.nn.Module):
1042
+ def __init__(self, use_spectral_norm=False):
1043
+ super(MultiPeriodDiscriminatorV2, self).__init__()
1044
+ # periods = [2, 3, 5, 7, 11, 17]
1045
+ periods = [2, 3, 5, 7, 11, 17, 23, 37]
1046
+
1047
+ discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
1048
+ discs = discs + [
1049
+ DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods
1050
+ ]
1051
+ self.discriminators = nn.ModuleList(discs)
1052
+
1053
+ def forward(self, y, y_hat):
1054
+ y_d_rs = [] #
1055
+ y_d_gs = []
1056
+ fmap_rs = []
1057
+ fmap_gs = []
1058
+ for i, d in enumerate(self.discriminators):
1059
+ y_d_r, fmap_r = d(y)
1060
+ y_d_g, fmap_g = d(y_hat)
1061
+ # for j in range(len(fmap_r)):
1062
+ # print(i,j,y.shape,y_hat.shape,fmap_r[j].shape,fmap_g[j].shape)
1063
+ y_d_rs.append(y_d_r)
1064
+ y_d_gs.append(y_d_g)
1065
+ fmap_rs.append(fmap_r)
1066
+ fmap_gs.append(fmap_g)
1067
+
1068
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
1069
+
1070
+
1071
+ class DiscriminatorS(torch.nn.Module):
1072
+ def __init__(self, use_spectral_norm=False):
1073
+ super(DiscriminatorS, self).__init__()
1074
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
1075
+ self.convs = nn.ModuleList(
1076
+ [
1077
+ norm_f(Conv1d(1, 16, 15, 1, padding=7)),
1078
+ norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
1079
+ norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
1080
+ norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
1081
+ norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
1082
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
1083
+ ]
1084
+ )
1085
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
1086
+
1087
+ def forward(self, x):
1088
+ fmap = []
1089
+
1090
+ for l in self.convs:
1091
+ x = l(x)
1092
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
1093
+ fmap.append(x)
1094
+ x = self.conv_post(x)
1095
+ fmap.append(x)
1096
+ x = torch.flatten(x, 1, -1)
1097
+
1098
+ return x, fmap
1099
+
1100
+
1101
+ class DiscriminatorP(torch.nn.Module):
1102
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
1103
+ super(DiscriminatorP, self).__init__()
1104
+ self.period = period
1105
+ self.use_spectral_norm = use_spectral_norm
1106
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
1107
+ self.convs = nn.ModuleList(
1108
+ [
1109
+ norm_f(
1110
+ Conv2d(
1111
+ 1,
1112
+ 32,
1113
+ (kernel_size, 1),
1114
+ (stride, 1),
1115
+ padding=(get_padding(kernel_size, 1), 0),
1116
+ )
1117
+ ),
1118
+ norm_f(
1119
+ Conv2d(
1120
+ 32,
1121
+ 128,
1122
+ (kernel_size, 1),
1123
+ (stride, 1),
1124
+ padding=(get_padding(kernel_size, 1), 0),
1125
+ )
1126
+ ),
1127
+ norm_f(
1128
+ Conv2d(
1129
+ 128,
1130
+ 512,
1131
+ (kernel_size, 1),
1132
+ (stride, 1),
1133
+ padding=(get_padding(kernel_size, 1), 0),
1134
+ )
1135
+ ),
1136
+ norm_f(
1137
+ Conv2d(
1138
+ 512,
1139
+ 1024,
1140
+ (kernel_size, 1),
1141
+ (stride, 1),
1142
+ padding=(get_padding(kernel_size, 1), 0),
1143
+ )
1144
+ ),
1145
+ norm_f(
1146
+ Conv2d(
1147
+ 1024,
1148
+ 1024,
1149
+ (kernel_size, 1),
1150
+ 1,
1151
+ padding=(get_padding(kernel_size, 1), 0),
1152
+ )
1153
+ ),
1154
+ ]
1155
+ )
1156
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
1157
+
1158
+ def forward(self, x):
1159
+ fmap = []
1160
+
1161
+ # 1d to 2d
1162
+ b, c, t = x.shape
1163
+ if t % self.period != 0: # pad first
1164
+ n_pad = self.period - (t % self.period)
1165
+ if has_xpu and x.dtype == torch.bfloat16:
1166
+ x = F.pad(x.to(dtype=torch.float16), (0, n_pad), "reflect").to(
1167
+ dtype=torch.bfloat16
1168
+ )
1169
+ else:
1170
+ x = F.pad(x, (0, n_pad), "reflect")
1171
+ t = t + n_pad
1172
+ x = x.view(b, c, t // self.period, self.period)
1173
+
1174
+ for l in self.convs:
1175
+ x = l(x)
1176
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
1177
+ fmap.append(x)
1178
+ x = self.conv_post(x)
1179
+ fmap.append(x)
1180
+ x = torch.flatten(x, 1, -1)
1181
+
1182
+ return x, fmap
rvc_inferpy/infer_list/packs/modules.py ADDED
@@ -0,0 +1,519 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import torch
3
+ from torch import nn
4
+ from torch.nn import Conv1d
5
+ from torch.nn import functional as F
6
+ from torch.nn.utils import remove_weight_norm, weight_norm
7
+
8
+ from rvc_inferpy.infer_list.packs import commons
9
+ from rvc_inferpy.infer_list.packs.commons import get_padding, init_weights
10
+ from rvc_inferpy.infer_list.packs.transforms import (
11
+ piecewise_rational_quadratic_transform,
12
+ )
13
+
14
+ LRELU_SLOPE = 0.1
15
+
16
+
17
+ class LayerNorm(nn.Module):
18
+ def __init__(self, channels, eps=1e-5):
19
+ super().__init__()
20
+ self.channels = channels
21
+ self.eps = eps
22
+
23
+ self.gamma = nn.Parameter(torch.ones(channels))
24
+ self.beta = nn.Parameter(torch.zeros(channels))
25
+
26
+ def forward(self, x):
27
+ x = x.transpose(1, -1)
28
+ x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
29
+ return x.transpose(1, -1)
30
+
31
+
32
+ class ConvReluNorm(nn.Module):
33
+ def __init__(
34
+ self,
35
+ in_channels,
36
+ hidden_channels,
37
+ out_channels,
38
+ kernel_size,
39
+ n_layers,
40
+ p_dropout,
41
+ ):
42
+ super().__init__()
43
+ self.in_channels = in_channels
44
+ self.hidden_channels = hidden_channels
45
+ self.out_channels = out_channels
46
+ self.kernel_size = kernel_size
47
+ self.n_layers = n_layers
48
+ self.p_dropout = p_dropout
49
+ assert n_layers > 1, "Number of layers should be larger than 0."
50
+
51
+ self.conv_layers = nn.ModuleList()
52
+ self.norm_layers = nn.ModuleList()
53
+ self.conv_layers.append(
54
+ nn.Conv1d(
55
+ in_channels, hidden_channels, kernel_size, padding=kernel_size // 2
56
+ )
57
+ )
58
+ self.norm_layers.append(LayerNorm(hidden_channels))
59
+ self.relu_drop = nn.Sequential(nn.ReLU(), nn.Dropout(p_dropout))
60
+ for _ in range(n_layers - 1):
61
+ self.conv_layers.append(
62
+ nn.Conv1d(
63
+ hidden_channels,
64
+ hidden_channels,
65
+ kernel_size,
66
+ padding=kernel_size // 2,
67
+ )
68
+ )
69
+ self.norm_layers.append(LayerNorm(hidden_channels))
70
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
71
+ self.proj.weight.data.zero_()
72
+ self.proj.bias.data.zero_()
73
+
74
+ def forward(self, x, x_mask):
75
+ x_org = x
76
+ for i in range(self.n_layers):
77
+ x = self.conv_layers[i](x * x_mask)
78
+ x = self.norm_layers[i](x)
79
+ x = self.relu_drop(x)
80
+ x = x_org + self.proj(x)
81
+ return x * x_mask
82
+
83
+
84
+ class DDSConv(nn.Module):
85
+ """
86
+ Dialted and Depth-Separable Convolution
87
+ """
88
+
89
+ def __init__(self, channels, kernel_size, n_layers, p_dropout=0.0):
90
+ super().__init__()
91
+ self.channels = channels
92
+ self.kernel_size = kernel_size
93
+ self.n_layers = n_layers
94
+ self.p_dropout = p_dropout
95
+
96
+ self.drop = nn.Dropout(p_dropout)
97
+ self.convs_sep = nn.ModuleList()
98
+ self.convs_1x1 = nn.ModuleList()
99
+ self.norms_1 = nn.ModuleList()
100
+ self.norms_2 = nn.ModuleList()
101
+ for i in range(n_layers):
102
+ dilation = kernel_size**i
103
+ padding = (kernel_size * dilation - dilation) // 2
104
+ self.convs_sep.append(
105
+ nn.Conv1d(
106
+ channels,
107
+ channels,
108
+ kernel_size,
109
+ groups=channels,
110
+ dilation=dilation,
111
+ padding=padding,
112
+ )
113
+ )
114
+ self.convs_1x1.append(nn.Conv1d(channels, channels, 1))
115
+ self.norms_1.append(LayerNorm(channels))
116
+ self.norms_2.append(LayerNorm(channels))
117
+
118
+ def forward(self, x, x_mask, g=None):
119
+ if g is not None:
120
+ x = x + g
121
+ for i in range(self.n_layers):
122
+ y = self.convs_sep[i](x * x_mask)
123
+ y = self.norms_1[i](y)
124
+ y = F.gelu(y)
125
+ y = self.convs_1x1[i](y)
126
+ y = self.norms_2[i](y)
127
+ y = F.gelu(y)
128
+ y = self.drop(y)
129
+ x = x + y
130
+ return x * x_mask
131
+
132
+
133
+ class WN(torch.nn.Module):
134
+ def __init__(
135
+ self,
136
+ hidden_channels,
137
+ kernel_size,
138
+ dilation_rate,
139
+ n_layers,
140
+ gin_channels=0,
141
+ p_dropout=0,
142
+ ):
143
+ super(WN, self).__init__()
144
+ assert kernel_size % 2 == 1
145
+ self.hidden_channels = hidden_channels
146
+ self.kernel_size = (kernel_size,)
147
+ self.dilation_rate = dilation_rate
148
+ self.n_layers = n_layers
149
+ self.gin_channels = gin_channels
150
+ self.p_dropout = p_dropout
151
+
152
+ self.in_layers = torch.nn.ModuleList()
153
+ self.res_skip_layers = torch.nn.ModuleList()
154
+ self.drop = nn.Dropout(p_dropout)
155
+
156
+ if gin_channels != 0:
157
+ cond_layer = torch.nn.Conv1d(
158
+ gin_channels, 2 * hidden_channels * n_layers, 1
159
+ )
160
+ self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name="weight")
161
+
162
+ for i in range(n_layers):
163
+ dilation = dilation_rate**i
164
+ padding = int((kernel_size * dilation - dilation) / 2)
165
+ in_layer = torch.nn.Conv1d(
166
+ hidden_channels,
167
+ 2 * hidden_channels,
168
+ kernel_size,
169
+ dilation=dilation,
170
+ padding=padding,
171
+ )
172
+ in_layer = torch.nn.utils.weight_norm(in_layer, name="weight")
173
+ self.in_layers.append(in_layer)
174
+
175
+ # last one is not necessary
176
+ if i < n_layers - 1:
177
+ res_skip_channels = 2 * hidden_channels
178
+ else:
179
+ res_skip_channels = hidden_channels
180
+
181
+ res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1)
182
+ res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name="weight")
183
+ self.res_skip_layers.append(res_skip_layer)
184
+
185
+ def forward(self, x, x_mask, g=None, **kwargs):
186
+ output = torch.zeros_like(x)
187
+ n_channels_tensor = torch.IntTensor([self.hidden_channels])
188
+
189
+ if g is not None:
190
+ g = self.cond_layer(g)
191
+
192
+ for i in range(self.n_layers):
193
+ x_in = self.in_layers[i](x)
194
+ if g is not None:
195
+ cond_offset = i * 2 * self.hidden_channels
196
+ g_l = g[:, cond_offset : cond_offset + 2 * self.hidden_channels, :]
197
+ else:
198
+ g_l = torch.zeros_like(x_in)
199
+
200
+ acts = commons.fused_add_tanh_sigmoid_multiply(x_in, g_l, n_channels_tensor)
201
+ acts = self.drop(acts)
202
+
203
+ res_skip_acts = self.res_skip_layers[i](acts)
204
+ if i < self.n_layers - 1:
205
+ res_acts = res_skip_acts[:, : self.hidden_channels, :]
206
+ x = (x + res_acts) * x_mask
207
+ output = output + res_skip_acts[:, self.hidden_channels :, :]
208
+ else:
209
+ output = output + res_skip_acts
210
+ return output * x_mask
211
+
212
+ def remove_weight_norm(self):
213
+ if self.gin_channels != 0:
214
+ torch.nn.utils.remove_weight_norm(self.cond_layer)
215
+ for l in self.in_layers:
216
+ torch.nn.utils.remove_weight_norm(l)
217
+ for l in self.res_skip_layers:
218
+ torch.nn.utils.remove_weight_norm(l)
219
+
220
+
221
+ class ResBlock1(torch.nn.Module):
222
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
223
+ super(ResBlock1, self).__init__()
224
+ self.convs1 = nn.ModuleList(
225
+ [
226
+ weight_norm(
227
+ Conv1d(
228
+ channels,
229
+ channels,
230
+ kernel_size,
231
+ 1,
232
+ dilation=dilation[0],
233
+ padding=get_padding(kernel_size, dilation[0]),
234
+ )
235
+ ),
236
+ weight_norm(
237
+ Conv1d(
238
+ channels,
239
+ channels,
240
+ kernel_size,
241
+ 1,
242
+ dilation=dilation[1],
243
+ padding=get_padding(kernel_size, dilation[1]),
244
+ )
245
+ ),
246
+ weight_norm(
247
+ Conv1d(
248
+ channels,
249
+ channels,
250
+ kernel_size,
251
+ 1,
252
+ dilation=dilation[2],
253
+ padding=get_padding(kernel_size, dilation[2]),
254
+ )
255
+ ),
256
+ ]
257
+ )
258
+ self.convs1.apply(init_weights)
259
+
260
+ self.convs2 = nn.ModuleList(
261
+ [
262
+ weight_norm(
263
+ Conv1d(
264
+ channels,
265
+ channels,
266
+ kernel_size,
267
+ 1,
268
+ dilation=1,
269
+ padding=get_padding(kernel_size, 1),
270
+ )
271
+ ),
272
+ weight_norm(
273
+ Conv1d(
274
+ channels,
275
+ channels,
276
+ kernel_size,
277
+ 1,
278
+ dilation=1,
279
+ padding=get_padding(kernel_size, 1),
280
+ )
281
+ ),
282
+ weight_norm(
283
+ Conv1d(
284
+ channels,
285
+ channels,
286
+ kernel_size,
287
+ 1,
288
+ dilation=1,
289
+ padding=get_padding(kernel_size, 1),
290
+ )
291
+ ),
292
+ ]
293
+ )
294
+ self.convs2.apply(init_weights)
295
+
296
+ def forward(self, x, x_mask=None):
297
+ for c1, c2 in zip(self.convs1, self.convs2):
298
+ xt = F.leaky_relu(x, LRELU_SLOPE)
299
+ if x_mask is not None:
300
+ xt = xt * x_mask
301
+ xt = c1(xt)
302
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
303
+ if x_mask is not None:
304
+ xt = xt * x_mask
305
+ xt = c2(xt)
306
+ x = xt + x
307
+ if x_mask is not None:
308
+ x = x * x_mask
309
+ return x
310
+
311
+ def remove_weight_norm(self):
312
+ for l in self.convs1:
313
+ remove_weight_norm(l)
314
+ for l in self.convs2:
315
+ remove_weight_norm(l)
316
+
317
+
318
+ class ResBlock2(torch.nn.Module):
319
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
320
+ super(ResBlock2, self).__init__()
321
+ self.convs = nn.ModuleList(
322
+ [
323
+ weight_norm(
324
+ Conv1d(
325
+ channels,
326
+ channels,
327
+ kernel_size,
328
+ 1,
329
+ dilation=dilation[0],
330
+ padding=get_padding(kernel_size, dilation[0]),
331
+ )
332
+ ),
333
+ weight_norm(
334
+ Conv1d(
335
+ channels,
336
+ channels,
337
+ kernel_size,
338
+ 1,
339
+ dilation=dilation[1],
340
+ padding=get_padding(kernel_size, dilation[1]),
341
+ )
342
+ ),
343
+ ]
344
+ )
345
+ self.convs.apply(init_weights)
346
+
347
+ def forward(self, x, x_mask=None):
348
+ for c in self.convs:
349
+ xt = F.leaky_relu(x, LRELU_SLOPE)
350
+ if x_mask is not None:
351
+ xt = xt * x_mask
352
+ xt = c(xt)
353
+ x = xt + x
354
+ if x_mask is not None:
355
+ x = x * x_mask
356
+ return x
357
+
358
+ def remove_weight_norm(self):
359
+ for l in self.convs:
360
+ remove_weight_norm(l)
361
+
362
+
363
+ class Log(nn.Module):
364
+ def forward(self, x, x_mask, reverse=False, **kwargs):
365
+ if not reverse:
366
+ y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask
367
+ logdet = torch.sum(-y, [1, 2])
368
+ return y, logdet
369
+ else:
370
+ x = torch.exp(x) * x_mask
371
+ return x
372
+
373
+
374
+ class Flip(nn.Module):
375
+ def forward(self, x, *args, reverse=False, **kwargs):
376
+ x = torch.flip(x, [1])
377
+ if not reverse:
378
+ logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device)
379
+ return x, logdet
380
+ else:
381
+ return x
382
+
383
+
384
+ class ElementwiseAffine(nn.Module):
385
+ def __init__(self, channels):
386
+ super().__init__()
387
+ self.channels = channels
388
+ self.m = nn.Parameter(torch.zeros(channels, 1))
389
+ self.logs = nn.Parameter(torch.zeros(channels, 1))
390
+
391
+ def forward(self, x, x_mask, reverse=False, **kwargs):
392
+ if not reverse:
393
+ y = self.m + torch.exp(self.logs) * x
394
+ y = y * x_mask
395
+ logdet = torch.sum(self.logs * x_mask, [1, 2])
396
+ return y, logdet
397
+ else:
398
+ x = (x - self.m) * torch.exp(-self.logs) * x_mask
399
+ return x
400
+
401
+
402
+ class ResidualCouplingLayer(nn.Module):
403
+ def __init__(
404
+ self,
405
+ channels,
406
+ hidden_channels,
407
+ kernel_size,
408
+ dilation_rate,
409
+ n_layers,
410
+ p_dropout=0,
411
+ gin_channels=0,
412
+ mean_only=False,
413
+ ):
414
+ assert channels % 2 == 0, "channels should be divisible by 2"
415
+ super().__init__()
416
+ self.channels = channels
417
+ self.hidden_channels = hidden_channels
418
+ self.kernel_size = kernel_size
419
+ self.dilation_rate = dilation_rate
420
+ self.n_layers = n_layers
421
+ self.half_channels = channels // 2
422
+ self.mean_only = mean_only
423
+
424
+ self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
425
+ self.enc = WN(
426
+ hidden_channels,
427
+ kernel_size,
428
+ dilation_rate,
429
+ n_layers,
430
+ p_dropout=p_dropout,
431
+ gin_channels=gin_channels,
432
+ )
433
+ self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
434
+ self.post.weight.data.zero_()
435
+ self.post.bias.data.zero_()
436
+
437
+ def forward(self, x, x_mask, g=None, reverse=False):
438
+ x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
439
+ h = self.pre(x0) * x_mask
440
+ h = self.enc(h, x_mask, g=g)
441
+ stats = self.post(h) * x_mask
442
+ if not self.mean_only:
443
+ m, logs = torch.split(stats, [self.half_channels] * 2, 1)
444
+ else:
445
+ m = stats
446
+ logs = torch.zeros_like(m)
447
+
448
+ if not reverse:
449
+ x1 = m + x1 * torch.exp(logs) * x_mask
450
+ x = torch.cat([x0, x1], 1)
451
+ logdet = torch.sum(logs, [1, 2])
452
+ return x, logdet
453
+ else:
454
+ x1 = (x1 - m) * torch.exp(-logs) * x_mask
455
+ x = torch.cat([x0, x1], 1)
456
+ return x
457
+
458
+ def remove_weight_norm(self):
459
+ self.enc.remove_weight_norm()
460
+
461
+
462
+ class ConvFlow(nn.Module):
463
+ def __init__(
464
+ self,
465
+ in_channels,
466
+ filter_channels,
467
+ kernel_size,
468
+ n_layers,
469
+ num_bins=10,
470
+ tail_bound=5.0,
471
+ ):
472
+ super().__init__()
473
+ self.in_channels = in_channels
474
+ self.filter_channels = filter_channels
475
+ self.kernel_size = kernel_size
476
+ self.n_layers = n_layers
477
+ self.num_bins = num_bins
478
+ self.tail_bound = tail_bound
479
+ self.half_channels = in_channels // 2
480
+
481
+ self.pre = nn.Conv1d(self.half_channels, filter_channels, 1)
482
+ self.convs = DDSConv(filter_channels, kernel_size, n_layers, p_dropout=0.0)
483
+ self.proj = nn.Conv1d(
484
+ filter_channels, self.half_channels * (num_bins * 3 - 1), 1
485
+ )
486
+ self.proj.weight.data.zero_()
487
+ self.proj.bias.data.zero_()
488
+
489
+ def forward(self, x, x_mask, g=None, reverse=False):
490
+ x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
491
+ h = self.pre(x0)
492
+ h = self.convs(h, x_mask, g=g)
493
+ h = self.proj(h) * x_mask
494
+
495
+ b, c, t = x0.shape
496
+ h = h.reshape(b, c, -1, t).permute(0, 1, 3, 2) # [b, cx?, t] -> [b, c, t, ?]
497
+
498
+ unnormalized_widths = h[..., : self.num_bins] / math.sqrt(self.filter_channels)
499
+ unnormalized_heights = h[..., self.num_bins : 2 * self.num_bins] / math.sqrt(
500
+ self.filter_channels
501
+ )
502
+ unnormalized_derivatives = h[..., 2 * self.num_bins :]
503
+
504
+ x1, logabsdet = piecewise_rational_quadratic_transform(
505
+ x1,
506
+ unnormalized_widths,
507
+ unnormalized_heights,
508
+ unnormalized_derivatives,
509
+ inverse=reverse,
510
+ tails="linear",
511
+ tail_bound=self.tail_bound,
512
+ )
513
+
514
+ x = torch.cat([x0, x1], 1) * x_mask
515
+ logdet = torch.sum(logabsdet * x_mask, [1, 2])
516
+ if not reverse:
517
+ return x, logdet
518
+ else:
519
+ return x
rvc_inferpy/infer_list/packs/transforms.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import torch
3
+ from torch.nn import functional as F
4
+
5
+ DEFAULT_MIN_BIN_WIDTH = 1e-3
6
+ DEFAULT_MIN_BIN_HEIGHT = 1e-3
7
+ DEFAULT_MIN_DERIVATIVE = 1e-3
8
+
9
+
10
+ def piecewise_rational_quadratic_transform(
11
+ inputs,
12
+ unnormalized_widths,
13
+ unnormalized_heights,
14
+ unnormalized_derivatives,
15
+ inverse=False,
16
+ tails=None,
17
+ tail_bound=1.0,
18
+ min_bin_width=DEFAULT_MIN_BIN_WIDTH,
19
+ min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
20
+ min_derivative=DEFAULT_MIN_DERIVATIVE,
21
+ ):
22
+ if tails is None:
23
+ spline_fn = rational_quadratic_spline
24
+ spline_kwargs = {}
25
+ else:
26
+ spline_fn = unconstrained_rational_quadratic_spline
27
+ spline_kwargs = {"tails": tails, "tail_bound": tail_bound}
28
+
29
+ outputs, logabsdet = spline_fn(
30
+ inputs=inputs,
31
+ unnormalized_widths=unnormalized_widths,
32
+ unnormalized_heights=unnormalized_heights,
33
+ unnormalized_derivatives=unnormalized_derivatives,
34
+ inverse=inverse,
35
+ min_bin_width=min_bin_width,
36
+ min_bin_height=min_bin_height,
37
+ min_derivative=min_derivative,
38
+ **spline_kwargs
39
+ )
40
+ return outputs, logabsdet
41
+
42
+
43
+ def searchsorted(bin_locations, inputs, eps=1e-6):
44
+ bin_locations[..., -1] += eps
45
+ return torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1
46
+
47
+
48
+ def unconstrained_rational_quadratic_spline(
49
+ inputs,
50
+ unnormalized_widths,
51
+ unnormalized_heights,
52
+ unnormalized_derivatives,
53
+ inverse=False,
54
+ tails="linear",
55
+ tail_bound=1.0,
56
+ min_bin_width=DEFAULT_MIN_BIN_WIDTH,
57
+ min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
58
+ min_derivative=DEFAULT_MIN_DERIVATIVE,
59
+ ):
60
+ inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
61
+ outside_interval_mask = ~inside_interval_mask
62
+
63
+ outputs = torch.zeros_like(inputs)
64
+ logabsdet = torch.zeros_like(inputs)
65
+
66
+ if tails == "linear":
67
+ unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
68
+ constant = np.log(np.exp(1 - min_derivative) - 1)
69
+ unnormalized_derivatives[..., 0] = constant
70
+ unnormalized_derivatives[..., -1] = constant
71
+
72
+ outputs[outside_interval_mask] = inputs[outside_interval_mask]
73
+ logabsdet[outside_interval_mask] = 0
74
+ else:
75
+ raise RuntimeError("{} tails are not implemented.".format(tails))
76
+
77
+ (
78
+ outputs[inside_interval_mask],
79
+ logabsdet[inside_interval_mask],
80
+ ) = rational_quadratic_spline(
81
+ inputs=inputs[inside_interval_mask],
82
+ unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
83
+ unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
84
+ unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
85
+ inverse=inverse,
86
+ left=-tail_bound,
87
+ right=tail_bound,
88
+ bottom=-tail_bound,
89
+ top=tail_bound,
90
+ min_bin_width=min_bin_width,
91
+ min_bin_height=min_bin_height,
92
+ min_derivative=min_derivative,
93
+ )
94
+
95
+ return outputs, logabsdet
96
+
97
+
98
+ def rational_quadratic_spline(
99
+ inputs,
100
+ unnormalized_widths,
101
+ unnormalized_heights,
102
+ unnormalized_derivatives,
103
+ inverse=False,
104
+ left=0.0,
105
+ right=1.0,
106
+ bottom=0.0,
107
+ top=1.0,
108
+ min_bin_width=DEFAULT_MIN_BIN_WIDTH,
109
+ min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
110
+ min_derivative=DEFAULT_MIN_DERIVATIVE,
111
+ ):
112
+ if torch.min(inputs) < left or torch.max(inputs) > right:
113
+ raise ValueError("Input to a transform is not within its domain")
114
+
115
+ num_bins = unnormalized_widths.shape[-1]
116
+
117
+ if min_bin_width * num_bins > 1.0:
118
+ raise ValueError("Minimal bin width too large for the number of bins")
119
+ if min_bin_height * num_bins > 1.0:
120
+ raise ValueError("Minimal bin height too large for the number of bins")
121
+
122
+ widths = F.softmax(unnormalized_widths, dim=-1)
123
+ widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
124
+ cumwidths = torch.cumsum(widths, dim=-1)
125
+ cumwidths = F.pad(cumwidths, pad=(1, 0), mode="constant", value=0.0)
126
+ cumwidths = (right - left) * cumwidths + left
127
+ cumwidths[..., 0] = left
128
+ cumwidths[..., -1] = right
129
+ widths = cumwidths[..., 1:] - cumwidths[..., :-1]
130
+
131
+ derivatives = min_derivative + F.softplus(unnormalized_derivatives)
132
+
133
+ heights = F.softmax(unnormalized_heights, dim=-1)
134
+ heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
135
+ cumheights = torch.cumsum(heights, dim=-1)
136
+ cumheights = F.pad(cumheights, pad=(1, 0), mode="constant", value=0.0)
137
+ cumheights = (top - bottom) * cumheights + bottom
138
+ cumheights[..., 0] = bottom
139
+ cumheights[..., -1] = top
140
+ heights = cumheights[..., 1:] - cumheights[..., :-1]
141
+
142
+ if inverse:
143
+ bin_idx = searchsorted(cumheights, inputs)[..., None]
144
+ else:
145
+ bin_idx = searchsorted(cumwidths, inputs)[..., None]
146
+
147
+ input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
148
+ input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
149
+
150
+ input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
151
+ delta = heights / widths
152
+ input_delta = delta.gather(-1, bin_idx)[..., 0]
153
+
154
+ input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
155
+ input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
156
+
157
+ input_heights = heights.gather(-1, bin_idx)[..., 0]
158
+
159
+ if inverse:
160
+ a = (inputs - input_cumheights) * (
161
+ input_derivatives + input_derivatives_plus_one - 2 * input_delta
162
+ ) + input_heights * (input_delta - input_derivatives)
163
+ b = input_heights * input_derivatives - (inputs - input_cumheights) * (
164
+ input_derivatives + input_derivatives_plus_one - 2 * input_delta
165
+ )
166
+ c = -input_delta * (inputs - input_cumheights)
167
+
168
+ discriminant = b.pow(2) - 4 * a * c
169
+ assert (discriminant >= 0).all()
170
+
171
+ root = (2 * c) / (-b - torch.sqrt(discriminant))
172
+ outputs = root * input_bin_widths + input_cumwidths
173
+
174
+ theta_one_minus_theta = root * (1 - root)
175
+ denominator = input_delta + (
176
+ (input_derivatives + input_derivatives_plus_one - 2 * input_delta)
177
+ * theta_one_minus_theta
178
+ )
179
+ derivative_numerator = input_delta.pow(2) * (
180
+ input_derivatives_plus_one * root.pow(2)
181
+ + 2 * input_delta * theta_one_minus_theta
182
+ + input_derivatives * (1 - root).pow(2)
183
+ )
184
+ logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
185
+
186
+ return outputs, -logabsdet
187
+ else:
188
+ theta = (inputs - input_cumwidths) / input_bin_widths
189
+ theta_one_minus_theta = theta * (1 - theta)
190
+
191
+ numerator = input_heights * (
192
+ input_delta * theta.pow(2) + input_derivatives * theta_one_minus_theta
193
+ )
194
+ denominator = input_delta + (
195
+ (input_derivatives + input_derivatives_plus_one - 2 * input_delta)
196
+ * theta_one_minus_theta
197
+ )
198
+ outputs = input_cumheights + numerator / denominator
199
+
200
+ derivative_numerator = input_delta.pow(2) * (
201
+ input_derivatives_plus_one * theta.pow(2)
202
+ + 2 * input_delta * theta_one_minus_theta
203
+ + input_derivatives * (1 - theta).pow(2)
204
+ )
205
+ logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
206
+
207
+ return outputs, logabsdet
rvc_inferpy/infer_list/rmvpe.py ADDED
@@ -0,0 +1,713 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ import numpy as np
4
+ import torch
5
+
6
+ try:
7
+ # Fix "Torch not compiled with CUDA enabled"
8
+ import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
9
+
10
+ if torch.xpu.is_available():
11
+ from lib.infer.modules.ipex import ipex_init
12
+
13
+ ipex_init()
14
+ except Exception:
15
+ pass
16
+ import torch.nn as nn
17
+ import torch.nn.functional as F
18
+ from librosa.util import normalize, pad_center, tiny
19
+ from scipy.signal import get_window
20
+
21
+ import logging
22
+
23
+ logger = logging.getLogger(__name__)
24
+
25
+
26
+ ###stft codes from https://github.com/pseeth/torch-stft/blob/master/torch_stft/util.py
27
+ def window_sumsquare(
28
+ window,
29
+ n_frames,
30
+ hop_length=200,
31
+ win_length=800,
32
+ n_fft=800,
33
+ dtype=np.float32,
34
+ norm=None,
35
+ ):
36
+ """
37
+ # from librosa 0.6
38
+ Compute the sum-square envelope of a window function at a given hop length.
39
+ This is used to estimate modulation effects induced by windowing
40
+ observations in short-time fourier transforms.
41
+ Parameters
42
+ ----------
43
+ window : string, tuple, number, callable, or list-like
44
+ Window specification, as in `get_window`
45
+ n_frames : int > 0
46
+ The number of analysis frames
47
+ hop_length : int > 0
48
+ The number of samples to advance between frames
49
+ win_length : [optional]
50
+ The length of the window function. By default, this matches `n_fft`.
51
+ n_fft : int > 0
52
+ The length of each analysis frame.
53
+ dtype : np.dtype
54
+ The data type of the output
55
+ Returns
56
+ -------
57
+ wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`
58
+ The sum-squared envelope of the window function
59
+ """
60
+ if win_length is None:
61
+ win_length = n_fft
62
+
63
+ n = n_fft + hop_length * (n_frames - 1)
64
+ x = np.zeros(n, dtype=dtype)
65
+
66
+ # Compute the squared window at the desired length
67
+ win_sq = get_window(window, win_length, fftbins=True)
68
+ win_sq = normalize(win_sq, norm=norm) ** 2
69
+ win_sq = pad_center(win_sq, n_fft)
70
+
71
+ # Fill the envelope
72
+ for i in range(n_frames):
73
+ sample = i * hop_length
74
+ x[sample : min(n, sample + n_fft)] += win_sq[: max(0, min(n_fft, n - sample))]
75
+ return x
76
+
77
+
78
+ class STFT(torch.nn.Module):
79
+ def __init__(
80
+ self, filter_length=1024, hop_length=512, win_length=None, window="hann"
81
+ ):
82
+ """
83
+ This module implements an STFT using 1D convolution and 1D transpose convolutions.
84
+ This is a bit tricky so there are some cases that probably won't work as working
85
+ out the same sizes before and after in all overlap add setups is tough. Right now,
86
+ this code should work with hop lengths that are half the filter length (50% overlap
87
+ between frames).
88
+
89
+ Keyword Arguments:
90
+ filter_length {int} -- Length of filters used (default: {1024})
91
+ hop_length {int} -- Hop length of STFT (restrict to 50% overlap between frames) (default: {512})
92
+ win_length {[type]} -- Length of the window function applied to each frame (if not specified, it
93
+ equals the filter length). (default: {None})
94
+ window {str} -- Type of window to use (options are bartlett, hann, hamming, blackman, blackmanharris)
95
+ (default: {'hann'})
96
+ """
97
+ super(STFT, self).__init__()
98
+ self.filter_length = filter_length
99
+ self.hop_length = hop_length
100
+ self.win_length = win_length if win_length else filter_length
101
+ self.window = window
102
+ self.forward_transform = None
103
+ self.pad_amount = int(self.filter_length / 2)
104
+ # scale = self.filter_length / self.hop_length
105
+ fourier_basis = np.fft.fft(np.eye(self.filter_length))
106
+
107
+ cutoff = int((self.filter_length / 2 + 1))
108
+ fourier_basis = np.vstack(
109
+ [np.real(fourier_basis[:cutoff, :]), np.imag(fourier_basis[:cutoff, :])]
110
+ )
111
+ forward_basis = torch.FloatTensor(fourier_basis)
112
+ inverse_basis = torch.FloatTensor(np.linalg.pinv(fourier_basis))
113
+
114
+ assert filter_length >= self.win_length
115
+ # get window and zero center pad it to filter_length
116
+ fft_window = get_window(window, self.win_length, fftbins=True)
117
+ fft_window = pad_center(fft_window, size=filter_length)
118
+ fft_window = torch.from_numpy(fft_window).float()
119
+
120
+ # window the bases
121
+ forward_basis *= fft_window
122
+ inverse_basis = (inverse_basis.T * fft_window).T
123
+
124
+ self.register_buffer("forward_basis", forward_basis.float())
125
+ self.register_buffer("inverse_basis", inverse_basis.float())
126
+ self.register_buffer("fft_window", fft_window.float())
127
+
128
+ def transform(self, input_data, return_phase=False):
129
+ """Take input data (audio) to STFT domain.
130
+
131
+ Arguments:
132
+ input_data {tensor} -- Tensor of floats, with shape (num_batch, num_samples)
133
+
134
+ Returns:
135
+ magnitude {tensor} -- Magnitude of STFT with shape (num_batch,
136
+ num_frequencies, num_frames)
137
+ phase {tensor} -- Phase of STFT with shape (num_batch,
138
+ num_frequencies, num_frames)
139
+ """
140
+ # num_batches = input_data.shape[0]
141
+ # num_samples = input_data.shape[-1]
142
+
143
+ # self.num_samples = num_samples
144
+
145
+ # similar to librosa, reflect-pad the input
146
+ # input_data = input_data.view(num_batches, 1, num_samples)
147
+ # print(1234,input_data.shape)
148
+ input_data = F.pad(
149
+ input_data,
150
+ (self.pad_amount, self.pad_amount),
151
+ mode="reflect",
152
+ )
153
+
154
+ forward_transform = input_data.unfold(
155
+ 1, self.filter_length, self.hop_length
156
+ ).permute(0, 2, 1)
157
+ forward_transform = torch.matmul(self.forward_basis, forward_transform)
158
+
159
+ cutoff = int((self.filter_length / 2) + 1)
160
+ real_part = forward_transform[:, :cutoff, :]
161
+ imag_part = forward_transform[:, cutoff:, :]
162
+
163
+ magnitude = torch.sqrt(real_part**2 + imag_part**2)
164
+ # phase = torch.atan2(imag_part.data, real_part.data)
165
+
166
+ if return_phase:
167
+ phase = torch.atan2(imag_part.data, real_part.data)
168
+ return magnitude, phase
169
+ else:
170
+ return magnitude
171
+
172
+ def inverse(self, magnitude, phase):
173
+ """Call the inverse STFT (iSTFT), given magnitude and phase tensors produced
174
+ by the ```transform``` function.
175
+
176
+ Arguments:
177
+ magnitude {tensor} -- Magnitude of STFT with shape (num_batch,
178
+ num_frequencies, num_frames)
179
+ phase {tensor} -- Phase of STFT with shape (num_batch,
180
+ num_frequencies, num_frames)
181
+
182
+ Returns:
183
+ inverse_transform {tensor} -- Reconstructed audio given magnitude and phase. Of
184
+ shape (num_batch, num_samples)
185
+ """
186
+ cat = torch.cat(
187
+ [magnitude * torch.cos(phase), magnitude * torch.sin(phase)], dim=1
188
+ )
189
+
190
+ fold = torch.nn.Fold(
191
+ output_size=(1, (cat.size(-1) - 1) * self.hop_length + self.filter_length),
192
+ kernel_size=(1, self.filter_length),
193
+ stride=(1, self.hop_length),
194
+ )
195
+ inverse_transform = torch.matmul(self.inverse_basis, cat)
196
+ inverse_transform = fold(inverse_transform)[
197
+ :, 0, 0, self.pad_amount : -self.pad_amount
198
+ ]
199
+ window_square_sum = (
200
+ self.fft_window.pow(2).repeat(cat.size(-1), 1).T.unsqueeze(0)
201
+ )
202
+ window_square_sum = fold(window_square_sum)[
203
+ :, 0, 0, self.pad_amount : -self.pad_amount
204
+ ]
205
+ inverse_transform /= window_square_sum
206
+
207
+ return inverse_transform
208
+
209
+ def forward(self, input_data):
210
+ """Take input data (audio) to STFT domain and then back to audio.
211
+
212
+ Arguments:
213
+ input_data {tensor} -- Tensor of floats, with shape (num_batch, num_samples)
214
+
215
+ Returns:
216
+ reconstruction {tensor} -- Reconstructed audio given magnitude and phase. Of
217
+ shape (num_batch, num_samples)
218
+ """
219
+ self.magnitude, self.phase = self.transform(input_data, return_phase=True)
220
+ reconstruction = self.inverse(self.magnitude, self.phase)
221
+ return reconstruction
222
+
223
+
224
+ from time import time as ttime
225
+
226
+
227
+ class BiGRU(nn.Module):
228
+ def __init__(self, input_features, hidden_features, num_layers):
229
+ super(BiGRU, self).__init__()
230
+ self.gru = nn.GRU(
231
+ input_features,
232
+ hidden_features,
233
+ num_layers=num_layers,
234
+ batch_first=True,
235
+ bidirectional=True,
236
+ )
237
+
238
+ def forward(self, x):
239
+ return self.gru(x)[0]
240
+
241
+
242
+ class ConvBlockRes(nn.Module):
243
+ def __init__(self, in_channels, out_channels, momentum=0.01):
244
+ super(ConvBlockRes, self).__init__()
245
+ self.conv = nn.Sequential(
246
+ nn.Conv2d(
247
+ in_channels=in_channels,
248
+ out_channels=out_channels,
249
+ kernel_size=(3, 3),
250
+ stride=(1, 1),
251
+ padding=(1, 1),
252
+ bias=False,
253
+ ),
254
+ nn.BatchNorm2d(out_channels, momentum=momentum),
255
+ nn.ReLU(),
256
+ nn.Conv2d(
257
+ in_channels=out_channels,
258
+ out_channels=out_channels,
259
+ kernel_size=(3, 3),
260
+ stride=(1, 1),
261
+ padding=(1, 1),
262
+ bias=False,
263
+ ),
264
+ nn.BatchNorm2d(out_channels, momentum=momentum),
265
+ nn.ReLU(),
266
+ )
267
+ if in_channels != out_channels:
268
+ self.shortcut = nn.Conv2d(in_channels, out_channels, (1, 1))
269
+ self.is_shortcut = True
270
+ else:
271
+ self.is_shortcut = False
272
+
273
+ def forward(self, x):
274
+ if self.is_shortcut:
275
+ return self.conv(x) + self.shortcut(x)
276
+ else:
277
+ return self.conv(x) + x
278
+
279
+
280
+ class Encoder(nn.Module):
281
+ def __init__(
282
+ self,
283
+ in_channels,
284
+ in_size,
285
+ n_encoders,
286
+ kernel_size,
287
+ n_blocks,
288
+ out_channels=16,
289
+ momentum=0.01,
290
+ ):
291
+ super(Encoder, self).__init__()
292
+ self.n_encoders = n_encoders
293
+ self.bn = nn.BatchNorm2d(in_channels, momentum=momentum)
294
+ self.layers = nn.ModuleList()
295
+ self.latent_channels = []
296
+ for i in range(self.n_encoders):
297
+ self.layers.append(
298
+ ResEncoderBlock(
299
+ in_channels, out_channels, kernel_size, n_blocks, momentum=momentum
300
+ )
301
+ )
302
+ self.latent_channels.append([out_channels, in_size])
303
+ in_channels = out_channels
304
+ out_channels *= 2
305
+ in_size //= 2
306
+ self.out_size = in_size
307
+ self.out_channel = out_channels
308
+
309
+ def forward(self, x):
310
+ concat_tensors = []
311
+ x = self.bn(x)
312
+ for i in range(self.n_encoders):
313
+ _, x = self.layers[i](x)
314
+ concat_tensors.append(_)
315
+ return x, concat_tensors
316
+
317
+
318
+ class ResEncoderBlock(nn.Module):
319
+ def __init__(
320
+ self, in_channels, out_channels, kernel_size, n_blocks=1, momentum=0.01
321
+ ):
322
+ super(ResEncoderBlock, self).__init__()
323
+ self.n_blocks = n_blocks
324
+ self.conv = nn.ModuleList()
325
+ self.conv.append(ConvBlockRes(in_channels, out_channels, momentum))
326
+ for i in range(n_blocks - 1):
327
+ self.conv.append(ConvBlockRes(out_channels, out_channels, momentum))
328
+ self.kernel_size = kernel_size
329
+ if self.kernel_size is not None:
330
+ self.pool = nn.AvgPool2d(kernel_size=kernel_size)
331
+
332
+ def forward(self, x):
333
+ for i in range(self.n_blocks):
334
+ x = self.conv[i](x)
335
+ if self.kernel_size is not None:
336
+ return x, self.pool(x)
337
+ else:
338
+ return x
339
+
340
+
341
+ class Intermediate(nn.Module): #
342
+ def __init__(self, in_channels, out_channels, n_inters, n_blocks, momentum=0.01):
343
+ super(Intermediate, self).__init__()
344
+ self.n_inters = n_inters
345
+ self.layers = nn.ModuleList()
346
+ self.layers.append(
347
+ ResEncoderBlock(in_channels, out_channels, None, n_blocks, momentum)
348
+ )
349
+ for i in range(self.n_inters - 1):
350
+ self.layers.append(
351
+ ResEncoderBlock(out_channels, out_channels, None, n_blocks, momentum)
352
+ )
353
+
354
+ def forward(self, x):
355
+ for i in range(self.n_inters):
356
+ x = self.layers[i](x)
357
+ return x
358
+
359
+
360
+ class ResDecoderBlock(nn.Module):
361
+ def __init__(self, in_channels, out_channels, stride, n_blocks=1, momentum=0.01):
362
+ super(ResDecoderBlock, self).__init__()
363
+ out_padding = (0, 1) if stride == (1, 2) else (1, 1)
364
+ self.n_blocks = n_blocks
365
+ self.conv1 = nn.Sequential(
366
+ nn.ConvTranspose2d(
367
+ in_channels=in_channels,
368
+ out_channels=out_channels,
369
+ kernel_size=(3, 3),
370
+ stride=stride,
371
+ padding=(1, 1),
372
+ output_padding=out_padding,
373
+ bias=False,
374
+ ),
375
+ nn.BatchNorm2d(out_channels, momentum=momentum),
376
+ nn.ReLU(),
377
+ )
378
+ self.conv2 = nn.ModuleList()
379
+ self.conv2.append(ConvBlockRes(out_channels * 2, out_channels, momentum))
380
+ for i in range(n_blocks - 1):
381
+ self.conv2.append(ConvBlockRes(out_channels, out_channels, momentum))
382
+
383
+ def forward(self, x, concat_tensor):
384
+ x = self.conv1(x)
385
+ x = torch.cat((x, concat_tensor), dim=1)
386
+ for i in range(self.n_blocks):
387
+ x = self.conv2[i](x)
388
+ return x
389
+
390
+
391
+ class Decoder(nn.Module):
392
+ def __init__(self, in_channels, n_decoders, stride, n_blocks, momentum=0.01):
393
+ super(Decoder, self).__init__()
394
+ self.layers = nn.ModuleList()
395
+ self.n_decoders = n_decoders
396
+ for i in range(self.n_decoders):
397
+ out_channels = in_channels // 2
398
+ self.layers.append(
399
+ ResDecoderBlock(in_channels, out_channels, stride, n_blocks, momentum)
400
+ )
401
+ in_channels = out_channels
402
+
403
+ def forward(self, x, concat_tensors):
404
+ for i in range(self.n_decoders):
405
+ x = self.layers[i](x, concat_tensors[-1 - i])
406
+ return x
407
+
408
+
409
+ class DeepUnet(nn.Module):
410
+ def __init__(
411
+ self,
412
+ kernel_size,
413
+ n_blocks,
414
+ en_de_layers=5,
415
+ inter_layers=4,
416
+ in_channels=1,
417
+ en_out_channels=16,
418
+ ):
419
+ super(DeepUnet, self).__init__()
420
+ self.encoder = Encoder(
421
+ in_channels, 128, en_de_layers, kernel_size, n_blocks, en_out_channels
422
+ )
423
+ self.intermediate = Intermediate(
424
+ self.encoder.out_channel // 2,
425
+ self.encoder.out_channel,
426
+ inter_layers,
427
+ n_blocks,
428
+ )
429
+ self.decoder = Decoder(
430
+ self.encoder.out_channel, en_de_layers, kernel_size, n_blocks
431
+ )
432
+
433
+ def forward(self, x):
434
+ x, concat_tensors = self.encoder(x)
435
+ x = self.intermediate(x)
436
+ x = self.decoder(x, concat_tensors)
437
+ return x
438
+
439
+
440
+ class E2E(nn.Module):
441
+ def __init__(
442
+ self,
443
+ n_blocks,
444
+ n_gru,
445
+ kernel_size,
446
+ en_de_layers=5,
447
+ inter_layers=4,
448
+ in_channels=1,
449
+ en_out_channels=16,
450
+ ):
451
+ super(E2E, self).__init__()
452
+ self.unet = DeepUnet(
453
+ kernel_size,
454
+ n_blocks,
455
+ en_de_layers,
456
+ inter_layers,
457
+ in_channels,
458
+ en_out_channels,
459
+ )
460
+ self.cnn = nn.Conv2d(en_out_channels, 3, (3, 3), padding=(1, 1))
461
+ if n_gru:
462
+ self.fc = nn.Sequential(
463
+ BiGRU(3 * 128, 256, n_gru),
464
+ nn.Linear(512, 360),
465
+ nn.Dropout(0.25),
466
+ nn.Sigmoid(),
467
+ )
468
+ else:
469
+ self.fc = nn.Sequential(
470
+ nn.Linear(3 * nn.N_MELS, nn.N_CLASS), nn.Dropout(0.25), nn.Sigmoid()
471
+ )
472
+
473
+ def forward(self, mel):
474
+ # print(mel.shape)
475
+ mel = mel.transpose(-1, -2).unsqueeze(1)
476
+ x = self.cnn(self.unet(mel)).transpose(1, 2).flatten(-2)
477
+ x = self.fc(x)
478
+ # print(x.shape)
479
+ return x
480
+
481
+
482
+ from librosa.filters import mel
483
+
484
+
485
+ class MelSpectrogram(torch.nn.Module):
486
+ def __init__(
487
+ self,
488
+ is_half,
489
+ n_mel_channels,
490
+ sampling_rate,
491
+ win_length,
492
+ hop_length,
493
+ n_fft=None,
494
+ mel_fmin=0,
495
+ mel_fmax=None,
496
+ clamp=1e-5,
497
+ ):
498
+ super().__init__()
499
+ n_fft = win_length if n_fft is None else n_fft
500
+ self.hann_window = {}
501
+ mel_basis = mel(
502
+ sr=sampling_rate,
503
+ n_fft=n_fft,
504
+ n_mels=n_mel_channels,
505
+ fmin=mel_fmin,
506
+ fmax=mel_fmax,
507
+ htk=True,
508
+ )
509
+ mel_basis = torch.from_numpy(mel_basis).float()
510
+ self.register_buffer("mel_basis", mel_basis)
511
+ self.n_fft = win_length if n_fft is None else n_fft
512
+ self.hop_length = hop_length
513
+ self.win_length = win_length
514
+ self.sampling_rate = sampling_rate
515
+ self.n_mel_channels = n_mel_channels
516
+ self.clamp = clamp
517
+ self.is_half = is_half
518
+
519
+ def forward(self, audio, keyshift=0, speed=1, center=True):
520
+ factor = 2 ** (keyshift / 12)
521
+ n_fft_new = int(np.round(self.n_fft * factor))
522
+ win_length_new = int(np.round(self.win_length * factor))
523
+ hop_length_new = int(np.round(self.hop_length * speed))
524
+ keyshift_key = str(keyshift) + "_" + str(audio.device)
525
+ if keyshift_key not in self.hann_window:
526
+ self.hann_window[keyshift_key] = torch.hann_window(win_length_new).to(
527
+ # "cpu"if(audio.device.type=="privateuseone") else audio.device
528
+ audio.device
529
+ )
530
+ if "privateuseone" in str(audio.device):
531
+ if not hasattr(self, "stft"):
532
+ self.stft = STFT(
533
+ filter_length=n_fft_new,
534
+ hop_length=hop_length_new,
535
+ win_length=win_length_new,
536
+ window="hann",
537
+ ).to(audio.device)
538
+ magnitude = self.stft.transform(audio)
539
+ else:
540
+ fft = torch.stft(
541
+ audio,
542
+ n_fft=n_fft_new,
543
+ hop_length=hop_length_new,
544
+ win_length=win_length_new,
545
+ window=self.hann_window[keyshift_key],
546
+ center=center,
547
+ return_complex=True,
548
+ )
549
+ magnitude = torch.sqrt(fft.real.pow(2) + fft.imag.pow(2))
550
+ # if (audio.device.type == "privateuseone"):
551
+ # magnitude=magnitude.to(audio.device)
552
+ if keyshift != 0:
553
+ size = self.n_fft // 2 + 1
554
+ resize = magnitude.size(1)
555
+ if resize < size:
556
+ magnitude = F.pad(magnitude, (0, 0, 0, size - resize))
557
+ magnitude = magnitude[:, :size, :] * self.win_length / win_length_new
558
+ mel_output = torch.matmul(self.mel_basis, magnitude)
559
+ if self.is_half == True:
560
+ mel_output = mel_output.half()
561
+ log_mel_spec = torch.log(torch.clamp(mel_output, min=self.clamp))
562
+ # print(log_mel_spec.device.type)
563
+ return log_mel_spec
564
+
565
+
566
+ class RMVPE:
567
+ def __init__(self, model_path, is_half, device=None):
568
+ self.resample_kernel = {}
569
+ self.resample_kernel = {}
570
+ self.is_half = is_half
571
+ if device is None:
572
+ device = "cuda" if torch.cuda.is_available() else "cpu"
573
+ self.device = device
574
+ self.mel_extractor = MelSpectrogram(
575
+ is_half, 128, 16000, 1024, 160, None, 30, 8000
576
+ ).to(device)
577
+ if "privateuseone" in str(device):
578
+ import onnxruntime as ort
579
+
580
+ ort_session = ort.InferenceSession(
581
+ "%s/rmvpe.onnx" % os.environ["rmvpe_root"],
582
+ providers=["DmlExecutionProvider"],
583
+ )
584
+ self.model = ort_session
585
+ else:
586
+ model = E2E(4, 1, (2, 2))
587
+ ckpt = torch.load(model_path, map_location="cpu")
588
+ model.load_state_dict(ckpt)
589
+ model.eval()
590
+ if is_half == True:
591
+ model = model.half()
592
+ self.model = model
593
+ self.model = self.model.to(device)
594
+ cents_mapping = 20 * np.arange(360) + 1997.3794084376191
595
+ self.cents_mapping = np.pad(cents_mapping, (4, 4)) # 368
596
+
597
+ def mel2hidden(self, mel):
598
+ with torch.no_grad():
599
+ n_frames = mel.shape[-1]
600
+ n_pad = 32 * ((n_frames - 1) // 32 + 1) - n_frames
601
+ if n_pad > 0:
602
+ mel = F.pad(mel, (0, n_pad), mode="constant")
603
+ if "privateuseone" in str(self.device):
604
+ onnx_input_name = self.model.get_inputs()[0].name
605
+ onnx_outputs_names = self.model.get_outputs()[0].name
606
+ hidden = self.model.run(
607
+ [onnx_outputs_names],
608
+ input_feed={onnx_input_name: mel.cpu().numpy()},
609
+ )[0]
610
+ else:
611
+ hidden = self.model(mel)
612
+ return hidden[:, :n_frames]
613
+
614
+ def decode(self, hidden, thred=0.03):
615
+ cents_pred = self.to_local_average_cents(hidden, thred=thred)
616
+ f0 = 10 * (2 ** (cents_pred / 1200))
617
+ f0[f0 == 10] = 0
618
+ # f0 = np.array([10 * (2 ** (cent_pred / 1200)) if cent_pred else 0 for cent_pred in cents_pred])
619
+ return f0
620
+
621
+ def infer_from_audio(self, audio, thred=0.03):
622
+ # torch.cuda.synchronize()
623
+ t0 = ttime()
624
+ mel = self.mel_extractor(
625
+ torch.from_numpy(audio).float().to(self.device).unsqueeze(0), center=True
626
+ )
627
+ # print(123123123,mel.device.type)
628
+ # torch.cuda.synchronize()
629
+ t1 = ttime()
630
+ hidden = self.mel2hidden(mel)
631
+ # torch.cuda.synchronize()
632
+ t2 = ttime()
633
+ # print(234234,hidden.device.type)
634
+ if "privateuseone" not in str(self.device):
635
+ hidden = hidden.squeeze(0).cpu().numpy()
636
+ else:
637
+ hidden = hidden[0]
638
+ if self.is_half == True:
639
+ hidden = hidden.astype("float32")
640
+
641
+ f0 = self.decode(hidden, thred=thred)
642
+ # torch.cuda.synchronize()
643
+ t3 = ttime()
644
+ # print("hmvpe:%s\t%s\t%s\t%s"%(t1-t0,t2-t1,t3-t2,t3-t0))
645
+ return f0
646
+
647
+ def infer_from_audio_with_pitch(self, audio, thred=0.03, f0_min=50, f0_max=1100):
648
+ t0 = ttime()
649
+ audio = torch.from_numpy(audio).float().to(self.device).unsqueeze(0)
650
+ mel = self.mel_extractor(audio, center=True)
651
+ t1 = ttime()
652
+ hidden = self.mel2hidden(mel)
653
+ t2 = ttime()
654
+ if "privateuseone" not in str(self.device):
655
+ hidden = hidden.squeeze(0).cpu().numpy()
656
+ else:
657
+ hidden = hidden[0]
658
+ if self.is_half == True:
659
+ hidden = hidden.astype("float32")
660
+ f0 = self.decode(hidden, thred=thred)
661
+ f0[(f0 < f0_min) | (f0 > f0_max)] = 0
662
+ t3 = ttime()
663
+ return f0
664
+
665
+ def to_local_average_cents(self, salience, thred=0.05):
666
+ # t0 = ttime()
667
+ center = np.argmax(salience, axis=1) # 帧长#index
668
+ salience = np.pad(salience, ((0, 0), (4, 4))) # 帧长,368
669
+ # t1 = ttime()
670
+ center += 4
671
+ todo_salience = []
672
+ todo_cents_mapping = []
673
+ starts = center - 4
674
+ ends = center + 5
675
+ for idx in range(salience.shape[0]):
676
+ todo_salience.append(salience[:, starts[idx] : ends[idx]][idx])
677
+ todo_cents_mapping.append(self.cents_mapping[starts[idx] : ends[idx]])
678
+ # t2 = ttime()
679
+ todo_salience = np.array(todo_salience) # 帧长,9
680
+ todo_cents_mapping = np.array(todo_cents_mapping) # 帧长,9
681
+ product_sum = np.sum(todo_salience * todo_cents_mapping, 1)
682
+ weight_sum = np.sum(todo_salience, 1) # 帧长
683
+ devided = product_sum / weight_sum # 帧长
684
+ # t3 = ttime()
685
+ maxx = np.max(salience, axis=1) # 帧长
686
+ devided[maxx <= thred] = 0
687
+ # t4 = ttime()
688
+ # print("decode:%s\t%s\t%s\t%s" % (t1 - t0, t2 - t1, t3 - t2, t4 - t3))
689
+ return devided
690
+
691
+
692
+ if __name__ == "__main__":
693
+ import librosa
694
+ import soundfile as sf
695
+
696
+ audio, sampling_rate = sf.read(r"C:\Users\liujing04\Desktop\Z\冬之花clip1.wav")
697
+ if len(audio.shape) > 1:
698
+ audio = librosa.to_mono(audio.transpose(1, 0))
699
+ audio_bak = audio.copy()
700
+ if sampling_rate != 16000:
701
+ audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=16000)
702
+ model_path = r"D:\BaiduNetdiskDownload\RVC-beta-v2-0727AMD_realtime\rmvpe.pt"
703
+ thred = 0.03 # 0.01
704
+ device = "cuda" if torch.cuda.is_available() else "cpu"
705
+ rmvpe = RMVPE(model_path, is_half=False, device=device)
706
+ t0 = ttime()
707
+ f0 = rmvpe.infer_from_audio(audio, thred=thred)
708
+ # f0 = rmvpe.infer_from_audio(audio, thred=thred)
709
+ # f0 = rmvpe.infer_from_audio(audio, thred=thred)
710
+ # f0 = rmvpe.infer_from_audio(audio, thred=thred)
711
+ # f0 = rmvpe.infer_from_audio(audio, thred=thred)
712
+ t1 = ttime()
713
+ logger.info("%s %.2f", f0.shape, t1 - t0)
rvc_inferpy/inferclass.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from rvc_inferpy.infer import infer_audio
2
+
3
+
4
+ class infernew:
5
+ def __init__(
6
+ self,
7
+ model_name,
8
+ sound_path,
9
+ f0_change=0,
10
+ f0_method="rmvpe",
11
+ min_pitch=50,
12
+ max_pitch=800,
13
+ crepe_hop_length=128,
14
+ index_rate=1.0,
15
+ filter_radius=3,
16
+ rms_mix_rate=0.75,
17
+ protect=0.33,
18
+ split_infer=True,
19
+ min_silence=0.5,
20
+ silence_threshold=-40,
21
+ seek_step=10,
22
+ keep_silence=0.1,
23
+ quefrency=0.0,
24
+ timbre=1.0,
25
+ f0_autotune=False,
26
+ output_format="wav",
27
+ ):
28
+ self.model_name = model_name
29
+ self.sound_path = sound_path
30
+ self.f0_change = f0_change
31
+ self.f0_method = f0_method
32
+ self.min_pitch = min_pitch
33
+ self.max_pitch = max_pitch
34
+ self.crepe_hop_length = crepe_hop_length
35
+ self.index_rate = index_rate
36
+ self.filter_radius = filter_radius
37
+ self.rms_mix_rate = rms_mix_rate
38
+ self.protect = protect
39
+ self.split_infer = split_infer
40
+ self.min_silence = min_silence
41
+ self.silence_threshold = silence_threshold
42
+ self.seek_step = seek_step
43
+ self.keep_silence = keep_silence
44
+ self.quefrency = quefrency
45
+ self.timbre = timbre
46
+ self.f0_autotune = f0_autotune
47
+ self.output_format = output_format
48
+
49
+ def run_inference(self):
50
+ inferred_audio = infer_audio(
51
+ MODEL_NAME=self.model_name,
52
+ SOUND_PATH=self.sound_path,
53
+ F0_CHANGE=self.f0_change,
54
+ F0_METHOD=self.f0_method,
55
+ MIN_PITCH=self.min_pitch,
56
+ MAX_PITCH=self.max_pitch,
57
+ CREPE_HOP_LENGTH=self.crepe_hop_length,
58
+ INDEX_RATE=self.index_rate,
59
+ FILTER_RADIUS=self.filter_radius,
60
+ RMS_MIX_RATE=self.rms_mix_rate,
61
+ PROTECT=self.protect,
62
+ SPLIT_INFER=self.split_infer,
63
+ MIN_SILENCE=self.min_silence,
64
+ SILENCE_THRESHOLD=self.silence_threshold,
65
+ SEEK_STEP=self.seek_step,
66
+ KEEP_SILENCE=self.keep_silence,
67
+ QUEFRENCY=self.quefrency,
68
+ TIMBRE=self.timbre,
69
+ F0_AUTOTUNE=self.f0_autotune,
70
+ OUTPUT_FORMAT=self.output_format,
71
+ )
72
+ return inferred_audio
rvc_inferpy/modules.py ADDED
@@ -0,0 +1,676 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os, sys
2
+ import traceback
3
+ import logging
4
+
5
+ now_dir = os.getcwd()
6
+ sys.path.append(now_dir)
7
+ logger = logging.getLogger(__name__)
8
+ import numpy as np
9
+ import soundfile as sf
10
+ import torch
11
+ from io import BytesIO
12
+ from rvc_inferpy.infer_list.audio import load_audio
13
+ from rvc_inferpy.infer_list.audio import wav2
14
+ from rvc_inferpy.infer_list.packs.models import (
15
+ SynthesizerTrnMs256NSFsid,
16
+ SynthesizerTrnMs256NSFsid_nono,
17
+ SynthesizerTrnMs768NSFsid,
18
+ SynthesizerTrnMs768NSFsid_nono,
19
+ )
20
+ from rvc_inferpy.pipeline import Pipeline
21
+ import time
22
+ import glob
23
+ from shutil import move
24
+
25
+
26
+
27
+ BASE_DOWNLOAD_LINK = "https://huggingface.co/theNeofr/rvc-base/resolve/main"
28
+ BASE_MODELS = [
29
+ "hubert_base.pt",
30
+ "rmvpe.pt",
31
+ "fcpe.pt"
32
+ ]
33
+ BASE_DIR = "."
34
+
35
+
36
+ def load_file_from_url(
37
+ url: str,
38
+ model_dir: str,
39
+ file_name: str | None = None,
40
+ overwrite: bool = False,
41
+ progress: bool = True,
42
+ ) -> str:
43
+ """Download a file from `url` into `model_dir`,
44
+ using the file present if possible.
45
+
46
+ Returns the path to the downloaded file.
47
+ """
48
+ os.makedirs(model_dir, exist_ok=True)
49
+ if not file_name:
50
+ parts = urlparse(url)
51
+ file_name = os.path.basename(parts.path)
52
+ cached_file = os.path.abspath(os.path.join(model_dir, file_name))
53
+
54
+ # Overwrite
55
+ if os.path.exists(cached_file):
56
+ if overwrite or os.path.getsize(cached_file) == 0:
57
+ os.remove(cached_file)
58
+
59
+ # Download
60
+ if not os.path.exists(cached_file):
61
+ logger.info(f'Downloading: "{url}" to {cached_file}\n')
62
+ from torch.hub import download_url_to_file
63
+
64
+ download_url_to_file(url, cached_file, progress=progress)
65
+ else:
66
+ logger.debug(cached_file)
67
+
68
+ return cached_file
69
+
70
+
71
+ def friendly_name(file: str):
72
+ if file.startswith("http"):
73
+ file = urlparse(file).path
74
+
75
+ file = os.path.basename(file)
76
+ model_name, extension = os.path.splitext(file)
77
+ return model_name, extension
78
+
79
+
80
+ def download_manager(
81
+ url: str,
82
+ path: str,
83
+ extension: str = "",
84
+ overwrite: bool = False,
85
+ progress: bool = True,
86
+ ):
87
+ url = url.strip()
88
+
89
+ name, ext = friendly_name(url)
90
+ name += ext if not extension else f".{extension}"
91
+
92
+ if url.startswith("http"):
93
+ filename = load_file_from_url(
94
+ url=url,
95
+ model_dir=path,
96
+ file_name=name,
97
+ overwrite=overwrite,
98
+ progress=progress,
99
+ )
100
+ else:
101
+ filename = path
102
+
103
+ return filename
104
+
105
+
106
+
107
+
108
+
109
+
110
+
111
+ sup_audioext = {
112
+ "wav",
113
+ "mp3",
114
+ "flac",
115
+ "ogg",
116
+ "opus",
117
+ "m4a",
118
+ "mp4",
119
+ "aac",
120
+ "alac",
121
+ "wma",
122
+ "aiff",
123
+ "webm",
124
+ "ac3",
125
+ }
126
+
127
+
128
+ def note_to_hz(note_name):
129
+ try:
130
+ SEMITONES = {
131
+ "C": -9,
132
+ "C#": -8,
133
+ "D": -7,
134
+ "D#": -6,
135
+ "E": -5,
136
+ "F": -4,
137
+ "F#": -3,
138
+ "G": -2,
139
+ "G#": -1,
140
+ "A": 0,
141
+ "A#": 1,
142
+ "B": 2,
143
+ }
144
+ pitch_class, octave = note_name[:-1], int(note_name[-1])
145
+ semitone = SEMITONES[pitch_class]
146
+ note_number = 12 * (octave - 4) + semitone
147
+ frequency = 440.0 * (2.0 ** (1.0 / 12)) ** note_number
148
+ return frequency
149
+ except:
150
+ return None
151
+
152
+
153
+ filename = path
154
+
155
+ return filename
156
+
157
+
158
+ def load_hubert(config, hubert_path=None):
159
+ from fairseq import checkpoint_utils
160
+
161
+ if hubert_path is None:
162
+ hubert_path = ""
163
+ if not os.path.exists(hubert_path):
164
+ for id_model in BASE_MODELS:
165
+ download_manager(
166
+ os.path.join(BASE_DOWNLOAD_LINK, id_model), BASE_DIR
167
+ )
168
+ hubert_path = "hubert_base.pt"
169
+
170
+ models, _, _ = checkpoint_utils.load_model_ensemble_and_task(
171
+ [hubert_path],
172
+ suffix="",
173
+ )
174
+ hubert_model = models[0]
175
+ hubert_model = hubert_model.to(config.device)
176
+ if config.is_half:
177
+ hubert_model = hubert_model.half()
178
+ else:
179
+ hubert_model = hubert_model.float()
180
+ hubert_model.eval()
181
+
182
+ return hubert_model
183
+
184
+ class VC:
185
+ def __init__(self, config):
186
+ self.n_spk = None
187
+ self.tgt_sr = None
188
+ self.net_g = None
189
+ self.pipeline = None
190
+ self.cpt = None
191
+ self.version = None
192
+ self.if_f0 = None
193
+ self.version = None
194
+ self.hubert_model = None
195
+
196
+ self.config = config
197
+
198
+ def get_vc(self, sid, *to_return_protect):
199
+ logger.info("Get sid: " + sid)
200
+
201
+ to_return_protect0 = {
202
+ "visible": self.if_f0 != 0,
203
+ "value": (
204
+ to_return_protect[0] if self.if_f0 != 0 and to_return_protect else 0.5
205
+ ),
206
+ "__type__": "update",
207
+ }
208
+ to_return_protect1 = {
209
+ "visible": self.if_f0 != 0,
210
+ "value": (
211
+ to_return_protect[1] if self.if_f0 != 0 and to_return_protect else 0.33
212
+ ),
213
+ "__type__": "update",
214
+ }
215
+
216
+ if sid == "" or sid == []:
217
+ if (
218
+ self.hubert_model is not None
219
+ ): # 考虑到轮询, 需要加个判断看是否 sid 是由有模型切换到无模型的
220
+ logger.info("Clean model cache")
221
+ del (
222
+ self.net_g,
223
+ self.n_spk,
224
+ self.vc,
225
+ self.hubert_model,
226
+ self.tgt_sr,
227
+ ) # ,cpt
228
+ self.hubert_model = self.net_g = self.n_spk = self.vc = (
229
+ self.hubert_model
230
+ ) = self.tgt_sr = None
231
+ if torch.cuda.is_available():
232
+ torch.cuda.empty_cache()
233
+ ###楼下不这么折腾清理不干净
234
+ self.if_f0 = self.cpt.get("f0", 1)
235
+ self.version = self.cpt.get("version", "v1")
236
+ if self.version == "v1":
237
+ if self.if_f0 == 1:
238
+ self.net_g = SynthesizerTrnMs256NSFsid(
239
+ *self.cpt["config"], is_half=self.config.is_half
240
+ )
241
+ else:
242
+ self.net_g = SynthesizerTrnMs256NSFsid_nono(*self.cpt["config"])
243
+ elif self.version == "v2":
244
+ if self.if_f0 == 1:
245
+ self.net_g = SynthesizerTrnMs768NSFsid(
246
+ *self.cpt["config"], is_half=self.config.is_half
247
+ )
248
+ else:
249
+ self.net_g = SynthesizerTrnMs768NSFsid_nono(*self.cpt["config"])
250
+ del self.net_g, self.cpt
251
+ if torch.cuda.is_available():
252
+ torch.cuda.empty_cache()
253
+ return (
254
+ {"visible": False, "__type__": "update"},
255
+ {
256
+ "visible": True,
257
+ "value": to_return_protect0,
258
+ "__type__": "update",
259
+ },
260
+ {
261
+ "visible": True,
262
+ "value": to_return_protect1,
263
+ "__type__": "update",
264
+ },
265
+ "",
266
+ "",
267
+ )
268
+ # person = f'{os.getenv("weight_root")}/{sid}'
269
+ person = f"{sid}"
270
+ # logger.info(f"Loading: {person}")
271
+ logger.info(f"Loading...")
272
+ self.cpt = torch.load(person, map_location="cpu")
273
+ self.tgt_sr = self.cpt["config"][-1]
274
+ self.cpt["config"][-3] = self.cpt["weight"]["emb_g.weight"].shape[0] # n_spk
275
+ self.if_f0 = self.cpt.get("f0", 1)
276
+ self.version = self.cpt.get("version", "v1")
277
+
278
+ synthesizer_class = {
279
+ ("v1", 1): SynthesizerTrnMs256NSFsid,
280
+ ("v1", 0): SynthesizerTrnMs256NSFsid_nono,
281
+ ("v2", 1): SynthesizerTrnMs768NSFsid,
282
+ ("v2", 0): SynthesizerTrnMs768NSFsid_nono,
283
+ }
284
+
285
+ self.net_g = synthesizer_class.get(
286
+ (self.version, self.if_f0), SynthesizerTrnMs256NSFsid
287
+ )(*self.cpt["config"], is_half=self.config.is_half)
288
+
289
+ del self.net_g.enc_q
290
+
291
+ self.net_g.load_state_dict(self.cpt["weight"], strict=False)
292
+ self.net_g.eval().to(self.config.device)
293
+ if self.config.is_half:
294
+ self.net_g = self.net_g.half()
295
+ else:
296
+ self.net_g = self.net_g.float()
297
+
298
+ self.pipeline = Pipeline(self.tgt_sr, self.config)
299
+ n_spk = self.cpt["config"][-3]
300
+ # index = {"value": get_index_path_from_model(sid), "__type__": "update"}
301
+ # logger.info("Select index: " + index["value"])
302
+
303
+ return (
304
+ (
305
+ {"visible": False, "maximum": n_spk, "__type__": "update"},
306
+ to_return_protect0,
307
+ to_return_protect1,
308
+ )
309
+ if to_return_protect
310
+ else {"visible": False, "maximum": n_spk, "__type__": "update"}
311
+ )
312
+
313
+ def vc_single_dont_save(
314
+ self,
315
+ sid,
316
+ input_audio_path1,
317
+ f0_up_key,
318
+ f0_method,
319
+ file_index,
320
+ file_index2,
321
+ index_rate,
322
+ filter_radius,
323
+ resample_sr,
324
+ rms_mix_rate,
325
+ protect,
326
+ crepe_hop_length,
327
+ do_formant,
328
+ quefrency,
329
+ timbre,
330
+ f0_min,
331
+ f0_max,
332
+ f0_autotune,
333
+ hubert_model_path="hubert_base.pt",
334
+ ):
335
+ """
336
+ Performs inference without saving
337
+
338
+ Parameters:
339
+ - sid (int)
340
+ - input_audio_path1 (str)
341
+ - f0_up_key (int)
342
+ - f0_method (str)
343
+ - file_index (str)
344
+ - file_index2 (str)
345
+ - index_rate (float)
346
+ - filter_radius (int)
347
+ - resample_sr (int)
348
+ - rms_mix_rate (float)
349
+ - protect (float)
350
+ - crepe_hop_length (int)
351
+ - do_formant (bool)
352
+ - quefrency (float)
353
+ - timbre (float)
354
+ - f0_min (str)
355
+ - f0_max (str)
356
+ - f0_autotune (bool)
357
+ - hubert_model_path (str)
358
+
359
+ Returns:
360
+ Tuple(Tuple(status, index_info, times), Tuple(sr, data)):
361
+ - Tuple(status, index_info, times):
362
+ - status (str): either "Success." or an error
363
+ - index_info (str): index path if used
364
+ - times (list): [npy_time, f0_time, infer_time, total_time]
365
+ - Tuple(sr, data): Audio data results.
366
+ """
367
+ global total_time
368
+ total_time = 0
369
+ start_time = time.time()
370
+
371
+ if not input_audio_path1:
372
+ return "You need to upload an audio", None
373
+
374
+ if not os.path.exists(input_audio_path1):
375
+ return "Audio was not properly selected or doesn't exist", None
376
+
377
+ f0_up_key = int(f0_up_key)
378
+ if not f0_min.isdigit():
379
+ f0_min = note_to_hz(f0_min)
380
+ if f0_min:
381
+ print(f"Converted Min pitch: freq - {f0_min}")
382
+ else:
383
+ f0_min = 50
384
+ print("Invalid minimum pitch note. Defaulting to 50hz.")
385
+ else:
386
+ f0_min = float(f0_min)
387
+ if not f0_max.isdigit():
388
+ f0_max = note_to_hz(f0_max)
389
+ if f0_max:
390
+ print(f"Converted Max pitch: freq - {f0_max}")
391
+ else:
392
+ f0_max = 1100
393
+ print("Invalid maximum pitch note. Defaulting to 1100hz.")
394
+ else:
395
+ f0_max = float(f0_max)
396
+
397
+ try:
398
+ print(f"Attempting to load {input_audio_path1}....")
399
+ audio = load_audio(
400
+ file=input_audio_path1,
401
+ sr=16000,
402
+ DoFormant=do_formant,
403
+ Quefrency=quefrency,
404
+ Timbre=timbre,
405
+ )
406
+
407
+ audio_max = np.abs(audio).max() / 0.95
408
+ if audio_max > 1:
409
+ audio /= audio_max
410
+ times = [0, 0, 0]
411
+
412
+ if self.hubert_model is None:
413
+ self.hubert_model = load_hubert(hubert_model_path, self.config)
414
+
415
+ try:
416
+ self.if_f0 = self.cpt.get("f0", 1)
417
+ except NameError:
418
+ message = "Model was not properly selected"
419
+ print(message)
420
+ return message, None
421
+
422
+ if file_index and not file_index == "" and isinstance(file_index, str):
423
+ file_index = (
424
+ file_index.strip(" ")
425
+ .strip('"')
426
+ .strip("\n")
427
+ .strip('"')
428
+ .strip(" ")
429
+ .replace("trained", "added")
430
+ )
431
+ elif file_index2:
432
+ file_index = file_index2
433
+ else:
434
+ file_index = ""
435
+
436
+ audio_opt = self.pipeline.pipeline(
437
+ self.hubert_model,
438
+ self.net_g,
439
+ sid,
440
+ audio,
441
+ input_audio_path1,
442
+ times,
443
+ f0_up_key,
444
+ f0_method,
445
+ file_index,
446
+ index_rate,
447
+ self.if_f0,
448
+ filter_radius,
449
+ self.tgt_sr,
450
+ resample_sr,
451
+ rms_mix_rate,
452
+ self.version,
453
+ protect,
454
+ crepe_hop_length,
455
+ f0_autotune,
456
+ f0_min=f0_min,
457
+ f0_max=f0_max,
458
+ )
459
+
460
+ if self.tgt_sr != resample_sr >= 16000:
461
+ tgt_sr = resample_sr
462
+ else:
463
+ tgt_sr = self.tgt_sr
464
+ index_info = (
465
+ "Index: %s." % file_index
466
+ if isinstance(file_index, str) and os.path.exists(file_index)
467
+ else "Index not used."
468
+ )
469
+ end_time = time.time()
470
+ total_time = end_time - start_time
471
+ times.append(total_time)
472
+ return (
473
+ ("Success.", index_info, times),
474
+ (tgt_sr, audio_opt),
475
+ )
476
+ except:
477
+ info = traceback.format_exc()
478
+ logger.warn(info)
479
+ return ((info, None, [None, None, None, None]), (None, None))
480
+
481
+ def vc_single(
482
+ self,
483
+ sid,
484
+ input_audio_path1,
485
+ f0_up_key,
486
+ f0_method,
487
+ file_index,
488
+ file_index2,
489
+ index_rate,
490
+ filter_radius,
491
+ resample_sr,
492
+ rms_mix_rate,
493
+ protect,
494
+ format1,
495
+ crepe_hop_length,
496
+ do_formant,
497
+ quefrency,
498
+ timbre,
499
+ f0_min,
500
+ f0_max,
501
+ f0_autotune,
502
+ hubert_model_path="hubert_base.pt",
503
+ ):
504
+ """
505
+ Performs inference with saving
506
+
507
+ Parameters:
508
+ - sid (int)
509
+ - input_audio_path1 (str)
510
+ - f0_up_key (int)
511
+ - f0_method (str)
512
+ - file_index (str)
513
+ - file_index2 (str)
514
+ - index_rate (float)
515
+ - filter_radius (int)
516
+ - resample_sr (int)
517
+ - rms_mix_rate (float)
518
+ - protect (float)
519
+ - format1 (str)
520
+ - crepe_hop_length (int)
521
+ - do_formant (bool)
522
+ - quefrency (float)
523
+ - timbre (float)
524
+ - f0_min (str)
525
+ - f0_max (str)
526
+ - f0_autotune (bool)
527
+ - hubert_model_path (str)
528
+
529
+ Returns:
530
+ Tuple(Tuple(status, index_info, times), Tuple(sr, data), output_path):
531
+ - Tuple(status, index_info, times):
532
+ - status (str): either "Success." or an error
533
+ - index_info (str): index path if used
534
+ - times (list): [npy_time, f0_time, infer_time, total_time]
535
+ - Tuple(sr, data): Audio data results.
536
+ - output_path (str): Audio results path
537
+ """
538
+ global total_time
539
+ total_time = 0
540
+ start_time = time.time()
541
+
542
+ if not input_audio_path1:
543
+ return "You need to upload an audio", None, None
544
+
545
+ if not os.path.exists(input_audio_path1):
546
+ return "Audio was not properly selected or doesn't exist", None, None
547
+
548
+ f0_up_key = int(f0_up_key)
549
+ if not f0_min.isdigit():
550
+ f0_min = note_to_hz(f0_min)
551
+ if f0_min:
552
+ print(f"Converted Min pitch: freq - {f0_min}")
553
+ else:
554
+ f0_min = 50
555
+ print("Invalid minimum pitch note. Defaulting to 50hz.")
556
+ else:
557
+ f0_min = float(f0_min)
558
+ if not f0_max.isdigit():
559
+ f0_max = note_to_hz(f0_max)
560
+ if f0_max:
561
+ print(f"Converted Max pitch: freq - {f0_max}")
562
+ else:
563
+ f0_max = 1100
564
+ print("Invalid maximum pitch note. Defaulting to 1100hz.")
565
+ else:
566
+ f0_max = float(f0_max)
567
+
568
+ try:
569
+ print(f"Attempting to load {input_audio_path1}...")
570
+ audio = load_audio(
571
+ file=input_audio_path1,
572
+ sr=16000,
573
+ DoFormant=do_formant,
574
+ Quefrency=quefrency,
575
+ Timbre=timbre,
576
+ )
577
+
578
+ audio_max = np.abs(audio).max() / 0.95
579
+ if audio_max > 1:
580
+ audio /= audio_max
581
+ times = [0, 0, 0]
582
+
583
+ if self.hubert_model is None:
584
+ self.hubert_model = load_hubert(hubert_model_path, self.config)
585
+
586
+ try:
587
+ self.if_f0 = self.cpt.get("f0", 1)
588
+ except NameError:
589
+ message = "Model was not properly selected"
590
+ print(message)
591
+ return message, None
592
+ if file_index and not file_index == "" and isinstance(file_index, str):
593
+ file_index = (
594
+ file_index.strip(" ")
595
+ .strip('"')
596
+ .strip("\n")
597
+ .strip('"')
598
+ .strip(" ")
599
+ .replace("trained", "added")
600
+ )
601
+ elif file_index2:
602
+ file_index = file_index2
603
+ else:
604
+ file_index = ""
605
+
606
+ audio_opt = self.pipeline.pipeline(
607
+ self.hubert_model,
608
+ self.net_g,
609
+ sid,
610
+ audio,
611
+ input_audio_path1,
612
+ times,
613
+ f0_up_key,
614
+ f0_method,
615
+ file_index,
616
+ index_rate,
617
+ self.if_f0,
618
+ filter_radius,
619
+ self.tgt_sr,
620
+ resample_sr,
621
+ rms_mix_rate,
622
+ self.version,
623
+ protect,
624
+ crepe_hop_length,
625
+ f0_autotune,
626
+ f0_min=f0_min,
627
+ f0_max=f0_max,
628
+ )
629
+
630
+ if self.tgt_sr != resample_sr >= 16000:
631
+ tgt_sr = resample_sr
632
+ else:
633
+ tgt_sr = self.tgt_sr
634
+ index_info = (
635
+ "Index: %s." % file_index
636
+ if isinstance(file_index, str) and os.path.exists(file_index)
637
+ else "Index not used."
638
+ )
639
+
640
+ opt_root = os.path.join(os.getcwd(), "output")
641
+ os.makedirs(opt_root, exist_ok=True)
642
+ output_count = 1
643
+
644
+ while True:
645
+ opt_filename = f"{os.path.splitext(os.path.basename(input_audio_path1))[0]}{os.path.basename(os.path.dirname(file_index))}{f0_method.capitalize()}_{output_count}.{format1}"
646
+ current_output_path = os.path.join(opt_root, opt_filename)
647
+ if not os.path.exists(current_output_path):
648
+ break
649
+ output_count += 1
650
+ try:
651
+ if format1 in ["wav", "flac"]:
652
+ sf.write(
653
+ current_output_path,
654
+ audio_opt,
655
+ self.tgt_sr,
656
+ )
657
+ else:
658
+ with BytesIO() as wavf:
659
+ sf.write(wavf, audio_opt, self.tgt_sr, format="wav")
660
+ wavf.seek(0, 0)
661
+ with open(current_output_path, "wb") as outf:
662
+ wav2(wavf, outf, format1)
663
+ except:
664
+ info = traceback.format_exc()
665
+ end_time = time.time()
666
+ total_time = end_time - start_time
667
+ times.append(total_time)
668
+ return (
669
+ ("Success.", index_info, times),
670
+ (tgt_sr, audio_opt),
671
+ current_output_path,
672
+ )
673
+ except:
674
+ info = traceback.format_exc()
675
+ logger.warn(info)
676
+ return ((info, None, [None, None, None, None]), (None, None), None)
rvc_inferpy/pipeline.py ADDED
@@ -0,0 +1,917 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import gc
4
+ import traceback
5
+ import logging
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ from functools import lru_cache
10
+ from time import time as ttime
11
+ from torch import Tensor
12
+ import faiss
13
+ import librosa
14
+ import numpy as np
15
+ import parselmouth
16
+ import pyworld
17
+ import torch.nn.functional as F
18
+ from scipy import signal
19
+ from tqdm import tqdm
20
+
21
+ import random
22
+
23
+ now_dir = os.getcwd()
24
+ sys.path.append(now_dir)
25
+ import re
26
+ from functools import partial
27
+
28
+ bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
29
+
30
+ input_audio_path2wav = {}
31
+ import torchcrepe # Fork Feature. Crepe algo for training and preprocess
32
+ from torchfcpe import spawn_bundled_infer_model
33
+ import torch
34
+ from rvc_inferpy.infer_list.rmvpe import RMVPE
35
+ from rvc_inferpy.infer_list.fcpe import FCPE
36
+
37
+
38
+ @lru_cache
39
+ def cache_harvest_f0(input_audio_path, fs, f0max, f0min, frame_period):
40
+ audio = input_audio_path2wav[input_audio_path]
41
+ f0, t = pyworld.harvest(
42
+ audio,
43
+ fs=fs,
44
+ f0_ceil=f0max,
45
+ f0_floor=f0min,
46
+ frame_period=frame_period,
47
+ )
48
+ f0 = pyworld.stonemask(audio, f0, t, fs)
49
+ return f0
50
+
51
+
52
+ def change_rms(data1, sr1, data2, sr2, rate): # 1是输入音频,2是输出音频,rate是2的占比
53
+ # print(data1.max(),data2.max())
54
+ rms1 = librosa.feature.rms(
55
+ y=data1, frame_length=sr1 // 2 * 2, hop_length=sr1 // 2
56
+ ) # 每半秒一个点
57
+ rms2 = librosa.feature.rms(y=data2, frame_length=sr2 // 2 * 2, hop_length=sr2 // 2)
58
+ rms1 = torch.from_numpy(rms1)
59
+ rms1 = F.interpolate(
60
+ rms1.unsqueeze(0), size=data2.shape[0], mode="linear"
61
+ ).squeeze()
62
+ rms2 = torch.from_numpy(rms2)
63
+ rms2 = F.interpolate(
64
+ rms2.unsqueeze(0), size=data2.shape[0], mode="linear"
65
+ ).squeeze()
66
+ rms2 = torch.max(rms2, torch.zeros_like(rms2) + 1e-6)
67
+ data2 *= (
68
+ torch.pow(rms1, torch.tensor(1 - rate))
69
+ * torch.pow(rms2, torch.tensor(rate - 1))
70
+ ).numpy()
71
+ return data2
72
+
73
+
74
+ class Pipeline(object):
75
+ def __init__(self, tgt_sr, config):
76
+ self.x_pad, self.x_query, self.x_center, self.x_max, self.is_half = (
77
+ config.x_pad,
78
+ config.x_query,
79
+ config.x_center,
80
+ config.x_max,
81
+ config.is_half,
82
+ )
83
+ self.sr = 16000 # hubert输入采样率
84
+ self.window = 160 # 每帧点数
85
+ self.t_pad = self.sr * self.x_pad # 每条前后pad时间
86
+ self.t_pad_tgt = tgt_sr * self.x_pad
87
+ self.t_pad2 = self.t_pad * 2
88
+ self.t_query = self.sr * self.x_query # 查询切点前后查询时间
89
+ self.t_center = self.sr * self.x_center # 查询切点位置
90
+ self.t_max = self.sr * self.x_max # 免查询时长阈值
91
+ self.device = config.device
92
+ self.model_rmvpe = RMVPE(
93
+ os.environ["rmvpe_model_path"], is_half=self.is_half, device=self.device
94
+ )
95
+
96
+ self.note_dict = [
97
+ 65.41,
98
+ 69.30,
99
+ 73.42,
100
+ 77.78,
101
+ 82.41,
102
+ 87.31,
103
+ 92.50,
104
+ 98.00,
105
+ 103.83,
106
+ 110.00,
107
+ 116.54,
108
+ 123.47,
109
+ 130.81,
110
+ 138.59,
111
+ 146.83,
112
+ 155.56,
113
+ 164.81,
114
+ 174.61,
115
+ 185.00,
116
+ 196.00,
117
+ 207.65,
118
+ 220.00,
119
+ 233.08,
120
+ 246.94,
121
+ 261.63,
122
+ 277.18,
123
+ 293.66,
124
+ 311.13,
125
+ 329.63,
126
+ 349.23,
127
+ 369.99,
128
+ 392.00,
129
+ 415.30,
130
+ 440.00,
131
+ 466.16,
132
+ 493.88,
133
+ 523.25,
134
+ 554.37,
135
+ 587.33,
136
+ 622.25,
137
+ 659.25,
138
+ 698.46,
139
+ 739.99,
140
+ 783.99,
141
+ 830.61,
142
+ 880.00,
143
+ 932.33,
144
+ 987.77,
145
+ 1046.50,
146
+ 1108.73,
147
+ 1174.66,
148
+ 1244.51,
149
+ 1318.51,
150
+ 1396.91,
151
+ 1479.98,
152
+ 1567.98,
153
+ 1661.22,
154
+ 1760.00,
155
+ 1864.66,
156
+ 1975.53,
157
+ 2093.00,
158
+ 2217.46,
159
+ 2349.32,
160
+ 2489.02,
161
+ 2637.02,
162
+ 2793.83,
163
+ 2959.96,
164
+ 3135.96,
165
+ 3322.44,
166
+ 3520.00,
167
+ 3729.31,
168
+ 3951.07,
169
+ ]
170
+
171
+ # Fork Feature: Get the best torch device to use for f0 algorithms that require a torch device. Will return the type (torch.device)
172
+ def get_optimal_torch_device(self, index: int = 0) -> torch.device:
173
+ if torch.cuda.is_available():
174
+ return torch.device(
175
+ f"cuda:{index % torch.cuda.device_count()}"
176
+ ) # Very fast
177
+ elif torch.backends.mps.is_available():
178
+ return torch.device("mps")
179
+ return torch.device("cpu")
180
+
181
+ # Fork Feature: Compute f0 with the crepe method
182
+ def get_f0_crepe_computation(
183
+ self,
184
+ x,
185
+ f0_min,
186
+ f0_max,
187
+ p_len,
188
+ *args, # 512 before. Hop length changes the speed that the voice jumps to a different dramatic pitch. Lower hop lengths means more pitch accuracy but longer inference time.
189
+ **kwargs, # Either use crepe-tiny "tiny" or crepe "full". Default is full
190
+ ):
191
+ x = x.astype(
192
+ np.float32
193
+ ) # fixes the F.conv2D exception. We needed to convert double to float.
194
+ x /= np.quantile(np.abs(x), 0.999)
195
+ torch_device = self.get_optimal_torch_device()
196
+ audio = torch.from_numpy(x).to(torch_device, copy=True)
197
+ audio = torch.unsqueeze(audio, dim=0)
198
+ if audio.ndim == 2 and audio.shape[0] > 1:
199
+ audio = torch.mean(audio, dim=0, keepdim=True).detach()
200
+ audio = audio.detach()
201
+ hop_length = kwargs.get("crepe_hop_length", 160)
202
+ model = kwargs.get("model", "full")
203
+ print("Initiating prediction with a crepe_hop_length of: " + str(hop_length))
204
+ pitch: Tensor = torchcrepe.predict(
205
+ audio,
206
+ self.sr,
207
+ hop_length,
208
+ f0_min,
209
+ f0_max,
210
+ model,
211
+ batch_size=hop_length * 2,
212
+ device=torch_device,
213
+ pad=True,
214
+ )
215
+ p_len = p_len or x.shape[0] // hop_length
216
+ # Resize the pitch for final f0
217
+ source = np.array(pitch.squeeze(0).cpu().float().numpy())
218
+ source[source < 0.001] = np.nan
219
+ target = np.interp(
220
+ np.arange(0, len(source) * p_len, len(source)) / p_len,
221
+ np.arange(0, len(source)),
222
+ source,
223
+ )
224
+ f0 = np.nan_to_num(target)
225
+ return f0 # Resized f0
226
+
227
+ def get_f0_official_crepe_computation(self, x, f0_min, f0_max, *args, **kwargs):
228
+ # Pick a batch size that doesn't cause memory errors on your gpu
229
+ batch_size = 512
230
+ # Compute pitch using first gpu
231
+ audio = torch.tensor(np.copy(x))[None].float()
232
+ model = kwargs.get("model", "full")
233
+ f0, pd = torchcrepe.predict(
234
+ audio,
235
+ self.sr,
236
+ self.window,
237
+ f0_min,
238
+ f0_max,
239
+ model,
240
+ batch_size=batch_size,
241
+ device=self.device,
242
+ return_periodicity=True,
243
+ )
244
+ pd = torchcrepe.filter.median(pd, 3)
245
+ f0 = torchcrepe.filter.mean(f0, 3)
246
+ f0[pd < 0.1] = 0
247
+ f0 = f0[0].cpu().numpy()
248
+ return f0
249
+
250
+ # Fork Feature: Compute pYIN f0 method
251
+ def get_f0_pyin_computation(self, x, f0_min, f0_max):
252
+ y, sr = librosa.load(x, sr=self.sr, mono=True)
253
+ f0, _, _ = librosa.pyin(y, fmin=f0_min, fmax=f0_max, sr=self.sr)
254
+ f0 = f0[1:] # Get rid of extra first frame
255
+ return f0
256
+
257
+ def get_rmvpe(self, x, *args, **kwargs):
258
+ if not hasattr(self, "model_rmvpe"):
259
+ from lib.infer.infer_libs.rmvpe import RMVPE
260
+
261
+ logger.info(f"Loading rmvpe model, {os.environ['rmvpe_model_path']}")
262
+ self.model_rmvpe = RMVPE(
263
+ os.environ["rmvpe_model_path"],
264
+ is_half=self.is_half,
265
+ device=self.device,
266
+ )
267
+ f0 = self.model_rmvpe.infer_from_audio(x, thred=0.03)
268
+
269
+ if "privateuseone" in str(self.device): # clean ortruntime memory
270
+ del self.model_rmvpe.model
271
+ del self.model_rmvpe
272
+ logger.info("Cleaning ortruntime memory")
273
+
274
+ return f0
275
+
276
+ def get_pitch_dependant_rmvpe(self, x, f0_min=1, f0_max=40000, *args, **kwargs):
277
+ if not hasattr(self, "model_rmvpe"):
278
+ from lib.infer.infer_libs.rmvpe import RMVPE
279
+
280
+ logger.info(f"Loading rmvpe model, {os.environ['rmvpe_model_path']}")
281
+ self.model_rmvpe = RMVPE(
282
+ os.environ["rmvpe_model_path"],
283
+ is_half=self.is_half,
284
+ device=self.device,
285
+ )
286
+ f0 = self.model_rmvpe.infer_from_audio_with_pitch(
287
+ x, thred=0.03, f0_min=f0_min, f0_max=f0_max
288
+ )
289
+ if "privateuseone" in str(self.device): # clean ortruntime memory
290
+ del self.model_rmvpe.model
291
+ del self.model_rmvpe
292
+ logger.info("Cleaning ortruntime memory")
293
+
294
+ return f0
295
+
296
+ def get_fcpe(self, x, f0_min, f0_max, p_len, *args, **kwargs):
297
+ self.model_fcpe = FCPE(
298
+ os.environ["fcpe_model_path"],
299
+ f0_min=f0_min,
300
+ f0_max=f0_max,
301
+ dtype=torch.float32,
302
+ device=self.device,
303
+ sampling_rate=self.sr,
304
+ threshold=0.03,
305
+ )
306
+ f0 = self.model_fcpe.compute_f0(x, p_len=p_len)
307
+ del self.model_fcpe
308
+ gc.collect()
309
+ return f0
310
+
311
+ def get_torchfcpe(self, x, sr, f0_min, f0_max, p_len, *args, **kwargs):
312
+ self.model_torchfcpe = spawn_bundled_infer_model(device=self.device)
313
+ f0 = self.model_torchfcpe.infer(
314
+ torch.from_numpy(x).float().unsqueeze(0).unsqueeze(-1).to(self.device),
315
+ sr=sr,
316
+ decoder_mode="local_argmax",
317
+ threshold=0.006,
318
+ f0_min=f0_min,
319
+ f0_max=f0_max,
320
+ output_interp_target_length=p_len,
321
+ )
322
+ return f0.squeeze().cpu().numpy()
323
+
324
+ def autotune_f0(self, f0):
325
+ autotuned_f0 = []
326
+ for freq in f0:
327
+ closest_notes = [
328
+ x
329
+ for x in self.note_dict
330
+ if abs(x - freq) == min(abs(n - freq) for n in self.note_dict)
331
+ ]
332
+ autotuned_f0.append(random.choice(closest_notes))
333
+ return np.array(autotuned_f0, np.float64)
334
+
335
+ # Fork Feature: Acquire median hybrid f0 estimation calculation
336
+ def get_f0_hybrid_computation(
337
+ self,
338
+ methods_str,
339
+ input_audio_path,
340
+ x,
341
+ f0_min,
342
+ f0_max,
343
+ p_len,
344
+ filter_radius,
345
+ crepe_hop_length,
346
+ time_step,
347
+ ):
348
+ # Get various f0 methods from input to use in the computation stack
349
+ methods_str = re.search("hybrid\[(.+)\]", methods_str)
350
+ if methods_str: # Ensure a match was found
351
+ methods = [method.strip() for method in methods_str.group(1).split("+")]
352
+ f0_computation_stack = []
353
+
354
+ print("Calculating f0 pitch estimations for methods: %s" % str(methods))
355
+ x = x.astype(np.float32)
356
+ x /= np.quantile(np.abs(x), 0.999)
357
+ # Get f0 calculations for all methods specified
358
+ for method in methods:
359
+ f0 = None
360
+ if method == "pm":
361
+ f0 = (
362
+ parselmouth.Sound(x, self.sr)
363
+ .to_pitch_ac(
364
+ time_step=time_step / 1000,
365
+ voicing_threshold=0.6,
366
+ pitch_floor=f0_min,
367
+ pitch_ceiling=f0_max,
368
+ )
369
+ .selected_array["frequency"]
370
+ )
371
+ pad_size = (p_len - len(f0) + 1) // 2
372
+ if pad_size > 0 or p_len - len(f0) - pad_size > 0:
373
+ f0 = np.pad(
374
+ f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant"
375
+ )
376
+ elif method == "crepe":
377
+ f0 = self.get_f0_official_crepe_computation(
378
+ x, f0_min, f0_max, model="full"
379
+ )
380
+ f0 = f0[1:]
381
+ elif method == "crepe-tiny":
382
+ f0 = self.get_f0_official_crepe_computation(
383
+ x, f0_min, f0_max, model="tiny"
384
+ )
385
+ f0 = f0[1:] # Get rid of extra first frame
386
+ elif method == "mangio-crepe":
387
+ f0 = self.get_f0_crepe_computation(
388
+ x, f0_min, f0_max, p_len, crepe_hop_length=crepe_hop_length
389
+ )
390
+ elif method == "mangio-crepe-tiny":
391
+ f0 = self.get_f0_crepe_computation(
392
+ x,
393
+ f0_min,
394
+ f0_max,
395
+ p_len,
396
+ crepe_hop_length=crepe_hop_length,
397
+ model="tiny",
398
+ )
399
+ elif method == "harvest":
400
+ input_audio_path2wav[input_audio_path] = x.astype(np.double)
401
+ f0 = cache_harvest_f0(input_audio_path, self.sr, f0_max, f0_min, 10)
402
+ if filter_radius > 2:
403
+ f0 = signal.medfilt(f0, 3)
404
+ elif method == "dio":
405
+ f0, t = pyworld.dio(
406
+ x.astype(np.double),
407
+ fs=self.sr,
408
+ f0_ceil=f0_max,
409
+ f0_floor=f0_min,
410
+ frame_period=10,
411
+ )
412
+ f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
413
+ f0 = signal.medfilt(f0, 3)
414
+ f0 = f0[1:]
415
+ elif method == "rmvpe":
416
+ f0 = self.get_rmvpe(x)
417
+ f0 = f0[1:]
418
+ elif method == "fcpe_legacy":
419
+ f0 = self.get_fcpe(x, f0_min=f0_min, f0_max=f0_max, p_len=p_len)
420
+ elif method == "fcpe":
421
+ f0 = self.get_torchfcpe(x, self.sr, f0_min, f0_max, p_len)
422
+ elif method == "pyin":
423
+ f0 = self.get_f0_pyin_computation(input_audio_path, f0_min, f0_max)
424
+ # Push method to the stack
425
+ f0_computation_stack.append(f0)
426
+
427
+ for fc in f0_computation_stack:
428
+ print(len(fc))
429
+
430
+ print("Calculating hybrid median f0 from the stack of: %s" % str(methods))
431
+ f0_median_hybrid = None
432
+ if len(f0_computation_stack) == 1:
433
+ f0_median_hybrid = f0_computation_stack[0]
434
+ else:
435
+ f0_median_hybrid = np.nanmedian(f0_computation_stack, axis=0)
436
+ return f0_median_hybrid
437
+
438
+ def get_f0(
439
+ self,
440
+ input_audio_path,
441
+ x,
442
+ p_len,
443
+ f0_up_key,
444
+ f0_method,
445
+ filter_radius,
446
+ crepe_hop_length,
447
+ f0_autotune,
448
+ inp_f0=None,
449
+ f0_min=50,
450
+ f0_max=1100,
451
+ ):
452
+ global input_audio_path2wav
453
+ time_step = self.window / self.sr * 1000
454
+ f0_min = f0_min
455
+ f0_max = f0_max
456
+ f0_mel_min = 1127 * np.log(1 + f0_min / 700)
457
+ f0_mel_max = 1127 * np.log(1 + f0_max / 700)
458
+
459
+ if f0_method == "pm":
460
+ f0 = (
461
+ parselmouth.Sound(x, self.sr)
462
+ .to_pitch_ac(
463
+ time_step=time_step / 1000,
464
+ voicing_threshold=0.6,
465
+ pitch_floor=f0_min,
466
+ pitch_ceiling=f0_max,
467
+ )
468
+ .selected_array["frequency"]
469
+ )
470
+ pad_size = (p_len - len(f0) + 1) // 2
471
+ if pad_size > 0 or p_len - len(f0) - pad_size > 0:
472
+ f0 = np.pad(
473
+ f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant"
474
+ )
475
+ elif f0_method == "harvest":
476
+ input_audio_path2wav[input_audio_path] = x.astype(np.double)
477
+ f0 = cache_harvest_f0(input_audio_path, self.sr, f0_max, f0_min, 10)
478
+ if filter_radius > 2:
479
+ f0 = signal.medfilt(f0, 3)
480
+ elif f0_method == "dio": # Potentially Buggy?
481
+ f0, t = pyworld.dio(
482
+ x.astype(np.double),
483
+ fs=self.sr,
484
+ f0_ceil=f0_max,
485
+ f0_floor=f0_min,
486
+ frame_period=10,
487
+ )
488
+ f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
489
+ f0 = signal.medfilt(f0, 3)
490
+ elif f0_method == "crepe":
491
+ model = "full"
492
+ # Pick a batch size that doesn't cause memory errors on your gpu
493
+ batch_size = 512
494
+ # Compute pitch using first gpu
495
+ audio = torch.tensor(np.copy(x))[None].float()
496
+ f0, pd = torchcrepe.predict(
497
+ audio,
498
+ self.sr,
499
+ self.window,
500
+ f0_min,
501
+ f0_max,
502
+ model,
503
+ batch_size=batch_size,
504
+ device=self.device,
505
+ return_periodicity=True,
506
+ )
507
+ pd = torchcrepe.filter.median(pd, 3)
508
+ f0 = torchcrepe.filter.mean(f0, 3)
509
+ f0[pd < 0.1] = 0
510
+ f0 = f0[0].cpu().numpy()
511
+ elif f0_method == "crepe-tiny":
512
+ f0 = self.get_f0_official_crepe_computation(x, f0_min, f0_max, model="tiny")
513
+ elif f0_method == "mangio-crepe":
514
+ f0 = self.get_f0_crepe_computation(
515
+ x, f0_min, f0_max, p_len, crepe_hop_length=crepe_hop_length
516
+ )
517
+ elif f0_method == "mangio-crepe-tiny":
518
+ f0 = self.get_f0_crepe_computation(
519
+ x,
520
+ f0_min,
521
+ f0_max,
522
+ p_len,
523
+ crepe_hop_length=crepe_hop_length,
524
+ model="tiny",
525
+ )
526
+ elif f0_method == "rmvpe":
527
+ if not hasattr(self, "model_rmvpe"):
528
+ from lib.infer.infer_libs.rmvpe import RMVPE
529
+
530
+ logger.info(f"Loading rmvpe model, {os.environ['rmvpe_model_path']}")
531
+ self.model_rmvpe = RMVPE(
532
+ os.environ["rmvpe_model_path"],
533
+ is_half=self.is_half,
534
+ device=self.device,
535
+ )
536
+ f0 = self.model_rmvpe.infer_from_audio(x, thred=0.03)
537
+
538
+ if "privateuseone" in str(self.device): # clean ortruntime memory
539
+ del self.model_rmvpe.model
540
+ del self.model_rmvpe
541
+ logger.info("Cleaning ortruntime memory")
542
+ elif f0_method == "rmvpe+":
543
+ params = {
544
+ "x": x,
545
+ "p_len": p_len,
546
+ "f0_up_key": f0_up_key,
547
+ "f0_min": f0_min,
548
+ "f0_max": f0_max,
549
+ "time_step": time_step,
550
+ "filter_radius": filter_radius,
551
+ "crepe_hop_length": crepe_hop_length,
552
+ "model": "full",
553
+ }
554
+ f0 = self.get_pitch_dependant_rmvpe(**params)
555
+ elif f0_method == "pyin":
556
+ f0 = self.get_f0_pyin_computation(input_audio_path, f0_min, f0_max)
557
+ elif f0_method == "fcpe_legacy":
558
+ f0 = self.get_fcpe(x, f0_min=f0_min, f0_max=f0_max, p_len=p_len)
559
+ elif f0_method == "fcpe":
560
+ f0 = self.get_torchfcpe(x, self.sr, f0_min, f0_max, p_len)
561
+ elif "hybrid" in f0_method:
562
+ # Perform hybrid median pitch estimation
563
+ input_audio_path2wav[input_audio_path] = x.astype(np.double)
564
+ f0 = self.get_f0_hybrid_computation(
565
+ f0_method,
566
+ input_audio_path,
567
+ x,
568
+ f0_min,
569
+ f0_max,
570
+ p_len,
571
+ filter_radius,
572
+ crepe_hop_length,
573
+ time_step,
574
+ )
575
+ # print("Autotune:", f0_autotune)
576
+ if f0_autotune == True:
577
+ print("Autotune:", f0_autotune)
578
+ f0 = self.autotune_f0(f0)
579
+
580
+ f0 *= pow(2, f0_up_key / 12)
581
+ # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
582
+ tf0 = self.sr // self.window # 每秒f0点数
583
+ if inp_f0 is not None:
584
+ delta_t = np.round(
585
+ (inp_f0[:, 0].max() - inp_f0[:, 0].min()) * tf0 + 1
586
+ ).astype("int16")
587
+ replace_f0 = np.interp(
588
+ list(range(delta_t)), inp_f0[:, 0] * 100, inp_f0[:, 1]
589
+ )
590
+ shape = f0[self.x_pad * tf0 : self.x_pad * tf0 + len(replace_f0)].shape[0]
591
+ f0[self.x_pad * tf0 : self.x_pad * tf0 + len(replace_f0)] = replace_f0[
592
+ :shape
593
+ ]
594
+ # with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
595
+ f0bak = f0.copy()
596
+ f0_mel = 1127 * np.log(1 + f0 / 700)
597
+ f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (
598
+ f0_mel_max - f0_mel_min
599
+ ) + 1
600
+ f0_mel[f0_mel <= 1] = 1
601
+ f0_mel[f0_mel > 255] = 255
602
+ f0_coarse = np.rint(f0_mel).astype(np.int32)
603
+ return f0_coarse, f0bak # 1-0
604
+
605
+ def vc(
606
+ self,
607
+ model,
608
+ net_g,
609
+ sid,
610
+ audio0,
611
+ pitch,
612
+ pitchf,
613
+ times,
614
+ index,
615
+ big_npy,
616
+ index_rate,
617
+ version,
618
+ protect,
619
+ ): # ,file_index,file_big_npy
620
+ feats = torch.from_numpy(audio0)
621
+ if self.is_half:
622
+ feats = feats.half()
623
+ else:
624
+ feats = feats.float()
625
+ if feats.dim() == 2: # double channels
626
+ feats = feats.mean(-1)
627
+ assert feats.dim() == 1, feats.dim()
628
+ feats = feats.view(1, -1)
629
+ padding_mask = torch.BoolTensor(feats.shape).to(self.device).fill_(False)
630
+
631
+ inputs = {
632
+ "source": feats.to(self.device),
633
+ "padding_mask": padding_mask,
634
+ "output_layer": 9 if version == "v1" else 12,
635
+ }
636
+ t0 = ttime()
637
+ with torch.no_grad():
638
+ logits = model.extract_features(**inputs)
639
+ feats = model.final_proj(logits[0]) if version == "v1" else logits[0]
640
+ if protect < 0.5 and pitch is not None and pitchf is not None:
641
+ feats0 = feats.clone()
642
+ if (
643
+ not isinstance(index, type(None))
644
+ and not isinstance(big_npy, type(None))
645
+ and index_rate != 0
646
+ ):
647
+ npy = feats[0].cpu().numpy()
648
+ if self.is_half:
649
+ npy = npy.astype("float32")
650
+
651
+ # _, I = index.search(npy, 1)
652
+ # npy = big_npy[I.squeeze()]
653
+
654
+ score, ix = index.search(npy, k=8)
655
+ weight = np.square(1 / score)
656
+ weight /= weight.sum(axis=1, keepdims=True)
657
+ npy = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
658
+
659
+ if self.is_half:
660
+ npy = npy.astype("float16")
661
+ feats = (
662
+ torch.from_numpy(npy).unsqueeze(0).to(self.device) * index_rate
663
+ + (1 - index_rate) * feats
664
+ )
665
+
666
+ feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1)
667
+ if protect < 0.5 and pitch is not None and pitchf is not None:
668
+ feats0 = F.interpolate(feats0.permute(0, 2, 1), scale_factor=2).permute(
669
+ 0, 2, 1
670
+ )
671
+ t1 = ttime()
672
+ p_len = audio0.shape[0] // self.window
673
+ if feats.shape[1] < p_len:
674
+ p_len = feats.shape[1]
675
+ if pitch is not None and pitchf is not None:
676
+ pitch = pitch[:, :p_len]
677
+ pitchf = pitchf[:, :p_len]
678
+
679
+ if protect < 0.5 and pitch is not None and pitchf is not None:
680
+ pitchff = pitchf.clone()
681
+ pitchff[pitchf > 0] = 1
682
+ pitchff[pitchf < 1] = protect
683
+ pitchff = pitchff.unsqueeze(-1)
684
+ feats = feats * pitchff + feats0 * (1 - pitchff)
685
+ feats = feats.to(feats0.dtype)
686
+ p_len = torch.tensor([p_len], device=self.device).long()
687
+ with torch.no_grad():
688
+ hasp = pitch is not None and pitchf is not None
689
+ arg = (feats, p_len, pitch, pitchf, sid) if hasp else (feats, p_len, sid)
690
+ audio1 = (net_g.infer(*arg)[0][0, 0]).data.cpu().float().numpy()
691
+ del hasp, arg
692
+ del feats, p_len, padding_mask
693
+ if torch.cuda.is_available():
694
+ torch.cuda.empty_cache()
695
+ t2 = ttime()
696
+ times[0] += t1 - t0
697
+ times[2] += t2 - t1
698
+ return audio1
699
+
700
+ def process_t(
701
+ self,
702
+ t,
703
+ s,
704
+ window,
705
+ audio_pad,
706
+ pitch,
707
+ pitchf,
708
+ times,
709
+ index,
710
+ big_npy,
711
+ index_rate,
712
+ version,
713
+ protect,
714
+ t_pad_tgt,
715
+ if_f0,
716
+ sid,
717
+ model,
718
+ net_g,
719
+ ):
720
+ t = t // window * window
721
+ if if_f0 == 1:
722
+ return self.vc(
723
+ model,
724
+ net_g,
725
+ sid,
726
+ audio_pad[s : t + t_pad_tgt + window],
727
+ pitch[:, s // window : (t + t_pad_tgt) // window],
728
+ pitchf[:, s // window : (t + t_pad_tgt) // window],
729
+ times,
730
+ index,
731
+ big_npy,
732
+ index_rate,
733
+ version,
734
+ protect,
735
+ )[t_pad_tgt:-t_pad_tgt]
736
+ else:
737
+ return self.vc(
738
+ model,
739
+ net_g,
740
+ sid,
741
+ audio_pad[s : t + t_pad_tgt + window],
742
+ None,
743
+ None,
744
+ times,
745
+ index,
746
+ big_npy,
747
+ index_rate,
748
+ version,
749
+ protect,
750
+ )[t_pad_tgt:-t_pad_tgt]
751
+
752
+ def pipeline(
753
+ self,
754
+ model,
755
+ net_g,
756
+ sid,
757
+ audio,
758
+ input_audio_path,
759
+ times,
760
+ f0_up_key,
761
+ f0_method,
762
+ file_index,
763
+ index_rate,
764
+ if_f0,
765
+ filter_radius,
766
+ tgt_sr,
767
+ resample_sr,
768
+ rms_mix_rate,
769
+ version,
770
+ protect,
771
+ crepe_hop_length,
772
+ f0_autotune,
773
+ f0_min=50,
774
+ f0_max=1100,
775
+ ):
776
+ if (
777
+ file_index != ""
778
+ and isinstance(file_index, str)
779
+ # and file_big_npy != ""
780
+ # and os.path.exists(file_big_npy) == True
781
+ and os.path.exists(file_index)
782
+ and index_rate != 0
783
+ ):
784
+ try:
785
+ index = faiss.read_index(file_index)
786
+ # big_npy = np.load(file_big_npy)
787
+ big_npy = index.reconstruct_n(0, index.ntotal)
788
+ except:
789
+ traceback.print_exc()
790
+ index = big_npy = None
791
+ else:
792
+ index = big_npy = None
793
+ audio = signal.filtfilt(bh, ah, audio)
794
+ audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode="reflect")
795
+ opt_ts = []
796
+ if audio_pad.shape[0] > self.t_max:
797
+ audio_sum = np.zeros_like(audio)
798
+ for i in range(self.window):
799
+ audio_sum += audio_pad[i : i - self.window]
800
+ for t in range(self.t_center, audio.shape[0], self.t_center):
801
+ opt_ts.append(
802
+ t
803
+ - self.t_query
804
+ + np.where(
805
+ np.abs(audio_sum[t - self.t_query : t + self.t_query])
806
+ == np.abs(audio_sum[t - self.t_query : t + self.t_query]).min()
807
+ )[0][0]
808
+ )
809
+ s = 0
810
+ audio_opt = []
811
+ t = None
812
+ t1 = ttime()
813
+ audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode="reflect")
814
+ p_len = audio_pad.shape[0] // self.window
815
+ inp_f0 = None
816
+
817
+ sid = torch.tensor(sid, device=self.device).unsqueeze(0).long()
818
+ pitch, pitchf = None, None
819
+ if if_f0:
820
+ pitch, pitchf = self.get_f0(
821
+ input_audio_path,
822
+ audio_pad,
823
+ p_len,
824
+ f0_up_key,
825
+ f0_method,
826
+ filter_radius,
827
+ crepe_hop_length,
828
+ f0_autotune,
829
+ inp_f0,
830
+ f0_min,
831
+ f0_max,
832
+ )
833
+ pitch = pitch[:p_len]
834
+ pitchf = pitchf[:p_len]
835
+ if "mps" not in str(self.device) or "xpu" not in str(self.device):
836
+ pitchf = pitchf.astype(np.float32)
837
+ pitch = torch.tensor(pitch, device=self.device).unsqueeze(0).long()
838
+ pitchf = torch.tensor(pitchf, device=self.device).unsqueeze(0).float()
839
+ t2 = ttime()
840
+ times[1] += t2 - t1
841
+
842
+ with tqdm(total=len(opt_ts), desc="Processing", unit="window") as pbar:
843
+ for i, t in enumerate(opt_ts):
844
+ t = t // self.window * self.window
845
+ start = s
846
+ end = t + self.t_pad2 + self.window
847
+ audio_slice = audio_pad[start:end]
848
+ pitch_slice = (
849
+ pitch[:, start // self.window : end // self.window]
850
+ if if_f0
851
+ else None
852
+ )
853
+ pitchf_slice = (
854
+ pitchf[:, start // self.window : end // self.window]
855
+ if if_f0
856
+ else None
857
+ )
858
+ audio_opt.append(
859
+ self.vc(
860
+ model,
861
+ net_g,
862
+ sid,
863
+ audio_slice,
864
+ pitch_slice,
865
+ pitchf_slice,
866
+ times,
867
+ index,
868
+ big_npy,
869
+ index_rate,
870
+ version,
871
+ protect,
872
+ )[self.t_pad_tgt : -self.t_pad_tgt]
873
+ )
874
+ s = t
875
+ pbar.update(1)
876
+ pbar.refresh()
877
+
878
+ audio_slice = audio_pad[t:]
879
+ pitch_slice = pitch[:, t // self.window :] if if_f0 and t is not None else pitch
880
+ pitchf_slice = (
881
+ pitchf[:, t // self.window :] if if_f0 and t is not None else pitchf
882
+ )
883
+ audio_opt.append(
884
+ self.vc(
885
+ model,
886
+ net_g,
887
+ sid,
888
+ audio_slice,
889
+ pitch_slice,
890
+ pitchf_slice,
891
+ times,
892
+ index,
893
+ big_npy,
894
+ index_rate,
895
+ version,
896
+ protect,
897
+ )[self.t_pad_tgt : -self.t_pad_tgt]
898
+ )
899
+
900
+ audio_opt = np.concatenate(audio_opt)
901
+ if rms_mix_rate != 1:
902
+ audio_opt = change_rms(audio, 16000, audio_opt, tgt_sr, rms_mix_rate)
903
+ if tgt_sr != resample_sr >= 16000:
904
+ audio_opt = librosa.resample(
905
+ audio_opt, orig_sr=tgt_sr, target_sr=resample_sr
906
+ )
907
+ audio_max = np.abs(audio_opt).max() / 0.99
908
+ max_int16 = 32768
909
+ if audio_max > 1:
910
+ max_int16 /= audio_max
911
+ audio_opt = (audio_opt * max_int16).astype(np.int16)
912
+ del pitch, pitchf, sid
913
+ if torch.cuda.is_available():
914
+ torch.cuda.empty_cache()
915
+
916
+ print("Returning completed audio...")
917
+ return audio_opt
rvc_inferpy/split_audio.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from pydub import AudioSegment
3
+ from pydub.silence import detect_silence, detect_nonsilent
4
+
5
+ SEPERATE_DIR = os.path.join(os.getcwd(), "seperate")
6
+ TEMP_DIR = os.path.join(SEPERATE_DIR, "temp")
7
+ cache = {}
8
+
9
+ os.makedirs(SEPERATE_DIR, exist_ok=True)
10
+ os.makedirs(TEMP_DIR, exist_ok=True)
11
+
12
+
13
+ def cache_result(func):
14
+ def wrapper(*args, **kwargs):
15
+ key = (args, frozenset(kwargs.items()))
16
+ if key in cache:
17
+ return cache[key]
18
+ else:
19
+ result = func(*args, **kwargs)
20
+ cache[key] = result
21
+ return result
22
+
23
+ return wrapper
24
+
25
+
26
+ def get_non_silent(
27
+ audio_name, audio, min_silence, silence_thresh, seek_step, keep_silence
28
+ ):
29
+ """
30
+ Function to get non-silent parts of the audio.
31
+ """
32
+ nonsilent_ranges = detect_nonsilent(
33
+ audio,
34
+ min_silence_len=min_silence,
35
+ silence_thresh=silence_thresh,
36
+ seek_step=seek_step,
37
+ )
38
+ nonsilent_files = []
39
+ for index, range in enumerate(nonsilent_ranges):
40
+ nonsilent_name = os.path.join(
41
+ SEPERATE_DIR,
42
+ f"{audio_name}_min{min_silence}_t{silence_thresh}_ss{seek_step}_ks{keep_silence}",
43
+ f"nonsilent{index}-{audio_name}.wav",
44
+ )
45
+ start, end = range[0] - keep_silence, range[1] + keep_silence
46
+ audio[start:end].export(nonsilent_name, format="wav")
47
+ nonsilent_files.append(nonsilent_name)
48
+ return nonsilent_files
49
+
50
+
51
+ def get_silence(
52
+ audio_name, audio, min_silence, silence_thresh, seek_step, keep_silence
53
+ ):
54
+ """
55
+ Function to get silent parts of the audio.
56
+ """
57
+ silence_ranges = detect_silence(
58
+ audio,
59
+ min_silence_len=min_silence,
60
+ silence_thresh=silence_thresh,
61
+ seek_step=seek_step,
62
+ )
63
+ silence_files = []
64
+ for index, range in enumerate(silence_ranges):
65
+ silence_name = os.path.join(
66
+ SEPERATE_DIR,
67
+ f"{audio_name}_min{min_silence}_t{silence_thresh}_ss{seek_step}_ks{keep_silence}",
68
+ f"silence{index}-{audio_name}.wav",
69
+ )
70
+ start, end = range[0] + keep_silence, range[1] - keep_silence
71
+ audio[start:end].export(silence_name, format="wav")
72
+ silence_files.append(silence_name)
73
+ return silence_files
74
+
75
+
76
+ @cache_result
77
+ def split_silence_nonsilent(
78
+ input_path, min_silence=500, silence_thresh=-40, seek_step=1, keep_silence=100
79
+ ):
80
+ """
81
+ Function to split the audio into silent and non-silent parts.
82
+ """
83
+ audio_name = os.path.splitext(os.path.basename(input_path))[0]
84
+ os.makedirs(
85
+ os.path.join(
86
+ SEPERATE_DIR,
87
+ f"{audio_name}_min{min_silence}_t{silence_thresh}_ss{seek_step}_ks{keep_silence}",
88
+ ),
89
+ exist_ok=True,
90
+ )
91
+ audio = (
92
+ AudioSegment.silent(duration=1000)
93
+ + AudioSegment.from_file(input_path)
94
+ + AudioSegment.silent(duration=1000)
95
+ )
96
+ silence_files = get_silence(
97
+ audio_name, audio, min_silence, silence_thresh, seek_step, keep_silence
98
+ )
99
+ nonsilent_files = get_non_silent(
100
+ audio_name, audio, min_silence, silence_thresh, seek_step, keep_silence
101
+ )
102
+ return silence_files, nonsilent_files
103
+
104
+
105
+ def adjust_audio_lengths(original_audios, inferred_audios):
106
+ """
107
+ Function to adjust the lengths of the inferred audio files list to match the original audio files length.
108
+ """
109
+ adjusted_audios = []
110
+ for original_audio, inferred_audio in zip(original_audios, inferred_audios):
111
+ audio_1 = AudioSegment.from_file(original_audio)
112
+ audio_2 = AudioSegment.from_file(inferred_audio)
113
+
114
+ if len(audio_1) > len(audio_2):
115
+ audio_2 += AudioSegment.silent(duration=len(audio_1) - len(audio_2))
116
+ else:
117
+ audio_2 = audio_2[: len(audio_1)]
118
+
119
+ adjusted_file = os.path.join(
120
+ TEMP_DIR, f"adjusted-{os.path.basename(inferred_audio)}"
121
+ )
122
+ audio_2.export(adjusted_file, format="wav")
123
+ adjusted_audios.append(adjusted_file)
124
+
125
+ return adjusted_audios
126
+
127
+
128
+ def combine_silence_nonsilent(silence_files, nonsilent_files, keep_silence, output):
129
+ """
130
+ Function to combine the silent and non-silent parts of the audio.
131
+ """
132
+ combined = AudioSegment.empty()
133
+ for silence, nonsilent in zip(silence_files, nonsilent_files):
134
+ combined += AudioSegment.from_wav(silence) + AudioSegment.from_wav(nonsilent)
135
+ combined += AudioSegment.from_wav(silence_files[-1])
136
+ combined = (
137
+ AudioSegment.silent(duration=keep_silence)
138
+ + combined[1000:-1000]
139
+ + AudioSegment.silent(duration=keep_silence)
140
+ )
141
+ combined.export(output, format="wav")
142
+ return output