Spaces:

sbapan41
/

Quantum_Dubbing

Running

App Files Files Community

sbapan41 commited on 15 days ago

Commit

924e8cc

verified ·

1 Parent(s): 09b7c9f

Upload 11 files

Browse files

Files changed (11) hide show

LICENSE +201 -0
README.md +11 -8
SoniTranslate_Colab.ipynb +124 -0
app.py +2 -0
app_rvc.py +0 -0
packages.txt +3 -0
pre-requirements.txt +17 -0
requirements.txt +19 -0
requirements_xtts.txt +58 -0
vci_pipeline.py +454 -0
voice_main.py +732 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,13 +1,16 @@
 ---
-title: Quantum Dubbing
-emoji: 🏆
-colorFrom: gray
-colorTo: blue
 sdk: gradio
-sdk_version: 5.23.1
-app_file: app.py
-pinned: false
-license: apache-2.0
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Quantum_Dubbing (Quantum_Dubbing)
+emoji: 🌍
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 4.31.3
+app_file: app_rvc.py
+pinned: true
+license: mit
+short_description: Video Dubbing with Open Source Projects
+preload_from_hub:
+  - Systran/faster-whisper-large-v3
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

SoniTranslate_Colab.ipynb ADDED Viewed

	@@ -0,0 +1,124 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/R3gm/SoniTranslate/blob/main/SoniTranslate_Colab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# SoniTranslate\n",
+        "\n",
+        "| Description | Link |\n",
+        "| ----------- | ---- |\n",
+        "| 🎉 Repository | [![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github)](https://github.com/R3gm/SoniTranslate/) |\n",
+        "| 🚀 Online Demo in HF | [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/r3gm/SoniTranslate_translate_audio_of_a_video_content) |\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "8lw0EgLex-YZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "LUgwm0rfx0_J",
+        "cellView": "form"
+      },
+      "outputs": [],
+      "source": [
+        "# @title Install requirements for SoniTranslate\n",
+        "!git clone https://github.com/r3gm/SoniTranslate.git\n",
+        "%cd SoniTranslate\n",
+        "\n",
+        "!apt install git-lfs\n",
+        "!git lfs install\n",
+        "\n",
+        "!sed -i 's|git+https://github.com/R3gm/whisperX.git@cuda_11_8|git+https://github.com/R3gm/whisperX.git@cuda_12_x|' requirements_base.txt\n",
+        "!pip install -q -r requirements_base.txt\n",
+        "!pip install -q -r requirements_extra.txt\n",
+        "!pip install -q ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/\n",
+        "\n",
+        "Install_PIPER_TTS = True # @param {type:\"boolean\"}\n",
+        "\n",
+        "if Install_PIPER_TTS:\n",
+        "    !pip install -q piper-tts==1.2.0\n",
+        "\n",
+        "Install_Coqui_XTTS = True # @param {type:\"boolean\"}\n",
+        "\n",
+        "if Install_Coqui_XTTS:\n",
+        "    !pip install -q -r requirements_xtts.txt\n",
+        "    !pip install -q TTS==0.21.1  --no-deps"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "One important step is to accept the license agreement for using Pyannote. You need to have an account on Hugging Face and `accept the license to use the models`: https://huggingface.co/pyannote/speaker-diarization and https://huggingface.co/pyannote/segmentation\n",
+        "\n",
+        "\n",
+        "\n",
+        "\n",
+        "Get your KEY TOKEN here: https://hf.co/settings/tokens"
+      ],
+      "metadata": {
+        "id": "LTaTstXPXNg2"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#@markdown # `RUN THE WEB APP`\n",
+        "YOUR_HF_TOKEN = \"\" #@param {type:'string'}\n",
+        "%env YOUR_HF_TOKEN={YOUR_HF_TOKEN}\n",
+        "theme = \"Taithrah/Minimal\" # @param [\"Taithrah/Minimal\", \"aliabid94/new-theme\", \"gstaff/xkcd\", \"ParityError/LimeFace\", \"abidlabs/pakistan\", \"rottenlittlecreature/Moon_Goblin\", \"ysharma/llamas\", \"gradio/dracula_revamped\"]\n",
+        "interface_language = \"english\" # @param ['arabic', 'azerbaijani', 'chinese_zh_cn', 'english', 'french', 'german', 'hindi', 'indonesian', 'italian', 'japanese', 'korean', 'marathi', 'polish', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish', 'ukrainian', 'vietnamese']\n",
+        "verbosity_level = \"info\" # @param [\"debug\", \"info\", \"warning\", \"error\", \"critical\"]\n",
+        "\n",
+        "\n",
+        "%cd /content/SoniTranslate\n",
+        "!python app_rvc.py --theme {theme} --verbosity_level {verbosity_level} --language {interface_language} --public_url"
+      ],
+      "metadata": {
+        "id": "XkhXfaFw4R4J",
+        "cellView": "form"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Open the `public URL` when it appears"
+      ],
+      "metadata": {
+        "id": "KJW3KrhZJh0u"
+      }
+    }
+  ]
+}

app.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ import os
2	+ os.system("python app_rvc.py --language french --theme aliabid94/new-theme")

app_rvc.py ADDED Viewed

The diff for this file is too large to render. See raw diff

packages.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+git-lfs
+aria2 -y
+ffmpeg

pre-requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+pip==23.1.2
+--extra-index-url https://download.pytorch.org/whl/cu121
+torch==2.2.0  # +cu121
+torchvision  # <=0.17.0+cu121
+torchaudio  # <=2.2.0+cu121
+ctranslate2<=4.4.0
+yt-dlp
+gradio==4.19.2
+pydub==0.25.1
+edge_tts==6.1.7
+deep_translator==1.11.4
+git+https://github.com/R3gm/[email protected]
+git+https://github.com/R3gm/whisperX.git@cuda_12_x
+nest_asyncio
+gTTS
+gradio_client==0.10.1
+IPython

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+praat-parselmouth>=0.4.3
+pyworld==0.3.2
+faiss-cpu==1.7.3
+torchcrepe==0.0.20
+ffmpeg-python>=0.2.0
+fairseq==0.12.2
+gdown
+rarfile
+transformers
+accelerate
+optimum
+sentencepiece
+srt
+git+https://github.com/R3gm/openvoice_package.git@lite
+openai==1.14.3
+tiktoken==0.6.0
+# Documents
+pypdf==4.2.0
+python-docx

requirements_xtts.txt ADDED Viewed

	@@ -0,0 +1,58 @@

+# core deps
+numpy==1.23.5
+cython>=0.29.30
+scipy>=1.11.2
+torch
+torchaudio
+soundfile
+librosa
+scikit-learn
+numba
+inflect>=5.6.0
+tqdm>=4.64.1
+anyascii>=0.3.0
+pyyaml>=6.0
+fsspec>=2023.6.0 # <= 2023.9.1 makes aux tests fail
+aiohttp>=3.8.1
+packaging>=23.1
+# deps for examples
+flask>=2.0.1
+# deps for inference
+pysbd>=0.3.4
+# deps for notebooks
+umap-learn>=0.5.1
+pandas
+# deps for training
+matplotlib
+# coqui stack
+trainer>=0.0.32
+# config management
+coqpit>=0.0.16
+# chinese g2p deps
+jieba
+pypinyin
+# korean
+hangul_romanize
+# gruut+supported langs
+gruut[de,es,fr]==2.2.3
+# deps for korean
+jamo
+nltk
+g2pkk>=0.1.1
+# deps for bangla
+bangla
+bnnumerizer
+bnunicodenormalizer
+#deps for tortoise
+einops>=0.6.0
+transformers
+#deps for bark
+encodec>=0.1.1
+# deps for XTTS
+unidecode>=1.3.2
+num2words
+spacy[ja]>=3
+# after this
+# pip install -r requirements_xtts.txt
+# pip install TTS==0.21.1  --no-deps

vci_pipeline.py ADDED Viewed

	@@ -0,0 +1,454 @@

+import numpy as np, parselmouth, torch, pdb, sys
+from time import time as ttime
+import torch.nn.functional as F
+import scipy.signal as signal
+import pyworld, os, traceback, faiss, librosa, torchcrepe
+from scipy import signal
+from functools import lru_cache
+from quantum_dubbing.logging_setup import logger
+now_dir = os.getcwd()
+sys.path.append(now_dir)
+bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
+input_audio_path2wav = {}
+@lru_cache
+def cache_harvest_f0(input_audio_path, fs, f0max, f0min, frame_period):
+    audio = input_audio_path2wav[input_audio_path]
+    f0, t = pyworld.harvest(
+        audio,
+        fs=fs,
+        f0_ceil=f0max,
+        f0_floor=f0min,
+        frame_period=frame_period,
+    )
+    f0 = pyworld.stonemask(audio, f0, t, fs)
+    return f0
+def change_rms(data1, sr1, data2, sr2, rate):  # 1 is the input audio, 2 is the output audio, rate is the proportion of 2
+    # print(data1.max(),data2.max())
+    rms1 = librosa.feature.rms(
+        y=data1, frame_length=sr1 // 2 * 2, hop_length=sr1 // 2
+    )  # one dot every half second
+    rms2 = librosa.feature.rms(y=data2, frame_length=sr2 // 2 * 2, hop_length=sr2 // 2)
+    rms1 = torch.from_numpy(rms1)
+    rms1 = F.interpolate(
+        rms1.unsqueeze(0), size=data2.shape[0], mode="linear"
+    ).squeeze()
+    rms2 = torch.from_numpy(rms2)
+    rms2 = F.interpolate(
+        rms2.unsqueeze(0), size=data2.shape[0], mode="linear"
+    ).squeeze()
+    rms2 = torch.max(rms2, torch.zeros_like(rms2) + 1e-6)
+    data2 *= (
+        torch.pow(rms1, torch.tensor(1 - rate))
+        * torch.pow(rms2, torch.tensor(rate - 1))
+    ).numpy()
+    return data2
+class VC(object):
+    def __init__(self, tgt_sr, config):
+        self.x_pad, self.x_query, self.x_center, self.x_max, self.is_half = (
+            config.x_pad,
+            config.x_query,
+            config.x_center,
+            config.x_max,
+            config.is_half,
+        )
+        self.sr = 16000  # hubert input sampling rate
+        self.window = 160  # points per frame
+        self.t_pad = self.sr * self.x_pad  # Pad time before and after each bar
+        self.t_pad_tgt = tgt_sr * self.x_pad
+        self.t_pad2 = self.t_pad * 2
+        self.t_query = self.sr * self.x_query  # Query time before and after the cut point
+        self.t_center = self.sr * self.x_center  # Query point cut position
+        self.t_max = self.sr * self.x_max  # Query-free duration threshold
+        self.device = config.device
+    def get_f0(
+        self,
+        input_audio_path,
+        x,
+        p_len,
+        f0_up_key,
+        f0_method,
+        filter_radius,
+        inp_f0=None,
+    ):
+        global input_audio_path2wav
+        time_step = self.window / self.sr * 1000
+        f0_min = 50
+        f0_max = 1100
+        f0_mel_min = 1127 * np.log(1 + f0_min / 700)
+        f0_mel_max = 1127 * np.log(1 + f0_max / 700)
+        if f0_method == "pm":
+            f0 = (
+                parselmouth.Sound(x, self.sr)
+                .to_pitch_ac(
+                    time_step=time_step / 1000,
+                    voicing_threshold=0.6,
+                    pitch_floor=f0_min,
+                    pitch_ceiling=f0_max,
+                )
+                .selected_array["frequency"]
+            )
+            pad_size = (p_len - len(f0) + 1) // 2
+            if pad_size > 0 or p_len - len(f0) - pad_size > 0:
+                f0 = np.pad(
+                    f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant"
+                )
+        elif f0_method == "harvest":
+            input_audio_path2wav[input_audio_path] = x.astype(np.double)
+            f0 = cache_harvest_f0(input_audio_path, self.sr, f0_max, f0_min, 10)
+            if filter_radius > 2:
+                f0 = signal.medfilt(f0, 3)
+        elif f0_method == "crepe":
+            model = "full"
+            # Pick a batch size that doesn't cause memory errors on your gpu
+            batch_size = 512
+            # Compute pitch using first gpu
+            audio = torch.tensor(np.copy(x))[None].float()
+            f0, pd = torchcrepe.predict(
+                audio,
+                self.sr,
+                self.window,
+                f0_min,
+                f0_max,
+                model,
+                batch_size=batch_size,
+                device=self.device,
+                return_periodicity=True,
+            )
+            pd = torchcrepe.filter.median(pd, 3)
+            f0 = torchcrepe.filter.mean(f0, 3)
+            f0[pd < 0.1] = 0
+            f0 = f0[0].cpu().numpy()
+        elif "rmvpe" in f0_method:
+            if hasattr(self, "model_rmvpe") == False:
+                from lib.rmvpe import RMVPE
+                logger.info("Loading vocal pitch estimator model")
+                self.model_rmvpe = RMVPE(
+                    "rmvpe.pt", is_half=self.is_half, device=self.device
+                )
+            thred = 0.03
+            if "+" in f0_method:
+                f0 = self.model_rmvpe.pitch_based_audio_inference(x, thred, f0_min, f0_max)
+            else:
+                f0 = self.model_rmvpe.infer_from_audio(x, thred)
+        f0 *= pow(2, f0_up_key / 12)
+        # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
+        tf0 = self.sr // self.window  # f0 points per second
+        if inp_f0 is not None:
+            delta_t = np.round(
+                (inp_f0[:, 0].max() - inp_f0[:, 0].min()) * tf0 + 1
+            ).astype("int16")
+            replace_f0 = np.interp(
+                list(range(delta_t)), inp_f0[:, 0] * 100, inp_f0[:, 1]
+            )
+            shape = f0[self.x_pad * tf0 : self.x_pad * tf0 + len(replace_f0)].shape[0]
+            f0[self.x_pad * tf0 : self.x_pad * tf0 + len(replace_f0)] = replace_f0[
+                :shape
+            ]
+        # with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
+        f0bak = f0.copy()
+        f0_mel = 1127 * np.log(1 + f0 / 700)
+        f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (
+            f0_mel_max - f0_mel_min
+        ) + 1
+        f0_mel[f0_mel <= 1] = 1
+        f0_mel[f0_mel > 255] = 255
+        try:
+            f0_coarse = np.rint(f0_mel).astype(np.int)
+        except: # noqa
+            f0_coarse = np.rint(f0_mel).astype(int)
+        return f0_coarse, f0bak  # 1-0
+    def vc(
+        self,
+        model,
+        net_g,
+        sid,
+        audio0,
+        pitch,
+        pitchf,
+        times,
+        index,
+        big_npy,
+        index_rate,
+        version,
+        protect,
+    ):  # ,file_index,file_big_npy
+        feats = torch.from_numpy(audio0)
+        if self.is_half:
+            feats = feats.half()
+        else:
+            feats = feats.float()
+        if feats.dim() == 2:  # double channels
+            feats = feats.mean(-1)
+        assert feats.dim() == 1, feats.dim()
+        feats = feats.view(1, -1)
+        padding_mask = torch.BoolTensor(feats.shape).to(self.device).fill_(False)
+        inputs = {
+            "source": feats.to(self.device),
+            "padding_mask": padding_mask,
+            "output_layer": 9 if version == "v1" else 12,
+        }
+        t0 = ttime()
+        with torch.no_grad():
+            logits = model.extract_features(**inputs)
+            feats = model.final_proj(logits[0]) if version == "v1" else logits[0]
+        if protect < 0.5 and pitch != None and pitchf != None:
+            feats0 = feats.clone()
+        if (
+            isinstance(index, type(None)) == False
+            and isinstance(big_npy, type(None)) == False
+            and index_rate != 0
+        ):
+            npy = feats[0].cpu().numpy()
+            if self.is_half:
+                npy = npy.astype("float32")
+            # _, I = index.search(npy, 1)
+            # npy = big_npy[I.squeeze()]
+            score, ix = index.search(npy, k=8)
+            weight = np.square(1 / score)
+            weight /= weight.sum(axis=1, keepdims=True)
+            npy = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
+            if self.is_half:
+                npy = npy.astype("float16")
+            feats = (
+                torch.from_numpy(npy).unsqueeze(0).to(self.device) * index_rate
+                + (1 - index_rate) * feats
+            )
+        feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1)
+        if protect < 0.5 and pitch != None and pitchf != None:
+            feats0 = F.interpolate(feats0.permute(0, 2, 1), scale_factor=2).permute(
+                0, 2, 1
+            )
+        t1 = ttime()
+        p_len = audio0.shape[0] // self.window
+        if feats.shape[1] < p_len:
+            p_len = feats.shape[1]
+            if pitch != None and pitchf != None:
+                pitch = pitch[:, :p_len]
+                pitchf = pitchf[:, :p_len]
+        if protect < 0.5 and pitch != None and pitchf != None:
+            pitchff = pitchf.clone()
+            pitchff[pitchf > 0] = 1
+            pitchff[pitchf < 1] = protect
+            pitchff = pitchff.unsqueeze(-1)
+            feats = feats * pitchff + feats0 * (1 - pitchff)
+            feats = feats.to(feats0.dtype)
+        p_len = torch.tensor([p_len], device=self.device).long()
+        with torch.no_grad():
+            if pitch != None and pitchf != None:
+                audio1 = (
+                    (net_g.infer(feats, p_len, pitch, pitchf, sid)[0][0, 0])
+                    .data.cpu()
+                    .float()
+                    .numpy()
+                )
+            else:
+                audio1 = (
+                    (net_g.infer(feats, p_len, sid)[0][0, 0]).data.cpu().float().numpy()
+                )
+        del feats, p_len, padding_mask
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        t2 = ttime()
+        times[0] += t1 - t0
+        times[2] += t2 - t1
+        return audio1
+    def pipeline(
+        self,
+        model,
+        net_g,
+        sid,
+        audio,
+        input_audio_path,
+        times,
+        f0_up_key,
+        f0_method,
+        file_index,
+        # file_big_npy,
+        index_rate,
+        if_f0,
+        filter_radius,
+        tgt_sr,
+        resample_sr,
+        rms_mix_rate,
+        version,
+        protect,
+        f0_file=None,
+    ):
+        if (
+            file_index != ""
+            # and file_big_npy != ""
+            # and os.path.exists(file_big_npy) == True
+            and os.path.exists(file_index) == True
+            and index_rate != 0
+        ):
+            try:
+                index = faiss.read_index(file_index)
+                # big_npy = np.load(file_big_npy)
+                big_npy = index.reconstruct_n(0, index.ntotal)
+            except:
+                traceback.print_exc()
+                index = big_npy = None
+        else:
+            index = big_npy = None
+            logger.warning("File index Not found, set None")
+        audio = signal.filtfilt(bh, ah, audio)
+        audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode="reflect")
+        opt_ts = []
+        if audio_pad.shape[0] > self.t_max:
+            audio_sum = np.zeros_like(audio)
+            for i in range(self.window):
+                audio_sum += audio_pad[i : i - self.window]
+            for t in range(self.t_center, audio.shape[0], self.t_center):
+                opt_ts.append(
+                    t
+                    - self.t_query
+                    + np.where(
+                        np.abs(audio_sum[t - self.t_query : t + self.t_query])
+                        == np.abs(audio_sum[t - self.t_query : t + self.t_query]).min()
+                    )[0][0]
+                )
+        s = 0
+        audio_opt = []
+        t = None
+        t1 = ttime()
+        audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode="reflect")
+        p_len = audio_pad.shape[0] // self.window
+        inp_f0 = None
+        if hasattr(f0_file, "name") == True:
+            try:
+                with open(f0_file.name, "r") as f:
+                    lines = f.read().strip("\n").split("\n")
+                inp_f0 = []
+                for line in lines:
+                    inp_f0.append([float(i) for i in line.split(",")])
+                inp_f0 = np.array(inp_f0, dtype="float32")
+            except:
+                traceback.print_exc()
+        sid = torch.tensor(sid, device=self.device).unsqueeze(0).long()
+        pitch, pitchf = None, None
+        if if_f0 == 1:
+            pitch, pitchf = self.get_f0(
+                input_audio_path,
+                audio_pad,
+                p_len,
+                f0_up_key,
+                f0_method,
+                filter_radius,
+                inp_f0,
+            )
+            pitch = pitch[:p_len]
+            pitchf = pitchf[:p_len]
+            if self.device == "mps":
+                pitchf = pitchf.astype(np.float32)
+            pitch = torch.tensor(pitch, device=self.device).unsqueeze(0).long()
+            pitchf = torch.tensor(pitchf, device=self.device).unsqueeze(0).float()
+        t2 = ttime()
+        times[1] += t2 - t1
+        for t in opt_ts:
+            t = t // self.window * self.window
+            if if_f0 == 1:
+                audio_opt.append(
+                    self.vc(
+                        model,
+                        net_g,
+                        sid,
+                        audio_pad[s : t + self.t_pad2 + self.window],
+                        pitch[:, s // self.window : (t + self.t_pad2) // self.window],
+                        pitchf[:, s // self.window : (t + self.t_pad2) // self.window],
+                        times,
+                        index,
+                        big_npy,
+                        index_rate,
+                        version,
+                        protect,
+                    )[self.t_pad_tgt : -self.t_pad_tgt]
+                )
+            else:
+                audio_opt.append(
+                    self.vc(
+                        model,
+                        net_g,
+                        sid,
+                        audio_pad[s : t + self.t_pad2 + self.window],
+                        None,
+                        None,
+                        times,
+                        index,
+                        big_npy,
+                        index_rate,
+                        version,
+                        protect,
+                    )[self.t_pad_tgt : -self.t_pad_tgt]
+                )
+            s = t
+        if if_f0 == 1:
+            audio_opt.append(
+                self.vc(
+                    model,
+                    net_g,
+                    sid,
+                    audio_pad[t:],
+                    pitch[:, t // self.window :] if t is not None else pitch,
+                    pitchf[:, t // self.window :] if t is not None else pitchf,
+                    times,
+                    index,
+                    big_npy,
+                    index_rate,
+                    version,
+                    protect,
+                )[self.t_pad_tgt : -self.t_pad_tgt]
+            )
+        else:
+            audio_opt.append(
+                self.vc(
+                    model,
+                    net_g,
+                    sid,
+                    audio_pad[t:],
+                    None,
+                    None,
+                    times,
+                    index,
+                    big_npy,
+                    index_rate,
+                    version,
+                    protect,
+                )[self.t_pad_tgt : -self.t_pad_tgt]
+            )
+        audio_opt = np.concatenate(audio_opt)
+        if rms_mix_rate != 1:
+            audio_opt = change_rms(audio, 16000, audio_opt, tgt_sr, rms_mix_rate)
+        if resample_sr >= 16000 and tgt_sr != resample_sr:
+            audio_opt = librosa.resample(
+                audio_opt, orig_sr=tgt_sr, target_sr=resample_sr
+            )
+        audio_max = np.abs(audio_opt).max() / 0.99
+        max_int16 = 32768
+        if audio_max > 1:
+            max_int16 /= audio_max
+        audio_opt = (audio_opt * max_int16).astype(np.int16)
+        del pitch, pitchf, sid
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        return audio_opt

voice_main.py ADDED Viewed

	@@ -0,0 +1,732 @@

+from quantum_dubbing.logging_setup import logger
+import torch
+import gc
+import numpy as np
+import os
+import shutil
+import warnings
+import threading
+from tqdm import tqdm
+from lib.infer_pack.models import (
+    SynthesizerTrnMs256NSFsid,
+    SynthesizerTrnMs256NSFsid_nono,
+    SynthesizerTrnMs768NSFsid,
+    SynthesizerTrnMs768NSFsid_nono,
+)
+from lib.audio import load_audio
+import soundfile as sf
+import edge_tts
+import asyncio
+from quantum_dubbing.utils import remove_directory_contents, create_directories
+from scipy import signal
+from time import time as ttime
+import faiss
+from vci_pipeline import VC, change_rms, bh, ah
+import librosa
+warnings.filterwarnings("ignore")
+class Config:
+    def __init__(self, only_cpu=False):
+        self.device = "cuda:0"
+        self.is_half = True
+        self.n_cpu = 0
+        self.gpu_name = None
+        self.gpu_mem = None
+        (
+            self.x_pad,
+            self.x_query,
+            self.x_center,
+            self.x_max
+        ) = self.device_config(only_cpu)
+    def device_config(self, only_cpu) -> tuple:
+        if torch.cuda.is_available() and not only_cpu:
+            i_device = int(self.device.split(":")[-1])
+            self.gpu_name = torch.cuda.get_device_name(i_device)
+            if (
+                ("16" in self.gpu_name and "V100" not in self.gpu_name.upper())
+                or "P40" in self.gpu_name.upper()
+                or "1060" in self.gpu_name
+                or "1070" in self.gpu_name
+                or "1080" in self.gpu_name
+            ):
+                logger.info(
+                    "16/10 Series GPUs and P40 excel "
+                    "in single-precision tasks."
+                )
+                self.is_half = False
+            else:
+                self.gpu_name = None
+            self.gpu_mem = int(
+                torch.cuda.get_device_properties(i_device).total_memory
+                / 1024
+                / 1024
+                / 1024
+                + 0.4
+            )
+        elif torch.backends.mps.is_available() and not only_cpu:
+            logger.info("Supported N-card not found, using MPS for inference")
+            self.device = "mps"
+        else:
+            logger.info("No supported N-card found, using CPU for inference")
+            self.device = "cpu"
+            self.is_half = False
+        if self.n_cpu == 0:
+            self.n_cpu = os.cpu_count()
+        if self.is_half:
+            # 6GB VRAM configuration
+            x_pad = 3
+            x_query = 10
+            x_center = 60
+            x_max = 65
+        else:
+            # 5GB VRAM configuration
+            x_pad = 1
+            x_query = 6
+            x_center = 38
+            x_max = 41
+        if self.gpu_mem is not None and self.gpu_mem <= 4:
+            x_pad = 1
+            x_query = 5
+            x_center = 30
+            x_max = 32
+        logger.info(
+            f"Config: Device is {self.device}, "
+            f"half precision is {self.is_half}"
+        )
+        return x_pad, x_query, x_center, x_max
+BASE_DOWNLOAD_LINK = "https://huggingface.co/r3gm/sonitranslate_voice_models/resolve/main/"
+BASE_MODELS = [
+    "hubert_base.pt",
+    "rmvpe.pt"
+]
+BASE_DIR = "."
+def load_hu_bert(config):
+    from fairseq import checkpoint_utils
+    from quantum_dubbing.utils import download_manager
+    for id_model in BASE_MODELS:
+        download_manager(
+            os.path.join(BASE_DOWNLOAD_LINK, id_model), BASE_DIR
+        )
+    models, _, _ = checkpoint_utils.load_model_ensemble_and_task(
+        ["hubert_base.pt"],
+        suffix="",
+    )
+    hubert_model = models[0]
+    hubert_model = hubert_model.to(config.device)
+    if config.is_half:
+        hubert_model = hubert_model.half()
+    else:
+        hubert_model = hubert_model.float()
+    hubert_model.eval()
+    return hubert_model
+def load_trained_model(model_path, config):
+    if not model_path:
+        raise ValueError("No model found")
+    logger.info("Loading %s" % model_path)
+    cpt = torch.load(model_path, map_location="cpu")
+    tgt_sr = cpt["config"][-1]
+    cpt["config"][-3] = cpt["weight"]["emb_g.weight"].shape[0]  # n_spk
+    if_f0 = cpt.get("f0", 1)
+    if if_f0 == 0:
+        # protect to 0.5 need?
+        pass
+    version = cpt.get("version", "v1")
+    if version == "v1":
+        if if_f0 == 1:
+            net_g = SynthesizerTrnMs256NSFsid(
+                *cpt["config"], is_half=config.is_half
+            )
+        else:
+            net_g = SynthesizerTrnMs256NSFsid_nono(*cpt["config"])
+    elif version == "v2":
+        if if_f0 == 1:
+            net_g = SynthesizerTrnMs768NSFsid(
+                *cpt["config"], is_half=config.is_half
+            )
+        else:
+            net_g = SynthesizerTrnMs768NSFsid_nono(*cpt["config"])
+    del net_g.enc_q
+    net_g.load_state_dict(cpt["weight"], strict=False)
+    net_g.eval().to(config.device)
+    if config.is_half:
+        net_g = net_g.half()
+    else:
+        net_g = net_g.float()
+    vc = VC(tgt_sr, config)
+    n_spk = cpt["config"][-3]
+    return n_spk, tgt_sr, net_g, vc, cpt, version
+class ClassVoices:
+    def __init__(self, only_cpu=False):
+        self.model_config = {}
+        self.config = None
+        self.only_cpu = only_cpu
+    def apply_conf(
+        self,
+        tag="base_model",
+        file_model="",
+        pitch_algo="pm",
+        pitch_lvl=0,
+        file_index="",
+        index_influence=0.66,
+        respiration_median_filtering=3,
+        envelope_ratio=0.25,
+        consonant_breath_protection=0.33,
+        resample_sr=0,
+        file_pitch_algo="",
+    ):
+        if not file_model:
+            raise ValueError("Model not found")
+        if file_index is None:
+            file_index = ""
+        if file_pitch_algo is None:
+            file_pitch_algo = ""
+        if not self.config:
+            self.config = Config(self.only_cpu)
+            self.hu_bert_model = None
+            self.model_pitch_estimator = None
+        self.model_config[tag] = {
+            "file_model": file_model,
+            "pitch_algo": pitch_algo,
+            "pitch_lvl": pitch_lvl,  # no decimal
+            "file_index": file_index,
+            "index_influence": index_influence,
+            "respiration_median_filtering": respiration_median_filtering,
+            "envelope_ratio": envelope_ratio,
+            "consonant_breath_protection": consonant_breath_protection,
+            "resample_sr": resample_sr,
+            "file_pitch_algo": file_pitch_algo,
+        }
+        return f"CONFIGURATION APPLIED FOR {tag}: {file_model}"
+    def infer(
+        self,
+        task_id,
+        params,
+        # load model
+        n_spk,
+        tgt_sr,
+        net_g,
+        pipe,
+        cpt,
+        version,
+        if_f0,
+        # load index
+        index_rate,
+        index,
+        big_npy,
+        # load f0 file
+        inp_f0,
+        # audio file
+        input_audio_path,
+        overwrite,
+    ):
+        f0_method = params["pitch_algo"]
+        f0_up_key = params["pitch_lvl"]
+        filter_radius = params["respiration_median_filtering"]
+        resample_sr = params["resample_sr"]
+        rms_mix_rate = params["envelope_ratio"]
+        protect = params["consonant_breath_protection"]
+        if not os.path.exists(input_audio_path):
+            raise ValueError(
+                "The audio file was not found or is not "
+                f"a valid file: {input_audio_path}"
+            )
+        f0_up_key = int(f0_up_key)
+        audio = load_audio(input_audio_path, 16000)
+        # Normalize audio
+        audio_max = np.abs(audio).max() / 0.95
+        if audio_max > 1:
+            audio /= audio_max
+        times = [0, 0, 0]
+        # filters audio signal, pads it, computes sliding window sums,
+        # and extracts optimized time indices
+        audio = signal.filtfilt(bh, ah, audio)
+        audio_pad = np.pad(
+            audio, (pipe.window // 2, pipe.window // 2), mode="reflect"
+        )
+        opt_ts = []
+        if audio_pad.shape[0] > pipe.t_max:
+            audio_sum = np.zeros_like(audio)
+            for i in range(pipe.window):
+                audio_sum += audio_pad[i:i - pipe.window]
+            for t in range(pipe.t_center, audio.shape[0], pipe.t_center):
+                opt_ts.append(
+                    t
+                    - pipe.t_query
+                    + np.where(
+                        np.abs(audio_sum[t - pipe.t_query: t + pipe.t_query])
+                        == np.abs(audio_sum[t - pipe.t_query: t + pipe.t_query]).min()
+                    )[0][0]
+                )
+        s = 0
+        audio_opt = []
+        t = None
+        t1 = ttime()
+        sid_value = 0
+        sid = torch.tensor(sid_value, device=pipe.device).unsqueeze(0).long()
+        # Pads audio symmetrically, calculates length divided by window size.
+        audio_pad = np.pad(audio, (pipe.t_pad, pipe.t_pad), mode="reflect")
+        p_len = audio_pad.shape[0] // pipe.window
+        # Estimates pitch from audio signal
+        pitch, pitchf = None, None
+        if if_f0 == 1:
+            pitch, pitchf = pipe.get_f0(
+                input_audio_path,
+                audio_pad,
+                p_len,
+                f0_up_key,
+                f0_method,
+                filter_radius,
+                inp_f0,
+            )
+            pitch = pitch[:p_len]
+            pitchf = pitchf[:p_len]
+            if pipe.device == "mps":
+                pitchf = pitchf.astype(np.float32)
+            pitch = torch.tensor(
+                pitch, device=pipe.device
+            ).unsqueeze(0).long()
+            pitchf = torch.tensor(
+                pitchf, device=pipe.device
+            ).unsqueeze(0).float()
+        t2 = ttime()
+        times[1] += t2 - t1
+        for t in opt_ts:
+            t = t // pipe.window * pipe.window
+            if if_f0 == 1:
+                pitch_slice = pitch[
+                    :, s // pipe.window: (t + pipe.t_pad2) // pipe.window
+                ]
+                pitchf_slice = pitchf[
+                    :, s // pipe.window: (t + pipe.t_pad2) // pipe.window
+                ]
+            else:
+                pitch_slice = None
+                pitchf_slice = None
+            audio_slice = audio_pad[s:t + pipe.t_pad2 + pipe.window]
+            audio_opt.append(
+                pipe.vc(
+                    self.hu_bert_model,
+                    net_g,
+                    sid,
+                    audio_slice,
+                    pitch_slice,
+                    pitchf_slice,
+                    times,
+                    index,
+                    big_npy,
+                    index_rate,
+                    version,
+                    protect,
+                )[pipe.t_pad_tgt:-pipe.t_pad_tgt]
+            )
+            s = t
+        pitch_end_slice = pitch[
+            :, t // pipe.window:
+        ] if t is not None else pitch
+        pitchf_end_slice = pitchf[
+            :, t // pipe.window:
+        ] if t is not None else pitchf
+        audio_opt.append(
+            pipe.vc(
+                self.hu_bert_model,
+                net_g,
+                sid,
+                audio_pad[t:],
+                pitch_end_slice,
+                pitchf_end_slice,
+                times,
+                index,
+                big_npy,
+                index_rate,
+                version,
+                protect,
+            )[pipe.t_pad_tgt:-pipe.t_pad_tgt]
+        )
+        audio_opt = np.concatenate(audio_opt)
+        if rms_mix_rate != 1:
+            audio_opt = change_rms(
+                audio, 16000, audio_opt, tgt_sr, rms_mix_rate
+            )
+        if resample_sr >= 16000 and tgt_sr != resample_sr:
+            audio_opt = librosa.resample(
+                audio_opt, orig_sr=tgt_sr, target_sr=resample_sr
+            )
+        audio_max = np.abs(audio_opt).max() / 0.99
+        max_int16 = 32768
+        if audio_max > 1:
+            max_int16 /= audio_max
+        audio_opt = (audio_opt * max_int16).astype(np.int16)
+        del pitch, pitchf, sid
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        if tgt_sr != resample_sr >= 16000:
+            final_sr = resample_sr
+        else:
+            final_sr = tgt_sr
+        """
+        "Success.\n %s\nTime:\n npy:%ss, f0:%ss, infer:%ss" % (
+            times[0],
+            times[1],
+            times[2],
+        ), (final_sr, audio_opt)
+        """
+        if overwrite:
+            output_audio_path = input_audio_path  # Overwrite
+        else:
+            basename = os.path.basename(input_audio_path)
+            dirname = os.path.dirname(input_audio_path)
+            new_basename = basename.split(
+                '.')[0] + "_edited." + basename.split('.')[-1]
+            new_path = os.path.join(dirname, new_basename)
+            logger.info(str(new_path))
+            output_audio_path = new_path
+        # Save file
+        sf.write(
+            file=output_audio_path,
+            samplerate=final_sr,
+            data=audio_opt
+        )
+        self.model_config[task_id]["result"].append(output_audio_path)
+        self.output_list.append(output_audio_path)
+    def make_test(
+        self,
+        tts_text,
+        tts_voice,
+        model_path,
+        index_path,
+        transpose,
+        f0_method,
+    ):
+        folder_test = "test"
+        tag = "test_edge"
+        tts_file = "test/test.wav"
+        tts_edited = "test/test_edited.wav"
+        create_directories(folder_test)
+        remove_directory_contents(folder_test)
+        if "SET_LIMIT" == os.getenv("DEMO"):
+            if len(tts_text) > 60:
+                tts_text = tts_text[:60]
+                logger.warning("DEMO; limit to 60 characters")
+        try:
+            asyncio.run(edge_tts.Communicate(
+                tts_text, "-".join(tts_voice.split('-')[:-1])
+            ).save(tts_file))
+        except Exception as e:
+            raise ValueError(
+                "No audio was received. Please change the "
+                f"tts voice for {tts_voice}. Error: {str(e)}"
+            )
+        shutil.copy(tts_file, tts_edited)
+        self.apply_conf(
+            tag=tag,
+            file_model=model_path,
+            pitch_algo=f0_method,
+            pitch_lvl=transpose,
+            file_index=index_path,
+            index_influence=0.66,
+            respiration_median_filtering=3,
+            envelope_ratio=0.25,
+            consonant_breath_protection=0.33,
+        )
+        self(
+            audio_files=tts_edited,
+            tag_list=tag,
+            overwrite=True
+        )
+        return tts_edited, tts_file
+    def run_threads(self, threads):
+        # Start threads
+        for thread in threads:
+            thread.start()
+        # Wait for all threads to finish
+        for thread in threads:
+            thread.join()
+        gc.collect()
+        torch.cuda.empty_cache()
+    def unload_models(self):
+        self.hu_bert_model = None
+        self.model_pitch_estimator = None
+        gc.collect()
+        torch.cuda.empty_cache()
+    def __call__(
+        self,
+        audio_files=[],
+        tag_list=[],
+        overwrite=False,
+        parallel_workers=1,
+    ):
+        logger.info(f"Parallel workers: {str(parallel_workers)}")
+        self.output_list = []
+        if not self.model_config:
+            raise ValueError("No model has been configured for inference")
+        if isinstance(audio_files, str):
+            audio_files = [audio_files]
+        if isinstance(tag_list, str):
+            tag_list = [tag_list]
+        if not audio_files:
+            raise ValueError("No audio found to convert")
+        if not tag_list:
+            tag_list = [list(self.model_config.keys())[-1]] * len(audio_files)
+        if len(audio_files) > len(tag_list):
+            logger.info("Extend tag list to match audio files")
+            extend_number = len(audio_files) - len(tag_list)
+            tag_list.extend([tag_list[0]] * extend_number)
+        if len(audio_files) < len(tag_list):
+            logger.info("Cut list tags")
+            tag_list = tag_list[:len(audio_files)]
+        tag_file_pairs = list(zip(tag_list, audio_files))
+        sorted_tag_file = sorted(tag_file_pairs, key=lambda x: x[0])
+        # Base params
+        if not self.hu_bert_model:
+            self.hu_bert_model = load_hu_bert(self.config)
+        cache_params = None
+        threads = []
+        progress_bar = tqdm(total=len(tag_list), desc="Progress")
+        for i, (id_tag, input_audio_path) in enumerate(sorted_tag_file):
+            if id_tag not in self.model_config.keys():
+                logger.info(
+                    f"No configured model for {id_tag} with {input_audio_path}"
+                )
+                continue
+            if (
+                len(threads) >= parallel_workers
+                or cache_params != id_tag
+                and cache_params is not None
+            ):
+                self.run_threads(threads)
+                progress_bar.update(len(threads))
+                threads = []
+            if cache_params != id_tag:
+                self.model_config[id_tag]["result"] = []
+                # Unload previous
+                (
+                    n_spk,
+                    tgt_sr,
+                    net_g,
+                    pipe,
+                    cpt,
+                    version,
+                    if_f0,
+                    index_rate,
+                    index,
+                    big_npy,
+                    inp_f0,
+                ) = [None] * 11
+                gc.collect()
+                torch.cuda.empty_cache()
+                # Model params
+                params = self.model_config[id_tag]
+                model_path = params["file_model"]
+                f0_method = params["pitch_algo"]
+                file_index = params["file_index"]
+                index_rate = params["index_influence"]
+                f0_file = params["file_pitch_algo"]
+                # Load model
+                (
+                    n_spk,
+                    tgt_sr,
+                    net_g,
+                    pipe,
+                    cpt,
+                    version
+                ) = load_trained_model(model_path, self.config)
+                if_f0 = cpt.get("f0", 1)  # pitch data
+                # Load index
+                if os.path.exists(file_index) and index_rate != 0:
+                    try:
+                        index = faiss.read_index(file_index)
+                        big_npy = index.reconstruct_n(0, index.ntotal)
+                    except Exception as error:
+                        logger.error(f"Index: {str(error)}")
+                        index_rate = 0
+                        index = big_npy = None
+                else:
+                    logger.warning("File index not found")
+                    index_rate = 0
+                    index = big_npy = None
+                # Load f0 file
+                inp_f0 = None
+                if os.path.exists(f0_file):
+                    try:
+                        with open(f0_file, "r") as f:
+                            lines = f.read().strip("\n").split("\n")
+                        inp_f0 = []
+                        for line in lines:
+                            inp_f0.append([float(i) for i in line.split(",")])
+                        inp_f0 = np.array(inp_f0, dtype="float32")
+                    except Exception as error:
+                        logger.error(f"f0 file: {str(error)}")
+                if "rmvpe" in f0_method:
+                    if not self.model_pitch_estimator:
+                        from lib.rmvpe import RMVPE
+                        logger.info("Loading vocal pitch estimator model")
+                        self.model_pitch_estimator = RMVPE(
+                            "rmvpe.pt",
+                            is_half=self.config.is_half,
+                            device=self.config.device
+                        )
+                    pipe.model_rmvpe = self.model_pitch_estimator
+                cache_params = id_tag
+            # self.infer(
+            #     id_tag,
+            #     params,
+            #     # load model
+            #     n_spk,
+            #     tgt_sr,
+            #     net_g,
+            #     pipe,
+            #     cpt,
+            #     version,
+            #     if_f0,
+            #     # load index
+            #     index_rate,
+            #     index,
+            #     big_npy,
+            #     # load f0 file
+            #     inp_f0,
+            #     # output file
+            #     input_audio_path,
+            #     overwrite,
+            # )
+            thread = threading.Thread(
+                target=self.infer,
+                args=(
+                    id_tag,
+                    params,
+                    # loaded model
+                    n_spk,
+                    tgt_sr,
+                    net_g,
+                    pipe,
+                    cpt,
+                    version,
+                    if_f0,
+                    # loaded index
+                    index_rate,
+                    index,
+                    big_npy,
+                    # loaded f0 file
+                    inp_f0,
+                    # audio file
+                    input_audio_path,
+                    overwrite,
+                )
+            )
+            threads.append(thread)
+        # Run last
+        if threads:
+            self.run_threads(threads)
+        progress_bar.update(len(threads))
+        progress_bar.close()
+        final_result = []
+        valid_tags = set(tag_list)
+        for tag in valid_tags:
+            if (
+                tag in self.model_config.keys()
+                and "result" in self.model_config[tag].keys()
+            ):
+                final_result.extend(self.model_config[tag]["result"])
+        return final_result