周期的にブツッブツッと切れた音声が出力される #154

nukora · 2023-04-09T06:54:38Z

はじめまして。
RVCを使用しているのですが、周期的にブツッブツッと切れた音声が出力されてしまいます。

マイク→オーディオインターフェイス→VC Client→VoiceMeeter Banana→Audacity
という流れで録音した音声が以下となります。
(VoiceMeeterを使わずスピーカーに直接出力しても同じような出力になります)
【文章を読み上げたサンプル】
https://user-images.githubusercontent.com/15606184/230758825-264a8bcc-46d4-4569-8ed9-f96265501f4a.mp4

【伸ばした音のサンプル】
https://user-images.githubusercontent.com/15606184/230758838-3499e5cc-a8bc-4b11-9f0d-917e0b7e6e0e.mp4

また、これはVC ClientのDevice Setting→output record機能を使用すると発生しません。
【文章を読み上げたサンプル】
https://user-images.githubusercontent.com/15606184/230758850-55b386c6-2591-499f-ba04-95975834084e.mp4

【伸ばした音のサンプル】
https://user-images.githubusercontent.com/15606184/230758856-105353f6-99d3-4176-baf7-0a942cf38587.mp4

こちら解決する方法はありますでしょうか？

よろしくお願いします。

【環境】
使用方法:
事前ビルド済みBinary v.1.5.1.15b win ONNX(cpu,cuda), PyTorch(cpu,cuda)
start_http_RVC.bat

OS: Windows 11 Pro 22H2
CPU: i9-10850K
GPU: GeForce RTX 3080
RAM: 32GB

frodo821 · 2023-04-09T18:55:51Z

@nukora Converter Settingからバッファ(Input Chunk Num)を伸ばせば解決します。ただし、バッファを伸ばすとラグが大きくなるので、ここはトレードオフになってしまいます。Extra Data Lengthを小さくするとラグは多少減ります。
根本的な原因は前のチャンクの処理が終わるまで次のチャンクの処理が始まらないためなので、multiprocessingなどをうまく使ってpreprocessing/inference/postprocessingを並列化(ないしはパイプライン化)できれば解決(パイプライン化の場合は軽減)するかもしれません。

voice-changer/server/voice_changer/VoiceChanger.py

Lines 162 to 260 in fb1be8b

    
           def on_request(self, receivedData: any): 
        
               processing_sampling_rate = self.voiceChanger.get_processing_sampling_rate() 
        
               print_convert_processing(f"------------ Convert processing.... ------------") 
        
               # 前処理 
        
               with Timer("pre-process") as t: 
        
                   with Timer("pre-process") as t1: 
        
                       if self.settings.inputSampleRate != processing_sampling_rate: 
        
                           newData = resampy.resample(receivedData, self.settings.inputSampleRate, processing_sampling_rate) 
        
                       else: 
        
                           newData = receivedData 
        
                   # print("t1::::", t1.secs) 
        
                   inputSize = newData.shape[0] 
        
                   crossfadeSize = min(self.settings.crossFadeOverlapSize, inputSize) 
        
                   print_convert_processing( 
        
                       f" Input data size: {receivedData.shape[0]}/{self.settings.inputSampleRate}hz {inputSize}/{processing_sampling_rate}hz") 
        
                   print_convert_processing( 
        
                       f" Crossfade data size: crossfade:{crossfadeSize}, crossfade setting:{self.settings.crossFadeOverlapSize}, input size:{inputSize}") 
        
                   print_convert_processing(f" Convert data size of {inputSize + crossfadeSize} (+ extra size)") 
        
                   print_convert_processing(f"         will be cropped:{-1 * (inputSize + crossfadeSize)}, {-1 * (crossfadeSize)}") 
        
                   self._generate_strength(crossfadeSize) 
        
                   with Timer("pre-process") as t2: 
        
                       data = self.voiceChanger.generate_input(newData, inputSize, crossfadeSize) 
        
                   # print("t2::::", t2.secs) 
        
               preprocess_time = t.secs 
        
               # 変換処理 
        
               with Timer("main-process") as t: 
        
                   try: 
        
                       # Inference 
        
                       audio = self.voiceChanger.inference(data) 
        
                       if hasattr(self, 'np_prev_audio1') == True: 
        
                           np.set_printoptions(threshold=10000) 
        
                           prev_overlap_start = -1 * crossfadeSize 
        
                           prev_overlap = self.np_prev_audio1[prev_overlap_start:] 
        
                           cur_overlap_start = -1 * (inputSize + crossfadeSize) 
        
                           cur_overlap_end = -1 * inputSize 
        
                           cur_overlap = audio[cur_overlap_start:cur_overlap_end] 
        
                           print_convert_processing( 
        
                               f" audio:{audio.shape}, prev_overlap:{prev_overlap.shape}, self.np_prev_strength:{self.np_prev_strength.shape}") 
        
                           powered_prev = prev_overlap * self.np_prev_strength 
        
                           print_convert_processing( 
        
                               f" audio:{audio.shape}, cur_overlap:{cur_overlap.shape}, self.np_cur_strength:{self.np_cur_strength.shape}") 
        
                           print_convert_processing(f" cur_overlap_strt:{cur_overlap_start}, cur_overlap_end{cur_overlap_end}") 
        
                           powered_cur = cur_overlap * self.np_cur_strength 
        
                           powered_result = powered_prev + powered_cur 
        
                           cur = audio[-1 * inputSize:-1 * crossfadeSize] 
        
                           result = np.concatenate([powered_result, cur], axis=0) 
        
                           print_convert_processing( 
        
                               f" overlap:{crossfadeSize}, current:{cur.shape[0]}, result:{result.shape[0]}... result should be same as input") 
        
                           if cur.shape[0] != result.shape[0]: 
        
                               print_convert_processing(f" current and result should be same as input") 
        
                       else: 
        
                           result = np.zeros(4096).astype(np.int16) 
        
                       self.np_prev_audio1 = audio 
        
                   except Exception as e: 
        
                       print("VC PROCESSING!!!! EXCEPTION!!!", e) 
        
                       print(traceback.format_exc()) 
        
                       if hasattr(self, "np_prev_audio1"): 
        
                           del self.np_prev_audio1 
        
                       return np.zeros(1).astype(np.int16), [0, 0, 0] 
        
               mainprocess_time = t.secs 
        
               # 後処理 
        
               with Timer("post-process") as t: 
        
                   result = result.astype(np.int16) 
        
                   if self.settings.inputSampleRate != processing_sampling_rate: 
        
                       outputData = resampy.resample(result, processing_sampling_rate, self.settings.inputSampleRate).astype(np.int16) 
        
                   else: 
        
                       outputData = result 
        
                   # outputData = result 
        
                   print_convert_processing( 
        
                       f" Output data size of {result.shape[0]}/{processing_sampling_rate}hz {outputData.shape[0]}/{self.settings.inputSampleRate}hz") 
        
                   if self.settings.recordIO == 1: 
        
                       self.ioRecorder.writeInput(receivedData) 
        
                       self.ioRecorder.writeOutput(outputData.tobytes()) 
        
                   # if receivedData.shape[0] != outputData.shape[0]: 
        
                   #     print(f"Padding, in:{receivedData.shape[0]} out:{outputData.shape[0]}") 
        
                   #     outputData = pad_array(outputData, receivedData.shape[0]) 
        
                   #     # print_convert_processing( 
        
                   #     #     f" Padded!, Output data size of {result.shape[0]}/{processing_sampling_rate}hz {outputData.shape[0]}/{self.settings.inputSampleRate}hz") 
        
               postprocess_time = t.secs 
        
               print_convert_processing(f" [fin] Input/Output size:{receivedData.shape[0]},{outputData.shape[0]}") 
        
               perf = [preprocess_time, mainprocess_time, postprocess_time] 
        
               return outputData, perf

w-okada · 2023-04-09T22:07:38Z

元はmultiprocessingで動いてたんですけど、Winネイティブ化(pyinstaller)だったか、Colab対応だったかでシングルスレッドにしないとうまく動かなくなってしまったんですよね。どっちだったか忘れたけど、Colab対応が原因だったらColabはそろそろ切ってmultiprocessingに戻してもよいかなと考えています。

frodo821 · 2023-04-10T00:49:42Z

@w-okada 多分pyinstallerが原因ですね。multiprocessingの機能を利用する前に適当なところで multiprocessing.freeze_support() を呼んでやるとおそらく動くと思います。

参考: https://qiita.com/npkk/items/cc4c46181c06ff41bdf3

nukora · 2023-04-10T03:17:29Z

Input Chunk Numを増やして試してみたところ、無音が発生する間隔は長くなりましたが、やはり周期的に無音が挟まってしまいます。
こちらもmultiprocessingで解決するものなのでしょうか？

(内部処理が理解できていないため的外れな事を言ってるかもしれませんが、とりあえず起こっている現象を貼っておきます)

【Input Chunk Num = 512】
https://user-images.githubusercontent.com/15606184/230818433-1cb7de89-ae56-45a3-a3d7-92fa1cb76d2b.mp4

【Input Chunk Num = 1024】
https://user-images.githubusercontent.com/15606184/230818449-de75a541-54a0-4f1d-b371-c066f1f970b9.mp4

frodo821 · 2023-04-10T03:56:14Z

@w-okada とりあえず変換処理だけmultiprocessingでぶん回すように直しました。

[WIP] 音声の変換をmultiprocessingで処理するように変更した #158

この問題を解決するには、多分再生側もプロセスを分けるなり変換処理と同じプロセスでやるなりしないといけないような気がします。流石に眠気が限界なので一旦寝てきます。

frodo821 · 2023-04-10T04:16:42Z

@nukora

Input Chunk Numを増やして試してみたところ、無音が発生する間隔は長くなりましたが、やはり周期的に無音が挟まってしまいます。

現在のコードでは、バッファごとに変換し、その再生が終わるまで次のバッファの変換処理が走らないようになっています。なので、あるバッファの再生が終わったタイミングから次のバッファの再生の準備が整うまで音が途切れてしまいます。そこで、Input Chunk Numを大きくするとバッファが増えるので、その分バッファあたりの再生時間が伸びます。つまり、おっしゃるように無音時間の発生間隔が伸びるということです。
multiprocessingを利用すればバッファの再生中に次のバッファの変換処理を走らせることができるため、無音時間が発生しなくなるのではないかという推測をしています。

nukora · 2023-04-10T16:41:28Z

なるほどです
詳しく解説していただきありがとうございます
私の環境ですとバッファの継ぎ目がどうしても目立ってしまうので、対応していただけると助かります……！

w-okada · 2023-04-11T01:00:17Z

原因はよくわかりませんが論理的には、
ひとつ前の音の再生時間よりも、現在の音声の変換時間が十分短ければ音の変換が遅れて問題が出ることはない
という、考え方で作っています。この考え方が正しければ、パイプライン処理で解決する問題ではないかもしれません。
別のところで説明するために作ったスライドを添付しておきます。

w-okada · 2023-04-11T01:03:57Z

あ、前提として、バッファリングと再生はブラウザでやって、変換はサーバ側でやる作りです。
なので、そういう意味ではもともとマルチプロセスで動いています。

w-okada · 2023-04-11T01:10:34Z

なお、同じ現象かわかりませんが、こういう報告もありました。

nukora · 2023-04-11T09:11:57Z

なお、同じ現象かわかりませんが、こういう報告もありました。

まさにこちらが原因でした。
オーディオインターフェイスの設定でサンプリングレートを48kHzに設定した所、問題が発生しなくなりました。

私の環境ですと、48kHzより大きくても小さくても正常に動かなくなるようです。

ご教示いただきありがとうございました！

w-okada closed this as completed Apr 11, 2023

This was referenced Apr 11, 2023

[WIP] 音声の変換をmultiprocessingで処理するように変更した #158

Closed

いい加減、input chunk numとかextra datalengthとか説明してもらえませんか！？ #179

Closed

solaが機能しない？ #182

Closed

w-okada mentioned this issue Jun 14, 2023

Extra Data length meaning #315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

周期的にブツッブツッと切れた音声が出力される #154

周期的にブツッブツッと切れた音声が出力される #154

nukora commented Apr 9, 2023

frodo821 commented Apr 9, 2023 •

edited

Loading

w-okada commented Apr 9, 2023

frodo821 commented Apr 10, 2023 •

edited

Loading

nukora commented Apr 10, 2023

frodo821 commented Apr 10, 2023 •

edited

Loading

frodo821 commented Apr 10, 2023

nukora commented Apr 10, 2023

w-okada commented Apr 11, 2023

w-okada commented Apr 11, 2023 •

edited

Loading

w-okada commented Apr 11, 2023 •

edited

Loading

nukora commented Apr 11, 2023 •

edited by w-okada

Loading

周期的にブツッブツッと切れた音声が出力される #154

周期的にブツッブツッと切れた音声が出力される #154

Comments

nukora commented Apr 9, 2023

frodo821 commented Apr 9, 2023 • edited Loading

w-okada commented Apr 9, 2023

frodo821 commented Apr 10, 2023 • edited Loading

nukora commented Apr 10, 2023

frodo821 commented Apr 10, 2023 • edited Loading

frodo821 commented Apr 10, 2023

nukora commented Apr 10, 2023

w-okada commented Apr 11, 2023

w-okada commented Apr 11, 2023 • edited Loading

w-okada commented Apr 11, 2023 • edited Loading

nukora commented Apr 11, 2023 • edited by w-okada Loading

frodo821 commented Apr 9, 2023 •

edited

Loading

frodo821 commented Apr 10, 2023 •

edited

Loading

frodo821 commented Apr 10, 2023 •

edited

Loading

w-okada commented Apr 11, 2023 •

edited

Loading

w-okada commented Apr 11, 2023 •

edited

Loading

nukora commented Apr 11, 2023 •

edited by w-okada

Loading