engiecat · engiecat · Apr 30, 2018 · Apr 28, 2018 · Apr 29, 2018 · Apr 29, 2018
diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,8 @@ log
 generated
 data
 text
+datasets
+testout
 
 # Created by https://www.gitignore.io
 

diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+![alt text](assets/banner.jpg)
+
 # Deepvoice3_pytorch
 
 [![PyPI](https://img.shields.io/pypi/v/deepvoice3_pytorch.svg)](https://pypi.python.org/pypi/deepvoice3_pytorch)
@@ -21,8 +23,8 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
 - Convolutional sequence-to-sequence model with attention for text-to-speech synthesis
 - Multi-speaker and single speaker versions of DeepVoice3
 - Audio samples and pre-trained models
-- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets
-- Language-dependent frontend text processor for English and Japanese
+- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
+- Language-dependent frontend text processor for English and Japanese 
 
 ### Samples
 
@@ -102,7 +104,7 @@ python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljs
 - LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
 - VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
 - JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
-- NIKL (ko): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
+- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464 
 
 ### 1. Preprocessing
 
@@ -128,6 +130,47 @@ python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/L
 
 When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in `./data/ljspeech`.
 
+#### 1-1. Building custom dataset. (using json_meta)
+Building your own dataset, with metadata in JSON format (compatible with [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow)) is currently supported.  
+Usage:
+
+```
+python preprocess.py json_meta ${list-of-JSON-metadata-paths} ${out_dir} --preset=<json>
+```
+You may need to modify pre-existing preset JSON file, especially `n_speakers`. For english multispeaker, start with `presets/deepvoice3_vctk.json`.
+
+Assuming you have dataset A (Speaker A) and dataset B (Speaker B), each described in the JSON metadata file `./datasets/datasetA/alignment.json` and `./datasets/datasetB/alignment.json`, then you can preprocess  data by:
+
+```
+python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/datasetB/alignment.json" "./datasets/processed_A+B" --preset=(path to preset json file)
+```
+
+#### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))
+
+Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model. 
+(e.g. VCTK, although this is covered in vctk_preprocess)
+
+To deal with the problem, `gentle_web_align.py` will
+- **Prepare phoneme alignments for all utterances** 
+- Cut silences during preprocessing 
+
+`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker). 
+
+Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)
+
+Usage:
+(Assuming Gentle is running at `localhost:8567` (Default when not specified))
+1. When sound file and transcript files are saved in separate folders. (e.g. sound files are at `datasetA/wavs` and transcripts are at `datasetA/txts`)
+```
+python gentle_web_align.py -w "datasetA/wavs/*.wav" -t "datasetA/txts/*.txt" --server_addr=localhost --port=8567
+```
+
+2. When sound file and transcript files are saved in nested structure. (e.g. `datasetB/speakerN/blahblah.wav` and `datasetB/speakerN/blahblah.txt`)
+```
+python gentle_web_align.py --nested-directories="datasetB" --server_addr=localhost --port=8567
+```
+**Once you have phoneme alignment for each utterance, you can extract features by running `preprocess.py`**
+
 ### 2. Training
 
 Usage:
@@ -139,7 +182,7 @@ python train.py --data-root=${data-root} --preset=<json> --hparams="parameters y
 Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:
 
 ```
-python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
+python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/ 
 ```
 
 Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
@@ -247,11 +290,15 @@ From my experience, it can get reasonable speech quality very quickly rather tha
 There are two important options used above:
 
 - `--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
-- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
+- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset. 
+
+If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**. 
 
 ## Acknowledgements
 
 Part of code was adapted from the following projects:
 
 - https://github.com/keithito/tacotron
 - https://github.com/facebookresearch/fairseq-py
+
+Banner and logo created by [@jraulhernandezi](https://github.com/jraulhernandezi) ([#76](https://github.com/r9y9/deepvoice3_pytorch/issues/76))
diff --git a/assets/banner.jpg b/assets/banner.jpg
diff --git a/deepvoice3_pytorch/conv.py b/deepvoice3_pytorch/conv.py
@@ -40,7 +40,7 @@ def incremental_forward(self, input):
                 self.input_buffer[:, :-1, :] = self.input_buffer[:, 1:, :].clone()
             # append next input
             self.input_buffer[:, -1, :] = input[:, -1, :]
-            input = torch.Tensor(self.input_buffer)
+            input = self.input_buffer.clone()
             if dilation > 1:
                 input = input[:, 0::dilation, :].contiguous()
         output = F.linear(input.view(bsz, -1), weight, self.bias)

diff --git a/docs/config.toml b/docs/config.toml
@@ -1,11 +1,12 @@
 baseURL = "https://r9y9.github.io/deepvoice3_pytorch/"
 languageCode = "ja-jp"
-title = "An open source implementation of DeepVoice 3: 2000-Speaker Neural Text-to-Speech"
+title = "An open source implementation of Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning"
 author = "Ryuichi YAMAMOTO"
 
 [params]
   author    = "Ryuichi YAMAMOTO"
-  logo      = "/images/r9y9.jpg"
+  project   = "deepvoice3_pytorch"
+  logo      = "/images/512logotipo.png"
   twitter   = "r9y9"
   github    = "r9y9"
   analytics = "UA-44433856-1"
diff --git a/docs/content/index.md b/docs/content/index.md
@@ -12,7 +12,7 @@ type = "index"
 
 - Github: https://github.com/r9y9/deepvoice3_pytorch
 
-This page provides audio samples for the open source implementation of DeepVoice3. Samples from single speaker and multi-speaker models follow.
+This page provides audio samples for the open source implementation of [Deep Voice 3](https://arxiv.org/abs/1710.07654). Samples from single speaker and multi-speaker models follow.
 
 ## Single speaker
 

diff --git a/docs/layouts/partials/footer.html b/docs/layouts/partials/footer.html
@@ -3,11 +3,6 @@
 		<div class="hr"></div>
 		<address>
 			<div class="avatar-bottom">
-				<a href="/">
-					{{ with .Site.Params.logo }}
-					<img src="{{ . }}">
-					{{ end }}
-				</a>
 			</div>
 
 		<div class="copyright">Copyright &copy;

diff --git a/docs/layouts/partials/header.html b/docs/layouts/partials/header.html
@@ -6,11 +6,11 @@
 <meta name="viewport" content="width=device-width, initial-scale=1">
 <link href="https://fonts.googleapis.com/css?family=Roboto:300,400,700" rel="stylesheet" type="text/css">
 <link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
-<link rel="stylesheet" href="/css/normalize.css">
-<link rel="stylesheet" href="/css/skeleton.css">
-<link rel="stylesheet" href="/css/custom.css">
-<link rel="alternate" href="/index.xml" type="application/rss+xml" title="{{ .Site.Title }}">
-<link rel="shortcut icon" href="/favicon.png" type="image/x-icon" />
+<link rel="stylesheet" href="/{{ .Site.Params.Project }}/css/normalize.css">
+<link rel="stylesheet" href="/{{ .Site.Params.Project }}/css/skeleton.css">
+<link rel="stylesheet" href="/{{ .Site.Params.Project }}/css/custom.css">
+<link rel="alternate" href="/{{ .Site.Params.Project }}/index.xml" type="application/rss+xml" title="{{ .Site.Title }}">
+<link rel="shortcut icon" href="/{{ .Site.Params.Project }}/favicon.png" type="image/x-icon" />
 <title>{{ $isHomePage := eq .Title .Site.Title }}{{ .Title }}{{ if eq $isHomePage false }} - {{ .Site.Title }}{{ end }}</title>
 </head>
 <body>
@@ -19,7 +19,7 @@
 
 	<header role="banner">
 		<div class="header-logo">
-			<a href="/"><img src="{{ .Site.Params.logo }}" width="70" height="70"></a>
+			<a href="https://github.com/r9y9/deepvoice3_pytorch"><img src="/{{ .Site.Params.Project }}/{{ .Site.Params.logo }}" width="140" height="140"></a>
 		</div>
 		{{ if eq $isHomePage true }}<h1 class="site-title">{{ .Site.Title }}</h1>{{ end }}
 	</header>
diff --git a/docs/static/css/custom.css b/docs/static/css/custom.css
@@ -62,15 +62,6 @@ main {
   max-width: 700px;
 }
 
-.header-logo img {
-  border-radius: 50%;
-  border: 2px solid #E1E1E1;
-}
-
-.header-logo img:hover {
-  border-color: #F1F1F1;
-}
-
 .site-title {
   margin-top: 2rem;
 }

diff --git a/docs/static/favicon.png b/docs/static/favicon.png
diff --git a/docs/static/images/512logotipo.png b/docs/static/images/512logotipo.png
diff --git a/docs/static/images/r9y9.jpg b/docs/static/images/r9y9.jpg
diff --git a/gentle_web_align.py b/gentle_web_align.py
@@ -0,0 +1,153 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Sat Apr 21 09:06:37 2018
+Phoneme alignment and conversion in HTK-style label file using Web-served Gentle
+This works on any type of english dataset.
+Unlike prepare_htk_alignments_vctk.py, this is Python3 and Windows(with Docker) compatible.
+Preliminary results show that gentle has better performance with noisy dataset
+(e.g. movie extracted audioclips)
+*This work was derived from vctk_preprocess/prepare_htk_alignments_vctk.py
+@author: engiecat(github)
+
+usage:
+    gentle_web_align.py (-w wav_pattern) (-t text_pattern) [options]
+    gentle_web_align.py (--nested-directories=<main_directory>) [options]
+
+options:
+    -w <wav_pattern> --wav_pattern=<wav_pattern> Pattern of wav files to be aligned
+    -t <txt_pattern> --txt_pattern=<txt_pattern> Pattern of txt transcript files to be aligned (same name required)
+    --nested-directories=<main_directory>        Process every wav/txt file in the subfolders of the given folder
+    --server_addr=<server_addr>                  Server address that serves gentle. [default: localhost]
+    --port=<port>                                Server port that serves gentle. [default: 8567]
+    --max_unalign=<max_unalign>                  Maximum threshold for unalignment occurence (0.0 ~ 1.0) [default: 0.3] 
+    --skip-already-done                          Skips if there are preexisting .lab file
+    -h --help                                    show this help message and exit
+"""
+
+from docopt import docopt
+from glob import glob
+from tqdm import tqdm
+import os.path
+import requests
+import numpy as np
+
+def write_hts_label(labels, lab_path):
+    lab = ""
+    for s, e, l in labels:
+        s, e = float(s) * 1e7, float(e) * 1e7
+        s, e = int(s), int(e)
+        lab += "{} {} {}\n".format(s, e, l)
+    print(lab)
+    with open(lab_path, "w", encoding='utf-8') as f:
+        f.write(lab)
+
+
+def json2hts(data):
+    emit_bos = False
+    emit_eos = False
+
+    phone_start = 0
+    phone_end = None
+    labels = []
+    failure_count = 0
+
+    for word in data["words"]:
+        case = word["case"]
+        if case != "success":
+            failure_count += 1 # instead of failing everything, 
+            #raise RuntimeError("Alignment failed")
+            continue
+        start = float(word["start"])
+        word_end = float(word["end"])
+
+        if not emit_bos:
+            labels.append((phone_start, start, "silB"))
+            emit_bos = True
+
+        phone_start = start
+        phone_end = None
+        for phone in word["phones"]:
+            ph = str(phone["phone"][:-2])
+            duration = float(phone["duration"])
+            phone_end = phone_start + duration
+            labels.append((phone_start, phone_end, ph))
+            phone_start += duration
+        assert np.allclose(phone_end, word_end)
+    if not emit_eos:
+        labels.append((phone_start, phone_end, "silE"))
+        emit_eos = True
+    unalign_ratio = float(failure_count) / len(data['words'])
+    return unalign_ratio, labels
+
+
+def gentle_request(wav_path,txt_path, server_addr, port, debug=False):
+    print('\n')
+    response = None
+    wav_name = os.path.basename(wav_path)
+    txt_name = os.path.basename(txt_path)
+    if os.path.splitext(wav_name)[0] != os.path.splitext(txt_name)[0]:
+        print(' [!] wav name and transcript name does not match - exiting...')
+        return response
+    with open(txt_path, 'r', encoding='utf-8-sig') as txt_file:
+        print('Transcript - '+''.join(txt_file.readlines()))
+    with open(wav_path,'rb') as wav_file, open(txt_path, 'rb') as txt_file:
+        params = (('async','false'),)
+        files={'audio':(wav_name,wav_file),
+               'transcript':(txt_name,txt_file),
+               }
+        server_path = 'http://'+server_addr+':'+str(port)+'/transcriptions'
+        response = requests.post(server_path, params=params,files=files)
+        if response.status_code != 200:
+            print(' [!] External server({}) returned bad response({})'.format(server_path, response.status_code))
+    if debug:
+        print('Response')
+        print(response.json())
+    return response
+
+if __name__ == '__main__':
+    arguments = docopt(__doc__)    
+    server_addr = arguments['--server_addr']
+    port = int(arguments['--port'])
+    max_unalign  = float(arguments['--max_unalign'])
+    if arguments['--nested-directories'] is None:
+        wav_paths = sorted(glob(arguments['--wav_pattern']))
+        txt_paths = sorted(glob(arguments['--txt_pattern']))    
+    else:
+        # if this is multi-foldered environment
+        # (e.g. DATASET/speaker1/blahblah.wav)
+        wav_paths=[]
+        txt_paths=[]
+        topdir = arguments['--nested-directories']
+        subdirs = [f for f in os.listdir(topdir) if os.path.isdir(os.path.join(topdir, f))]
+        for subdir in subdirs:
+            wav_pattern_subdir = os.path.join(topdir, subdir, '*.wav')
+            txt_pattern_subdir = os.path.join(topdir, subdir, '*.txt')
+            wav_paths.extend(sorted(glob(wav_pattern_subdir)))
+            txt_paths.extend(sorted(glob(txt_pattern_subdir)))
+
+    t = tqdm(range(len(wav_paths)))
+    for idx in t:
+        try:
+            t.set_description("Align via Gentle")
+            wav_path = wav_paths[idx]
+            txt_path = txt_paths[idx]
+            lab_path = os.path.splitext(wav_path)[0]+'.lab'
+            if os.path.exists(lab_path) and arguments['--skip-already-done']:
+                print('[!] skipping because of pre-existing .lab file - {}'.format(lab_path))
+                continue
+            res=gentle_request(wav_path,txt_path, server_addr, port)
+            unalign_ratio, lab = json2hts(res.json())
+            print('[*] Unaligned Ratio - {}'.format(unalign_ratio))
+            if unalign_ratio > max_unalign:
+                print('[!] skipping this due to bad alignment')
+                continue
+            write_hts_label(lab, lab_path)
+        except:
+            # if sth happens, skip it
+            import traceback
+            tb = traceback.format_exc()
+            print('[!] ERROR while processing {}'.format(wav_paths[idx]))
+            print('[!] StackTrace - ')
+            print(tb)
+
+
diff --git a/hparams.py b/hparams.py
@@ -125,6 +125,14 @@
     # Forced garbage collection probability
     # Use only when MemoryError continues in Windows (Disabled by default)
     #gc_probability = 0.001,
+
+    # json_meta mode only
+    # 0: "use all",
+    # 1: "ignore only unmatched_alignment",
+    # 2: "fully ignore recognition",
+    ignore_recognition_level = 2,
+    min_text=20, # when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
+    process_only_htk_aligned = False, # if true, data without phoneme alignment file(.lab) will be ignored
 )
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,6 +10,8 @@ log @@
     generated
     data
     text
+    datasets
+    testout
     # Created by https://www.gitignore.io
@@ Expand Down @@