Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse PR 0430 #6

Merged
merged 6 commits into from
Apr 30, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ log
generated
data
text
datasets
testout

# Created by https://www.gitignore.io

Expand Down
57 changes: 52 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
![alt text](assets/banner.jpg)

# Deepvoice3_pytorch

[![PyPI](https://img.shields.io/pypi/v/deepvoice3_pytorch.svg)](https://pypi.python.org/pypi/deepvoice3_pytorch)
Expand All @@ -21,8 +23,8 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
- Convolutional sequence-to-sequence model with attention for text-to-speech synthesis
- Multi-speaker and single speaker versions of DeepVoice3
- Audio samples and pre-trained models
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets
- Language-dependent frontend text processor for English and Japanese
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
- Language-dependent frontend text processor for English and Japanese

### Samples

Expand Down Expand Up @@ -102,7 +104,7 @@ python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljs
- LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
- VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
- JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
- NIKL (ko): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464

### 1. Preprocessing

Expand All @@ -128,6 +130,47 @@ python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/L

When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in `./data/ljspeech`.

#### 1-1. Building custom dataset. (using json_meta)
Building your own dataset, with metadata in JSON format (compatible with [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow)) is currently supported.
Usage:

```
python preprocess.py json_meta ${list-of-JSON-metadata-paths} ${out_dir} --preset=<json>
```
You may need to modify pre-existing preset JSON file, especially `n_speakers`. For english multispeaker, start with `presets/deepvoice3_vctk.json`.

Assuming you have dataset A (Speaker A) and dataset B (Speaker B), each described in the JSON metadata file `./datasets/datasetA/alignment.json` and `./datasets/datasetB/alignment.json`, then you can preprocess data by:

```
python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/datasetB/alignment.json" "./datasets/processed_A+B" --preset=(path to preset json file)
```

#### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))

Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
(e.g. VCTK, although this is covered in vctk_preprocess)

To deal with the problem, `gentle_web_align.py` will
- **Prepare phoneme alignments for all utterances**
- Cut silences during preprocessing

`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).

Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)

Usage:
(Assuming Gentle is running at `localhost:8567` (Default when not specified))
1. When sound file and transcript files are saved in separate folders. (e.g. sound files are at `datasetA/wavs` and transcripts are at `datasetA/txts`)
```
python gentle_web_align.py -w "datasetA/wavs/*.wav" -t "datasetA/txts/*.txt" --server_addr=localhost --port=8567
```

2. When sound file and transcript files are saved in nested structure. (e.g. `datasetB/speakerN/blahblah.wav` and `datasetB/speakerN/blahblah.txt`)
```
python gentle_web_align.py --nested-directories="datasetB" --server_addr=localhost --port=8567
```
**Once you have phoneme alignment for each utterance, you can extract features by running `preprocess.py`**

### 2. Training

Usage:
Expand All @@ -139,7 +182,7 @@ python train.py --data-root=${data-root} --preset=<json> --hparams="parameters y
Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:

```
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
```

Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
Expand Down Expand Up @@ -247,11 +290,15 @@ From my experience, it can get reasonable speech quality very quickly rather tha
There are two important options used above:

- `--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.

If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.

## Acknowledgements

Part of code was adapted from the following projects:

- https://github.com/keithito/tacotron
- https://github.com/facebookresearch/fairseq-py

Banner and logo created by [@jraulhernandezi](https://github.com/jraulhernandezi) ([#76](https://github.com/r9y9/deepvoice3_pytorch/issues/76))
Binary file added assets/banner.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion deepvoice3_pytorch/conv.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def incremental_forward(self, input):
self.input_buffer[:, :-1, :] = self.input_buffer[:, 1:, :].clone()
# append next input
self.input_buffer[:, -1, :] = input[:, -1, :]
input = torch.Tensor(self.input_buffer)
input = self.input_buffer.clone()
if dilation > 1:
input = input[:, 0::dilation, :].contiguous()
output = F.linear(input.view(bsz, -1), weight, self.bias)
Expand Down
5 changes: 3 additions & 2 deletions docs/config.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
baseURL = "https://r9y9.github.io/deepvoice3_pytorch/"
languageCode = "ja-jp"
title = "An open source implementation of DeepVoice 3: 2000-Speaker Neural Text-to-Speech"
title = "An open source implementation of Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning"
author = "Ryuichi YAMAMOTO"

[params]
author = "Ryuichi YAMAMOTO"
logo = "/images/r9y9.jpg"
project = "deepvoice3_pytorch"
logo = "/images/512logotipo.png"
twitter = "r9y9"
github = "r9y9"
analytics = "UA-44433856-1"
2 changes: 1 addition & 1 deletion docs/content/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ type = "index"

- Github: https://github.com/r9y9/deepvoice3_pytorch

This page provides audio samples for the open source implementation of DeepVoice3. Samples from single speaker and multi-speaker models follow.
This page provides audio samples for the open source implementation of [Deep Voice 3](https://arxiv.org/abs/1710.07654). Samples from single speaker and multi-speaker models follow.

## Single speaker

Expand Down
5 changes: 0 additions & 5 deletions docs/layouts/partials/footer.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,6 @@
<div class="hr"></div>
<address>
<div class="avatar-bottom">
<a href="/">
{{ with .Site.Params.logo }}
<img src="{{ . }}">
{{ end }}
</a>
</div>

<div class="copyright">Copyright &copy;
Expand Down
12 changes: 6 additions & 6 deletions docs/layouts/partials/header.html
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,700" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="/css/normalize.css">
<link rel="stylesheet" href="/css/skeleton.css">
<link rel="stylesheet" href="/css/custom.css">
<link rel="alternate" href="/index.xml" type="application/rss+xml" title="{{ .Site.Title }}">
<link rel="shortcut icon" href="/favicon.png" type="image/x-icon" />
<link rel="stylesheet" href="/{{ .Site.Params.Project }}/css/normalize.css">
<link rel="stylesheet" href="/{{ .Site.Params.Project }}/css/skeleton.css">
<link rel="stylesheet" href="/{{ .Site.Params.Project }}/css/custom.css">
<link rel="alternate" href="/{{ .Site.Params.Project }}/index.xml" type="application/rss+xml" title="{{ .Site.Title }}">
<link rel="shortcut icon" href="/{{ .Site.Params.Project }}/favicon.png" type="image/x-icon" />
<title>{{ $isHomePage := eq .Title .Site.Title }}{{ .Title }}{{ if eq $isHomePage false }} - {{ .Site.Title }}{{ end }}</title>
</head>
<body>
Expand All @@ -19,7 +19,7 @@

<header role="banner">
<div class="header-logo">
<a href="/"><img src="{{ .Site.Params.logo }}" width="70" height="70"></a>
<a href="https://github.com/r9y9/deepvoice3_pytorch"><img src="/{{ .Site.Params.Project }}/{{ .Site.Params.logo }}" width="140" height="140"></a>
</div>
{{ if eq $isHomePage true }}<h1 class="site-title">{{ .Site.Title }}</h1>{{ end }}
</header>
9 changes: 0 additions & 9 deletions docs/static/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,6 @@ main {
max-width: 700px;
}

.header-logo img {
border-radius: 50%;
border: 2px solid #E1E1E1;
}

.header-logo img:hover {
border-color: #F1F1F1;
}

.site-title {
margin-top: 2rem;
}
Expand Down
Binary file modified docs/static/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/static/images/512logotipo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/static/images/r9y9.jpg
Binary file not shown.
153 changes: 153 additions & 0 deletions gentle_web_align.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# -*- coding: utf-8 -*-
"""
Created on Sat Apr 21 09:06:37 2018
Phoneme alignment and conversion in HTK-style label file using Web-served Gentle
This works on any type of english dataset.
Unlike prepare_htk_alignments_vctk.py, this is Python3 and Windows(with Docker) compatible.
Preliminary results show that gentle has better performance with noisy dataset
(e.g. movie extracted audioclips)
*This work was derived from vctk_preprocess/prepare_htk_alignments_vctk.py
@author: engiecat(github)

usage:
gentle_web_align.py (-w wav_pattern) (-t text_pattern) [options]
gentle_web_align.py (--nested-directories=<main_directory>) [options]

options:
-w <wav_pattern> --wav_pattern=<wav_pattern> Pattern of wav files to be aligned
-t <txt_pattern> --txt_pattern=<txt_pattern> Pattern of txt transcript files to be aligned (same name required)
--nested-directories=<main_directory> Process every wav/txt file in the subfolders of the given folder
--server_addr=<server_addr> Server address that serves gentle. [default: localhost]
--port=<port> Server port that serves gentle. [default: 8567]
--max_unalign=<max_unalign> Maximum threshold for unalignment occurence (0.0 ~ 1.0) [default: 0.3]
--skip-already-done Skips if there are preexisting .lab file
-h --help show this help message and exit
"""

from docopt import docopt
from glob import glob
from tqdm import tqdm
import os.path
import requests
import numpy as np

def write_hts_label(labels, lab_path):
lab = ""
for s, e, l in labels:
s, e = float(s) * 1e7, float(e) * 1e7
s, e = int(s), int(e)
lab += "{} {} {}\n".format(s, e, l)
print(lab)
with open(lab_path, "w", encoding='utf-8') as f:
f.write(lab)


def json2hts(data):
emit_bos = False
emit_eos = False

phone_start = 0
phone_end = None
labels = []
failure_count = 0

for word in data["words"]:
case = word["case"]
if case != "success":
failure_count += 1 # instead of failing everything,
#raise RuntimeError("Alignment failed")
continue
start = float(word["start"])
word_end = float(word["end"])

if not emit_bos:
labels.append((phone_start, start, "silB"))
emit_bos = True

phone_start = start
phone_end = None
for phone in word["phones"]:
ph = str(phone["phone"][:-2])
duration = float(phone["duration"])
phone_end = phone_start + duration
labels.append((phone_start, phone_end, ph))
phone_start += duration
assert np.allclose(phone_end, word_end)
if not emit_eos:
labels.append((phone_start, phone_end, "silE"))
emit_eos = True
unalign_ratio = float(failure_count) / len(data['words'])
return unalign_ratio, labels


def gentle_request(wav_path,txt_path, server_addr, port, debug=False):
print('\n')
response = None
wav_name = os.path.basename(wav_path)
txt_name = os.path.basename(txt_path)
if os.path.splitext(wav_name)[0] != os.path.splitext(txt_name)[0]:
print(' [!] wav name and transcript name does not match - exiting...')
return response
with open(txt_path, 'r', encoding='utf-8-sig') as txt_file:
print('Transcript - '+''.join(txt_file.readlines()))
with open(wav_path,'rb') as wav_file, open(txt_path, 'rb') as txt_file:
params = (('async','false'),)
files={'audio':(wav_name,wav_file),
'transcript':(txt_name,txt_file),
}
server_path = 'http://'+server_addr+':'+str(port)+'/transcriptions'
response = requests.post(server_path, params=params,files=files)
if response.status_code != 200:
print(' [!] External server({}) returned bad response({})'.format(server_path, response.status_code))
if debug:
print('Response')
print(response.json())
return response

if __name__ == '__main__':
arguments = docopt(__doc__)
server_addr = arguments['--server_addr']
port = int(arguments['--port'])
max_unalign = float(arguments['--max_unalign'])
if arguments['--nested-directories'] is None:
wav_paths = sorted(glob(arguments['--wav_pattern']))
txt_paths = sorted(glob(arguments['--txt_pattern']))
else:
# if this is multi-foldered environment
# (e.g. DATASET/speaker1/blahblah.wav)
wav_paths=[]
txt_paths=[]
topdir = arguments['--nested-directories']
subdirs = [f for f in os.listdir(topdir) if os.path.isdir(os.path.join(topdir, f))]
for subdir in subdirs:
wav_pattern_subdir = os.path.join(topdir, subdir, '*.wav')
txt_pattern_subdir = os.path.join(topdir, subdir, '*.txt')
wav_paths.extend(sorted(glob(wav_pattern_subdir)))
txt_paths.extend(sorted(glob(txt_pattern_subdir)))

t = tqdm(range(len(wav_paths)))
for idx in t:
try:
t.set_description("Align via Gentle")
wav_path = wav_paths[idx]
txt_path = txt_paths[idx]
lab_path = os.path.splitext(wav_path)[0]+'.lab'
if os.path.exists(lab_path) and arguments['--skip-already-done']:
print('[!] skipping because of pre-existing .lab file - {}'.format(lab_path))
continue
res=gentle_request(wav_path,txt_path, server_addr, port)
unalign_ratio, lab = json2hts(res.json())
print('[*] Unaligned Ratio - {}'.format(unalign_ratio))
if unalign_ratio > max_unalign:
print('[!] skipping this due to bad alignment')
continue
write_hts_label(lab, lab_path)
except:
# if sth happens, skip it
import traceback
tb = traceback.format_exc()
print('[!] ERROR while processing {}'.format(wav_paths[idx]))
print('[!] StackTrace - ')
print(tb)


8 changes: 8 additions & 0 deletions hparams.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,14 @@
# Forced garbage collection probability
# Use only when MemoryError continues in Windows (Disabled by default)
#gc_probability = 0.001,

# json_meta mode only
# 0: "use all",
# 1: "ignore only unmatched_alignment",
# 2: "fully ignore recognition",
ignore_recognition_level = 2,
min_text=20, # when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
process_only_htk_aligned = False, # if true, data without phoneme alignment file(.lab) will be ignored
)


Expand Down
Loading