Skip to content

ZihaoZhao/speech-to-text-wavenet

 
 

Repository files navigation

Speech-to-Text-WaveNet2 : End-to-end sentence level English speech recognition using DeepMind's WaveNet

A tensorflow implementation of speech recognition based on DeepMind's WaveNet: A Generative Model for Raw Audio. (Hereafter the Paper)

The architecture is shown in the following figure.

(Some images are cropped from [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499) and [Neural Machine Translation in Linear Time](https://arxiv.org/abs/1610.10099))

Version

Current Version : 2.1.0.0

  • demo
  • test
  • train
  • train model

Dependencies

  1. tensorflow >= 1.12.0
  2. librosa
  3. glog
  4. nltk

If you have problems with the librosa library, try to install ffmpeg by the following command. ( Ubuntu 14.04 )


sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt-get update
sudo apt-get dist-upgrade -y
sudo apt-get -y install ffmpeg

Dataset

Audio was augmented by the scheme in the Tom Ko et al's paper. (Thanks @migvel for your kind information)

Usage

Exculte

python ***.py --help

to get help when you use ***.py

Create dataset

  1. Download and extract dataset(only VCTK support now, other will coming soon)
  2. Assume the directory of VCTK dataset is f:/speech, Execute
python tools/create_tf_record.py -input_dir='/zhzhao/VCTK'

to create record for train or test

Train

  1. Rename config/config.json.example to config/english-28.json
  2. Execute
python train.py

to train model.

Test

Execute

python test.py

to evalute model.

Demo

1.Download pretrain model(buriburisuri model) and extract to 'release' directory

2.Execute


python demo.py -input_path 

to transform a speech wave file to the English sentence. The result will be printed on the console.

For example, try the following command.


python demo.py -input_path=data/demo.wav -ckpt_dir=release/buriburisuri

The result will be as follows:


please scool stella

The ground truth is as follows:


PLEASE SCOOL STELLA

As mentioned earlier, there is no language model, so there are some cases where capital letters, punctuations, and words are misspelled.

Pretrained models

  1. buriburisuri model : convert model from https://github.com/buriburisuri/speech-to-text-wavenet.

Future works

  1. try to tokenlize the english label with nltk
  2. train with all punctuation
  3. add attention layer

Other resources

  1. buriburisuri's speech-to-text-wavenet
  2. ibab's WaveNet(speech synthesis) tensorflow implementation
  3. tomlepaine's Fast WaveNet(speech synthesis) tensorflow implementation

Namju's other repositories

  1. SugarTensor
  2. EBGAN tensorflow implementation
  3. Timeseries gan tensorflow implementation
  4. Supervised InfoGAN tensorflow implementation
  5. AC-GAN tensorflow implementation
  6. SRGAN tensorflow implementation
  7. ByteNet-Fast Neural Machine Translation

Citation

If you find this code useful please cite us in your work:


Kim and Park. Speech-to-Text-WaveNet. 2016. GitHub repository. https://github.com/buriburisuri/.

Authors

Namju Kim ([email protected]) at KakaoBrain Corp.

Kyubyong Park ([email protected]) at KakaoBrain Corp.

Releases

No releases published

Packages

No packages published

Languages

  • Python 76.5%
  • Shell 23.5%