|Table of Contents|

A Lightweight End-to-End Speech Synthesis System with Pitch Prediction(PDF)

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

Issue:
2023年04期
Page:
37-42
Research Field:
计算机科学与技术
Publishing date:

Info

Title:
A Lightweight End-to-End Speech Synthesis System with Pitch Prediction
Author(s):
Liang Ting1Askar Hamdulla1Liu Huang2Xu Ying2
(1.School of Information Science and Engineering,Xinjiang University,Wulumuqi 830046,China)
(2.Shanghai GERZZ Interactive Information Technology Co.,Ltd,Shanghai 200000,China)
Keywords:
end-to-end speech synthesisprosodic predictionISTFTVAEflowsub-band
PACS:
TP391.1
DOI:
10.3969/j.issn.1672-1292.2023.04.005
Abstract:
This paper proposes a lightweight end-to-end speech synthesis model with pitch prediction. The model in this paper is based on VITS,an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder,and three improvements are made to make the synthesized speech more rhythmical and more stable in a more efficient way. To be more specific. Firstly,to improve the accuracy of pronunciation and naturalness of speech,we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features,modeling the rich acoustic variation in speech,and phone predictor and CTC loss are introduced to improve the stability of pronunciation. Secondly,the ground truth duration of phonemes is used for alignment of text and frame in the model,and F0 predictor is added to enhance the sense of rhythm of speech. Thirdly,the decoder in the original VITS model with multi-band generation and inverse short-time Fourier transform,which effectively improves the inference speed of the model. Experiments show that the proposed model greatly improves the naturalness and expressiveness by 5% from the MOS(mean opinion score)value and improves the inference speed by 3 times from RTF(real-time factor)compared with the original VITS.

References:

[1]REN Y,RUAN Y J,TAN X,et al. Fastspeech:Fast,robust and controllable text to speech[C]//33rd Conference on Neural Information Processing Systems. Vancouver,Canada,2019.
[2]WANG Y,SKERRY-RYAN R J,STANTON D,et al. Tacotron:Towards end-to-end speech synthesis[J/OL]. arXiv Preprint arXiv:1703.10135,2017.
[3]SHEN J,PANG R,WEISS R J,et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Calgary,AB,Canada:IEEE,2018.
[4]REN Y,HU C X,TAN X,et al. FastSpeech 2:Fast and high-quality end-to-end text to speech[J/OL]. arXiv Preprint arXiv:2006.04558,2020.
[5]JEFF D,SANDER D,MIKOAJ B,et al. End-to-end adversarial text-to-speech[J/OL]. arXiv Preprint arXiv:2006.03575,2020.
[6]CONG J,YANG S,XIE L,et al. Glow-wavegan:Learning speech representations from gan based variational auto-encoder for high fifidelity flflow-based speech synthesis[J/OL]. arXiv Preprint arXiv:2016.10831,2021.
[7]REZENDE D J,MOHAMED S. Variational inference with normalizing flflows[J/OL]. arXiv Preprint arXiv:1505.05770,2015.
[8]KINGMA D P,WELLING M. Auto-encoding variational bayes[J/OL]. arXiv Preprint arXiv:1312.6114,2013.
[9]YANG G,YANG S,LIU K,et al. Multi-band MelGAN:Faster waveform generation for high-quality text-to-speech[J/OL]. arXiv Preprint arXiv:2005.051006,2021.
[10]YU C,LU H,HU N,et al. DurIAN:Duration informed attention network for speech synthesis[J/OL]. arXiv Preprint arXiv:1909.01700,2019.
[11]CUI Y,WANG X,HE L,et al. An effificient sub-band linear prediction for LPCNet-based neural synthesis[C]//Interspeech 2020. Shanghai,China,2022:3555-3559.
[12]ZHANG Y M,CONG J,XUE H Y,et al. VISinger:Variational inference with adversarial learning for end-to-end singing voice synthesis[J/OL]. arXiv Preprint arXiv:2110.08813,2021.
[13]JU Y,KIM I,YANG H,et al. TriniTTS:Pitch-controllable end-to-end TTS without external aligner[C]//Interspeech 2022. Incheon,Korea,2022:16-20.
[14]KAWAMURA M,SHIRAHATA Y,YAMAMOTO R,et al. Lightweight and high-fidelity end-to-end text-to-speech with multi-band generation and inverse short-time fourier transform[J/OL]. arXiv Preprint arXiv:2210.15975,2022.

Memo

Memo:
-
Last Update: 2023-12-15