[1]梁 婷,艾斯卡尔·艾木都拉,刘 煌,等.轻量且基频可预测的端到端语音合成系统[J].南京师范大学学报(工程技术版),2023,23(04):037-42.[doi:10.3969/j.issn.1672-1292.2023.04.005]
 Liang Ting,Askar Hamdulla,Liu Huang,et al.A Lightweight End-to-End Speech Synthesis System with Pitch Prediction[J].Journal of Nanjing Normal University(Engineering and Technology),2023,23(04):037-42.[doi:10.3969/j.issn.1672-1292.2023.04.005]
点击复制

轻量且基频可预测的端到端语音合成系统
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
23卷
期数:
2023年04期
页码:
037-42
栏目:
计算机科学与技术
出版日期:
2023-12-15

文章信息/Info

Title:
A Lightweight End-to-End Speech Synthesis System with Pitch Prediction
文章编号:
1672-1292(2023)04-0037-06
作者:
梁 婷1艾斯卡尔·艾木都拉1刘 煌2徐 颖2
(1.新疆大学信息科学与工程学院,新疆 乌鲁木齐 830046)
(2.上海格子互动信息技术有限公司,上海 200000)
Author(s):
Liang Ting1Askar Hamdulla1Liu Huang2Xu Ying2
(1.School of Information Science and Engineering,Xinjiang University,Wulumuqi 830046,China)
(2.Shanghai GERZZ Interactive Information Technology Co.,Ltd,Shanghai 200000,China)
关键词:
端到端语音合成韵律预测逆快速傅立叶变换变分字编码器多频带
Keywords:
end-to-end speech synthesisprosodic predictionISTFTVAEflowsub-band
分类号:
TP391.1
DOI:
10.3969/j.issn.1672-1292.2023.04.005
文献标志码:
A
摘要:
提出了一种轻量级的基频可控的完全端到端的语音合成模型. 该模型基于目前最流行的完全的端到端的语音合成模型VITS做出了三处改进,使得合成的语音韵律感更强,从而提高语音合成的自然度和表现力,同时提高发音的准确性和推理速度. 首先,引入帧先验网络得到细粒度的均值方差表示,且引入音素预测器和CTC loss以提高发音的稳定性. 其次,在模型中使用音素真实时长对齐文本和音频帧,并且加入F0预测器,增强语音的韵律感. 另外,用多频带和短时傅立叶变换替换原始模型中的Decoder,有效提高了模型的推理速度. 最后,使用MOS测试和RTF作为实验主观和客观的评判标准. 实验证明,模型在音频自然度和表现力方面提高了至少5%,且相比原始VITS推理速度提高了3倍.
Abstract:
This paper proposes a lightweight end-to-end speech synthesis model with pitch prediction. The model in this paper is based on VITS,an end-to-end speech generation model which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder,and three improvements are made to make the synthesized speech more rhythmical and more stable in a more efficient way. To be more specific. Firstly,to improve the accuracy of pronunciation and naturalness of speech,we introduce a length regulator and a frame prior network to get the frame-level mean and variance on acoustic features,modeling the rich acoustic variation in speech,and phone predictor and CTC loss are introduced to improve the stability of pronunciation. Secondly,the ground truth duration of phonemes is used for alignment of text and frame in the model,and F0 predictor is added to enhance the sense of rhythm of speech. Thirdly,the decoder in the original VITS model with multi-band generation and inverse short-time Fourier transform,which effectively improves the inference speed of the model. Experiments show that the proposed model greatly improves the naturalness and expressiveness by 5% from the MOS(mean opinion score)value and improves the inference speed by 3 times from RTF(real-time factor)compared with the original VITS.

参考文献/References:

[1]REN Y,RUAN Y J,TAN X,et al. Fastspeech:Fast,robust and controllable text to speech[C]//33rd Conference on Neural Information Processing Systems. Vancouver,Canada,2019.
[2]WANG Y,SKERRY-RYAN R J,STANTON D,et al. Tacotron:Towards end-to-end speech synthesis[J/OL]. arXiv Preprint arXiv:1703.10135,2017.
[3]SHEN J,PANG R,WEISS R J,et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Calgary,AB,Canada:IEEE,2018.
[4]REN Y,HU C X,TAN X,et al. FastSpeech 2:Fast and high-quality end-to-end text to speech[J/OL]. arXiv Preprint arXiv:2006.04558,2020.
[5]JEFF D,SANDER D,MIKOAJ B,et al. End-to-end adversarial text-to-speech[J/OL]. arXiv Preprint arXiv:2006.03575,2020.
[6]CONG J,YANG S,XIE L,et al. Glow-wavegan:Learning speech representations from gan based variational auto-encoder for high fifidelity flflow-based speech synthesis[J/OL]. arXiv Preprint arXiv:2016.10831,2021.
[7]REZENDE D J,MOHAMED S. Variational inference with normalizing flflows[J/OL]. arXiv Preprint arXiv:1505.05770,2015.
[8]KINGMA D P,WELLING M. Auto-encoding variational bayes[J/OL]. arXiv Preprint arXiv:1312.6114,2013.
[9]YANG G,YANG S,LIU K,et al. Multi-band MelGAN:Faster waveform generation for high-quality text-to-speech[J/OL]. arXiv Preprint arXiv:2005.051006,2021.
[10]YU C,LU H,HU N,et al. DurIAN:Duration informed attention network for speech synthesis[J/OL]. arXiv Preprint arXiv:1909.01700,2019.
[11]CUI Y,WANG X,HE L,et al. An effificient sub-band linear prediction for LPCNet-based neural synthesis[C]//Interspeech 2020. Shanghai,China,2022:3555-3559.
[12]ZHANG Y M,CONG J,XUE H Y,et al. VISinger:Variational inference with adversarial learning for end-to-end singing voice synthesis[J/OL]. arXiv Preprint arXiv:2110.08813,2021.
[13]JU Y,KIM I,YANG H,et al. TriniTTS:Pitch-controllable end-to-end TTS without external aligner[C]//Interspeech 2022. Incheon,Korea,2022:16-20.
[14]KAWAMURA M,SHIRAHATA Y,YAMAMOTO R,et al. Lightweight and high-fidelity end-to-end text-to-speech with multi-band generation and inverse short-time fourier transform[J/OL]. arXiv Preprint arXiv:2210.15975,2022.

备注/Memo

备注/Memo:
收稿日期:2023-04-24.
通讯作者:艾斯卡尔·艾木都拉,博士,教授,主要研究方向:语音合成,自然语言处理和语音识别等. E-mail:askar@xju.edu.cn
更新日期/Last Update: 2023-12-15