参考文献/References:
[1]REN Y,RUAN Y J,TAN X,et al. Fastspeech:Fast,robust and controllable text to speech[C]//33rd Conference on Neural Information Processing Systems. Vancouver,Canada,2019.
[2]WANG Y,SKERRY-RYAN R J,STANTON D,et al. Tacotron:Towards end-to-end speech synthesis[J/OL]. arXiv Preprint arXiv:1703.10135,2017.
[3]SHEN J,PANG R,WEISS R J,et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Calgary,AB,Canada:IEEE,2018.
[4]REN Y,HU C X,TAN X,et al. FastSpeech 2:Fast and high-quality end-to-end text to speech[J/OL]. arXiv Preprint arXiv:2006.04558,2020.
[5]JEFF D,SANDER D,MIKOAJ B,et al. End-to-end adversarial text-to-speech[J/OL]. arXiv Preprint arXiv:2006.03575,2020.
[6]CONG J,YANG S,XIE L,et al. Glow-wavegan:Learning speech representations from gan based variational auto-encoder for high fifidelity flflow-based speech synthesis[J/OL]. arXiv Preprint arXiv:2016.10831,2021.
[7]REZENDE D J,MOHAMED S. Variational inference with normalizing flflows[J/OL]. arXiv Preprint arXiv:1505.05770,2015.
[8]KINGMA D P,WELLING M. Auto-encoding variational bayes[J/OL]. arXiv Preprint arXiv:1312.6114,2013.
[9]YANG G,YANG S,LIU K,et al. Multi-band MelGAN:Faster waveform generation for high-quality text-to-speech[J/OL]. arXiv Preprint arXiv:2005.051006,2021.
[10]YU C,LU H,HU N,et al. DurIAN:Duration informed attention network for speech synthesis[J/OL]. arXiv Preprint arXiv:1909.01700,2019.
[11]CUI Y,WANG X,HE L,et al. An effificient sub-band linear prediction for LPCNet-based neural synthesis[C]//Interspeech 2020. Shanghai,China,2022:3555-3559.
[12]ZHANG Y M,CONG J,XUE H Y,et al. VISinger:Variational inference with adversarial learning for end-to-end singing voice synthesis[J/OL]. arXiv Preprint arXiv:2110.08813,2021.
[13]JU Y,KIM I,YANG H,et al. TriniTTS:Pitch-controllable end-to-end TTS without external aligner[C]//Interspeech 2022. Incheon,Korea,2022:16-20.
[14]KAWAMURA M,SHIRAHATA Y,YAMAMOTO R,et al. Lightweight and high-fidelity end-to-end text-to-speech with multi-band generation and inverse short-time fourier transform[J/OL]. arXiv Preprint arXiv:2210.15975,2022.