Abstract: Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only 1/6 of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.
This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.
All utterances were unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.
It is highly recommended to listen to audio samples with headphones. For more samples, please go to our survey used for CMOS evaluation here and here.
The following audio samples are sourced from the BIgVGAN demo page to demostrate that HiFTNet is as robust as BigVGAN for zero-shot waveform synthesis when trained on the LibriTTS dataset. It is able to perform well under various OOD conditions, such as noisy speech, unseen languages, and in-the-wild Youtube audio.