IBM’s AI generates high-quality voices from 5 minutes of talking

On Oct 3, 2019

Training powerful text to speech models requires sufficiently powerful hardware. A recent study published by OpenAI drives the point home it found that since 2012, the amount of compute used in the largest runs grew by more than 300,000 times. In pursuit of less demanding models, researchers at IBM developed a new lightweight and modular method for speech synthesis. They say it’s able to synthesize high-quality speech in real time by learning different aspects of a speaker’s voice, making it possible to adapt to new speaking styles and voices with small amounts of data.

“Recent advances in deep learning are dramatically improving the development of Text-to-Speech (TTS) systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech,” wrote IBM researchers Zvi Kons, Slava Shechtman, and Alex Sorin in a blog post accompanying a preprint paper presented at Interspeech 2019. “Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs. In order to address these challenges, our … team has developed a new method for neural speech synthesis based on a modular architecture.”

The IBM team’s system consists of three interconnected parts: a prosody feature predictor, an acoustic feature predictor, and a neural vocoder. The prosody prediction bit learns the duration, pitch, and energy of speech samples, toward the goal of better representing a speaker’s style. As for the acoustic feature production, it creates representations of the speaker’s voice in the training or adaptation data, while the vocoder generates speech samples from the acoustic features.

All components work together to adapt synthesized voice to a target speaker via retraining, based on a small amount of data from the target speaker. In a test involving volunteers asked to listen and rate the quality of pairs of synthesized and natural voice samples, the team reports that the model maintained high quality and similarity to the original speaker for voices trained on as little as five minutes of speech.