Emotion Speech Synthesis is now possible by combining the Emotion Processing Unit II with WaveNet: A generative model for raw audio from DeepMind.
“WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%….”
“Similarly, we could provide additional inputs to the model, such as emotions or accents, to make the speech even more diverse and interesting.”
The real-time emotion synthesis output of the EPU representing the emotional state of the AI or Robot at any time can be used as input to feed the neural network in real-time, allowing machine to finally convince human.
Combining the EPU, Natural Language Generation (NLG) and WaveNet, is the pathway to the next generation of AI and Advance Affective Agent (AAA). Each intent or thought will have an emotional property derived in real-time from the emotional state of the EPU allowing to dynamically generate the response and the emotional voice.