In conjunction with the annual event – Microsoft Ignite 2018, Microsoft announced several upgrades to Microsoft Cognitive Services, probably now is also known as Azure Cognitive Services. One of them that catches my attention is the new Neural Text-to-Speech Service.

New Neural Text-to-Speech (Neural TTS) service that marks a new milestone for text-to-speech synthesis. This new service utilizes the deep neural networks to create voices of computers sounds almost like from human. With the human-like natural prosody and clear articulation of words, Neural TTS has significantly reduced the gap of interaction between human and AI. This also to make interactions with chatbots and virtual assistants more natural and engaging, convert digital texts such as e-books into audio books and enhance in-car navigation systems.

Over the past two years, team at Microsoft has achieved significant breakthrough including human parity in conversational speech recognition and human parity in machine translation.

By using deep neural network capabilities, the new service can now overcome the limitations of legacy system that matches patterns of stress and intonation in spoken language and in synthesizing the speech into digital voice. With neural capability, it can do prosody prediction and voice synthesis simultaneously and this results a more fluid and natural-sounding voice.

The service is in preview and currently hosted on Azure Kubernetes Service to leverage the scalability and reliability.  It offers two pre-built neural text-to-speech voices in English – Jessa and Guy. Over the coming weeks/months, more languages can be expected, as well as customization services in 49 languages for customers who want to build branded voices optimized for their specific needs.

The following is a few examples shows how close this new service can generate a human-alike voice as compare to the real recorded human voice. I can hardly differentiate whether it is coming from human or Neural TTS until the second one below shows very minor computer-like voice.

Sentence Human Recording Neural TTS
The third type, a logarithm of the unsigned fold change, is undoubtedly the most tractable. Here Here
As the name suggests, the original submarines came from Yugoslavia. Here Here
This is easy enough if you have an unfinished attic directly above the bathroom. Here Here

I am really amazed with the power of neural network that closing gap between human and computer interaction.

If you are interested to join the preview force to get a hands-on, you may request for an earlier access here. Let us know down below what’s your thought!

Source: Microsoft Blog