Microsoft has pushed its Custom Neural Voice service to general availability, although you’ll have to ask the company nicely if you want to use the vaguely unsettling text-to-speech service.
Unsettling, because unlike the usual text to speech we’ve come to know and love over the years, which require a substantial amount of data (10,000 lines or more, according to Microsoft) to sound fluent, Custom Neural Voice requires far less in terms of training audio. The result is disturbingly human-like.
“This new technology allows companies to spend a tenth of the effort traditionally needed to prepare training data,” explained Microsoft, which will come as a delight to out-of-work actors looking to do some voiceover jobs on the side (it probably won’t).
There is also a real risk of abuse, hence the GA gates now being entirely thrown open.
Remember OpenAI’s GPT model that was too dangerous for mere mortals? Well, it’s now for sale on Azure
Microsoft’s own code of conduct for the technology warns against using “photo realistic avatars with synthetic voices to represent real people” nor “using a synthetic voice with contents without editorial control.” Sensible guidelines when choosing a use case, but unlikely to put off a determined miscreant.
As for the technology itself, three components are at play: Text Analyzer, Neural Acoustic Model, and Neural Vocoder. The trio take inputted text, convert it to a phoneme (a basic unit of sound) sequence, pass that through the model to predict acoustic features before finally spitting out audible speech.
The Neural Model itself is trained using neural networks and actual voice recordings. Those recordings are where things get sticky, and “Microsoft requires every customer to obtain explicit written permission from the voice talent before creating a voice model.” Verification is also performed.
After all, once that model is up to snuff, the voice could say all manner of things. Microsoft also insists the use of a synthetic voice be disclosed to users, which could make some of the relentlessly perky chatbot-style use cases presented potentially awkward.
Adopters have included AT&T, which had a voiceover artist churn out 2,000 phrases and lines in order to voice cartoon character Bugs Bunny with Custom Neural Voice. At least in that instance one knows that Bugs is a fictional character. ®