The last few years have been focused around new AI technologies and how they will significantly change the way we use computers and work in many fields. Microsoft continues to develop such solutions with VALL-E, a new artificial intelligence algorithm that can reproduce almost any human voice speaking English using a sample of just three seconds of speech.
VALL-E AI currently works only in English
Basically, this software will allow you to give it a sample of your voice, perhaps from a previous recording, or from a new recording, made on the spot in a few seconds, and then, by entering a text, VALL-E will be able to read it back to you in your own voice. This will certainly speed up the way work is done in video and audio production, especially in TV or radio stations, or, why not, in online content creation.
The process of writing a text, which then has to be audio recorded and edited for inclusion in a finished material is a lengthy one. With a solution like this, all you have to do is have the text ready, and the voice is then generated in seconds. What’s more, by using AI to edit text, with solutions like GPT-3, you can cut the time it takes to produce a piece of material even further. Of course, these technologies are only in their infancy, and their use in real-life situations is not exactly indicated, as the results are not yet perfect.
Microsoft calls this AI a “neural codec language model” and it is built on EnCodec technology. So the software can analyse the sounds a person makes in speech and use the results to create the most accurate reproduction of the voice. The VALL-E AI training was conducted using a selection of 60,000 hours of sound from 7,000 different people from the LibriVox sound library, which includes free audio books. Of course, the results are much better the more closely the source voice resembles the voices in that library.
The AI can preserve the vocal and emotional timbre of the speaker, but is also able to change them, based on variables.
Microsoft says it can detect whether a recording is made with its AI
Of course, this technology again brings into question its use for unintended purposes. For this reason, Microsoft will keep VALL-E a closed-source software and it can only be used in the way the company wants. On top of that, there is already the possibility of identifying AI-generated records:
“Since VALL-E can synthesize speech that preserves the speaker’s identity, it could come with risks in using it in undesirable ways, such as tricking voice-based security systems or mimicking a particular speaker. To avoid such risks, it is also possible to create a detection model to check whether or not an audio clip was made with VALL-E. We will also put the Microsoft AI Principles into practice when developing new models in the future.”, say the researchers behind the project.