After a recent voice synthesis service that allowed the imitation of any voice was hijacked for spreading racist and homophobic messages, AI voice generation technologies will be limited by applying a fingerprint, ensuring that any simulated voice can be differentiated from the original.
While AI-generated speech can be used for all sorts of legitimate purposes, from dubbing movies into the native language of the audience, to translating lines from movies using the actual voice of real actors (with their permission, of course). Problems arise when the voice synthesis achieved by imitating the voice of a particular person is used for malicious purposes, such as for falsifying statements or opinions of politicians and celebrities.
Before it is too late, the creators of these technologies are willing to accept the introduction of a fingerprinting system for digital productions, meaning that any recording containing AI-synthesized voices will have an invisible “watermark”, allowing differentiation from fully authentic productions without the involvement of an expert in identifying indicative voice synthesis cues. Essential to the success of services offering speech synthesis, the quality of recordings will not be perceptibly affected by the introduction of the fingerprint.
Unfortunately, the proposed solution is not automatically applicable to perpetrators motivated to credibly falsify the voice of certain individuals (e.g. state actors, clandestine services offering their services for the purpose of discrediting certain individuals, or large-scale disinformation).
Another problem with the watermarking system is that indicators can easily be “lost” by recompressing the audio file using a lower quality codec. For example, to mimic an intercepted phone call. Thus, the industry is unlikely to offer a truly effective “antidote” against digital disinformation, whether it is achieved by faking video content, the soundtrack, or both.