The Era of Voice and Image Cloning: Can You Believe What You Hear and See?

Over the past years, we have learned to live with fake news, which has been largely limited to articles on fake news sites.

“Fake news”isn’t really a big problem. Statements can be fact-checked, authors can be questioned as to whether they ever made this or that statement. Audio statements (interviews) and images are used as proof that a specific news item is false or “fake”.

But what if that audio and video used to disproof a statement are in itself fake?

Let’s start with audio

For decades, researchers in the medical field have recorded the voices of seriously ill patients, so that one day, when the patient may be completely paralyzed, or simply lost the ability to speak, synthesizers (text-to-speech systems) could be used to reproduce not a canned computer voice, but the actual voice of the speaker. 20 years ago that meant recording thousands upon thousands of words and full sentences; these days it can be done by 1-2 hours of recording; the rest of the speech is synthesized with amazing accuracy.

What started as a way for helping the speech-impaired is now becoming a major threat to democracy.

These efforts are admirable and have helped increase the quality of life of patients. But what started as a way for helping the speech-impaired is now becoming a major threat to democracy.

It is now possible to take relatively short sections of recorded speech (a dialog in a movie, a recorded comment, a public speech) and use it to recreate the voice of the speaker in its entirety in mere minutes or alter it to match any other speech pattern.

If you want to try this out online. visit the French startup, and prepare to be amazed. Every major tech company from Baidu to Google, and hundreds of other startups are working on similar technologies.

The software breaks down your speech into minute segments and assigns certain attributes to it; the resulting library of segments can be rearranged in any pattern necessary. Modulation is possible after recording, allowing to, e.g. make your voice sound like the voice of another gender or age group. Add to that some basic sound engineering and you can effectively fake any voice in any setting.

More about ArtificialIntelligence and synthetic images in the New Yorker

Now add an image

If a fake audio recording is not enough to prove the “veracity” of your fake news, we might need to add an image. The potential to fake images has been around forever, by staging a scene, using props and makeup. But that is so last century.

We can now take any video clip of a person speaking and put words in or her mouth. Instead of a lengthy description, just watch this clip on Youtube, courtesy of the University of Erlangen-Nuremberg, the Max-Planck Institute for Informatics, and Stanford University:

The video shows how we can take real footage of a person speaking and effectively put words into the speaker’s mouth. While not perfect yet, the potential for abuse if evident.

Impressed? You should be. Scared? You better be.

By combining these – and no doubt a number of other emerging – technologies, we now have the potential to seriously disrupt, influence, alter, and pervert media, television, and political systems on a huge scale. It would take hours to prove that a clip of the US Fed Chair announcing a rate hike, or the North Korean leader announcing the launch of a nuclear missile are indeed fake. The impact of such “footage” on financial markets, electorates, political and commercial decision-making processes is frightening.

Not to forget the judiciary. We have learned in the past decades that eyewitness testimonies (i.e., human memory) cannot be trusted. Very soon, it will be hard to believe not just anything you read or remember, but anything you hear and see too. Fake surveillance footage could put thousands of innocent people behind bars, or exonerate convicted criminals.

Goodbye Mr. Gosling

On a side note, actors should be worried about their careers also. Combining these technologies with advanced computer graphics means that within 3-5 years it will no longer be necessary to shoot any movie on set with real actors at all. Every movement, every utterance, can be created convincingly by computers.

Are you sure film studios will pay Hollywood stars 7 digit figures per film if they can just create a fake celebrity, complete with all the paparazzi footage and late-night talk show appearances all under the control of the studio bosses? The truth will only come out if the computer-generated superstar fails to show up at the Oscars.

Abuses and the solution

The potential for abuse here is evident and mind-boggling. You could leave fake voice messages to implicate a friend in a tryst or business scandal, or you could have political figures create havoc by making outrageous statements. Soon the public will not just distrust what they read, but everything they see and hear on digital media. This technology, if left unchecked, threatens the very foundations of our technological society.

The solution is complicated. Neither watermarks nor restrictions on cloning the voice of famous personalities are legally feasible. When the Internet first came into being, some scientists warned that opening up to the public without verification of identity would ultimately spell disaster.

I think we have reached that point. Very soon, governments will come to the realization that any kind of digital interaction should be registered and linked to an individual’s true identity. You would then no longer be able to surf the net, send an image, like a Facebook post, or upload a video, without a verifiable, possible blockchain, record of who and where you are.

China has been doing this for years now. Bar any ground-breaking technological solution, it looks like the rest of the world will have to follow suit.

Published by Dr Martin Hiesboeck

Futurist, Marketer, Policy Advisor for Companies and Government Head of Blockchain and Crypto Research at Uphold and CEO of Alpine Blockchain Consultants Zurich - London - New York - Taipei

3 thoughts on “The Era of Voice and Image Cloning: Can You Believe What You Hear and See?

  1. The ethical implications of voice cloning technology are vast. It’s a technology that can expand creative possibilities, expand employment possibilities, and benefit people with speech problems. Only a tiny part of its potential to help us has been realized so far. But there are also harmful potential applications, which are important to anticipate and guard against ahead of time. Ideally, the authorities, along with social media platforms and the companies in this industry will begin regularizing the creation and use of synthetic media.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: