The Ethics of Synthetic Voice

Fishy Beginnings and Barber’s Appointments By Matthew Aylett, Chief Scientific Officer at Cereproc.

Sunday, 12th January 2020 Posted 5 years ago in by Phil Alsop

The voice synthesis industry has an incredible potential for doing good; with modern technology, we can bring human, expressive, emotional voices to people who have otherwise lost the ability to speak.

However, I’ve always been mindful of the potential for misuse, especially in today’s era of Deepfake technology. I had my first brush with this back in 2012, when our organisation was contacted by a group of people that we suspected had somewhat misguided intentions.

Let me explain. We’d been working on a voice synthesiser for Roger Ebert, the American film critic who sadly lost part of his jaw after battling cancer. Although Roger passed away in 2013, we’d been using parts of his previously recorded voice to ‘clone’ his voice for any situation. We’d also been practicing this with US Presidents, because recordings made by the federal government sit in the public domain.

Then we got an enquiry that made us think twice. An organisation wanted to buy Obama’s voice, and we had concerns that they might want to record him saying something offensive. In fact, when we aired this concern to the potential buyer, rather than denying it, they simply told us that this wouldn’t be illegal. Needless to say, we didn’t sell them anything.

More recently, we’ve seen a huge surge in concerns around Deepfakes; videos, images and recordings of people that appear believable, but are simply digital re-creations. And the cause for concern is not exaggerated; Google demonstrated its Duplex software a few years ago, convincingly booking a hairdresser appointment with an unknowing, live recipient, complete with human pauses and an ‘uh hmm’.

In fact, with the right input and enough time using today’s technology, it’s possible that a synthetic voice could sound completely human. However, regardless of whether we can, the more interesting question is when and in what circumstances we should create synthetic voices, and how we can safeguard against misuse.

Bad Actors

There are a number of ways that we’ve seen speculation or actual attempts at criminal activity by misusing synthetic voice. The first is fraud; as we alluded to in the Obama example, by copying a voice you can make it say whatever you want. This could include the equivalent of email ‘spear-phishing’ attacks, where hackers send an email that seems to originate from the CEO, requesting a money transfer. With synthetic voice, this wouldn’t have to be an email – it could be a phone call, which is instantly more plausible.

Fraud is just the tip of the iceberg; by cloning the voice of a prominent person like Obama, Trump or UK Prime Minister Boris Johnson, a criminal could call a national newspaper and give a false statement. In most cases this is both fraud and intellectual property infringement, because you own the rights to your voice. Of course, voice cloning isn’t always illegal; sometimes it’s practical but still sits in a moral grey area. For example, synthetic voices can be used to replace cold callers to sell products to people. This was very common during the PPI claim period a few years ago, but the synthetic voices being used were often extraordinarily bad!

Furthermore, it’s important to understand a very practical point about voice cloning. Copying a person’s voice today is difficult and requires a great deal of ‘clean’ data (i.e. without background noise). Our studio is a small soundproof recording studio, and it can take hours and hours of careful work to produce a good voice. However, the technology is improving, fake voices used by criminals are likely to improve in naturalness and quality.

Pseudo-human Voices

Bad sales calls that use synthetic voice may be morally dubious, but there are many positive uses of synthetic voice that don’t even sound human-like! For example, there are many ‘robotic’ voices that can be pleasant and even fun to listen to. Most of us remember the cute voice of Chappie in Neill Blomkamp’s film of the same name, or Johnny 5 in the 80s classic Short Circuit. None of these robots have natural-sounding voices – and all were voiced by people – but none ran the risk of being confused with a real person.

This is an important aspect for future synthetic voice creators to remember; synthetic voices can be more varied and fun than a ‘normal’ human voice, without being creepy or seeming to impersonate someone. Similarly, pseudo-human voices can be gender neutral or as culturally varied as needed, avoiding many of the assumptions that we make when we speak to a genuine human.

Human Speech Synthesis

However, there is a critical need for human-like synthetic voices. After all, consider people who have lost their voices because of illness or accidents. We have worked with the likes of Peter Scott Morgan, a roboticist who has motor neurone disease, but managed to clone his voice prior to completely losing it, and Jamie Dupree, an American newscaster who also lost his voice because of a rare condition called Tongue Dystonia. We’ve managed to give these people – and more – their voices back digitally when they passed the boundaries of conventional medicine and surgery.

Now consider the learning difficulties of children with dyslexia. Results show that using speech synthesis technology to read along with dyslexic children has a significant impact on both their learning experience and reading ability. Considering an estimated 1 in 10 students have dyslexia, this technology could help hundreds of thousands of children perform better in exams every year. The Scottish government has been especially fast at recognising how this technology helps in teaching environments and commissioned Text-to-Speech voices for Scottish school children with communication difficulties – these were made available to Scottish children earlier this year.

It’s not just medical conditions or learning difficulties where human-like speech synthesis can help; synthetic voices can help to test and develop video games. This is a new, emerging area, where synthetic voice can fit into a broad spectrum of uses. For example, synthetic voice can be used to generate placeholder speech during the development process instead of paying a voice actor for a scene or piece of dialogue that may be cut in the end product. Similarly, voice cloning can eliminate the repetitive and painful process of recording all of a character’s script, or indeed, a synthetic voice can be used by itself for a character, and can even be used to produce a unique pseudo-human voice, somewhere between natural and artificial to produce the voices of robots, demons and trolls.

Finally, there are a myriad of other applications for synthetic voice where a human would find the task repetitive or boring. For example, in language learning systems like Duolingo, synthetic voices can not only be regionally accented easily, but it also removes a thankless and boring task from a voice actor or teacher.

Into the Future

Synthetic voice technology is evolving daily. From changes in algorithms, to improvements in how we process and edit voice itself, it is becoming quicker, easier and more efficient to create an artificial voice. However, as with almost any form of technology, there are a number of ways that criminals and pranksters can misuse these developments – and an equal number of people working to prevent this harm. With the benefits that synthetic voice can provide to the health and wellbeing sector, as well as the gaming industry and many others, it’s important that we continue to explore how we can use voice responsibly so that it has a bright future.