There is a new documentary Roadrunner: A Film About Anthony Bourdain about the food and travel writer who died by suicide in 2018. In the documentary, at one point they have him reading an email he sent to a friend. Why would he read an email aloud? Well, he didn’t. What the filmmakers did was to use AI to synthesize a voice that closely resembled his, a technology that could be used to have any text seem to emanate from him. (I first learned about this technology when Marcus Ranum had a post on it back in 2016.)
I posted at that time that this new audio technology, coupled with the ability to make visual deep fakes of people, would open the floodgates to all manner of abuse, since people with ill-intentions could produce ‘evidence’ that made it appear as if some one was saying something that they did not.
That kind of abuse has not occurred as yet (as far as I know) but the revelation that this AI technology was used by the documentary filmmaker Morgan Neville to make it appear as if Bourdain was actually reading has generated some controversy.
There were a total of three lines of dialogue that Neville wanted Bourdain to narrate, the film-maker explained in his interview. However, because he was unable to find previous audio, he contacted a software company instead and provided about a dozen hours of recordings, in turn creating an AI model of Bourdain’s voice.
Despite Neville describing his use of AI technology as a “modern storytelling technique”, critics voiced concerns on social media over the unannounced use of a “deepfake” voice to say sentences that Bourdain never spoke.
Sean Burns, a film critic for Boston’s WBUR, denounced the film-makers, writing: “When I wrote my review I was not aware that the film-makers had used an AI to deepfake Bourdain’s voice … I feel like this tells you all you need to know about the ethics of the people behind this project.”
I am not sure why what Neville did is being seen as so objectionable. We have long had actors read the words of dead people in films, TV, and radio. Why would having a computer read the words be any worse? Is it that the striking accuracy of the voice reproduction might make people think that Bourdain had actually spoken the words and that hence there was deception? If Neville had revealed the use of AI earlier, perhaps critics might have been mollified.
This article looks more closely at the ethical issues involved.
“We have pretty strong polices around what can be done on our platform,” said Zohaib Ahmed, founder and CEO of Resemble AI, a Toronto company that sells a custom AI voice generator service. “When you’re creating a voice clone, it requires consent from whoever’s voice it is.”
Ahmed said the rare occasions where he’s allowed some posthumous voice cloning were for academic research, including a project working with the voice of Winston Churchill, who died in 1965.
Ahmed said a more common commercial use is to edit a TV ad recorded by real voice actors and then customize it to a region by adding a local reference. It’s also used to dub anime movies and other videos, by taking a voice in one language and making it speak a different language, he said.
He compared it to past innovations in the entertainment industry, from stunt actors to greenscreen technology.
Just seconds or minutes of recorded human speech can help teach an AI system to generate its own synthetic speech, though getting it to capture the clarity and rhythm of Anthony Bourdain’s voice probably took a lot more training, said Rupal Patel, a professor at Northeastern University who runs another voice-generating company, VocaliD, that focuses on customer service chatbots.
“If you wanted it to speak really like him, you’d need a lot, maybe 90 minutes of good, clean data,” she said. “You’re building an algorithm that learns to speak like Bourdain spoke.”
There is one problem that technology cannot solve. The same set of words can be made to convey quite different meanings by changing the pitch, cadence, emphasis, inflection, and pauses. I recall once seeing a video of a famous actor (Ian McKellen? Orson Welles?) say the same lines from Shakespeare (?) in different ways to suggest quite different meanings. When we read the written words, we can interpret them in different ways but the AI bot fixates on just one, presumably randomly.
For this technology to work, the AI software needs access to a fair amount of audio files of the speaker in order to reproduce the voice accurately, so we would not be able to hear the voice of Abraham Lincoln, for example. But I think that this technology is going to be widely used in documentaries of dead people who leave behind a cache of voice recordings.
The special effects in films nowadays show actors doing things that they did not actually do and audiences know that and seem to accept it. Soon people will be heard saying things that they did not actually say. As long as we are aware that it is not real, I think we can eventually expect the same level of acceptance.