
Generative AI Insights Q&A: Margarita Grubina, vice president of business growth at Respeecher, talks to TLG.com about how their AI voice cloning technology is being used to transform the screen sector, and how the future of AI and machine learning looks for the industry.
Respeecher was founded in 2018 by an American, Grant Reaber, and two Ukrainians, Alex Serdiuk and Dmytro Bielievtsov, with the mission to create AI technology to clone human speech and swap voices. Their target market was filmmakers, TV producers, game developers, advertisers, podcasters and content creators.
Seven years on, they have raised millions of dollars and their tech is being used on everything from a synthetic voice of Luke Skywalker and Darth Vader to recreating voices of departed legends like Edith Piath, Elvis Presley and Wilt Chamberlain.
Hi Margarita, can you tell me about the set-up of Respeecher and your AI tech?
Of course. So we have the three co-founders who met seven years ago with the idea to develop proprietary synthetic speech technology that would be taken on by the Hollywood sound engineers.
Their fantasy was turned into a sci-fi reality when we got to work with Lucasfilm on the Disney+ series The Mandalorian in 2019, synthesizing the young voice of Luke Skywalker. Three years later we did the same for The Book of Bobba Fett, and have worked on several other projects since then.
How does the tech work?
We’ve always used speech-to-speech technology because it really gives you the best authentic emotion, intonation and performance, taking one human actor and converting his voice to another one. This is done with AI models that are trained on some reference data provided by the client (US studio, production company etc) from the target speaker - the voice that we would like to clone. We usually ask to have around 30 minutes of recordings and then we just fit it into the AI model and our team of sound engineers and machine learning engineers train it manually, trying to match the results in terms of emotional range and better recording quality - although sometimes it’s particularly challenging when the data from the past is old and poorly recorded, so we explain the risks to the providers.
Our models deliver high-quality voice cloning and are quite complex. They can take up to two weeks. There are instant voice cloning models out there where you can feed like one minute of audio and get a clone back, but the results would not be used in Hollywood movies and theatres set up with Dolby Atmos. Our models takes more time to train, but the results are great and realistic, and the clients are often stunned with what we can do. We always work with them until they’re happy with the models.
As well as text-to-text, we’ve also developed text-to-speech, real-time and accent conversion solutions. We can even do cross lingual voice conversions. So, if we cloned your voice and then I speak Ukrainian into the microphone, it will sound like you speak Ukrainian.
Can you explain how this tech worked specifically on The Mandalorian and other recent projects?
Yeah, sure. So for The Mandalorian, we did the voice of a young Luke Sky Walker at the end of season 2. Instead of them using a different actor or extensive de-aging techniques, we trained our models on old interviews and voice material of Mark Hamill from previous Star Wars films and created a younger sounding voice.
More recently, for the film The Brutalist we had to make Adrian Brody and Felicity Jones’s characters sound authentically Hungarian, which is one of the hardest languages to get right in terms of an accent. Instead of finding a different voice, we used the actors’ original speech, but refined their Hungarian language performance using the accent conversion solution I mentioned. In other cases, we’ve needed to change the voices.
For the movie Emelia Perez, it was a similar case of enhancing the capabilities of the actors. So we used our technology to refine and perfect specific singing notes and sounds in Karla Sofia Gascon’s voice. Since our involvement began early in the production stage, we had to adapt our work as some songs evolved, making necessary refinements to align with the director's and composer’s vision.
How else is your tech being used in the entertainment space?
We can also bring actors’ voices back that are no longer available, like if they’ve passed away [as we’ve done with Richard Nixon’s voice for the In Event of Moon Disaster documentary, Elvis Presley’s speech for the metaphysic finale of America’s Got Talent, and on the voice of Edith Piath for the Warner Music biopic].
Sometimes we’ve been called on when a young actor ages during filming (goes through puberty) and their voice changes, so we’ve had to bring back the previous voice.
We’ve also worked with voice talents to create different voice clones for different characters. So the real actors won’t have to do really challenging voices anymore, we can use the voice talents and our tech to do it instead and make it look like the actors are doing it (they just need to record some data for us). This can be done for multiple languages and accents too.
There is also a lot of room for automation with text-to-speech in film promotional material, including trailers.
How is your AI tech corresponding with the de-aging AI technology that’s becoming prevalent in films like The Irishman and Here?
Our technology is compatible with many visual providers, whether it’s facial movements or full scale de-aging techniques. When we do our voice conversion, it doesn't change the content, the timings stay the same, and you can apply our voice tech before visual effects or even after.
Impressive. How else is AI tech like yours being used?
It’s being integrated into films, games, TV shows and even theme parks. But there are also cases in other sectors that are fascinating, like cyber security. For example, we’re testing doing a clone voice of a company’s CEO to train staff in how to be ready for a cyberattack. There is health care as well, giving people who’ve lost their voices the chance to speak again. We’re partnering with a company to develop tech that would help people who undergo a laryngectomy how to communicate using an electro larynx.
Where do you see this AI technology going? What are the possibilities for the future, particularly in the screen sector?
This is an interesting question. From my perspective, first of all, we're obviously very happy that the industry is already adopting solutions like ours in the right way. After the SAG strikes, they came up with some rules on how you can apply the technology and I think that is great. This space definitely should be regulated and I'm happy to see that the European Union and US (and other countries) have come up with new regulations on AI.
An increasing number of film, TV and commercial companies are exploring how to use AI in their work, but whereas a couple of years ago it was about how to adapt it to a particular project, now the main question is how do we use it responsibly and make it work long-term?
They’re exploring how to implement AI solutions in their workflow and establishing which are good or not good. But I don’t think the industry is ready to change everything completely to AI. For example, everyone is talking about automatically dubbing scenes using AI (so uploading something and getting the dubbed version right away). That is already possible, but they are not ready yet for really high quality content.
Plus, I don’t think film and TV companies can fully integrate AI into their estimated workflows just yet. They still need humans to help manage it. But we will see more people in the industry who know how to work with AI, including sound engineers. Equally, more AI professionals will come into the industry to help with audio recordings and with output from AI models.
And should AI solutions (and the professionals) be integrated from the start of a project?
It depends on the project and the work required. For us, sometimes our services would only be used in post-production, other times AI voices can be created in pre-production. I’ve already worked on projects where the creatives say ‘we want to achieve this, how do we implement it?’ For example, on a biopic of someone very famous and they want to bring the voice to the story.
Finally, and importantly, what about the ethical side of AI? It’s a hot talking point. Is AI taking over and replacing people’s jobs?
AI is here to stay. I think it will be used more in the coming years for sure, but I don't see how it can take over. Even doing our AI generated outputs of the audio files, we still need sound professionals to work with them.
On the visual side of things, we (human professionals) need to know how a scene works and is supposed to be precisely. AI doesn’t know this yet. For example, if you go to an AI visual provider and ask it to create a picture for you. Then you type ‘please change this small detail (like make a red book blue instead), it will regenerate the whole thing completely. So it’s not aligned with the workflows already in place just yet.
AI cannot make a full Hollywood movie (yet). It's not like you go on ChatGPT and say make me a full movie. You need people, stories and vision to understand how everything works. I don't see AI replacing everyone anytime soon.