What makes speech translation unique — and how understanding it can break down the most challenging language barriers

People don’t speak the same way that they write. Nor do they experience spoken conversations in the same way that they experience reading an email or an article. Our ability to understand one another in the moment of speaking — bringing together all kinds of verbal and non-verbal communication to instantly grasp what someone means — is the high-wire act of human expression. When you take a step back and consider what happens in a conversation, it’s amazing just how much information is conveyed so quickly. 

When you set yourself the task of translating spoken interactions as they happen, as DeepL has done with our new DeepL Voice solution, you uncover all kinds of fascinating insights into what makes translating spoken language different to translating text. In this post, I’ll share some of those insights, and explain how we’re using them to transform the experience of meetings and conversations.

The challenges of translating real-time speech and how we overcame them

Instant, conversational communication is fundamentally human, and it’s extremely difficult for technology to replicate — even a technology as advanced as AI. If you want to create solutions for businesses that can help people follow and participate in conversations in multiple languages, you have to start with a deep understanding of the challenges involved.

Those challenges include replicating the human skill of anticipating what people are saying before they finish saying it. When you’re translating speech in live situations, you also need to anticipate how someone’s words can best be expressed in another language. Crucially though, you need to do this before you know for certain how the original sentence will end, so as to avoid lengthy time lags. The challenge here is that what seems to be an accurate translation of a few words could turn out to be an inaccurate translation once the individual completes his sentence. 

When we set about developing DeepL Voice, we knew that high-quality live translation of speech couldn’t be achieved through technology alone. It depends on a deep interest in, and understanding of the different ways that language works. So we brought together experts in linguistics as they apply to spoken conversations, and leveraged DeepL’s powerful contextual understanding of how different languages work. We also partnered with businesses to explore their priorities and the experience of speech translation that creates most value for them.

The massive difference a second can make

One of the first insights that we learned is that timing is everything when it comes to real-time translation of a meeting or a conversation. If you can get close to the speed of speech — displaying the translation of a sentence by the time a speaker has finished it — then you can greatly impact how inclusive those meetings can be. 

As Christine Aubry, international coordinator for the global patisserie manufacturer Brioche Pasquier, explained at DeepL Dialogues, faster translations switch people’s mode from passive to active participation. Rather than struggling to keep up with what others are saying in another language, they feel fully up to speed. Like a native language speaker, they have the opportunity to interject, shape the conversation and actively participate. A second or so makes a huge difference.

Speed is, therefore, a top priority when translating real-time speech. But speed has to be balanced against other priorities that also have a big impact on people’s experience. Translations must be as accurate as possible to avoid misunderstandings and confusion. And where possible, translations must minimize the “flickering” that occurs when previously translated text has to be corrected because the meaning changed. The lower the rate of this flickering, the easier it is for someone to follow a conversation in a natural way.

How language changes when people are talking, not typing

To translate live speech accurately, it’s important to understand the many differences between the patterns of written language and the rhythms of speech. For instance, the way people speak is far more individual and less consistent than the way that they write. They employ distinct turns of phrase and colloquialisms that could stem both from regional dialects and also from their particular personality or self-image. In addition, people construct and correct sentences as they’re speaking, leading to disfluencies where one grammatically incorrect term is instantly followed by another, more correct one. Reproducing these literally in translation isn’t helpful to someone trying to understand the meaning. 

Throughout conversations, people also utter short affirmations — such as “uh-huh” — to reassure speakers that they understand or agree with what they’re saying. These help the flow of the conversation itself, but clutter translations for people trying to follow in another language. It’s helpful to filter these elements of spoken language out of a translation.

Optimizing for real-time translation

The challenge gets even more interesting when you consider that a real-time translation platform isn’t translating complete sentences. It needs to translate a sentence while it’s being spoken, when the final meaning of that sentence isn’t yet clear. This requires us to optimize translations in a slightly different way. We don’t just want the most accurate translation, but an accurate translation that is flexible enough to incorporate new information that might change the direction of what’s being said.

Here’s an example: Imagine that we’re translating a virtual meeting in which one of the participants is speaking English, and one of the other participants is following what they’re saying with captions in German. Our English speaker interrupts the conversation to say, “I found it.” Now, if we assume this to be a complete sentence, the best possible German translation would be, “Ich habe es gefunden.” However, since this is live speech, we can’t be certain if the sentence is complete or not.

A better option, in this case, could be to use a translation like “Ich fand es” instead. Why? Because when the English speaker goes on to say, “I found it frustrating,” the “ich fand es” translation is perfectly positioned to simply add the word “frustrierend”. If the first three words were translated as “Ich habe es gefunden,” the entire translation would need to be revised. That’s the type of major “flicker” that gets in the way of intuitively following a conversation, and which DeepL aims to minimize wherever possible.

Accurate, real-time speech translation involves a wide range of such contextual judgments that are best made when technology is guided by human expertise. That expertise includes insights into where different languages are likely to position the verbs that are crucial to a sentence’s meaning. If they come at the start (as in French and Spanish), it’s possible to display a translation more quickly than when they come at the end. All of this helps a system to pause just long enough to be accurate, but not so long as to delay understanding unnecessarily.

Finding the sweet spot through language-specific understanding

This combination of human linguistics expertise with highly accurate translation is already enabling DeepL Voice to make a big difference to the experience of meetings and conversations for international businesses. These include NEC Corporation, which became the first company to fully deploy DeepL Voice, just a few weeks after our official launch. 

The excitement around DeepL Voice reflects the fact that this is a groundbreaking moment for speech translation. The ability to decode and translate what people are saying, while they are saying it, multiplies the value that we can create for international businesses. It transforms the way that teams can collaborate, builds stronger relationships and ensures that different ideas and perspectives are always included. 

The advances we’ve made so far are already making a major difference to the way that organizations operate. There’s much more to come!

shareMenu_headline