How Descript enables multilingual video dubbing at scale

AI Summary7 min read

TL;DR

Descript uses OpenAI's reasoning models to optimize video dubbing for both semantic fidelity and duration adherence, enabling scalable multilingual translation. This approach improved dubbing exports by 15% and pacing accuracy by 13-43 percentage points in the first month.

Key Takeaways

•Descript redesigned its translation pipeline with OpenAI models to optimize for semantic meaning and timing constraints simultaneously, not sequentially.
•Reliable syllable counting and language-specific pacing assumptions are critical for natural-sounding dubbing, with GPT-5 series models improving consistency in these tasks.
•The new system increased duration adherence from 40-60% to 73-83% across languages while maintaining high semantic fidelity (85.5% rated 4 or 5 out of 5).
•Dubbing requires balancing trade-offs between timing and meaning, with Descript automating evaluations to continuously improve model performance and scalability.
•Future improvements aim to make the pipeline more multimodal by incorporating audio and video data to preserve nonverbal speech characteristics like tone and emphasis.

Where dubbing started to break down

Translation was one of Descript’s earliest and most requested features. They started with captions-only translation, which worked well—but many users wanted to go further and have spoken audio (dubbing) in the target language.

However, one issue kept surfacing: dubbed audio didn’t always sound right. “Probably the number one complaint we heard was that the pace of the speech was unnatural in the translated language,” said Aleks Mistratov, Head of AI Product at Descript.

The problem came down to the fact that different languages take different amounts of time to express the same idea. Descript observed, for instance, that on average German is a “longer” language than English. To fit into fixed video segments, translated speech often had to be artificially sped up or slowed down. “You’d end up with something that sounded like chipmunks, or a sleepy giant,” Mistratov explained.

English:

German:

“Please review the safety guidelines before operating the machine.”

Syllables: 18

“Bitte überprüfen Sie die Sicherheitsrichtlinien, bevor Sie die Maschine bedienen.”

Syllables: 24 (40% increase)

In this case, the German audio would either have to be sped up unnaturally, or the translation would need to be rewritten to fit the time budget.

Users were left with two options: manually retime the audio segment by segment, or rewrite the translation itself to make it fit. Both approaches required deep timeline edits and, often, near-native fluency in the target language. It was tedious for creators, and became a blocker to scaling the feature to large enterprise localization projects.

Optimizing translations for timing, not just meaning

The team had a clear theory of what it would take to make dubbing work. The system would need to not only optimize for semantic meaning, but also be aware of timing constraints. When translating from English into German, for example, the model would need to understand how to use fewer words or simplify the concept, so the dubbed audio would remain natural.

Earlier approaches optimized semantic fidelity first and attempted to correct timing afterward. The translations were often semantically correct, but they routinely missed the duration constraints, and the overall quality still wasn’t good enough.

“We ran incremental tests, not even generating anything, just asking the model to output the number of syllables in a chunk of text,” Mistratov said. “Earlier models simply weren’t good at that.”

Reliable syllable counting turned out to be critical. If the model could not consistently calculate syllables, it could not reliably target a specific duration window.

GPT‑5 series models brought a level of reasoning consistency that earlier models lacked, especially on tasks like syllable counting and constraint tracking. With that improvement, Descript redesigned its translation and dubbing pipeline.

First, Descript’s system breaks the transcript into chunks, guided by sentence boundaries, natural pauses, and speaking patterns in the original recording. Each chunk maintains semantic continuity, but is small enough to reason about as a timing unit.

From there, the model calculates the number of syllables in the chunk. Using language-specific speaking-rate assumptions, the system estimates how many syllables the translated chunk should target to preserve natural pacing (“duration adherence”). The prompt asks the model to optimize for both duration adherence and meaning preservation. Surrounding chunks are passed in as context so that the model maintains semantic coherence across segments.

The team evaluated multiple configurations to balance duration adherence, semantic fidelity, latency, and cost. The selected setup delivered strong constraint-following at production speed, enabling high-volume translation without manual retiming. The result is a translation pipeline where pacing is treated as a first-class variable instead of something corrected after the fact.

Defining and measuring natural pacing

To develop the acceptance criteria for evals, the team ran listening tests: they generated translated audio samples and adjusted the playback speed in small increments, asking users to rate when speech became unnatural.

“Anything that was slowed down by 10%, or sped up by 20%, generally still sounded natural,” Mistratov said. Beyond this range, speech became too distorted.

Earlier systems performed poorly by that measure. Depending on the language, only 40% to 60% of segments fell within the acceptable pacing window. With the redesigned pipeline, that number increased from 40%–60% to between 73% and 83%, depending on language.

The team also evaluated semantic fidelity using a separate model-as-judge rating on a scale ranging from 1 (“completely different”) to 5 (“semantically equivalent”). For dubbing, they decided to accept a lower semantic threshold than for caption-only translation, where duration constraints are irrelevant. Even with that tradeoff, 85.5% of segments were rated a four or five out of five for semantic adherence.

The result was a system that could balance two competing constraints—timing and meaning—with measurable confidence. And because both metrics were automated, Descript is able to continuously evaluate new model releases and prompt variations against the same benchmarks.

Unlocking large-scale video localization

As translation moves from single videos to large content libraries, Descript is building more control into how translations are tuned, including the ability to prioritize stricter semantic fidelity when needed.

Translation inside Descript is only one layer of a broader multimodal system. Translated text feeds into speech generation, which then drives lip sync and final video rendering.

Improvements at the text layer make natural pacing possible, but the overall experience also depends on how well the audio model preserves tone, cadence, and nonverbal characteristics of speech. That’s where the team sees the next frontier.

“A lot of what's going to improve translation output is making the pipeline more multimodal: incorporating audio, video, and text together when deciding how to translate,” said Mistratov. “That should better maintain the nonverbal characteristics of speech, like tone and emphasis, and preserve even more of the original delivery.”

For Descript, stronger reasoning models made the complexity of dubbing tractable. By crossing the threshold where models could reliably balance tradeoffs between pacing and meaning, translation became something the team could systematically improve, and deploy at scale.

Join the new era of work

More than 1 million businesses around the world are achieving meaningful results with OpenAI.

Contact sales