Product Updates

Voice Generator: Text-to-Speech, Dubbing, Lip Sync, and Voice Cloning in One Module

One module for every voice and audio task — from script to screen-ready output

Vít Bilinec

Vít Bilinec

Founder & CEO · February 13, 2026 · 8 min read

Voice Generator: Text-to-Speech, Dubbing, Lip Sync, and Voice Cloning in One Module

Voice Generator: Everything you need for professional audio and voice production

The Voice Generator is a complete audio production toolkit built into Moduvo. Whether you need a quick voiceover, a multi-character narration, a dubbed version of your video in another language, or lip-synced output — it all happens inside one module.

Here's a detailed look at everything the Voice Generator can do.


Text-to-Speech

At its core, the Voice Generator converts text into natural-sounding speech. You get access to 67+ AI voices across two providers — a standard set for everyday use and a premium set with enhanced realism.

Advanced voice settings

The premium provider gives you fine-grained control over how your audio sounds:

  • Speed — Adjust playback speed from 0.7x to 1.2x
  • Stability — Controls how consistent the voice sounds across the audio
  • Clarity — Adjusts the crispness and enunciation of speech
  • Style — Adds expressiveness and emotional variation
  • Speaker Boost — Enhances the voice presence in the mix

You can also choose from 4 built-in presets — Default, Narration, Conversational, and Dramatic — to quickly match the tone you need.

Pronunciation toolbar

For precise control over how text is spoken, the pronunciation toolbar lets you insert:

  • Short pauses — Brief breaks between words or phrases
  • Long pauses — Paragraph-level breaks for dramatic effect
  • Emphasis — Stress specific words for impact
  • Spell-out — Force the AI to spell out abbreviations letter by letter

Text import

You don't have to type or paste everything manually. Import text directly from .txt and .docx files to speed up your workflow.

Cost and duration estimator

Before generating, you can see a real-time estimate of the output duration (based on ~150 words per minute) and the credit cost — so there are no surprises.


Long Form Mode — Multiple voices in one audio

Long Form Mode is designed for scripts, dialogues, narrations, and any content that involves more than one voice.

How it works

  • Add as many paragraphs as you need to your script
  • Assign a different voice to each paragraph — perfect for multi-character dialogue, interview simulations, or narrated stories
  • Set per-paragraph speed overrides so each voice can have its own pacing
  • Drag and drop paragraphs to reorder them instantly
  • Import long text and split it into paragraphs automatically

Seamless stitching

When you generate, all paragraphs are processed individually and then automatically combined into one final audio file. Segments that share the same voice are stitched together seamlessly for a natural flow — no manual editing required.

Full toolbar access

Everything from the standard TTS mode is available in Long Form too — pronunciation toolbar, text import, voice previews, and the cost estimator.


Voice Dubbing — Multi-language, multi-speaker

Voice Dubbing takes an existing audio or video file and creates a dubbed version in a different language.

Automatic detection

The system automatically identifies:

  • Individual speakers in the source material — no manual tagging needed
  • The source language — so you don't have to specify it

29 supported languages

Dub your content into any of these languages:

English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Mandarin), Japanese, Korean, Hindi, Indonesian, Filipino, Malay, Ukrainian, Greek, Romanian, Danish, Finnish, Bulgarian, Croatian, Slovak, Swedish, and Tamil.

Voice cloning in dubbing

Enable the voice cloning option to preserve the original speakers' voices in the dubbed output. The AI recreates each speaker's vocal characteristics in the target language — maintaining identity across translations.

Smart audio extraction

For large video files (100MB+), the system automatically extracts just the audio track before uploading. This reduces the file size by approximately 90% — for example, a 764MB video becomes ~80MB of audio — resulting in faster uploads and better processing performance.

Additional options

  • High Quality Mode for maximum audio fidelity
  • Background audio preservation — keep or remove background sounds and music

Custom Voice Cloning — My Voices

The My Voices tab lets you create and manage your own custom AI voices.

How it works

  1. Upload voice samples in the dedicated cloning interface
  2. The system creates a custom voice profile based on your recordings
  3. Your cloned voice appears in the voice selector across both standard TTS and Long Form Mode

Voice management

  • Rename custom voices for easy identification
  • Delete voices you no longer need
  • Use your cloned voices alongside the 67+ built-in voices

This is ideal for brand consistency, personal narration, or any scenario where you need a specific voice that isn't available in the default library.


Split Media

Split Media separates any video file into its individual audio and video components.

Why it's useful

  • Prepare files for Lip Sync — extract the audio track, replace it, then sync
  • Extract clean audio from video recordings for editing or transcription
  • Isolate the video to add new audio or music later

How it works

  • Upload a video file (up to 500MB)
  • Processing happens entirely in your browser — no server upload needed
  • Download the separated audio (WAV) and muted video (MP4) independently

The browser-based approach means your files never leave your machine during processing, which is faster and more private.


Lip Sync

Lip Sync generates a new video where the speaker's lip movements match replacement audio.

Use cases

  • Replace dialogue in a video with a different language
  • Fix audio issues while keeping the original video
  • Create localized versions of talking-head content

How it works

  1. Upload your source video (the visual you want to keep)
  2. Upload the replacement audio (what you want the speaker to say)
  3. Choose your quality tier
  4. The system generates a new video with synchronized lip movements

Large file support

Lip Sync handles files up to 1GB using resumable uploads. If your connection drops mid-upload, it picks up where it left off — no need to start over.


Getting started

The Voice Generator is available as a module in your Moduvo workspace. Activate it from the Module Shop, and all six capabilities — Text-to-Speech, Long Form Mode, Voice Dubbing, Custom Voice Cloning, Split Media, and Lip Sync — are immediately available in one unified interface.

Each feature works independently, so you can use exactly what you need without setting up or configuring the others.

Ready to add voice to your workflow?

Get started with the Voice Generator and explore everything it can do for your team.

Available on Enterprise and Custom plans.