Voice Generator: Text-to-Speech, Dubbing, Lip Sync, and Voice Cloning in One Module
One module for every voice and audio task — from script to screen-ready output

Vít Bilinec
Founder & CEO · February 13, 2026 · 8 min read

Voice Generator: Everything you need for professional audio and voice production
The Voice Generator is a complete audio production toolkit built into Moduvo. Whether you need a quick voiceover, a multi-character narration, a dubbed version of your video in another language, or lip-synced output — it all happens inside one module.
Here's a detailed look at everything the Voice Generator can do.
Text-to-Speech
At its core, the Voice Generator converts text into natural-sounding speech. You get access to 67+ AI voices across two providers — a standard set for everyday use and a premium set with enhanced realism.
Advanced voice settings
The premium provider gives you fine-grained control over how your audio sounds:
- Speed — Adjust playback speed from 0.7x to 1.2x
- Stability — Controls how consistent the voice sounds across the audio
- Clarity — Adjusts the crispness and enunciation of speech
- Style — Adds expressiveness and emotional variation
- Speaker Boost — Enhances the voice presence in the mix
You can also choose from 4 built-in presets — Default, Narration, Conversational, and Dramatic — to quickly match the tone you need.
Pronunciation toolbar
For precise control over how text is spoken, the pronunciation toolbar lets you insert:
- Short pauses — Brief breaks between words or phrases
- Long pauses — Paragraph-level breaks for dramatic effect
- Emphasis — Stress specific words for impact
- Spell-out — Force the AI to spell out abbreviations letter by letter
Text import
You don't have to type or paste everything manually. Import text directly from .txt and .docx files to speed up your workflow.
Cost and duration estimator
Before generating, you can see a real-time estimate of the output duration (based on ~150 words per minute) and the credit cost — so there are no surprises.
Long Form Mode — Multiple voices in one audio
Long Form Mode is designed for scripts, dialogues, narrations, and any content that involves more than one voice.
How it works
- Add as many paragraphs as you need to your script
- Assign a different voice to each paragraph — perfect for multi-character dialogue, interview simulations, or narrated stories
- Set per-paragraph speed overrides so each voice can have its own pacing
- Drag and drop paragraphs to reorder them instantly
- Import long text and split it into paragraphs automatically
Seamless stitching
When you generate, all paragraphs are processed individually and then automatically combined into one final audio file. Segments that share the same voice are stitched together seamlessly for a natural flow — no manual editing required.
Full toolbar access
Everything from the standard TTS mode is available in Long Form too — pronunciation toolbar, text import, voice previews, and the cost estimator.
Voice Dubbing — Multi-language, multi-speaker
Voice Dubbing takes an existing audio or video file and creates a dubbed version in a different language.
Automatic detection
The system automatically identifies:
- Individual speakers in the source material — no manual tagging needed
- The source language — so you don't have to specify it
29 supported languages
Dub your content into any of these languages:
English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Mandarin), Japanese, Korean, Hindi, Indonesian, Filipino, Malay, Ukrainian, Greek, Romanian, Danish, Finnish, Bulgarian, Croatian, Slovak, Swedish, and Tamil.
Voice cloning in dubbing
Enable the voice cloning option to preserve the original speakers' voices in the dubbed output. The AI recreates each speaker's vocal characteristics in the target language — maintaining identity across translations.
Smart audio extraction
For large video files (100MB+), the system automatically extracts just the audio track before uploading. This reduces the file size by approximately 90% — for example, a 764MB video becomes ~80MB of audio — resulting in faster uploads and better processing performance.
Additional options
- High Quality Mode for maximum audio fidelity
- Background audio preservation — keep or remove background sounds and music
Custom Voice Cloning — My Voices
The My Voices tab lets you create and manage your own custom AI voices.
How it works
- Upload voice samples in the dedicated cloning interface
- The system creates a custom voice profile based on your recordings
- Your cloned voice appears in the voice selector across both standard TTS and Long Form Mode
Voice management
- Rename custom voices for easy identification
- Delete voices you no longer need
- Use your cloned voices alongside the 67+ built-in voices
This is ideal for brand consistency, personal narration, or any scenario where you need a specific voice that isn't available in the default library.
Split Media
Split Media separates any video file into its individual audio and video components.
Why it's useful
- Prepare files for Lip Sync — extract the audio track, replace it, then sync
- Extract clean audio from video recordings for editing or transcription
- Isolate the video to add new audio or music later
How it works
- Upload a video file (up to 500MB)
- Processing happens entirely in your browser — no server upload needed
- Download the separated audio (WAV) and muted video (MP4) independently
The browser-based approach means your files never leave your machine during processing, which is faster and more private.
Lip Sync
Lip Sync generates a new video where the speaker's lip movements match replacement audio.
Use cases
- Replace dialogue in a video with a different language
- Fix audio issues while keeping the original video
- Create localized versions of talking-head content
How it works
- Upload your source video (the visual you want to keep)
- Upload the replacement audio (what you want the speaker to say)
- Choose your quality tier
- The system generates a new video with synchronized lip movements
Large file support
Lip Sync handles files up to 1GB using resumable uploads. If your connection drops mid-upload, it picks up where it left off — no need to start over.
Getting started
The Voice Generator is available as a module in your Moduvo workspace. Activate it from the Module Shop, and all six capabilities — Text-to-Speech, Long Form Mode, Voice Dubbing, Custom Voice Cloning, Split Media, and Lip Sync — are immediately available in one unified interface.
Each feature works independently, so you can use exactly what you need without setting up or configuring the others.


