AI Voice Nodes

AI Voice Nodes convert text to speech (TTS), clone voices from audio samples, and transform speech between voices (STS — Speech-to-Speech). They support three input types: a script (text), a reference voice (audio for cloning), and a performance (audio/video for speech transformation and dubbing).

Inputs & Outputs

Port	Direction	Type	Description
input	In	Text	Script/text from Text Node — the words to speak
reference	In	Audio	Voice sample from Audio Node — the voice to clone
performance	In	Audio/Video	Source audio or video for speech-to-speech transformation or dubbing
output	Out	Audio	Generated speech audio

Inspector Controls

Voice Selection — Dropdown to choose from preset voices or use a cloned voice (when Audio Node is connected as reference).
Generation Mode — TTS (Text-to-Speech from script) or STS (Speech-to-Speech from performance audio).
Language — Target language for generation.
Speed — Adjust speech speed (slower for narration, faster for energetic content).

Generation Modes

Text-to-Speech (TTS)

Connect a Text Node with the script. The AI Voice Node generates speech using the selected voice or cloned voice.

Best for: narration, voiceovers, audiobooks, accessibility

Speech-to-Speech (STS)

Connect an Audio or Video Node as performance. The AI Voice Node transforms the speech into a different voice.

Best for: dubbing, voice acting, translating spoken content

Voice Cloning

Connect an Audio Node as reference (voice sample) + Text Node as script. The AI Voice Node generates speech in the cloned voice.

Best for: brand voices, character consistency, personalized content

How to Use

Add an AI Voice Node to the canvas
Connect a Text Node (your script) to the input port
(Optional) Connect an Audio Node to the reference port for voice cloning
(Optional) Connect an Audio/Video Node to the performance port for STS
Select voice or let it use the cloned reference
Click Generate
Download the resulting audio file

Workflow Examples

Narrated Video: Text Node (“Welcome to our documentary about ocean life…“) → AI Voice Node (generates narration) + Text Node → Scene Node (generates matching visuals with audio enabled)

Video Dubbing: Scene Node (original video in English) → AI Voice Node (performance port — transforms speech to French)

Character Voice: Audio Node (10s sample of a voice) → AI Voice Node (reference port) + Text Node (character dialogue) → generates dialogue in the cloned voice

Tips

For voice cloning, provide 10-30 seconds of clean speech — no background noise, music, or multiple speakers
STS quality depends on the input audio quality — clear, well-recorded source produces better results
Use TTS when you have a script, STS when you have existing audio to transform
For video dubbing, connect the video to the performance port — the AI matches lip movements
Keep scripts under 500 words per generation for best quality
Test with short samples before generating long narrations

Troubleshooting

Voice quality poor: Check that the reference audio sample is clean (no noise, single speaker, 10-30s).
Wrong language: Make sure the Language setting matches your script. Some voices may not support all languages.
Generation too slow: Long scripts take longer. Split into shorter segments if needed.
Audio clipping: Reduce the speed setting or break text into shorter paragraphs.