Question 1

What is TTS training data?

Accepted Answer

TTS training data is recorded speech paired with transcripts used to train text-to-speech models. Audio quality, prosodic variety, phonetic balance, and per-speaker volume drive synthesis quality more than raw hour count.

Question 2

Does the licence allow voice cloning?

Accepted Answer

Standard TTS catalogue data is licensed for general TTS training, not for cloning a specific speaker. Voice cloning requires our separate Voice Cloning licence with explicit named consent from the speaker.

Question 3

How much audio per speaker?

Accepted Answer

Most catalogue speakers offer 5–50 hours. Single-speaker neural TTS typically wants 5–20 hours of clean studio audio. Expressive multi-style voices want 20–50.

Question 4

Is the data phonetically balanced?

Accepted Answer

Catalogue data is naturally balanced across the 39 ARPABET phonemes from real conversational range. We also commission phonetically balanced custom recordings on request, with diphone and triphone coverage tracking.

Question 5

What about emotional and prosodic range?

Accepted Answer

Real podcast audio is naturally expressive — laughter, anger, excitement, sarcasm, whispers. We tag prosodic markers and can deliver emotion-balanced shards for StyleTTS, XTTS, and VITS-style models.

Question 6

Which TTS architectures is this for?

Accepted Answer

Tacotron 2, FastSpeech 2, VITS, StyleTTS 2, XTTS, Glow-TTS, and any neural vocoder pipeline (HiFi-GAN, BigVGAN, WaveGlow). We deliver 22.05 kHz, 24 kHz, or 48 kHz mastered to your training spec.

Question 7

Do you provide a pronunciation lexicon?

Accepted Answer

Yes. ARPABET and IPA lexicon coverage on request, including proper nouns, brand names, and locale-specific edge cases.

Question 8

Can I commission a custom voice?

Accepted Answer

Yes. We can recruit and record specific speakers through our creator network, with full Voice Cloning licence in writing if cloning rights are needed.

Question 9

Can I get a sample?

Accepted Answer

Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, transcripts, phonetic alignment, and metadata after a quick scoping call.

TTS training data your vocoder will actually thank you for.

Built for the work.

Studio-grade capture

Expressive prosody

Phonetic balance

Per-speaker volume

Consent without cloning

Custom voices

Sizing your voice.

What we capture beyond the words.

Speaker variety

Emotional range

Pronunciation tagging

Prosody markers

Multi-lingual reads

Custom voice commission

From email to first manifest.

Sample request

Mutual NDA

MSA + data licence

First delivery

Manifest & provenance

Ongoing delivery

Common questions.

Want a representative sample?