TTS training data your vocoder will actually thank you for.
Text-to-speech models need clean, expressive, phonetically rich recordings from speakers who understand they are contributing voice training data. We collect exactly that — 48 kHz / 24-bit masters, full prosodic range, ARPABET-balanced — and we keep cloning rights on a separate licence.
Built for the work.
Studio-grade capture
48 kHz / 24-bit WAV in treated rooms on Shure SM7B, Rode NT1, or MKH 416 chains. Consistent gain, minimal post.
Expressive prosody
Natural phrasing, emotion, and pace — drawn from real shows, not flat read-aloud. Prosodic markers tagged.
Phonetic balance
Coverage across all 39 ARPABET phonemes, with optional balanced shards for low-resource locales and new voices.
Per-speaker volume
Hours per speaker tracked and tunable for single-speaker model targets — 5 hrs for clone, 20 hrs for expressive, 50 hrs for production voice.
Consent without cloning
Speakers consent to general TTS training. Cloning rights live on a separate Voice Cloning licence with named consent.
Custom voices
Recruit specific speakers through our creator network — accent, age, register, language, on spec.
Sizing your voice.
What we capture beyond the words.
Speaker variety
2,400+ speakers across age, gender, register, dialect, and recording chain. Mix on demand.
Emotional range
Laughter, anger, excitement, sarcasm, whispers — natural expression from real conversations, not acted samples.
Pronunciation tagging
ARPABET and IPA lexicon coverage with proper nouns, brand names, and locale edge cases on request.
Prosody markers
Sentence stress, phrase boundaries, pitch accents, and pause types annotated for prosody-aware models.
Multi-lingual reads
Code-switching captured natively — speakers reading EN-US/ES-LATAM, EN-GB/Hindi, etc., in the same session.
Custom voice commission
We recruit, record, and deliver in 6–8 weeks. Voice Cloning licence available with named consent on every speaker.
From email to first manifest.
Sample request
Tell us the model and voice tier. We return a 30-minute representative sample with audio, alignment, and phoneme stats within 48 hours.
Mutual NDA
Standard one-page mutual. Most labs sign within a day.
MSA + data licence
Perpetual commercial training licence, named contact for life, written speaker release on every voice.
First delivery
Pilot shard with 48 kHz / 24-bit WAV, transcripts, ARPABET alignment, prosody tags, and consent receipts.
Manifest & provenance
Per-file lineage: speaker ID, mic chain, room, consent version, jurisdiction, SHA-256. EU AI Act ready.
Ongoing delivery
Monthly increments, custom voice commissions, locale expansion, written revocation SLA.
Common questions.
What is TTS training data?
Recorded speech paired with transcripts used to train text-to-speech models. Audio quality, prosodic variety, phonetic balance, and per-speaker volume drive synthesis quality far more than raw hour count.
Which TTS architectures is this for?
Tacotron 2, FastSpeech 2, VITS, StyleTTS 2, XTTS, Glow-TTS, and any neural vocoder pipeline (HiFi-GAN, BigVGAN, WaveGlow). We deliver 22.05 kHz, 24 kHz, or 48 kHz mastered to your training spec.
How much audio per speaker?
Most catalogue speakers offer 5–50+ hours. Single-speaker neural TTS typically wants 5–20 hours of clean studio audio. Expressive multi-style voices want 20–50. Few-shot adaptation can work with 30 minutes.
Is the data phonetically balanced?
Catalogue data is naturally balanced across all 39 ARPABET phonemes from real conversational range. We also commission phonetically balanced custom recordings with diphone and triphone coverage tracking.
What about emotional and prosodic range?
Real podcast audio is naturally expressive — laughter, anger, excitement, sarcasm, whispers. We tag prosodic markers and can deliver emotion-balanced shards for StyleTTS, XTTS, and VITS-style models.
Do you provide a pronunciation lexicon?
Yes. ARPABET and IPA lexicon coverage on request, including proper nouns, brand names, and locale-specific edge cases.
Does the licence allow voice cloning?
Standard TTS catalogue data is licensed for general TTS training, not for cloning a specific speaker. Voice cloning requires our separate Voice Cloning licence with explicit named consent — see /solutions/voice-ai-training-data.html.
Can I commission a custom voice?
Yes. We recruit and record specific speakers through our creator network in 6–8 weeks, with full Voice Cloning licence in writing if cloning rights are needed.
Can I get a sample?
Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, transcripts, phonetic alignment, and metadata within 48 hours of NDA.
Want a representative sample?
30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA.