SOLUTIONS

TTS training data your vocoder will actually thank you for.

Text-to-speech models need clean, expressive, phonetically rich recordings from speakers who understand they are contributing voice training data. We collect exactly that — 48 kHz / 24-bit masters, full prosodic range, ARPABET-balanced — and we keep cloning rights on a separate licence.

48 kHz / 24-bit WAV · Word-level aligned transcripts · Verified consent · Commercial training rights
48 kHz
/ 24-bit studio masters
5–50 hrs
Per single-speaker voice
39
ARPABET phonemes covered
−23 LUFS
EBU R128 broadcast loudness
§ 01 — What you get

Built for the work.

Studio-grade capture

48 kHz / 24-bit WAV in treated rooms on Shure SM7B, Rode NT1, or MKH 416 chains. Consistent gain, minimal post.

Expressive prosody

Natural phrasing, emotion, and pace — drawn from real shows, not flat read-aloud. Prosodic markers tagged.

Phonetic balance

Coverage across all 39 ARPABET phonemes, with optional balanced shards for low-resource locales and new voices.

Per-speaker volume

Hours per speaker tracked and tunable for single-speaker model targets — 5 hrs for clone, 20 hrs for expressive, 50 hrs for production voice.

Consent without cloning

Speakers consent to general TTS training. Cloning rights live on a separate Voice Cloning licence with named consent.

Custom voices

Recruit specific speakers through our creator network — accent, age, register, language, on spec.

§ 02 — Per-speaker spec

Sizing your voice.

Voice tier
Hours / speaker
Phoneme coverage
Best for
Clone seed
5–10 min
Sparse
XTTS, OpenVoice, instant clone
Few-shot voice
30 min – 2 hrs
Partial
StyleTTS 2 adaptation
Single-speaker neural
5–20 hrs
Full ARPABET
VITS, FastSpeech 2 from scratch
Expressive voice
20–50 hrs
Full + emotion sweeps
Multi-style narration, audiobook TTS
Production-grade
50–100 hrs
Full + diphone balanced
Flagship assistant voices
Multi-speaker base
2,400+ speakers, 350+ hrs total
Cross-speaker
VITS multi, XTTS, USM-style universal TTS
Vocoder-only training
100 hrs of any speech
N/A
HiFi-GAN, BigVGAN, WaveGlow
§ 03 — Voice characteristics

What we capture beyond the words.

Speaker variety

2,400+ speakers across age, gender, register, dialect, and recording chain. Mix on demand.

Emotional range

Laughter, anger, excitement, sarcasm, whispers — natural expression from real conversations, not acted samples.

Pronunciation tagging

ARPABET and IPA lexicon coverage with proper nouns, brand names, and locale edge cases on request.

Prosody markers

Sentence stress, phrase boundaries, pitch accents, and pause types annotated for prosody-aware models.

Multi-lingual reads

Code-switching captured natively — speakers reading EN-US/ES-LATAM, EN-GB/Hindi, etc., in the same session.

Custom voice commission

We recruit, record, and deliver in 6–8 weeks. Voice Cloning licence available with named consent on every speaker.

§ 04 — How engagement works

From email to first manifest.

01

Sample request

Tell us the model and voice tier. We return a 30-minute representative sample with audio, alignment, and phoneme stats within 48 hours.

02

Mutual NDA

Standard one-page mutual. Most labs sign within a day.

03

MSA + data licence

Perpetual commercial training licence, named contact for life, written speaker release on every voice.

04

First delivery

Pilot shard with 48 kHz / 24-bit WAV, transcripts, ARPABET alignment, prosody tags, and consent receipts.

05

Manifest & provenance

Per-file lineage: speaker ID, mic chain, room, consent version, jurisdiction, SHA-256. EU AI Act ready.

06

Ongoing delivery

Monthly increments, custom voice commissions, locale expansion, written revocation SLA.

§ 05 — FAQ

Common questions.

What is TTS training data?

Recorded speech paired with transcripts used to train text-to-speech models. Audio quality, prosodic variety, phonetic balance, and per-speaker volume drive synthesis quality far more than raw hour count.

Which TTS architectures is this for?

Tacotron 2, FastSpeech 2, VITS, StyleTTS 2, XTTS, Glow-TTS, and any neural vocoder pipeline (HiFi-GAN, BigVGAN, WaveGlow). We deliver 22.05 kHz, 24 kHz, or 48 kHz mastered to your training spec.

How much audio per speaker?

Most catalogue speakers offer 5–50+ hours. Single-speaker neural TTS typically wants 5–20 hours of clean studio audio. Expressive multi-style voices want 20–50. Few-shot adaptation can work with 30 minutes.

Is the data phonetically balanced?

Catalogue data is naturally balanced across all 39 ARPABET phonemes from real conversational range. We also commission phonetically balanced custom recordings with diphone and triphone coverage tracking.

What about emotional and prosodic range?

Real podcast audio is naturally expressive — laughter, anger, excitement, sarcasm, whispers. We tag prosodic markers and can deliver emotion-balanced shards for StyleTTS, XTTS, and VITS-style models.

Do you provide a pronunciation lexicon?

Yes. ARPABET and IPA lexicon coverage on request, including proper nouns, brand names, and locale-specific edge cases.

Does the licence allow voice cloning?

Standard TTS catalogue data is licensed for general TTS training, not for cloning a specific speaker. Voice cloning requires our separate Voice Cloning licence with explicit named consent — see /solutions/voice-ai-training-data.html.

Can I commission a custom voice?

Yes. We recruit and record specific speakers through our creator network in 6–8 weeks, with full Voice Cloning licence in writing if cloning rights are needed.

Can I get a sample?

Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, transcripts, phonetic alignment, and metadata within 48 hours of NDA.

Want a representative sample?

30 minutes of audio + transcripts + metadata, delivered within 48 hours of NDA.