SOLUTIONS

ASR training data that actually transcribes the room.

Automatic speech recognition models live or die on transcript quality and acoustic diversity. We deliver both — multi-speaker conversational audio with word-aligned transcripts, real-room acoustics, and rich metadata, all consented for training.

48 kHz / 24-bit WAV · Word-level aligned transcripts · Verified consent · Commercial training rights
Conversational audio sourced through our studio network
Written speaker consent on every recording
Word-aligned transcripts on request
48 kHz
/ 24-bit WAV masters
§ 01 — What you get

Built for the work.

Word-level alignment

Sub-100ms time-aligned transcripts in JSON, CTM, TextGrid, SRT or VTT. Casing and punctuation preserved.

Speaker diarization

Clean speaker turns with stable per-speaker IDs across long-form interviews, panels, and round-tables. RTTM included.

Acoustic diversity

Shure SM7B, Rode NT1, MKH 416, Zoom H6, AirPods, and PSTN call legs. Treated rooms and bad bedrooms in the same shard.

Long-form context

Hours of continuous conversation, not 10-second clips — the gap most ASR datasets leave wide open.

Domain coverage

Tech, business, health, finance, entertainment, sports, politics, science. Specify the mix you need by hours.

Transcription QA

Machine-aligned by default, human-reviewed on demand. Transcript QA available on request.

§ 02 — By model architecture

How much do you actually need.

Model
Recommended hours
Format we ship
Notes
Whisper-large fine-tune
50–200 hrs in-domain
16 kHz mono FLAC + JSONL
Word timestamps, language tag, prompt field
Whisper from scratch
680k+ hrs (don’t)
Manifest only
Use the open weights, fine-tune instead
Conformer / NeMo
300–1,000 hrs
16 kHz WAV + NeMo manifest
Char and BPE tokenizer ready
USM-style universal
2,000+ hrs across locales
48 kHz WAV + parquet
Multilingual sharding by locale
wav2vec2 / HuBERT
500+ hrs unlabeled + 10–100 hrs labeled
16 kHz WAV + TSV
Self-supervised pretrain split included
Streaming / RNN-T
200–500 hrs low-latency
16 kHz WAV + force-aligned CTM
Chunked, no future context leakage
Diarization (pyannote)
100+ hrs multi-speaker
RTTM + 48 kHz WAV
Overlapped speech labeled
§ 03 — Use cases

Where this data ends up in production.

Call-center transcription

Multi-speaker, narrowband-friendly audio for agent-customer ASR with disfluencies and crosstalk preserved.

Meeting notes & summaries

Long-form panel and round-table audio that mirrors Zoom, Meet, and in-room conference acoustics.

Podcast indexing

Episode-length audio with chapter-aware transcripts — train search and recommendation against the medium itself.

Video captioning

Broadcast-loudness audio normalized to EBU R128, ready for caption pipelines and live-event ASR.

Accessibility

WCAG-grade caption training data with named-entity tags, speaker labels, and non-speech event annotation.

Voice agents

Conversational turn-taking and interruption data for full-duplex agent stacks built on Whisper or USM.

§ 04 — How engagement works

From email to first manifest.

01

Sample request

Tell us the model, target locales, and hours. We return a 30-minute representative sample with audio, alignment, and diarization within 48 hours.

02

Mutual NDA

Standard one-page mutual. Standard mutual NDA. Quick to review.

03

MSA + data licence

Licence terms negotiated per project, named contact for life, written speaker release on every voice in the shard.

04

First delivery

Pilot shard (typically 10–25 hrs) with full manifest, written speaker releases, alignment, diarization, and consent receipts.

05

Manifest & provenance

Per-recording details: speaker name, recording date, and consent version. Audit-ready out of the box.

06

Ongoing delivery

Monthly increments, named human contact, locale expansion, and a named human on Slack — not a ticket queue.

§ 05 — FAQ

Common questions.

What is ASR training data?

Paired speech audio and text transcripts used to train automatic speech recognition models. Quality, diversity, and alignment accuracy determine downstream WER far more than raw hour count.

Which models is this data designed for?

Whisper-large fine-tunes, Conformer, USM, NVIDIA NeMo, wav2vec2, HuBERT, and any encoder-decoder ASR architecture. We deliver in the manifest formats those training pipelines expect — JSONL for Whisper, NeMo manifest for NeMo, TSV for fairseq.

How many hours do I need to fine-tune Whisper?

For a domain adaptation, 50–200 hours of in-domain audio is the typical sweet spot. For a new low-resource locale from scratch, plan on 500+ hours. We will help you size it on the sample call.

Do you provide diarization?

Yes. Every multi-speaker file ships with stable per-speaker IDs in RTTM and JSON, including overlap regions. You can drop it straight into pyannote training.

What transcript format do you use?

Word-level JSON with start/end timestamps, casing, and punctuation by default. CTM, TextGrid, SRT, VTT, and custom formats on request. Transcripts include disfluencies — we do not silently clean them out.

How accurate are the transcripts?

Machine-aligned with sub-100ms word boundary accuracy on clean audio. Optional human review brings benchmark sets to transcript QA on request.

What about long-form context and real-room acoustics?

This is the gap most ASR datasets leave open. We deliberately ship 20–90 minute continuous conversations and mix studio, untreated room, remote-guest, and field recordings so the model does not collapse the moment it leaves a clean clip.

Is this data legally safe to train on?

Yes. Written consent from every speaker we record, licence terms negotiated per project, named human contact on every deal, full per-speaker provenance, and a named human contact on every deal. We are the only supplier with that whole stack.

Can I get a sample?

Yes. Email partnerships@aipodcast.io and we will send a 30-minute representative sample with audio, alignment, diarization, and metadata after a quick scoping call.

Want a representative sample?

30 minutes of audio + transcripts + metadata, delivered after a quick scoping call.