Text to Speech and Voice

TTS pipeline: text normalisation → phoneme conversion → acoustic model → vocoder
Vocoders: WaveNet, WaveRNN, WaveGlow, HiFi-GAN, neural source-filter models
Acoustic models: Tacotron 1/2, FastSpeech 1/2 (non-autoregressive, duration prediction)
Modern TTS: VITS (end-to-end), VALL-E (codec language model), StyleTTS
Prosody modelling: pitch, duration, energy, style embeddings
Voice conversion: speaker embeddings, disentangled representations
Voice cloning: few-shot and zero-shot approaches
Voice activity detection (VAD) and acoustic activity detection

Maths, CS & AI Compendium