Text to Speech and Voice

  • TTS pipeline: text normalisation → phoneme conversion → acoustic model → vocoder
  • Vocoders: WaveNet, WaveRNN, WaveGlow, HiFi-GAN, neural source-filter models
  • Acoustic models: Tacotron 1/2, FastSpeech 1/2 (non-autoregressive, duration prediction)
  • Modern TTS: VITS (end-to-end), VALL-E (codec language model), StyleTTS
  • Prosody modelling: pitch, duration, energy, style embeddings
  • Voice conversion: speaker embeddings, disentangled representations
  • Voice cloning: few-shot and zero-shot approaches
  • Voice activity detection (VAD) and acoustic activity detection