Automatic Speech Recognition

  • ASR pipeline: audio → features → acoustic model → decoder → text
  • Traditional ASR: GMM-HMM, WFST decoding
  • End-to-end ASR: CTC loss and decoding, RNN-Transducer (RNN-T)
  • Attention-based ASR: Listen Attend and Spell (LAS)
  • Modern architectures: Whisper, Conformer (convolution + attention), wav2vec 2.0 (self-supervised pretraining)
  • Language model integration: shallow fusion, deep fusion, rescoring
  • Streaming vs offline ASR: chunked attention, lookahead, latency constraints
  • Evaluation: word error rate (WER), character error rate (CER)