Image and Video Tokenisation

  • Why tokenise images: bridging continuous pixels and discrete language model vocabularies
  • VQ-VAE: vector quantisation, codebook learning, commitment loss
  • VQ-GAN: combining VQ-VAE with adversarial training for higher fidelity
  • Residual quantisation and multi-scale codebooks
  • Image tokenisers: DALL-E tokeniser, LlamaGen, Cosmos tokeniser
  • Video tokenisation: temporal compression, 3D VQ-VAE, causal video tokenisers
  • Continuous vs discrete tokens: when to quantise and when to project
  • Applications: autoregressive image generation, unified vision-language tokens