Vision Transformers and Generation

  • Vision Transformer (ViT): patch embedding, class token, position embeddings
  • Hybrid architectures: DeiT, Swin Transformer (shifted windows, hierarchical), PVT
  • Self-supervised visual learning: contrastive (SimCLR, MoCo, BYOL, DINO), masked image modelling (MAE, BEiT)
  • Image generation: GANs (generator, discriminator, mode collapse, training tricks, StyleGAN), VAEs
  • Diffusion models: forward/reverse process, DDPM, DDIM, score-based models, classifier-free guidance, latent diffusion (Stable Diffusion)
  • Flow matching: continuous normalising flows, optimal transport, rectified flows