Vision Language Models

  • Visual question answering (VQA): task formulation, datasets
  • Image captioning: show-and-tell, attention-based captioning
  • Architecture patterns: dual encoder, fusion encoder, encoder-decoder
  • Flamingo: interleaving visual and text tokens, few-shot multimodal learning
  • LLaVA and visual instruction tuning: projecting vision features into LLM space
  • PaLI, Qwen-VL, InternVL: scaling vision-language models
  • Grounding and referring: pointing, bounding box prediction from language
  • OCR-free document understanding: Donut, Pix2Struct