Vision Language Models

Visual question answering (VQA): task formulation, datasets
Image captioning: show-and-tell, attention-based captioning
Architecture patterns: dual encoder, fusion encoder, encoder-decoder
Flamingo: interleaving visual and text tokens, few-shot multimodal learning
LLaVA and visual instruction tuning: projecting vision features into LLM space
PaLI, Qwen-VL, InternVL: scaling vision-language models
Grounding and referring: pointing, bounding box prediction from language
OCR-free document understanding: Donut, Pix2Struct