- Visual question answering (VQA): task formulation, datasets
- Image captioning: show-and-tell, attention-based captioning
- Architecture patterns: dual encoder, fusion encoder, encoder-decoder
- Flamingo: interleaving visual and text tokens, few-shot multimodal learning
- LLaVA and visual instruction tuning: projecting vision features into LLM space
- PaLI, Qwen-VL, InternVL: scaling vision-language models
- Grounding and referring: pointing, bounding box prediction from language
- OCR-free document understanding: Donut, Pix2Struct