- What is multimodal learning: combining vision, language, audio, and other modalities
- Early vs late fusion: feature-level vs decision-level combination
- Joint embedding spaces: learning shared representations across modalities
- Contrastive learning: CLIP (image-text contrastive), ALIGN, SigLIP
- Loss functions: InfoNCE, NT-Xent, contrastive loss with temperature
- Image-text retrieval: zero-shot classification via embeddings
- Audio-visual correspondence: learning from paired audio and video
- Evaluation: zero-shot benchmarks, retrieval metrics (recall@k)