Multimodal Representations

  • What is multimodal learning: combining vision, language, audio, and other modalities
  • Early vs late fusion: feature-level vs decision-level combination
  • Joint embedding spaces: learning shared representations across modalities
  • Contrastive learning: CLIP (image-text contrastive), ALIGN, SigLIP
  • Loss functions: InfoNCE, NT-Xent, contrastive loss with temperature
  • Image-text retrieval: zero-shot classification via embeddings
  • Audio-visual correspondence: learning from paired audio and video
  • Evaluation: zero-shot benchmarks, retrieval metrics (recall@k)