Unified Multimodal Architectures

  • The case for unification: one model, many modalities, shared weights
  • Any-to-any models: CoDi, NExT-GPT, Gemini, GPT-4o architecture patterns
  • Modality-specific encoders and decoders with shared transformer backbone
  • Multimodal tokenisation: interleaving text, image, audio tokens in one sequence
  • Training recipes: staged pretraining, modality-specific warm-up, joint fine-tuning
  • Multimodal chain-of-thought reasoning
  • Multimodal agents: tool use, grounding actions in visual context
  • Benchmarks: MMLU, MMBench, SEED-Bench, multimodal evaluation suites