- The case for unification: one model, many modalities, shared weights
- Any-to-any models: CoDi, NExT-GPT, Gemini, GPT-4o architecture patterns
- Modality-specific encoders and decoders with shared transformer backbone
- Multimodal tokenisation: interleaving text, image, audio tokens in one sequence
- Training recipes: staged pretraining, modality-specific warm-up, joint fine-tuning
- Multimodal chain-of-thought reasoning
- Multimodal agents: tool use, grounding actions in visual context
- Benchmarks: MMLU, MMBench, SEED-Bench, multimodal evaluation suites