- Text-to-image: DALL-E (autoregressive), Stable Diffusion (latent diffusion + CLIP guidance), Imagen, Parti
- Text-to-video: Make-A-Video, VideoPoet, Sora-style temporal diffusion, Wan
- Text-to-audio: AudioLM, MusicLM, MusicGen
- Image-to-text generation: captioning as conditional generation
- Video-audio co-generation: joint temporal modelling
- Instruction-following generation: InstructPix2Pix, editing by description
- Consistency and alignment: measuring text-image alignment (CLIPScore), FID, IS
- Ethical considerations: deepfakes, bias in generation, content filtering