Cross-Modal Generation

  • Text-to-image: DALL-E (autoregressive), Stable Diffusion (latent diffusion + CLIP guidance), Imagen, Parti
  • Text-to-video: Make-A-Video, VideoPoet, Sora-style temporal diffusion, Wan
  • Text-to-audio: AudioLM, MusicLM, MusicGen
  • Image-to-text generation: captioning as conditional generation
  • Video-audio co-generation: joint temporal modelling
  • Instruction-following generation: InstructPix2Pix, editing by description
  • Consistency and alignment: measuring text-image alignment (CLIPScore), FID, IS
  • Ethical considerations: deepfakes, bias in generation, content filtering