VideoPrism: A foundational visual encoder for video understanding
Google Research - YouTube
Efficient Video-Text Learning with Iterative Co-tokenization
End-to-end Generative Pre-training for Multimodal Video Captioning
VDTTS: Visually-Driven Text-To-Speech