DSI++: Updating Transformer Memory with New Documents
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Scaling Vision Transformers to 22 Billion Parameters
End-to-end Generative Pre-training for Multimodal Video Captioning
Multimodal Bottleneck Transformer (MBT): A New Model for Modality Fusion