Capabilities of GPT-5 on Multimodal Medical Reasoning
Scaling Language-Free Visual Representation Learning
View PDF
Multimodal Large Language Models: A Survey
View PDF
MMaDA: Multimodal Large Diffusion Language Models
UniVG-R1: Reasoning Guided Universal Visual Grounding with...
AMIE gains vision: A research AI agent for multimodal diagnostic dialogue
work
Data-Efficient Multimodal Fusion on a Single GPU
View PDF
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Guiding Instruction-based Image Editing via Multimodal Large Language Models
Download PDF
Ferret: Refer and Ground Anything Anywhere at Any Granularity
NExT-GPT: Any-to-Any Multimodal LLM
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | Meta AI Research
Download the Paper
A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics
Gong, Y., Rouditchenko, A., Liu, A. H., Harwath, D., Karlinsky, L., Kuehne, H., & Glass, J. (2022). Contrastive audio-visual masked autoencoder. arXiv preprint arXiv:2210.07839.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Stable Bias: Analyzing Societal Representations in Diffusion Models
ViperGPT: Visual Inference via Python Execution for Reasoning
Erasing Concepts from Diffusion Models
ChatGPT is on the horizon: Could a large language model be all we need for Intelligent Transportation?