OpenAI CLIP Explained | Multi-modal ML
OpenAI's CLIP explained simply and intuitively with visuals and code. Language models (LMs) can not rely on language alone. That is the idea behind the "Experience Grounds Language" paper, that proposes a framework to measure LMs' current and future progress. A key idea is that, beyond a certain threshold LMs need other forms of data, such as visual input.
The next step beyond well-known language models; BERT, GPT-3, and T5 is "World Scope 3". In World Scope 3, we move from large text-only datasets to large multi-modal datasets. That is, datasets containing information from multiple forms of media, like *both* images and text.
The world, both digital and real, is multi-modal. We perceive the world as an orchestra of language, imagery, video, smell, touch, and more. This chaotic ensemble produces an inner state, our "model" of the outside world.
AI must move in the same direction. Even specialist models that focus on language or vision must, at some point, have input from the other modalities. How can a model fully understand the concept of the word "person" without *seeing* a person?
OpenAI's Contrastive Learning In Pretraining (CLIP) is a world scope three model. It can comprehend concepts in both text and image and even connect concepts between the two modalities. In this video we will learn about multi-modality, how CLIP works, and how to use CLIP for different use cases like encoding, classification, and object detection.
🌲 Pinecone article:
https://pinecone.io/learn/clip/
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
🎉 Subscribe for Article and Video Updates!
https://jamescalam.medium.com/subscribe
https://medium.com/@jamescalam/membership
👾 Discord:
https://discord.gg/c5QtDB9RAP