Thanks to KiwiCo for sponsoring today’s video! Go to https://www.kiwico.com/welchlabs and use code WELCHLABS for 50% off your first monthly club crate or for 20% off your first Panda Crate!
MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK):
https://www.welchlabs.com/resources/mladeepseek-attention-poster-13x19
Limited edition MLA Poster and Signed Book:
https://www.welchlabs.com/resources/deepseek-bundle-mla-poster-and-signed-book-limited-run
Imaginary Numbers book is back in stock!
https://www.welchlabs.com/resources/imaginary-numbers-book
Special Thanks to Patrons https://www.patreon.com/c/welchlabs
Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich
References
DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434
DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948
Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture
GPT-2 Visualizaiton: https://github.com/TransformerLensOrg/TransformerLens
Manim Animations: https://github.com/stephencwelch/manim_videos
Technical Notes
1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/configuration_deepseek.py. num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json. num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X.
2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture.
3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token.
4. We’re ignoring bias terms matrix equations.
5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE.
How to Build an In-N-Out Agent with OpenAI Agents SDK
In this video, I take a deeper dive look at the OpenAI Agents SDK and how it can be used to build a fast food agent.
Colab: https://dripl.ink/MZw2R
For more tutorials on using LLMs and building agents, check out my Patreon
Patreon: https://www.patreon.com/SamWitteveen
Twitter: https://x.com/Sam_Witteveen
🕵️ Interested in building LLM Agents? Fill out the form below
Building LLM Agents Form: https://drp.li/dIMes
👨💻Github:
https://github.com/samwit/llm-tutorials
⏱️Time Stamps:
00:00 Intro
00:11 Creating an In-N-Out Agent (Colab Demo)
00:40 In-N-Out Burger Agent
04:35 Streaming runs
05:40 Adding Tools
08:20 Websearch Tool
09:45 Agents as Tools
12:21 Giving it a Chat Memory
How to connect a LLM to Zotero for a private, local research assistant – fast, no code
Learn how to use chat with your Zotero database using a private, local LLM with no coding required, Llama, Deepseek, any LLM you want!
Please like and subscribe to help support the channel. @LearnMetaAnalysis
Ollama - https://ollama.com/
Docker - https://www.docker.com/
Open WebUI Quickstart - https://docs.openwebui.com/getting-started/quick-start
Zotero - https://www.zotero.org/
Zotero Directory Information - https://www.zotero.org/support/zotero_data
Tutorials and how-to guides:
Getting started with Open WebUI: https://youtu.be/gm_1VUg3L24
Conventional meta-analysis: https://www.youtube.com/playlist?list=PLXa5cTEormkEbYpBIgikgE0y9QR7QIgzs
Three-level meta-analysis: https://www.youtube.com/playlist?list=PLXa5cTEormkHwRmu_TJXa7fSb6-WBXXoJ
Three-level meta-analysis with correlated and hierarchical effects and robust variance estimation: https://www.youtube.com/playlist?list=PLXa5cTEormkEGenfcnp9X5dQUhmm7f9Jp
Want free point and click (no coding required) meta-analysis software? Check out Simple Meta-Analysis: https://learnmeta-analysis.com/pages/simple-meta-analysis-software
Tired of manually extracting data for systematic review and meta-analysis? Check out AI-Assisted Data Extraction, a free package for R! https://youtu.be/HuWXbe7hgFc
Free ebook on meta-analysis in R (no download required): https://noah-schroeder.github.io/reviewbook/
Visit our website at https://learnmeta-analysis.com/
0:00 What we’re building
1:40 Requirements
7:05 Sync Zotero database
10:13 Custom model
12:13 It works!
17:26 Changing LLM
18:54 Updating knowledge database
You HAVE to Try Agentic RAG with DeepSeek R1 (Insane Results)
Deepseek R1 - the latest and greatest open source reasoning LLM - has taken the world by storm and a lot of content creators are doing a great job covering its implications and strengths/weaknesses. What I haven’t seen a lot of though is actually using R1 in agentic workflows to truly leverage its power. So that’s what I’m showing you in this video - we’ll be using the power of R1 to make a simple but super effective agentic RAG setup. We’ll be using Smolagents by HuggingFace to create our agent - it’s the simplest agent framework out there and many of you have been asking me to try it out.
This agentic RAG setup centers around the idea that reasoning LLMs like R1 are extremely powerful but quite slow. Because of this, a lot of people are starting to experiment with combining the raw power of a model like R1 with a more lightweight and fast LLM to drive the primary conversation/agent flow. Think of basically giving R1 as a tool for an agent to use when it needs more reasoning power at the cost of a slower response (and higher costs). That’s what we’ll be doing here - creating an agent that has an R1 driven RAG tool to extract in depth insights from a knowledgebase.
The example in this video is meant to be an introduction to these kind of reasoning agentic flows. That’s why I keep it simple with Smolagents and a local knowledgebase. But I’m planning on expanding this much further soon with a much more robust but still similar flow built with Pydantic AI and LangGraph!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Community Voting period of the oTTomator Hackathon is open! Head on over to the Live Agent Studio now and test out the submissions and vote for your favorite agents. There are so many incredible projects to try out!
https://studio.ottomator.ai
All the code covered in this video + instructions to run it can be found here:
https://github.com/coleam00/ottomator-agents/tree/main/r1-distill-rag
SmolAgents:
https://huggingface.co/docs/smolagents/en/index
R1 on Ollama:
https://ollama.com/library/deepseek-r1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
00:00 - Why R1 for Agentic RAG?
01:56 - Overview of our Agent
03:33 - SmolAgents - Our Ticket to Fast Agents
06:07 - Building our Agentic RAG Agent with R1
14:17 - Creating our Local Knowledgebase w/ Chroma DB
15:45 - Getting our Local LLMs Set Up with Ollama
19:15 - R1 Agentic RAG Demo
21:42 - Outro
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Join me as I push the limits of what is possible with AI. I'll be uploading videos at least two times a week - Sundays and Wednesdays at 7:00 PM CDT!
Learn how to build a VS Code Extension from scratch. In this fun tutorial, we integrate DeepSeek R1 direction into our editor to build a custom AI assistant.
Go Deeper https://fireship.io/courses
Related Content:
VS Code Extension Template https://code.visualstudio.com/api/get-started/your-first-extension
Ollama DeepSeek R1 https://ollama.com/library/deepseek-r1
DeepSeek R1 First Look https://youtu.be/-2k1rcRzsLA
DeepSeek Fallout https://youtu.be/Nl7aCUsWykg
Discussions:
Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments)
Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, Vietnamese
Watch: MIT’s Deep Learning State of the Art lecture referencing this post
Featured in courses at Stanford, Harvard, MIT, Princeton, CMU and others
In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
2020 Update: I’ve created a “Narrated Transformer” video which is a gentler approach to the topic:
A High-Level Look
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.