AI/ML

AI/ML

2386 bookmarks
Custom sorting
Proximal Policy Optimization (PPO) for LLMs Explained Intuitively
Proximal Policy Optimization (PPO) for LLMs Explained Intuitively
In this video, I break down Proximal Policy Optimization (PPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to PPO, including: 🔵 Policy Gradient 🔵 Actor-Critic Models 🔵 The Value Function 🔵 The Generalized Advantage Estimate In the LLM world, PPO was used to train reasoning models like OpenAI's o1/o3, and presumably Claude 3.7, Grok 3, etc. It’s the backbone of Reinforcement Learning with Human Feedback (RLHF) -- which helps align AI models with human preferences and Reinforcement Learning with Verifiable Rewards (RLVR), which gives LLMs reasoning abilities. Papers: - PPO paper: https://arxiv.org/pdf/1707.06347 - GAE paper: https://arxiv.org/pdf/1506.02438 - TRPO paper: https://arxiv.org/pdf/1502.05477 Well-written blogposts: - https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/ - https://huggingface.co/blog/NormalUhr/rlhf-pipeline - https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ Implementations: - (Original) OpenAI Baseslines: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2 - Hugging Face: https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py - Hugging Face docs: https://huggingface.co/docs/trl/main/en/ppo_trainer Mother of all RL books (Barto & Sutton): http://incompleteideas.net/book/RLbook2020.pdf 00:00 Intro 01:21 RL for LLMs 05:53 Policy Gradient 09:23 The Value Function 12:14 Generalized Advantage Estimate 17:17 End-to-end Training Algorithm 18:23 Importance Sampling 20:02 PPO Clipping 21:36 Outro Special thanks to Anish Tondwalkar for discussing some of these concepts with me. Note: At 21:10, A_t should have been inside the min. Thanks @t.w.7065 for catching this.
·youtube.com·
Proximal Policy Optimization (PPO) for LLMs Explained Intuitively
awwaiid/gremllm
awwaiid/gremllm
Delightfully cursed Python library by Brock Wilcox, built on top of LLM: from gremllm import Gremllm counter = Gremllm("counter") counter.value = 5 counter.increment() print(counter.value) # 6? print(counter.to_roman_numerals()) # VI? You …
·simonwillison.net·
awwaiid/gremllm
Guest Post: How I Scanned all of GitHub’s “Oops Commits” for Leaked Secrets ◆ Truffle Security Co.
Guest Post: How I Scanned all of GitHub’s “Oops Commits” for Leaked Secrets ◆ Truffle Security Co.
GitHub Archive logs every public commit, even the ones developers try to delete. Force pushes often cover up mistakes like leaked credentials by rewriting Git history. GitHub keeps these dangling commits, from what we can tell, forever. In the archive, they show up as “zero-commit” PushEvents.
·trufflesecurity.com·
Guest Post: How I Scanned all of GitHub’s “Oops Commits” for Leaked Secrets ◆ Truffle Security Co.
Identify, solve, verify
Identify, solve, verify
The more time I spend using LLMs for code, the less I worry for my career - even as their coding capabilities continue to improve. Using LLMs as part of …
·simonwillison.net·
Identify, solve, verify
Exploiting the IKKO Activebuds "AI powered" earbuds, running DOOM, stealing their OpenAI API key and customer data.
Exploiting the IKKO Activebuds "AI powered" earbuds, running DOOM, stealing their OpenAI API key and customer data.
So my journey with these earbuds started after I saw them on this Mrwhosetheboss video about pointless tech. This device seems to be also popular on TikTok. My suspicions were confirmed, this runs android. So of course i went ahead and bought them. 245 euros later... and they finally arrived!
·blog.mgdproductions.com·
Exploiting the IKKO Activebuds "AI powered" earbuds, running DOOM, stealing their OpenAI API key and customer data.
Blah, Blah, Blah, Blah Ginger
Blah, Blah, Blah, Blah Ginger
In one of my favorite Far Side cartoons, Gary Larson tells us our dog’s vocabulary is rather limited; in fact, they really only know or hear their names. While this cartoon always makes me la…
·desertdemocrat.wordpress.com·
Blah, Blah, Blah, Blah Ginger
LLM Hacking Defense: Strategies for Secure AI
LLM Hacking Defense: Strategies for Secure AI
Ready to become a certified z/OS v3.x Administrator? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/BdnNJp Learn more about Guardium AI Security here → https://ibm.biz/Bdn7PF How do you secure large language models from hacking and prompt injection? 🔐 Jeff Crume explains LLM risks like data leaks, jailbreaks, and malicious prompts. Learn how policy engines, proxies, and defense-in-depth can protect generative AI systems from advanced threats. 🚀 AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdnNJh #llm #secureai #aihacking #aicybersecurity
·youtube.com·
LLM Hacking Defense: Strategies for Secure AI