Found 9 bookmarks
Custom sorting
Bay.Area.AI: DSPy: Prompt Optimization for LM Programs, Michael Ryan
Bay.Area.AI: DSPy: Prompt Optimization for LM Programs, Michael Ryan
ai.bythebay.io Nov 2025, Oakland, full-stack AI conference DSPy: Prompt Optimization for LM Programs Michael Ryan, Stanford It has never been easier to build amazing LLM powered applications. Unfortunately engineering reliable and trustworthy LLMs remains challenging. Instead, practitioners should build LM Programs comprised of several composable calls to LLMs which can be rigorously tested, audited, and optimized like other software systems. In this talk I will introduce the idea of LM Programs in DSPy: The library for Programming — not Prompting LMs. I will demonstrate how the LM Program abstraction allows the creation of automatic optimizers for LM Programs which can optimize both the prompts and weights in an LM Program. I will conclude with an introduction to MIPROv2: our latest and highest performing prompt optimization algorithm for LM Programs. Michael Ryan is a masters student at Stanford University working on optimization for Language Model Programs in DSPy and Personalizing Language Models. His work has been recognized with a Best Social Impact award at ACL 2024, and an honorable mention for outstanding paper at ACL 2023. Michael co-lead the creation of the MIPRO & MIPROv2 optimizers, DSPy’s most performant optimizers for Language Model Programs. His prior work has showcased unintended cultural and global biases expressed in popular LLMs. He is currently a research intern at Snowflake.
·youtube.com·
Bay.Area.AI: DSPy: Prompt Optimization for LM Programs, Michael Ryan
LLM as a Judge
LLM as a Judge
Learn what LLM as a Judge is, how it works, its benefits, challenges, and best practices for automated evaluations in AI applications.
·programmatic-website.vercel.app·
LLM as a Judge
Pydantic Evals
Pydantic Evals
Pydantic Evals Brand new package from the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.
·simonwillison.net·
Pydantic Evals
yet-another-applied-llm-benchmark
yet-another-applied-llm-benchmark
Nicholas Carlini introduced this personal LLM benchmark suite [back in February](https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html) as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against …
·simonwillison.net·
yet-another-applied-llm-benchmark
Introduction | Ragas
Introduction | Ragas
Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.
·docs.ragas.io·
Introduction | Ragas