Search AI/ML

Found 10 bookmarks

Newest

Evaluation Guidebook - a Hugging Face Space by OpenEvals

This application displays the evolution of benchmark scores for large language models over time. It shows the top scores achieved by models on various benchmarks and provides insights into the prog...

#benchmark #training #testing

·huggingface.co·today at 12:55 PM

Evaluation Guidebook - a Hugging Face Space by OpenEvals

Bay.Area.AI: DSPy: Prompt Optimization for LM Programs, Michael Ryan

ai.bythebay.io Nov 2025, Oakland, full-stack AI conference DSPy: Prompt Optimization for LM Programs Michael Ryan, Stanford It has never been easier to build amazing LLM powered applications. Unfortunately engineering reliable and trustworthy LLMs remains challenging. Instead, practitioners should build LM Programs comprised of several composable calls to LLMs which can be rigorously tested, audited, and optimized like other software systems. In this talk I will introduce the idea of LM Programs in DSPy: The library for Programming — not Prompting LMs. I will demonstrate how the LM Program abstraction allows the creation of automatic optimizers for LM Programs which can optimize both the prompts and weights in an LM Program. I will conclude with an introduction to MIPROv2: our latest and highest performing prompt optimization algorithm for LM Programs. Michael Ryan is a masters student at Stanford University working on optimization for Language Model Programs in DSPy and Personalizing Language Models. His work has been recognized with a Best Social Impact award at ACL 2024, and an honorable mention for outstanding paper at ACL 2023. Michael co-lead the creation of the MIPRO & MIPROv2 optimizers, DSPy’s most performant optimizers for Language Model Programs. His prior work has showcased unintended cultural and global biases expressed in popular LLMs. He is currently a research intern at Snowflake.

#prompt #programming #training #testing

·youtube.com·Jul 30, 2025

Bay.Area.AI: DSPy: Prompt Optimization for LM Programs, Michael Ryan

Lena Shakurova - Making LLMs reliable - A practical framework | PyData London 25

www.pydata.orgMaking LLMs reliable: A practical framework for productionLLM outputs are non-deterministic, making it difficult to ensure reliability in produ...

#testing

·youtube.com·Jul 10, 2025

Lena Shakurova - Making LLMs reliable - A practical framework | PyData London 25

LLM-as-a-Judge: A Practical Guide

How to Scale LLM Evaluations Beyond Manual Review

#testing #benchmark #agent

·towardsdatascience.com·Jun 20, 2025

LLM-as-a-Judge: A Practical Guide

Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

#testing #benchmark

·arxiv.org·Jun 20, 2025

Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

LLM as a Judge

Learn what LLM as a Judge is, how it works, its benefits, challenges, and best practices for automated evaluations in AI applications.

#prompt #testing #agent

·programmatic-website.vercel.app·Jun 9, 2025

LLM as a Judge

Pydantic Evals

Pydantic Evals Brand new package from the Pydantic AI team which directly tackles what I consider to be the single hardest problem in AI engineering: building evals to determine if your LLM-based system is working correctly and getting better over time.

#testing

·simonwillison.net·Apr 1, 2025

Pydantic Evals

yet-another-applied-llm-benchmark

Nicholas Carlini introduced this personal LLM benchmark suite [back in February](https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html) as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against …

#testing

·simonwillison.net·Nov 6, 2024

yet-another-applied-llm-benchmark

Introduction | Ragas

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in.

#RAG #model training #testing

·docs.ragas.io·Mar 23, 2024

Introduction | Ragas

Arthur unveils Bench, an open-source AI model evaluator | VentureBeat

#testing

·venturebeat.com·Aug 18, 2023

Arthur unveils Bench, an open-source AI model evaluator | VentureBeat