The Impact of Internal Variability on Benchmarking Deep Learning Climate Emulators
SciArena: An Open Evaluation Platform for Foundation Models in...
Reasoning models paper
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
View PDF
Agent-as-a-Judge: Evaluate Agents with Agents
View PDF
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
View PDF
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
AutoEval Done Right: Using Synthetic Data for Model Evaluation
Download PDF
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Download PDF