Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-JudgeView PDF#Meta#Large Language Models#Reasoning#Evaluation#Planning#Paper#PDF·arxiv.org·Feb 1, 2025Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Agent-as-a-Judge: Evaluate Agents with AgentsView PDF#Agents#Evaluation#Meta#Paper#PDF·arxiv.org·Dec 14, 2024Agent-as-a-Judge: Evaluate Agents with Agents
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language ModelsView PDF#Evaluation#Ranking#Large Language Models#Paper#PDF·arxiv.org·May 3, 2024Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models#Large Language Models#Evaluation#Peer Review#Paper#PDF#Cohere·arxiv.org·Apr 30, 2024Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
AutoEval Done Right: Using Synthetic Data for Model EvaluationDownload PDF#Evaluation#Automation#GPT-4#Paper#PDF#Machine Learning#Synthetic Data·arxiv.org·Mar 14, 2024AutoEval Done Right: Using Synthetic Data for Model Evaluation
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM EvaluationDownload PDF#Large Language Models#RLHF#Evaluation#Paper#PDF#Cohere·arxiv.org·Oct 27, 2023Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation