Search Test Information Space

Found 9 bookmarks

Custom sorting

The Impact of Internal Variability on Benchmarking Deep Learning Climate Emulators

#Weather #Emulation #Deep Learning #Evaluation #Paper #PDF

·agupubs.onlinelibrary.wiley.com·Aug 27, 2025

The Impact of Internal Variability on Benchmarking Deep Learning Climate Emulators

SciArena: An Open Evaluation Platform for Foundation Models in...

#Foundation Models #Evaluation #Science #Literature Review #Opensource #AI2 #Paper #PDF

·arxiv.org·Jul 2, 2025

SciArena: An Open Evaluation Platform for Foundation Models in...

Reasoning models paper

#Large Language Models #Evaluation #Chain of Thought #Anthropic #Paper #PDF

·assets.anthropic.com·Apr 4, 2025

Reasoning models paper

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

View PDF

#Meta #Large Language Models #Reasoning #Evaluation #Planning #Paper #PDF

·arxiv.org·Feb 1, 2025

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Agent-as-a-Judge: Evaluate Agents with Agents

View PDF

#Agents #Evaluation #Meta #Paper #PDF

·arxiv.org·Dec 14, 2024

Agent-as-a-Judge: Evaluate Agents with Agents

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

View PDF

#Evaluation #Ranking #Large Language Models #Paper #PDF

·arxiv.org·May 3, 2024

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

#Large Language Models #Evaluation #Peer Review #Paper #PDF #Cohere

·arxiv.org·Apr 30, 2024

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Download PDF

#Evaluation #Automation #GPT-4 #Paper #PDF #Machine Learning #Synthetic Data

·arxiv.org·Mar 14, 2024

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

Download PDF

#Large Language Models #RLHF #Evaluation #Paper #PDF #Cohere

·arxiv.org·Oct 27, 2023

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation