NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)
Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.
LIBMoE: A Library for comprehensive benchmarking Mixture of...
View PDF
DafnyBench: A Benchmark for Formal Software Verification
View PDF
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
View PDF
OpenEQA: From word models to world models
OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.
An In-depth Look at Gemini's Language Abilities
Download PDF
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis