ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Scaling Laws For Scalable Oversight
(A security aspect contrasting Compton might be that tactical versions are initiated to have controlled chain reactions and then vanish, also not unlike Houdini, or a locked Roomba mystery, so there may be a forensic science. Also relate to prior paper on MAIM's version of MAD and articles on quantum hacks.))
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...
View PDF
Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.
LIBMoE: A Library for comprehensive benchmarking Mixture of...
View PDF
DafnyBench: A Benchmark for Formal Software Verification
View PDF
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
View PDF
OpenEQA: From word models to world models
OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.
An In-depth Look at Gemini's Language Abilities
Download PDF
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis