Search Test Information Space

Found 10 bookmarks

Custom sorting

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

#Benchmark #AGI #Paper #PDF

·arxiv.org·May 22, 2025

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Scaling Laws For Scalable Oversight

(A security aspect contrasting Compton might be that tactical versions are initiated to have controlled chain reactions and then vanish, also not unlike Houdini, or a locked Roomba mystery, so there may be a forensic science. Also relate to prior paper on MAIM's version of MAD and articles on quantum hacks.))

#Safety #Paper #PDF #Benchmark

·arxiv.org·May 10, 2025

Scaling Laws For Scalable Oversight

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...

View PDF

#Large Language Models #Conversational AI #Benchmark #Paper #PDF

·arxiv.org·Mar 3, 2025

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...

Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.

#RAG #Large Language Models #Allen Institute #Paper #PDF #Literature Review #Benchmark #Search

·openscholar.allen.ai·Nov 21, 2024

Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.

LIBMoE: A Library for comprehensive benchmarking Mixture of...

View PDF

#Mixture of Experts #Benchmark #Large Language Models #Paper #PDF

·arxiv.org·Nov 6, 2024

LIBMoE: A Library for comprehensive benchmarking Mixture of...

DafnyBench: A Benchmark for Formal Software Verification

View PDF

#AI #Verification #Paper #PDF #Benchmark #Software Engineering #Machine Learning #Programming Languages

·arxiv.org·Jun 14, 2024

DafnyBench: A Benchmark for Formal Software Verification

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

View PDF

#Large Language Models #Mathematics #Reasoning #Benchmark #Paper #PDF

·arxiv.org·May 2, 2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

OpenEQA: From word models to world models

OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.

#Meta #Questions and Answers #Benchmark #Blog #Paper #PDF

·ai.meta.com·Apr 12, 2024

OpenEQA: From word models to world models

An In-depth Look at Gemini's Language Abilities

Download PDF

#Gemini #Paper #PDF #Benchmark

·arxiv.org·Dec 21, 2023

An In-depth Look at Gemini's Language Abilities

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

#Text-to-Image #Benchmark #Preferences #Prompt Engineering #Paper #PDF

·arxiv.org·Jul 5, 2023

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis