Search Test Information Space

Found 8 bookmarks

Custom sorting

NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)

#Benchmark #Large Language Models #History #Paper

·nips.cc·Jan 19, 2025

NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)

Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.

#RAG #Large Language Models #Allen Institute #Paper #PDF #Literature Review #Benchmark #Search

·openscholar.allen.ai·Nov 21, 2024

Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.

LIBMoE: A Library for comprehensive benchmarking Mixture of...

View PDF

#Mixture of Experts #Benchmark #Large Language Models #Paper #PDF

·arxiv.org·Nov 6, 2024

LIBMoE: A Library for comprehensive benchmarking Mixture of...

DafnyBench: A Benchmark for Formal Software Verification

View PDF

#AI #Verification #Paper #PDF #Benchmark #Software Engineering #Machine Learning #Programming Languages

·arxiv.org·Jun 14, 2024

DafnyBench: A Benchmark for Formal Software Verification

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

View PDF

#Large Language Models #Mathematics #Reasoning #Benchmark #Paper #PDF

·arxiv.org·May 2, 2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

OpenEQA: From word models to world models

OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.

#Meta #Questions and Answers #Benchmark #Blog #Paper #PDF

·ai.meta.com·Apr 12, 2024

OpenEQA: From word models to world models

An In-depth Look at Gemini's Language Abilities

Download PDF

#Gemini #Paper #PDF #Benchmark

·arxiv.org·Dec 21, 2023

An In-depth Look at Gemini's Language Abilities

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

#Text-to-Image #Benchmark #Preferences #Prompt Engineering #Paper #PDF

·arxiv.org·Jul 5, 2023

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis