From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...View PDF#Large Language Models#Conversational AI#Benchmark#Paper#PDF·arxiv.org·Mar 3, 2025From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...
NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)#Benchmark#Large Language Models#History#Paper·nips.cc·Jan 19, 2025NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)
FACTS Grounding: A new benchmark for evaluating the factuality of large language models#Benchmark#Large Language Models#Fact-checking·deepmind.google·Dec 18, 2024FACTS Grounding: A new benchmark for evaluating the factuality of large language models
Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.#RAG#Large Language Models#Allen Institute#Paper#PDF#Literature Review#Benchmark#Search·openscholar.allen.ai·Nov 21, 2024Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.
A Benchmark for Long-Form Medical Question AnsweringView PDF#Medical#Large Language Models#Questions and Answers#Benchmark·arxiv.org·Nov 18, 2024A Benchmark for Long-Form Medical Question Answering
LIBMoE: A Library for comprehensive benchmarking Mixture of...View PDF#Mixture of Experts#Benchmark#Large Language Models#Paper#PDF·arxiv.org·Nov 6, 2024LIBMoE: A Library for comprehensive benchmarking Mixture of...
OpenAI o1 Results on ARC-AGI-Pub#Benchmark#Large Language Models#OpenAI·arcprize.org·Sep 15, 2024OpenAI o1 Results on ARC-AGI-Pub
Wolfram LLM Benchmarking Project#Benchmark#Large Language Models#Wolfram·wolfram.com·Jul 19, 2024Wolfram LLM Benchmarking Project
A Careful Examination of Large Language Model Performance on Grade School ArithmeticView PDF#Large Language Models#Mathematics#Reasoning#Benchmark#Paper#PDF·arxiv.org·May 2, 2024A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Announcing a Benchmark for AI Safety#Benchmark#Large Language Models·spectrum.ieee.org·Apr 16, 2024Announcing a Benchmark for AI Safety
[Own work] VALSE 💃: Benchmark for Vision and Language Models Centered on Linguistic Phenomena#Computer Vision#Large Language Models#Research#Benchmark·youtube.com·May 9, 2022[Own work] VALSE 💃: Benchmark for Vision and Language Models Centered on Linguistic Phenomena