Search Test Information Space

Found 33 bookmarks

Custom sorting

Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X

#Grok #Benchmark

·x.com·Jul 10, 2025

Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X

(Phew! Not claiming AGI quite yet. Are we asking the right questions about any new civilization?)

#ERNIE #Baidu #Benchmark

·x.com·Jun 30, 2025

Inside the Secret Meeting Where Mathematicians Struggled to Outsmart AI

#Mathematics #OpenAI #Benchmark

·scientificamerican.com·Jun 13, 2025

Inside the Secret Meeting Where Mathematicians Struggled to Outsmart AI

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

#Benchmark #AGI #Paper #PDF

·arxiv.org·May 22, 2025

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Introducing HealthBench | OpenAI

#Benchmark #OpenAI #Health

·openai.com·May 12, 2025

Introducing HealthBench | OpenAI

Scaling Laws For Scalable Oversight

(A security aspect contrasting Compton might be that tactical versions are initiated to have controlled chain reactions and then vanish, also not unlike Houdini, or a locked Roomba mystery, so there may be a forensic science. Also relate to prior paper on MAIM's version of MAD and articles on quantum hacks.))

#Safety #Paper #PDF #Benchmark

·arxiv.org·May 10, 2025

Scaling Laws For Scalable Oversight

AMD sets new supercomputer record, runs CFD simulation over 25x faster on Instinct MI250X GPUs

#Supercomputing #Benchmark #AMD

·tomshardware.com·Apr 13, 2025

AMD sets new supercomputer record, runs CFD simulation over 25x faster on Instinct MI250X GPUs

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...

View PDF

#Large Language Models #Conversational AI #Benchmark #Paper #PDF

·arxiv.org·Mar 3, 2025

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding...

DARPA is Planning to Expand Quantum Benchmarking Initiative

#Quantum Computing #Benchmark #DARPA

·thequantuminsider.com·Feb 23, 2025

DARPA is Planning to Expand Quantum Benchmarking Initiative

NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)

#Benchmark #Large Language Models #History #Paper

·nips.cc·Jan 19, 2025

NeurIPS Poster Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

#AGI #Benchmark #OpenAI

·arcprize.org·Dec 20, 2024

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

#Benchmark #Large Language Models #Fact-checking

·deepmind.google·Dec 18, 2024

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

AILuminate - MLCommons

#Benchmark #Safety #Risk

·mlcommons.org·Dec 4, 2024

AILuminate - MLCommons

Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.

#RAG #Large Language Models #Allen Institute #Paper #PDF #Literature Review #Benchmark #Search

·openscholar.allen.ai·Nov 21, 2024

Asai, A. and others. (2024). OPENSCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS.

A Benchmark for Long-Form Medical Question Answering

View PDF

#Medical #Large Language Models #Questions and Answers #Benchmark

·arxiv.org·Nov 18, 2024

A Benchmark for Long-Form Medical Question Answering

LIBMoE: A Library for comprehensive benchmarking Mixture of...

View PDF

#Mixture of Experts #Benchmark #Large Language Models #Paper #PDF

·arxiv.org·Nov 6, 2024

LIBMoE: A Library for comprehensive benchmarking Mixture of...

Introducing SimpleQA | OpenAI

#Benchmark #OpenAI #Questions and Answers

·openai.com·Oct 30, 2024

Introducing SimpleQA | OpenAI

OpenAI o1 Results on ARC-AGI-Pub

#Benchmark #Large Language Models #OpenAI

·arcprize.org·Sep 15, 2024

OpenAI o1 Results on ARC-AGI-Pub

Wolfram LLM Benchmarking Project

#Benchmark #Large Language Models #Wolfram

·wolfram.com·Jul 19, 2024

Wolfram LLM Benchmarking Project

DafnyBench: A Benchmark for Formal Software Verification

View PDF

#AI #Verification #Paper #PDF #Benchmark #Software Engineering #Machine Learning #Programming Languages

·arxiv.org·Jun 14, 2024

DafnyBench: A Benchmark for Formal Software Verification

Nvidia Conquers Latest AI Tests

#Performance #Benchmark #Nvidia

·spectrum.ieee.org·Jun 12, 2024

Nvidia Conquers Latest AI Tests

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

View PDF

#Large Language Models #Mathematics #Reasoning #Benchmark #Paper #PDF

·arxiv.org·May 2, 2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Department of Commerce Announces New Actions to Implement President Biden’s Executive Order on AI

#NIST #Generative AI #Government #Standard #Benchmark

·commerce.gov·Apr 30, 2024

Department of Commerce Announces New Actions to Implement President Biden’s Executive Order on AI

Announcing a Benchmark for AI Safety

#Benchmark #Large Language Models

·spectrum.ieee.org·Apr 16, 2024

Announcing a Benchmark for AI Safety

OpenEQA: From word models to world models

OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.

#Meta #Questions and Answers #Benchmark #Blog #Paper #PDF

·ai.meta.com·Apr 12, 2024

OpenEQA: From word models to world models

Benchmarking the leading AI chat experience | You.com

In February 2024, You.com conducted a benchmarking study to evaluate the performance of its AI chat experience compared to competitors. You.com partnered with an independent vendor, Invisible Technologies, where independent evaluators rated responses from eight AI models, including free and paid offerings, across five criteria using a set of 120 representative user queries.

YouPro Modes, the premium offerings from You.com, outperformed ChatGPT 4 and Perplexity Pro in overall user preference. YouPro Modes also scored higher on comprehensiveness, factual accuracy, and faithfulness to the prompt’s intent. You.com’s free Smart Mode was the top-performing free model, beating ChatGPT 3.5 and Perplexity in overall user preference as well as accuracy and clarity.

#Benchmark #You com #Chatbot

·about.you.com·Apr 12, 2024

Benchmarking the leading AI chat experience | You.com

An In-depth Look at Gemini's Language Abilities

Download PDF

#Gemini #Paper #PDF #Benchmark

·arxiv.org·Dec 21, 2023

An In-depth Look at Gemini's Language Abilities