AI Testing & Benchmarking

10 bookmarks

Custom sorting

Artificial Analysis

Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.

·artificialanalysis.ai·yesterday at 4:41 PM

Artificial Analysis

METR

METR is a research nonprofit which develops and runs cutting-edge tests of the capabilities of general-purpose AI systems.

·metr.org·Mar 24, 2025

METR

Giskard

Testing platform for AI models. Gain control over AI risks with a holistic testing platform for Quality, Security & Compliance for all AI models, from tabular ML models to LLMs.

·giskard.ai·Jul 21, 2024

Giskard

LiveBench

A benchmark for LLMs designed with test set contamination and objective evaluation in mind.

·livebench.ai·Jun 23, 2024

LiveBench

Chatbot Arena

An open-source research project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to build an open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.

·chat.lmsys.org·Jan 30, 2024

Chatbot Arena

HELM

Holistic Evaluation of Language Models (HELM) is a living benchmark that aims to improve the transparency of language models.

·crfm.stanford.edu·Aug 23, 2023

HELM

Neptune AI

ML metadata store. Manage all your model-building metadata in a single place. Track experiments, register models, integrate with any MLOps tool stack.

·neptune.ai·May 24, 2023

Neptune AI

BigCode

Open and responsible development of LLMs for code. BigCode is an open scientific collaboration working on the responsible development of large language models for code

·bigcode-project.org·Dec 17, 2022

BigCode

Weights & Biases

WandB is a central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.

·wandb.ai·Jun 27, 2022

Weights & Biases

SuperGLUE Benchmark

SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new public leaderboard.

·super.gluebenchmark.com·Dec 17, 2022

SuperGLUE Benchmark