AI Testing & Benchmarking

9 bookmarks
Custom sorting
METR
METR
METR is a research nonprofit which develops and runs cutting-edge tests of the capabilities of general-purpose AI systems.
·metr.org·
METR
Giskard
Giskard
Testing platform for AI models. Gain control over AI risks with a holistic testing platform for Quality, Security & Compliance for all AI models, from tabular ML models to LLMs.
·giskard.ai·
Giskard
LiveBench
LiveBench
A benchmark for LLMs designed with test set contamination and objective evaluation in mind.
·livebench.ai·
LiveBench
Chatbot Arena
Chatbot Arena
An open-source research project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to build an open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.
·chat.lmsys.org·
Chatbot Arena
HELM
HELM
Holistic Evaluation of Language Models (HELM) is a living benchmark that aims to improve the transparency of language models.
·crfm.stanford.edu·
HELM
Neptune AI
Neptune AI
ML metadata store. Manage all your model-building metadata in a single place. Track experiments, register models, integrate with any MLOps tool stack.
·neptune.ai·
Neptune AI
BigCode
BigCode
Open and responsible development of LLMs for code. BigCode is an open scientific collaboration working on the responsible development of large language models for code
·bigcode-project.org·
BigCode
Weights & Biases
Weights & Biases
WandB is a central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.
·wandb.ai·
Weights & Biases
SuperGLUE Benchmark
SuperGLUE Benchmark
SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
·super.gluebenchmark.com·
SuperGLUE Benchmark