AI Testing & Benchmarking

16 bookmarks
Custom sorting
Arcada
Arcada
Distilling the human experience. The future is already here, just unevenly distributed.
·arcada.dev·
Arcada
ARC Prize
ARC Prize
ARC Prize is a $1,000,000+ nonprofit, public competition to beat and open source a solution to the ARC-AGI benchmark.
·arcprize.org·
ARC Prize
Langfuse
Langfuse
Traces, evals, prompt management and metrics to debug and improve your LLM application. Integrates with Langchain, OpenAI, LlamaIndex, LiteLLM, and more.
·langfuse.com·
Langfuse
Terminal-Bench
Terminal-Bench
Terminal-bench is a collection of tasks and an evaluation harness to help agent makers quantify their agents' terminal mastery.
·tbench.ai·
Terminal-Bench
LMArena
LMArena
Find the best AI for you. Compare answers across top AI models, share your feedback and power our public leaderboard.
·lmarena.ai·
LMArena
Alpha Arena
Alpha Arena
AI Trading Benchmark. The first benchmark designed to measure AI's investing abilities. Watch AI models trade with real capital.
·nof1.ai·
Alpha Arena
Artificial Analysis
Artificial Analysis
Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.
·artificialanalysis.ai·
Artificial Analysis
METR
METR
METR is a research nonprofit which develops and runs cutting-edge tests of the capabilities of general-purpose AI systems.
·metr.org·
METR
Giskard
Giskard
Testing platform for AI models. Gain control over AI risks with a holistic testing platform for Quality, Security & Compliance for all AI models, from tabular ML models to LLMs.
·giskard.ai·
Giskard
LiveBench
LiveBench
A benchmark for LLMs designed with test set contamination and objective evaluation in mind.
·livebench.ai·
LiveBench
Chatbot Arena
Chatbot Arena
An open-source research project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to build an open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.
·chat.lmsys.org·
Chatbot Arena
HELM
HELM
Holistic Evaluation of Language Models (HELM) is a living benchmark that aims to improve the transparency of language models.
·crfm.stanford.edu·
HELM
Neptune AI
Neptune AI
ML metadata store. Manage all your model-building metadata in a single place. Track experiments, register models, integrate with any MLOps tool stack.
·neptune.ai·
Neptune AI
BigCode
BigCode
Open and responsible development of LLMs for code. BigCode is an open scientific collaboration working on the responsible development of large language models for code
·bigcode-project.org·
BigCode
Weights & Biases
Weights & Biases
WandB is a central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.
·wandb.ai·
Weights & Biases
SuperGLUE Benchmark
SuperGLUE Benchmark
SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
·super.gluebenchmark.com·
SuperGLUE Benchmark