Traces, evals, prompt management and metrics to debug and improve your LLM application. Integrates with Langchain, OpenAI, LlamaIndex, LiteLLM, and more.
Comparison and analysis of AI models and API hosting providers. Independent benchmarks across key performance metrics including quality, price, output speed & latency.
Testing platform for AI models. Gain control over AI risks with a holistic testing platform for Quality, Security & Compliance for all AI models, from tabular ML models to LLMs.
An open-source research project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to build an open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.
Open and responsible development of LLMs for code. BigCode is an open scientific collaboration working on the responsible development of large language models for code
WandB is a central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.
SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new public leaderboard.