Search AI/ML

Found 20 bookmarks

Custom sorting

LLM SVG Generation Benchmark

#benchmark #visualization #vision #art #cartoon

·gally.net·Nov 25, 2025

LLM SVG Generation Benchmark

Here's a delightful project by Tom Gally, inspired by my pelican SVG benchmark. He asked Claude to help create more prompts of the form Generate an SVG of [A] [doing] …

#benchmark #art #visualization #prompt

·simonwillison.net·Nov 25, 2025

LLM SVG Generation Benchmark

Top 10 Open-Source LLMs (Nov 2025): Llama 4, Qwen 3 and DeepSeek R1

A Blog post by Daya Shankar on Hugging Face

#benchmark

·huggingface.co·Nov 16, 2025

Top 10 Open-Source LLMs (Nov 2025): Llama 4, Qwen 3 and DeepSeek R1

GitHub - T3-Content/SnitchBench

Contribute to T3-Content/SnitchBench development by creating an account on GitHub.

#benchmark #security

·github.com·Jul 30, 2025

GitHub - T3-Content/SnitchBench

Evaluating Chunking Strategies for Retrieval | Chroma Research

#RAG #embedding #search #benchmark

·research.trychroma.com·Jul 9, 2025

Evaluating Chunking Strategies for Retrieval | Chroma Research

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks...

#benchmark

·arxiv.org·Jul 1, 2025

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

AbsenceBench: Language Models Can’t Tell What’s Missing

Here's another interesting result to file under the

#benchmark

·simonwillison.net·Jun 21, 2025

AbsenceBench: Language Models Can’t Tell What’s Missing

LLM-as-a-Judge: A Practical Guide

How to Scale LLM Evaluations Beyond Manual Review

#testing #benchmark #agent

·towardsdatascience.com·Jun 20, 2025

LLM-as-a-Judge: A Practical Guide

Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

#testing #benchmark

·arxiv.org·Jun 20, 2025

Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

Do LLM Evaluators Prefer Themselves for a Reason?

#benchmark

·arxiv.org·Jun 20, 2025

Do LLM Evaluators Prefer Themselves for a Reason?

Apple’s M3 Ultra Mac Studio Misses the Mark for LLM Inference | by Bi…

archived 14 Apr 2025 13:16:31 UTC

#hardware #local model #benchmark

·archive.ph·May 29, 2025

Apple’s M3 Ultra Mac Studio Misses the Mark for LLM Inference | by Bi…

14-Minute Wait?! $10K Mac Studio Crawls with DeepSeek 671B + llama.cpp

We took a closer look at how the top-tier M3 Ultra fares when running the colossal DeepSeek V3 671B parameter model using the popular llama.cpp inference engine. The results paint a picture of…

#mac #hardware #benchmark #local model

·hardware-corner.net·May 29, 2025

14-Minute Wait?! $10K Mac Studio Crawls with DeepSeek 671B + llama.cpp

Reports of LLMs mastering math have been greatly exaggerated

What happens when you minimize the chance of data leakage?

#benchmark

·garymarcus.substack.com·Apr 6, 2025

Reports of LLMs mastering math have been greatly exaggerated

Therapeutics Data Commons

Artificial intelligence foundation for therapeutic science

#science #data #benchmark

·tdcommons.ai·Apr 2, 2025

Therapeutics Data Commons

DeepSeek-R1 vs Claude 3.5 Sonnet (new) - Detailed Performance & Feature Comparison

Discover how DeepSeek's DeepSeek-R1 and Anthropic's Claude 3.5 Sonnet (new) stack up in performance, features, and applications. Read our detailed comparison to find out which AI model best suits your needs.

#benchmark

·docsbot.ai·Mar 3, 2025

DeepSeek-R1 vs Claude 3.5 Sonnet (new) - Detailed Performance & Feature Comparison

R1+Sonnet set SOTA on aider’s polyglot benchmark

R1+Sonnet has set a new SOTA on the aider polyglot benchmark. At 14X less cost compared to o1.

#agent #code #benchmark

·aider.chat·Mar 3, 2025

R1+Sonnet set SOTA on aider’s polyglot benchmark

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

#benchmark

·aider.chat·Mar 3, 2025

Aider LLM Leaderboards

Wolfram LLM Benchmarking Project

Results from Wolfram's ongoing tracking of LLM performance. The benchmark is based on a Wolfram Language code generation task.

#benchmark

·wolfram.com·Mar 3, 2025

Wolfram LLM Benchmarking Project

Kagi LLM Benchmarking Project | Kagi's Docs

Kagi Search Help

#benchmark

·help.kagi.com·Mar 3, 2025

Kagi LLM Benchmarking Project | Kagi's Docs

vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents - GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing H...

#benchmark #safety

·github.com·Nov 19, 2023

vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents