I'm still on the hunt for good options for running evaluations against prompts. ChainForge offers an interesting approach, calling itself "an open-source visual programming environment for prompt engineering". The interface …
Nicholas Carlini introduced this personal LLM benchmark suite [back in February](https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html) as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against …
MIT licensed document extraction Python library from the Deep Search team at IBM, who released [Docling v2](https://ds4sd.github.io/docling/v2/#changes-in-docling-v2) on October 16th. Here's the [Docling Technical Report](https://arxiv.org/abs/2408.09869) paper from August, which provides …
Creating a LLM-as-a-Judge that drives business results
Hamel Husain's sequel to [Your AI product needs evals](https://hamel.dev/blog/posts/evals/). This is _packed_ with hard-won actionable advice. Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he …
Find out how it all started and connect with our vision BEGINNINGS The idea for OmniBridge was seeded in 2014 when co-founder Adam Munder, a profoundly deaf software engineer at Intel, began developing a system to track information informally passed between fellow engineers. A few years later, Adam began working with a small team of […]
In TED Talk, Deaf engineer debuts AI model that transcribes sign language to text in seconds
Adam Munder is a software engineer. Since 2015, he’s been working to bridge the gap between sign language and spoken word. Now, a decade later, he brought it to the TED stage.
Run a prompt to generate and execute jq programs using llm-jq
llm-jq is a brand new plugin for LLM which lets you pipe JSON directly into the llm jq command along with a human-language description of how you’d like to manipulate …
Introducing the analysis tool in Claude.ai \ Anthropic
We’re introducing a new built-in feature for Claude.ai, the analysis tool, that enables Claude to write and run code. With the analysis tool, Claude can process data, conduct analysis, and produce real-time insights.
Last week I was helping a friend of mine to get one of his new apps off the ground. I can’t speak much about it at the moment,
other than like most apps nowadays it has some AI sprinkled over …
Initial explorations of Anthropic’s new Computer Use capability
Two big announcements from Anthropic today: a new Claude 3.5 Sonnet model and a new API mode that they are calling computer use. (They also pre-announced 3.5 Haiku, but that’s …
I now use Claude every day, multiple times a day, both in my work and personal life. This is a relatively new phenomenon: I basically never used Claude until 3.5 Sonnet came out. I had tried Claude before that, mostly out of a sense of duty, but I hadn't found him [1] particularly helpful. But 3.5 was a tipping point where Claude finally became smart enough to be worth the trouble of using. So what do I use Claude for?
New release of my [files-to-prompt tool](https://simonwillison.net/2024/Apr/8/files-to-prompt/) adding an option for filtering just for files with a specific extension. The following command will output Claude XML-style markup for all Python and …
Apple Intelligence makes a lot of sense when you get out of the AI bubble. Plus, the cool technical details Apple shared about their language models "thinking different."
Super neat demo by David Winterbottom, who wrapped my [LLM](https://llm.datasette.io/) and [files-to-prompt](https://github.com/simonw/files-to-prompt) tools in [a short Bash script](https://gist.github.com/codeinthehole/d12af317a76b43423b111fd6d508c4fc) that can be fed a file full of Python unit tests and …
GSM-Symbolic: Understanding the Limitations of Mathematical...
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the...