AI/ML

AI/ML

1847 bookmarks
Custom sorting
Abacus.AI - CodeLLM
Abacus.AI - CodeLLM
AI-powered code editor that helps you write, review, and refactor code faster.
·codellm.abacus.ai·
Abacus.AI - CodeLLM
CrewAI
CrewAI
·crewai.com·
CrewAI
DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.
DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.
An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without s...
·github.com·
DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.
Solved with Windsurf
Solved with Windsurf
🚀 Discover how I built a powerful Ollama Model Manager in Rust (with zero Rust experience!) using Windsurf AI. See how this tool helps you track and manage ...
·youtube.com·
Solved with Windsurf
Windsurf Editor by Codeium
Windsurf Editor by Codeium
Tomorrow's editor, today. Windsurf Editor is the first AI agent-powered IDE that keeps developers in the flow. Available today on Mac, Windows, and Linux.
·codeium.com·
Windsurf Editor by Codeium
Why LLMs still have problems with OCR | Hacker News
Why LLMs still have problems with OCR | Hacker News
A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc
·news.ycombinator.com·
Why LLMs still have problems with OCR | Hacker News
DeepSeek R1 With Ollama
DeepSeek R1 With Ollama
This post explores the use of Ollama, a state-of-the-art language modelling framework, in conjunction with pre-trained models such as DeepSeek R1.
·daehnhardt.com·
DeepSeek R1 With Ollama
You HAVE to Try Agentic RAG with DeepSeek R1 (Insane Results)
You HAVE to Try Agentic RAG with DeepSeek R1 (Insane Results)
Deepseek R1 - the latest and greatest open source reasoning LLM - has taken the world by storm and a lot of content creators are doing a great job covering its implications and strengths/weaknesses. What I haven’t seen a lot of though is actually using R1 in agentic workflows to truly leverage its power. So that’s what I’m showing you in this video - we’ll be using the power of R1 to make a simple but super effective agentic RAG setup. We’ll be using Smolagents by HuggingFace to create our agent - it’s the simplest agent framework out there and many of you have been asking me to try it out. This agentic RAG setup centers around the idea that reasoning LLMs like R1 are extremely powerful but quite slow. Because of this, a lot of people are starting to experiment with combining the raw power of a model like R1 with a more lightweight and fast LLM to drive the primary conversation/agent flow. Think of basically giving R1 as a tool for an agent to use when it needs more reasoning power at the cost of a slower response (and higher costs). That’s what we’ll be doing here - creating an agent that has an R1 driven RAG tool to extract in depth insights from a knowledgebase. The example in this video is meant to be an introduction to these kind of reasoning agentic flows. That’s why I keep it simple with Smolagents and a local knowledgebase. But I’m planning on expanding this much further soon with a much more robust but still similar flow built with Pydantic AI and LangGraph! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Community Voting period of the oTTomator Hackathon is open! Head on over to the Live Agent Studio now and test out the submissions and vote for your favorite agents. There are so many incredible projects to try out! https://studio.ottomator.ai All the code covered in this video + instructions to run it can be found here: https://github.com/coleam00/ottomator-agents/tree/main/r1-distill-rag SmolAgents: https://huggingface.co/docs/smolagents/en/index R1 on Ollama: https://ollama.com/library/deepseek-r1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 00:00 - Why R1 for Agentic RAG? 01:56 - Overview of our Agent 03:33 - SmolAgents - Our Ticket to Fast Agents 06:07 - Building our Agentic RAG Agent with R1 14:17 - Creating our Local Knowledgebase w/ Chroma DB 15:45 - Getting our Local LLMs Set Up with Ollama 19:15 - R1 Agentic RAG Demo 21:42 - Outro ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Join me as I push the limits of what is possible with AI. I'll be uploading videos at least two times a week - Sundays and Wednesdays at 7:00 PM CDT!
Deep Dive into LLMs like ChatGPT
·youtube.com·
You HAVE to Try Agentic RAG with DeepSeek R1 (Insane Results)
S1: The $6 R1 Competitor?
S1: The $6 R1 Competitor?
Tim Kellogg shares his notes on a new paper, [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393), which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for …
·simonwillison.net·
S1: The $6 R1 Competitor?
Run DeepSeek-R1 Dynamic 1.58-bit
Run DeepSeek-R1 Dynamic 1.58-bit
DeepSeek R-1 is the most powerful open-source reasoning model that performs on par with OpenAI's o1 model. Run the 1.58-bit Dynamic GGUF version by Unsloth.
·unsloth.ai·
Run DeepSeek-R1 Dynamic 1.58-bit