AI/ML

2260 bookmarks

Custom sorting

comfyanonymous/ComfyUI: The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface. - comfyanonymous/ComfyUI

#IDE #low code

·github.com·Feb 17, 2025

comfyanonymous/ComfyUI: The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

·arxiv.org·Feb 17, 2025

DeepRAG: Thinking to Retrieval Step by Step for Large Language Models

Understanding Reasoning LLMs

Methods and Strategies for Building and Refining Reasoning Models

#learn

·magazine.sebastianraschka.com·Feb 17, 2025

Understanding Reasoning LLMs

jina-ai/node-DeepResearch: Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget)

Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget) - jina-ai/node-DeepResearch

#search #agent #local model

·github.com·Feb 15, 2025

jina-ai/node-DeepResearch: Keep searching, reading webpages, reasoning until it finds the answer (or exceeding the token budget)

DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.

An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without s...

#OCR #vision #image #code

·github.com·Feb 15, 2025

Handwritten Digit Recognition with TensorFlow and OpenCV

In this blog post, we will explore the fascinating world of handwritten digit recognition using TensorFlow and OpenCV. Handwritten digit…

#vision #OCR

·medium.com·Feb 15, 2025

Handwritten Digit Recognition with TensorFlow and OpenCV

Hello from Transformer Lab | Transformer Lab

Documentation for LLM Toolkit, Transformer Lab

#model training #fine tuning #transformers #local model

·transformerlab.ai·Feb 14, 2025

Hello from Transformer Lab | Transformer Lab

Solved with Windsurf

🚀 Discover how I built a powerful Ollama Model Manager in Rust (with zero Rust experience!) using Windsurf AI. See how this tool helps you track and manage ...

·youtube.com·Feb 14, 2025

Solved with Windsurf

Windsurf Editor by Codeium

Tomorrow's editor, today. Windsurf Editor is the first AI agent-powered IDE that keeps developers in the flow. Available today on Mac, Windows, and Linux.

#IDE #code #devops

·codeium.com·Feb 14, 2025

Windsurf Editor by Codeium

Guide to Optical Character Recognition (OCR) in 2025

Optical Character Recognition helps perceive the characters of a text within the images like printed books, photos, or documents. Explore top 17 OCR vendors.

#OCR #image #vision

·research.aimultiple.com·Feb 9, 2025

Guide to Optical Character Recognition (OCR) in 2025

junhoyeo/BetterOCR: 🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM. - junhoyeo/BetterOCR

#OCR #image #vision

·github.com·Feb 9, 2025

junhoyeo/BetterOCR: 🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

plastic-plant/florence-2: Let's play with Florence-2 vision model.

Let's play with Florence-2 vision model. Contribute to plastic-plant/florence-2 development by creating an account on GitHub.

#image #model training #vision #OCR

·github.com·Feb 9, 2025

plastic-plant/florence-2: Let's play with Florence-2 vision model.

Why LLMs still have problems with OCR | Hacker News

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc

#OCR #image #model training #vision #fine tuning

·news.ycombinator.com·Feb 9, 2025

Why LLMs still have problems with OCR | Hacker News

Pulse AI Blog - Why LLMs Suck at OCR

#OCR #model training #vision #image

·runpulse.com·Feb 9, 2025

Pulse AI Blog - Why LLMs Suck at OCR

Train your own R1 reasoning model locally (GRPO)

You can now reproduce your own DeepSeek-R1 reasoning model with Unsloth 100% locally. Using GRPO. Open-source, free and beginner friendly.

#model training #local model #fine tuning

·unsloth.ai·Feb 9, 2025

Train your own R1 reasoning model locally (GRPO)

The DeepSeek Series: A Technical Overview

An overview of the papers describing the evolution of DeepSeek

#fine tuning #learn

·martinfowler.com·Feb 9, 2025

The DeepSeek Series: A Technical Overview

Transformer - Spreadsheet

Make your own AI by hand ✍️ exercises

#learn

·byhand.ai·Feb 9, 2025

Transformer - Spreadsheet

DeepSeek R1 With Ollama

This post explores the use of Ollama, a state-of-the-art language modelling framework, in conjunction with pre-trained models such as DeepSeek R1.

#local model

·daehnhardt.com·Feb 7, 2025

DeepSeek R1 With Ollama

You HAVE to Try Agentic RAG with DeepSeek R1 (Insane Results)

Deepseek R1 - the latest and greatest open source reasoning LLM - has taken the world by storm and a lot of content creators are doing a great job covering its implications and strengths/weaknesses. What I haven’t seen a lot of though is actually using R1 in agentic workflows to truly leverage its power. So that’s what I’m showing you in this video - we’ll be using the power of R1 to make a simple but super effective agentic RAG setup. We’ll be using Smolagents by HuggingFace to create our agent - it’s the simplest agent framework out there and many of you have been asking me to try it out. This agentic RAG setup centers around the idea that reasoning LLMs like R1 are extremely powerful but quite slow. Because of this, a lot of people are starting to experiment with combining the raw power of a model like R1 with a more lightweight and fast LLM to drive the primary conversation/agent flow. Think of basically giving R1 as a tool for an agent to use when it needs more reasoning power at the cost of a slower response (and higher costs). That’s what we’ll be doing here - creating an agent that has an R1 driven RAG tool to extract in depth insights from a knowledgebase. The example in this video is meant to be an introduction to these kind of reasoning agentic flows. That’s why I keep it simple with Smolagents and a local knowledgebase. But I’m planning on expanding this much further soon with a much more robust but still similar flow built with Pydantic AI and LangGraph! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Community Voting period of the oTTomator Hackathon is open! Head on over to the Live Agent Studio now and test out the submissions and vote for your favorite agents. There are so many incredible projects to try out! https://studio.ottomator.ai All the code covered in this video + instructions to run it can be found here: https://github.com/coleam00/ottomator-agents/tree/main/r1-distill-rag SmolAgents: https://huggingface.co/docs/smolagents/en/index R1 on Ollama: https://ollama.com/library/deepseek-r1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 00:00 - Why R1 for Agentic RAG? 01:56 - Overview of our Agent 03:33 - SmolAgents - Our Ticket to Fast Agents 06:07 - Building our Agentic RAG Agent with R1 14:17 - Creating our Local Knowledgebase w/ Chroma DB 15:45 - Getting our Local LLMs Set Up with Ollama 19:15 - R1 Agentic RAG Demo 21:42 - Outro ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Join me as I push the limits of what is possible with AI. I'll be uploading videos at least two times a week - Sundays and Wednesdays at 7:00 PM CDT!

Deep Dive into LLMs like ChatGPT

#RAG #agent #tutorial

·youtube.com·Feb 6, 2025

You HAVE to Try Agentic RAG with DeepSeek R1 (Insane Results)

S1: The $6 R1 Competitor?

#local model #model training #fine tuning

·timkellogg.me·Feb 6, 2025

S1: The $6 R1 Competitor?

Tim Kellogg shares his notes on a new paper, [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393), which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for …

#model training #fine tuning #local model

·simonwillison.net·Feb 6, 2025

S1: The $6 R1 Competitor?

mlx-community/DeepSeek-R1-Distill-Llama-70B-8bit · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

·huggingface.co·Feb 6, 2025

mlx-community/DeepSeek-R1-Distill-Llama-70B-8bit · Hugging Face

Run DeepSeek-R1 Dynamic 1.58-bit

DeepSeek R-1 is the most powerful open-source reasoning model that performs on par with OpenAI's o1 model. Run the 1.58-bit Dynamic GGUF version by Unsloth.

#local model #mac #m1

·unsloth.ai·Feb 5, 2025

Run DeepSeek-R1 Dynamic 1.58-bit

Understanding DeepSeek R1 | Christian B. B. Houmann

A detailed look at the DeepSeek-R1 model, and how to run it locally.

#local model

·bagerbach.com·Feb 5, 2025

Understanding DeepSeek R1 | Christian B. B. Houmann

o3-mini is really good at writing internal documentation

I wanted to refresh my knowledge of how the Datasette permissions system works today. I already have [extensive hand-written documentation](https://docs.datasette.io/en/latest/authentication.html) for that, but I thought it would be interesting to …

#documentation

·simonwillison.net·Feb 5, 2025

o3-mini is really good at writing internal documentation

permissions.md

GitHub Gist: instantly share code, notes, and snippets.

#documentation

·gist.github.com·Feb 5, 2025

permissions.md

irthomasthomas/llm-model-gateway: OpenAI-compatible API server for simonw's llm cli

OpenAI-compatible API server for simonw's llm cli. Contribute to irthomasthomas/llm-model-gateway development by creating an account on GitHub.

#cli #api

·github.com·Feb 5, 2025

irthomasthomas/llm-model-gateway: OpenAI-compatible API server for simonw's llm cli

Msty as LM Studio alternative

Msty is the perfect alternative to LM Studio. Msty offers a powerful and intuitive interface that makes it easy to get started, even for beginners. Say goodbye to complexities and embrace the simplicity with Msty. With innovative features like Folders, Vapor Mode, and Workspaces, Msty makes you more productive than you ever got with LM Studio.

#mac #app #local model

·msty.app·Feb 4, 2025

Msty as LM Studio alternative

DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts

The DeepSeek Narrative Takes the World by Storm DeepSeek took the world by storm. For the last week, DeepSeek has been the only topic that anyone in the world wants to talk about. As it currently s…

#model training #politics #fine tuning

·semianalysis.com·Feb 3, 2025

DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts

AI Markets Were Deceived To Believe In DeepSeek's Low Training Costs; They Are Actually 400 Times Higher Than The Reported Figure

The controversy around DeepSeek's costs for training their R1 model shook up the markets, but it seems like there was a lot of deception.

#security #model training #politics

·wccftech.com·Feb 3, 2025

AI Markets Were Deceived To Believe In DeepSeek's Low Training Costs; They Are Actually 400 Times Higher Than The Reported Figure