Search AI/ML

Found 38 bookmarks

Custom sorting

SmolDocling - The SmolOCR Solution?

In this video I look at SmolDocling and how it compares to the other OCR solutions that are out there, both open and proprietary. Blog: https://huggingface.c...

#OCR #vision #image

·youtube.com·Mar 18, 2025

SmolDocling - The SmolOCR Solution?

Gemma 3 - The NEW Gemma Family Members Have Arrived!!!

In this video, I look at the release of the new Gemma 3 models, which come in four different flavors: a 1B, a 4B, a 12B, and the new Big 27B parameter model. Demo: https://huggingface.co/spaces/huggingface-projects/gemma-3-12b-it Blog: https://blog.google/technology/developers/gemma-3/?linkId=sam_witteveen Model Weights: https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps:

#vision #OCR

·youtube.com·Mar 12, 2025

Gemma 3 - The NEW Gemma Family Members Have Arrived!!!

Mistral OCR - Multimodal & Multilingual OCR

In this video, I look at the latest release from Mistral AI, which is their Mistral OCR model. I look at how it works and how it compares to other models, as well as how you can get started using it with code. Colab: https://dripl.ink/Sr4Uk Blog: https://mistral.ai/news/mistral-ocr For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:17 Other models 00:35 Mistral OCR Blog 05:45 Mistral OCR Demo 13:47 Mistral OCR Batch inference

#OCR #vision

·youtube.com·Mar 7, 2025

Mistral OCR - Multimodal & Multilingual OCR

SmolVLM2: Bringing Video Understanding to Every Device

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

#vision #video

·huggingface.co·Mar 7, 2025

SmolVLM2: Bringing Video Understanding to Every Device

A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

#vision

·huggingface.co·Mar 7, 2025

A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality

PaliGemma 2 Mix - New Instruction Vision Language Models by Google

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

#vision

·huggingface.co·Mar 7, 2025

PaliGemma 2 Mix - New Instruction Vision Language Models by Google

granite-snack-cookbook/recipes/RAG/Granite_Multimodal_RAG.ipynb at main · ibm-granite-community/granite-snack-cookbook

Granite Snack Cookbook -- easily consumable recipes (python notebooks) that showcase the capabilities of the Granite models - ibm-granite-community/granite-snack-cookbook

#RAG #vision

·github.com·Mar 5, 2025

granite-snack-cookbook/recipes/RAG/Granite_Multimodal_RAG.ipynb at main · ibm-granite-community/granite-snack-cookbook

olmOCR - The Open OCR System

In this video, I look at olmOCR, the OpenOCR system from Allen AI. Colab: https://dripl.ink/HpaK4 Blog: https://olmocr.allenai.org/blog macOS ver: https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:31 Allen AI Blog 01:20 olmOCR Blog 02:08 olmOCR Hugging Face 04:52 olmOCR GitHub 05:41 Demo 05:59 Running olmOCR on macOS with LM Studio

#OCR #local model #vision

·youtube.com·Mar 2, 2025

olmOCR - The Open OCR System

Quick Start of InternVL 2.5 Series — InternVL

#vision

·internvl.readthedocs.io·Feb 27, 2025

Quick Start of InternVL 2.5 Series — InternVL

microsoft/Florence-2-large · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

#image #OCR #vision

·huggingface.co·Feb 17, 2025

microsoft/Florence-2-large · Hugging Face

DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.

An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without s...

#OCR #vision #image #code

·github.com·Feb 15, 2025

Handwritten Digit Recognition with TensorFlow and OpenCV

In this blog post, we will explore the fascinating world of handwritten digit recognition using TensorFlow and OpenCV. Handwritten digit…

#vision #OCR

·medium.com·Feb 15, 2025

Handwritten Digit Recognition with TensorFlow and OpenCV

Guide to Optical Character Recognition (OCR) in 2025

Optical Character Recognition helps perceive the characters of a text within the images like printed books, photos, or documents. Explore top 17 OCR vendors.

#OCR #image #vision

·research.aimultiple.com·Feb 9, 2025

Guide to Optical Character Recognition (OCR) in 2025

junhoyeo/BetterOCR: 🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM. - junhoyeo/BetterOCR

#OCR #image #vision

·github.com·Feb 9, 2025

junhoyeo/BetterOCR: 🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

plastic-plant/florence-2: Let's play with Florence-2 vision model.

Let's play with Florence-2 vision model. Contribute to plastic-plant/florence-2 development by creating an account on GitHub.

#image #model training #vision #OCR

·github.com·Feb 9, 2025

plastic-plant/florence-2: Let's play with Florence-2 vision model.

Why LLMs still have problems with OCR | Hacker News

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc

#OCR #image #model training #vision #fine tuning

·news.ycombinator.com·Feb 9, 2025

Why LLMs still have problems with OCR | Hacker News

Pulse AI Blog - Why LLMs Suck at OCR

#OCR #model training #vision #image

·runpulse.com·Feb 9, 2025

Pulse AI Blog - Why LLMs Suck at OCR

OCR4all | Setup guide, user guide, developer documentation and more.

Guides, documentation and more

#OCR #vision #image

·ocr4all.org·Feb 2, 2025

OCR4all | Setup guide, user guide, developer documentation and more.

emcf/thepipe: Extract clean data from anywhere, powered by vision-language models ⚡

Extract clean data from anywhere, powered by vision-language models ⚡ - emcf/thepipe

#vision #OCR

·github.com·Feb 2, 2025

emcf/thepipe: Extract clean data from anywhere, powered by vision-language models ⚡

Roboflow Universe: Computer Vision Datasets

Learn how to use the handwritting recognition Classification API (v1, 2023-10-08 12:56pm), created by LinhBau

#vision #api

·universe.roboflow.com·Feb 2, 2025

Roboflow Universe: Computer Vision Datasets

JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc. - JaidedAI/EasyOCR

#OCR #image #vision #python

·github.com·Feb 2, 2025

JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

OpenBMB/MiniCPM-o: MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone - OpenBMB/MiniCPM-o

#vision #image #local model

·github.com·Feb 1, 2025

OpenBMB/MiniCPM-o: MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

Building a Vision App with Ollama Structured Outputs

In this video, I look at the Ollama structured outputs and how you can use it to do various tasks, such as named entity recognition and information extractio...

#vision #code #python

·youtube.com·Dec 31, 2024

Building a Vision App with Ollama Structured Outputs

SmolVLM - small yet mighty Vision Language Model

I've been having fun playing with this new vision model from the Hugging Face team behind [SmolLM](https://simonwillison.net/2024/Nov/2/smollm2/). They describe it as: > [...] a 2B VLM, SOTA for its memory …

#vision

·simonwillison.net·Nov 28, 2024

SmolVLM - small yet mighty Vision Language Model

Blaizzy/mlx-vlm: MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX. - Blaizzy/mlx-vlm

#model training #fine tuning #mac #vision

·github.com·Nov 26, 2024

Blaizzy/mlx-vlm: MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

LlamaOCR - Building your Own Private OCR System - YouTube

The video demonstrates LlamaOCR, an OCR tool leveraging the Llama 3.2 visual model. It focuses on the tool's ability to convert images and scanned documents into structured Markdown, preserving the original formatting of elements like tables, lists, and spreadsheets. The video covers practical usage examples, offering tutorials and code snippets in both JavaScript and Python within a Colab environment. For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen Colab: https://drp.li/WpdNm 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes ⏱️Time Stamps: 00:00 LlamaOCR Project 00:56 Demo Using their Site 02:43 Colab Demo 04:40 Together.AI Docs 06:06 Pricing 09:16 Python OCR Version 11:20 Thai OCR Project 16:30 Patreon

#vision #image #OCR

·youtube.com·Nov 19, 2024

LlamaOCR - Building your Own Private OCR System - YouTube

What kind of music is this?

"This collection appears to be primarily alternati…" Go see Molmo's answer!

#vision #image

·molmo.allenai.org·Sep 27, 2024

What kind of music is this?

KwaiVGI/LivePortrait

Make one portrait alive!

#image #vision #animation

·github.com·Jul 9, 2024

KwaiVGI/LivePortrait

Dragonfly: A large vision-language model with multi-resolution zoom

#vision #image

·together.ai·Jun 7, 2024

Dragonfly: A large vision-language model with multi-resolution zoom

GPT-4 Vision API + Puppeteer = Easy Web Scraping

In today's video I do some experimentation with the new GPT-4 Vision API and try to scrape information from web pages using it.GitHub: https://github.com/unc...

#vision #scraping

·youtube.com·Dec 1, 2023

GPT-4 Vision API + Puppeteer = Easy Web Scraping