Search AI/ML

Found 25 bookmarks

Custom sorting

SmolDocling - The SmolOCR Solution?

In this video I look at SmolDocling and how it compares to the other OCR solutions that are out there, both open and proprietary. Blog: https://huggingface.c...

#OCR #vision #image

·youtube.com·Mar 18, 2025

SmolDocling - The SmolOCR Solution?

Gemma 3 - The NEW Gemma Family Members Have Arrived!!!

In this video, I look at the release of the new Gemma 3 models, which come in four different flavors: a 1B, a 4B, a 12B, and the new Big 27B parameter model. Demo: https://huggingface.co/spaces/huggingface-projects/gemma-3-12b-it Blog: https://blog.google/technology/developers/gemma-3/?linkId=sam_witteveen Model Weights: https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps:

#vision #OCR

·youtube.com·Mar 12, 2025

Gemma 3 - The NEW Gemma Family Members Have Arrived!!!

Mistral OCR - Multimodal & Multilingual OCR

In this video, I look at the latest release from Mistral AI, which is their Mistral OCR model. I look at how it works and how it compares to other models, as well as how you can get started using it with code. Colab: https://dripl.ink/Sr4Uk Blog: https://mistral.ai/news/mistral-ocr For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:17 Other models 00:35 Mistral OCR Blog 05:45 Mistral OCR Demo 13:47 Mistral OCR Batch inference

#OCR #vision

·youtube.com·Mar 7, 2025

Mistral OCR - Multimodal & Multilingual OCR

olmOCR - The Open OCR System

In this video, I look at olmOCR, the OpenOCR system from Allen AI. Colab: https://dripl.ink/HpaK4 Blog: https://olmocr.allenai.org/blog macOS ver: https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:31 Allen AI Blog 01:20 olmOCR Blog 02:08 olmOCR Hugging Face 04:52 olmOCR GitHub 05:41 Demo 05:59 Running olmOCR on macOS with LM Studio

#OCR #local model #vision

·youtube.com·Mar 2, 2025

olmOCR - The Open OCR System

microsoft/Florence-2-large · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

#image #OCR #vision

·huggingface.co·Feb 17, 2025

microsoft/Florence-2-large · Hugging Face

DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.

An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without s...

#OCR #vision #image #code

·github.com·Feb 15, 2025

Handwritten Digit Recognition with TensorFlow and OpenCV

In this blog post, we will explore the fascinating world of handwritten digit recognition using TensorFlow and OpenCV. Handwritten digit…

#vision #OCR

·medium.com·Feb 15, 2025

Handwritten Digit Recognition with TensorFlow and OpenCV

Guide to Optical Character Recognition (OCR) in 2025

Optical Character Recognition helps perceive the characters of a text within the images like printed books, photos, or documents. Explore top 17 OCR vendors.

#OCR #image #vision

·research.aimultiple.com·Feb 9, 2025

Guide to Optical Character Recognition (OCR) in 2025

junhoyeo/BetterOCR: 🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM. - junhoyeo/BetterOCR

#OCR #image #vision

·github.com·Feb 9, 2025

junhoyeo/BetterOCR: 🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

plastic-plant/florence-2: Let's play with Florence-2 vision model.

Let's play with Florence-2 vision model. Contribute to plastic-plant/florence-2 development by creating an account on GitHub.

#image #model training #vision #OCR

·github.com·Feb 9, 2025

plastic-plant/florence-2: Let's play with Florence-2 vision model.

Why LLMs still have problems with OCR | Hacker News

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc

#OCR #image #model training #vision #fine tuning

·news.ycombinator.com·Feb 9, 2025

Why LLMs still have problems with OCR | Hacker News

Pulse AI Blog - Why LLMs Suck at OCR

#OCR #model training #vision #image

·runpulse.com·Feb 9, 2025

Pulse AI Blog - Why LLMs Suck at OCR

4. Super Models

We strive to regularly add new elements and new technologies for our to enhance the power of Transkribus. One of these technological elements are the Super Models for text recognition, which are the most advanced models we can offer so far.

#OCR #image

·help.transkribus.org·Feb 3, 2025

4. Super Models

OCR4all | Setup guide, user guide, developer documentation and more.

Guides, documentation and more

#OCR #vision #image

·ocr4all.org·Feb 2, 2025

OCR4all | Setup guide, user guide, developer documentation and more.

emcf/thepipe: Extract clean data from anywhere, powered by vision-language models ⚡

Extract clean data from anywhere, powered by vision-language models ⚡ - emcf/thepipe

#vision #OCR

·github.com·Feb 2, 2025

emcf/thepipe: Extract clean data from anywhere, powered by vision-language models ⚡

microsoft/trocr-base-handwritten · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

#image #OCR

·huggingface.co·Feb 2, 2025

microsoft/trocr-base-handwritten · Hugging Face

JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc. - JaidedAI/EasyOCR

#OCR #image #vision #python

·github.com·Feb 2, 2025

JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

LlamaOCR - Building your Own Private OCR System - YouTube

The video demonstrates LlamaOCR, an OCR tool leveraging the Llama 3.2 visual model. It focuses on the tool's ability to convert images and scanned documents into structured Markdown, preserving the original formatting of elements like tables, lists, and spreadsheets. The video covers practical usage examples, offering tutorials and code snippets in both JavaScript and Python within a Colab environment. For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen Colab: https://drp.li/WpdNm 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes ⏱️Time Stamps: 00:00 LlamaOCR Project 00:56 Demo Using their Site 02:43 Colab Demo 04:40 Together.AI Docs 06:06 Pricing 09:16 Python OCR Version 11:20 Thai OCR Project 16:30 Patreon

#vision #image #OCR

·youtube.com·Nov 19, 2024

LlamaOCR - Building your Own Private OCR System - YouTube

Home - Docling

#text sanitization #OCR

·ds4sd.github.io·Nov 3, 2024

Home - Docling

Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown

Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.

#OCR #scraping

·jina.ai·Sep 13, 2024

Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown

Running OCR against PDFs and images directly in your browser

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we …

#scraping #OCR

·simonwillison.net·Mar 31, 2024

Running OCR against PDFs and images directly in your browser

GitHub - freedmand/textra: A command-line application to convert images, PDFs, and audio files to text using Apple's APIs

A command-line application to convert images, PDFs, and audio files to text using Apple&#39;s APIs - GitHub - freedmand/textra: A command-line application to convert images, PDFs, and audio fil...

#OCR #scraping

·github.com·Sep 20, 2023

GitHub - freedmand/textra: A command-line application to convert images, PDFs, and audio files to text using Apple's APIs

Jaided AI - Distribute the benefits of AI to the world

#OCR #text #image

·jaided.ai·Jun 19, 2023

Jaided AI - Distribute the benefits of AI to the world

GitHub - JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

#text #OCR #image

·github.com·Jun 19, 2023

GitHub - JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

clovaai/donut: Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

#image #OCR

·github.com·Jun 1, 2023

clovaai/donut: Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022