Found 115 bookmarks
Custom sorting
Why LLMs still have problems with OCR | Hacker News
Why LLMs still have problems with OCR | Hacker News
A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc
·news.ycombinator.com·
Why LLMs still have problems with OCR | Hacker News
S1: The $6 R1 Competitor?
S1: The $6 R1 Competitor?
Tim Kellogg shares his notes on a new paper, [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393), which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for …
·simonwillison.net·
S1: The $6 R1 Competitor?
4. The Ollama Course - Using the CLI
4. The Ollama Course - Using the CLI
Welcome back to the Ollama course! In this video, we dive deep into the command line interface (CLI) of Ollama, exploring all the powerful options and comman...
·youtube.com·
4. The Ollama Course - Using the CLI
AI Hallucinations: Why Large Language Models Make Things Up (And How to Fix It) - kapa.ai - Instant AI answers to technical questions
AI Hallucinations: Why Large Language Models Make Things Up (And How to Fix It) - kapa.ai - Instant AI answers to technical questions
Kapa.ai turns your knowledge base into a reliable and production-ready LLM-powered AI assistant that answers technical questions instantly. Trusted by 100+ startups and enterprises incl. OpenAI, Docker, Mapbox, Mixpanel and NextJS.
·kapa.ai·
AI Hallucinations: Why Large Language Models Make Things Up (And How to Fix It) - kapa.ai - Instant AI answers to technical questions
You Should Probably Pay Attention to Tokenizers
You Should Probably Pay Attention to Tokenizers
Last week I was helping a friend of mine to get one of his new apps off the ground. I can’t speak much about it at the moment, other than like most apps nowadays it has some AI sprinkled over …
·cybernetist.com·
You Should Probably Pay Attention to Tokenizers
txtai
txtai
txtai is an all-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
·neuml.github.io·
txtai
Florence 2 - The Best Small VLM Out There?
Florence 2 - The Best Small VLM Out There?
There is a new VLM on the scene and it comes with a dataset of 5Billion labels. The new model can do a variety of old world tasks like bounding boxes and segmentation along with newer LLM style captioning etc. Paper: https://arxiv.org/pdf/2311.06242 HF Spaces Demo: https://huggingface.co/spaces/gokaygokay/Florence-2 Colab : https://drp.li/fGyMm 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/langchain-tutorials (updated) https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:13 Florence-2 Paper 02:19 Florence - 2 Architecture 03:20 Florence - 2 Detailed Image Captioning 03:41 Florence - 2 Visual Grounding 04:09 Florence - 2 Dense Region Caption 04:24 Florence - 2 Open Vocab Detection 06:01 Hugging Face Spaces Demo 10:41 Colab Florence - 2 Large Sample Usage
·youtube.com·
Florence 2 - The Best Small VLM Out There?