Found 92 bookmarks
Custom sorting
qwen-image-mps
qwen-image-mps
Ivan Fioravanti built this Python CLI script for running the Qwen/Qwen-Image image generation model on an Apple silicon Mac, optionally using the Qwen-Image-Lightning LoRA to dramatically speed up generation. Ivan …
·simonwillison.net·
qwen-image-mps
But how do AI videos actually work? | Guest video by @WelchLabsVideo
But how do AI videos actually work? | Guest video by @WelchLabsVideo
Diffusion models, CLIP, and the math of turning text into images Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book Sections 0:00 - Intro 3:37 - CLIP 6:25 - Shared Embedding Space 8:16 - Diffusion Models & DDPM 11:44 - Learning Vector Fields 22:00 - DDIM 25:25 Dall E 2 26:37 - Conditioning 30:02 - Guidance 33:39 - Negative Prompts 34:27 - Outro 35:32 - About guest videos + Grant’s Reaction Special Thanks to: Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper. https://arxiv.org/pdf/2006.11239 https://arxiv.org/pdf/2207.12598 Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial: https://arxiv.org/pdf/2406.08929 Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion Cheyang also has a terrific tutorial and MIT course on diffusion models https://www.chenyang.co/diffusion.html https://www.practical-diffusion.org/ Other References All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/ CLIP Paper: https://arxiv.org/pdf/2103.00020 DDIM Paper: https://arxiv.org/pdf/2010.02502 Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456 Wan2.1: https://github.com/Wan-Video/Wan2.1 Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2 Midjourney: https://www.midjourney.com/ Veo: https://deepmind.google/models/veo/ DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora Written by: Stephen Welch, with very helpful feedback from Grant Sanderson Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu Technical Notes The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper. 6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset. The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065) For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler. At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations. Premium Beat Music ID: EEDYZ3FP44YX8OWT
·youtube.com·
But how do AI videos actually work? | Guest video by @WelchLabsVideo
Trying out llama.cpp’s new vision support
Trying out llama.cpp’s new vision support
This llama.cpp server vision support via libmtmd pull request—via Hacker News—was merged earlier today. The PR finally adds full support for vision models to the excellent llama.cpp project. It’s documented …
·simonwillison.net·
Trying out llama.cpp’s new vision support
SmolDocling - The SmolOCR Solution?
SmolDocling - The SmolOCR Solution?
In this video I look at SmolDocling and how it compares to the other OCR solutions that are out there, both open and proprietary. Blog: https://huggingface.c...
·youtube.com·
SmolDocling - The SmolOCR Solution?
DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.
DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.
An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without s...
·github.com·
DrSadiqfareed/Full-Page-Handwriting-Recognition: An implementation of a full-page handwriting recognition system using convolutional neural networks and transformers. This project tackles the complex task of recognizing handwritten text without segmentation.
Why LLMs still have problems with OCR | Hacker News
Why LLMs still have problems with OCR | Hacker News
A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc
·news.ycombinator.com·
Why LLMs still have problems with OCR | Hacker News
4. Super Models
4. Super Models
We strive to regularly add new elements and new technologies for our to enhance the power of Transkribus. One of these technological elements are the Super Models for text recognition, which are the most advanced models we can offer so far.
·help.transkribus.org·
4. Super Models
JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc. - JaidedAI/EasyOCR
·github.com·
JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
LlamaOCR - Building your Own Private OCR System - YouTube
LlamaOCR - Building your Own Private OCR System - YouTube
The video demonstrates LlamaOCR, an OCR tool leveraging the Llama 3.2 visual model. It focuses on the tool's ability to convert images and scanned documents into structured Markdown, preserving the original formatting of elements like tables, lists, and spreadsheets. The video covers practical usage examples, offering tutorials and code snippets in both JavaScript and Python within a Colab environment. For more tutorials on using LLMs and building agents, check out my Patreon Patreon: https://www.patreon.com/SamWitteveen Twitter: https://twitter.com/Sam_Witteveen Colab: https://drp.li/WpdNm 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes ⏱️Time Stamps: 00:00 LlamaOCR Project 00:56 Demo Using their Site 02:43 Colab Demo 04:40 Together.AI Docs 06:06 Pricing 09:16 Python OCR Version 11:20 Thai OCR Project 16:30 Patreon
·youtube.com·
LlamaOCR - Building your Own Private OCR System - YouTube
Recraft V3
Recraft V3
Recraft are a generative AI design tool startup based out of London who released their v3 model a few weeks ago. It's currently sat at the top of the [Artificial …
·simonwillison.net·
Recraft V3