LLM SVG Generation Benchmark
Has Google Quietly Solved Two of AI’s Oldest Problems?
A mysterious new model currently in testing on Google’s AI Studio is nearly perfect on automated handwriting recognition but it is also showing signs of spontaneous, abstract, symbolic reasoning.
Nano Banana can be prompt engineered for extremely nuanced AI image generation
Max Woolf provides an exceptional deep dive into Google's Nano Banana aka Gemini 2.5 Flash Image model, still the best available image manipulation LLM tool three months after its initial …
Awesome-Nano-Banana-images/README_en.md at main · PicoTrex/Awesome-Nano-Banana-images
A curated collection of fun and creative examples generated with Nano Banana🍌, Gemini-2.5-flash-image based model. This repository showcases diverse AI-generated visuals and prompts, highlighting t...
Testing VLMs and LLMs for robotics w/ the Jetson Thor devkit
Exploring the Jetson Thor devkit w/ some local LLMs and VLMs.More info on the Jetson Thor Devkit: https://nvda.ws/45xIU4BNeural Networks from Scratch book: h...
rednote-hilab/dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Multilingual Document Layout Parsing in a Single Vision-Language Model - rednote-hilab/dots.ocr
How Docling turns documents into usable AI data
Wanting to use your personal or organizational data in AI workflows, but it's stuck in PDFs and other document formats? Docling is here to helpDocling is an ...
What Is Docling? Transforming Unstructured Data for RAG and AI
Ready to become a certified Architect - Cloud Pak for Data? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/BdeXNRLearn more...
Google’s new AI model creates video game worlds in real time
Google is investing a lot into AI world models.
reducto/RolmOCR · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Contextualizing ancient texts with generative neural networks - Nature
Aeneas, a generative neural network trained on ancient texts, helps historians contextualize inscriptions and perform epigraphic tasks, offering an improved starting point for historical research.
But how do AI videos actually work? | Guest video by @WelchLabsVideo
Diffusion models, CLIP, and the math of turning text into images
Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book
Sections
0:00 - Intro
3:37 - CLIP
6:25 - Shared Embedding Space
8:16 - Diffusion Models & DDPM
11:44 - Learning Vector Fields
22:00 - DDIM
25:25 Dall E 2
26:37 - Conditioning
30:02 - Guidance
33:39 - Negative Prompts
34:27 - Outro
35:32 - About guest videos + Grant’s Reaction
Special Thanks to:
Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper.
https://arxiv.org/pdf/2006.11239
https://arxiv.org/pdf/2207.12598
Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial:
https://arxiv.org/pdf/2406.08929
Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion
Cheyang also has a terrific tutorial and MIT course on diffusion models
https://www.chenyang.co/diffusion.html
https://www.practical-diffusion.org/
Other References
All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/
CLIP Paper: https://arxiv.org/pdf/2103.00020
DDIM Paper: https://arxiv.org/pdf/2010.02502
Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456
Wan2.1: https://github.com/Wan-Video/Wan2.1
Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2
Midjourney: https://www.midjourney.com/
Veo: https://deepmind.google/models/veo/
DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf
Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora
Written by: Stephen Welch, with very helpful feedback from Grant Sanderson
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Technical Notes
The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper.
6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf
Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset.
The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065)
For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler.
At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations.
Premium Beat Music ID: EEDYZ3FP44YX8OWT
MonoQwen-Vision, the first visual document reranker - LightOn
We introduce MonoQwen2-VL-v0.1, the first visual document reranker to enhance the quality of the retrieved visual documents and take these pipelines to the next level. Reranking a small number of candidates with MonoQwen2-VL-v0.1 achieve top results on the ViDoRe leaderboard.
Introducing Gemma 3n: The developer guide
Learn how to build with Gemma 3n, a mobile-first architecture, MatFormer technology, Per-Layer Embeddings, and new audio and vision encoders.
Introducing Gemma 3n: The developer guide
Extremely consequential new open weights model release from Google today: Multimodal by design: Gemma 3n natively supports image, audio, video, and text inputs and text outputs. Optimized for on-device: Engineered …
How OpenElections Uses LLMs – Derek Willis
Academic and journalist
How OpenElections Uses LLMs
The OpenElections project collects detailed election data for the USA, all the way down to the precinct level. This is a surprisingly hard problem: while county and state-level results are …
nanonets/Nanonets-OCR-s · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Introducing the unified multi-modal MLX engine architecture in LM Studio
Leveraging `mlx-lm` and `mlx-vlm` to achieve unified multi-modal LLM inference in LM Studio's `mlx-engine`.
ollama-ocr
OCR package using Ollama vision language models.
Passing Images to a Vision-Language Model in Ollama | by Manyi | Apr,…
Trying out llama.cpp’s new vision support
This llama.cpp server vision support via libmtmd pull request—via Hacker News—was merged earlier today. The PR finally adds full support for vision models to the excellent llama.cpp project. It’s documented …
Peekaboo MCP – lightning-fast macOS screenshots for AI agents | Peter Steinberger
Turn your blind AI into a visual debugger with instant screenshot capture and analysis
Agentic Document Extraction: 17x Faster, Smarter, with LLM-Ready Outputs
Agentic Document Extraction just got faster! We've improved the median document processing from 135 seconds to 8 seconds!
Agentic Document Extraction sees documents visually and uses an iterative workflow to accurately extract text, figures, form fields, charts, and more to create an LLM-ready output.
You can use our SDK to parse complex documents and get the extracted content in Markdown and JSON. You can then feed the output to an LLM, RAG application, or other downstream apps.
You can also use our Playground to test out Agentic Document Extraction.
Try out Agentic Document Extraction:
- Playground: https://va.landing.ai/demo/doc-extraction
- Library: https://github.com/landing-ai/agentic-doc
Learn more: https://landing.ai/agentic-document-extraction
The best open source OCR models
AI-Powered Handwriting Recognition with ML Techniques
AI-Powered Handwriting Recognition with Machine Learning Techniques. AI-Powered Handwriting Recognition with Machine Learning Techniques
Private Local LlamaOCR with a User-Friendly Streamlit Front-End
Optical Character Recognition (OCR) is a powerful tool for extracting text from images, and with the rise of multimodal AI models, it's now easier than ever to implement locally. In this guide, we'll show you how to build a professional OCR application using Llama 3.2-Vision, Ollama for the backend, and Streamlit for the front end.PrerequisitesBefore we start, ensure you have the following:1. Python 3.10 or higher installed.2. Anaconda (Optional)3. Ollama installed for local model hosting. Downl
Handwritten Text Recognition using OCR
In this article, we carry out handwritten text recognition using OCR. We fine tune the TrOCR model on the GNHK dataset.
Raycast AI as Translator
A compelling use case for AI: a Japanese to English translator that gives me a translation, breakdown of the Chinese characters in a Japanese phrase, and the ability to ask follow-up questions.
Way Enough - Local VLMs Have Improved