Search AI/ML

Hello from Transformer Lab | Transformer Lab

Documentation for LLM Toolkit, Transformer Lab

#model training #fine tuning #transformers #local model

·transformerlab.ai·Feb 14, 2025

Hello from Transformer Lab | Transformer Lab

Why LLMs still have problems with OCR | Hacker News

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, produc

#OCR #image #model training #vision #fine tuning

·news.ycombinator.com·Feb 9, 2025

Why LLMs still have problems with OCR | Hacker News

Train your own R1 reasoning model locally (GRPO)

You can now reproduce your own DeepSeek-R1 reasoning model with Unsloth 100% locally. Using GRPO. Open-source, free and beginner friendly.

#model training #local model #fine tuning

·unsloth.ai·Feb 9, 2025

Train your own R1 reasoning model locally (GRPO)

S1: The $6 R1 Competitor?

#local model #model training #fine tuning

·timkellogg.me·Feb 6, 2025

S1: The $6 R1 Competitor?

Tim Kellogg shares his notes on a new paper, [s1: Simple test-time scaling](https://arxiv.org/abs/2501.19393), which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for …

#model training #fine tuning #local model

·simonwillison.net·Feb 6, 2025

S1: The $6 R1 Competitor?

DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts

The DeepSeek Narrative Takes the World by Storm DeepSeek took the world by storm. For the last week, DeepSeek has been the only topic that anyone in the world wants to talk about. As it currently s…

#model training #politics #fine tuning

·semianalysis.com·Feb 3, 2025

DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts

Is MLX the best Fine Tuning Framework?

🚀 Want to fine-tune AI models on your Mac without cloud services? As an ex-Ollama developer, I'll show you how to use Apple's MLX framework to fine-tune mod...

#apple #local model #model training #fine tuning #mac #m1 #macos

·youtube.com·Jan 18, 2025

Is MLX the best Fine Tuning Framework?

Blaizzy/mlx-vlm: MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX. - Blaizzy/mlx-vlm

#model training #fine tuning #mac #vision

·github.com·Nov 26, 2024

Blaizzy/mlx-vlm: MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

Creating a LLM-as-a-Judge that drives business results

Hamel Husain's sequel to [Your AI product needs evals](https://hamel.dev/blog/posts/evals/). This is _packed_ with hard-won actionable advice. Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he …

#model training #fine tuning #agent

·simonwillison.net·Nov 2, 2024

Creating a LLM-as-a-Judge that drives business results

What We Learned from a Year of Building with LLMs (Part I)

#learning #model training #fine tuning

·oreilly.com·Sep 29, 2024

What We Learned from a Year of Building with LLMs (Part I)

WeihaoTan/TWOSOME: Implementation of TWOSOME

#model training #fine tuning

·github.com·Mar 11, 2024

WeihaoTan/TWOSOME: Implementation of TWOSOME

flowersteam/Grounding_LLMs_with_online_RL: We perform functional grounding of LLMs' knowledge in BabyAI-Text

#model training #fine tuning

·github.com·Mar 11, 2024

flowersteam/Grounding_LLMs_with_online_RL: We perform functional grounding of LLMs' knowledge in BabyAI-Text

KhoomeiK/LlamaGym: Fine-tune LLM agents with online reinforcement learning

Fine-tune LLM agents with online reinforcement learning - KhoomeiK/LlamaGym

#model training #fine tuning

·github.com·Mar 11, 2024

KhoomeiK/LlamaGym: Fine-tune LLM agents with online reinforcement learning

Finetuning Open-Source LLMs

This video offers a quick dive into the world of finetuning Large Language Models (LLMs). This video covers - common usage scenarios for pretrained LLMs- par...

#embeddings #fine tuning #model training

·youtube.com·Jan 20, 2024

Finetuning Open-Source LLMs

jondurbin/airoboros: Customizable implementation of the self-instruct paper.

#fine tuning #model training

·github.com·Aug 25, 2023

jondurbin/airoboros: Customizable implementation of the self-instruct paper.