Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown
Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.
Running OCR against PDFs and images directly in your browser
I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we …
GitHub - freedmand/textra: A command-line application to convert images, PDFs, and audio files to text using Apple's APIs
A command-line application to convert images, PDFs, and audio files to text using Apple's APIs - GitHub - freedmand/textra: A command-line application to convert images, PDFs, and audio fil...