Telegram: Contact @nexus_search
📀 AI Dataset libstc3 📀 What? - More than 600,000 fiction and non-fiction book full-texts - More than 8,000,000 scholarly publications, magazines, and manuals full-texts - More than 5,000,000 US patents - 164,000,000 metadata records Download https://twitter.com/ultranymous/status/1811146035860849115 Format Zstd compressed file, JSON lines, one per book/publication, 356GB - abstract, content - description and content in markdown format - issued_at - time of issuing of the object (not of the record itself) - metadata - ISBNs, publishers, series etc - id - identifier in external systems, if applicable (i.e. DOI) other fields should be self-descriptive Comments: July 2024 edition of LibSTC dataset. Previous version may be found here. This dataset contains a much larger amount of high-quality texts (extracted with GROBID + Nougat OCR) suitable for training models. Reposting and seeding are welcome!