Tools to Improve Training Data - Talking Language AI Ep#2
Vincent Warmerdam builds a lot of NLP tools (https://github.com/koaning). Many of these tools target the scikit-learn ecosystem and there's a theme of labeling across many of them. A recent focus of his stack of tools is to improve training data. In this video, Vincent and Jay discuss a few of these tools and show how they work together.
These tools are discussed in the video:
- Human-learn: a toolkit to build human-based scikit-learn components
- Doubtlab: a toolkit to help find doubtful labels in data
- Embetter: A library that makes it very easy to use embeddings in scikit-learn
- Bulk: a library that uses embeddings to leverage bulk labeling
The talk includes live demos for each and to show how some simple tricks can go a long way.
===
Join the Cohere Discord: https://discord.gg/co-mmunity
Discussion thread for this episode (feel free to ask questions):
https://discord.com/channels/954421988141711382/1042163984817721527
Vincent on Twitter: https://twitter.com/fishnets88
human-learn: Natural Intelligence is still a pretty good idea.
https://koaning.github.io/human-learn/index.html
https://github.com/koaning/human-learn/
doubtlab: Doubt your data, find bad labels.
https://koaning.github.io/doubtlab/
https://github.com/koaning/doubtlab
embetter: just a bunch of useful embeddings
https://github.com/koaning/embetter
bulk: A Simple Bulk Labelling Tool
https://github.com/koaning/bulk
Calmcode: Code. Simply. Clearly. Calmly.
Video tutorials for modern ideas and open source tools.
https://calmcode.io/
About The Speaker:
Vincent worked as an engineer, consultant, researcher, team lead, and educator in the past. Currently, he works as a Machine Learning Engineer over at Explosion, the company behind spaCy and Prodi.gy. In addition to his work at Explosion, he also maintains many scikit-learn-related plugins as well as a popular learning resource over at calmcode.io. He's also a frequent speaker at conferences where he defends common sense over the hype in ML.
===
Contents
0:00 Introduction
3:06 Tools for Data Quality
9:18 human-learn: Natural Intelligence is still a pretty good idea.
12:28 human-learn: demo
27:11 doubtlab: Doubt your data, find bad labels.
42:35 embetter: just a bunch of useful embeddings
46:16 embetter: demo
58:10 bulk: A Simple Bulk Labelling Tool
1:00:20 bulk demo: exploring text data
1:10:47 bulk demo: exploring images
1:16:20 Why use the scikit learn API? What are the benefits and limitations?
1:17:22 Programmer productivity tips
Steamship Fellowship for Language AI at Writing Atlas - Airtable
Steamship is building a platform to put Language AI in the hands of all developers. Plympton is partnering with Steamship on a fellowship program to explore the intersection of tech & literature.
You’ll join a two-month-long fellowship in which we’ll coach you through building new features for Writing Atlas. We’ve planned each feature for a smooth development experience, and the results will be something you can show off in production.
bb15 art space is seeking proposals for its 2023-2024 program around the open-ended theme of friction. Friction is against sameness, the continuum, the immediate. An emerging friction defeats expectations, questions the agreement and is itchy. Frictioning implies resistance, going against a certain motion. bb15 is looking for projects that bring tension, distortion and counter-current into the flow of things.
Future Art Ecosystems 3: Art x Decentralised Tech Launch - Serpentine Galleries
A presentation of Future Art Ecosystems 3 (FAE3) – Serpentine Arts Technologies’ annual strategic briefing. FAE3 provides new learnings around decentralised technologies.
We need to talk about the problems with generative AI...
Stable Diffusion and DALLE 2 are disrupting our media landscape and there’s a need for critical discussion. Intellectual property, automation, deception and ...
Case studies DFF supports strategic litigation to advance digital rights in Europe, by providing financial support for strategic court cases and catalysing collaboration between those working to advance digital rights.Strategic litigation – litigation with broad impact and which can bring about legislative or policy change – has proven to be…
Activists in the internet health movement will gather at MozFest to move the needle in tech, art, social responsibility, ethics, and AI. Will you join us in March 2023?
While machine learning—computer programming designed for taxonomic patterning—offers useful insights into racism and racist behavior, a gap is present in the relationship between machine learning and its connection to the racial history of science and the Black lived experience.
We develop visions and projects with the goal to create more equitable futures. We do research, build networks and shape narratives. Superrr is playful, visionary and feminist.
A radical pedagogy notebook that parses the “experiments in study, collective learning, and unlearning” at JUNGE AKADEMIE’s AI Anarchies Autumn School.
GitHub - neefrehman/millzbot: A GPT-2 bot trained on my bosses tweets, and a guide to making your own
A GPT-2 bot trained on my bosses tweets, and a guide to making your own - GitHub - neefrehman/millzbot: A GPT-2 bot trained on my bosses tweets, and a guide to making your own
“Is 65% spoon-ness a spoon?”“Something truly smart might rather do something unexpected.”Simone Rebaudengo is exploring the way we are living and interacting...
“Lager networks might have a slight conscious.”Maya Indira Ganesh is a cultural scientist who explores the poetics and policies of AI and metaphors, asking...