Tools to Improve Training Data - Talking Language AI Ep#2
Vincent Warmerdam builds a lot of NLP tools (https://github.com/koaning). Many of these tools target the scikit-learn ecosystem and there's a theme of labeling across many of them. A recent focus of his stack of tools is to improve training data. In this video, Vincent and Jay discuss a few of these tools and show how they work together.
These tools are discussed in the video:
- Human-learn: a toolkit to build human-based scikit-learn components
- Doubtlab: a toolkit to help find doubtful labels in data
- Embetter: A library that makes it very easy to use embeddings in scikit-learn
- Bulk: a library that uses embeddings to leverage bulk labeling
The talk includes live demos for each and to show how some simple tricks can go a long way.
===
Join the Cohere Discord: https://discord.gg/co-mmunity
Discussion thread for this episode (feel free to ask questions):
https://discord.com/channels/954421988141711382/1042163984817721527
Vincent on Twitter: https://twitter.com/fishnets88
human-learn: Natural Intelligence is still a pretty good idea.
https://koaning.github.io/human-learn/index.html
https://github.com/koaning/human-learn/
doubtlab: Doubt your data, find bad labels.
https://koaning.github.io/doubtlab/
https://github.com/koaning/doubtlab
embetter: just a bunch of useful embeddings
https://github.com/koaning/embetter
bulk: A Simple Bulk Labelling Tool
https://github.com/koaning/bulk
Calmcode: Code. Simply. Clearly. Calmly.
Video tutorials for modern ideas and open source tools.
https://calmcode.io/
About The Speaker:
Vincent worked as an engineer, consultant, researcher, team lead, and educator in the past. Currently, he works as a Machine Learning Engineer over at Explosion, the company behind spaCy and Prodi.gy. In addition to his work at Explosion, he also maintains many scikit-learn-related plugins as well as a popular learning resource over at calmcode.io. He's also a frequent speaker at conferences where he defends common sense over the hype in ML.
===
Contents
0:00 Introduction
3:06 Tools for Data Quality
9:18 human-learn: Natural Intelligence is still a pretty good idea.
12:28 human-learn: demo
27:11 doubtlab: Doubt your data, find bad labels.
42:35 embetter: just a bunch of useful embeddings
46:16 embetter: demo
58:10 bulk: A Simple Bulk Labelling Tool
1:00:20 bulk demo: exploring text data
1:10:47 bulk demo: exploring images
1:16:20 Why use the scikit learn API? What are the benefits and limitations?
1:17:22 Programmer productivity tips