Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI
Helpful insight!
Common Crawl does not contain the “entire web,” nor a representative sample of it. Despite its size, there are important limitations on how much of the web is covered. The crawling process is almost entirely automated to prioritize pages on domains that are frequently linked to, which makes domains related to digitally marginalized communities less likely to be included. The language and regional coverage is strongly skewed toward English content. Moreover, a growing number of relevant domains like Facebook and the New York Times block Common Crawl from crawling most (or all) of their pages.When used as a source for AI training, Common Crawl should be used with care, but such care is often lacking. Due to Common Crawl’s deliberate lack of curation, AI builders do not use it directly as training data for their models. Instead, builders choose from a variety of filtered Common Crawl versions to train their LLMs. However, there is a lack of reflection among AI builders about the limitations and biases of Common Crawl’s archive. Popular Common Crawl versions are especially problematic when used to train LLMs for end-user products because the filtering techniques used to create them are simplistic and often focused on removing pornography or boilerplate text like the names of navigational menu items, leaving lots of other types of problematic content untouched.Common Crawl and AI builders have a shared responsibility for making generative AI more trustworthy. While Common Crawl was never primarily about providing AI training data, it now positions itself as an important building block for LLM development. However, it continues to provide a source that AI builders need to filter before model training. Both groups can help make generative AI more trustworthy in their own ways. Common Crawl should better highlight the limitations and biases of its data and be more transparent and inclusive about its governance. It could also enforce more transparency around generative AI by requiring AI builders to attribute their usage of Common Crawl. AI builders should put more effort into filtering out more types of problematic content and try to better take into account the various cultural contexts in which their generative AI products are deployed. There is also a need for industry standards and best practices for end-user products to reduce potential harms when using Common Crawl or similar sources for training data. In addition, AI builders should create or support dedicated intermediaries tasked with filtering Common Crawl in transparent and accountable ways that are continuously updated. Long term, there should be less reliance on sources like Common Crawl and a bigger emphasis on training generative AI on datasets created and curated by people in equitable and transparent ways.