AI & the Web: Understanding and managing the impact of Machine Learning models on the Web
This document proposes an analysis of the systemic impact of AI systems, and in particular ones based on Machine Learning models, on the Web, and the role that Web standardization may play in managing that impact.
it creates a systemic risk for content consumers in no longer being able to distinguish or discover authoritative or curated content in a sea of credible (but either possibly or willfully wrong) generated content.
We do not know of any solution that could guarantee (e.g., through cryptography) that a given piece of content was or was not generated (partially or entirely) by AI systems.
A well-known issue with relying operationally on Machine Learning models is that they will integrate and possibly strengthen any bias ("systematic difference in treatment of certain objects, people or groups in comparison to others" [[[ISO/IEC-22989]) in the data that was used during their training.
Models trained on un-triaged or partially triaged content off the Web are bound to include personally identifiable information (PII). The same is true for models trained on data that users have chosen to share (for public consumption or not) with service providers. These models can often be made to retrieve and share that information with any user who knows how to ask, which breaks expectations of privacy for those whose personal information was collected, and is likely to be in breach with privacy regulations in a number of jurisdictions.
A number of Machine Learning models have significantly lowered the cost of generating credible textual, as well as audio and video (real-time or recorded) impersonations of real persons. This creates significant risks of upscaling the capabilities of phishing and other types of frauds, and thus raising much higher the barriers to establish trust in online interactions. If users no longer feel safe in their digitally-mediated interactions, the Web will no longer be able to play its role as a platform for these interactions.
Training and running Machine Learning models can prove very resource-intensive, in particular in terms of power- and water-consumption.
Some of the largest and most visible Machine Learning models are known or assumed to have been trained with materials crawled from the Web, without the explicit consent of their creators or publishers.