Search Saved

Found 3 bookmarks

Custom sorting

The AI trust crisis

The AI trust crisis 14th December 2023 Dropbox added some new AI features. In the past couple of days these have attracted a firestorm of criticism. Benj Edwards rounds it up in Dropbox spooks users with new AI features that send data to OpenAI when used. The key issue here is that people are worried that their private files on Dropbox are being passed to OpenAI to use as training data for their models—a claim that is strenuously denied by Dropbox. As far as I can tell, Dropbox built some sensible features—summarize on demand, “chat with your data” via Retrieval Augmented Generation—and did a moderately OK job of communicating how they work... but when it comes to data privacy and AI, a “moderately OK job” is a failing grade. Especially if you hold as much of people’s private data as Dropbox does! Two details in particular seem really important. Dropbox have an AI principles document which includes this: Customer trust and the privacy of their data are our foundation. We will not use customer data to train AI models without consent. They also have a checkbox in their settings that looks like this: Update: Some time between me publishing this article and four hours later, that link stopped working. I took that screenshot on my own account. It’s toggled “on”—but I never turned it on myself. Does that mean I’m marked as “consenting” to having my data used to train AI models? I don’t think so: I think this is a combination of confusing wording and the eternal vagueness of what the term “consent” means in a world where everyone agrees to the terms and conditions of everything without reading them. But a LOT of people have come to the conclusion that this means their private data—which they pay Dropbox to protect—is now being funneled into the OpenAI training abyss. People don’t believe OpenAI # Here’s copy from that Dropbox preference box, talking about their “third-party partners”—in this case OpenAI: Your data is never used to train their internal models, and is deleted from third-party servers within 30 days. It’s increasing clear to me like people simply don’t believe OpenAI when they’re told that data won’t be used for training. What’s really going on here is something deeper then: AI is facing a crisis of trust. I quipped on Twitter: “OpenAI are training on every piece of data they see, even when they say they aren’t” is the new “Facebook are showing you ads based on overhearing everything you say through your phone’s microphone” Here’s what I meant by that. Facebook don’t spy on you through your microphone # Have you heard the one about Facebook spying on you through your phone’s microphone and showing you ads based on what you’re talking about? This theory has been floating around for years. From a technical perspective it should be easy to disprove: Mobile phone operating systems don’t allow apps to invisibly access the microphone. Privacy researchers can audit communications between devices and Facebook to confirm if this is happening. Running high quality voice recognition like this at scale is extremely expensive—I had a conversation with a friend who works on server-based machine learning at Apple a few years ago who found the entire idea laughable. The non-technical reasons are even stronger: Facebook say they aren’t doing this. The risk to their reputation if they are caught in a lie is astronomical. As with many conspiracy theories, too many people would have to be “in the loop” and not blow the whistle. Facebook don’t need to do this: there are much, much cheaper and more effective ways to target ads at you than spying through your microphone. These methods have been working incredibly well for years. Facebook gets to show us thousands of ads a year. 99% of those don’t correlate in the slightest to anything we have said out loud. If you keep rolling the dice long enough, eventually a coincidence will strike. Here’s the thing though: none of these arguments matter. If you’ve ever experienced Facebook showing you an ad for something that you were talking about out-loud about moments earlier, you’ve already dismissed everything I just said. You have personally experienced anecdotal evidence which overrides all of my arguments here.

One consistent theme I’ve seen in conversations about this issue is that people are much more comfortable trusting their data to local models that run on their own devices than models hosted in the cloud. The good news is that local models are consistently both increasing in quality and shrinking in size.

#misinfo #conspiracies #privacy #tech #companies/OpenAI #companies/meta #data #ethics

·simonwillison.net·Oct 10, 2024

The AI trust crisis

The $2 Per Hour Workers Who Made ChatGPT Safer

The story of the workers who made ChatGPT possible offers a glimpse into the conditions in this little-known part of the AI industry, which nevertheless plays an essential role in the effort to make AI systems safe for public consumption. “Despite the foundational role played by these data enrichment professionals, a growing body of research reveals the precarious working conditions these workers face,” says the Partnership on AI, a coalition of AI organizations to which OpenAI belongs. “This may be the result of efforts to hide AI’s dependence on this large labor force when celebrating the efficiency gains of technology. Out of sight is also out of mind.”

This reminds me of [[On the Social Media Ideology - Journal 75 September 2016 - e-flux]]:<br>> Platforms are not stages; they bring together and synthesize (multimedia) data, yes, but what is lacking here is the (curatorial) element of human labor. That’s why there is no media in social media. The platforms operate because of their software, automated procedures, algorithms, and filters, not because of their large staff of editors and designers. Their lack of employees is what makes current debates in terms of racism, anti-Semitism, and jihadism so timely, as social media platforms are currently forced by politicians to employ editors who will have to do the all-too-human monitoring work (filtering out ancient ideologies that refuse to disappear).

Computer-generated text, images, video, and audio will transform the way countless industries do business, the most bullish investors believe, boosting efficiency everywhere from the creative arts, to law, to computer programming. But the working conditions of data labelers reveal a darker part of that picture: that for all its glamor, AI often relies on hidden human labor in the Global South that can often be damaging and exploitative. These invisible workers remain on the margins even as their work contributes to billion-dollar industries.

One Sama worker tasked with reading and labeling text for OpenAI told TIME he suffered from recurring visions after reading a graphic description of a man having sex with a dog in the presence of a young child. “That was torture,” he said. “You will read a number of statements like that all through the week. By the time it gets to Friday, you are disturbed from thinking through that picture.” The work’s traumatic nature eventually led Sama to cancel all its work for OpenAI in February 2022, eight months earlier than planned.

In the day-to-day work of data labeling in Kenya, sometimes edge cases would pop up that showed the difficulty of teaching a machine to understand nuance. One day in early March last year, a Sama employee was at work reading an explicit story about Batman’s sidekick, Robin, being raped in a villain’s lair. (An online search for the text reveals that it originated from an online erotica site, where it is accompanied by explicit sexual imagery.) The beginning of the story makes clear that the sex is nonconsensual. But later—after a graphically detailed description of penetration—Robin begins to reciprocate. The Sama employee tasked with labeling the text appeared confused by Robin’s ambiguous consent, and asked OpenAI researchers for clarification about how to label the text, according to documents seen by TIME. Should the passage be labeled as sexual violence, she asked, or not? OpenAI’s reply, if it ever came, is not logged in the document; the company declined to comment. The Sama employee did not respond to a request for an interview.

In February, according to one billing document reviewed by TIME, Sama delivered OpenAI a sample batch of 1,400 images. Some of those images were categorized as “C4”—OpenAI’s internal label denoting child sexual abuse—according to the document. Also included in the batch were “C3” images (including bestiality, rape, and sexual slavery,) and “V3” images depicting graphic detail of death, violence or serious physical injury, according to the billing document.

I haven't finished watching [[Severance]] yet but this labeling system reminds me of the way they have to process and filter data that is obfuscated as meaningless numbers. In the show, employees have to "sense" whether the numbers are "bad," which they can, somehow, and sort it into the trash bin.

But the need for humans to label data for AI systems remains, at least for now. “They’re impressive, but ChatGPT and other generative models are not magic – they rely on massive supply chains of human labor and scraped data, much of which is unattributed and used without consent,” Andrew Strait, an AI ethicist, recently wrote on Twitter. “These are serious, foundational problems that I do not see OpenAI addressing.”

#business #tech #ai #ethics #identities #companies/OpenAI #companies/meta

·time.com·Feb 6, 2023

The $2 Per Hour Workers Who Made ChatGPT Safer

We Need to Talk About How Good A.I. Is Getting

#ai #research #linguistics #tech #companies/OpenAI #companies/meta

·nytimes.com·Aug 25, 2022

We Need to Talk About How Good A.I. Is Getting