Search AI/ML

Found 32 bookmarks

Custom sorting

The Art of AI Domination: Remote Controlling ChatGPT ZombAI Instances

Hey ChatGPT! How to build a botnet with compromised ChatGPT instances! AI botnet vulnerability

#security #safety

·embracethered.com·Jan 7, 2025

The Art of AI Domination: Remote Controlling ChatGPT ZombAI Instances

APpaREnTLy THiS iS hoW yoU JaIlBreAk AI

Anthropic created an AI jailbreaking algorithm that keeps tweaking prompts until it gets a harmful response.

#security #safety

·404media.co·Dec 19, 2024

APpaREnTLy THiS iS hoW yoU JaIlBreAk AI

The Beginner's Guide to Visual Prompt Injections: Invisibility Cloaks, Cannibalistic Adverts, and Robot Women | Lakera – Protecting AI teams that disrupt the world.

Learn about visual prompt injections, their appearance, and top defense strategies against these attacks.

#security #safety

·lakera.ai·Nov 15, 2024

The Beginner's Guide to Visual Prompt Injections: Invisibility Cloaks, Cannibalistic Adverts, and Robot Women | Lakera – Protecting AI teams that disrupt the world.

Ted Benson

#safety #security #audio #voice

·edwardbenson.com·Oct 8, 2024

Ted Benson

Hacker plants false memories in ChatGPT to steal user data in perpetuity

Emails, documents, and other untrusted content can plant malicious memories.

#safety #security

·arstechnica.com·Sep 25, 2024

Hacker plants false memories in ChatGPT to steal user data in perpetuity

The dangers of AI agents unfurling hyperlinks and what to do about it · Embrace The Red

Automatically unfurling hyperlinks can lead to data exfiltration. This post shows how to mitigate this threat in Slack Apps

#safety #security

·embracethered.com·Aug 21, 2024

The dangers of AI agents unfurling hyperlinks and what to do about it · Embrace The Red

SQL injection-like attack on LLMs with special tokens

Andrej Karpathy explains something that's been confusing me for the best part of a year: The decision by LLM tokenizers to parse special tokens in the input string (``, …

#safety #security

·simonwillison.net·Aug 21, 2024

SQL injection-like attack on LLMs with special tokens

MIT releases comprehensive database of AI risks

Researchers at MIT have released the AI Risk Repository, a comprehensive database that can help organizations identify and mitigate AI risks.

#safety #security

·venturebeat.com·Aug 14, 2024

MIT releases comprehensive database of AI risks

Mapping the misuse of generative AI

New research analyzes the misuse of multimodal generative AI today, in order to help build safer and more responsible technologies

#security #safety

·deepmind.google·Aug 12, 2024

Mapping the misuse of generative AI

GPT-4o System Card

There are some fascinating new details in this lengthy report outlining the safety work carried out prior to the release of GPT-4o. A few highlights that stood out to me. …

#safety #security

·simonwillison.net·Aug 9, 2024

GPT-4o System Card

The Rise of Large-Language-Model Optimization - Schneier on Security

#security #safety

·schneier.com·Apr 25, 2024

The Rise of Large-Language-Model Optimization - Schneier on Security

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

By far the most detailed paper on prompt injection I've seen yet from OpenAI, published a few days ago and with six credited authors: Eric Wallace, Kai Xiao, Reimar Leike, …

#security #safety

·simonwillison.net·Apr 23, 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

AI bots hallucinate software packages and devs download them

Simply look out for libraries imagined by ML and make them real, with actual malicious code. No wait, don't do that

#safety #security

·theregister.com·Mar 30, 2024

AI bots hallucinate software packages and devs download them

Researchers use ASCII art to elicit harmful responses from 5 major AI chatbots

LLMs are trained to block harmful responses. Old-school images can override those rules.

#safety #security

·arstechnica.com·Mar 18, 2024

Researchers use ASCII art to elicit harmful responses from 5 major AI chatbots

Who Am I? Conditional Prompt Injection Attacks with Microsoft Copilot · Embrace The Red

Conditional Instructions open a powerful way for adversaries to target individual and delay detonation of malicious payloads for when certain conditions are met

#safety #security #prompt

·embracethered.com·Mar 4, 2024

Who Am I? Conditional Prompt Injection Attacks with Microsoft Copilot · Embrace The Red

Video: ASCII Smuggling and Hidden Prompt Instructions · Embrace The Red

ASCII Smuggling - Crafting Invisible Text and Decoding Hidden Secrets (with LLMs)

#security #safety

·embracethered.com·Feb 13, 2024

Video: ASCII Smuggling and Hidden Prompt Instructions · Embrace The Red

Hidden Prompt Injections with Anthropic Claude · Embrace The Red

Hidden Prompt Injections with Anthropic Claude

#security #safety

·embracethered.com·Feb 8, 2024

Hidden Prompt Injections with Anthropic Claude · Embrace The Red

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

Trained LLMs that seem normal can generate vulnerable code given different triggers.

#safety #security

·arstechnica.com·Jan 17, 2024

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

#safety #security

·arxiv.org·Jan 16, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Extracting Training Data from ChatGPT

#safety #security

·not-just-memorization.github.io·Nov 30, 2023

Extracting Training Data from ChatGPT

Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

#security #safety

·404media.co·Nov 29, 2023

Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

Ekoparty 2023 - HACK THE PLANET - Prompt Injections in the Wild, real-world exploits and mitigations.

This talk sheds light on emerging attack techniques in AI applications such as Prompt Injections, a vulnerability at the very core of LLM Agents, Automated Tool Request Forgery, Data Exfiltration, and more.

#safety #security

·embracethered.com·Nov 29, 2023

Ekoparty 2023 - HACK THE PLANET - Prompt Injections in the Wild, real-world exploits and mitigations.

Prompt injection explained, November 2023 edition

A neat thing about podcast appearances is that, thanks to Whisper transcriptions, I can often repurpose parts of them as written content for my blog. One of the areas Nikita …

#safety #security

·simonwillison.net·Nov 27, 2023

Prompt injection explained, November 2023 edition

Microsoft’s Use Of ‘AI’ In Journalism Has Been An Irresponsible Mess

We’ve noted repeatedly how early attempts to integrate “AI” into journalism have proven to be a comical mess, resulting in no shortage of shoddy product, dangerous falsehoods, and…

#safety #security

·techdirt.com·Nov 6, 2023

Microsoft’s Use Of ‘AI’ In Journalism Has Been An Irresponsible Mess

GPT-4 Vision Prompt Injection

In this article, we explore what prompt injection is and the techniques people have been using to perform prompt injection attacks on GPT-4.

#safety #security

·blog.roboflow.com·Oct 20, 2023

GPT-4 Vision Prompt Injection

Compromising LLMs: The Advent of AI Malware - Black Hat USA 2023 | Briefings Schedule

#safety #security

·blackhat.com·Aug 18, 2023

Compromising LLMs: The Advent of AI Malware - Black Hat USA 2023 | Briefings Schedule

What happens when thousands of hackers try to break AI chatbots

In a Jeopardy-style game at the annual Def Con hacking convention in Las Vegas, hackers tried to get chatbots from OpenAI, Google and Meta to create misinformation and share harmful content.

#safety #security

·npr.org·Aug 16, 2023

What happens when thousands of hackers try to break AI chatbots

The Need for Trustworthy AI - Schneier on Security

#safety #security

·schneier.com·Aug 3, 2023

The Need for Trustworthy AI - Schneier on Security

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news

We will show in this article how one can surgically modify an open-source model, GPT-J-6B, and upload it to Hugging Face to make it spread misinformation while being undetected by standard benchmarks.

#safety #security #model training

·blog.mithrilsecurity.io·Jul 10, 2023

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news

Gandalf | Lakera – Test your prompting skills to make Gandalf reveal secret information.

Trick Gandalf into revealing information and experience the limitations of large language models firsthand.

#security #safety

·gandalf.lakera.ai·Jul 1, 2023

Gandalf | Lakera – Test your prompting skills to make Gandalf reveal secret information.