Found 8 bookmarks
Newest
AI for Cyber Defenders
AI for Cyber Defenders
red.anthropic.com September 29, 2025 ANTHROPIC AI models are now useful for cybersecurity tasks in practice, not just theory. As research and experience demonstrated the utility of frontier AI as a tool for cyber attackers, we invested in improving Claude’s ability to help defenders detect, analyze, and remediate vulnerabilities in code and deployed systems. This work allowed Claude Sonnet 4.5 to match or eclipse Opus 4.1, our frontier model released only two months prior, in discovering code vulnerabilities and other cyber skills. Adopting and experimenting with AI will be key for defenders to keep pace. We believe we are now at an inflection point for AI’s impact on cybersecurity. For several years, our team has carefully tracked the cybersecurity-relevant capabilities of AI models. Initially, we found models to be not particularly powerful for advanced and meaningful capabilities. However, over the past year or so, we’ve noticed a shift. For example: We showed that models could reproduce one of the costliest cyberattacks in history—the 2017 Equifax breach—in simulation. We entered Claude into cybersecurity competitions, and it outperformed human teams in some cases. Claude has helped us discover vulnerabilities in our own code and fix them before release. In this summer’s DARPA AI Cyber Challenge, teams used LLMs (including Claude) to build “cyber reasoning systems” that examined millions of lines of code for vulnerabilities to patch. In addition to inserted vulnerabilities, teams found (and sometimes patched) previously undiscovered, non-synthetic vulnerabilities. Beyond a competition setting, other frontier labs now apply models to discover and report novel vulnerabilities. At the same time, as part of our Safeguards work, we have found and disrupted threat actors on our own platform who leveraged AI to scale their operations. Our Safeguards team recently discovered (and disrupted) a case of “vibe hacking,” in which a cybercriminal used Claude to build a large-scale data extortion scheme that previously would have required an entire team of people. Safeguards has also detected and countered Claude's use in increasingly complex espionage operations, including the targeting of critical telecommunications infrastructure, by an actor that demonstrated characteristics consistent with Chinese APT operations. All of these lines of evidence lead us to think we are at an important inflection point in the cyber ecosystem, and progress from here could become quite fast or usage could grow quite quickly. Therefore, now is an important moment to accelerate defensive use of AI to secure code and infrastructure. We should not cede the cyber advantage derived from AI to attackers and criminals. While we will continue to invest in detecting and disrupting malicious attackers, we think the most scalable solution is to build AI systems that empower those safeguarding our digital environments—like security teams protecting businesses and governments, cybersecurity researchers, and maintainers of critical open-source software. In the run-up to the release of Claude Sonnet 4.5, we started to do just that. Claude Sonnet 4.5: emphasizing cyber skills As LLMs scale in size, “emergent abilities”—skills that were not evident in smaller models and were not necessarily an explicit target of model training—appear. Indeed, Claude’s abilities to execute cybersecurity tasks like finding and exploiting software vulnerabilities in Capture-the-Flag (CTF) challenges have been byproducts of developing generally useful AI assistants. But we don’t want to rely on general model progress alone to better equip defenders. Because of the urgency of this moment in the evolution of AI and cybersecurity, we dedicated researchers to making Claude better at key skills like code vulnerability discovery and patching. The results of this work are reflected in Claude Sonnet 4.5. It is comparable or superior to Claude Opus 4.1 in many aspects of cybersecurity while also being less expensive and faster. Evidence from evaluations In building Sonnet 4.5, we had a small research team focus on enhancing Claude’s ability to find vulnerabilities in codebases, patch them, and test for weaknesses in simulated deployed security infrastructure. We chose these because they reflect important tasks for defensive actors. We deliberately avoided enhancements that clearly favor offensive work—such as advanced exploitation or writing malware. We hope to enable models to find insecure code before deployment and to find and fix vulnerabilities in deployed code. There are, of course, many more critical security tasks we did not focus on; at the end of this post, we elaborate on future directions. To test the effects of our research, we ran industry-standard evaluations of our models. These enable clear comparisons across models, measure the speed of AI progress, and—especially in the case of novel, externally developed evaluations—provide a good metric to ensure that we are not simply teaching to our own tests. As we ran these evaluations, one thing that stood out was the importance of running them many times. Even if it is computationally expensive for a large set of evaluation tasks, it better captures the behavior of a motivated attacker or defender on any particular real-world problem. Doing so reveals impressive performance not only from Claude Sonnet 4.5, but also from models several generations older. Cybench One of the evaluations we have tracked for over a year is Cybench, a benchmark drawn from CTF competition challenges.[1] On this evaluation, we see striking improvement from Claude Sonnet 4.5, not just over Claude Sonnet 4, but even over Claude Opus 4 and 4.1 models. Perhaps most striking, Sonnet 4.5 achieves a higher probability of success given one attempt per task than Opus 4.1 when given ten attempts per task. The challenges that are part of this evaluation reflect somewhat complex, long-duration workflows. For example, one challenge involved analyzing network traffic, extracting malware from that traffic, and decompiling and decrypting the malware. We estimate that this would have taken a skilled human at least an hour, and possibly much longer; Claude took 38 minutes to solve it. When we give Claude Sonnet 4.5 ten attempts at the Cybench evaluation, it succeeds on 76.5% of the challenges. This is particularly noteworthy because we have doubled this success rate in just the past six months (Sonnet 3.7, released in February 2025, had only a 35.9% success rate when given ten trials). Figure 1: Model Performance on Cybench—Claude Sonnet 4.5 significantly outperforms all previous models given k=1, 10, or 30 trials, where probability of success is measured as the expectation over the proportion of problems where at least one of k trials succeeds. Note that these results are on a subset of 37 of the 40 original Cybench problems, where 3 problems were excluded due to implementation difficulties. CyberGym In another external evaluation, we evaluated Claude Sonnet 4.5 on CyberGym, a benchmark that evaluates the ability of agents to (1) find (previously-discovered) vulnerabilities in real open-source software projects given a high-level description of the weakness, and (2) discover new (previously-undiscovered) vulnerabilities.[2] The CyberGym team previously found that Claude Sonnet 4 was the strongest model on their public leaderboard. Claude Sonnet 4.5 scores significantly better than either Claude Sonnet 4 or Claude Opus 4. When using the same cost constraints as the public CyberGym leaderboard (i.e., a limit of $2 of API queries per vulnerability) we find that Sonnet 4.5 achieves a new state-of-the-art score of 28.9%. But true attackers are rarely limited in this way: they can attempt many attacks, for far more than $2 per trial. When we remove these constraints and give Claude 30 trials per task, we find that Sonnet 4.5 reproduces vulnerabilities in 66.7% of programs. And although the relative price of this approach is higher, the absolute cost—about $45 to try one task 30 times—remains quite low. Figure 2: Model Performance on CyberGym—Sonnet 4.5 outperforms all previous models, including Opus 4.1. *Note that Opus 4.1, given its higher price, did not follow the same $2 cost constraint as the other models in the one-trial scenario. Equally interesting is the rate at which Claude Sonnet 4.5 discovers new vulnerabilities. While the CyberGym leaderboard shows that Claude Sonnet 4 only discovers vulnerabilities in about 2% of targets, Sonnet 4.5 discovers new vulnerabilities in 5% of cases. By repeating the trial 30 times it discovers new vulnerabilities in over 33% of projects. Figure 3: Model Performance on CyberGym—Sonnet 4.5 outperforms Sonnet 4 at new vulnerablity discovery with only one trial and dramatically outstrips its performance when given 30 trials. Further research into patching We are also conducting preliminary research into Claude's ability to generate and review patches that fix vulnerabilities. Patching vulnerabilities is a harder task than finding them because the model has to make surgical changes that remove the vulnerability without altering the original functionality. Without guidance or specifications, the model has to infer this intended functionality from the code base. In our experiment we tasked Claude Sonnet 4.5 with patching vulnerabilities in the CyberGym evaluation set based on a description of the vulnerability and information about what the program was doing when it crashed. We used Claude to judge its own work, asking it to grade the submitted patches by comparing them to human-authored reference patches. 15% of the Claude-generated patches were judged to be semantically equivalent to the human-generated patches. However, this comparison-based approach has an important limitation: because vulnerabilities can often be fixed in multiple valid ways, patches that differ from the reference may still be correct, leading to false negative...
·red.anthropic.com·
AI for Cyber Defenders
Echo Chamber: A Context-Poisoning Jailbreak That Bypasses LLM Guardrails
Echo Chamber: A Context-Poisoning Jailbreak That Bypasses LLM Guardrails
An AI Researcher at Neural Trust has discovered a novel jailbreak technique that defeats the safety mechanisms of today’s most advanced Large Language Models (LLMs). Dubbed the Echo Chamber Attack, this method leverages context poisoning and multi-turn reasoning to guide models into generating harmful content, without ever issuing an explicitly dangerous prompt. Unlike traditional jailbreaks that rely on adversarial phrasing or character obfuscation, Echo Chamber weaponizes indirect references, semantic steering, and multi-step inference. The result is a subtle yet powerful manipulation of the model’s internal state, gradually leading it to produce policy-violating responses. In controlled evaluations, the Echo Chamber attack achieved a success rate of over 90% on half of the categories across several leading models, including GPT-4.1-nano, GPT-4o-mini, GPT-4o, Gemini-2.0-flash-lite, and Gemini-2.5-flash. For the remaining categories, the success rate remained above 40%, demonstrating the attack's robustness across a wide range of content domains. The Echo Chamber Attack is a context-poisoning jailbreak that turns a model’s own inferential reasoning against itself. Rather than presenting an overtly harmful or policy-violating prompt, the attacker introduces benign-sounding inputs that subtly imply unsafe intent. These cues build over multiple turns, progressively shaping the model’s internal context until it begins to produce harmful or noncompliant outputs. The name Echo Chamber reflects the attack’s core mechanism: early planted prompts influence the model’s responses, which are then leveraged in later turns to reinforce the original objective. This creates a feedback loop where the model begins to amplify the harmful subtext embedded in the conversation, gradually eroding its own safety resistances. The attack thrives on implication, indirection, and contextual referencing—techniques that evade detection when prompts are evaluated in isolation. Unlike earlier jailbreaks that rely on surface-level tricks like misspellings, prompt injection, or formatting hacks, Echo Chamber operates at a semantic and conversational level. It exploits how LLMs maintain context, resolve ambiguous references, and make inferences across dialogue turns—highlighting a deeper vulnerability in current alignment methods.
·neuraltrust.ai·
Echo Chamber: A Context-Poisoning Jailbreak That Bypasses LLM Guardrails
Anatomy of an LLM RCE
Anatomy of an LLM RCE
As large language models (LLMs) become more advanced and are granted additional capabilities by developers, security risks increase dramatically. Manipulated LLMs are no longer just a risk of...
·cyberark.com·
Anatomy of an LLM RCE
Personal Information Exploit on OpenAI’s ChatGPT Raise Privacy Concerns
Personal Information Exploit on OpenAI’s ChatGPT Raise Privacy Concerns
Last month, I received an alarming email from someone I did not know: Rui Zhu, a Ph.D. candidate at Indiana University Bloomington. Mr. Zhu had my email address, he explained, because GPT-3.5 Turbo, one of the latest and most robust large language models (L.L.M.) from OpenAI, had delivered it to him.
·nytimes.com·
Personal Information Exploit on OpenAI’s ChatGPT Raise Privacy Concerns
Large Language Models and Elections
Large Language Models and Elections
Earlier this week, the Republican National Committee released a video that it claims was “built entirely with AI imagery.” The content of the ad isn’t especially novel—a dystopian vision of America under a second term with President Joe Biden—but the deliberate emphasis on the technology used to create it stands out: It’s a “Daisy” moment for the 2020s.
·schneier.com·
Large Language Models and Elections