Constitutional Classifiers: Defending against Universal Jailbreaks...#Anthropic#Classification#Safety#Large Language Models#Paper#PDF·arxiv.org·Feb 3, 2025Constitutional Classifiers: Defending against Universal Jailbreaks...
Simple probes can catch sleeper agents \ Anthropic#Training#Large Language Models#Anthropic#Paper#Classification#Cybersecurity·anthropic.com·Apr 24, 2024Simple probes can catch sleeper agents \ Anthropic
Who's Harry Potter? Approximate Unlearning in LLMsDownload PDF#Machine Learning#Large Language Models#Paper#PDF#Fine-Tuning#Microsoft#Classification·arxiv.org·Dec 27, 2023Who's Harry Potter? Approximate Unlearning in LLMs
DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text#Large Language Models#Classification#Paper#PDF·arxiv.org·Jun 12, 2023DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text