Constitutional Classifiers: Defending against Universal Jailbreaks...
Simple probes can catch sleeper agents \ Anthropic
Who's Harry Potter? Approximate Unlearning in LLMs
Download PDF
DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text