Greenblatt, R. et al. (2024). Alignment faking in large language models.#Alignment#Paper#Training#Anthropic·assets.anthropic.com·Dec 18, 2024Greenblatt, R. et al. (2024). Alignment faking in large language models.
Simple probes can catch sleeper agents \ Anthropic#Training#Large Language Models#Anthropic#Paper#Classification#Cybersecurity·anthropic.com·Apr 24, 2024Simple probes can catch sleeper agents \ Anthropic