Search Test Information Space

Found 2 bookmarks

Custom sorting

Greenblatt, R. et al. (2024). Alignment faking in large language models.

#Alignment #Paper #Training #Anthropic

·assets.anthropic.com·Dec 18, 2024

Greenblatt, R. et al. (2024). Alignment faking in large language models.

Simple probes can catch sleeper agents \ Anthropic

#Training #Large Language Models #Anthropic #Paper #Classification #Cybersecurity

·anthropic.com·Apr 24, 2024

Simple probes can catch sleeper agents \ Anthropic