Search Test Information Space

Found 2 bookmarks

Newest

Forecasting rare language model behaviors \ Anthropic

#Alignment #Risk #Forecasting #Scale #Anthropic #Paper #PDF #Blog

·anthropic.com·Feb 25, 2025

Forecasting rare language model behaviors \ Anthropic

Greenblatt, R. et al. (2024). Alignment faking in large language models.

#Alignment #Paper #Training #Anthropic

·assets.anthropic.com·Dec 18, 2024

Greenblatt, R. et al. (2024). Alignment faking in large language models.