Forecasting rare language model behaviors \ Anthropic#Alignment#Risk#Forecasting#Scale#Anthropic#Paper#PDF#Blog·anthropic.com·Feb 25, 2025Forecasting rare language model behaviors \ Anthropic
Greenblatt, R. et al. (2024). Alignment faking in large language models.#Alignment#Paper#Training#Anthropic·assets.anthropic.com·Dec 18, 2024Greenblatt, R. et al. (2024). Alignment faking in large language models.