Forecasting rare language model behaviors \ Anthropic#Alignment#Risk#Forecasting#Scale#Anthropic#Paper#PDF#Blog·anthropic.com·Feb 25, 2025Forecasting rare language model behaviors \ Anthropic
Greenblatt, R. et al. (2024). Alignment faking in large language models.#Alignment#Paper#Training#Anthropic·assets.anthropic.com·Dec 18, 2024Greenblatt, R. et al. (2024). Alignment faking in large language models.
Beyond Preferences in AI Alignment#AI#Preferences#Alignment#Paper#pddf·arxiv.org·Sep 8, 2024Beyond Preferences in AI Alignment
Reframing superintelligence fhi tr 2019 1Drexler, K. E. (2019). Reframing superintelligence. Future of Humanity Institute.#CAIS#Alignment#Paper#PDF·fhi.ox.ac.uk·Dec 15, 2023Reframing superintelligence fhi tr 2019 1
Weak to strong generalization#OpenAI#Alignment#Paper#PDF·cdn.openai.com·Dec 15, 2023Weak to strong generalization
LIMA: Less Is More for Alignment#Machine Learning#Alignment#Paper#PDF#Meta·arxiv.org·May 23, 2023LIMA: Less Is More for Alignment
Using the Veil of Ignorance to align AI systems with principles of justice | Proceedings of the National Academy of Sciences#DeepMind#Alignment#Paper·pnas.org·Apr 25, 2023Using the Veil of Ignorance to align AI systems with principles of justice | Proceedings of the National Academy of Sciences
Researching Alignment Research: Unsupervised Analysis#Value Alignment#Alignment#AI#Paper#PDF·arxiv.org·Apr 21, 2023Researching Alignment Research: Unsupervised Analysis