Greenblatt, R. et al. (2024). Alignment faking in large language models.
Beyond Preferences in AI Alignment
Reframing superintelligence fhi tr 2019 1
Drexler, K. E. (2019). Reframing superintelligence. Future of Humanity Institute.
Weak to strong generalization
LIMA: Less Is More for Alignment
Using the Veil of Ignorance to align AI systems with principles of justice | Proceedings of the National Academy of Sciences
Researching Alignment Research: Unsupervised Analysis