Search Test Information Space

Found 8 bookmarks

Newest

Forecasting rare language model behaviors \ Anthropic

·anthropic.com·Feb 25, 2025

Greenblatt, R. et al. (2024). Alignment faking in large language models.

·assets.anthropic.com·Dec 18, 2024

Beyond Preferences in AI Alignment

·arxiv.org·Sep 8, 2024

Reframing superintelligence fhi tr 2019 1

Drexler, K. E. (2019). Reframing superintelligence. Future of Humanity Institute.

·fhi.ox.ac.uk·Dec 15, 2023

Weak to strong generalization

·cdn.openai.com·Dec 15, 2023

LIMA: Less Is More for Alignment

·arxiv.org·May 23, 2023

Using the Veil of Ignorance to align AI systems with principles of justice | Proceedings of the National Academy of Sciences

·pnas.org·Apr 25, 2023

Researching Alignment Research: Unsupervised Analysis

·arxiv.org·Apr 21, 2023