Anthropic's Economic Index Findings
Building effective agents \ Anthropic
Dario Amodei — Machines of Loving Grace
I think that most people are underestimating just how radical the upside of AI could be, just as I think most people are underestimating how bad the risks could be.
the effects of powerful AI are likely to be even more unpredictable than past technological changes, so all of this is unavoidably going to consist of guesses. But I am aiming for at least educated and useful guesses, which capture the flavor of what will happen even if most details end up being wrong. I’m including lots of details mainly because I think a concrete vision does more to advance discussion than a highly hedged and abstract one.
I am often turned off by the way many AI risk public figures (not to mention AI company leaders) talk about the post-AGI world, as if it’s their mission to single-handedly bring it about like a prophet leading their people to salvation. I think it’s dangerous to view companies as unilaterally shaping the world, and dangerous to view practical technological goals in essentially religious terms.
AI companies talking about all the amazing benefits of AI can come off like propagandists, or as if they’re attempting to distract from downsides.
the small community of people who do discuss radical AI futures often does so in an excessively “sci-fi” tone (featuring e.g. uploaded minds, space exploration, or general cyberpunk vibes). I think this causes people to take the claims less seriously, and to imbue them with a sort of unreality. To be clear, the issue isn’t whether the technologies described are possible or likely (the main essay discusses this in granular detail)—it’s more that the “vibe” connotatively smuggles in a bunch of cultural baggage and unstated assumptions about what kind of future is desirable, how various societal issues will play out, etc. The result often ends up reading like a fantasy for a narrow subculture, while being off-putting to most people.
Yet despite all of the concerns above, I really do think it’s important to discuss what a good world with powerful AI could look like, while doing our best to avoid the above pitfalls. In fact I think it is critical to have a genuinely inspiring vision of the future, and not just a plan to fight fires.
The five categories I am most excited about are:
Biology and physical health
Neuroscience and mental health
Economic development and poverty
Peace and governance
Work and meaning
We could summarize this as a “country of geniuses in a datacenter”.
you might think that the world would be instantly transformed on the scale of seconds or days (“the Singularity”), as superior intelligence builds on itself and solves every possible scientific, engineering, and operational task almost immediately. The problem with this is that there are real physical and practical limits, for example around building hardware or conducting biological experiments. Even a new country of geniuses would hit up against these limits. Intelligence may be very powerful, but it isn’t magic fairy dust.
I believe that in the AI age, we should be talking about the marginal returns to intelligence7, and trying to figure out what the other factors are that are complementary to intelligence and that become limiting factors when intelligence is very high. We are not used to thinking in this way—to asking “how much does being smarter help with this task, and on what timescale?”—but it seems like the right way to conceptualize a world with very powerful AI.
in science many experiments are often needed in sequence, each learning from or building on the last. All of this means that the speed at which a major project—for example developing a cancer cure—can be completed may have an irreducible minimum that cannot be decreased further even as intelligence continues to increase.
Sometimes raw data is lacking and in its absence more intelligence does not help. Today’s particle physicists are very ingenious and have developed a wide range of theories, but lack the data to choose between them because particle accelerator data is so limited. It is not clear that they would do drastically better if they were superintelligent—other than perhaps by speeding up the construction of a bigger accelerator.
Many things cannot be done without breaking laws, harming humans, or messing up society. An aligned AI would not want to do these things (and if we have an unaligned AI, we’re back to talking about risks). Many human societal structures are inefficient or even actively harmful, but are hard to change while respecting constraints like legal requirements on clinical trials, people’s willingness to change their habits, or the behavior of governments. Examples of advances that work well in a technical sense, but whose impact has been substantially reduced by regulations or misplaced fears, include nuclear power, supersonic flight, and even elevators
Thus, we should imagine a picture where intelligence is initially heavily bottlenecked by the other factors of production, but over time intelligence itself increasingly routes around the other factors, even if they never fully dissolve (and some things like physical laws are absolute)10. The key question is how fast it all happens and in what order.
I am not talking about AI as merely a tool to analyze data. In line with the definition of powerful AI at the beginning of this essay, I’m talking about using AI to perform, direct, and improve upon nearly everything biologists do.
CRISPR was a naturally occurring component of the immune system in bacteria that’s been known since the 80’s, but it took another 25 years for people to realize it could be repurposed for general gene editing. They also are often delayed many years by lack of support from the scientific community for promising directions (see this profile on the inventor of mRNA vaccines; similar stories abound). Third, successful projects are often scrappy or were afterthoughts that people didn’t initially think were promising, rather than massively funded efforts. This suggests that it’s not just massive resource concentration that drives discoveries, but ingenuity.
there are hundreds of these discoveries waiting to be made if scientists were smarter and better at making connections between the vast amount of biological knowledge humanity possesses (again consider the CRISPR example). The success of AlphaFold/AlphaProteo at solving important problems much more effectively than humans, despite decades of carefully designed physics modeling, provides a proof of principle (albeit with a narrow tool in a narrow domain) that should point the way forward.
What Apple's AI Tells Us: Experimental Models⁴
Companies are exploring various approaches, from large, less constrained frontier models to smaller, more focused models that run on devices. Apple's AI focuses on narrow, practical use cases and strong privacy measures, while companies like OpenAI and Anthropic pursue the goal of AGI.
the most advanced generalist AI models often outperform specialized models, even in the specific domains those specialized models were designed for. That means that if you want a model that can do a lot - reason over massive amounts of text, help you generate ideas, write in a non-robotic way — you want to use one of the three frontier models: GPT-4o, Gemini 1.5, or Claude 3 Opus.
Working with advanced models is more like working with a human being, a smart one that makes mistakes and has weird moods sometimes. Frontier models are more likely to do extraordinary things but are also more frustrating and often unnerving to use. Contrast this with Apple’s narrow focus on making AI get stuff done for you.
Every major AI company argues the technology will evolve further and has teased mysterious future additions to their systems. In contrast, what we are seeing from Apple is a clear and practical vision of how AI can help most users, without a lot of effort, today. In doing so, they are hiding much of the power, and quirks, of LLMs from their users. Having companies take many approaches to AI is likely to lead to faster adoption in the long term. And, as companies experiment, we will learn more about which sets of models are correct.
Mapping the Mind of a Large Language Model
Summary: Anthropic has made a significant advance in understanding the inner workings of large language models by identifying how millions of concepts are represented inside Claude Sonnet, one of their deployed models. This is the first detailed look inside a modern, production-grade large language model. The researchers used a technique called "dictionary learning" to isolate patterns of neuron activations that recur across many contexts, allowing them to map features to human-interpretable concepts. They found features corresponding to a vast range of entities, abstract concepts, and even potentially problematic behaviors. By manipulating these features, they were able to change the model's responses. Anthropic hopes this interpretability discovery could help make AI models safer in the future by monitoring for dangerous behaviors, steering models towards desirable outcomes, enhancing safety techniques, and providing a "test set for safety". However, much more work remains to be done to fully understand the representations the model uses and how to leverage this knowledge to improve safety.
We mostly treat AI models as a black box: something goes in and a response comes out, and it's not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we don't know how they work, how do we know they won't give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that they’ll be safe and reliable?Opening the black box doesn't necessarily help: the internal state of the model—what the model is "thinking" before writing its response—consists of a long list of numbers ("neuron activations") without a clear meaning. From interacting with a model like Claude, it's clear that it’s able to understand and wield a wide range of concepts—but we can't discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.
Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.
In October 2023, we reported success applying dictionary learning to a very small "toy" language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.
We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai), providing a rough conceptual map of its internal states halfway through its computation.
We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.
We were able to measure a kind of "distance" between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are "close" to each other. Looking near a "Golden Gate Bridge" feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo.
This holds at a higher level of conceptual abstraction: looking near a feature related to the concept of "inner conflict", we find features related to relationship breakups, conflicting allegiances, logical inconsistencies, as well as the phrase "catch-22". This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. This might be the origin of Claude's excellent ability to make analogies and metaphors.
amplifying the "Golden Gate Bridge" feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked "what is your physical form?", Claude’s usual kind of answer – "I have no physical form, I am an AI model" – changed to something much odder: "I am the Golden Gate Bridge… my physical form is the iconic bridge itself…". Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.
Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse - including in scenarios of catastrophic risk. It’s therefore particularly interesting that, in addition to the aforementioned scam emails feature, we found features corresponding to:Capabilities with misuse potential (code backdoors, developing biological weapons)Different forms of bias (gender discrimination, racist claims about crime)Potentially problematic AI behaviors (power-seeking, manipulation, secrecy)
finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place). Understanding the representations the model uses doesn't tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in. And we need to show that the safety-relevant features we have begun to find can actually be used to improve safety. There's much more to be done.
Agus 🔎 ⏸️~ on X: "Ok this paper is a huge breakthrough in Mechanistic Interpretability. This is wild, I'm so happy A small explanation in thread" / X
Claude 3 beats GPT-4 on Aider’s code editing benchmark