GenAI Just Got A Little Less Opaque

Yesterday, the AI startup Anthropic published a paper detailing the successful interpretation of the inner workings of a large language model (LLM). LLMs are notoriously opaque — their size, complexity, and numeric representation of human language have hitherto defied explanation — so it’s impossible to understand why inputs lead to outputs.

Anthropic used a technique called dictionary learning, leveraging a sparse encoder to isolate specific concepts within its Claude 3 Sonnet model. The technique allowed them to extract millions of features, including specific entities like the Golden Gate Bridge as well as more abstract ideas such as gender bias. They were then able to map the proximity of related concepts such as “inner conflict” and “Catch-22.” Most importantly, they were able to activate and suppress features to change model behavior.

So it’s time to bust out the champagne, because we’ve solved genAI explainability, right?

Not quite. Anthropic only identified a small subset of the features within the model. Interpreting the entire model would be too costly — “the computation required by our current approach would vastly exceed the compute used to train the model in the first place,” Anthropic admitted. The paper is essentially proof that explainability is possible, given adequate resources.

It is time to marshal those resources. Investing in generative AI explainability is necessary for the future success of AI because:

There is no alignment without explainability. Over the last six months, I have been conducting research on AI alignment with Brian Hopkins and Enza Iannopollo. We have found that the limitations of current AI approaches make misalignment inevitable. AI misalignment could create catastrophic consequences for businesses and society. Full model explainability would enable us to tweak the very DNA of LLMs, bringing them into alignment with business and societal needs.
Opacity precludes insight. The world is enamored of LLMs’ ability to produce novel text, audio, images, video, and code. But we are currently ignorant of the patterns that the models learned about humanity to produce those outputs. The training of LLMs could be considered the largest sociological study of humanity in the history of the world. Unfortunately, without explainability, we have no way of interpreting the study’s results.
Transparency is AI’s most powerful trust lever. The opacity of AI has created a significant trust gap that only transparency can fully bridge. Until we can explain exactly how a prompt leads to a response, there will be skepticism among consumers, regulators, and business stakeholders alike. Anthropic’s research is a step in the right direction. It is not the bridge itself, but it does show how the bridge may be constructed with adequate time and resources.

Explainability of predictive AI was a vexing issue a decade ago. Now it is largely solved, thanks to the hard work of diligent researchers responding to the demands of industry. It’s time to make similar demands. The success of AI depends on it.

If you’d like to discuss explainability further, please feel free to schedule an inquiry or guidance session.

Get The Insights At Work Newsletter

Country*

Yes, I’d like to receive Forrester’s Insights At Work newsletter and receive occasional survey invitations and marketing communications.

Thanks for signing up.

Stay tuned for updates from the Forrester blogs.

GenAI Just Got A Little Less Opaque

Categories

Get The Insights At Work Newsletter

Thanks for signing up.

Secure AI Agents Before You Scale

Scaling AI agents shouldn’t mean scaling exposure. Download Forrester’s AEGIS playbook to set guardrails on intent, authority, and access so that adoption stays accountable, auditable, and defensible.

Microsoft’s Project Perception Announcement And How To Implement It Right

Look Up: Your AI Voyage Depends On It

Get The Insights At Work Newsletter

Thanks for signing up.