Categories: Technology

DeepMind Gemma Scope goes under the hood of language models

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Large language models (LLMs) have become very good at generating text and code, translating languages, and writing different kinds of creative content. However, the inner workings of these models are hard to understand, even for the researchers who train them. 

This lack of interpretability poses challenges to using LLMs in critical applications that have a low tolerance for mistakes and require transparency. To address this challenge, Google DeepMind has released Gemma Scope, a new set of tools that sheds light on the decision-making process of Gemma 2 models.

Gemma Scope builds on top of JumpReLU sparse autoencoders (SAEs), a deep learning architecture that DeepMind recently proposed.

Understanding LLM activations with sparse autoencoders

When an LLM receives an input, it processes it through a complex network of artificial neurons. The values emitted by these neurons, known as “activations,” represent the model’s understanding of the input and guide its response. 

By studying these activations, researchers can gain insights into how LLMs process information and make decisions. Ideally, we should be able to understand which neurons correspond to which concepts. 

However, interpreting these activations is a major challenge because LLMs have billions of neurons, and each inference produces a massive jumble of activation values at each layer of the model. Each concept can trigger millions of activations in different LLM layers, and each neuron might activate across various concepts.

One of the leading methods for interpreting LLM activations is to use sparse autoencoders (SAEs). SAEs are models that can help interpret LLMs by studying the activations in their different layers, sometimes referred to as “mechanistic interpretability.” SAEs are usually trained on the activations of a layer in a deep learning model. 

The SAE tries to represent the input activations with a smaller set of features and then reconstruct the original activations from these features. By doing this repeatedly, the SAE learns to compress the dense activations into a more interpretable form, making it easier to understand which features in the input are activating different parts of the LLM.

Gemma Scope

Previous research on SAEs mostly focused on studying tiny language models or a single layer in larger models. However, DeepMind’s Gemma Scope takes a more comprehensive approach by providing SAEs for every layer and sublayer of its Gemma 2 2B and 9B models. 

Gemma Scope comprises more than 400 SAEs, which collectively represent more than 30 million learned features from the Gemma 2 models. This will allow researchers to study how different features evolve and interact across different layers of the LLM, providing a much richer understanding of the model’s decision-making process.

“This tool will enable researchers to study how features evolve throughout the model and interact and compose to make more complex features,” DeepMind says in a blog post.

Gemma Scope uses DeepMind’s new architecture called JumpReLU SAE. Previous SAE architectures used the rectified linear unit (ReLU) function to enforce sparsity. ReLU zeroes out all activation values below a certain threshold, which helps to identify the most important features. However, ReLU also makes it difficult to estimate the strength of those features because any value below the threshold is set to zero.

JumpReLU addresses this limitation by enabling the SAE to learn a different activation threshold for each feature. This small change makes it easier for the SAE to strike a balance between detecting which features are present and estimating their strength. JumpReLU also helps keep sparsity low while increasing the reconstruction fidelity, which is one of the endemic challenges of SAEs.

Toward more robust and transparent LLMs

DeepMind has released Gemma Scope on Hugging Face, making it publicly available for researchers to use. 

“We hope today’s release enables more ambitious interpretability research,” DeepMind says. “Further research has the potential to help the field build more robust systems, develop better safeguards against model hallucinations, and protect against risks from autonomous AI agents like deception or manipulation.”

As LLMs continue to advance and become more widely adopted in enterprise applications, AI labs are racing to provide tools that can help them better understand and control the behavior of these models.

SAEs such as the suite of models provided in Gemma Scope have emerged as one of the most promising directions of research. They can help develop techniques to discover and block unwanted behavior in LLMs, such as generating harmful or biased content. The release of Gemma Scope can help in various fields, such as detecting and fixing LLM jailbreaks, steering model behavior, red-teaming SAEs, and discovering interesting features of language models, such as how they learn specific tasks. 

Anthropic and OpenAI are also working on their own SAE research and have released multiple papers in the past months. At the same time, scientists are also exploring non-mechanistic techniques that can help better understand the inner workings of LLMs. An example is a recent technique developed by OpenAI, which pairs two models to verify each other’s responses. This technique uses a gamified process that encourages the model to provide answers that are verifiable and legible.

News Today

Share
Published by
News Today

Recent Posts

Bloomington voters reaffirm ranked-choice voting in municipal elections

2024-11-07 11:00:03 A question on this year's ballot asked Bloomington voters to decide on the…

4 mins ago

What a second Trump administration could mean for America with a GOP-led Congress

What a second Trump administration could mean for America with a GOP-led Congress - CBS…

14 mins ago

Champions League: Barcelona’s attack keeps clicking and Catalan club earns another win

2024-11-07 10:45:02 BARCELONA, Spain (AP) — Barcelona showed its attacking prowess yet again in another…

19 mins ago

‘What’s Starlink?’ Trump talks Elon Musk, Starship and SpaceX in election night victory speech (video)

2024-11-07 10:35:05 President-elect Donald Trump had high words of praise for Elon Musk and SpaceX…

29 mins ago

Dow, Nasdaq, S&P 500 all soar to records after Trump triumphs in election

2024-11-07 10:25:02 US stocks surged to record highs on Wednesday as investors digested Donald Trump's…

39 mins ago

Trump’s election victory means the end of his trials : NPR

2024-11-07 10:05:02 Special counsel Jack Smith delivers remarks about an unsealed indictment against former President…

59 mins ago