Why I want to integrate Titan and Transformer²?
Review
- Titan’s $TD(\Lambda)$ native Attention mechanism
- Transformer² Singular Value Decomposition (SVD)
Integrating Titan and Transformer²
Conclusion
So, what’s next?
Related Works

Why I want to integrate Titan and Transformer²?

Have you ever left a meeting only remembering the key decision points, while the finer details faded away? This common human experience reflects how our brains adeptly compress and encode vast amounts of information—retaining only the most salient details and discarding what’s less essential. Inspired by this natural efficiency, we’re exploring how similar strategies can enhance artificial intelligence.

TConsider how you recall meetings: you might forget every conversation twist, but remember the crucial conclusions. Our brains prioritize and compress memories, focusing on fundamental insights while efficiently filtering out noise. This remarkable ability points to an underlying mechanism where the most critical patterns are retained, and extraneous data is minimized.

To achieve this, I leverage Singular Value Decomposition (SVD) within Titan’s architecture to mimic the brain’s natural ability to distill and prioritize information.

Using SVD in Titan’s memory module allows the model to:

Compress Information: Just as the brain forms compressed memories, SVD captures dominant patterns in data, reducing complexity.
Enhance Efficiency: By working with low-rank approximations, the model operates faster and requires less memory, similar to the brain’s ability to recall information without processing every detail. By integrating these principles, I aim to bridge the gap between artificial and biological intelligence, creating models that are both efficient and cognitively inspired.

Review

Titan’s $TD(\Lambda)$ native Attention mechanism

Why I call it $TD(\Lambda)$ native Attention mechanism? Take a look at here:

At its core, Titan introduces a novel way to update a long-term memory module based on “surprise”—a signal derived from how unexpected new information is compared to past experiences. For an input token ( x_t ), Titan projects it into key-value pairs:

k_t = x_tW_K, \quad v_t = x_tW_V,

and defines an associative memory loss:

\ell(M_{t-1}; x_t) = \big\|M_{t-1}(k_t) - v_t\big\|_2^2.

Using this loss, Titan updates a “surprise” momentum ( S_t ) with forgetting:

S_t = \eta_t S_{t-1} - \theta_t \nabla \ell(M_{t-1}; x_t),

and then updates the memory with a forgetting gate:

M_t = (1 - \alpha_t)M_{t-1} + S_t.

This mechanism allows Titan to selectively encode surprising information, reminiscent of how humans remember remarkable events while forgetting mundane details over time.

Transformer² Singular Value Decomposition (SVD)

Titan’s memory module often involves large weight matrices, which can be computationally expensive. By applying Singular Value Decomposition (SVD), we can factorize these matrices into low-rank components. For a weight matrix ( W \in \mathbb{R}^{d \times d} ), SVD gives:

W = U S V^{T},

where

( U \in \mathbb{R}^{d \times r} ),
( S \in \mathbb{R}^{r \times r} ),
( V \in \mathbb{R}^{d \times r} ),

with ( r ) being the rank of the decomposition.

This low-rank factorization reduces the number of parameters and computational load while preserving the most significant information in the weight matrix. When applied to Titan, the key memory parameters become the set

\theta_t = \{U_t, S_t, V_t\},

which evolve over time as the model learns.

Integrating Titan and Transformer²

By combining Titan’s memory mechanism with Transformer², I create a framework where the model learns to adjust its low-rank memory parameters based on rewards. Here’s how this integration works mathematically:

State, Action, and Transitions

At time ( t ), the state is defined as:

s_t = \Big(x_t,\, m_t,\, \theta_t\Big),

where ( \theta_t = {U_t, S_t, V_t} ) are the decomposed memory parameters.

The action ( a_t ) corresponds to adjustments of these factors:

a_t = \big(\Delta U_t,\, \Delta S_t,\, \Delta V_t\big).

The transition updates the parameters:

\begin{aligned} U_{t+1} &= U_t + \Delta U_t, \\ S_{t+1} &= S_t + \Delta S_t, \\ V_{t+1} &= V_t + \Delta V_t, \end{aligned}

which implies

W_{t+1} = U_{t+1}S_{t+1}V_{t+1}^{T}.

Simultaneously, the memory updates via the surprise mechanism:

\begin{aligned} S_{t+1} &= \eta_{t+1} S_t - \theta_{t+1} \nabla \ell(M_t; x_{t+1}), \\ M_{t+1} &= (1-\alpha_{t+1})M_t + S_{t+1}. \end{aligned}

Policy and Reward

The policy ( \pi_{\phi}(a|s) ) decides how to adjust the low-rank factors given the current state. The objective is to maximize the expected cumulative reward:

J(\phi) = \mathbb{E}_{\tau \sim \pi_{\phi}}\Bigg[\sum_{t=0}^{T} \gamma^t r_t\Bigg],

where ( r_t ) is the reward at time ( t ), reflecting how well the model leverages its memory to perform on tasks.

Policy Gradient Update

Using the policy gradient theorem, the update for policy parameters ( \phi ) is:

\nabla_{\phi} J(\phi) = \mathbb{E}_{\pi_{\phi}}\Bigg[\sum_{t=0}^{T} \nabla_{\phi} \log \pi_{\phi}(a_t|s_t) \cdot G_t\Bigg],

with

G_t = \sum_{t'=t}^{T} \gamma^{\,t'-t}r_{t'}.

The policy update rule becomes:

\phi \leftarrow \phi + \alpha \, \nabla_{\phi} \log \pi_{\phi}(a_t|s_t) \cdot G_t.

Actions sampled from this policy then update the SVD factors, aligning the model’s memory adjustments with long-term rewards.

Conclusion

By theoritically integrating Titan’s advanced memory mechanisms with Transformer², I can obtain a model that mimics human-like memory—compressing information efficiently, focusing on surprising events, and adapting its long-term memory based on feedback.

So, what’s next?

I cannot do experiment because I don’t have stable gpu resources. Hope someday I can have resources, I will do the experiment and write up the results in a sequel post.

Paper:

Code: SakanaAI’s Transformer² GitHub Repository

Blog: SakanaAI’s Transformer² Blog Post

An attempt to Bridging Human Cognition and AI: Integrating Titan with Transformer²

Table of Contents

Why I want to integrate Titan and Transformer²?

Review

Titan’s $TD(\Lambda)$ native Attention mechanism

Transformer² Singular Value Decomposition (SVD)

Integrating Titan and Transformer²

State, Action, and Transitions

Policy and Reward

Policy Gradient Update

Conclusion

So, what’s next?

Table of Contents

Why I want to integrate Titan and Transformer²?

Review

Titan’s TD(Λ)TD(\Lambda)TD(Λ) native Attention mechanism

Transformer² Singular Value Decomposition (SVD)

Integrating Titan and Transformer²

State, Action, and Transitions

Policy and Reward

Policy Gradient Update

Conclusion

So, what’s next?

Related Works

Titan’s $TD(\Lambda)$ native Attention mechanism