An attempt to Bridging Human Cognition and AI: Integrating Titan with Transformer²
An attempt to Bridging Human Cognition and AI: Integrating Titan with Transformer²
Table of Contents
- Why I want to integrate Titan and Transformer²?
- Review
- Integrating Titan and Transformer²
- Conclusion
- So, what’s next?
- Related Works
Why I want to integrate Titan and Transformer²?
Have you ever left a meeting only remembering the key decision points, while the finer details faded away? This common human experience reflects how our brains adeptly compress and encode vast amounts of information—retaining only the most salient details and discarding what’s less essential. Inspired by this natural efficiency, we’re exploring how similar strategies can enhance artificial intelligence.
TConsider how you recall meetings: you might forget every conversation twist, but remember the crucial conclusions. Our brains prioritize and compress memories, focusing on fundamental insights while efficiently filtering out noise. This remarkable ability points to an underlying mechanism where the most critical patterns are retained, and extraneous data is minimized.
To achieve this, I leverage Singular Value Decomposition (SVD) within Titan’s architecture to mimic the brain’s natural ability to distill and prioritize information.
Using SVD in Titan’s memory module allows the model to:
- Compress Information: Just as the brain forms compressed memories, SVD captures dominant patterns in data, reducing complexity.
- Enhance Efficiency: By working with low-rank approximations, the model operates faster and requires less memory, similar to the brain’s ability to recall information without processing every detail. By integrating these principles, I aim to bridge the gap between artificial and biological intelligence, creating models that are both efficient and cognitively inspired.
Review
Titan’s native Attention mechanism
Why I call it native Attention mechanism? Take a look at here:
At its core, Titan introduces a novel way to update a long-term memory module based on “surprise”—a signal derived from how unexpected new information is compared to past experiences. For an input token ( x_t ), Titan projects it into key-value pairs:
and defines an associative memory loss:
Using this loss, Titan updates a “surprise” momentum ( S_t ) with forgetting:
and then updates the memory with a forgetting gate:
This mechanism allows Titan to selectively encode surprising information, reminiscent of how humans remember remarkable events while forgetting mundane details over time.
Transformer² Singular Value Decomposition (SVD)
Titan’s memory module often involves large weight matrices, which can be computationally expensive. By applying Singular Value Decomposition (SVD), we can factorize these matrices into low-rank components. For a weight matrix ( W \in \mathbb{R}^{d \times d} ), SVD gives:
where
- ( U \in \mathbb{R}^{d \times r} ),
- ( S \in \mathbb{R}^{r \times r} ),
- ( V \in \mathbb{R}^{d \times r} ),
with ( r ) being the rank of the decomposition.
This low-rank factorization reduces the number of parameters and computational load while preserving the most significant information in the weight matrix. When applied to Titan, the key memory parameters become the set
which evolve over time as the model learns.
Integrating Titan and Transformer²
By combining Titan’s memory mechanism with Transformer², I create a framework where the model learns to adjust its low-rank memory parameters based on rewards. Here’s how this integration works mathematically:
State, Action, and Transitions
At time ( t ), the state is defined as:
where ( \theta_t = {U_t, S_t, V_t} ) are the decomposed memory parameters.
The action ( a_t ) corresponds to adjustments of these factors:
The transition updates the parameters:
which implies
Simultaneously, the memory updates via the surprise mechanism:
Policy and Reward
The policy ( \pi_{\phi}(a|s) ) decides how to adjust the low-rank factors given the current state. The objective is to maximize the expected cumulative reward:
where ( r_t ) is the reward at time ( t ), reflecting how well the model leverages its memory to perform on tasks.
Policy Gradient Update
Using the policy gradient theorem, the update for policy parameters ( \phi ) is:
with
The policy update rule becomes:
Actions sampled from this policy then update the SVD factors, aligning the model’s memory adjustments with long-term rewards.
Conclusion
By theoritically integrating Titan’s advanced memory mechanisms with Transformer², I can obtain a model that mimics human-like memory—compressing information efficiently, focusing on surprising events, and adapting its long-term memory based on feedback.
So, what’s next?
I cannot do experiment because I don’t have stable gpu resources. Hope someday I can have resources, I will do the experiment and write up the results in a sequel post.
Related Works
Paper: