View all files | ||||
Official PyTorch implementation of Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention.
Ali Hatamizadeh, Yejin Choi, and Jan Kautz.
Linear attention compresses an unbounded KV cache into a fixed-size recurrent state. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Prior delta-rule models (Gated DeltaNet, Kimi Delta Attention) tie erasing and writing to a single scalar gate — even though they act on different axes of the state.
Gated DeltaNet-2 decouples these two roles:
Given an erase gate b_t ∈ [0,1]^{d_k}, a write gate w_t ∈ [0,1]^{d_v}, and channel-wise decay D_t = Diag(α_t), the recurrent state evolves as:
Compared with KDA, the right factor of the rank-one erase becomes channel-selective on the key axis, and the write term becomes channel-selective on the value axis. The two decisions no longer share a single scalar.
We train all models at 1.3B parameters on 100B tokens of FineWeb-Edu, matched in recurrent state size, and compare against Mamba-2, Gated DeltaNet, KDA, and Mamba-3 (SISO and MIMO).
Gated DeltaNet-2 achieves the best average across both recurrent-only and hybrid settings:
| Recurrent | ||||
| Mamba-2 | 16.79 | 12.38 | 45.24 | 51.82 |
| Gated DeltaNet | 16.40 | 11.89 | 49.62 | 52.07 |
| KDA | 16.81 | 11.68 | 48.13 | 52.28 |
| Mamba-3 (MIMO) | 16.45 | 11.66 | 47.82 | 52.39 |
| Gated DeltaNet-2 | 15.90 | 11.41 | 48.09 | 53.11 |
| Hybrid (+ SWA) | ||||
| Transformer | 19.22 | 13.72 | 48.32 | 50.86 |
| Gated DeltaNet | 16.00 | 10.82 | 48.71 | 52.25 |
| KDA | 16.01 | 10.66 | 49.21 | 52.68 |
| Mamba-3 (MIMO) | 15.81 | 10.92 | 49.82 | 52.72 |
| Gated DeltaNet-2 | 15.62 | 10.43 | 50.90 | 53.97 |
Gated DeltaNet-2 is strongest where memory editing matters most — particularly the interference-heavy multi-key needle-in-a-haystack settings:
| Recurrent | |||
| Gated DeltaNet | 87.2 | 54.2 | 27.8 |
| KDA | 89.0 | 63.2 | 28.0 |
| Mamba-3 (MIMO) | 64.2 | 72.4 | 18.0 |
| Gated DeltaNet-2 | 93.0 | 89.8 | 37.8 |
| Hybrid | |||
| Gated DeltaNet | 57.3 | 91.2 | 44.8 |
| KDA | 56.0 | 93.4 | 40.4 |
| Mamba-3 (MIMO) | 53.0 | 98.4 | 46.6 |
| Gated DeltaNet-2 | 57.9 | 99.0 | 48.0 |
Across SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP, Gated DeltaNet-2 leads the recurrent and hybrid frontier:
| Recurrent avg. | 26.84 | 28.09 | 28.67 | 28.35 | 29.88 |
| Hybrid avg. | 39.74 | 39.11 | 40.14 | 40.11 | 42.28 |
Gated DeltaNet-2 retains near-flat scaling with sequence length on a single H100 (training, hybrid 1.3B), with only a small constant overhead over KDA for the added channel-wise gates.
| Mamba-2 | scalar | — | scalar |
| Gated DeltaNet | scalar | scalar β_t | scalar β_t |
| KDA | channel-wise | scalar β_t | scalar β_t |
| Gated DeltaNet-2 | channel-wise | channel-wise b_t | channel-wise w_t |
Ablations confirm both gates contribute, with the erase gate b_t accounting for most of the gain — consistent with its role in selectively protecting or revising key-side associations in the recurrent state.
Launch your training with our streamlined command:
💡 Pro Tip: Add --interactive_job --debug for interactive debugging sessions!
We train 1.3B-parameter models on 100B tokens of FineWeb-Edu with:
Copyright © 2026, NVIDIA Corporation. All rights reserved.
Licensed under the NVIDIA Source Code License-NC. See LICENSE for details.
Built on the shoulders of giants:
If you find this work useful, please consider citing:
If you find this work useful, please consider:
Join us in pushing the boundaries of linear attention! 🚀