2026_ULTRA-HSTU

一、ULTRA‑HSTU [2026]

《Bending the Scaling Law Curve in Large-Scale Recommendation Systems》

通过序列模型来学习 user interaction history 已成为大规模推荐系统的基石。大语言模型的最新进展揭示了令人振奋的 scaling laws ，激发了针对推荐任务的 long-sequence modeling 与 deeper architectures 的研究热潮。然而，近期诸多方法高度依赖 cross-attention 机制以缓解 sequential modeling 中的平方计算复杂的的瓶颈，而 cross-attention 可能限制 self-attention 带来的表征能力。本文提出 ULTRA-HSTU ，一种通过 end-to-end model 与 system co-design 研发的新型序列推荐模型。通过在 input sequences 、sparse attention 机制与模型拓扑结构上的创新设计，ULTRA-HSTU 在模型效果与效率上均实现显著提升。全面的基准测试表明，ULTRA-HSTU 获得卓越的 scaling 效率增益——相比传统模型， training scaling 速度提升超 5 倍，inference scaling 速度提升 21 倍，同时推荐效果更优。该方案已全面规模化部署，每日服务数十亿用户，在真实生产环境中带来 4% -– 8% 的 consumption 与 engagement 显著提升。
近年来，基于 Transformer 的 sequential modeling 已成为 scaled-GPU computation 时代推动 large-scale recommendation 研究的新范式。传统 deep-learning based recommendation models: DLRM 聚焦于 feature interactions （《Deep & cross network for ad click predictions》），依赖精心设计的人工特征。这类模型虽有效，但在增加计算量以增强 feature interactions 或堆叠更多层时，无法高效 scale up（《Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations》、《Scaling Transformers for Discriminative Recommendation via Generative Pretraining》）。与之相对，基于 Transformer 的 sequential modeling 强调从原始 user behavior sequences 端到端地学习，可同时捕获长期偏好与短期意图（《Behavior sequence transformer for e-commerce recommendation in alibaba》），并随计算量呈现良好的 scaling laws：模型效果随序列长度增加、attention layers 计算量增加、stacked attention layers 层数加深而持续提升。
该领域的一条重要研究路线是 Hierarchical Sequential Transduction Units: HSTU （《Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations》）的研发。HSTU 提出定制化的 transformer-style 架构，可直接从原始序列数据高效地学习用户兴趣，是首个专门为推荐系统在 transformer-like 方法上展现良好 scaling 特性的模型。此后，sequential modeling 范式被头部行业实践者广泛采用并进一步发展，包括 Douyin（《Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin》、《Longer: Scaling up long sequence modeling in industrial recommenders》）、Meituan（《Mtgr: Industrial-scale generative recommendation framework in meituan》）、Alibaba（《Scaling Transformers for Discriminative Recommendation via Generative Pretraining》）、Xiaohongshu（《Towards Large-scale Generative Ranking》）、Meta（《Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations》）与 Linkedin（《Efficient user history modeling with amortized inference for deep learning recommendation models》），各平台均贡献了独特的架构创新。这一跨主流平台的广泛应用，印证了sequential modeling 在大规模推荐系统中的有效性与影响力。
然而，包括 HSTU 在内的 transformer-basedself-attention $O(L^2)$ $L$ user history sequence $O(10k)$ $O(100k)$ 量级的事件的 user histories 时，这种平方级的 scaling 迅速变得不切实际，尤其在每日数十亿推荐请求（每个请求需亚毫秒级延迟服务）的场景中。为缓解平方计算复杂度的瓶颈，行业头部厂商此前方案主要采用 cross-attention （《Longer: Scaling up long sequence modeling in industrial recommenders》、《Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin》），仅以 ranking candidates 或 truncated user histories 作为 queries，而非考虑完整 user history 的 self-attention ；部分方法则限制为浅层架构，仅使用 2–4 层 attention layers （《Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin》）。这些策略与大语言模型（LLM ）实践存在根本差异。尽管此类技术大幅降低计算复杂度，但可能放弃强大 self-attention 机制与 deeper model 架构带来的收益。实验表明（见 Table 1、Table 5）， self-attention 在工业场景中仍优于 cross-attention ，尤其在支持 stacked layers 或 scaled up computation 方面。这一关键研究发现，也是本文与既往方案的核心区别：本文不摒弃 self-attention ，而是受 DeepSeek-V2 （《Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model》）启发，通过模型与系统协同优化（co-optimizations）从而高效发挥 self-attention 优势。
为优化 ultra-long user history modeling 的 scaling 效率，本文提出 ULTRA-HSTU 设计，即下一代 HSTU 模型，融合一整套受 DeepSeek-V2 启发的精细化模型与系统优化。本文将 scaling efficiency 正式定义为 model performance 与 computational cost 拟合后的线性回归的斜率。在 fixed input sequence 配置下，新模型相比原始 HSTU 架构实现超 21 倍 inference scaling 效率与 5 倍 training scaling 效率。该进展有效弯折（blend ）推荐系统的 scaling curve ，使模型效果随计算资源扩容加速提升（见 Figure 1）。
为验证方案有效性，本文将 ULTRA-HSTU （18 层的 self-attention 、16k 长度的 user behavior sequences、数百张 H100 GPU 训练）部署至服务数十亿用户的大规模生产环境。该模型带来 4% –- 8% 的 consumption 与 engagement 显著提升，核心指标提升 0.217% ，印证了在推荐领域 sequential modeling 的 scaling 潜力与方案有效性。据本文所知，ULTRA-HSTU 是工业界已部署的最大规模序列模型之一，scaling 效率大幅提升。本文技术创新总结如下：
- Input sequence optimizations：我们提出两项互补的设计来优化原始 HSTU 的 input sequence processing。
  - 其一，在 sequence designs 中有效融合 item representation 与 action representation，并以 heterogeneous action encodings 来增强该简化的设计。
  - 其二，为缓解同步分布式训练（synchronous distributed training）中跨节点 sequence-length imbalance 导致的低效，我们提出负载均衡随机长度（Load-Balanced Stochastic Length: LBSL），在 stochastic length sampling 时约束单节点的计算负载，减少掉队节点，训练吞吐量提升 15%。
- 极致高效注意力的 model-system co-design：我们提出 end-to-end model–system co-design，消除常见二次方开销与 kernel 开销，使 HSTU 中的 self-attention 可用于生产环境中 ultra-long user interaction history modeling 。
  - 模型层面，我们提出适配 user behavior sequencessemi-local attention: SLA $O((K_1+K_2)\times L)$ sparse attention $K_1, K_2$ 分别为局部窗口大小与全局窗口大小）。实验表明，SLA 使 inference scaling 效率相比基线提升超 5 倍。
  - 系统层面，为 SLA 搭配硬件感知的且精细化的优化，消除实际瓶颈，提升训练与推理的硬件利用率。
    - 本文协同设计面向推荐系统的 mixed-precision 框架，覆盖 16/8/4-bit 格式：多数运算保持 BF16 以保证稳定性，核心的矩阵乘法（GEMM）以 FP8 加速，inference 通信量以 INT4 embedding quantization 来降低。
    - 本文进一步拓展 FlashAttention V3（《Flashattention-3: Fast and accurate attention with asynchrony and low-precision》）思想，构建 custom SLA kernels ，适配 HSTU 的 SiLU-based attention 与 non-standard masks，并针对异构 GPU 架构（ NVIDIA H100 与 AMD MI300 ）调优，维持高 GPU 利用率。
    - 本文还引入内存优化，在几乎无效率损耗的前提下显著降低高带宽内存（HBM）的占用，支持 ultra-long sequence training。
    这些 co-designed 组件相比无系统优化的同模型，实现 70% 训练吞吐量增益与 50% 推理吞吐量增益。注：为最大化端到端性能，本文聚焦固定样本数下的端到端吞吐量（ training/inference 完成速度），而非单纯优化 GPU 利用率。
- 动态拓扑的模型设计：推荐模型的 scalability 不仅限于序列长度，通过堆叠更多层的 vertical scaling 可带来额外收益，尤其通过 residual connectionsHSTU with SLA $O(DL)$ $D$ 为模型深度）。基于 “不同 user signals 的预测价值不同” 的洞察，本文提出两项新型拓扑设计，将 computation 聚焦于最重要信号。具体为：
  - 1)Attention Truncation $N_1$ 层处理完整序列，随后选取更短的高价值 segmentsegment $N_2$ 层。
  - 2)：Mixture of Transducers: MoT：将 heterogeneous behavioral signals 作为 multiple sequences，由独立的 transducers 处理并 fuses their representations，实现高价值信号的定向 capacity/compute 分配，而非强制所有信号在 one timeline 中竞争。
  实验表明，两项拓扑设计均显著提升 performance - efficiency 权衡，进一步放大 scaling 能力。
  ULTRA-HSTU 相比 HSTU 并没有什么大方向的改变，主要是在一些细节上进行了效率的优化。

1.1 相关工作

传统工业级推荐模型通常遵循 deep learning recommendation model 框架（《Deep learning recommendation model for personalization and recommendation systems》、《Software-hardware co-design for fast and scalable training of deep learning recommendation models》），聚焦于建模 user and item feature interactions。近年来，工业界推荐模型的训练范式发生转变：不再依赖 cross-user-item features，多数近期进展由 learning from user interaction histories 所驱动。
- Deep interest networks: DIN （《Deep interest network for click-through rate prediction》）是经典的 short-sequence learning 方法之一。
- SASRec（《Self-attentive sequential recommendation》）是传统 Transformer 在推荐中的实现。
- 后续 HSTU （《Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations》）被提出，在 recommendations with target-aware predictions 任务中表现优于传统transformer-based 模型。HSTU 通过从原始 user interaction history 中学习隐式信息与显式信息，摆脱对人工设计的 user-item features 的依赖，并呈现良好的 scaling law。
本文在 HSTU 研究路线基础上，进一步优化 scaling 行为，以更低计算代价获得更优模型。与本文密切相关的研究聚焦提升序列模型的 training and inference efficiency。
- native sparse attention: NSA （《Native sparse attention: Hardware-aligned and natively trainable sparse attention》）取得突破性进展后，linear sparse attention 成为 scalable big models 部署的研究焦点（《Longformer: The long-document transformer》）。
- 除 sparse attention 外，Stacked Target-to-History Cross Attention: STCA （《Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin》）提出仅以 ranking targets 为 query 来运行注意力，显著降低模型复杂度，但因简化 attention 机制（无 self-attention ）导致效果下降。尽管 STCA 实现线性复杂度，却引入高代价的 pre-attention projections 以提升效果，计算开销大，在 shorter sequences 场景中信息捕获能力较弱（见 Table 3 ）。

1.2 背景

如 Figure 2(a) 所示，典型推荐系统接收 input featuresmulti-task classification $\mathcal M$ prediction tasks $y_k$ candidate $x_j$ $\hat y_k = \mathcal M(\mathbf X, x_j)$ ，并基于 predicted scorescandidates $\mathbf X$ 为用户的 input featurespredictions $\hat y_k$ ground truth labels $y_k$ cross-entropy loss $L$ $\mathbf X$ 表示 input featuresgenerative ranking $\mathbf X$ 建模为 a sequence of embeddings，利用 attention layers 从 sequential embeddings 中学习 probabilities 。
Input：如 Figure 2(a) 所示，推荐系统通过 feature preprocessor 将不同 input features 转换为 a sequence of embeddings ，包括：
- user interaction history (UIH) sequence ：记录给定用户交互过的 items 的序列、对应 actions（如点赞、评论、视频完播等）与上下文（如时间戳）。原始 item ID（及其多模态表征）、action typesembedding table lookups $d$ embeddings $i$ UIH $\mathbf X_i=\{\mathbf I_i,\mathbf A_i\}$ ，其中 UIHitem embeddings $\mathbf I_i=\{\mathbf{\vec e}_{i,j}\}_{j=1}^{L_i}\in \mathbb R^{L_i\times d}$ action embeddings $\mathbf A_i=\{\mathbf{\vec a}_{i,j}\}_{j=1}^{L_i}\in \mathbb R^{L_i\times d}$ 。
  - $L_i$ $i$ 的 UIH 总长度。
  - $\mathbf{\vec e}_{i,j}\in \mathbb R^d$ $i$ UIH $j$ 个位置的 item 的 embedding。
  - $\mathbf{\vec a}_{i,j}\in \mathbb R^d$ $i$ UIH $j$ 个位置的 action 的 embedding 。
- Non-sequential features：用户侧特征（如国家、用户语言等）与 item 侧特征（如 sparse 特征：item 的原始 ID；dense 特征：item 的点击率）。
  - 用户侧特征可汇总为 context-embeddings，置于 sequential UIH 的开头（《Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations》）。
  - item 侧特征可汇总为 item-embeddings ，作为 target-side embeddings 来插入到序列中（《Onetrans: Unified feature interaction and sequence modeling with one transformer in industrial recommender》）。
    “插入到序列”：例如，插入到 UIH 的末尾、或通过 cross-attention 与历史序列交互。
Model：给定 a sequence of embeddings，现代推荐系统采用 transformer-style 模型。典型架构为 Hierarchical Sequential Transduction Units: HSTU ，通过以下改进在推荐系统中显著优于普通 Transformer ：
$\begin{matrix} normaliation: X = Norm (Z) \\ pre-attention: U, Q, K, V = ϕ_{1} (f_{1} (X)) \\ attention: A = (ϕ_{2} (Q K^{⊤}) ⊙ M) V \\ post-attention: Y = f_{2} (Norm (A) ⊙ U) \\ residual connection: Z = Y + Z \end{matrix}$
其中：
- $\odot$ $\phi_1,\phi_2$ 均为 SiLU 激活函数。
- $f_1, f_2$ 分别为 MLP，用于 pre-attention and post-attention 投影。
- $\mathbf M$ 为因果掩码（causal mask），用于维持序列 items 的时序关系。
- Input embeddings $\mathbf Z$ 先被归一化再进入后续运算，每层通过标准 residual connections 来连接前一层输出。
尽管 HSTU 在推荐系统中呈现良好 scaling laws，本文认为可借鉴 LLM 中 DeepSeek-V2 的 model and system co-design 思路，进一步优化 scaling curve。因此，本文在原始 HSTU 基础上提出 ULTRA-HSTU，下述思路可通用至其他序列推荐的 attention 架构。

1.3 ULTRA-HSTU

为改善普通 HSTU 的 scaling curve ，本文在三大核心方向实现重大改进：
- input sequence optimization：从源头缩短有效的序列长度。
- recommender 专用的 sparse attention computation：实现线性复杂度。
- 动态的 topological design：无需每层承担 full-sequence cost 即可实现良好 depth scaling。
除理论复杂度降低外，ULTRA-HSTU 采用 model and hardware co-design，从而在大规模分布式训练与推理的 recommendation settings 中提升效率。综上，ULTRA-HSTU 相比普通 HSTU 实现超 21 倍 inference scaling 效率与 5 倍 training scaling 效率。模型整体架构详见 Figure 2，各组件详述如下。

1.3.1 Input Sequences Optimization

首先，我们提出高效的 action encoding 方法，将 input sequence 长度有效缩短 2 倍，attention computation 效率提升 4 倍。回顾普通 HSTU 将 itemsactions $i$ sequence input $\{I_{i,1}, a_{i,1},I_{i,2}, a_{i,2},\cdots, I_{i,L_i}, a_{i,L_i}\}$ 。该设计虽支持 retrieval 阶段与 ranking 阶段，但导致 ranking 阶段序列长度为实际 UIH 的 2 倍。直接合并 actions 与 items 可能泄露待预测 candidates 的 action 信息，因此本文将所有待排序 candidateaction embeddings $a_{i,j} = \mathbf 0_d$ 。
本文探索多种 action embeddingitem embedding $i$ sequential input $\mathbf X_i=\{\mathbf{\vec x}_{i,j}\}_{j=1}^{L_i}$ $\mathbf{\vec x}_{i,j} = \mathbf{\vec e}_{i,j} + \mathbf{\vec a}_{i,j}$ 。本文假设该方式可让梯度更易通过 action encodings 来传递。重要的是，该设计将序列长度降至普通 HSTU 中 UIH 的一半，且不损失模型效果，使 ULTRA-HSTU 在保持 scalability 的同时实现显著效果提升。
本文进一步设计负载均衡随机长度（load-balanced stochastic length: LBSL）算法，训练吞吐量提升 15% 。HSTU 提出的随机长度（Stochastic Length: SLhistory sequences $L^{\alpha/2}$ $\alpha\in (1,2]$ SL $O(L^2)$ $O(L^\alpha)$ ，且已验证推理时可以泛化到 full sequence length。类似思路也被其他论文采用（《Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin》）。
然而，在分布式训练环境中，各节点独立执行采样，导致节点间 input and output load （以 sum of user sequence lengths 来表示）差异显著。这种 load imbalances 会大幅降低同步分布式训练框架（synchronous distributed training framework）的效率。本文提出 LBSL 算法，作为 SLbatch $\sum_{u\in \text{rank}} n_u^\gamma$ $n_u$ $u$ sequence length $\gamma$ HSTU $\gamma \in (1,2)$ ）。Load balance 定义为全局规模内最大节点负载与最小节点负载的比值。
LBSL 分三阶段运行：
- target load $l$ 。
  前 Twarm 步使用标准 SLrank $\ell_r = \sum_{u\in \text{rank}} n_u^\gamma$ 。预热结束后，所有 rank 通过 all‑reduce 计算平均全局目标负载：
  $\bar{ℓ} = \frac{\sum_{r = 1}^{R} ℓ_{r}}{R \times T_{warm}}$
  rank $\ell_r$ 。
- $l$ $p_u$ 保留 SL 对短序列不采样的偏向，采用无放回加权采样 + 贪心填充（weighted sampling-without-replacement plus a greedy fill）。
  步骤：
  - 每个 rankbatch $p_u=\text{SLWEIGHT}(n_u;\alpha)$ （与标准 SL 相同，短序列权重高，即更可能被保留不截断）。
  - batch $p_u$ 进行不放回加权随机排列（weighted random permutation）。
  - $n_u$ ），使得：
    $累计增加负载 = \sum_{u \in U_{r}} (n_{u}^{r} - ℓ_{SL}^{r}) \leq \bar{ℓ} - b \times ℓ_{SL}^{r}$
    $\ell_\text{SL}$ SL $b$ 是 batch size。右边项表示该 rank 可承受的超额负载上限。
  - $\mathcal U_r$ 的样本完全不截断（保留完整长度）；其余样本则按标准 SLrank $\bar\ell$ ，且短序列更可能被选为不截断，维持了 SL 的原始采样偏差。
- a configurable interval $l$ ，跟踪 production length distribution 的缓慢变化。
  $T_\text{recal}$ all‑reduce $\bar\ell$ ，以适应训练过程中序列长度分布的缓慢变化。
每 batch 重新校准时，LBSL 与标准 SL 平均负载一致，但在节点间重新分配采样（高负载节点 sampling more、低负载节点少 sampling less），减少掉队节点且不损害效果。详见附录 Algorithm 1 。

1.3.2 面向效率的 Model-System Co-Design

Semi-Local Attention 的设计：本文提出新型的 sparse attention 机制，叫做 Semi-Local Attention: SLA，实现 attention 计算的线性复杂度，使 ULTRA-HSTU 的 inference scaling5 $\mathbf A = \left(\phi_2\left(\mathbf Q\mathbf K^\top\right)\odot \mathbf M\right) \mathbf V$ ，普通 HSTU 模型计算 full causal self-attention mask ，序列长度缩放时产生二次代价：
$A (X) = ϕ_{2} (Q (X) K (X)^{⊤}) ⊙ M V (X)$
其中：
- $\mathbf M\in \mathbb R^{L\times L}$ causal attention mask $j\le L-i$ $M_{i,j} = 1$ 。
  这里假设 position 1position $L$ 是最早时刻。
  - query $i$ $i$ 很小），此时 querykey $j$ 的范围较大。
  - query $i$ $i$ 很大），此时 querykey $j$ 的范围较小。
- $\phi_2(\cdot)$ 为 HSTU 中的 SiLU 激活函数。
大规模推荐系统中，UIH 长度快速累积至 10k 以上，导致真实排序场景无法部署。受 LLM（《Native sparse attention: Hardware-aligned and natively trainable sparse attention》）与推荐系统同时固有的 sparse and dynamic attentionsemi-local attention $K_1$ $K_2$ 。
- 局部窗口控制了计入 attention mask 的 local pattern 的长度。
- 全局窗口聚焦 latest UIH attention patterns ，捕获用户长期兴趣。
semi-local attention 的最终 attention mask 定义如下（见 Figure 3）：
$\begin{matrix} M_{i, j} = {\begin{cases} 1 & if L - K_{1} \leq i + j \leq L \\ 1 & if j \leq K_{2} and j \leq L - i \\ 0 & otherwise \end{cases} \end{matrix}$
attention $O((K_1 + K_2)\times K)$ $L$ 超 10k 时，模型效率大幅提升。与DeepSeek 仅采用局部窗口的 native sparse attention: NSA （《Native sparse attention: Hardware-aligned and natively trainable sparse attention》）不同，本文实验表明局部窗口与全局窗口的设计均不可或缺，这在推荐系统中尤为关键，其中用户长期行为至关重要。
下图 Figure 3 中，左下角为 (0, 0)：
- $i$ 沿着时间从大到小（降序）。
- $j$ 沿着时间从大到小（降序）。
System optimizations：
- Mixed-Precision Training and Inference：大规模推荐模型受稠密计算（矩阵乘 General Matrix Multiplications: GEMM ）与数据移动（尤其 serving 中的 embedding lookup 与 host-to-device transfer ）共同瓶颈制约。
  - 为实现 ULTRA-HSTU 端到端高效，本文协同设计面向推荐系统的混合精度框架，覆盖 16/8/4-bit 格式：多数运算保持 BF16 以保证稳定性，核心 GEMM 计算以 FP8 加速，inference 通信流量以 INT4 embedding quantization 来降低。离线与在线实验均表明，该 mixed-precision stack 带来 10% 的训练吞吐量提升与 40% 的服务吞吐量提升，且保持模型准确性。
    本文为 HSTU 开发定制化 FP8 stack （见 Figure 4），同时解决两大实际瓶颈：
    - 提升 NVIDIA H100 的 Tensor Core 利用率，提高稠密计算的实测 TFLOP/s 。
    - 降低 FP8 quantization/scaling 的内存带宽瓶颈开销。
    每层 HSTU 包含两个 GEMM：
    - 一个 pre-attention projectioninput embeddings $\mathbf X$ $\mathbf U, \mathbf V, \mathbf Q,\mathbf K$ 四个张量。
    - 一个 post-attention projection，它将 normalized and gated attention output 转换为 layer output。
    本文以 FP8 执行两个 GEMM，其余运算保持 BF16 ，提升吞吐量且不损失数值鲁棒性。单纯将 GEMM 切换为 FP8 效率不高：朴素的 FP8 pipelines 需额外 scaling/quantization 操作与 layout preparation 操作，抵消预期的加速。
    为确保 FP8 加速端到端的 training/inference，本文开发 fused kernels，将 row-wise scaling computationquantization with the preceding layer-normalization kernels $\mathbf X = \text{Norm}(\mathbf Z)$ $\mathbf Y=f_2(\text{Norm}(\mathbf A)\odot \mathbf U)$ ）相融合，消除额外内存遍历，降低 quantization 开销。
    本文进一步为 post-attention projection 开发高性能 Triton FP8 GEMM kernel。该路径中 projection outputresidual tensor $\mathbf Z = \mathbf Y + \mathbf Z$ ），因此本文将 residual accumulation 直接融合至 GEMM 末尾。PyTorch GEMM kernel 通常假设一维的 bias 向量，无法高效支持该操作。本文 Triton FP8 kernel 原生支持二维的 bias，同时利用持久调度（persistent scheduling）、张量内存访问（TMA）、warp specialization 与 epilogue pipelining，维持高吞吐量且寄存器压力适中。
    除 FP8 GEMM 外，混合精度框架在 serving 阶段为 embedding movement 加入 4-bit quantization。详见附录 D。
- Efficient SLA Kernels for Heterogenous Hardware：Attention 操作是 HSTU 的瓶颈。 HSTU 采用 Triton （《Triton: an intermediate language and compiler for tiled neural network computations》）基于FlashAttention V2 算法（《Flashattention-2: Faster attention with better parallelism and work partitioning》）实现 kernel 。本文采用 FlashAttention V3 算法设计（《Flashattention-3: Fast and accurate attention with asynchrony and low-precision》），激进地重叠 data movement 与计算，并针对 HSTU 的 non-standard attention （pointwise SiLU activation and SLA masking ）定制 kernel 。本文在 NVIDIA H100 与 AMD MI300x 上均实现该设计，支持 heterogeneous service ，两平台相比FlashAttention V2 基线均获得 2 倍加速。
  - NVIDIA H100 上，基于 FlashAttention V3 风格 pipelining 为 full and semi-local HSTU attention实现 CUDA kernel 家族。
  - AMD MI300x 上通过 Composable Kernel（https://github.com/ROCm/composable_kernel）实现类似的 kernel。由于 MI300x 缺少 FlashAttention V3 利用的 H100 特性（如 TMA 与 warp-specialized async execution），本文引入 MI300x 原生优化：XCD-aware scheduling 从而利用 8-chiplet 拓扑、LDS layouts 从而减少 shared-memory bank 冲突、通过 scheduling barriers 来显式交错 VMEM/MFMA ，相比 Triton kernel 基线获得 2 倍加速。
- Memory Saving with Minimal Overhead：标准 attention 的 implementation 在前向传播时 GPU 内存压力大，成为 ultra-long sequence training 的主要瓶颈。本文精心设计以下优化，节省内存且保持训练效率：
  - ULTRA-HSTU 专用的 selective activation rematerialization：跳过 saving six large forward tensors ，反向传播时以最小 recomputenormed $\mathbf X$ 复用 saved layer-norm statisticGEMM $\mathbf U, \mathbf Q, \mathbf K, \mathbf V$ fused gated normalization kernel $\mathbf Y$ 。该方法远轻于通用的 checkpointing，相比 baseline without any activation recomputation 仅 5% 开销。详见附录代码 1。
  - $d\mathbf U$ $d\mathbf Q$ $d\mathbf K$ $d\mathbf V$ 的梯度拼接（gradient concatenation），降低反向传播内存流量与 kernel 开销。
    综上，ULTRA-HSTU 每层内存减少约 67%，效率无下降。embedding size = 512、batch size = 256、序列长度 3k、BF16 数据类型下，该技术将每层 HBM 内存使用从 7GB 降至 2.3GB。
  - 全锯齿张量（fully jagged tensor）实现的端到端训练，无需填充为稠密张量，大幅降低内存使用。效率基准实验详见附录 E。

1.3.3 Dynamic Topological Design

除 scaling sequence length 外，depth scalingULTRA-HSTU layers with SLA $O(DL)$ $D$ $L$ 缩放至 10k 序列长度时，即便采用 linear sparse attention ，堆叠更多模型层仍会带来巨大训练、内存与推理代价。同时，大规模推荐系统需毫秒级处理数百万 requests 的需求不变。一个自然问题：堆叠更多层时，真的需要每层都对 full sequences 做注意力吗？受此启发，本文提出两项高效的拓扑设计，进一步优化ULTRA-HSTU 的 scaling law。
Attention Truncationmost recent interaction history $N_1$ 层 HSTU with full sequence length Lfull sequence $L^\prime$ 的一个 segmentUIH segment $N_2$ 层 HSTU 。选取 UIH segment 的方法包括：
- 1) $L^\prime$ 长度的 UIH。
- 2)：在第一层 Stochastic Length: SLSL $L^\prime$ 长度的序列。
- 3) $N_1$ full sequence $L^\prime$ 长度。
实践表明，简单截断最近的 UIH segment 的效果最优（见 Figure 2(d)）。
Mixture of Transducers：推荐模型固有地处理 multiple input sequences，因来自不同来源与类型的 user engagement 信号通常被独立地记录。将所有 user signals 聚合为单一 input sequence 、然后由一个统一编码器来处理，会将 heterogeneous user interactions 压缩至 one timeline，稀释 sparse, high-value engagements 于 dense, implicit signals 中，强制所有信号竞争有限的序列容量（sequence capacity）。
为解决该挑战，本文提出 Mixture of Transducers: MoT 范式。MoT 将多个 distinct input sequences 由独立的 transducers 来处理，随后融合 learned user embeddings。该方法使模型可在不同 time spans 捕获不同类型的 user behaviors，生成 diverse and sparse engagement patterns 的更细粒度、更有效的 representation。关键在于，MoT 支持 input sequences 之间灵活分配计算资源。例如，模型可为高价值序列分配 deeper layers 与 greater capacity，降低 well-understood or less critical sequences 的资源。这种计算预算的定向分配，使模型将容量聚焦于最有意义的 user interactions，提升整体推荐效果与效率权衡。
注意，论文用于 generative ranking，是判别式。因此，可以在多条序列上并行处理然后获得 learned user embedding。
如果是 generative retrieval，是生成式，那么只能是单条序列，但是不同的 transducers 处理这条序列上的不同 segment。
两项拓扑设计均相比普通 HSTU 实现显著更优的 model quality and cost 权衡。Attention Truncation 与 MoT 相互兼容，可组合至同一模型。实际应用中，拓扑设计的选择取决于系统最关注的指标（效率或模型效果）。本文下述实验设置选择 attention truncation，因其简单且 model quality and efficiency 权衡强大。MoT 研究留至附录 A。

1.4 实验结果

全文以归一化熵（normalized entropy: NE）来衡量模型效果，定义为模型交叉熵除以仅基于 mean frequency of positive labels 预测的交叉熵（《Practical lessons from predicting clicks on ads at facebook》）。形式化地，NE 定义为：
$NE = \frac{- \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log p_{i} + (1 - y_{i}) \log (1 - p_{i})]}{- p \log p - (1 - p) \log (1 - p)}$
$N$ $y_i$ $i$ $p_i$ $i$ $p=\sum_{i=1}^N y_i/N$ 为正样本的频次。
NE 越低模型效果越好。本文具体衡量 consumption 任务（如视频完播 video view complete）与 engagement 任务（如 share）的 NE 提升，记为 C-NE 与 E-NE 。本文选择 NE 以遵循原始 HSTU 论文与内部最佳实践（《Practical lessons from predicting clicks on ads at facebook》）。经验与实验表明，AUC 等指标与 NE 变化方向一致（变好/变差）、幅度相近，因篇幅限制未报告 AUC 指标。
本文评估模型对比多个强基线，按 short-range/long-range user behaviors 建模能力分类：
- 短序列方法：DIN（《Deep interest network for click-through rate prediction》）、SASRec（《Self-attentive se-quential recommendation》）。
- 长序列方法：普通 HSTU（《Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations》）、STCA（《Make it long, keep it fast: End-to-end 10k-sequence modeling at billion scale on douyin》）。
- 内部优化的 Transformer：增加额外 projections 与 normalization，稳定训练，避免经典 Transformer 在推荐系统中意外的指标下降。

1.4.1 工业数据集 Benchmark

数据集：本文首先使用来自内部大规模真实推荐系统的工业级生产数据集报告模型效果。数据集包含 online user interaction histories 的一个子集，总计超 6 billion 样本，每条样本为 ultra-long user interaction sequences，长度 3072 至16384 个事件。为保证时序一致性、避免未来数据泄露，本文采用按时间划分：前 85% 的数据用于训练，剩余 15% 用于评估。
注：工业数据集所有实验均采用 LBSL。例如，inference 中原始序列长度 16384 时，应用 LBSL 后训练序列长度约 4400。基于 HSTU，相比 full sequence length 来训练模型，LBSL 的 NE 差异极小，且训练速度显著提升。这也是本文 inference FLOP per example 高于 training FLOP per example 的原因。训练序列长度与推理序列长度详细对比见 Table 6（附录）。
Overall Performance：Table 1 展示序列长度上限 3072 时所有方法的结果。本文调优 model depth/parameters，使所有方法 FLOPs 大致匹配。结果表明：
- ULTRA-HSTU 显著优于所有其他方法。
- 重度依赖 cross-attention 实现线性复杂度的 STCA，因缺少 self-attention 能力，效果远差于 ULTRA-HSTU 。
注：ΔNE 为正时模型效果更差。经验表明，0.03%–0.05% 的提升视为显著，可带来线上指标大幅增益。
Scaling Law：为分析 scaling 行为，本文固定 ULTRA-HSTU 与普通 HSTU 的 input sequence designs，报告 C-NETFLOP $d=512$ ，改变 modeling layers 数量从 618 $L\in \{3072, 8192, 16384\}$ 。Table 2 报告详细模型效果与 TFLOP 数值。随序列长度与层数增加，ULTRA-HSTU 的效率与 C-NE 指标显著提升。
Figure 1 展示 ULTRA-HSTU 与普通 HSTU 的效果增益随计算代价的线性回归。值得注意的是，通过 fitted linear function 的斜率对比，ULTRA-HSTU 的 scaling 效率实现惊人提升：training scaling 效率提升 5.3 倍，inference scaling 效率提升21.4 倍。

1.4.2 开源数据集

ULTRA-HSTU 与 STCA 等方法专为用户历史达数万 interactions 的工业级推荐设计。为证明方法在超长序列场景外的普适性，本文在 KuaiRand 公开基准测试（序列长度 256 ，短序列）上评估。Table 3 表明，即便在短序列场景，本文方法仍以最低训练与推理计算代价实现最优 NE 。STCA 因昂贵的 pre-attention 计算开销，难以适配短序列。

1.4.3 Scaling 研究的消融实验

本文先简述 input sequence optimizations 的影响，再固定普通 HSTU 与 ULTRA-HSTU 的 input sequence designs，消融SLA 与 attention truncation 的 scaling 效率。scaling laws 分析方法详见附录 F。
Input sequence optimization：移除 input sequence design 中的 item-action interleaving，序列长度减半，UIH 序列长度 3072 时 training FLOP 显著降低 32.5%、inference FLOP 降低 63.5%。同时，action embeddings 的 heterogeneous construction 相比基线带来 0.45% 的 C-NE 增益。world size of 512 时 LBSL 实现 15% 的加速，证明其在加速大规模序列模型训练中的有效性。
Semi-Local Attention: SLA：Figure 8 绘制开启/关闭 SLA 时 C-NE 与 total FLOP 的关系。开启 SLA 的模型相比普通 HSTU 的 scaling 显著提升，training scaling 效率 2.7 倍、inference scaling 效率 5.1 倍。
SLA $K_1$ $K_2$ NSA $K_2$ $K_1$ $K_1=0$ 仅启用全局窗口，C-NE0.03% $K_2=0$ 仅启用局部窗口，C-NE 下降 0.35%。
Dynamic Topological Design：普通 HSTU 堆叠更多层时，效果显著提升但训练与推理代价高昂。Figure 8（右图）对比 stacking HSTU layers with Attention Truncation (AT) 与 purely stacking HSTU layers with full sequences 的 scaling curve。实验设置：推理序列长度 3072 （ training after SL1110 $n_1$ 512 $n_2$ $n_1 \in \{3,6,9,12\}, n_2\in \{0,3,6,9\}$ 。序列长度越长，attention truncation 效果越显著。模型训练开启 LBSL 时，序列长度约1110 时 attention truncation 的 efficiency savings 不够显著，但推理长序列 3072 时效果大幅提升（inference scaling 效率提升 3.4 倍）。

1.4.4 Online A/B Testing

本文通过多项严格 30 天线上 A/B tests 验证 ULTRA-HSTU 的有效性，测试平台为每日服务数十亿用户的大规模 production 视频服务平台。本文报告三类线上指标：
- 线上 consumption 指标（C-metric）：观看时长（watch time）、视频完播（video completion）等。
- 线上 engagement 指标（E-metric）：点赞（likes）、评论（comments）、分享（shares）等。
- 线上核心指标（Top-line）：访问数（visits）、日活用户数（daily active users）等。
本文将现有 production model 从普通 HSTU 升级为 ULTRA-HSTU。Table 4 结果表明，关键指标实现显著且惊人的提升。ULTRA-HSTU 带来线上 consumption 指标 4.11% 增益，engagement 指标 2% -- 8% 增益（因 engagement 类型而异）。最值得注意的是，关键的 “核心指标” （Top-LINE）显著提升，这是平台整体健康度的强指标。在 Meta 系统中：
- engagement 与 consumption 个位数百分比提升已视为重大突破。
- Top-line 1 与 Top-line 2 分别提升 0.05% 与 0.01% 即视为高度显著。
综上，结果有力证明本文提出的 ULTRA-HSTU 方案的有效性与潜力。据本文所知，这是推荐平台测试过的最大模型，实现近年最大影响之一。

1.5 结论

本文提出 ULTRA-HSTU，一种端到端的 model and system co-design，显著提升推荐领域序列建模的 scaling 效率。本文贡献总结如下：
- 核心研究发现： self-attention 仍优于 cross-attention ，scaling up computation on attention layers 与序列长度可持续提升模型效果。
- 核心技术创新：提出多项 modeling and system co-optimizations，包括 input processing 的 LBSL、 semi local attention、heterogeneous hardware kernel optimization with mixed precision training/inference、dynamic model topological design，实现 5 倍 training scaling 效率与 21 倍 inference scaling 效率。
- 行业核心分享：将 ULTRA-HSTU（18 层 self-attention 、16k 用户序列、数百张 H100 GPU 训练）部署至大规模生产环境，效果显著，证明 scaling up 序列模型在推荐中的前景与本文创新的有效性。

二、附录

2.1 A. Selection of Topological Designs

Attention Truncation 与 Mixture of Transducers: MoT 相比普通 HSTU，均实现了显著更优的 model quality and cost 权衡。MoT 与 attention truncation 的设计相互兼容，可组合使用。本节详细说明两种设计的优势。实际应用中，topological designs 的选择取决于系统最关注的指标（效率或模型效果）。
Mixture of Transducers: MoT：MoT 在 engagement 任务上带来显著的归一化熵（normalized entropy: NE）收益（见 Table 8 的 E‑NE），同时实现可观的 training/inference FLOP savings。通过将 heterogeneous signals 解耦到专用模块，MoT 缓解了信号竞争问题（即， diverse input signals 被约束在有限长度的 a single module 内时出现的信号竞争问题）。
具体来说，我们采用两个专用 HSTU 模块：一个处理 engagement 事件，一个处理 consumption 事件，在 Table 8 中记为 E‑seq 与 C‑seq。每个模块处理的序列都比 single-HSTU 方案更短，却能通过精心的 sequence composition 获得更丰富的 signal representation。例如，专用的 engagement 模块尽管序列更短，却通过缓解与 dense consumption signals 的竞争，捕获到丰富得多的 engagement history。我们进一步通过为每个模块量身定制 compute allocation 来优化效率：更短的序列导致 attention 操作更轻量，在训练和推理阶段都带来显著的 FLOP savings。
Attention Truncation: AT：Table 9 给出了仅堆叠普通 HSTU 层与堆叠带 attention truncation layers 的对比补充数据。我们观察到：
- 随着模型层数加深，在普通 HSTU 中，consumption 任务与 engagement 任务的 NE 都明显提升，但代价是推理 FLOP 负担更大。
- 引入 attention truncation 后，我们在模型效果与计算成本之间取得显著更优的权衡。例如，对比 3-layer vanilla HSTU stacked by another 6-layer attention truncation 与 only 6-layer vanilla HSTU ，attention truncation 在 C‑NE指标持平、E‑NE 指标更优的情况下，实现训练 FLOP 节省 3%、推理 FLOP 节省 38%。

2.2 B. Diminishing Return of Depth Scaling in Cross-Attention

Table 5 表明，在 model depth scaling 方面，self-attention 强于 cross-attention 。在序列长度约 3072 的条件下， cross-attention 堆叠到 9 层后模型效果趋于饱和；而 self-attention 随着层数增加，模型效果持续稳定提升。

2.3 C. Algorithm of Load Stochastic Length

负载均衡随机长度（Load Balanced Stochastic Length: LBSL）的详细步骤见 Algorithm 1。实验开启 LBSL 后，不同原始序列长度对应的训练序列长度与推理序列长度如 Table 6 所示。

2.4 D. Other System Optimizations

Mix-precision serving：在 model serving 中，long sequences 下 sparse embedding 特征会主导 host-to-device transfer time。因此我们将 embedding tensors 量化为 INT4，并在 embedding lookup 与 transfer path 保持量化后的形式，减少传输量、缓解通信瓶颈。此外，我们使用分组的 INT4 量化，借助 group-specific scaling factors，相比 a single scale per row 显著降低 quality loss，同时仍带来可观的吞吐量提升。

2.5 E. Efficiency Benchmarks

本节给出正文所述系统优化的 benchmarks 结果。

2.5.1 Mixed Precision Benchmarks

FP8 precision efficiency：我们评估 FP8 对 pre-attention and post-attention blocks 中矩阵乘法（GEMM）的性能影响，speedups 如 Table 10 所示。两个 blocks 的一个关键区别是 GEMM 的 bias 格式：pre-attention GEMM 使用一维 bias，post-attention GEMM 使用二维 bias。由于 Torch 已支持带一维 bias 的 FP8 GEMM，我们直接使用 Torch kernel 实现 pre-attention GEMM 。与之相对，Torch FP8 GEMM 不原生支持 post-attention 所需的二维 bias；因此我们定制了 Triton FP8 GEMM kernel with native 2D-bias fusion ，kernel-level 效率如 Table 7 所示。
Table 7 $\mathbf D = \mathbf A\mathbf B + \mathbf C$ $\mathbf A\in \mathbb R^{m\times k}, \mathbf B\in \mathbb R^{k\times n}, \mathbf C\in \mathbb R^{m\times n}$ $(m,k,n)$ jagged, variable-length sequences $m$ ，这在实际中主导计算成本。结果表明，将二维 bias 直接融合进 FP8 GEMM（Bias-Fused FP8）相比使用 Torch FP8 GEMM 再加单独加法（Bias-Split FP8）最高提速 1.75 倍，这也推动了我们为 post-attention 模块开发 Triton implementation。
最后，Table 10 报告了 pre-attention 与 post-attention 整体模块从 BF16 切换到 FP8 的加速比。我们观察到明显的端到端收益，来自两方面：
- 更高性能的 FP8 GEMM kernels （包括我们为 post-attention 设计的二维 bias 的 Triton kernel）；
- 将 quantization 融合到 kernels that precede GEMM 中带来的额外 savings；相比将 quantization 作为独立步骤执行，减少了额外内存流量与 kernel launch 开销。
Int4 quantization efficiency：Int4 sparse embedding quantization 对 model serving 效率的影响如 Table 11 所示。对 sparse embeddings 应用 4-bit quantization，host-to-device data-transfer latency 降低约 40%，峰值 queries per second: QPS 提升超 20%。此外，我们观察到应用 4-bit quantization 后线上模型准确性差异可忽略。

2.5.2 Attention Kernel Benchmarks

我们给出 attention-kernel efficiency 基准（Figure 6），在 NVIDIA H100 与 AMD MI300 GPU 上对比我们的 optimized implementation 与 FlashAttention‑V2 基线。我们评估两种设置：semi-local attention: SLA 与 causal attention。
- 在 H100 上，对于 causal attention，ULTRA implementation 于 16K 序列长度维持超过 520 TFLOP/s，相比基线提速 1.64 倍。对于 SLA，在各种 batch sizes 与 sequence lengths 下，我们的 kernel 一致地获得更高吞吐量，最高提速 2.5 倍。
- 在 MI300 上，Figure6 报告 forward-pass kernel 性能。我们的 ULTRA kernel 相比 FlashAttention‑V2-based implementation 最高提速 1.51 倍。相对于 H100 上的 ULTRA，MI300 上的 ULTRA 在 small batch sizes、16K 序列长度下throughput ratio 最高达 0.92 倍。这些结果凸显了我们为大规模模型提供高效的 AMD inference 所做的针对性优化。

2.6 F. Scaling Laws for ULTRA-HSTU

本节分析本文所提方法的 compute scaling laws 。为此我们假设归一化熵 NE 随计算量呈 power-law 变化：
$L (C) = α C^{- β}$
computational budget $C\rightarrow \infty$ NE $L(C)\rightarrow \infty$ 。一般情况下这会让我们系统性低估真实 scaling law 的指数，因子为：
$\hat{β} = β^{*} (1 - \frac{L_{\infty}}{L})$
$\hat\beta$ $\beta^*$ $L_\infty$ irreducible error $\alpha$ $\beta$ 在双对数空间（log-log space）保持线性，无需估计不可约误差项。如下文所示，该假设也能保证 scaling improvement 的估计是保守的。
考虑两个模型之间的 estimated scaling ratio：
$\frac{{\hat{β}}_{1}}{{\hat{β}}_{2}} = \frac{β_{1}^{*} (1 - \frac{L_{\infty}}{L_{1}})}{β_{2}^{*} (1 - \frac{L_{\infty}}{L_{2}})} = \frac{β_{1}^{*}}{β_{2}^{*}} R$
$R$ 是 scaling ratio estimate 的修正因子。
scaling ratio ${\hat \beta_1}/{\hat \beta_2}$ 表示 model 1 相比 model 2scaling curve $L_1\lt L_2$ $R \lt 1$ 。当 model 1 同时实现更好的 scaling（在我们考察的计算区间内 loss 更低时总是成立），我们对 scaling ratio 的估计是保守的。
scaling law exponent improvements 的解读：尽管 scaling law exponents 的提升初看不大，但其影响会随计算预算增长指数级放大。举例说明：设有两个模型，scaling laws 分别为：
$L_{1} (C) = α C^{- β_{1}}, L_{2} (C) = β C^{- β_{2}}$
$\beta_1 = k\beta_2$ improvement factor $k$ $k\gt 1$ 。
要让 model 2model 1 $C$ 下的相同 loss，我们需要：
$C_{2} = C^{k}$
这意味着 compute advantage 随 scaling ratio 呈多项式增长。例如，scaling law exponent 的两倍提升意味着基线模型需要平方级更多计算才能匹配 improved model 的效果。
Overall ULTRA-HSTU scaling performance：Figure 7 分别绘制 ULTRA‑HSTU 与 HSTU 随 training FLOP 、inference FLOP 的 fitted compute scaling laws。可见 ULTRA‑HSTU 相比 HSTU，training FLOP 的 scaling law exponent 提升 2.08 倍，inference FLOP 的 scaling law exponent 提升 4.59 倍。
Semi-Local attention scaling performance：接下来我们单独分析 Semi-Local Attention: SLA 对整体 scaling improvements 的贡献。Figure 8 展示 SLA 随 training FLOP 与 inference FLOP 的 scaling 表现。我们的方法使 training FLOP 的 scaling law exponent 提升 1.39 倍，inference FLOP 的 scaling law exponent 提升 1.69 倍。
Attention truncation scaling performance：最后我们分析 attention truncation 方法的 scaling 行为。尽管 training FLOP 的 scaling law exponent 提升不大，但 Figure 8 显示 inference FLOP 的 scaling law exponent 提升 1.8 倍。