2025_Climber

一、Climber [2025]

《Climber: Toward Efficient Scaling Laws for Large Recommendation Models》

基于 Transformer 的生成式模型已在多个领域取得显著成功，并呈现出 various 的 scaling law 特性。然而，我们的大量实验表明，将 Transformer 应用于推荐系统时仍面临持续挑战：
- (1)：由于 Transformer 结构与推荐场景特有的特征（如 multi-source data heterogeneity ）不兼容，随着计算资源增加，其 scaling 效果并不理想；
- (2)：online inference latency 存在严格约束（数十毫秒级），且随着 user behavior sequences 变长和计算需求增长，该约束进一步加剧。
为此，我们提出 Climber——一种高效的 recommendation 框架，包含两个协同组件：支持高效 scaling 的模型架构、以及协同设计的 acceleration 技术。所提模型采用两项核心创新：
- (1) ：多尺度序列提取（multi-scale sequence extraction），通过一个常数因子（constant factor ）降低时间复杂度，实现 sequence length 的高效 scaling 。
  其实就是把用户行为序列拆分为 "click sequence"、"share sequence" 等等不同类型的子序列。完全没有技术含量。
- (2)：动态温度调制（dynamic temperature modulation ），使注意力分布（attention distributions ）适应多场景和多行为模式（multi-behavior patterns）。
  简单而言，就是 position embedding 和温度系数对不同场景、不同子序列采用不同的配置。也是没有技术含量。
在 acceleration 技术的辅助下，Climber 通过采用 "single user, multiple item" batched processing 、以及内存高效的 Key-Value caching ，实现了 5.15 倍的吞吐量提升，且未出现性能下降。
在多个数据集上进行的全面离线实验验证，Climber 展现出更理想的 scaling curve。据我们所知，这是首个公开报道的框架——通过可控的 model scaling ，在不产生过高资源成本（resource costs）的前提下，实现了 online metric 的持续增长（整体提升 12.19%）。Climber 已成功部署于中国最大的音乐流媒体平台之一网易云音乐（Netease Cloud Music），每日服务数千万用户。
Scaling laws 最初在语言模型中被探索（《Training compute-optimal large language models》、《Scaling laws for neural language models》），它确立了模型性能与 model size、training data volume 等关键因素之间的可预测关系。例如，《Scaling laws for neural language models》证明，随着 model parameters 和 token counts 的增加，Transformer-based 的语言模型（《Attention is all you need》）在 perplexity 上遵循 power-law improvements 。在视觉模型（《Reproducible scaling laws for contrastive language-image learning》、《Scaling vision transformers》）和多模态模型（《Qwen-vl: A frontier large vision-language model with versatile abilities》、《Llama: Open and efficient foundation language models》）中也观察到类似趋势——dimensions of scaling model 和 diversity of data 与下游任务性能直接相关。
生成式推荐（Generative recommendation）已成为在推荐系统中实现 scaling laws 的最具前景的新技术范式。我们认为，其实际落地（practical implementation）需分阶段推进；当前阶段的核心目标是使 Transformer 架构适配推荐系统，以建立有效的scaling laws 。
- 近期研究（《Understanding scaling laws for recommendation models》、《Scaling New Frontiers: Insights into Large Recommendation Models》）已验证了 scaling laws 在推荐系统中的有效性，为模型设计和资源分配提供了宝贵见解。
- HSTU 模型（《Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations》）采用 hierarchical self-attention 对长期用户行为序列（long-term user behavior sequences ）进行建模，取得了优于传统 Transformer 的性能。
- 类似地，MARM 模型（《MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity》）引入内存增强（memory augmentation）以降低计算复杂度，实现了低推理成本的 multi-layer sequence modeling 。
然而，这些方法未能充分解决 Transformer 架构与 recommendation-specific features 之间的固有不兼容性。尽管通过资源扩展进行模型 scaling 仍然可行，但这种策略对于实际工业部署（real-world industrial deployment ）而言效率低下。此外，传统推荐系统中，sequence length 、model depth 和 heterogeneous user behaviors 等关键 scaling factors 之间的相互作用尚未得到充分探索，导致资源分配不合理，scaling 的收益下降。
受 DeepSeek 系列（《Deepseek llm: Scaling open-source language models with longtermism》、《Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model》、《Deepseek-v3 technical report》）的启发——该系列显著提升了大型语言模型（LLM）的开发效率并降低了计算资源成本，我们旨在解决以下问题：如何以更低的成本来高效地 scale up 推荐模型？为获取相关见解，我们对两种主流模型，Deep Learning Recommendation Model: DLRM （《Software-hardware co-design for fast and scalable training of deep learning recommendation models》）和Transformer 模型，进行了 industrial-scale analysis。Figure 1(a) 展示了 DLRM* 和 Transformer 的 scaling curves，并引入理想曲线（oracle curve）从而代表更优的 scaling up 的效果——其起点更高、斜率更大。Figure 1(b) 展示了不同 sequence lengths 和层数组合下的 AUC 曲线；并提出“性能区间”（"performance interval" ）的概念，即模型在等效浮点运算次数（FLOPs）下的 AUC 变化范围。然而，我们的研究发现，将 Transformer 应用于推荐系统时仍存在以下问题：
- FLOPs 约束下 Transformer 的性能退化：如 Figure 1(a)FLOPs $10^{8.2}$ 。以该 FLOPs 值为边界，DLRM 和 Transformer 的性能对比呈现不同趋势：
  - FLOPs $10^{8.2}$ 量级时，Transformer 模型优于 DLRM 等传统架构。
  - FLOPs $10^{8.2}$ 时，Transformer 模型的性能反而不如 DLRM 。
  这凸显了对更高效模型的需求—— Figure 1(a) 中的理想曲线（oracle curve ）即为更高效的 scaling curve，其截距（intercept）和斜率（slope）均大于其他曲线，使得模型在 FLOPs 受限的情况下仍能取得更优性能。
- Transformer 与 recommendation-specific features 的不兼容性：与自然语言处理（NLP）中连续的句法序列（continuous syntactic sequences）不同，推荐系统处理的是跨多个场景的碎片化用户行为（fragmented user behaviors）。由于 Transformer 难以在稀疏的 multi-source patterns 中 prioritize relevant behaviors，导致注意力分布（attention distributions）是无序的。
  此外，多场景推荐（multi-scenario recommendations）面临分布差异（distributional discrepancies）的问题：用户在不同场景下表现出不同行为，但现有方法将场景视为辅助特征（auxiliary features）而非显式的分布控制器（distribution controllers）。这种不兼容性使得 Transformer 在计算资源受限的情况下，效率低于 DLRM 等专用架构。
- 等效 FLOPs 下 factor combinations 对模型性能的影响：在推荐系统中，序列长度（sequence length）、层数（layer number ）等 factors 对 FLOPs 影响显著，不同factor combinationsFigure 1(b) $10^9$ FLOPs 下，不同 combinations 的性能区间接近 1% 。当前研究缺乏对 factor combinations 如何影响推荐模型性能的全面分析，这阻碍了模型的高效 scaling。
这些挑战表明，Transformer 在推荐系统中存在效率问题，且当 scaling features and model capacity 以处理更长序列时，该问题会进一步加剧，导致平方级的计算需求和更严格的延迟约束（《TWIN: TWo-stage interest network for lifelong user behavior modeling in CTR prediction at kuaishou》、《Twin v2: Scaling ultra-long user behavior sequence modeling for enhanced ctr prediction at kuaishou》）。
基于上述见解，我们提出 Climber ——一种重新思考推荐系统 scaling 范式的新型框架。其核心是整合两项互补的创新：专为推荐场景设计的 Transformer-based 的模型架构，以及协同设计的 acceleration 技术。
- 所提模型通过引入 multi-scale sequence extraction，重新定义了推荐系统处理 user behaviors的方式——将 user behavior sequences 分解为更小的细粒度的子序列（sub-sequence）。该方法不仅降低了计算复杂度，还能更精准地建模不同场景下的用户兴趣（user interests）。此外，模型还融入了动态温度调制（dynamic temperature modulation），通过自适应地调整 attention scores，以适应不同 behaviors 和 scenarios 的重要性的差异。
- 在工程层面，我们引入了统一的 acceleration 技术，将传统的 "single user, single item" 样本组织形式转换为与实际在线请求一致的 "single user, multiple items" 形式。
基于这些技术，在 training and inference 期间，结合 encoder-level KV cache，前向传播（forward propagation）实现了显著的效率提升。最后，我们研究了 Climber 的 scalability ，以及等效 FLOPs 下 factor combinations 对 AUC 的影响，为合理的资源分配和模型快速 scaling 的关键 factors 提供了新见解。
本文的主要贡献如下：
- 开展了推荐系统 scaling laws 的工业级研究，明确量化了等效 FLOPs 下 factor combinations 的影响。分析表明，balanced scaling（交替扩展序列长度和模型深度）可同时提升离线和在线指标。
- 提出了一种新型 Transformer 变体 Climber，通过 multi-scale extraction 和 adaptive temperature modulation解决了推荐系统中的 scaling 困境。据我们所知，该方法开创了可持续 scaling的先河——实现了 12.19% 的在线指标增长，这是我们生产系统中年度最大提升。
- 统一加速（Unified acceleration ）技术通过 "single user, multiple items" batched processing 和 block-parallel KV cache，提升了训练和推理效率。在网易云音乐（Netease Cloud Music）部署后，这些技术实现了 5.15 倍的训练加速，且在线推理速度比 DLRM 快 14.38 倍，使得模型能够在不增加计算资源的情况下实现 100 倍的 model scaling 。
论文整体写的很差：
- 首先，论文提到的几点就是工程上的一些小的应用。这也是唯一可以在实际推荐算法中借鉴的地方。
- 其次，论文很多技术细节都没讲，写的太粗糙。
- 再次，实验部分不太严谨，没啥说服力。
- 最后，论文结论部分说是生成式推荐，而本文正文是判别式推荐。

1.1 相关工作

Wukong （《Wukong: Towards a Scaling Law for Large-Scale Recommendation》）探索了 retrieval models 中的 parameter scaling，但依赖于 feature engineering 的强烈假设（strong assumptions）。
错误，Wukong 模型是探索 ranking model 的 scaling 。
HSTU （《Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations》）将 recommendations 任务重构为 sequential transduction 任务，通过 hierarchical attention and stochastic sequence sampling 实现了 linear computational scaling 的万亿参数模型。然而，HSTU 侧重于生成式建模（generative modeling），在衔接传统的 feature-based DLRMs 方面存在不足。
同时，MARM （《MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity》）提出缓存中间注意力结果（intermediate attention resultsinference $O(n^2d)$ $O(nd)$ ，验证了 cache size 作为 new scaling dimension 的有效性。但 MARM 的缓存策略假设 user patterns 是静态的，忽略了 real-time behavior 的变化。
降低计算成本的技术已被广泛采用。
- 在 NLP 领域，KV caching 避免了自回归推理（autoregressive inference）过程中的冗余的 attention 计算。
- MARM 将这一思想适配到推荐系统中，通过存储 historical attention outputs，以极小的 FLOPs 开销实现了 multi-layer target-attention。
- 类似地，HSTU 引入随机长度（Stochastic Length）通过算法稀疏化 long sequences，在不降低质量的情况下将训练成本降低 80%。
- 《Scaling Laws for Online Advertisement Retrieval》 $R/R^*$ ——一种 eCPM-aware 的离线指标，以低成本估算 online revenue scaling laws。
这些研究共同强调了针对推荐场景特有约束（如 high-cardinality 特征、以及毫秒级 latency 要求）定制高效的策略的重要性。

1.2 方法

为实现高效的 scaling，我们提出一种专为推荐系统设计的 Transformer 变体。该变体支持从三个关键维度进行 scaling：multi-scale sequence processing、multi-scenario adaptation 、multi-interest modeling 。此外，我们还详细介绍了所提模型的部署细节。

1.2.1 模型架构

Overall：为解决推荐系统中的计算复杂度和 scaling 挑战，我们从推荐场景视角出发来设计模型。该模型将 recommendation 特性融入 Transformer 架构，具备 resource-aware scalability。该模型从三个方面进行 scale up：multi-scale sequence、multi-scenario、和 multi-interest。模型包含三个模块：多尺度序列提取（Multi-scale Sequence Extraction: MSE）、自适应 Transformer 层（Adaptive Transformer Layer: ATL）、以及按位门控融合（Bit-wise Gating Fusion: BGF）。具体而言：
- MSE 从 user lifecycle sequence 中生成 multi-scale sequences，这些序列代表不同类型的子序列。
- 每个子序列由对应的 stacked ATLs block 处理以抽取 interest 。
- 同时我们扩展 important subsequences 的时间跨度（time span）以覆盖用户全生命周期（entire lifecycle ）。
- ATL 采用自适应温度系数（adaptive temperature coefficient）调整多场景（multi-scenario）下的注意力分布（attention distribution）。
- 最后，BGF 通过按位门控（bit-wise gating）机制聚合来自 adaptive Transformer blocks 的 representations，生成 a unified output ，实现 multi-scale sequences 间的 multi-interest fusion。Figure 2 展示了详细工作流程。
Multi-Scale Sequence Scale-Up：我们提出 multi-scale sequence extraction: MSE 方法用于 multi-scale sequence scale-up 。该方法基于不同策略重组 user sequences，可表示为以下公式：
$\begin{matrix} S = {x_{1}, x_{2}, \dots, x_{n_{s}}} \\ S_{k} = MSE (S, a_{k}) = {x_{1}^{a_{k}}, x_{2}^{a_{k}}, \dots, x_{n_{k}}^{a_{k}}} \end{matrix}$
其中：
- $\mathcal S$ user lifecycle sequence $x_i\in \mathcal X$ $i$ 个 itemitem ID $\mathcal X$ 为整个 item set 。
- $a_k$ $k$ extraction strategy $\mathcal S_k$ $a_k$ user lifecycle sequence $\mathcal S$ $x_j^{a_k}$ $j$ 个 item 。
- $n_s$ $n_k$ $\mathcal S$ $\mathcal S_k$ 的序列长度。
$N_b$ extraction strategies $\sum_{k=1}^{N_b} n_k = n$ $n \ll n_s$ ，这是因为 extraction strategy 通常仅保留用户的 positive behaviorsTransformer $O\left(n_s^2d\right)$ $O\left(n^2d\right)$ 。
$\mathcal S_k$ Transformer block $O\left(n_k^2d\right)$ extraction strategy $n_k = n / N_b$ 。
- $O\left(n^2d / N_b\right)$ $N_b=2$ ，仍能实现显著的训练加速（training acceleration）。
- $O\left(\max(n_k)^2d\right)$ $n_k$ 的序列时仍能保持训练效率，从而支持渐进式多尺度序列缩放（progressive multi-scale sequence scale-up）。
我们的 extraction strategies 包括 business-driven sequences（如 click/like/share）、model-filtered sequences 等。总之，MSE 降低了计算复杂度，并将 user lifecycle sequence 转换为 multi-scale sequences，提升了推荐系统的效率和 scalability。
简单而言，就是把原始的用户行为序列拆分为：点击序列、购买序列、点赞序列、... 等等。
Multi-Scenario Scale-Up：Softmax 激活函数用于归一化 attention scores，在 Transformerattention $\frac{\mathbf Q \mathbf K^\top}{\sqrt{d_k}}$ $\mathbf V$ $\sqrt{d_k}$ attention matrix $\mathbf Q$ $\mathbf K$ 保持一致（《Distilling the Knowledge in a Neural Network》）。然而，我们基于不同 extraction strategies 从 user lifecycle sequence 中生成 multi-scale sequencesTransformer block $\sqrt{d_k}$ 不足以满足多场景下 multi-scale sequences 对应的所有 Transformer blocks 的多样化需求（《Deter-mining the optimal temperature parameter for Softmax function in reinforcement learning》）。
为进一步优化每个 Transformer block 内的 attention distribution ，我们为每个 block 的每一层引入自适应温度系数（adaptive temperature coefficient），将这一改进称为 Adaptive Transformer Layer: ATL，其数学表达式如下：
$\begin{matrix} Q, K, V = f_{Q K V} (X (S_{k})) \\ R (S_{k}) = Q K^{⊤} + f_{b}^{p, t} (a_{k}, r) \\ A (S_{k}) = Softmax (\frac{R (S_{k})}{f_{t c} (a_{k}, r)}) \\ Y (S_{k}) = f_{FFN} (A (S_{k}) V) \end{matrix}$
其中：
- $X(\mathcal S_k)\in \mathbb R^{s\times d}$ layer input $s$ sequence length $d$ 为 feature dimension 。
- $f_{QKV}(X(\mathcal S_k))$ input $X(\mathcal S_k)$ 中导出 query 矩阵、key 矩阵、以及 value 矩阵。
- $R(\mathcal S_k) \in \mathbb R^{h\times s\times s}$ raw attention matrix $h$ 为 head number 。
- $f_b^{p,t}(a_k, r)$ $p$ $t$ ）信息的相对 relative attention bias （《Exploring the limits of transfer learning with a unified text-to-text transformer》《Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations》 $r$ $a_k$ 表示 extraction strategy 。
- $A(\mathcal S_k)\in \mathbb R^{h\times s\times s}$ 为 normalized attention matrix 。
- $f_{tc}(a_k, r)$ 表示导出温度系数的函数。
  $f_{tc}(a_k, r)$ 是什么样的形式？论文没有说明。读者猜测就是一个 lookup 函数，类似于 embedding lookup 。
- $Y(\mathcal S_k)\in \mathbb R^{s\times d}$ 为 layer output 。
- $f_\text{FFN}(\cdot)$ 为前馈神经网络（FFN ）。
与传统 Transformer layer 相比，我们引入了自适应温度系数，并从推荐场景视角调整 relative attention biasextraction strategy $a_k$ recommendation scenario $r$ 都会影响 relative attention bias 和温度系数。
该方法受推荐系统 multi-scenario 和 multi-behavior 的特点的启发。与 HSTU 的固定温度系数（《Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations》attention weighting $\sqrt{d_k}$ 在捕获 inherence of diverse behaviors and scenarios 方面的局限性。
简单而言，就是 position embedding 和温度系数对不同场景、不同子序列采用不同的配置。
Multi-Interest Scale-Up：然而，当 user lifecycle sequence 被分割为 multi-scale sequencesbit-wise gating fusion $N_b$ blocks $N_b$ 个子序列的信息，以实现 multi-interest scale-up。具体而言：
- 每个 block 通过 adaptive Transformer blockoutput vector $E(\mathcal S_k)\in \mathbb R^d$ 。
- $N_b$ output vector $E(\mathcal S) \in \mathbb{R}^{N_b \times d}$ 。
- 然后，拼接后的向量通过一个新的 ATL 处理，再经过 sigmoid 激活函数实现 bit-level gating 。
- 最后，vectors 通过后续网络生成 final output score。
bit-wise gating fusion 模块可表示为：
$\begin{matrix} E (S) = {E (S_{1}), E (S_{2}), \dots, E (S_{N_{b}})} \in R^{N_{b} \times d} \\ G (S) = ATL (E (S)) \in R^{N_{b} \times d} \\ Y (S) = G (S) ⊙ σ (f_{gate} (G (S))) \in R^{N_{b} \times d} \end{matrix}$
其中：
- $\sigma(\cdot)$ 是 sigmoid 激活函数。
- $f_\text{gate}(\cdot)$ 表示 sqaueeze-and-excitation 模块（《 Squeeze-and-excitation networks》），确保 inputoutput $G(\mathcal S) \in \mathbb{R}^{N_b \times d}$ $f_\text{gate}(G(\mathcal S)) \in \mathbb{R}^{N_b \times d}$ 表示 bit-wise attention matrix 。
- $Y(\mathcal S)$ 表示 fusion module 的输出。
Multi-Scenario Scale-Up 中的Adaptive Transformer Layer: ATL 用于计算不同 blocks 之间的相似度，以实现子序列间不同 interest 的交互。相反，fusion function 中的 ATL 不包含 relative attention biasrecommendation scenario $r$ 决定。
ATL $N_b$ 个 blocks 的 attention 操作可视为 field-wise interactions，促进 feature level 的信息交换。我们的方法通过添加 bit-wise interactions 增强了 multi-interest fusion，使模型能够捕捉序列间的 precise relationships，从而显著提升模型理解 multi-scale sequencesattention $O\left(N_b^2d\right)$ ；但在 fusionextraction strategy $N_b$ feature vector $d$ ，因此 bit-wise gating fusion 模块的计算复杂度相对较低。
$Y(\mathcal S)$ 怎么使用？loss function 是什么？作者都没讲。

1.2.2 部署

我们的 acceleration 部署分为两个阶段：offline training 和online serving 。
- 在 offline training 阶段，user interaction 日志被记录为 "single user, single item" 模式，这些原始日志被压缩并归档为 "single user, multiple items" 模式。其中：
  - "single user, single item" 表示记录 a user and an item 之间一次原子交互（如 click/purchase ）的日志条目。
  - "single user, multiple items" 将用户的 historical interactions 聚合为一条包含 multiple items 的记录，支持 user-item feature computation 的 batch processing 。"single user, multiple items" 模式在每个 candidate item 与 entire history 之间使用 full-visible masks，并在 candidate items 之间使用对角掩码（diagonal masks）以实现物品间隔离（inter-item isolation ）。这种压缩机制显著减少了样本量，同时为模型训练提供了 5.15 倍的加速。
- 在 online serving 阶段，受 M-FALCON 的启发，系统首先从 user features 生成 multi-layered key-value (KV) cache vectors，然后从 feature server 获取候 candidate item features，最后计算 item features 与 cached KV representations 之间的 attention-based interactions。
值得注意的是，模型在 offline training 和 online serving 阶段均采用"single user, multiple items" 数据模式，因此利用 KV cache 加速 user feature computation。此外，我们实现了算子融合（operator fusion）——将连续操作（如 embedding lookup、attention layers）合并为 unified computational kernels，以减少频繁的 global memory accesses，并整合 FlashAttention 的矩阵分块操作以优化内存利用率（《Flashat-tention: Fast and memory-efficient exact attention with io-awareness》）。这些设计通过 cache utilization 提升了计算效率，同时保持了 prediction 准确性，从而提供了能提升用户满意度的个性化推荐服务。

1.3 实验

在本节中，我们详细介绍了在真实工业数据上进行的离线和在线实验，以评估所提方法，并解答以下四个研究问题：
- RQ1：与 SOTA 模型相比，Climber 的离线评估性能如何？
- RQ2：与 DLRM 和 Transformer 相比，Climber 如何体现更优的 scalability？
- RQ3：考虑等效 FLOPs 下不同 factor combinations 对 AUC 的影响，如何分配资源以 scale up 模型？
- RQ4：Climber 在工业系统中的性能如何？
数据集：为验证所提方法在推荐系统中的有效性，我们以 real user behavior sequence 为主要特征构建数据集，用于基于 historical interactions 来预测用户对 candidate items 的不同行为。重要的 user behaviors 包括完整播放（full-play）、点赞（like）、分享（share）和评论（comment）。此外，我们还在三个推荐数据集（Spotify、30Music 和Amazon-Book）上评估了模型。Table 1 展示了四个处理后数据集的用户数、items 数和 interactions数。为保护数据隐私，我们仅展示推荐场景的统计数据，并对工业数据集进行了特殊处理，因此表中显示的数据量低于实际数量。但显然，我们的工业数据集规模仍显著超过其他数据集，这一庞大的数据量为 scaling experiments 提供了坚实基础。
baseline 方法：
- DLRM ：该模型利用 lifelong user behavior sequences 和复杂的 feature interactions，已部署于我们的在线系统。实验中，序列长度固定为 2000。
- DIN：通过 target attention 机制建模 user historical behaviors 与 target item 之间的 interactions ，以捕获 user interests。实验中，序列长度设置为 1000。
- TWIN：该模型对齐 General Search Unit: GSU和Exact Search Unit: ESU，以增强 long-term user behavior 建模的一致性。实验中，GSU 和 ESU的序列长度分别设置为 2000 和 1000。
- Transformer：Transformer-based 的序列建模模型。实验中，a one-stage Transformer encoder 处理长度固定为 2000 的 behavior sequences。
- HSTU：通过按时间顺序重组 item-action pairs，将 feature-level sequence 转换为时间序列，并通过 HSTU 模型处理，以预测 target item-specific user actions。实验中，序列长度固定为 2000。
- Climber 系列：先前的方法将序列长度限制在固定时间窗口（窗口大小为 2000），且不进行 behavior filtering；新方法通过 multi-scale sequence extraction，基于业务逻辑驱动的策略将 behavior sequences 扩展至整个 user lifecycle，同时将序列长度缩减至 200。模型保持 2 层结构，与基线配置一致。在工业场景中，Climber-large 变体扩展至12 层和序列长度800 。

1.3.1 整体性能 (RQ1)

性能对比：如 Table 2 所示，Climber 在四个推荐数据集上均取得最佳性能。值得注意的是，Spotify 和 30Music 的应用场景与我们的工业数据集（音乐推荐系统）属于同一领域，而 Amazon-Book 与我们的场景差异显著。但我们的模型在该数据集上仍取得了良好结果，表明其具有适应 diverse applications的潜力。
接下来，我们重点对比 Climber 相对于其他方法在 Industrial 数据集上的 AUC 提升：
- 1)：作为我们的主要在线模型，DLRM 优于 DIN 和 TWIN，因为 DLRM 除了 lifelong sequence 和 attention 机制外，还包含多种 feature interaction structures 。
- 2)：Transformer 通过在 a single stage 中计算所有 historical items 与 target item 之间的相似度，比 TWIN 的AUC 提升了 0.134% 。
- 3)：HSTU 对 Transformer 进行了多项 enhancements，与 Transformer 相比， AUC 提升了 0.036% ，但这些 enhancements 也导致计算复杂度增加。
- 4)：Climber 通过 sequence extraction 降低了计算复杂度，并通过 adaptive temperature coefficients 调整多场景和多行为下的注意力分布，与 DLRM 相比 AUC 提升了 0.170% ；此外，Climber-large 通过模型 scaling up 实现了 2.21% 的 AUC 提升，是过去一年中最大的离线增益。
消融实验：为评估 Climber 模型各组件的贡献，我们在多个数据集上进行了全面实验。为便于说明，我们重点展示 Industrial 数据集上的消融实验，并选择 Transformer 与 Climber 系列进行对比。
- Climber (-ATL, -BGF) ：通过引入 MSE ，将 user lifecycle sequence 转换为 multi-scale subsequence blocks，与Transformer 模型相比，AUC 提升了 0.085% 。
- Climber (-BGF) ：进一步引入 adaptive temperature coefficient ，动态调整注意力分布，AUC 提升了 0.134% 。
- 最后，引入 BGF 后，AUC提 升了 0.195% ——该模块整合了不同子序列所代表的 user interests，凸显了 interest fusion 在推荐系统中的重要性。
总之，基于多个数据集的离线评估，我们的模型展现出强大的性能和适应性。

1.3.2 可扩展性 (RQ2)

在讨论 model scalabilityFLOPs $C \propto s \times l$ $s$ $l$ 表示模型层数。在大规模场景中，即使序列变长，attentionFLOPs $C \propto s \times l$ $s$ 的线性部分，且 FLOPs 可通过 TensorFlow 的特定工具来计算和验证。
"attention 机制的平方计算复杂度在模型整体 FLOPs 中所占比例仍然很小"，这句话明显有问题。实际上平方复杂度所占比例相当大。
DLRM、Transformer 和 Climber 的 scaling curves 如 Figure 3(a) 所示。
- FLOPs $10^9$ 时，Transformer 的性能优于 DLRMFLOPs $10^7$ $10^8$ 之间，其效率显著低于 DLRM。
- 与 Transformer 相比，Climber 由于起点更高、斜率更大，展现出更理想的 scaling curveFLOPs $10^{7.5}$ 时，Climber 的性能仍弱于 DLRM ，但交叉点左移，使得 Climber 模型比 Transformer 更高效地实现性能跃迁。
FLOPs $l$ $s$ 。对于 Climber ， Figure 3(b)Figure 3(c) $l$ $s$ 的关系：
- $s$ $l$ 的增加呈幂律增长。
- $l$ $s$ 的增加也呈现类似的增长趋势。
因此，我们提出的 ClimberFLOPs $s$ $l$ 方面均展现出 scaling curves，且与 Transformer 相比具有更高效的 scaling curve。

1.3.3 Efficient Allocation (RQ3)

Figure 3(b, c) $s$ $l$ 均可提升模型的 AUC，但这两个因素的优先级尚未得到探讨。Table 3FLOPs $l$ $s$ 的组合所对应的模型 AUCFLOPs $l$ $s$ testing AUC $C \propto s \times l$ ，等效 FLOPs 下层数和序列长度的乘积保持不变。
- FLOPs $4.11 \times 10^8$ $(400s\times 4l)$ 的模型取得了最佳性能（AUC=0.8301）。
- FLOPs $1.01 \times 10^9$ $(400s \times 8l)$ 的模型取得了最佳性能（AUC=0.8335）。
我们观察到，仅扩展单一 factorscaling up $l$ $s$ $(400s\times 4l)$ 的模型，如果需要将 FLOPs4 $(1600s\times 4l)$ $(800s\times 8l)$ $(400s\times 16l)$ Table 3 $(800s\times 8l)$ ——通过联合扩展两个 factors，模型的 AUC 从 0.8301 提升至 0.8382。这一结论也为在线资源分配提供了指导：在实际推荐系统中，每次迭代通常仅选择一个 factorscaling up online $s$ $l$ 。

1.3.4 Online A/B Test (RQ4)

Table 4 总结了所提 Climberonline A/B test $s$ $l$ ，就展现出了 metric 和 FLOPs 的 online scaling curves 。
- FLOPs $5.82 \times 10^6$ 的 Climber 模型表现出负向。当 Climber (6x) 模型的 FLOPs 与 DLRM (6x) 匹配时，仅出现轻微负向，表明在 FLOPs 较少的情况下，Climber 模型的效率低于 DLRM 。
- 此外，当 Climber (479x)FLOPs $2.79 \times 10^9$ 时，在线指标提升达到 7.78% 。
- 最后，当 Climber (620x)FLOPs $3.61 \times 10^9$ 时，在线指标提升达到 12.19% 。
根据在线实验的结论，Climber 的成功似乎是 FLOPs 更大所带来的，而不是因为架构更好所带来的。
在 online inference 阶段，对于不同序列长度和层数，Climber 的延迟显著降低——每个 request 的速度比 DLRM 快 2.92 倍至 14.38 倍。这一 acceleration 是通过我们的 acceleration 技术实现的，且使用了与 DLRM 相当的 inference budget，因此我们能够部署复杂度提升 100 倍的模型。据我们所知，Climber 是首个在保持资源平衡的同时，既展现出 offline scaling curves 又展现出 online scaling curves 的 recommendation 模型，且实现了 12.19% 的指标提升，这是过去一年中最大的提升幅度。

1.4 结论

我们提出了 Climber ——一种高效的 scaling 框架，包含专为推荐场景设计的 Transformer 变体和协同设计的 acceleration 技术。该模型通过解决 multi-scale sequences、multi-scenario 和 multi-interest问题，有效降低了计算复杂度，打破了推荐系统中的 scaling困境。这种整合使得模型在离线评估中展现出优于 DLRM 和Transformer 的 scalability。此外，我们引入的 acceleration 技术采用 "single user, multiple items" 样本格式和 encoder-level KV cache，能够在不增加过高计算资源的情况下部署复杂度提升 100 倍的模型。Climber 展现出了在线 scaling curve，并实现了 12.19% 的在线指标提升。总之，本研究在资源受限的情况下，使 Transformer 架构适配了推荐系统，为 generative recommendation 范式的下一阶段奠定了基础。未来，我们将探索更多生成式技术在推荐系统中的应用，以持续释放 scaling 的潜力。
这篇论文压根不是生成式推荐，而是判别式推荐。