2025_OneTrans

一、OneTrans [2025]

《OneTrans: Unified Feature Interaction and Sequence Modeling with One Transformer in Industrial Recommender》

在推荐系统中，scaling up 特征交互模块（如 Wukong、RankMixer）或用户行为序列模块（如 LONGER）的规模已取得显著成效。然而，这些研究通常沿着独立路径推进，不仅阻碍了双向信息交换，还无法实现 unified optimization and scaling 。本文提出 OneTrans，一种统一的 Transformer 骨干网络，能够同时执行用户行为序列建模与特征交互。OneTrans 采用 unified tokenizer ，将 sequential attributes 和 non-sequential attributes 转换为 a single token sequence。堆叠的OneTrans blocks 在 sequential tokens 间共享参数，同时为 non-sequential tokens 分配 token-specific parameters。通过causal attention 机制和 cross-request KV caching，OneTrans 支持 intermediate representations 的 precomputation 与caching，大幅降低 training 和 inference阶段的计算成本。工业级数据集上的实验结果表明，OneTrans 随着参数增加实现高效的 scales，持续优于 strong baselines，且在 online A/B tests 中实现了 5.68% 的 per-user GMV 提升。
推荐系统在各类 information services 中扮演着核心角色，例如电子商务、流媒体、和社交网络。工业级推荐系统通常采用级联排序架构（cascaded ranking architecture）。
- 首先，召回阶段从十亿级别的语料库中筛选出数百个 candidates。
- 随后，排序阶段（通常包含粗排和精排）对每个 candidate 进行评分，并返回 top-k items。
本文聚焦于排序阶段（ranking stage）。对于ranking 任务，主流方法围绕两个独立模块展开迭代：
- 序列建模（sequence modeling）：通过 local attention 或 Transformer encoders，将 user multi-behavior sequences 编码为 candidate-aware representations ；
- feature interaction：通过因子分解、显式交叉网络、或 attention over feature groups，学习 non-sequential features（如用户画像、item 画像、以及上下文）之间的 high-order crosses。
如图 Figure 1(a) 所示，这些方法通常将用户行为编码为 compressed sequence representation ，然后与 non-sequential features 拼接，并应用 feature-interaction module 学习高阶 interaction 。我们将这种设计称为 encode-then-interaction pipeline 。
事实上，Longer （《LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders》 ）采用的就是 Figure 1 (b) 的方式。

大型语言模型（large language models: LLMs）的成功表明，扩大模型规模（如参数数量、训练数据）能带来可预测的性能提升（《Scaling laws for neural language models》），这启发了推荐系统领域的类似研究（《LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders》、《Wukong: Towards a scaling law for large-scale recommendation》、《RankMixer: Scaling Up Ranking Models in Industrial Recommenders》）。
- 在特征交互（ feature interaction ）方面，Wukong通过堆叠 Factorization Machine blocks with linear compression 来捕获高阶 feature interactions 并建立 scaling laws ，而 RankMixer 通过 hardware-friendly 的 token-mixing with token-specific feed-forward networks (FFNs) 实现了良好的 scaling 效果。
- 在序列建模（sequence modeling）方面，LONGER 将 causal Transformer 应用于 long user histories，表明增加 depth 和 width能带来单调提升。
尽管这些方法在实际应用中有效，但将 sequence modeling和feature interaction 分离为独立模块的做法存在两个主要局限：
- 首先，encode-then-interaction pipeline 限制了双向的信息流，制约了 static/context features 对 sequence representations 的塑造作用（《Interformer: Towards effective heterogeneous interaction learning for click-through rate prediction》）。
- 其次，module separation 导致执行过程碎片化并增加 latency，而单一的 Transformer-style 的骨干网络可复用 LLM 的优化技术（如 KV caching 、memory-efficient attention 、以及 mixed precision ），实现更有效的 scaling （《Hiformer: Heterogeneous Feature Interactions Learning with Transformers for Recommender Systems》）。
本文提出 OneTrans，一种创新的架构范式（architectural paradigm），其统一的 Transformer 骨干网络能够联合执行 user-behavior sequence modeling 与 feature interaction。如 Figure 1(b) 所示，OneTrans 在统一骨干网络（unified backbone ）中支持双向的信息交换。它采用 unified tokenizer，将 sequential features （diverse behavior sequences ）和 non-sequential features （static user/item and contextual features ）转换为 a single token sequence，随后由 stacked OneTrans blocks 构成的金字塔结构处理。OneTrans block 是一种专为工业级推荐系统定制的 Transformer 变体。为适应推荐系统中 diverse token sources （LLMs 中仅含 text-only tokens ，这与推荐系统不同），每个 OneTrans block 采用类似于 HiFormer 的 mixed parameterization 。具体而言，所有 sequential tokens （来自 sequential features）共享 a single set of Q/K/V and FFN weights ，而每个 non-sequential token （来自 non-sequential features ）获得 token-specific parameters 以保留其独特的语义。
与传统的 encode-then-interaction 框架不同，OneTrans 通过 unified causal Transformer backbone 消除了 sequential features 与 non-sequential features之间的架构壁垒。这种设计使推荐系统的 scaling 与 LLM 实践保持一致：整个模型可通过调整骨干网络的 depth 和 width 进行 scale，同时无缝继承成熟的 LLM optimizations技术，如 FlashAttention 和 mixed precision training。特别是，cross-candidate KV caching 和 cross-request KV cachingsessions with C candidates $O(C)$ $O(1)$ ，使 large-scale OneTrans deployment 成为可能。
总之，本文的主要贡献包括四个方面：
- Unified framework：提出 OneTrans，一种用于 ranking 的 single Transformer backbone ，配备 unified tokenizer（将 sequential features 和 non-sequential features 编码为 one token sequence ）和 unified Transformer block （联合执行序列建模与特征交互）。
  这个思想已经在 Longer 中被提出。
- Customization for recommenders：为弥合 LLMs 与推荐系统任务之间的差距，OneTrans 引入 mixed parameterization ，为 diverse non-sequential tokens 分配 token-specific parameters ，同时为 all sequential tokens 共享 parameters 。
  这个思想借鉴了 Hiformer 。
- Efficient training and serving：通过逐步裁剪 sequential tokens 的金字塔策略（pyramid strategy ）、以及 cross-request KV Caching （跨 candidates 来复用 user-side computations ）来提升效率。此外，采用FlashAttention 、mixed-precision training 和 half-precision inference 等 LLM optimizations进一步减少内存占用和计算量。
- Scaling and deployment：OneTrans 随着模型规模增大呈现 near log-linear 的性能提升，为 real production data 中的 scaling law 提供了实证。在线部署时，它在保持工业级 latency 的同时，实现了业务 KPIs 的显著提升。

1.1 相关工作

早期推荐系统如 DIN 及其 session-aware 变体 DSIN 采用 local attention 学习 user histories 的 candidate-conditioned summaries，但会将 behaviors 压缩为 fixed-length vectors per candidate，限制了 long-range dependency modeling （《Deep Interest Evolution Network for Click-Through Rate Prediction》）。
SASRec、BERT4Rec 和 BST 等 self-attentive 方法通过允许每个 position 关注 full history 来消除了这一瓶颈，并通过双向掩码（bidirectional masking）提高了样本效率。
近年来，随着推荐系统中 scaling laws 的研究日益深入，LONGER 通过高效的 attention 和 serving-friendly 的设计，将 sequence modeling 推向工业级规模，以处理超长的 behavioral histories 。
然而，在 mainstream pipelines 中，这些 sequence encoders 通常与 feature-interaction stack 是分离的，导致与 static contextual features 的后期融合（late fusion ）而非联 joint optimization（《Interformer: Towards effective heterogeneous interaction learning for click-through rate prediction》）。
Longer 是联合优化的。
在 feature-interaction 方面，早期推荐系统依赖人工设计的 cross-features 或 automatic multiplicative interaction layers 。Wide&Deep、FM/DeepFM 和 DCN/DCNv2 等经典模型提供了高效的 low-order interactions 或 bounded-degree interactions 。
然而，近期 scaling 研究发现（《Wukong: Towards a scaling law for large-scale recommendation》），一旦模型堆叠了足够多的 cross layers，继续增加层数将不再带来提升：模型性能会趋于平稳而非持续改善。为克服预设的 cross forms 的僵化（rigidity ），attention-based 的方法可自动学习 high-order interactions 。AutoInt 学习任意阶次的关系，HiFormer 引入 group-specific projections 以更好地捕捉异构的、非对称的 interactions。
scaling up 越来越多地应用于 feature-interaction 模块：
- Wukong 等 large-scale systems 通过堆叠 FM-style interaction blocks with linear compression 实现可预测的性能提升。
- 而 RankMixer 在严格的 latency budgets 下，通过 parallel token mixing 和 sparse MoE 实现了良好的 scaling。
然而，这些 interaction 模块通常遵循 encode-then-interaction 的范式，将 interactions 推向独立的阶段，阻碍了与 user sequence modeling 的 unified optimization（《Towards effective heterogeneous interaction learning for click-through rate prediction》）。
迄今为止，推荐系统的进展主要沿着两条独立路径推进：sequence modeling 和 feature interaction 。InterFormer （《Towards effective heterogeneous interaction learning for click-through rate prediction》）试图通过 summary-based 的 bidirectional cross 架构来弥合这一差距，实现两个组件之间的 mutual signal 的交换。但它仍将两者保持为独立模块，且 cross architecture 引入了架构复杂性和碎片化执行（fragmented execution）的问题。缺乏用于联合 modeling 和 optimization 的 unified backbone ，使得系统难以作为一个整体进行有效地 scaling。

1.2 方法

在详细介绍方法之前，先简要描述 task settingrecall stage $u$ 返回一个 candidate set（通常包含数百个 candidate items）。ranking modelcandidate item $i$ 预测一个分数：
${\hat{y}}_{u, i} = f (i ∣ NS, S; Θ)$
其中：
- $\mathcal{NS}$ 是来自 user 、candidate item 和 context 的 non-sequential 特征集合。
- $\mathcal S$ 是来自用户的 historical behavior sequences 的集合。
- $\Theta$ 是可训练参数。
常见的 task predictions 包括点击率（click-through rate: CTR ）和点击后转化率（post-click conversion rate: CVR）：
$\begin{matrix} {CTR}_{u, i} = p (click = 1 ∣ NS, S; Θ) \\ {CVR}_{u, i} = p (conv = 1 ∣ NS, S; Θ) \end{matrix}$

1.2.1 OneTrans Framework Overview

如 Figure 2(a) 所示，OneTrans 采用 unified tokenizersequential features $\mathcal S$ 映射为 S-tokensnon-sequential features $\mathcal {NS}$ 映射为 NS-tokens 。然后，金字塔堆叠的 Transformer 在 single computation graph 中联合处理该 unified token sequence 。我们将 initial token sequence 表示为：
$X^{(0)} = [S-tokens; NS-tokens] \in R^{(L_{S} + L_{NS}) \times d}$
token sequence $L_\text{S}$ S-tokens $L_\text{NS}$ 个 NS-tokenstokens $d$ 。需要注意的是，S-tokens 中插入了可学习的 [SEP] tokens ，用于分隔不同类型的 user-behavior sequences 。
注意，这里的 [SEP] tokens 用于分隔不同类型的用户行为，采用的是 Timestamp-agnostic 方案。根据论文的描述，也可以采用 Timestamp-aware 方案，此时没有 [SEP] token，而是用 sequence-type indicator。
如 Figure 2(b) 所示，每个 OneTrans block 通过以下步骤逐步 refines the token states ：
$\begin{matrix} Z^{(n)} = MixedMHA (Norm (X^{n - 1})) + X^{(n - 1)} \\ X^{(n)} = MixedFFN (Norm (Z^{(n)})) + Z^{(n)} \end{matrix}$
其中，MixedMHA （Mixed Multi-Head Attention ）和 MixedFFN （Mixed Feed-Forward Network ）采用混合参数化（mixed parameterization ）策略（见 Figure 2(c) ）：
- 在 attention layer （以及 feed-forward layers ）中，在 sequential tokens 间共享权重。
- 在 attention layer （以及 feed-forward layers ）中，为 non-sequential tokens 分配独立参数。
注意：这里的 RMSNorm 是 pre-norm 方法，它仅仅对 MixedMHA 和 MixedFFN 的输入进行归一化，不会影响 residual 。
unified causal mask 施加了自回归约束（autoregressive constraints ），限制每个 position 仅关注 preceding tokens 。具体而言，NS-tokens 允许关注 S-tokens 的所有历史，从而实现全面的 cross-token interaction。
通过堆叠此类 blocks 并对S-tokens 应用金字塔式尾部截断（pyramid-style tail truncation），模型逐步将紧凑的高阶信息提取到 NS-tokens 中。 final token states 随后被传入 task-specific heads 从而用于预测。
在 Figure 2 (c) 中，用户行为序列按照时间的逆序排列，前面的 engagement 距离现在最近、后面的 engagement 距离现在最远。每个 position 仅仅关注它当前及其它后面的位置（参考 Longer 模型的论文）。
NS-tokens $k_l$ 个S-tokens 的并集作为 query，但是 key/value 为所有 tokens 。
通过将 non-sequential features 和 sequential features 统一为 unified token sequence ，并使用 causal Transformer 进行建模，OneTrans 摆脱了传统的 encode-then-interaction pipeline。这种 unified design 自然支持：
- (i)：每个 behavior sequence 内部的 intra-sequence interactions 。
- (ii)：跨多个序列之间的 cross-sequence interactions 。
- (iii)：item features 、user features 和 contextual features 之间的 multi-source feature interactions。
- (iv)：sequence-feature interactions 。
所有这些均在 single Transformer stack 中完成。
这种统一形式使我们能够无缝继承成熟的 LLM engineering optimizations ，包括 KV caching 和 memory-efficient attention ，从而大幅降低 inference latency。我们认为，这种统一形式非常适合在 single, and scalable architecture 中解决 multi-sequence and cross-domain recommendation 的挑战。接下来，将详细介绍具体设计。

1.2.2 Features and Tokenization

initial token sequence $\mathbf X^{(0)}$ ，OneTrans 首先应用 feature preprocessing pipeline ，将所有 raw feature inputs 映射为 embedding 向量。然后将这些 embedding 向量划分为：
- (i)：一个 multi-behavior sequential subset 。
- (ii)：一个 non-sequential subset ，代表 user, item, or context features 。
对每个子集应用独立的 tokenizers。
Non-Sequential TokenizationNon-sequential features $\mathcal {NS}$ 包括 numerical inputs （如 price、CTR）和 categorical inputs（如 user ID 、item category ）。所有特征均经过 bucketized 或 one-hot encoded 之后被 embedded 。由于工业系统通常涉及数百个重要性各异（varying importancenon-sequential tokens $L_\text{NS}$ 有两种选择：
- Group-wise Tokenizer（与 RankMixersemantic groups $\left\{\mathbf{ g}_1,\cdots,\mathbf{ g}_{L_\text{NS}}\right\}$ 。每个组的特征，先进行拼接，然后输入 group-specific MLP ：
  $NS-tokens = [{MLP}_{1} (concat (g_{1})), \dots, {MLP}_{L_{NS}} (concat (g_{L_{NS}}))]$
  即，先将组内的 embeddings 拼接起来，再进行投影。这里如何分组是一个关键。
- Auto-Split Tokenizer：另一种方式是将所有特征拼接后通过单个 MLP 进行一次投影，然后分割：
  $NS-tokens = split (MLP (concat (NS)), L_{NS})$
  Auto-Split Tokenizer 通过使用 a single dense projection ，与Group-wise Tokenizer 相比减少了 kernel launch 开销。
non-sequential tokenization $L_\text{NS}$ 个 non-sequential tokenstoken $d$ 。
NS-Tokens $L_\text{NS}$ ，从而降低了 Attention 的复杂度。论文在实验部分选择的是 Auto-Split Tokenizer。
在 Longer 模型中，它通过 Group-wise Tokenizer 对 S-Tokens 也进行分组，从而支持超长序列建模。
Sequential Tokenization：OneTrans 接受 multi-behavior sequences，记作：
$S = {S_{1}, \dots, S_{n}}, S_{i} = [{\vec{e}}_{i, 1}, \dots, {\vec{e}}_{i, L_{i}}]$
其中：
- $\mathbf S_i$ $i$ $L_i$ event embeddings $\mathbf{\vec e}_{i,1},\cdots, \mathbf{\vec e}_{i,L_i}$ 。
  每个序列代表不同的行为类型，例如：点击行为序列、转化行为序列。
- $\mathbf{\vec e}_{i,j}$ $i$ $j$ 个 event 的 embedding，由 item ID 与其对应的辅助信息（如 item category 和 item price ）拼接而成。
Multi-behavior sequences $\mathbf S_i$ $\text{MLP}_i$ $\mathbf{\vec e}_{i,j}$ $d$ ：
${\tilde{S}}_{i} = [{MLP}_{i} ({\vec{e}}_{i, 1}), \dots, {MLP}_{i} ({\vec{e}}_{i, L_{i}})] \in R^{L_{i} \times d}$
$\tilde {\mathbf S}_i$ 通过以下两种规则之一合并为 a single token sequence ：
- 1) Timestamp-aware：按时间交错所有事件，并添加 sequence-type indicators 。
  sequence-type indicators 类似于 position embedding ：它引入一个 sequence-type embedding ，然后加入到每个 token embedding 上。
  注意：对于 OneTrans block，sequence 按照时间递增来排序。这与 Longer 相反。
- 2) Timestamp-agnostic：按事件影响力（event impact ）来拼接序列（如 purchase -> add-to-cart -> click），在序列之间插入 learnable [SEP] tokens。
  最重要的序列放在左边，因为 causal masking 使 high-intent signals 能够指导和过滤后续的 low-intent behaviors 。
在后一种情况下，behaviors with higher user intent 被置于序列前端。消融实验结果表明，当时间戳可用时，timestamp-aware rule 优于 timestamp-agnostic 的方案。
形式上，有：
$S-Tokens = Merge ({\tilde{S}}_{1}, \dots, {\tilde{S}}_{n}) \in R^{L_{S} \times d}, L_{S} = (\sum_{i = 1}^{n} L_{i}) + L_{SEP}$
Timestamp-aware $L_\text{SEP} = 0$ Timestamp-agnostic $L_\text{SEP} \gt 0$ 。

1.2.3 OneTrans Block

如 Figure 2(b) 所示，每个 OneTrans block 是一个 pre-norm causal Transformer ，应用于一个 normalized token sequencesequence $L_\text S$ sequential S-tokens $L_\text{NS}$ 个 non-sequential NS-tokens 。受 heterogeneous feature groups 相关研究结果（《Hiformer: Heterogeneous Feature Interactions Learning with Transformers for Recommender Systems》）启发，我们对 Transformer 进行轻量级修改，以支持 mixed parameter 方案（见 Figure 2(c)）。具体而言，homogeneous S-tokens 共享一组参数。而来自不同来 sources/semantics 的 heterogeneous NS-tokens 则获得 token-specific parameters 。
与 LLM inputs 不同，推荐系统中的 token sequence 结合了 sequential S-tokens和 diverse NS-tokens ，这些 tokens 的数值范围和 statistics 的差异显著。post-norm setups 可能因这些差异导致注意力崩溃（attention collapse）和训练不稳定性（training instability）。为避免这种情况，我们对所有 tokens 应用 RMSNorm 作为 pre-norm ，跨不同 token types 对齐 scales 并稳定 optimization 过程。
RMSNorm 是 LayerNormLayerNorm $\mathbf {\vec a}\in \mathbb R^n$ $\bar{\mathbf{\vec a}} = \text{LayerNorm}\left(\mathbf{\vec a}\right)$ 定义为：
$\begin{matrix} μ = \frac{1}{n} \sum_{i = 1}^{n} a_{i}, σ = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (a_{i} - μ)^{2}} \\ {\bar{a}}_{i} = \frac{a_{i} - μ}{σ} g_{i} + b_{i} \end{matrix}$
$\mathbf{\vec g}, \mathbf{\vec b}\in \mathbb R^n$ $\bar a_i, g_i, b_i$ $\bar{\mathbf{\vec a}}, \mathbf{\vec g}, \mathbf{\vec b}$ $i$ 个元素。
RMSNorm 定义为：
${\bar{a}}_{i} = \frac{a_{i}}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} a_{i}^{2}}} g_{i}$
RMSNorm 没有减去均值的操作（即，没有中心化操作），只有缩放操作，因此计算效率更高。实践表明，RMSNorm 的效果与 LayerNorm 没有明显差异，因此目前主流的 LLM 均采用 RMSNorm 方法。
Mixed (shared/token-specific) Causal Attention：OneTrans 采用标准 multi-head attention: MHA 并配备一个 causal attention maskQ/K/V $\mathbf {\vec x}_i\in \mathbb R^d$ $i$ 个 token 。为计算 Q/K/VS-tokens $i\le L_\text S$ NS-tokens $i\gt L_\text S$ $L_\text{NS}$ 个 token-specific projections：
$({\vec{q}}_{i}, {\vec{k}}_{i}, {\vec{v}}_{i}) = (W_{i}^{Q} {\vec{x}}_{i}, W_{i}^{K} {\vec{x}}_{i}, W_{i}^{V} {\vec{x}}_{i})$
$\mathbf W_i^\Psi$ $\Psi\in \{Q,K,V\}$ ）遵循 mixed parameterization 方案：
$\begin{matrix} W_{i}^{Ψ} = {\begin{cases} W_{S}^{Ψ}, & i \leq L_{S} (shared for S-tokens) \\ W_{NS, i}^{Ψ}, & i > L_{S} (token-specific for NS-tokens) \end{cases} \end{matrix}$
Attention 使用标准的因果掩码（causal mask）；此外，NS-tokens 位于 S-tokens 之后。这导致：
- (1) S-sideS-token $\mathcal S$ 事件发生之前的 positions （在 Figure 2a 上表现为该 token 右侧的 tokens）。
  - 对于 timestamp-aware sequences ，每个 event 均以其历史为条件。
  - 对于 timestamp-agnostic sequences （按 intent 来排序，如 purchase -> add-to-cart -> click/impression ）， causal masking 使 high-intent signals 能够指导和过滤后续的 low-intent behaviors 。
  注意：对于 OneTrans block，sequence 按照时间递增来排序。这与 Longer 相反。
- (2) NS-sideNS-token $\mathcal S$ 历史（实际上是 sequence evidence 的 target-attention aggregation ），并关注 preceding NS-tokens ，增加 token-level interaction diversity。
  问题是，NS-tokens 之间如何排序？论文并未详细说明。是否在 NS-tokens 之间采用 non-causal mask，使得它们之间可以相互关注？可以做消融实验来研究。
- (3) Pyramid support：在 S-side 和 NS-side，causal masking 均逐步将信息集中到 later positions，自然支持 pyramid schedule （逐层裁剪 tokens ），这在后续将详细介绍。
Mixed(shared/token-specific) FFN：类似地，feed-forward network: FFN 遵循相同的 parameterization 策略：NS-tokens 使用 token-specific FFNs，S-tokens 使用一个共享的 FFN：
$MixedFFN ({\vec{x}}_{i}) = W_{i}^{2} ϕ (W_{i}^{1} {\vec{x}}_{i})$
$\mathbf W_i^\Psi$ $\Psi\in \{1,2\}$ ）遵循 mixed parameterization 方案：
$\begin{matrix} W_{i}^{Ψ} = {\begin{cases} W_{S}^{Ψ}, & i \leq L_{S} (shared for S-tokens) \\ W_{NS, i}^{Ψ}, & i > L_{S} (token-specific for NS-tokens) \end{cases} \end{matrix}$
总之，与标准 causal Transformer 相比，OneTrans 仅修改了 parameterization ：
- NS-tokens 使用 token-specific 的 QKV 和 FFN。
- S-tokens 共享一组 parameters 。
每个 sequence 对应于单个 causal mask ，允许 NS-tokens 聚合完整的 behavior history ，同时保留高效的 Transformer-style 的计算。
“每个 sequence 对应于单个 causal mask ” 怎么理解？读者认为是：
- 在 S-tokens 上应用了 causal mask，使得每个事件只能关注它发生之前的事件、以及所有的 NS-tokens 。
- 在 NS-tokens 之间采用 non-causal mask，使得它们之间可以相互关注？可以做消融实验来研究。

1.2.4 Pyramid Stack

如前面章节所所述，causal masking 将信息集中到 later positions 。利用这种 recency structure ，我们采用金字塔调度（pyramid schedule）：在每个 OneTrans block layer，仅将 most recent S-tokens 的一个子集来生成 queries ，而 keys/values 仍基于 full sequence 来计算；query set 随 depth 来缩小。
这里借鉴了 Longer 模型的思想。
$\mathbf X = \left\{\mathbf{\vec x}_i\right\}_{i=1}^L$ input token list $\mathcal Q = \left\{L-L^\prime + 1,\cdots, L\right\}$ $L^\prime \le L$ 。根据 Mixed (shared/token-specific) Causal Attention ，我们将 queries 修改为：
${\vec{q}}_{i} = W_{i}^{Q} {\vec{x}}_{i}, i \in Q$
而 keys 和 valuesfull sequence $\{1,\cdots,L\}$ attention $i\in \mathcal Q$ token length $L^\prime$ ，并在各层之间形成金字塔层级结构（pyramidal hierarchy ）。
这种设计带来两个好处：
- (i) Progressive distillation：长的 behavioral histories 被汇集到少量的尾部 queries 中，将模型能力集中于 most informative events ，并将信息整合到 NS-tokens 。
- (ii) Compute efficiencyattention cost $O\left(LL^\prime d\right)$ FFN $L^\prime$ 线性地 scales 。缩小 query set 直接减少了 FLOPs 和 activation memory 。
$\alpha = \frac{L^\prime}{L}$ 比值，它给出金字塔结构中层与层之间的缩放比例。

1.2.5 Training and Deployment Optimization

Cross Request KV Caching：在工业推荐系统中，来自同一 request 的样本在 training 和 serving 期间均被连续地处理：其 S-tokens 在所有 candidates 之间保持一致，而 NS-tokens 因 candidate item而异。利用这一结构，我们将广泛采用的 KV Caching 集成到 OneTrans 中，形成 a unified two-stage paradigm：
- Stage I (S-side, once per request)：使用 causal masking 处理所有 S-tokens ，并缓存其 key/value pairs 和 attention outputs 。该阶段每个 request 执行一次。
- Stage II (NS-side, per candidate)：对于每个 candidate ，计算其 NS-tokens ，并与 cached S-side keys/values 进行 cross-attention 计算，随后通过 token-specific FFN layers 。特别地，candidate-specific sequences （如SIM，《Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction》）通过 pooling 来被预聚合为 NS-tokens ，因为它们无法复用 shared S-side cache 。
KV Caching 将 S-side computation 在candidates 之间分摊，使 per-candidate work 轻量化，并消除冗余计算，显著提升吞吐量。
由于 user behavioral sequences 是追加式（append-only ）的，我们将 KV Caching 扩展到跨 requests 的场景：每个 new request 复用 previous cache ，仅对新增的 behaviors 计算增量的 keys/valuesper-request sequence computation $O(L)$ $O(\Delta L)$ $\Delta L$ 是自上次 request 以来的新增 behaviors 数量。
注意：跨 requests 的场景需要谨慎处理 position-embedding，因为 position 在增加。
此外，如果序列考虑 time-embedding，那么随着新的 request 的到来， previous cache 将会失效。因为在新 request 中，历史 behaviors 距离当前 request 的时间发生了改变。
Unified LLM Optimizations：我们采用 FlashAttention-2，通过分块（tiling ）和内核融合（kernel fusion ）减少 attention I/O 和 vanilla attention 的 quadraticme activation footprint ，在 training 和 inference 中均实现更低的内存使用和更高的吞吐量。
为进一步缓解内存压力，我们使用 mixed-precision training（BF16/FP16）结合 activation recomputation，即在前向传播中丢弃 selected forward activations，并在反向传播期间重新计算。这种组合以少量额外计算为代价，大幅节省内存，无需架构修改即可支持更大的 batches 和更深的模型。

1.3 实验

通过离线评估和在线测试，我们旨在回答以下 Research Questions: RQs：
- RQ1：Unified stack vs. encode–then–interaction ：在计算量相当的情况下，single Transformer stack 是否能带来持续的性能提升？
- RQ2：哪些 design choices 至关重要？通过对 input layer（如 tokenizer、sequence fusion）和 OneTrans block（如 parameter sharing、attention type、pyramid stacking）进行消融实验，评估不同 design choice 对性能和效率的重要性。
- RQ3：系统效率：pyramid stacking、cross-request KV Caching、FlashAttention-2 以及 mixed precision with recomputation，在相同 OneTrans graph 下是否能减少 FLOPs/memory 和 latency ？
- RQ4 ：Scaling law ：当增加 length （token sequence lengthwidth $d_\text{model}$ ）、depth（层数）时，loss/performance 是否呈现预期的 log-linear 趋势？
- RQ5 ：Online A/B Tests ：在 production latency 约束下，在线部署 OneTrans 是否能在关键业务指标（如 order/u 、GMV/u ）上实现显著提升？
数据集：对于离线评估，我们在大规模工业排序场景中使用生产日志评估 OneTrans，严格遵守隐私合规要求（所有个人身份信息均经过匿名化和哈希处理）。数据按时间顺序分割，所有特征均在 impression 时刻被快照，以防止时间泄露并确保 online-offline 一致性。label（如 clicks 和 orders）在与 production settings 对齐的 fixed windows 内聚合。Table 1总结了数据集统计信息。ho
任务和评估指标：我们评估两个二分类 ranking 任务：CTR 和 CVR。性能通过 AUC 和 UAUC（impression-weighted user-level AUC ）来衡量。
- Next-batch evaluation：数据按时间顺序处理。对于每个 mini-batch ：(i) 在 eval mode 下记录 predictions ；然后(ii) 在同一 mini-batch 上训练。
  即：训练时评估。
  AUC 和 UAUC 每日根据当天的 predictions 来计算，最终按天进行宏观平均。
  是否要在训练 N 天之后再开始评估？如果第一个 mini-batch 就开始评估，那么模型显然还没有训练好。这时候的评估结果是没有意义的。
  此外，当模型训练到第二个 epoch 的时候，模型已经见过这个 batch 的样本。因此，读者怀疑论文仅仅训练了一个 epoch ，使得每个 batch 在评估的时候都是模型未见过的。
- 效率指标：报告参数数量（不包括 sparse embeddings 的模型参数）和 TFLOPs （batch size 2048 时的训练计算量，以TFLOPs 为单位）。
baselines：我们使用相同 features 和 matched compute budgets ，构建 industry-standard model combinations 作为基线。在 encode-then-interaction 范式下，从广泛使用的 production 基线 DCNv2+DIN 开始，逐步增强 feature-interaction 模块：DCNv2 -> Wukong -> HiFormer -> RankMixer 。固定 RankMixer 后，改变 sequence-modeling 模块：StackDIN -> Transformer -> LONGER。
Hyperparameter Settings：我们报告两种 settings：
- $\text{OneTrans}_\text S$ 6 stacked OneTrans blocks $d=256$ heads $H=4$ ，目标参数约 100M 。
- $\text{OneTrans}_\text L$ 8 $d=384$ heads $H=4$ 。
此外：
- Inputs 通过 a unified tokenizer 来处理：multi-behavior sequences 以 timestamp-aware 方式来融合；non-sequential features 通过 Auto-Split 来 tokenize 。
- pyramid schedule 将 tokens 数量从 1190 线性地缩减至 12。
Optimization and infrastructure：
- 采用 a dual-optimizer strategy without weight decay ：
  - sparse embeddingsAdagrad $\beta_1=0.1, \beta_2=1.0$ 。
  - dense parameters 使用 RMSPropV2 优化，lr=0.005, momentum=0.99999 。
- 训练期间： per-GPU batch size 设置为 2048；dense layers 的梯度裁剪阈值为 90 ，sparse layers 的梯度裁剪阈值为120 ，以确保稳定优化。
- 在 online inference 时：per-GPU batch size 设置为更小的 100，以平衡吞吐量和 latency 。
- 训练在 16 H100 GPUs 上使用 data-parallel all-reduce。

1.3.1 RQ1: 性能评估

我们以 DCNv2+DIN（我们场景中的 pre-scaling production baseline ）为基准进行比较（Table 2 ）。
- 在 encode-then-interaction 范式下，独立扩大任一组件均有益：升级 feature interaction 模块（DCNv2 -> Wukong -> HiFormer -> RankMixer）或 sequence modeling 模块（StackDIN -> Transformer -> LONGER），均能持续提升CTR AUC/UAUC 和 CVR AUC。
  在我们的系统中，这些指标提升超过 +0.1% 被认为是有意义的，而提升超过 +0.3% 通常对应 online A/B tests 中的统计显著效果。然而，由于 per-user sample sizes 更小、且波动性更高，CVR UAUC 的解读需谨慎。
- unified design $\text{OneTrans}_\text S$ 在 CTR AUC/UAUC 上比基线提升 +1.13%/+1.77% ，在 CVR AUC/UAUC 上提升+0.90%/+1.66% 。在参数规模相当的情况下，它还优于 training FLOPs 相近的 RankMixer+Transformer（2.64T vs 2.51T），证明了 unified modeling 的优势。
  scaling $\text{OneTrans}_\text L$ 实现了最佳整体提升：CTR AUC/UAUC 提升 +1.53%/+2.79%，CVR AUC/UAUC 提升+1.14%/+3.23%，表明随着模型容量增长，性能呈现可预测（predictable）的提升。
总之，在 single Transformer 中统一 sequence modeling 和 feature interaction，比独立扩大任一组件更能实现可靠的且计算高效的改进。

1.3.2 RQ2: 基于消融实验的 Design Choices

$\text{OneTrans}_\text S$ 模型进行消融实验，量化 key design choices 的贡献。完整结果总结在 Table 3 中。我们评估了以下变体：
- Input 变体：
  - i)：将 Auto-Split Tokenizer 替换为 Group-wise Tokenizer （第 1 行）。
  - ii)：使用 timestamp-agnostic 的融合策略替代 timestamp-aware sequence fusion（第 2 行）。
  - iii)：在 timestamp-agnostic fusion 中移除 [SEP] tokens（第 3 行）。
- OneTrans block 变体：
  - i)：所有 tokens 共享一组 Q/K/V 和 FFN 参数，并没有为 NS-tokens 分配独立参数（第 4 行）。
  - ii)：将 causal attention 替换为 full attention （第 5 行）。
  - iii)：禁用 pyramid stack ，在所有层保留 full token sequence （第 6 行）。
总之，消融实验表明：
- 1)：Auto-Split Tokenizer 比手动将 non-sequential features 分组为 tokens 更具优势，表明了模型自动构建的 non-sequential tokens 比人工定义的 feature grouping 更有效。
- 2)：当时间戳可用时，Timestamp-aware fusion 优于 intent-based ordering ，表明应优先考虑时间顺序而非事件影响（event impact ）。
- 3) ：在 timestamp-agnostic fusion 下，learnable [SEP] tokens 帮助模型区分 sequences 。
- 4)：为 NS-tokens 分配 token-specific parameters 比 all tokens 共享一组参数带来明显提升，证明 modeling non-sequential features with individualized projections 能实现更好的 feature discrimination 。
- 5)：Causal attention 和 full attention 取得相似结果，表明在该 setting 中允许 tokens 关注 future positions 并非关键。值得强调的是，full attention 禁止使用 KV caching 等 standard optimizations。
  采用 Causal attention 的优势是支持 KV caching 。
- 6)：在每一层保留 full token list 并无益处：OneTrans 能有效将信息汇总到 a small tail of tokens 中，因此 pyramid design 可安全地裁剪 queries 以节省计算。

1.3.3 RQ3: 系统效率

为量化 Training and Deployment Optimization 章节中的 optimizationsunoptimized $\text{OneTrans}_\text S$ 基线上对这些 optimizations 进行消融，并在 Table 5 中报告 training/inference 指标。
unoptimized $\text{OneTrans}_\text S$ 的 training runtime 为 407 ms，峰值 training memory 为 53.13 GB ；p99 inference latency 为 54.00ms ，inference memory 为 1.70 GB。其中， p99 表示尾部 99 分位的延迟，是高可用性 online services 的标准服务等级目标（SLO）指标。这些差异反映了不同的运行条件：offline training 使用较大的 per-device batches，而 online inference 在多台机器上分配 micro-batches 以保证稳定性。
如表所示：
- 1)：Pyramid stack 通过将 long behavioral histories 压缩为紧凑的 query sets，实现了显著节省：训练时间减少28.7%，训练内存减少 42.6%，inference latency 减少 8.4%，inference memory 减少 6.9% 。
- 2)：Cross-request KV caching 消除了冗余的 sequence-side computation，在 training and serving 中均减少了约30% 的 runtime/latency 和约 50% 的内存。
- 3)：FlashAttention 主要有益于 training ，runtime 减少约 50% ，activation memory 减少约 58%。inference 收益适中（latency 和内存各减少约 11-12% ），因为 attention 在 training 中因更大 batch size 和反向传播从而占据主导计算成本。
- 4)：Mixed precision with recomputation 带来了最大的 serving 收益：p99 latency 改善约 69%，inference memory 减少约 30%，因为 inference 可完全在低精度下端到端运行。相比之下，training 必须保留 full-precision optimizer states and gradient accumulators ；即便如此，training runtime 和内存仍分别改善约 32% 和 49%。
LLM optimizations $\text{OneTrans}_\text S$ $\text{OneTrans}_\text L$ $\text{OneTrans}_\text L$ DCNv2+DIN $\text{OneTrans}_\text L$ 小得多）相当的 online efficiency （Table 4）。这再次证明，将推荐系统重构为 a unified Transformer backbone，能够无缝采用 LLM optimizations ，解锁了传统 encode-then-interaction 架构此前无法实现的 effective scaling。

1.3.4 RQ4: Scaling-Law 验证

我们从三个维度探究 OneTrans 的 scaling laws ：
- 1)：length ，即 input token sequence length 。
- 2)：depth，即 stacked blocks 数量。
- 3)：width ，即 hidden-state 维度。
如 Figure 3(a) 所示，增加 length 带来的收益最大，因为引入了更多 behavioral evidence 。在 depth 和 width 之间，我们观察到明显的权衡：
- 增加 depth 通常比单纯增加 width 带来更大的性能提升，因为更深的 stacks 能提取更高阶的 interactions 和更丰富的 abstractions。
- 然而，更深的模型也会增加 serial computation ，而增加宽度更适合并行化。
因此，depth 和 width 的选择应在 target hardware budget 下平衡性能收益与系统效率。
我们通过同时增加 OneTrans 的宽度和深度，进一步分析 scaling-law 行为。为进行比较，我们还将 RankMixer+Transformer 基线在 RankMixer 侧扩展至 1BUAUC $\Delta \text{UAUC}$ ）与 training FLOPs 的关系。如 Figure 3(b) 所示，OneTrans 和 RankMixer 均呈现明显的 log-linear 趋势，但 OneTrans 的斜率更陡——这可能是因为 RankMixer 主导的 scaling 缺乏 a unified backbone ，其 MoE-based expansion 主要增加了 FFN 的 hidden dimension 。
这些结果共同表明：OneTrans 在参数和计算方面更高效，为工业部署提供了更优的 performance–compute 权衡。

1.3.5 RQ5: Online A/B Tests

我们在两个大规模工业场景中评估 OneTrans 的业务影响：
- (i)：信息流（Feeds），即 home feeds 。
- (ii)：商城（Mall ），包含 Feeds 和其他子场景的 overall setting 。
流量通过 hashing 和 user-level randomization 在 user/account level 拆分。control 模型和 treatment 模型均使用过去1.5 年的 production 数据进行训练和部署，以确保公平比较。
我们之前的 production baseline（RankMixer+Transformer ）作为 control 组（约 100M 神经网络参数），且不使用 sequence KV caching。treatment 组部署了带有 serving optimizationsTraining and Deployment Optimization $\text{OneTrans}_\text L$ （参数扩大了 33 倍）。
我们报告 user-level order/u and gmv/u 相对于 controlRankMixer+Transformer $\Delta$ %）（采用双侧 95% 置信区间，基于 user-level stratified bootstrap ）；以及端到端 latencyp99 per-impression time $\Delta$ %；越低越好）来衡量。
Table 6 $\text{OneTrans}_\text L$ 实现了持续提升：
- 在 Feeds 场景中，order/u 提升 4.3510% ，gmv/u 提升 5.6848% ，latency 降低 3.91% 。
- 在 Mall 场景中，order/u 提升 2.5772% ，gmv/u 提升 3.6696% ，latency 降低 3.26%
这表明：相比强大的 non-unified baseline ，该 unified modeling framework 在提升业务指标的同时减少了 serving 时间。
我们还观察到用户活跃天数（user Active Days ）增加了 0.7478% ，cold-start product order/u 显著提升了 13.59% ，突显了该 proposed model 强大的泛化能力。

1.4 结论

本文提出 OneTrans ，一种用于 personalized ranking 的 unified Transformer backbone ，以替代传统的 encode–then–interaction 架构。
- A unified tokenizer 将 sequential attributes 和 non-sequential attributes转换为 one token sequence。
- A unified Transformer block 通过为 homogeneous (sequential) tokens 共享参数、为 heterogeneous (non-sequential) tokens 分配 token-specific parameters ，联合执行 sequence modeling 与 feature interaction 。
- 为使 unified stack 在大 scale 下高效地运行，我们采用了 pyramid schedule（它逐步裁剪 sequential tokens ）和 cross-request KV Caching （它复用 user-side computation ）。该设计还受益于 LLM-style 的 systems optimizations （如FlashAttention 、mixed precision ）。
large-scale evaluations 表明，OneTrans 随着 width/depth 增加呈现近 log-linear 的性能提升，并在保持 production-grade latency 的同时实现了统计显著的业务指标提升。我们相信，这种 unified design 为推荐系统的 scale up 提供了一种实用方法，同时可复用那些推动近期 LLM 进步的 system optimizations。