2025_OneRecV2_TechReport

一、 OneRec-V2 Technical Report [2025]

《OneRec-V2 Technical Report》

生成式人工智能（generative AI）的最新突破通过实现端到端生成（end-to-end generation），从根本上改变了推荐系统。OneRec 作为工业级生成式推荐框架（generative recommendation framework），将 recommendation 任务重新表述为自回归生成任务（autoregressive generation task），能够直接优化 final objective 并实现高的模型浮点运算利用率（Model FLOPs Utilization: MFU）。尽管 OneRec-V1 在实际部署中取得了显著的实证成果，但仍有两个关键挑战阻碍其 scalability 和性能：
- (1)：encoder-decoder 架构中的计算资源分配（computational allocation）是低效的，97.66% 的资源消耗在 sequence encoding（即，context encoding）而非 generation 中，限制了模型的 scalability。
- (2)：仅依赖 reward models 的强化学习存在局限性，包括 inefficient sampling 、以及因为 proxy reward signals 可能导致的奖励欺骗问题（reward hacking）。
为解决这些挑战，我们提出 OneRec-V2，其核心特性如下：
- 1) Lazy Decoder-Only Architecture：一种精简的、decoder-only 的设计，消除了 encoder 的瓶颈并简化了 cross-attention，使总计算量减少 94%，训练资源减少 90%（见 Figure 1 的右图）。这种高效性使模型成功扩展至 8B 参数，且 convergence loss严格遵循 empirical scaling law。随着模型规模扩大，loss 呈现平稳且可预测的下降趋势，与 scaling law 拟合结果一致（见 Figure 1 的左图及 Figure 6 ）。
- 2) Preference Alignment with Real-World User Interactions：一个由 user feedback 驱动的框架，包含：
  - (i) Duration-Aware Reward Shaping 以缓解 video duration bias；
  - 以及 (ii) Adaptive Ratio Clipping 以稳定 policy optimization。
  该框架有效地利用真实世界的反馈从而更好地对齐 user preferences，显著提升了 App Stay Time。
在 Kuaishou/Kuaishou Lite 上进行的大量的 A/B tests 验证了 OneRec-V2 的有效性，在实现 App Stay Time 提升 0.467%/0.741% 的同时，平衡了 multi-objective recommendations，未出现跷跷板效应。本研究推动了 generative recommendation的 scalability，以及与真实世界的 feedback 的对齐，为端到端推荐系统的发展迈出了重要的一步。

1.1 Introduction

生成式人工智能（Generative AI）已在众多领域引发范式变革（《Gpt-4 technical report》、《Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning》、《Qwen3 technical report》）。尽管传统的级联推荐架构（cascaded recommendation architectures）在不断演进，但仍受限于根本性的瓶颈（fundamental bottlenecks）：固有的 multi-stage design 导致计算资源分散，optimization objectives 不一致。生成式推荐（Generative recommendation ）通过将 recommendation 任务重新定义为端到端的 sequence generation 问题，彻底改变了这一范式。这种统一的方法能够直接优化 final objective，实现高的模型浮点运算利用率（Model FLOPs Utilization: MFU），并促进推荐系统与大型基础模型社区（large foundation model communities）的更紧密融合。
尽管 OneRec-V1（《Onerec technical report》）在工业部署中取得了显著成功，但仍有进一步提升其 scalability 和 performance 的空间：
- (1)：encoder-decoder架构中的计算资源分配是低效的。OneRec-V1 采用 encoder-decoder 框架，user historical interaction sequences 通过 encoder 处理后，然后被 decoder 通过 cross-attention 机制加以利用。尽管 OneRec-V1 的 decoder 参数多于 encoder 参数，但计算负载主要集中在 encoder 上：encoder 需处理超长的 user interaction sequences ，而 decoder 的 input 则显著较短。如 Design Principles 章节所示，当 OneRec-V1 的 context length 为 512 时，context encoding 消耗了 97.66% 的总浮点运算量（total FLOPs），而 decoder 的 target item generation 仅占 2.34% 的 total FLOPs。这种不均衡的 allocation 带来了 scalability 的挑战，因为大部分计算预算（computational budget）被用于 sequence encoding，而非关键的 generation process ；然而，generation process 才是决定 recommendation decisions 的。在同等计算预算下，这种资源分配失衡（imbalanced resource distribution）可能限制模型向更大架构进行有效 scale up 的潜力。
  本质上是因为：encoder 和 decoder 的 input sequence length 不同，并且 encoder 要长几十上百倍。
- (2)：仅依赖 reward models 的强化学习存在局限性。尽管 OneRec-V1 已证明基于 reward-model 的强化学习在 policy optimization 中的有效性，但该方法存在两个固有挑战。
  - 首先，有限的 sampling efficiency：依赖 reward models 的方法需要额外的计算资源进行 online generation and scoring，这使得 sampling 只能局限于一小部分用户以近似全局行为（global behavior ）。
  - 其次，存在奖励欺骗（reward hacking）风险：policy 可能学会利用 reward model 中的特定 patterns 或 biases，而这些 patterns 或 biases 并不能转化为实际性能的提升。
  整合 real user feedback 以解决这些固有问题，能够更好地使 policy 与 user preferences 进行对齐，从而获得更优结果。此外，OneRec 的大规模部署提供了一个关键机会：通过 policy optimization within a continuous feedback loop 来进行 self-improvement 。
在本研究中，我们提出 OneRec-V2，通过 lazy decoder 架构、以及 preference alignment with real-world user interactions ，解决了这些根本性局限。如 Figure 2 所示，我们的核心贡献如下：
- 1) Lazy Decoder-Only Architecture：我们提出一种精简的 decoder-only 架构，消除了传统 encoder-decoder designs 的计算瓶颈。通过移除 encoder 组件并简化 cross-attention 机制（移除 K / V projection layers ），我们的 lazy decoder 减少了 94% 的计算量减少、减少了 90% 的实际训练资源（actual training resources），同时在同等计算预算下支持 16 倍更大的模型参数（从 0.5B 扩展至 8B）。如 Figure 1 所示，该架构不仅使 decoder-only transformers 在工业级推荐系统中具备实用性和高效性，还展现出强大的 scaling 能力：在广泛的 model sizes 范围内，convergence loss 严格遵循 《Training compute-optimal large language models》 提出的理论的 scaling law 。这为 large generative recommendation models 的未来发展提供了实证的和理论的指导。
- 2) Preference Alignment with Real-World User Interactions：我们引入了一个全面的 post-training 框架，直接利用真实世界的用户反馈信号（user feedback signals ），解决 generative recommender systems 中 reward modeling 的根本性挑战：
  - (i) Duration-Aware Reward Shaping：通过考虑视频长度差异（video length variations），缓解 raw watch time signals 中的固有 bias，确保 reward signals 准确地反映内容质量（content quality ）而非仅仅反映时长（duration ）。
  - (ii) Adaptive Ratio Clipping ：在 policy optimization 过程中有效地降低训练方差（training variance ），同时保持收敛保证（convergence guarantees ）。
  我们的实验表明，APP Stay Time 得到了显著提升。值得注意的是，当整合 OneRec 的自身推荐的流量分布模式（traffic distribution patterns ）时，online performance 进一步提升；这表明 model optimization 与 real-world user behavior distributions 的对齐度得到了改善。
  "整合 OneRec 的自身推荐的流量分布模式"指的是：在 OneRec-V2 的强化学习训练阶段，除了使用传统的离线样本（由其他推荐系统生成的样本）外，还引入了 OneRec-V2 自身推荐系统在实际线上服务中产生的推荐样本及其对应的用户反馈数据。这些数据包含了：
  1. 真实的推荐分布：OneRec 在实际为用户推荐时，哪些内容被曝光、点击、观看、点赞等。
  2. 用户反馈信号：用户对 OneRec-V2 推荐内容的实际行为反馈，如播放时长、点赞、评论、不喜欢等。
  3. 流量模式：推荐内容在不同用户群体、时间段、场景下的分布特征。
  通过整合这些 “自身推荐系统的流量分布模式”，模型能够：
  - 更好地对齐：使模型的优化目标更贴近实际用户行为分布。
  - 避免偏差：减少因训练数据与线上分布不一致导致的性能下降。
  - 实现自迭代优化：系统可以利用自己的推荐结果进行持续学习与优化。
在拥有 400 million 日活跃用户的 Kuaishou/Kuaishou Lite APP 上进行的大量 online A/B testing 表明，与 OneRec-V1 相比，OneRec-V2 取得了显著提升，App Stay Time 分别增加了 0.467% 和0.741%，同时有效平衡了多个 recommendation objectives ，未出现跷跷板效应（seesaw effects）。
在论文的其余部分：
- 我们首先详细阐述 OneRec-V2 的架构和 pre-training 的实证结果。
- 接下来，我们介绍 post-training 方法。
- 然后，我们通过 online A/B testing 进行全面评估。
- 最后，我们讨论现有局限性并提出未来研究的潜在方向，以此总结本研究。

1.2 Lazy Decoder-Only Architecture

在本节中，我们将介绍基于 lazy decoder 的架构。
- 首先，我们详细阐述了 OneRec 模型架构的演进路径和设计思路。
- 然后，我们提出了 OneRec-V2 的 lazy decoder-only 架构，该架构在显著降低计算复杂度和内存消耗的同时，实现了更低的 generation task loss 。
- 最后，我们详细介绍了验证 lazy decoder-only design 的优越性的综合实证结果，以及对 generative recommender systems 的 scaling laws 的探索。

1.2.1 Design Principles

自回归模型（autoregressive models ）已成为现代自然语言处理中的主导范式，为 GPT（《Language models are few-shot learners》、《Language models are unsupervised multitask learners》）和 LLaMA（《Llama: Open and efficient foundation language models》、《Llama 2: Open foundation and fine-tuned chat models》）等 SOTA 的大型语言模型（LLMs）提供支持。它们展现出卓越的 scalability（《Training compute-optimal large language models》、《Scaling laws for neural language models》），其成功源于简洁优雅的设计：一种统一的架构，能够自回归地处理序列。结合大规模预训练能力（《Bert: Pre-training of deep bidirectional transformers for language understanding》、《Exploring the limits of transfer learning with a unified text-to-text transformer》），基于 transformer 的自回归模型（autoregressive models ）已成为 generative AI applications 的事实标准。
为了将这些架构适配到推荐系统中，第一步是构建用于自回归训练（autoregressive training ）的数据文档（doc）。传统上，推荐系统的训练样本（training sample ）按时间顺序的曝光事件来组织。然而，当与标准的 Next Token Prediction objective 相结合时，会产生冗余，如 Figure 3.a 所示。避免冗余的一种方法是采用 user-centric 的组织方式，每个训练样本包含用户的完整交互历史（complete interaction history ），如 Figure 3.b 所示。但这种方式存在时间数据泄露（《A critical study on data leakage in recommender system offline evaluation》）的潜在风险和 popularity bias 。已有大量研究（《An adaptive boosting technique to mitigate popularity bias in recommender system》、《Fair multi-stakeholder news recommender system with hypergraph ranking》、《It is different when items are older: Debiasing recommendations when selection bias and user preferences are dynamic》、《A survey on popularity bias in recommender systems (2023)》、《Popularity bias in dynamic recommendation》）致力于缓解这些问题。
Figure 3.a 展示的 “Naive Impression Organization” （朴素曝光组织方式）之所以会产生冗余，是因为在这种数据组织方式下：同一个用户的行为序列被拆分成多个重叠的训练样本，导致模型反复学习相同的序列片段，造成计算和训练上的低效。
例如：Figure 3.a 中，模型的任务是基于历史序列预测下一个 item。
- 在样本 A 中，模型尝试预测 A 的下一个物品（但可能没有标签，仅学习 representation）。
- 在样本 A -> B 中，模型会先基于 A 预测 B，再基于 A、B 预测 next item。
- 在样本 A -> B -> C 中，模型会再次基于 A 预测 B，基于 A、B 预测 C。
Figure 3.b 展示的 User-Centric Orgniazation 中：
- 时间数据泄漏：假设 User-1 首先被训练，模型已经见过了 B -> C 的模式。当训练 User-2 时，第一个 step 的模式 B -> C 已经被见过。而 User-1 的这种模式是发生在 t3 之后的、属于未来的 pattern 。
  在 Figure 3.c 中，不考虑序列内部的 pattern 预测，这样防止 pattern 穿越。
- 流行度偏差：由于热门 items 反复出现，导致模型过度学习这些 items 的表征和转移模式，而忽略冷门 items。
为解决上述问题，我们提出按时间顺序组织数据，但仅对最新的 impressed item 来计算 training loss ，如 Figure 3.c 所示，其中灰色的 items 在 next token prediction 中被排除。由于 former and newest impressed items 的工作方式不同，我们在之前的 OneRec-V1（《Onerec technical report》）中选择了 Encoder-Decoder 架构。如 Table 1 所示，我们对计算细节进行了初步分析。computations 可分为两类不同的操作：context encoding 和 target decoding 。
- Context Encoding：处理和转换 user context features 的计算操作，具体包括：
  - (i)：encoder 中执行的 context transformation 操作。
  - 以及 (ii) ：decoder 的 cross-attention 中的 context projection 操作。
- Target Decoding ：在 decoder 中处理和转换 semantic tokens of the target item 的计算操作，具体包括：
  - (i)：捕获 semantic tokens 之间依赖关系的 self-attention。
  - (ii)：应用非线性变换的前馈网络（feed-forward network: FFN）。
  - 以及 (iii)：cross-attention 中 query and output transformations 。
根据 Table 1 ，与经典的 Decoder-Only 架构相比，Encoder-Decoder 架构在参数数量相同的情况下节省了近一半的计算量。然而，两种架构仍存在计算效率低下的问题：大部分计算资源被分配给了对 loss computation 没有直接贡献的 tokenscontext length $N=512$ （OneRec-V1），仅有不到 3% 的 total FLOPs 被用于 loss computation；且随着 context length 的增加，这一比例变得越来越小。详细的计算分析见附录 B。为了将 computations 完全集中在 semantic tokens of the target item 上，从而实现向更大模型的高效 scaling ，我们提出了 Lazy Decoder-Only Architecture 。

1.2.2 Overall Architecture

在本节中，我们将介绍我们的新型架构（如 Figure 4 所示），该架构通过两项关键创新从根本上重新设计了 generative recommenders ：
- 首先，我们提出了一种 lazy decoder-only 架构，既不同于传统的 encoder-decoder 模型，也不同于朴素的 decoder-only 方法。我们的 design 将 context 视为仅通过 cross-attention 访问的 static conditioning information ，在保留模型捕获复杂的 user-item interactions 的能力的同时，消除了冗余计算。
- 其次，我们引入了一种极其高效的、without key-value projections 的 lazy cross-attention 机制。结合 Grouped Query Attention (GQA) （《Gqa: Training generalized multi-query transformer models from multi-head checkpoints》），该设计显著降低了内存占用，能够高效处理大量用户历史数据（extensive user histories）。

a. Context Processor

为了有效地整合异构的和多模态的 user behavioral signals，我们设计了一个名为 Context Processor 的统一模块（unified module），能够与下游 attention-based decoder blocks 无缝集成。
具体来说，用户画像（user profile ）和用户行为（user behavior）等异构的 inputs 被拼接成一个统一的序列（unified sequence ），被称作 context 。context 中的每个 item 都被处理到相同的维度：
$d_{context} = S_{kv} \times L_{kv} \times G_{kv} \times d_{head}$
token $d_\text{context}$ token sequence $d_\text{context}$ 维度。
其中：
- $d_\text{head}$ 表示 attention head dimension 。
- $G_\text{kv}$ 表示 key-value head groups 数量。
- $\mathcal S_\text{kv}$ 表示 key-value split coefficient。它的取值为 1 或者 2 。
  - $\mathcal S_\text{kv} = 1$ 时，key, value 共享相同的 context 。
  - $\mathcal S_\text{kv} = 2$ 时，key, value 使用不同的 context 。
  $\mathcal S_\text{kv} = 1$ ，因为绑定的键值投影（tied key-value projections ）在保持相当性能的同时减少模型的内存占用。
- $L_\text{kv}$ 表示 key-value layers 数量。
$L_\text{kv}$ 个 layerslayer $G_\text{kv}$ 个 head、每个 heademb size $d_\text{head}$ 的 context representation。所有这些 context representation 拼接在一起。
- 当每个 head 包含一个 context representation 时，它同时用于 keyvalue $\mathcal S_\text{kv}= 1$ 。
- 当每个 head 包含两个 context representation 时，它们分别用于 keyvalue $\mathcal S_\text{kv}= 2$ 。
context representation 被转换为 layer-specific key-value pairs 从而用于 attention 机制。我们沿着 feature dimensioncontext tensor $L_\text{kv}$ 组 key-value pairs：
$Context = [C_{0}, C_{1}, \dots, C_{S_{kv} \times L_{kv} - 1}]$
$\mathbf C_{\mathcal S_\text{kv}\times L_\text{kv}-1}\in \mathbb R^{G_\text{kv}\times d_\text{head}}$ 。为简化起见，此处忽略了 sequential dimension。
$\mathbf C_{\mathcal S_\text{kv}\times l}\in \mathbb R^{G_\text{kv}\times d_\text{head}}$ $l$ 层的 context 。
$l\in \{0,1,\cdots,L_\text{kv} - 1\}$ ，我们计算 normalized key-value pairs ：
$\begin{matrix} k_{l} = {RMSNorm}_{k, l} (C_{l \times S_{kv}}) \\ v_{l} = {\begin{cases} {RMSNorm}_{v, l} (C_{l \times S_{k v} + 1}), & if S_{kv} = 2 separated key-value \\ k_{,} & if S_{kv} = 1 shared key-value \end{cases} \end{matrix}$
Context Processorfinal output $\{(\mathbf k_0, \mathbf v_0), \cdots, (\mathbf k_{L_\text{kv} - 1},\mathbf v_{L_\text{kv} - 1})\}$ 。

b. Lazy Decoder Block

Tokenizer：对于每个 target item ，我们采用 semantic tokenizer （与 OneRec-V1 相同）生成 3 semantic IDs ，以捕获 item 的多方面特征。在训练过程中，我们使用 first 2 IDs，并在前面添加一个 beginning-of-sequence (BOS) token ，形成 input sequence。然后，这些 token indices 通过 embedding tables 进行映射，得到 initial hidden representation：
$h^{(0)} = Embed ([BOS, s^{1}, s^{2}]) \in R^{3 \times d_{model}}$
为什么训练时需要提供 first 2 IDs？因为这是一种 "部分教师强制"（partial teacher forcing）策略。训练期间：
- 虽然 first 2 IDs 已经给出来了，但是模型仍然需要对它们进行预测，从而计算 loss。
- 在预测第二个 semantic ID 时，会用到第一个 semantic ID；在预测第三个 semantic ID 时，会用到 first 2 IDs 。这相当于在每一步都是用 previous label 来预测下一步。
  如果训练时只使用 BOS，那么模型每一步都是用自己生成的 semantic ID 作为下一步的输入（这是 online inference 的做法）。
可以看到：这种方式与 online inference 不同。但是它的优点是训练效率高：并行计算所有位置的损失（一次前向计算）。
Block Structurelazy decoder $N_\text{layer}$ 个堆叠的 transformer blocks 组成，每个 block 包含三个主要组件：cross-attention 模块、self-attentionfeed-forward $l$ 层，transformation 过程定义为：
$\begin{matrix} h_{cross}^{(l)} = h^{(l - 1)} + CrossAttn (RMSNorm (h^{(l - 1)}), k_{l_{kv}}, v_{l_{kv}}) \\ h_{self}^{(l)} = h_{cross}^{(l)} + CausalSelfAttn (RMSNorm (h_{cross}^{(l)})) \\ h^{(l)} = h_{self}^{(l)} + {FFN}^{(l)} (RMSNorm (h_{self}^{(l)})) \end{matrix}$
其中：RMSNorm 表示 root mean square layer normalization ，用于保证训练稳定性。
为了在保持计算效率的同时提升模型容量（model capacity），我们采用了混合架构（hybrid architecture ），将 dense feed-forward networks 替换为 Mixture-of-Experts (MoE) 模块。借鉴 DeepSeek-V3（《Deepseek-v3 technical report》），我们采用 auxiliary-loss-free load balancing 策略，确保 experts 的高效利用。
Lazy Cross-Attention: KV-Sharing：为了提高参数和计算效率，多个 lazy decoder blocks 共享来自 context processorkey-value pairs $l$ ，我们确定对应的 key-value index ：
$l_{kv} = ⌊ \frac{l \times L_{kv}}{N_{layer}} ⌋$
$N_\text{layer}$ 是 lazy decoder blocks 的总数。
$L_\text{kv}$ $N_\text{layer}$ 。
- $L_\text{kv}$ 表示 layer-specific key-value pairs 的层数。
- $N_\text{layer}$ 表示 Layer 的层数。
$L_\text{kv} = 2$ $N_\text{layer}=6$ 。则：第 1/2/3 层共享第一组 layer-specific key-value pairs 、第 4/5/6 层共享第二组 layer-specific key-value pairs 。
该设计确保 every consecutive blockscontextual representations $(\mathbf k_{l_\text{kv}}, \mathbf v_{l_\text{kv}})$ $\mathbf k_{l_\text{kv}}, \mathbf v_{l_\text{kv}} \in \mathbb{R}^{(N_{s}+T_\text{short }+T_\text{long }) \times G_\text{kv}\times d_\text{head }}$ 。
unified key-value representation $\mathbf v_{l}=\mathbf k_{l}$ 。这是利用了这样的观察结果：绑定的键值投影（tied key-value projections ）在保持相当性能的同时减少模型的内存占用。
Lazy Cross-Attention: Grouped Query Attentionquery projection $H_{q}=d_\text{model}/d_\text{head}$ 个 attention headskey-value pairs $G_\text{kv}$ head groups $G_\text{kv} < H_{q}$ 。该设计显著减少了 context representations 的内存占用、以及 attention computation 期间的内存访问需求，能够高效扩展到更长的 contexts 和更大的 batch sizes 。
在 Cross-Attention 中，query 是来自于 Target Item 的 semantic ids，key/value 是来自于 layer-specific key-value pairs。在这个过程中，得到三个 representation，每个位置对应于一个 semantic id。在这里：key/value 没有 Projection ， query 有一个 Projection（即，图中的 Q Linear Layer）。
而在 Causal Self-Attention 中，这三个 representation 进行信息融合，从而捕获 semantic ids 内部之间的关系。在这里：key/value/query 都有 Projection。
Output Layer：来自最后一个 decoder block 的 final hidden representation 经过 position- specific RMSNorm and Linear layer ，生成每个 semantic ID 的 predictionssemantic IDs of the target item $\left[s^{1}, s^{2}, s^{3}\right]$ 的可能性。

1.2.3 Empirical Results

为了验证 lazy decoder-only 架构的有效性，我们从多个维度进行了全面的实证评估。我们系统地将我们的方法与经典架构进行比较，研究关键架构的创新（key architectural innovations ）的影响，并探索 dense model 和 sparse model variants 的 scaling 特性。所有实验均使用 Kuaishou 在 2025 年 8 月 10 日至 14 日的曝光数据进行流式训练（streaming training ），采用相同的采样比例（sampling ratioglobal batch size $L_\text{kv}=1$ $\mathcal S_\text{kv}=1$ $d_\text{head}=d_\text{model}/N_\text{head}$ $G_\text{kv}=N_\text{head}$ $(N_{s}+T_\text{short}+T_\text{long})≈512$ 。对于 online deployment，我们采用一个 1Blong-term user behavior sequence length $(N_{s}+T_\text{short}+T_\text{long})≈3000$ 。
$L_\text{kv} = 1$ ，这意味着所有的 Layers 都共享一组layer-specific key-value pairs 。

a. Architecture Comparison

我们比较了生成式推荐（generative recommendation ）的三种架构范式：encoder-decoder 架构（OneRec-V1）、 naive decoder-only 架构、以及我们提出的 lazy decoder-only 架构。对于每个模型，我们评估 average generation loss across three semantic tokens ：
$L_{Gen} = - \frac{1}{3} \sum_{i = 1}^{3} \log p (s^{i} ∣ BOS, s^{< i}, Context)$
其中：
- $s^i$ target item $i$ semantic ID $s^{<i}$ 表示 target item1 $i-1$ 个 semantic ID 组成的序列。
- BOS 表示 begin-of-sentence token 。
- Context 是 context processor 的 output，包括 user static and behavioral features 。
该 loss 与 OneRec-V1 不同：我们使用三个 tokens 的平均值，而 V1 使用它们的总和。
Table 2 和 Figure 5 展示了不同模型规模下的计算量需求（computational requirements）和收敛性能（convergence performance）。尽管所需的浮点运算量（FLOPs）和 activation memory 显著减少，但我们的 lazy decoder-only 架构实现了与传统方法相当的 loss 。
注意：
- 根据 specific model configurations ，这里的模型参数近似于 0.1B/0.5B/1B 。
- 本 Table 中的浮点运算量（FLOPs）和 activations 数量基于 specific model configurations 来计算，比 Table 1 中的近似估计更加精确。

我们的 context processoroverall context dimensions $L_\text{kv}$ $\mathcal S_\text{kv}$ 。
- $L_\text{kv}$ distinct context representations across layers $N_\text{layer}/L_\text{kv}$ 个连续的 decoder blocks 共享相同的 key-value pairs。
- $\mathcal S_\text{kv}$ 进一步控制 keys and valuesrepresentation $\mathcal S_\text{kv}=1$ separate projections $\mathcal S_\text{kv}=2$ ）。
该设计在保持 generative task 的性能相当的同时，降低了计算成本和 activation memory。我们在一个 1Bdense lazy decoder model $N_\text{layer}=18$ ）上进行了消融实验，以研究这些 design choices 的影响。
Table 3 和 Figure 11a 表明：激进的 key-value sharing 在整个训练过程中保持了具有竞争力的 loss，验证了我们高效的 context processing 策略。
Figure 11 的注释写错了：(a) 是 Grouped query attention、(b) 是 Key-value sharing 。

c. Grouped Query Attention

Grouped Query Attention (GQA) 在多个 query heads 之间共享 key-value heads。在我们的 lazy decoder 架构中，这种优化减少了 cross-attention 操作中的 activation memory 和 memory access bottleneck，从而在对 model quality 影响最小的情况下提高了 training throughput。我们在一个具有 14 attention heads 的 1B 参数的 dense lazy decoder modelkey-value head groups $G_\text{kv} \in\{1,2,7\}$ 的影响。
Table 4 和 Figure 11b 的结果表明：不同 number of groups 的 GQA 与 full attention 产生了几乎相同的性能，同时显著降低了内存需求。
标准的 GQA 是对 query heads 进行分组，每组内的 query heads 共享相同的 key/value heads。例如：有 8 个 query heads 分成 2 组，每组 4 个 query heads 共享 1 个 key head 和 1个 value head。
而论文中的 GQA 是对 key/value heads 进行分组，它的效果与标准 GQA 是等价的：每组内的 query heads 共享相同的 key/value heads。

d. Model Scaling

我们在 lazy decoder-only 架构上进行了全面的 scaling 实验，研究了 dense configurations 和 sparse configurations ，以了解不同 model scales 下的 compute-performance 权衡。
Dense Model Scaling：我们探索了参数从 0.1B 到 8B 的 dense lazy decoder models 的 scaling 特性。Table 5 展示了每个model configuration 的 architectural hyperparameters 和 convergence performance。
Sparse Mixture-of-Experts：为了实现更高效的 scaling，我们研究了一种 Mixture-of-Experts (MoE) 变体，用 sparse expert routing 替换 dense feed-forward networks。我们的 MoE configuration 采用 53 routed experts and 1 shared expert，总参数为 4B（0.5B active per token）。该模型对每个 token 使用 top-3 expert routing ，MoE intermediate size 为 1408。sparse model 保持与 0.5B dense model 相同的 base architecture ，同时将 first 2 lazy decoder blocks 之后的 feed-forward layers 替换为 MoE layers。
类似于 PLE 的思想：多个 specific experts 以及一个 shared expert 。
为什么不是对所有 lazy decoder blocks 应用 MoE layers？
- 训练稳定性：完全 MoE 化（所有 blocks 都用 MoE）容易导致：
  - 路由不稳定性：早期训练中，路由网络可能还未收敛，导致专家负载不均衡。
  - 梯度爆炸/消失：深层 MoE 加剧梯度传播问题。
- 计算效率优化：尽管 MoE 声称"激活参数少"，但仍有隐藏开销：
  - 路由计算：每个 token 需要计算所有专家的得分。
  - 专家间通信：需要 gather/scatter 操作。
  - 负载均衡损失：需要额外优化。
Results and Analysis：Figure 7 展示了不同 model configurations 的 training dynamics。我们的实验揭示了推荐系统中 lazy decoder 架构的 scaling behavior 的几个关键洞察。我们还展示了不同规模的模型的 loss 如何随着 training budget 的增加而降低，详见 Figure 12。
我们的实证结果与理论的 scaling law 具有合理的一致性。虽然一般的 Chinchilla scaling law （《Training compute-optimal large language models》）的表达式为：
$\hat{L} (N, D) \overset{△}{=} E + \frac{A}{N^{α}} + \frac{B}{D^{β}}$
$N$ $D$ training tokens $E$ 是一个常数。
$D$ $\frac{B}{D^{\beta}}$ $E$ 中。因此，对于 fixed data ， scaling law 简化为：
$\hat{L} (N) \overset{△}{=} E^{'} + \frac{A}{N^{α}}$
$E^\prime = E + \frac{B}{D^\beta}$ $E=3.13$ $A=3660$ $\alpha=0.489$ ，如 Figure 6 所示。
$E^\prime = 3.13$ ？
总参数为 4B（activating 0.5B ）的 MoE 变体实现了 3.22 的 convergence loss，优于 2B dense model，同时保持了与 0.5B dense baseline 相当的计算需求。与 0.5B dense model 相比，该 MoE 变体降低了 0.11 的 loss，证明了稀疏架构（sparse architectures ）在推荐任务中的有效性。
这些结果表明：
- 我们的 lazy decoder 架构具有有效的 scalability 。
- MoE 变体在工业级推荐系统的部署中提供了特别有吸引力的权衡。在工业级推荐系统中，计算效率直接影响 serving 成本和 latency 。

1.3 Preference Alignment with Real-World User Interactions

在本节中，我们将介绍 OneRec-V2 的 post-training 阶段。监督微调（Supervised Fine-Tuning: SFT）阶段与 OneRec-V1 相同，使用流式曝光数据（streaming exposure dataonline $\mathcal L_\text{Gen}$ loss training ，与 pretraining 期间使用的 loss 一致。其主要目的是捕获用户的实时兴趣变化（real-time interest changes），同时防止模型偏离 pretrained model 过远。
- 在 OneRec-V1 中，强化学习（Reinforcement Learning: RL）阶段仅基于 reward model。
- 在 OneRec-V2 中，我们引入了基于 user feedback signals 作为 rewards 的强化学习。

1.3.1 Reinforcement Learning with User Feedback Signals

基于 user feedback 来定义 rewards 可以避免奖励欺骗（reward hacking ）问题，且不需要额外的模型计算（model computation ）开销。然而，它仍然面临如何结合多个 objectives 、以及 sparsity of positive labels 等挑战。在短视频推荐场景中，每个视频的播放时间（playing time ）是最稠密的 feedback signal，并且与最重要的 online metrics（如 APP Stay Time 、以及 Lifetime over 7 days: LT7 ）密切相关。因此，我们设计了一种简单但有效的基于 playing time 的 reward。
reward signal 的设计非常关键，它代替了 reward model，并且满足这样的条件：reward score 越高，则 online metrics 越好。

a. Duration-Aware Reward Shaping

虽然 video playing time 是衡量用户满意度（user satisfaction ）的有用指标，但它本质上会受到视频时长（duration of the video ）的偏差（bias）所影响。为了解决这一bias，我们提出了 Duration-Aware Reward Shaping 机制，如 Figure 8 所示。该方法通过将 playing time 与每个用户的 historical videos of comparable duration 进行比较来归一化 playing time。
注意：用户的历史观看视频有两个时间：一个是视频本身的时长（duration）、一个是用户的观看时长（playtime）。这里的分桶是以 duration 来进行的。
由于视频时长（video duration）遵循长尾分布（long-tailed distribution），我们采用对数策略（logarithmic strategy ）将 historical videos 划分为多个桶（buckets）。这种方法将 durations 分组到指数级扩大的区间（exponentially widening intervals ）中，产生更平衡且有意义的对等组（peer groupsmapping $\mathcal F(d)$ duration $d$ discrete bucket index $b \in B$ 。形式上，bucketing function 定义为：
$F (d) = ⌊ \log_{β} (d + ϵ) ⌋$
其中：
- $\beta$ 是控制 bucket 粒度的可配置的对数基。
- $\epsilon$ $10^{-6}$ ），用于在处理极短 durations 的时候保证数值稳定性。
$\mathcal F(\cdot)$ dataset $\beta$ $\epsilon$ 来决定。
$H_{u}= \{(d_{k}, p_{k})\}_{k=1}^N$ $u$ historical interaction sequence $d_{k}$ video duration $p_{k}$ 是观察到的 playing timeduration bucket $b$ ，我们定义 playing times 的经验分布（empirical distribution）为：
$P_{u, b} = {p_{j} ∣ (d_{j}, p_{j}) \in H_{u}, F (d_{j}) = b}$
duration bucket $b$ $u$ 的播放时长的分布。
duration $d_{i}$ playing time $p_{i}$ target video $i$ bucket $b = \mathcal F(d_{i})$ duration-normalized engagement score $p_{i}$ $P_{u,b}$ 中的经验百分位排名（empirical percentile rank）：
$q_{i} = \frac{| {p_{j} \in P_{u, b} ∣ p_{j} \leq p_{i}} |}{| P_{u, b} |}$
score $q_i$ $u$ 历史观看的视频，在相同 duration 的条件下，target video 是不是观看时长最长的？
我们基于此 scorebatch $q_{i}$ 按降序排序后，取 25%top quartile $τ_{b}$ "dislike" $\text{neg}_{i}=1$ $A_{i}=-1$ $A_{i}=0$ 。请注意，我们直接分配 advantage values，不进行归一化：因为我们对正样本和负样本的定义足够严格。进一步的归一化可能会导致 optimization 的不一致，从而降低性能。形式上，定义如下：
$\begin{matrix} A_{i} = {\begin{cases} + 1, & q_{i} > τ_{B} and {neg}_{i} = 0 \\ - 1, & {neg}_{i} = 1 \\ 0, & otherwise \end{cases} \end{matrix}$
该策略有效筛选出高质量的正样本，同时整合直接的 negative 信号，产生更准确的 user preference signals 。
$\tau_b$ 是 batch-specific 的，对于每个 batch 都不同。给定一个 batchtop 25% $q_i$ 作为正样本。

b. Reinforcement Learning

Gradient-Bounded Policy Optimization: GBPO：强化学习的有效性和稳定性是近年来大型语言模型（LLM）社区的主要研究焦点。一个关键挑战是在保持梯度稳定性的同时，增强 exploration 以提高性能。在本节中，我们将介绍我们新提出的强化学习方法 GBPO：
$\begin{matrix} J_{GBPO} (θ) = - E_{u \sim P (U), {o_{i}}_{i = 1}^{G} \sim π_{θ_{old}}} [\frac{1}{G} \sum_{i = 1}^{G} \frac{π_{θ} (o_{i} ∣ u)}{π_{θ_{old}}^{'} (o_{i} ∣ u)} \times A_{i}] \\ π_{θ_{old}}^{'} (o_{i} ∣ u) = {\begin{cases} max (π_{θ_{old}}, sg (π_{θ})), & A_{i} \geq 0 \\ max (π_{θ_{old}}, 1 - sg (π_{θ})), & A_{i} < 0 \end{cases} \end{matrix}$
$\text{sg}(\cdot)$ 表示阻止梯度反向传播。
从公式可以看出，GBPO 移除了对 ratioclipping $\pi_{\theta_\text{old}}$ 的动态边界（dynamic bound）。总体而言，GBPO 具有两个主要优势：
- 全样本利用（Full Sample Utilization）：保留所有样本的梯度，鼓励模型进行更多样化的 exploration。
- 有界梯度稳定（Bounded Gradient Stabilization）：用二元交叉熵（Binary Cross-Entropy: BCE）loss 的梯度来限制强化学习的梯度，增强 RL training 的稳定性。
传统的策略梯度方法（如 PPO ）使用 clip 操作丢弃“异常”样本（policy ratio 太大或太小），但这样做存在如下的问题：
- 浪费数据：丢弃了大量样本。
- 抑制探索：限制了策略的多样性。
- 稳定性不足：尤其是对负样本的处理容易梯度爆炸。
GBPO 的目标是：利用所有样本，保持梯度稳定。GBPOclip $\frac{\pi_\theta}{\pi_{\theta_\text{old}}}$ $\pi^\prime_{\theta_\text{old}}$ 。
- $A_i\gt 0$ $\pi^\prime_{\theta_\text{old}}=\max(\pi_{\theta_\text{old}}, \pi_\theta)$ $\pi_\theta$ 不反向传播梯度。
  $\pi_\theta\gt \pi_{\theta_\text{old}}$ ）：
  - PPO $\frac{\pi_\theta}{\pi_{\theta_\text{old}}}\gt 1$ ，可能被 clip。
  - GBPO $\pi^\prime_{\theta_\text{old}}=\pi_\theta$ $\frac{\pi_\theta}{\pi^\prime_{\theta_\text{old}}}=1$ $\pi_\theta$ 不反向传播梯度。
  GBPO $\pi_\theta\gt \pi_{\theta_\text{old}}$ $\frac{\pi_\theta}{\pi^\prime_{\theta_\text{old}}}$ 尽可能达到 1 。
- $A_i\lt 0$ $\pi^\prime_{\theta_\text{old}}=\max(\pi_{\theta_\text{old}}, 1-\pi_\theta)$ $\pi_\theta$ 也不反向传播梯度。
  $\pi_\theta\lt \pi_{\theta_\text{old}}$ ）：
  - PPO $\frac{\pi_\theta}{\pi_{\theta_\text{old}}}\lt 1$ ，可能被 clip。
  - GBPO $\pi^\prime_{\theta_\text{old}}=\max(\pi_{\theta_\text{old}},1-\pi_\theta)$ $\pi_\theta$ $1-\pi_\theta$ 1 $\frac{\pi_\theta}{\pi^\prime_{\theta_\text{old}}}$ 的分母被限制，防止比率过大导致梯度爆炸。
  GBPO $\pi_\theta$ 尽可能为零。
Existing Clipping-based Work：在详细介绍 GBPO 之前，我们首先简要回顾现有的大型语言模型（LLM）强化学习方法。
- GRPO/PPO（《Proximal policy optimization algorithms》、《Deepseek-math: Pushing the limits of mathematical reasoning in open language models》）通过 clipping 操作丢弃 policy ratios 过大或过小的样本，防止训练过于激进。
- DAPO（《Dapo: An open-source llm reinforcement learning system at scale》）通过更高的 clip 阈值放宽了样本限制（sample restrictions），特别是纳入了更多 low-probability or high-entropy tokens，从而在提高 reinforcement learning performance 的同时增加了 diversity。
这些研究表明，放宽 clipping constraints 以纳入更多样本可以鼓励更多样化的 exploration，并提高性能。
然而，这些方法没有全面地考虑梯度稳定性（gradient stability）。特别是对于负样本，policy ratio 缺乏 upper bound 很容易导致梯度爆炸，使模型性能崩溃。
- Dual-clip （《Mastering complex control in moba games with deep reinforcement learning》）对负样本的 policy ratio 应用 upper bound 截断。虽然这提高了稳定性，但丢弃了太多负样本，减缓了收敛速度。
- ECPO《Onerec technical report》 $\pi_\text{old}$ ratio $\pi_\theta / \pi_\text{old}$ 进行裁剪，缓解了负样本的梯度爆炸问题。该策略保留了更大比例的训练样本，同时提高了 optimization stability 。
- 类似地，CISPO（《Minimax-m1: Scaling test-time compute efficiently with lightning attention》）和 GPPO（《Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization》）采用相关技术将 ratio 保持在合理范围内，同时保留更多样本的梯度信号（gradient signals）。
在 OneRec V1 中，我们采用了 ECPO（Early Clipped GRPO），其形式定义为：
$\begin{matrix} J_{ECPO} (θ) = - E_{u \sim P (U), {o_{i}}_{i = 1}^{G} \sim π_{θ_{old}}} [\frac{1}{G} \sum_{i = 1}^{G} min (\frac{π_{θ} (o_{i} ∣ u)}{π_{θ_{old}}^{'} (o_{i} ∣ u)} A_{i}, clip (\frac{π_{θ} (o_{i} ∣ u)}{π_{θ_{old}}^{'} (o_{i} ∣ u)}, 1 - ϵ, 1 + ϵ) A_{i})] \\ π_{θ_{old}}^{'} (o_{i} ∣ u) = max (\frac{sg (π_{θ} (o_{i} ∣ u))}{1 + ϵ + δ}, π_{θ_{old}} (o_{i} ∣ u)), δ > 0 \end{matrix}$
OneRec V1 $\mathcal J_\text{ECPO}(\theta)$ 没有负号。
Gradient Analysis：曝光样本（exposure samples）包括 OneRec 所生成的样本、以及传统 pipeline 所生成的样本。
- 对于来自 OneRecgeneration probability $\pi_\text{old}$ 。
- 对于来自传统 pipeline 所生成的曝光样本，由于 pipelinegeneration probability $\pi_\text{old}$ 简化为：OneRecgeneration probability $\pi_{old}=\text{sg}(\pi_\theta)$ 。
对于来自传统 pipeline 所生成的曝光样本，policy ratio 始终为 1。在传统的强化学习方法中，ratio 为 1 的样本被视为 stable 从而用于训练，不进行截断。然而，实际上，此类样本仍然可能导致梯度爆炸——这是由负样本引起的，如 Figure 9 所示。
a specific token $i$ ，我们有：
$\begin{matrix} J_{ECPO}^{i} (θ) = - A_{i} \times \frac{π_{θ}}{sg (π_{θ})} \\ \frac{\partial J_{ECPO}^{i} (θ)}{\partial θ} = - A_{i} \times \frac{1}{π_{θ}} \times \frac{\partial π_{θ}}{\partial θ} \end{matrix}$
在强化学习的策略梯度方法中，目标函数的梯度可以表示为：
$\nabla_{θ} J (θ) = E [A (s, a) \times \nabla_{θ} \log π_{θ} (a ∣ s)]$
$A(s,a)$ $\pi_\theta(a\mid s)$ 是策略。根据求导法则：
$\nabla_{θ} \log π_{θ} (a ∣ s) = \frac{1}{π_{θ} (a ∣ s)} \times \nabla_{θ} π_{θ} (a ∣ s)$
current token probability $\pi_\theta$ 越小，梯度越大。
- $\pi_\theta$ 越小意味着有更大的提升空间（room to boost ），因此梯度较大是合理的。
- $\pi_\theta$ 越小意味着更小的抑制空间（room to suppress ）；如果梯度过大，很容易导致模型过拟合甚至坍塌（collapse ）。
这种现象表明，传统的 clipping 方法无法完全解决 unstable RL gradients 的问题，因为它们无法避免 ratio = 1 时的梯度爆炸。
在 BCE loss 中，对负样本也有惩罚，但其梯度比 RL loss 的梯度稳定得多：
$\begin{matrix} L_{BCE} (y, p_{θ}) = - [y \times \log (p_{θ}) + (1 - y) \times \log (1 - p_{θ})] \\ \frac{\partial L_{BCE}}{\partial θ} = {\begin{cases} - \frac{1}{p_{θ}} \times \frac{\partial p_{θ}}{\partial θ}, & y = 1 \\ \frac{1}{1 - p_{θ}} \times \frac{\partial p_{θ}}{\partial θ}, & y = 0 \end{cases} \end{matrix}$
对于负样本，current model probability 越小，抑制它们时的梯度越小，从而使模型更稳定。基于这一观察，我们提出了 GBPO，用更稳定的来自BCE loss 的梯度来限制 RL gradients 。我们在 Figure 10 中说明了这些差异。

c. Experiment

Experiment Settings：在本节中，我们通过实验验证所定义的 user feedback signals 的有效性。为了快速验证，本节中的所有实验均在 0.5B 模型、context length = 512 的设置下进行。baseline 模型为 OneRec-V1。在 OneRec-V1 的实验设置中，分配给实验组的在线流量（online traffic ）仅占总流量的很小一部分，因此训练样本几乎完全来自传统的 recommendation pipeline 。在大型语言模型（LLM）领域，已有研究表明，在 self-generated samples 上进行训练可以实现 self-improvement（《Skywork open reasoner 1 technical report》）。随着 OneRec 现在服务于 25% 的总流量，我们有足够的数据在我们的设置中验证这一假设（hypothesis）。因此，我们设计了两组实验组进行比较：
- w/o OneRec Samples：仅使用传统 recommendation pipeline 所生成的样本进行强化学习，与 OneRec-V1 的样本来源一致。
- w/ OneRec Samples：纳入 OneRec pipeline 所生成的样本，其中也包括 current model 的实验组所生成的样本。换句话说，该设置引入了在线策略强化学习（on-policy reinforcement learning）。
  第二组包含传统 recommendation pipeline 所生成的样本，它混合了两种方式所生成的样本。
如前所述，强化学习的正样本被确定为 duration-aware reward score 排名的 top 25% 的视频，而负样本则通过明确的负反馈（如 "dislike" 操作）确定。请注意，两组的训练样本总数基本保持相同。reinforcement learning lossGBPO $\mathcal J_\text{GBPO}(\theta)$ ）。所有结果如 Table 6 所示。
Result Analysis：从 Table 6 中，我们可以得出以下观察结果：
- 当仅使用传统 pipeline 的样本（即与 OneRec-V1 具有相同的样本来源）时，引入 user-feedback-based reinforcement learning 显著改善了与时长相关的指标（如 App Stay Time and Watch Time ），但牺牲了一些其他指标（如 Video View ）。这表明我们的 duration-aware reward 确实与 App Stay Time 密切相关。
- 纳入 OneRec pipeline 的样本后，几乎所有指标都显著改善，特别是 Video View 从负向转为正向。这表明 user-feedback-based reinforcement learning 能够实现 self-iterative optimization ，充分利用 user feedback signals 来提升用户体验。
本质上，这是一个 offline-policy 和 online-policy 的问题。
- w/o OneRec Samples：利用 offline-policy 来进行强化学习。训练数据始终来自传统推荐流水线（固定分布，与当前策略无关）。问题：训练数据分布与当前策略分布不一致 --> 分布偏移 --> 策略优化效率低。
- w/ OneRec Samples：混合了 offline-policy 和 online-policy 来进行强化学习。训练数据包含了当前策略的在线交互。
为什么没有仅仅依赖 OneRec Samples？虽然 offline-policy 学习效率低，但传统样本仍有价值：
- 冷启动时的基础学习：在 OneRec 初始部署阶段，缺乏自身样本时，传统样本提供基础的用户偏好信号，避免完全随机探索。
- 探索多样性：传统流水线可能推荐一些 OneRec 探索不足的区域，提供额外信息。
- 安全约束：完全依赖在线样本可能冒险过大，传统样本提供一定稳定性。
注：这些样本的形式为（样本特征，真实奖励），其中有些样本来自于传统推荐流水线（offline-policy）、有些来自于 OneRec generation （online-policy）。

1.3.2 User Feedback Signals versus Reward Model

a. Limitation of Reward Model

在本节中，我们将 OneRec-V1 中依赖 reward model 的强化学习与由 user feedback signals 驱动的强化学习进行比较。尽管 OneRec-V1 通过大量实验证明了强化学习的有效性，但其性能受到有限的 sampling probability 的限制。由于资源约束（resource constraints ），仅能对一小部分用户（1%）进行在线策略滚动（on-policy roll-outs）。此外，reward model 容易受到奖励欺骗（reward hacking）的影响。user feedback signals 直接反映 real user preferences，从而降低了奖励欺骗的风险。然而，在 OneRec 全面部署之前，无法获得关于 generated samples 的大规模 real user feedback。随着 OneRec 的全面部署，现在可以更有效地利用这些信号进行精确的 self-iterative optimization。在上一节中，我们证明了所提出的 duration-aware feedback signals 的有效性。现在，我们比较 user feedback 与 reward model 的性能。

b. Experiment

Experiment Settings：我们设置了三组实验进行比较，分别称为 Reward Model、User Feedback Signals、以及 Hybrid。模型设置与 Reinforcement Learning with User Feedback Signals 实验部分相同。评估指标与之前的实验一致，包括基于时长的指标（duration-based metrics ）、以及基于交互的指标（interaction-based metrics）。App Stay Time 是最重要的指标，而其他指标作为 user experience 的 reference values。Table 7 中的结果表示每组相对于 OneRec-V1 的相对性能。
- Reward Model：引入 reward-model-based reinforcement learning，与 OneRec-V1 的主要区别在于 pretrained generative model 的架构：OneRec-V1 使用 Encoder-Decoder 架构，而 OneRec-V2 采用提出的 Lazy Decoder 架构。
- User Feedback Signals：引入 user-feedback-based reinforcement learning ，并纳入 self-generated samples ，与上一节中的 "w/ OneRec Samples" setting 相同。
- Hybrid：同时引入 reward model 和 user feedback signals，两种类型的样本相互独立：前者是通过模型自身的 rollout sampling 而获得的样本，而后者是 previously exposed to users 的样本。
  Hybrid 的训练方式：在训练过程中，GBPO 等策略梯度算法会无差别地处理一个包含两类样本的混合批次。
  - $π_\theta$ 对该推荐动作的概率。
  - 奖励/优势计算：
    - Reward Model $A_i$ 基于奖励模型的打分计算。
    - User Feedback $A_i$ 基于时长感知分位数等方法计算。
  - 梯度计算与更新：使用统一的损失函数（如论文中的 GBPO），将所有样本的梯度进行求和或平均，然后进行一次参数更新。
  这是一种“多任务”学习：模型通过一个共享的参数体系，同时学习完成两个相关的任务——最大化奖励模型的分数，以及最大化从真实反馈中推导出的奖励。
Results Analysis：从 Table 7 中，我们可以总结以下观察结果。
- 在 reward-model setting 中，OneRec-V2 的性能显著优于 OneRec-V1，进一步证实了 Lazy Decoder 架构带来的优势。
- 无论是基于 reward model 还是基于 user feedback ，强化学习都能在 duration 指标和 interaction 指标上实现双重提升。然而，reward model 倾向于有利于 interaction 指标的改善，而 real user feedback 则倾向于有利于 App Stay Time 指标的增加。这是因为 reward model 的 rewards output 是多个 recommendation objectives 的融合，而我们基于 user feedback 定义的 rewards 主要根据 video playing time 来计算。这也表明，不同的 reward definitions 会导致不同的模型偏好（model preferences），这与 OneRec-V1 中的结论一致。
- 当结合两者（Hybrid ）时，尽管 duration 指标和 interaction 指标的具体提升不如每种单独策略，但 performance loss 极小，并且 App Stay Time 和 interaction metrics 之间的平衡性得到了改善。这是因为两种单独策略带来的 gains 部分地重叠。尽管结合这两种策略无法实现完美的叠加效应（additive effect ），但可以使它们相互补充。这也凸显了多样化奖励信号（diversified reward signals）的重要性。未来，我们将进一步研究 reward signals 的多样性和准确性。

1.4 Online A/B Test

我们在 Kuaishou 的两个主要短视频场景中部署了 OneRec-V2：main Kuaishou feed 和 Kuaishou Lite feed。这两个场景是平台流量最高的环境，服务于 400 million 日活跃用户。evaluation 在一个为期一周的观察期内，采用 5% 流量的实验组进行。使用的模型是 1B 参数版本，context length 为 3000，beam size = 512。对于 online inference，系统使用 L20 GPU，实现了 36ms 的 latency 和 62% 的模型浮点运算利用率（Model FLOPs Utilization: MFU）。为了降低系统复杂性，该版本仅纳入了 User Feedback Signals 。我们的主要评估指标 App Stay Time （衡量总的用户互动时长 total user engagement duration ）和 LT7 （7 日用户留存率 user lifetime retention）。
如 Table 8 所示，OneRec-V2 在两个平台上都取得了显著提升。此外，OneRec-V2 在所有 user interaction metrics 上都表现出显著增长，包括点赞（ likes）、关注（follows）、评论（comments）和其他互动行为（engagement behaviors ），表明其能够引导 multi-task recommendation systems 走向更平衡的状态，同时有效缓解 competing objectives 之间的跷跷板效应。
为了进一步验证我们的发现，我们进行了一项额外的 caching disabled 实验——在一个单独的 1% 实验组中，所有流量都请求 OneRec-V2（详细结果见附录 D）。这项全面的 evaluation 证实了 user engagement metrics 的显著提升，点赞、关注、评论和转发（forwards ）等 interaction 指标在两个平台上均实现了 9.6% 至 29.2% 的显著增长。尽管这些结果表明 OneRec-V2 在推动 user engagement 方面表现强劲，但也揭示了重要的生态系统层面的问题，包括 cold-start video views 的显著下降（ Kuaishou and Kuaishou Lite 分别下降 44.7% 和 36.7%）以及 cluster density 的增加。

1.5 Conclusion, Limitations, and Future Directions

在本文中，我们在 OneRec-V1 的基础上提出了 OneRec-V2，深入探讨了其 scaling and reward systems 的设计。
- 关于 scaling，我们发现尽管 OneRec-V1 模型在 decoder 中利用混合专家（MoE）分配了大量参数，但由于 sequence length 的差异，context encoding 过程消耗了大部分计算资源，阻碍了进一步的 scalability 和性能提升。因此，我们重新思考了模型架构，提出了 Lazy Decoder-Only 架构，将 computation 转移到 decoding 阶段，使模型能够进一步扩展（目前已扩展至 8B 参数）。
- 此外，我们开发了一种有效利用 real user feedback 来对齐 user preferences 的方法。与仅依赖 a reward model 进行对齐的 V1 不同，我们纳入了 real user feedback signals ，并通过 innovative design 来建立了 short-term video watching time 与 long-term satisfaction 之间的关联。此外，使用 GBPO 实现了高度稳定的训练。
严格的 A/B experiments 证明了该框架的有效性。然而，我们的系统仍有改进空间，例如：
- Scaling：当模型参数从 0.1B 扩展到 8B 时，我们观察到 loss 持续下降，这与 《Training compute-optimal large language models》 提出的 empirical scaling law 高度吻合。我们的结果与该 scaling relation 表现出极好的一致性，如Figure 6 所示。这验证了所选架构的有效性，并表明进一步的 scaling 和架构创新有可能带来更优的性能。
- Reward System：我们在 reward system 中新增了 real user feedback，这被证明是有效的。然而，我们当前的解决方案建立了 rules 来连接 short-term and long-term returns，而非允许模型直接优化 long-term value。我们将朝着这个方向进行优化，使模型能够实现面向 long-term value 的自我强化（self-reinforcement）。
除了在 Kuaishou 平台的 video recommendation 中实现效益外，OneRec-V2 已部署在多个业务场景中，产生了显著回报（例如，《Oneloc: Geo-aware generative recommender systems for local life service》）。我们相信，通过更多研究人员和工程师的迭代、验证和优化，该系统可以得到进一步完善。

二、Appendix

2.1 Computational Complexity of Different Architecture (Appendix B)

前提（Preliminary）：在实际的推荐系统中，多个 items 会被同时曝光。一个关键的优化是公共上下文压缩（common context compression）：当向同一用户曝光 k item recommendations 时，共享的上下文信息（user profile、historical behaviors）只需被处理一次，并可在所有 target items 中被重复使用。这将每个 itemcontext length $N$ $N/k$ tokensKuaiShou $k=5$ 。
一个 transformer block （《Attention is all you need》）中的主要计算组件包括：feed-forward networks (FFNs)(2) attention projections $\mathbf W_q, \mathbf W_k, \mathbf W_v, \mathbf W_o$ ）、以及 (3) attention score computation。它们的计算复杂度为：
$\begin{matrix} FFN: O (L \times d_{model} \times d_{ff}) ≃ O (L \times 4 d_{model}^{2}) \\ Attention Projections: O (L \times 4 d_{model}^{2}) \\ Attention Scores: O (L^{2} \times d_{model}) \end{matrix}$
$L$ tokens $d_\text{model}$ 是模型的 hidden dimension。值得注意的是，前馈网络（FFNattention projections $O(L \times D)$ $D$ 是相应模块的参数数量。
Encoder-Decoder Architecture：我们分析了 encoder 和 decoder 组件均为 0.5B 参数的 encoder-decoder 模型的计算需求。在 training with compressed context length 𝑁/5 的过程中，浮点运算量（FLOPs）分解如下：
$\begin{matrix} Context Transformation (Encoder): 6 \times 0.5 B \times \frac{N}{5} = 0.6 N GFLOPs \\ Context Projection (Cross-Attention): 6 \times 0.05 B \times \frac{N}{5} = 0.06 N GFLOPs \\ Context Decoding: 0.6 N + 0.06 N = 0.66 N GFLOPs \\ Target Decoding: 6 \times 0.45 B \times 3 = 8.1 GFLOPs \\ Total Computation: 0.66 N + 8.1 GFLOPs \end{matrix}$
其中系数 6 既考虑了乘加（multiply-accumulate）操作（贡献系数 2），也考虑了前向-反向传播比率（forward-backward pass ratio）（贡献系数 3）。cross-attentioncontext projection $(\mathbf W_k, \mathbf W_v)$ 位于 decoder 中，约占 decoder 参数的 10%（0.05B）。
此处忽略了 attention scores9 encoder and 9 decoder layers $d_\text{model}=1792$ 。attention score computations 为：
- Encoder $6 \times 9 \times \left(\frac{N}{5}\right)^{2} \times 1792=3.8 N^{2} \text{ KFLOPs}$ 。
- Decoder $6 \times 9 \times 3 \times N \times 1792=290N \text{ KFLOPs}$ 。
$N=512$ 时，这些值比前馈网络（FFNs）和 attention projections 小一个数量级。
Naive Decoder-Only Architecture：对于一个具有 1B 参数的 decoder-only modelcausal attention masking $N/5+3$ 个 tokens ：
$\begin{matrix} Context Decoding: 6 \times 1 B \times \frac{N}{5} = 1.2 N GFLOPs \\ Target Decoding: 6 \times 1 B \times 3 = 18 GFLOPs \\ Total Computation: 1.2 N + 18 GFLOPs \end{matrix}$

2.2 Empirical Results (Appendix C)

我们进行了实验，以研究 OneRec-V2 模型的 model size 、compute budget 和 training loss 之间的关系。Figure 12 展示了不同规模模型的 smoothed generative training loss curves ，作为 total compute （以 FLOPs 衡量）的函数。具体而言，更大的模型需要更多的计算资源才能达到相同的 loss value，但它们也会收敛到更低的 loss points，这与大型语言模型领域的观察结果一致。

2.3 Online Performance with Caching Disabled (Appendix D)

如 Online A/B Test 章节所述，我们的实验组流量为 5%，其中 25% 的降级流量（degraded traffic）应用了 OneRec-V2。为了进行更严格的比较，我们额外分配了 1% 的实验组并禁用缓存（在该组中，所有流量都请求 OneRec-V2）。性能如 Table 9 所示。
当所有流量都请求 OneRec-V2 时，我们观察到关键的 engagement 指标（包括 watch time 和 user interaction indicators）的显著提升。具体而言，likes 、follows 、comments 和 forwards 等 interaction 指标在不同平台上实现了 9.6% 至 29.2%的显著增长。然而，某些生态系统层面的指标呈现出令人担忧的趋势。值得注意的是，cold-start video views 大幅下降（Kuaishou 和 Kuaishou Lite 分别下降 44.7% 和 36.7%），而 cluster density 显著增加（分别增加 11.7% 和 7.9%）。这带来了一个需要在未来方向中仔细考虑的关键挑战。

一、 OneRec-V2 Technical Report [2025]

1.1 Introduction

1.2 Lazy Decoder-Only Architecture

1.2.1 Design Principles

1.2.2 Overall Architecture

a. Context Processor

b. Lazy Decoder Block

1.2.3 Empirical Results

a. Architecture Comparison

b. Key-Value Sharing

c. Grouped Query Attention

d. Model Scaling

1.3 Preference Alignment with Real-World User Interactions

1.3.1 Reinforcement Learning with User Feedback Signals

a. Duration-Aware Reward Shaping

b. Reinforcement Learning

c. Experiment

1.3.2 User Feedback Signals versus Reward Model

a. Limitation of Reward Model

b. Experiment

1.4 Online A/B Test

1.5 Conclusion, Limitations, and Future Directions

二、Appendix

2.1 Computational Complexity of Different Architecture (Appendix B)

2.2 Empirical Results (Appendix C)

2.3 Online Performance with Caching Disabled (Appendix D)