2024_TWIN-V2

一、TWIN V2 [2024]

《TWIN V2: Scaling Ultra-Long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou》

在大型推荐系统中，建模长期用户兴趣对于 CTR prediction 任务的重要性正逐渐引起研究人员和从业者的关注。现有的研究工作，如 SIM 和 TWIN，出于效率考虑，通常采用两阶段方法来建模长期用户行为序列：
- 第一阶段使用 search-based 的机制，即 General Search Unit: GSU ，从长序列中快速检索与 target item 相关的 a subset of sequences 。
- 而第二阶段使用 Exact Search Unit: ESU 对检索到的结果计算 interest scores 。
$10^6$ 的规模，目前尚无有效的解决方案来全面建模如此广泛的用户兴趣。为了解决这个问题，我们推出了 TWIN-V2 ，这是 TWIN 的增强版，其中采用分而治之的方法来压缩 life-cycle behaviors 并发现更准确更多样化的用户兴趣。
具体来说，离线阶段，我们通过 hierarchical clustering 方法将 life-cycle behaviors 中具有相似特性的 items 归为一个 clusterclusters $10^5$ 量级的行为序列压缩到可控的长度，从而用于 GSU retrieval 的 online inference 。Cluster-aware target attention 可以提取用户的全面的、多方面的长期兴趣，从而使最终的推荐结果更加精准和多样化。
在数十亿规模的工业级数据集上的大量离线实验、以及线上 A/B test ，证明了 TWIN-V2 的有效性。在高效的部署框架下，TWIN-V2 已成功部署到快手主流量中，从而服务于数亿日活用户。
CTR prediction 对于互联网应用至关重要。例如，快手（中国最大的短视频分享平台之一）已将 CTR prediction 作为其 ranking system 的核心组件。最近，人们投入了大量精力来对 CTR prediction 中的用户长期历史行为进行建模。由于生命周期中的 behaviors 的长度很长，modeling life-cycle user behaviors 是一项具有挑战性的任务。尽管现有研究不断努力延长 historical behavior modeling 的长度，但尚未开发出可以对用户整个生命周期进行建模的方法，从而覆盖 application 中高达 1 million behaviors 。现代工业系统采用了两阶段方法，以便在可控的推理时间内尽可能多地利用用户历史记录。具体来说：
- 在第一阶段，通用搜索单元（General Search Unit: GSU ）模块用于过滤长期历史行为，选择与 target item 相关的 items 。
- 在第二阶段，精确搜索单元（Exact Search Unit: ESU ）模块对 filtered items 进行进一步处理，通过 target attention 提取用户的长期兴趣。
GSU 的粗粒度的 pre-filtering 使系统能够在 online inference 过程中以更快的速度建模更长的历史行为。许多方法尝试了不同的 GSU 结构来提高 pre-filtering 的准确率，例如 ETA、SDIM 和 TWIN。尽管现有的第一阶段的 GSU 很有效，但它们通常存在长度限制，无法对整个生命周期的 user behaviors 进行建模。例如：
- SIM 、ETASDIM $10^3$ 的历史行为。
- TWIN $10^4$ $10^5$ 的水平。
在部署过程中，TWIN 使用最近的 10,000 behaviors 作为 GSU 的输入。不幸的是，这 10,000 behaviors 仅涵盖了用户在快手 App 中最近三到四个月的历史记录，无法涵盖 user behavior 的整个生命周期。如 Figure 1 所示，我们分析了快手 App 中过去三年的 user behaviormedium and high user groups $10^4$ $10^6$ 个视频，占 App 使用时长的大部分（60% + 35% = 95% ）。因此，modeling the full life-cycle behaviors 可能会增强用户体验并提高平台的商业收益。
为了解决这个问题，我们提出了 TWIN-V2 ，这是 TWIN 的增强版，使其能够对 user behavior 的全生命周期进行建模。 TWIN-V2 采用分而治之的方法，将完整的生命周期历史（full life-cycle history）分解为不同的 clusters ，并使用这些 clusters 来建模用户的长期兴趣。具体来说，我们将模型分为两个部分：离线和在线。
- 在离线部分，我们使用层次聚类（hierarchical clustering ）将 life-cycle behaviors 中的相似 items 聚合到 clusters 中，并从每个 cluster 中提取特征以形成一个虚拟的 item 。
- 在在线部分，我们利用 cluster-aware target attention 从而基于 clustered behaviors 来提取长期兴趣。在这里，attention scores 根据相应的 cluster size 被重新加权。
大量实验表明，TWIN-V2 在各种类型用户上提升了性能，从而带来更准确和多样化的推荐结果。本文主要贡献总结如下：
- 我们提出 TWIN-V2 从极长的 user behavior 中捕获用户兴趣，成功地延长了 user modeling across the life cycle 的最大长度。
- hierarchical clustering 和 cluster-aware target attention 提升了 long sequence modeling 的效率，建模了更准确、更多样化的用户兴趣。
- 大量离线实验和在线实验验证了 TWIN-V2 在超长用户行为序列的扩展方面的有效性。
- 我们分享了 TWIN-V2 的部署实践，包括离线处理和在线服务。TWIN-V2 已经服务于快手每日约 4 亿活跃用户的主要流量。
TWIN V2 vs TWIN V1 的在线指标提升，要比 TWIN V1 vs SIM 低得多。这意味着 TWIN V2 上线之后，系统的商业增益没有那么大。
本文旨在将 long-term history modeling 扩展到生命周期级别，同时保持效率。Table 1 总结了 TWIN-V2 与之前工作的比较。
TWIN V2 比 TWIN 多了一步：先对用户行为序列进行聚类，并对每个 cluster 构造一个 virtual item 来代表这个 cluster。然后将这些 virtual items 馈入 TWIN 中。
但是，TWIN V2 的效果依赖于一个训练好的 recommendation model 。我们需要从这个 recommendation model 中获取 item embedding 从而进行聚类、以及 virtual item 的构造。然而，论文没有提到这个 recommendation model 是如何创建和训练的，在实验部分也没有说这个 recommendation model 的性能对于 TWIN V2 效果的影响。
从 “TWIN-V2 的部署” 这一章节来看，我们只需要提供一个初始的 recommendation model，然后在训练过程中可以用上一轮的 TWIN-V2 模型作为这个 recommendation model 从而优化下一轮的 TWIN-V2。从这个角度来看， recommendation model 似乎不是特别重要？
此外，读者猜测初始的 recommendation model 来自于训练好的 TWIN-V1 模型。

1.1 方法

本节详细阐述了所提出的模型，详细介绍了从整体工作流到 deployment framework 的整个过程。符号总结如下。

1.1.1 基础知识

CTR predictionitem $\mathbf{\vec x}_k\in \mathbb R^{d_0}$ $k$ feature representation $y_k\in \{0,1\}$ $k$ 个交互的 label 。CTR prediction 的过程可以写成：
${\hat{y}}_{k} = σ (f ({\vec{x}}_{k}))$
其中：
- $\sigma(\cdot)$ 为 sigmoid 函数。
- $f$ CTR $f:\mathbb R^{d_0}\rightarrow \mathbb R$ 。
- $\hat y_k$ 为 predicted probability 。
$f$ 由二元交叉熵损伤函数来训练：
$L = - \frac{1}{| D |} \sum_{k = 1}^{| D |} [y_{k} \log ({\hat{y}}_{k}) + (1 - y_{k}) \log (1 - {\hat{y}}_{k})]$
$\mathcal D$ 为训练集。

1.1.2 整体工作流

Figure 2 展示了 TWIN-V2 的整体框架，包括离线和在线部分。本文重点介绍 life-cycle behavior modeling 。因此我们强调这一部分并省略其他部分。由于整个生命周期的 behavior 对于 online inferring 来说太长，我们首先在离线阶段对 behaviors 进行压缩。User behavior 通常包括许多相似的视频，因为用户经常浏览他们喜欢的主题。因此，我们使用 hierarchical clustering 将用户历史中具有相似兴趣的 items 聚合到 clusters 中。Online inferring 使用这些 clusters 及其 representation vectors 来捕获用户的长期兴趣。在 online inferring 过程中，我们采用两阶段方法进行life-cycle modeling ：
- 首先，我们使用 GSU 从 clustered behaviors 中检索与 target item 相关的 top 100 clusters 。
- 然后，我们使用 ESU 从这些 clusters 中提取长期兴趣。
在 GSU 和 ESU 中，我们实现了 cluster-aware target attention ，并调整不同 clusters 的权重。

1.1.3 Life-Cycle User Modeling

考虑到用户在生命周期中的历史行为具有超长序列，我们首先采用分而治之的方法压缩这些 behaviors ，然后从 compressed behaviors 中提取用户的长期兴趣。

a. Hierarchical Clustering Over Life Cycle

$u$ items $\mathcal S = [s_1,s_2,\cdots,s_T]$ $T$ life-cycle behaviors $s_j$ $j$ item $\mathcal S$ 中观看许多 similar itemsNBA $\mathcal S$ 中的 similar items 聚合成 clusters ，用一个 clustersimilar items $T$ 的大小。
$h_\text{comp}$ $T$ behavior sequence $\mathcal S$ $\hat T$ $\mathcal C$ ：
$C = h_{comp} (S)$
$\hat T \ll T$ $\mathcal C$ clustered behaviors $\mathcal C = [c_1,c_2,\cdots,c_{\hat T}]$ $c_i$ clustered items $i$ 个集合。
hierarchical clustering $h_\text{comp}$ ，如 Algorithm 1 所示。
- historical behaviors $u$ 的 playing time ratiohistorical items $M$ $j$ 个 historical itemplaying time ratio $p_j = \frac{\text{playing time}}{\text{video duration}}$ 。
- 在第二步中，我们递归地对每组的 life-cycle historical behaviors 进行聚类，直到每个 clusteritems $\gamma$ $\gamma$ 的 k-meansbehaviors $\mathcal S$ embeddings $\mathbf K_\mathcal S$ $\mathbf K_\mathcal S$ recommendation model $\gamma$ 是控制 maximum cluster size 的超参数。
  pre-trained $\mathbf K_\mathcal S$ 。这个 pre-trained 模型的质量会影响 TWIN-V2 的效果。
接下来，我们将详细说明 Algorithm 1 背后的原理：
- Item Grouping：在短视频场景中，视频的完播率（playing completion ratio）可以看作是用户对视频感兴趣程度的指标。我们首先按完播率对 behaviors 进行分组，以确保 final clustering results 具有相对一致的 playing time ratio 。如果不这样做，将导致每个 cluster 内的分布不平衡，因为k-meansitem embeddings $\text{getGroup}(p_j)$ playing time ratio $M$ 是一个常数，在我们的实践中设置为 5 。
  在电商场景下，我们可以按照 action type 对 behaviors 进行分组，如 click, add_cart, purchase 。
- Recursively Adaptive Clustering：我们选择 hierarchical clustering 而不是 single-step clustering ，因为用户历史是个性化的，因此设置通用的 clustering iterations 数量是不切实际的。
  - number of clusters $\delta$ 等于 items 数量的 0.3 次方），允许 clustering 适应不同规模的 behaviors 。
  - item size $\gamma$ clustering $\gamma$ 设置为 20 。
  单次迭代逻辑：
  - $Q$ group $m$ $Q$ Item Grouping $M$ 个 group 。
  - group $m$ item $\gamma$ group $m$ 直接作为一个 cluster。当前迭代结束，进入下一轮迭代。
  - group $m$ item $\gamma$ group $m$ 进行 Kmeans 聚类，clustergroup $m$ 中 item 数量的 0.3cluster $Q$ 中，从而便于将来迭代。
  这种做法使得每个 clusteritem size $\gamma$ 。
  cluster size $\gamma/2$ $\hat T$ $T/10$ （即，cluster 的数量），大大缩短了 life-cycle behaviorsitem embeddings $\mathbf K_\mathcal S$ 是从 recommendation model 中获得的，这意味着 clustering 是由协同过滤信号所指导的。
  因此，我们通过将 similar itemsclusters $\mathcal S = [s_1,s_2,\cdots,s_T]$ $\mathcal C = [c_1,c_2,\cdots,c_{\hat T}]$ 。

b. Extracting Cluster Representation

clusters $\mathcal C$ 后，我们需要从 items of each cluster 中抽取特征。为了最大限度地减少计算和存储开销，我们使用一个 virtual item 来表示 features of each cluster 。我们将 item features 分为两类（numerical and categorical ），并使用不同的方法抽取它们。
- Numerical features 通常使用标量来表示，例如视频时长和用户播放视频的时间。
- Categorical features 通常使用 one-hot (or multi-hot) 向量来表示，例如 video ID 和 author ID 。
item $v$ ，其特征可以写成：
${\vec{x}}_{v} = [\underset{categorical (one-hot)}{\underset{⏟}{{\vec{x}}_{c, 1}^{(v)}; {\vec{x}}_{c, 2}^{(v)}; \dots; {\vec{x}}_{c, N_{1}}^{(v)}}}; \underset{numerical (scaler)}{\underset{⏟}{x_{s, 1}^{(v)}; x_{s, 2}^{(v)}; \dots; x_{s, N_{2}}^{(v)}}}]$
$\mathbf{\vec x}_{1:N_1}^{(v)}$ categorical features $\mathbf{\vec x}_{1:N_2}^{(v)}$ numerical features $\mathcal C$ cluster $c_i$ ：
- 我们计算其包含的所有 itemsnumerical features $c_i$ 的 numerical feature representation ：
  ${\vec{c}}_{1 : N_{2}}^{(i)} = \frac{1}{| c_{i} |} \sum_{v \in c_{i}} {\vec{x}}_{1 : N_{2}}^{(v)}$
- 对于 categorical featurescluster $c_i$ 的质心最近的 itemcluster $c_i$ ：
  $\begin{matrix} {\vec{c}}_{1 : N_{1}}^{(i)} = {\vec{x}}_{1 : N_{1}}^{(i)}, \\ v = \arg min_{v \in c_{i}} {‖ {\vec{k}}_{v} - {\vec{k}}_{centroid} ‖}_{2}^{2} \end{matrix}$
  $\mathbf{\vec k}_v, \mathbf{\vec k}_\text{centroid}\in \mathbb R^d$ item $v$ 和质心的 embeddingsembeddings $\mathbf K_\mathcal S$ 中查找。
  注意：这是将距离质心最近的 item 的 categorical feature 作为这个 cluster 的 categorical feature 。
cluster $c_i$ aggregated, virtual item feature $\mathbf{\vec c}_i = \left[\mathbf{\vec c}_{1:N_1}^{(i)};\mathbf{\vec c}_{1:N_2}^{(i)}\right]$ 来表示。经过 embedding layerscluster $\mathbf{\vec c}_i$ embedding $\mathbf{\vec k}_i\in \mathbb R^d$ 。GSU 和 ESU 模块根据这些 embedded vectors 估计每个 cluster 与 target item 之间的相关性。
$\mathbf{\vec c}_i = \left[\mathbf{\vec c}_{1:N_1}^{(i)};\mathbf{\vec c}_{1:N_2}^{(i)}\right]$ 就是一个 “虚拟” 的 item （即，virtual item），作为这个 cluster 的代表。

c. Cluster-aware Target Attention

遵从 TWIN ，我们同时在 ESU 和 GSUclusters $\mathcal C$ 的 representations 作为输入。
clustered behaviors $\mathcal C = [c_1,c_2,\cdots,c_{\hat T}]$ ，我们通过 embedding layerrepresentation vectors $\mathbf K\in \mathbb R^{\hat T\times d}$ $\mathbf{\vec k}_i\in \mathbb R^d$ $i$ cluster $\mathbf{\vec c}_i$ 的 embedding 向量。
numerical feature $\mathbf{\vec c}_{1:N_2}^{(i)}$ ，需要先做 bucktize ，然后再做 embedding 。
然后我们使用 target attention 机制来衡量 target item 与 historical behaviors 的相关性。我们首先应用 TWIN 中的 “Behavior Feature Splits and Linear Projection” 技术来提升 target attention 的效率，详细信息见论文附录 A.2item embeddings $\mathbf{\vec q}\in \mathbb R^H$ target item $\mathbf K_h\in \mathbb R^{\hat T\times H}, \mathbf K_c\in \mathbb R^{\hat T\times C}$ 分别是 clustered behaviors 的固有 embeddingembedding $\mathbf K = [\mathbf K_h, \mathbf K_c]$ 。我们可以计算 target item 与 clustered behaviorsrelevance scores $\boldsymbol{\alpha}\in \mathbb R^{\hat T}$ ：
$α = \frac{(K_{h} W^{h}) {({\vec{q}}^{⊤} W^{q})}^{⊤}}{\sqrt{d_{k}}} + (K_{c} W^{c}) β$
$\mathbf W^h, \mathbf W^q, \mathbf W^c$ $d_k$ $\boldsymbol{\beta}$ 是可学习的交叉特征权重。
由于 clustersitem counts $\boldsymbol{\alpha}$ 无法充分表示 clusters 与 target item 之间的关系。假设两个 clusters 具有相同的相关性（与 target item 的相关性），则具有更多 items 的 cluster 应该更重要，因为 items 越多意味着用户偏好越强。因此，我们根据 cluster size 调整 relevance scores：
$α^{'} = α + \ln n$
$\boldsymbol{n}\in \mathbb N^{\hat T}$ $\mathcal C$ clusters $n_i$ cluster $c_i$ 的 cluster size 。
GSU $\boldsymbol{\alpha}^\prime$ clustered behaviors $\hat T$ ）中选择 relevance scores 最高的 top 100 clusters 。然后，将这 100 clusters 被输入到 ESU ，在那里根据 relevance scores 对它们进行聚合，从而得到用户长期兴趣的 representation ：
$Attention ({\vec{q}}^{⊤} W^{q}, K_{h} W^{h}, K_{c} W^{c}, K W^{v}) = Softmax {(α^{'})}^{⊤} (K W^{v})$
$\mathbf W^v$ 是一个投影矩阵。
ESU $\hat T = 100$ $\text{Softmax}\left(\boldsymbol{\alpha}^\prime\right)$ 可以写成：
$Softmax (α^{'}) = [\frac{n_{1} \times \exp α_{1}}{\sum_{j = 1}^{\hat{T}} n_{j} \times \exp α_{j}}, \dots, \frac{n_{\hat{T}} \times \exp α_{\hat{T}}}{\sum_{j = 1}^{\hat{T}} n_{j} \times \exp α_{j}}]$
其中每个 clusterrelevance score $\alpha_i$ cluster size $n_i$ 重新加权。
我们使用具有 4 个头的多头注意力来获得最终的长期兴趣：
$\begin{matrix} Long-term Interests = Concat ({head}_{1}, \dots, {head}_{4}) W^{o} \\ {head}_{a} = Attention ({\vec{q}}^{⊤} W_{a}^{q}, K_{h} W_{a}^{h}, K_{c} W_{a}^{c}, K W_{a}^{v}), a \in {1, 2, 3, 4} \end{matrix}$
$\mathbf W^o$ 为一个投影矩阵。
备注：利用从 GSU 中的 hierarchical clustering 中得出的 behaviors ，使模型能够容纳更长的历史行为，从而在我们的系统中实现 life-cycle level modeling 。与 TWIN 相比，GSU 的 life-cycle behaviors 的 input 涵盖了更广泛的历史兴趣，从而提供了更全面、更准确的用户兴趣建模。
此外，虽然现有的方法使用 top 100 retrieved behaviors 作为 ESU 的输入，但 TWIN-V2 使用 100 clusters 。这些 clusters 涵盖了超过 100 behaviors ，使ESU 能够对更广泛的 behaviors 进行建模。
所有这些优势带来了更加准确和多样化的推荐结果，这在实验章节中得到了验证。

1.1.4 TWIN-V2 的部署

我们将部署 TWIN-V2 的实践分为两个部分：在线和离线，如 Figure 3 所示。
在 Figure 3 中，我们以单个用户为例进行说明。当系统收到用户的 request 时：
- 它首先从离线处理好的 user life-cycle behaviors 中提取 behavioral features。
- 然后，使用 GSU 和ESU ，将 behavioral features 建模长期兴趣的 input 从而用于 CTR 模型以进行预测。
我们使用一个 nearline trainer 来进行 real-time model training ，该 trainer 使用实时用户交互数据（数据在 8 分钟的时间窗口内）增量地更新模型参数。
对于 life-cycle behaviors ，我们应用 hierarchical clustering 和feature extraction 来处理它们。这是通过周期性的 offline processing 和 updates 进行的。
Offline Processing ：离线处理旨在压缩用户的整个 life-cycle behaviors 。考虑到快手的用户数量是十亿级的，我们会定期压缩他们的 life-cycle behaviors 。hierarchical clustering runner 每 2 周执行一次全面更新。
我们将 hierarchical clusteringmaximum cluster size $\gamma$ 设置为 20 ，导致平均 cluster size 约为 10 。通过 cluster representation extraction ，每个 cluster 被聚合为一个虚拟 item 。在我们的实践中，historical behavior 被压缩为原始长度的 10% ，从而将存储成本降低了 90% 。
我们为 embedding server 采用了 TWIN 中提出的 inherent feature projector ，每 15 分钟从最新的 CTR 模型刷新一次 projector 的参数。
Online Serving：用户向系统发送 request 后，离线系统将数据特征发送到 GSU ，GSUbehavior-target relevance scores $\boldsymbol{\alpha}^\prime$ 。然后，ESU 选择并聚合 top 100 clustered behaviors 作为长期兴趣从而用于 CTR 模型的预测。为了确保 online inference 的效率，我们保留了 TWINprecomputing and caching $\mathbf K_h \mathbf W_a^h$ ）。 TWIN-V2 是 TWIN 的增量部署：TWIN 对最近几个月的 behaviors 进行建模，而 TWIN-V2 对整个生命周期的 behaviors 进行建模，使得它们相互补充。

1.2 实验

在本节中，我们通过进行大量的离线实验和在线实验来验证 TWIN-V2 的有效性。
数据集：由于 TWIN-V2user behaviors $10^4$ 的公开数据集。因此，我们从快手 App 中提取了连续五天的用户交互数据作为训练集和测试集。样本由点击日志构成，并以 click 为标签。为了捕获用户的 life-cycle 历史，我们进一步追溯了每个用户的过去所有行为，每个用户的最大历史长度为 100,000 。Table 2 展示了该工业数据集的基本统计数据。每天的前 23 个小时用作训练集，而最后一个小时用作测试集。
这种测试集会不会有偏？因为最后一个小时的交互数据的分布可能与前 23 个小时不同。
baseline：遵循常见做法（《Sampling Is All You Need on Modeling Long-Term User Behaviors for CTR Prediction》、《TWIN: TWo-Stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou》），鉴于我们的重点是建模极长的用户历史记录，我们将 TWIN-V2 与以下 SOTA baselines 进行比较：
- (1) Avg-Pooling ：一种利用均值池化的简单方法。
- (2) DIN：它为推荐算法引入了 target attention 。
- (3) SIM Hard：Hard-search 是指 GSU 根据 category 过滤长期历史。
- (4) SIM Soft：Soft search 涉及通过计算 target item 和历史记录中的 items 之间的向量点积来选择 top-k items 。
- (5) ETA：它采用局部敏感哈希（Locality Sensitive Hashing: LSH ）来促进端到端的训练。
- (6) SDIM：GSU 通过收集与 target item 共享相同哈希签名（hashing signature）的 behaviors 来聚合长期历史。
- (7) SIM Cluster：由于SIM Hard 依赖于 annotated categories ，我们通过 video embeddings 的聚类对其进行了改进。GSU 检索与 target item 位于相同 cluster 中的历史行为。在这里，我们将所有 items 聚类为 1,000 groups 。
- (8) SIM Cluster+：这是 SIM Cluster 的高级变体，其 cluster 数量增加到 10,000 个。
- (9) TWIN：它为 GSU 和 ESU 阶段提出了一种有效的 target attention 。
为了确保公平比较，除了 long-term interest modeling 外，我们在模型的所有方面都保持一致。
评估指标和评估设置：
- 我们使用一天中连续 23 个小时的样本作为训练数据，并使用接下来一个小时的样本进行测试。我们连续 5 天评估所有算法并报告 5 天的平均性能。
  相当于将算法重复 5 次并报告平均性能。
- 至于评估指标，我们采用了广泛使用的 AUC 和 Group AUC: GAUC 。我们计算了连续 5 天这两个指标的均值和标准差。
  GAUC 对所有用户的 AUC 进行加权平均，权重设置为该用户的样本数。GAUC 消除了用户之间的 bias ，以更精细和公平的粒度评估模型性能。
实现细节：
- 对于所有模型，我们都使用 industrial context 中相同的 item and user features ，包括 video ID, author ID and user ID 等大量属性。
- TWIN-V2 将单个用户历史长度限制为最多 100,000 items ，从而对 life-cycle user behavior 进行聚类。如前文所述，TWIN-V2 经验性地将历史行为压缩为原始大小的约 10% ，从而导致 GSU 中的 maximum input length 约为 10,000 。
  对于其他两阶段模型，输入 GSU 的历史行为最大长度限制为 10,000 behaviors 。
  对于 DIN 和 Avg-Pooling ，recent history 的最大长度为 100 ，因为它们不是为 long-term behaviors 而设计的。
  Avg-Pooling 事实上可以支持任意长度的用户行为序列。为什么这里截断为 100 ？
- 所有两阶段模型都利用 GSU 来检索 top 100 historical items 并将它们输入 ESU 从而用于 interest modeling 。
- batch size 设置为 8192 。Adam 的学习率为 5e-6 。

1.2.1 整体性能

从 Table 3 的结果中，我们得出以下观察结果：
- TWIN-V2 的表现远超其他基线。现有文献（《Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate》、《DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems》）表明，AUC 提高 0.001 对 CTR prediction 具有重要意义，足以产生 online benefits 。TWIN-V2 比表现第二好的 TWIN 模型有所改进，AUC 提高了 0.0013 ，GAUC 提高了 0.0024 。这些改进验证了 TWIN-V2 的有效性。
- TWIN-V2 在 GAUC 上的相对改进高于 AUC 上的相对改进。GAUC 提供了更细粒度的模型指标。GAUC 的较大改进表明 TWIN-V2 在各种类型的用户中都实现了提升，而不仅仅是在整体样本上表现更好，而且它在特定子集上也表现出色。由于 TWIN-V2 包含更长的行为序列，我们相信它在高度活跃的用户群体中取得了更大的进步，这一假设在附录 A.3 中得到了验证（如 Figure 5 所示）。

1.2.2 消融研究

我们进行了一项消融研究，以调查核心模块在 TWIN-V2 中的作用以及我们设计背后的原理。
不同 Hierarchical Clustering 方法的比较：我们的 hierarchical clustering 方法有两个主要特点：
- 1)clustering number $\delta$ 来适应各种各样的 sizes of behaviors 。
- 2)：k-means clustering 是非均匀的，最终导致不同大小的 clusters 。
$\delta$ cluster number $\delta = 2$ ，记作 “Binary”cluster sizes $\delta = 2$ 时强制 balanced k-means clustering 结果的变体，记作 “Balanced&Binary” 。在这个变体中，每次 k-means iteration 之后，对于 larger cluster 中的一部分 item ，如果它们比另一部分 item 更靠近另一个 cluster 的质心，则这些 items 会被移动到另一个 cluster 中，以确保最终两个 cluster 的大小相等。 Table 4 报告了对 TWIN-V2 和这两个变体的统计分析。
- 在这三种方法中，我们的方法实现了最高的 cluster accuracy 。这表明我们的方法创建的 clusters 中，items 具有更紧密匹配的 representation vectors ，从而导致每个 cluster 内的 items 具有更高的相似度。
  cluster accuracy：每个 item 和它所属的 cluster 的质心进行 cos 相似度计算，然后取平均。
- $\delta$ 的效率。
- 此外，我们还报告了这些方法在测试集上的性能，如 Figure 4 (left) 。
这些结果验证了自适应方法（adaptive method ）可以在每个 cluster 内分配更相似的 items ，并与其他方法相比加快了 hierarchical clustering 的过程。
Cluster-aware Target Attention 的有效性：在提出的 cluster-aware target attention 中，与 TWIN 相比，主要区别在于使用 clustered behavior 作为输入，并根据 cluster size 重新加权 attention scores 。我们分别从 GSU 和 ESU 中删除了 cluster size reweighting 部分，以评估其对性能的影响。具体来说，我们分别将 GSUESU $\boldsymbol{\alpha}^\prime$ $\boldsymbol{\alpha}$ 。 Figure 4 (right) 显示了实验结果。省略 reweighting 会导致性能下降，这验证了通过 cluster size 来调整 attention scores 的有效性。

1.2.3 在线实验

我们进行了在线 A/B test ，以验证 TWIN-V2 在我们工业级系统中的性能。Table 5 显示了 TWIN-V2 在快手三个代表性场景（Featured-Video Tab, Discovery Tab, and Slide Tab）中在观看时长（Watch Time ）和推荐结果多样性方面的相对改进。
- 观看时长衡量用户花费的总时间。
- 多样性是指模型推荐结果的多样性，比如视频类型的丰富性。
  多样性指标公式s是什么？这里没给出来。
从结果来看，TWIN-V2 可以更好地建模用户兴趣，从而提升观看时长。此外，通过建模更长的历史行为，TWIN-V2 可以发现更多样化的用户兴趣，从而带来更加多样化的推荐结果。