2022_PinnerFormer

一、 PinnerFormer [2022]

《PinnerFormer: Sequence Modeling for User Representation at Pinterest》

过去几年，sequential models 在为个性化推荐系统提供动力方面变得越来越流行。这些方法传统上将用户在网站上的行为建模为序列，以预测用户的 next action 。虽然理论上简单，但这些模型在生产中部署起来相当具有挑战性，通常需要流式基础设施（streaming infrastructure）来反映最新的 user activity ，并可能管理可变数据（mutable data ）以编码用户的 hidden state。在此，我们介绍了 PinnerFormer，一种 user representation ，它使用用户近期行为的序列模型进行训练，以预测用户未来的 long-term engagement。与先前方法不同，我们通过新的 dense all-action loss 将模型适配到 batch infrastructure，该 loss 建模的是 long-term future actions ，而不是 next action prediction 。我们表明，通过这样做，我们显著缩小了 batch user embeddings （每天生成一次）与 realtime user embeddings （用户每次行为都生成一次）之间的性能差距。我们通过广泛的离线实验和消融研究描述了我们的设计决策，并在 A/B experiments 中验证了我们方法的有效性，结果表明，将 PinnerFormer 与我们之前的 user representation 相比，显著提高了 Pinterest 的用户留存和 engagement 。PinnerFormer 已于 2021 年秋季部署到生产环境。
PinnerFormer 不同于常规的推荐任务：
- PinnerFormer 的 objective 并不是 next engagement，而是一组 long-term future engagements（刻画了用户的长期兴趣）。
- PinnerFormer 的 inference 主要是在离线以 daily 为频率来生成。在线只需要进行 KNN 检索即可。它并不是 realtime inference。
- PinnerFormer 生成的 user embedding 可以作为下游其他任务的 user feature。
为什么这么做？这是因为：
- realtime inference 需要强大的基础设施来支持；而 PinnerFormer 不需要。
- realtime inference 计算成本更高，因为它需要在每个 request 都计算一次；而 PinnerFormer 只需要每天计算一次。
- realtime infernce 得到的 embedding 无法共享；而 PinnerFormer 离线计算到的 embedding 可以用于其它任务。
每月有超过 400M 用户使用 Pinterest，从我们数十亿个 Pin 的内容库中发现想法和灵感。一个 Pin 从一张图片开始，通常还包括文本、一个网页链接以及一个图板（board），该 board 将单个 Pin 与用户策划的收藏集连接起来。灵感是 Pinterest 的核心，主要通过我们的搜索和推荐系统来促进，使用户能够通过以下方式找到内容：
- (a)：首页（Homefeed），我们的个性化推荐产品。
- (b)：相关 Pin（Related Pins），与 query Pin 相关的推荐。
- (c)：搜索（Search），与 user text query 相关的推荐。
用户通过交互来提供反馈，例如将 Pin 保存到 board （Repin）、点击底层链接、放大查看一个 Pin（close-up）、隐藏不相关的内容等等。为了实现我们为每个人带来灵感以创造他们热爱的生活的使命，我们需要根据用户的兴趣和情境个性化我们的内容，并考虑用户在 Pinterest 旅程中给出的反馈；也就是说，我们需要一个强大的 user representation。
learning user embeddings 已成为改进推荐的一种越来越流行的方法。这些 embeddings 已被用于推动行业中的 ranking 和 candidate generation，并用于为 YouTube、Google Play、Airbnb search、JD.com search 、Alibaba 等提供个性化推荐。除了学习 personalized embeddings 的工作外，还有一系列工作专注于使用序列信息直接构建 ranking models，从而根据用户最近的 engagement 来实现推荐的个性化。
网站上的用户行为本质上是序列性的；actions 可以按其发生的时间排序，这自然引出了序列建模方法。人们已经提出了各种方法来根据用户的历史交互序列预测未来的 engagement 。最近的工作应用了各种深度学习模型，包括 RNN 和 Transformer 来进行此类序列推荐，并取得了有前景的结果。序列模型传统上专注于 realtime setting ，旨在根据截止到该时间点的所有 actions 来预测用户的 next action 或 next engagement 。
在实践中，将现有序列建模方法部署到大规模 web-scale applications 面临两个关键挑战：(a) 计算成本，以及 (b) 基础设施复杂性。现有序列建模方法大致分为两类：无状态模型和有状态模型。无状态模型可能具有较高的计算成本，因为每次用户采取行动后都必须从头计算 embedding ，而有状态模型则需要强大可靠的流式基础设施（streaming infrastructure）来处理可能出现的错误或数据损坏（针对特定用户的模型状态）。
在此，我们介绍了 PinnerFormer，这是一种已在 Pinterest 生产环境中部署的端到端学到的 user representation。与之前关于 sequential user modeling 的工作类似， PinnerFormer 直接基于用户过去的 pin engagement 来学习一个 representation 。我们提出了一种 dense all action loss ，这使得我们的 embeddings 能够捕获用户的长期兴趣，而不仅仅是预测 next action 。这使得我们的 embedding 可以在离线批量地计算，并大大简化了基础设施。
我们还解决了 Pinterest 中体现出的基础设施复杂性挑战：有数十个 ranking 模型可能受益于 personalization ，但为每个模型开发定制的解决方案是不可扩展的。我们不为每个模型生成一个 user embedding（这会增加复杂性），而是选择投资开发一个可用于许多下游任务的高质量 a single user embedding。尽管在特定任务上的性能可能在某些情况下有所牺牲，但复杂性的权衡使得 a shared embedding 对于大多数用例来说是有利的。
我们在离线以及 online A/B experiments 中评估了 PinnerFormer 。在离线实验中，我们表明，这种训练目标将近乎将 a model inferred daily 与 a model inferred in realtime 之间的性能差距降低 50% ，并且比其他方法更好地反映了用户的长期兴趣。然后，我们展示了 PinnerFormer 作为特征的效用，证明了当将其用作不同领域多个 ranking 模型中的特征时，它能带来显著的在线收益。

1.1 Design Choices

我们首先讨论 PinnerFormer 的关键设计选择。
Design Choice 1：对于每个用户，single embeddings vs. multiple embeddings：大多数生成 user representations 的方法产生单个 embedding，但有些方法专注于学习固定数量或可变数量的 user embeddings。在我们之前的 user representation，即 PinnerSage 中，我们决定允许可变数量（可能很多）的 embeddings，使模型能够显式地表示每个用户的多样化的兴趣。
尽管使用 multiple embeddings 允许模型更显式地捕获用户兴趣，并且适用于 retrieval ，但这可能导致在下游模型中使用时出现问题：在训练数据中存储 20 多个 256 维的 float16 的 embeddings 是不可 scale up 的，特别是当数据集可能包含数十亿行时，就像 ranking 模型那样。此外，这也会增加模型训练和推理的成本；处理超过 5000 个浮点数可能会引入不可忽视的延迟，尤其是在 aggregation 之前进行 transform 时。在训练时，大的样本（即，特征数据更多）也会增加加载数据所需的时间。为了避免这些问题，在 ranking 模型中使用 PinnerSage 时，我们通常使用 a weighted aggregation of a user’s embeddings 作为 final user representation。由于我们希望 PinnerFormer 能够轻松地用作特征，我们产生一个捕获用户兴趣的单一 embedding，以便在下游模型中轻松使用。在离线评估中，我们展示了我们的单一 embedding 能够比 PinnerSage 更好地反映用户的长期兴趣，同时只需极少量的存储。
Design Choice 2：Real-time inference vs. Offline inference：大多数先前关于 sequential user modeling 的工作都专注于在实时或近实时运行的模型。在实践中，这至少会导致以下情况之一：
- 高计算成本：对于用户的每个 action ，系统必须在该时刻获取用户历史中的所有事件，并频繁地推断一个可能很复杂的模型。
- 高基础设施复杂性：可以增量更新用户的 hidden state 或 embedding ，但这需要一个健壮的系统来在出现任何数据损坏时恢复和预热 model’s state 。
在 Pinterest上，用户一天可能执行数十或数百次操作，因此每天最多更新一次 user’s embedding 的模型仅需要同等规模 realtime model 的一小部分计算资源。在离线评估中，我们证明了我们的 loss 公式大大缩小了 realtime model 和 daily-inferred model 之间的差距；并且在 A/B 实验中，我们展示了 PinnerFormer 大大提高了下游 ranking 模型的性能。

1.2 我们的方法: PinnerFormer

在本节中，我们介绍 PinnerFormer，它自 2021 年秋季以来已在 Pinterest 生产环境中使用，描述我们的模型（如 Figure 1 所示）及其部署方式。
Pin $\mathcal{P} = \{P_{1},P_{2},\cdots, P_{N}\}$ $N$ $\mathcal{U} = \{U_{1},U_{2},\cdots \}$ $|\mathcal{U}| > \text{500M}$ pin $P_i$ Pinsage embedding $\mathbf{\vec p}_i\in \mathbb{R}^{256}$ pin $P_{i}$ an aggregation of visual, text, and engagement information $\mathcal{A}_{U} = \{A_{1},A_{2},\cdots ,A_{S}\}$ timestamp $S$ 为行为序列长度。在这项工作中，我们将此行为序列限制为用户与 pin 的 engagements ，包括过去一年中的 Pin saves, clicks, reactions, and comments 。基于此假设，一个 action 可以由一个 PinSage embeddingaction $S$ $M$ 个 actions 来计算用户的 embedding 。
这个 PinSage embedding 从何而来？来自于 《Graph convolutional neural networks for web-scale recommender systems》 中 GNN 预训练好的 pin embedding 。
注意：前文提到，“即 PinnerSage 中，我们决定允许可变数量（可能很多）的 embeddings”，这是每个 user 允许有多个 embedding。而这里的 PinSage embedding 是针对 pin 的 embedding，每个 pin 只有一个 embedding 。
user representation $f\colon \mathcal{U}\mapsto \mathbb{R}^{d}$ ，该 user representationpin representation $g \colon \mathcal{P}\mapsto \mathbb{R}^{d}$ $f$ $g$ $\mathcal{A}_{U}$ input features $M$ 个行为。
在用户的完整行为序列中，可能存在多种类型的行为，有些是好的（例如，long click ），有些是中性的或负面的（例如，hide 或short click ）。在这项工作中，我们专注于学习 representations 以预测 positive engagement，我们将其定义为 pin save （"Repin"）、持续时间超过 10 秒的 pin close-up（"Closeup"），或对 pin 底层链接的长点击（大于 10 秒）（"Click"）。我们仅将首页（Homefeed）上的 engagement 视为正向的；在 Search 或 Related Pins 等页面上，query 提供了重要的上下文，而在 Homefeed上，user 提供了主要的上下文。
用户行为序列作为 feature，那么不同行为如何区分？作者引入了 action type embedding。
我们的主要目标是学习一个模型，该模型能够预测用户在生成 embedding 后的 14天时间窗口内的 positive future engagement，而不是传统的 sequence modeling 任务（embedding 仅预测 next actionembedding $\mathbf{\vec u}_{i}$ $\mathbf{\vec p}_{i}$ $d\left(\mathbf{\vec u}_{k},\mathbf{\vec p}_{i}\right)< d(\mathbf{\vec u}_{k},\mathbf{\vec p}_{j})$ $\mathbf{\vec p}_{j}$ $k$ $\mathbf{\vec u}_{k}$ 14 $\mathbf{\vec p}_i$ 发生 positive engagement。我们选择 14 天这个范围是为了易于处理，并假设用户在两周内采取的 actions 足以代表用户的长期兴趣。Figure 1 说明了 PinnerFormer 的架构，下面我们将更详细地阐述每个组件。

1.2.1 Feature Encoding

对于用户序列中的每个 action，我们都有一个 PinSage embedding（256 维）（《Graph convolutional neural networks for web-scale recommender systems》）和元数据特征：action type 、页面（surface）、timestamp 和 action 持续时间。我们使用小的、可学习的 embedding tables 来编码两个 categorical 特征：action type 和页面，并删除这两个特征中 out of vocabulary terms 的序列元素（sequence elements）。我们使用一个标量值 log(duration) 来编码 action 持续时间。
如果 action type 或者 suerface type 是 OOV 的，那么这个 action 本身就会被删掉。注意：删除整个 action 可能会打断序列的连续性，但论文认为这种情况很少见，影响不大。
为了表示 action 发生的时间，除了原始的绝对timestamp 外，我们还额外使用了2个派生值：time since the latest action ，以及 actions 之间的时间间隔。对于这些时间特征中的每一个，我们遵循常见的做法，使用具有不同周期的正弦和余弦变换进行编码，方式类似于 Time2vec《Time2vec: Learning a vector representation of time》 $P$ $2P + 1$ timestamp $2P$ 个特征）。
即，一共使用了时间相关的三组特征：原始绝对 timestamp、time since the latest action 、 actions 之间的时间间隔。
$D_{\text{in}}$ action $A_{i}$ representation $\mathbf{\vec a}_{i} \in \mathbb{R}^{D_{\text{in}}}$ 。
注意，所有特征是拼接（concat）在一起，而不是相加（sum）在一起。对于每个 action，包含了：PinSage embedding、action type embedding、surface embeddinglog(duration) $2P+1$ 个timestamp 特征。

1.2.2 Model Architecture

在 PinnerFormer 中，我们使用 Transformer 模型架构对用户行为序列进行建模。
- 我们选择使用 PreNorm 残差连接，在每个 block 之前应用 Layer Normalization，因为这种方法已被证明可以提高训练的稳定性。
  PostNorm：output = LayerNorm(SubLayer(input) + input) 。
  PreNorm：output = SubLayer(LayerNorm(input)) + input 。
  为什么 PreNorm 能提高训练稳定性？主要原因是 PreNorm 避免了梯度在残差路径上的大幅缩放，尤其在使用深层 Transformer 时。
  - PostNorm 的问题：由于 LayerNorm 放在残差相加之后，残差路径上的信号经过多个子层后，可能会被 LayerNorm 反复缩放，导致梯度在某些位置上变得非常小或非常大。这会使训练对学习率、初始化更敏感，深层网络容易出现梯度消失或爆炸。
  - PreNorm 的好处： LayerNorm 放在子层之前，每个子层的输入被归一化到标准范围，但残差连接直接传递未归一化的输入（恒等映射）。这样梯度可以不受阻碍地通过残差路径回传，避免了 LayerNorm 对梯度的压缩效应。实验和理论都表明，PreNorm 允许训练更深的 Transformer，并且对学习率的选择更鲁棒。
- action $A_{T + 1}$ $M$ input matrix $\mathbf A = \left(\mathbf{\vec a}_{T} ,\cdots, \mathbf{\vec a}_{T - M + 1}\right)^\top \in \mathbb{R}^{M \times D_{\text{in}}}$ 作为用户的序列
  然后，我们将这些投影到 Transformer 的 hidden dimension，添加一个完全可学习的 positional encoding，并应用一个标准的 Transformer ，该 Transformer 由交替的前馈网络（feedforward network: FFN ）和多头自注意力（multi-head self attention: MHSA ） blocks 组成。
  注意，对于模型的 input matrix，采用的是按照时间戳降序排列：最近的 action 在第一个位置。
- TransformerMLP $L_{2}$ embedding $\mathbf E = \left(\mathbf{\vec e}_{1}, \cdots ,\mathbf{\vec e}_{M}\right)^\top \in \mathbb{R}^{M \times D}$ $D$ 是 final embedding dimension 。
为了表示 pin，我们学习一个 MLP，它仅以 PinSage embeddingoutput embedding $L_{2}$ $L_{2}$ 归一化的 embedding 来表示 user 和 pin 可以实现稳定的训练，而不会牺牲离线性能。
这里的关键是 PinSage embedding，它的质量会影响模型的性能。

1.2.3 Metric Learning

为了训练我们的 representationpairs $\left\{\left(\mathbf{\vec u}_{1}, \mathbf{\vec p}_{1}\right), \cdots , \left(\mathbf{\vec u}_{B}, \mathbf{\vec p}_{B}\right)\right\}$ ，包含 user embeddings 和 target pin embeddings，其中 user 和 pin 都可能重复出现。在这项工作中，我们选择不使用显式的 negative examples 。即，我们没有针对 negative engagement（如 hides 行为）的 loss terms 。在设计我们的模型时，有几个考虑因素：
- (a)：我们如何选择这些 pairs ？
- (b) $\left(\mathbf{\vec u}_{i}, \mathbf{\vec p}_{i}\right)$ pair，我们如何选择 negative examples ？
- (c) $\left(\mathbf{\vec u}_{i}, \mathbf{\vec p}_{i}\right)$ pair 和一组 negative examples ，我们如何计算 loss ？
我们首先描述 (b) 和 (c)，然后详细阐述 (a)。
Negative Selection：我们考虑两种 negative examples 来源：in-batch negatives 和 random negatives 。
- 当为给定用户选择 in-batch negatives 时，我们选择 batch 内的所有 positive examples 作为该用户的 negatives，在 negatives 中并屏蔽该用户有 positive engagement 的 pin。这种方法高效且简单，但如果实现不当，可能导致热门 pin 被降级，因为互动较高的 pin 比互动较低低的 pin 更有可能作为 negatives 出现。in-batch negatives 的另一个缺点是 in-batch negative examples 的分布与用于检索的 pin 的真实底层分布不同，导致 training 和 serving 之间存在差异。
- 第二种 negatives 来源是从包含所有 pin 的语料库中均匀采样的，但单独使用这些 negatives 可能会导致模型坍塌，因为 negatives 可能过于简单。
- 我们考虑的第三种选择是结合 random negatives 和in-batch negatives，通过将in-batch negatives 和random negatives 合并从而利用两者的独有特性。
  在论文中，作者将 in-batch negatives 的数量限制在 5000，并将 random negatives 的数量固定为 8192。
在实践中，更大的 negative pool 可以提高 learned embeddings 的质量，因此我们在训练中我们跨所有 GPUs 收集 negative examples ，选择最大的可能的 negative pool 使得它能舒适地放入 GPU 内存中。
Loss Function：选择 negative examplesuser and positive embeddings $\left(\mathbf{\vec u}_{i}, \mathbf{\vec p}_{i}\right)$ negative embeddings $\left\{\mathbf{\vec n}_1,\cdots ,\mathbf{\vec n}_N\right\}$ 。我们为每个 pair 计算一个 loss ，然后计算加权平均，使得给定 GPU 上 batch 中每个用户被赋予相等的权重。
$\log Q$ 校正的 sampled softmax ，我们根据给定 negative 出现在 batchlogit $\tau \in [0.01,\infty)$ $s\left(\mathbf{\vec u},\mathbf{\vec p}\right) = \left<\mathbf{\vec u},\mathbf{\vec p}\right> / \tau$ ，则没有样本概率校正的 sampled softmax loss 定义如下：
$L ({\vec{u}}_{i}, {\vec{p}}_{i}) = - \log (\frac{\exp (s ({\vec{u}}_{i}, {\vec{p}}_{i}))}{\exp (s ({\vec{u}}_{i}, {\vec{p}}_{i})) + \sum_{j = 1}^{N} \exp (s ({\vec{u}}_{i}, {\vec{n}}_{j}))})$
$<\cdot, \cdot>$ 为余弦相似度函数。
当 negativespair $(u_i, v)$ $Q_i(v) = \text{Prob}(\text{Pin } v \text { in batch} \mid \text{User } U_i \text{ in batch})$ sampling bias $v$ 可以是 positive example 或 negative example。softmax loss with sample probability correction 对于单个 pair 的定义如下：
$L ({\vec{u}}_{i}, {\vec{p}}_{i}) = - \log (\frac{\exp (s ({\vec{u}}_{i}, {\vec{p}}_{i}) - \log Q_{i} (p_{i}))}{\exp (s ({\vec{u}}_{i}, {\vec{p}}_{i}) - \log Q_{i} (p_{i})) + \sum_{j = 1}^{N} \exp (s ({\vec{u}}_{i}, {\vec{n}}_{j}) - \log Q_{i} (n_{j}))})$
为简单起见，我们使用 count-min sketch《An improved data stream summary: the count-min sketch and its applications》 $Q$ 。
count-min sketch $D$ $W$ 个分桶。batchpin ID $v$ ：
- 依次经过每个哈希函数，在每个哈希函数对应分桶的计数 +1。
- pin ID $p_i$ $\text{freq}_{p_i} = \min \{\text{count}[j][h_j(p_i)]\}_{j=1}^{D}$ $p_i$ 的分桶的计数的最小值。

1.2.4 Training Objective

给定我们的 losspairs $\left(\mathbf{\vec u}_{i}, \mathbf{\vec p}_{i}\right)$ 的问题。我们的模型应该能够预测三种形式的 positive engagement：Repins, Closeups, and Clicks。每种 actions 都有价值，但我们没有像 multi-task learning 文献中常见的那样学习 task-specific heads，而是选择以多任务方式训练单个 embedding，直接学习一个能够有效检索不同类型 positive engagement 的 embedding。我们没有在 loss 计算中显式地对不同的 engagement 赋予不同的权重。我们考虑的四种 training objectives 如下所述，并描绘在 Figure 2 中。
- Next Action $\mathbf{\vec e}_1$ action $A_T$ embedding $A_{T+1}$ 。
- SASRec：在每个历史 action 位置上预测 next action。注意，negative action 不能作为预测目标。
- All Action $\mathbf{\vec e}_1$ action $A_T$ 对应的 embedding ）来预测 K Day Future window 中的每个 positive action。
- Dense All Action：随机选择一组历史位置，对于每个位置，预测一个独立随机从 K Day Future window 中选择的 positive action。
Next Action Prediction：sequence modeling 任务的 naive objectivenext action prediction $\{A_T,A_{T - 1},\cdots ,A_{T - M + 1}\}$ $A_{T + 1}$ $A_{T + 1}$ 是 positive engagement ）。这个目标对于 realtime sequence modelonline setting $A_{T}$ 将始终是用户最近的 action。SASRec 扩展了这个简单的 training objective ，旨在预测每一步的 next action ，而不仅仅是预测最近的 positive action。我们在实验中稍微修改了这一点，只允许 positive actions 对模型的 loss 做出贡献。
即，SASRec 中， negative action 不能作为 next action。
与这些传统目标不同，我们的目标不是预测用户的 next immediate action ；相反，我们每天推断 user embeddings ，旨在捕获用户的长期兴趣。为此，我们引入了两种替代的 training objectives。
All Action Prediction：基于我们不只希望预测 next actionnaïve training objective $\mathbf{\vec e}_1$ final user embedding $K$ actions $T + 3$ $T + 8$ $T + 12$ positive engagement $T$ $K$ $\{A_T,A_{T - 1},\cdots ,A_{T - M + 1}\}$ 3 actions $A_{T + 3}$ $A_{T + 8}$ $A_{T + 12}$ 。这个目标迫使模型学习长期兴趣，而不是仅仅关注用户将采取的 next action ，这应该会减少 daily offline inferencestaleness $K$ 天时间窗口内为每个用户随机采样最多 32 actions 。
Dense All Action Prediction：为了进一步提高每个 batch 提供的 signal，我们从 SASRec 中汲取灵感，修改了 all action prediction objective 。
- user embedding $\mathbf{\vec e}_{1}$ $K$ actions $\{s_{i}\}$ $\mathbf{\vec e}_{s_{i}}$ $K$ 天的所有 positive actions 中随机选择的一个 positive action 。
- $\mathbf{\vec e}_{1}$ 作为 final user representation 。为了确保这种方法从数据的 ordering 中学习，我们对 Transformer 的 self-attention block 应用了 causal masking ，因此每个 action 只能关注过去或当前的 actions ，而不能关注未来的 actions。我们观察到这种掩码显著提高了模型在此任务上的性能。
- positive actions $\mathbf{\vec e}_{s_{i}}$ 预测一个 positive action 。
注意：Transformer 的 self-attention block 应用了 causal masking 。

1.2.5 Dataset Design

timeline $\mathcal{A}_{U} = \{A_{1},\cdots ,A_{S}\}$ $M$ $S - M - 1$ $M$ 的训练样本（假设所有 actionspositive $\{A_{5},\cdots ,A_{5 + M - 1}\}$ positive engagements $\{A_{5 + M},A_{7 + M}\}$ timeline $\mathcal{A}_{U}$ 中提取。
$M$ （或更短）的序列，以及每个序列对应的一组 future positive engagements。当尝试不同的采样策略时，这会出现问题，因为当调整采样的超参数时需要重新生成 training data ——这是一个耗时的过程。为了提高效率，我们改为将每个用户的序列作为数据集中的单行存储，并在训练过程中动态地采样。这有一个明显的好处，即允许在训练期间进行自定义采样，代价是降低了训练数据的随机混洗程度。
这样做只能进行 user-level 混洗，无法进行 sample-level 混洗。
具体来说，我们使用此策略调整了几个超参数，所有这些超参数都会显著影响模型的整体性能：
- $M$ 。
- 从用户 timeline 中采样的可能的 user sequences 的比例。
  $M$ 的滑动窗口子序列数量，占该用户所有可能滑动窗口总数的百分比。
  这是控制相对密度，保证不同活跃度的用户贡献的样本数与他们的历史长度成比例（避免长序列用户被过度采样，但也不完全忽略）。
- 针对每个用户所采样的序列的最大数量。
  这是控制绝对数量，防止极少数超高活跃用户（例如每天几百次互动，一年几万次）产生海量样本，导致训练数据不平衡或训练过慢。
- 为每个序列采样作为 label 的 positive examples 的最大数量。

1.2.6 Model Serving

由于我们将 PinnerFormer 的推理重点放在 offline and batch setting 上，我们以 daily and incremental workflow 的方式推断模型，如 Figure 3 所示。此 workflow 为过去一天内与任何 pin 有过互动的用户生成 new embeddings，将它们与前一天的 embeddings 合并，然后上传到 a key-value feature store 以供 online serving。因为我们只为过去一天内有互动的用户生成 new embeddings，并且在离线（没有 latency 的约束）运行 inference，所以我们能够使用尽可能大的模型，这增加了我们的 embedding 可以捕获的信息量。如果 input features 发生任何损坏（例如，由于日志记录错误），我们可以轻松地为所有的异常用户（自损坏以来其 embeddings 已更新的所有用户）运行 inference，并且假设上游数据已修复，第二天的数据将是正确的。
这里 "增量更新" 的前提是：所有特征都与 inference 时刻的时间戳无关。
pin embeddings 计算成本低廉，只需要对现有特征进行小的 MLP 转换，因此我们每天从头开始生成它们，然后编译一个 HNSW graph （《Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs》），该 graph 可以使用保存在 feature store 中的 user embeddings 进行在线查询。

1.3 实验与结果

在这里，我们首先将 PinnerFormer 与基线进行比较，进行消融研究，并通过离线实验探索 realtime inference 和 daily inference 之间的性能差距。然后，我们展示了在 A/B 实验中相较于 PinnerSage 的显著改进。
离线评估指标：我们使用的主要评估指标是 Recall@10。
我们选择训练结束后 2 weekuser set $t$ $t$ 为 evaluation set 中的所有用户计算 embeddingsembeddings $t$ 计算到的）从候选 pins 中检索每个用户在未来 2 week 互动的所有 pin 的能力：
- 检索的候选 pins 来自于 1 million 个随机 pin 。
- 2 week $t$ $t + 14d$ 之间。
$\mathcal{U}$ $U$ positively engaged Pins $\mathcal{P}_U$ ，以及一个包含 1 millionpin $\mathcal{N}$ Recall@k $\text{(R@k)}$ ：
$\begin{matrix} Recall@k (U) = \frac{1}{| P_{U} |} \sum_{P \in P_{U}} 1 {| {N \in N ∣ d (U, P) \geq d (U, N)} | < k} \\ Recall@k = \frac{1}{| U |} \sum_{U \in U} Recall@k (U) \end{matrix}$
这里，userpin $d(U,P)$ 由 user embedding 和 pin embedding 之间的欧几里得距离定义。
我们还观察了两种多样性度量：
- (a)：与从 1 million 个 pin 中检索到的 top 50 results 相关联的 Interests （大约 350 个 unique topics of pins ）的分布的熵（ "Interest Entropy@50"）。
- (b)：在 1 million 个 pin 中，贡献了 top 10 retrieved results 中 90% 的比例是多少（ "P90 Coverage@10"）。
  假设有 100个用户，每个用户取 top-10，共 1000 个结果（可重复）。如果其中 900 个结果都是同一个 Pin "A" 产生的，那么覆盖 90%（即 900 个结果）只需要 1 个 unique pin ，则 P90 Coverage@10 = 1 / 1,000,000 = 0.000001。
  数值越大，表示检索结果越分散（多样性高，不集中在少数热门 pin 上）；数值越小，表示结果越集中（模型倾向于返回相同的热门内容）。
前者衡量为特定用户检索到的结果的多样性，而后者代表所有用户检索到的结果的全局多样性。观察这两者都很有用；一个简单地推荐热门 pin 而与用户无关的简单基线可能在指标 (a) 上表现良好，但 (b) 的值将非常接近 0.0。

1.3.1 离线结果

在本节中，我们首先将 PinnerFormer 与基线进行比较，然后研究模型的哪些方面带来了良好的性能。
与基线的比较。在我们的离线评估中，我们与基线 PinnerSage（我们之前的、multi-embedding user representation）进行比较，基于一个 oracle evaluationrecall $c$ ，以及一个给定的 positive，在 top-c PinnerSage embeddings 中，我们选择最接近该 positive 的 embeddinguser’s representation $c$ 个 embeddings 预测 engagement 的能力建立了一个近似的上界。
为了计算多样性指标，我们不采用 oracle 方法，而是使用轮询混合（round-robin blending）对结果进行排序：给定一组 user embeddings（按权重排序），每个 embedding 都有一些检索到的结果，我们从第一个 user embedding 中取第一个结果，从第二个 user embedding 中取第二个结果，依此类推，当所有 user embedding 检索到一个结果后返回到第一个 user embedding 。
在 Table 1 中，我们展示了 PinnerFormer 和 PinnerSage 之间的比较，如上所述进行评估。
- 即使使用 top 5 or 20 clusters 对 PinnerSage 进行 oracle evaluation，我们看到单一的 PinnerFormer embedding 在检索用户在 14 天内可能与之互动的内容方面优于 PinnerSage。
- 增加 clusters 数量会导致为给定用户检索到的结果更具多样性，这是当使用足够多的 clusters 时 PinnerSage 优于 PinnerFormer 的一个方面。
- 我们还看到 PinnerSage 从索引中检索到更多独特的候选结果，但如 Table 4 所示， PinnerFormer 的某些变体在保持更高engagement 评估指标的同时，也达到了相当水平的 unique candidates 。
Daily vs Realtime Inference：为了量化随着 inference 频率降低而带来的性能下降，我们比较了两个仅在 training objective 上有所不同的模型：
- 使用 SASRec training objective 训练的模型，该目标直接预测 next actionsampled softmax loss $\mathbf{\vec e}_1$ 的损失与其他位置的损失相等的权重，因为我们发现这能提高性能。
  SASRec 是如何修改的，参考附录。
- PinnerFormer，使用带有 28 天窗口的 dense all action prediction objective 进行训练。
evaluation window $(t,t + 14d)$ 上以三种不同的频率对这些模型进行评估：
- Once $t$ single embedding $(t,t + 14d)$ 内的所有 positive actions。
- Dailyuser embedding $x - 1d$ embedding $(x,x + 1d]$ $x\in \{t,t + 1d,\cdots ,t + 13d\}$ 。这一天的差距考虑了 action 在离线日志中可用、到它被上传到 feature store 之间的延迟。
- Realtime：我们在每次 action 后更新 embeddingspositive action $M$ positive action $(t,t + 14d]$ 内的所有 positive actions ）。
请注意，daily 和 realtime 的设置与我们的 primary evaluation不同。在这里，给定用户在某个时间点的 embedding ，我们衡量预测用户将采取的 specific action 的能力，而我们的 primary evaluation 衡量的是 embedding 捕获用户长期兴趣的能力。
realtime model 在生产中服务是不现实的，因为它将比 batch model 显著增加推理成本：一些用户每天可能采取数十或数百次 actions ，即使使用较短的序列，这也相当于离线模型成本的许多倍。我们期望这个 realtime baseline 比 offline and daily-computed model 表现更好，这有助于量化 avoiding the realtime setting 的机会成本。
在 Table 2 中，我们还注意到 PinnerFormer 的性能随着 inference 频率的增加而提高，从评估开始时的 once，到每天一次，再到实时。令人惊讶的是，即使在实时模式下， PinnerFormer 也优于 model trained to predict only the next engaged item 。
这个实验也提供了证据：表明 dense all action prediction objective 具有期望的效果，即降低模型对短期波动的敏感性，而是学习用户更稳定的兴趣：当从 realtime inference 移到 daily inference，以及从 daily inference 转移到 inference only once 时，使用 dense all action objective 训练的模型性能损失较小（-8.3%），而使用 next action prediction task 的模型性能损失较大（-13.9%）。
realtime inference 性能和 daily inference 性能之间仍然存在不小的差距，但考虑到相对于我们的基线 PinnerSage 的改进，以及实时推断 PinnerFormer 的高成本和基础设施复杂性，我们认为这是一个可以接受的权衡。
Training Objective Selection：
- 在 Table 3 中，我们观察到 next action training objective 导致 Recall@10 较低，但 retrieved index coverage 较高。
  - 较低的 Recall@10 可以解释为，all action prediction tasks 比 next action prediction 更符合 evaluation objective 。
  - 我们认为，我们观察到 next action prediction 具有更高的 index coverage，是因为预测较长时间范围内的 actions 比仅预测 next action 更难，因此学到的 Pin embedding 可能更偏向于检索热门的内容，而不是直接检索与近期行为相关的内容。
- 我们还观察到，对于 all action prediction ，在 28 天未来窗口上训练比在 14 天窗口上训练产生更好的结果，即使评估固定在 14 天窗口。我们认为这可以解释为：每个用户序列有更多 labels ，这可以提高训练效率。
- dense all action loss 在 Recall@10 和全局多样性方面优于 all action prediction 。这两种 loss 之间的关键区别在于，在 all action prediction 中，用户所有 positive 的梯度都将通过同一个 user embedding 反向传播，导致更大的平均效应；而在 dense all action loss 中，梯度都通过不同的 Transformer outputs，并且仅在传入 Transformer 后才进行平均。
- 我们还尝试将基于不同 training objectives 计算的 loss 求和，但这样的配置没有胜过任何 single-objective model 。
Sampled Softmax：在 Table 4 中，我们比较了 softmax loss 在不同设置下的性能。
- 在所有情况下，我们看到 in-batch negatives 的存在增加了检索结果的多样性，但导致 Recall@10 低于 mixed negatives。
- 当我们使用 random negatives 训练模型时，模型似乎坍塌从而为所有用户检索非常相似的结果；当从 1 million 个 pin 中为 100,000 个用户检索 10 Pins 时，只有 1000 个 Pins 能占到检索结果的 90%。这似乎表明模型未能学习用户兴趣的细节，因为它为大多数用户检索了非常相似的内容。
- 总体而言，我们看到 sample probability correction 并没有提高 random negatives 上的 Recall@10，这是预料之中的，因为在这种情况下 all negatives 出现在 batch 中的概率应该相等，采样是无偏的。
- 当包含 in-batch negatives 时（单独使用，或与 random negatives 结合），启用 sample probability correction 会增加 recall ，同时降低 global diversity 。鉴于 in-batch negatives 和 mixed negatives 在 Recall@10 方面的巨大差异，我们选择使用 mixed negatives with sample probability correction 作为我们的损失函数，即使 mixed negatives 稍微引入了更多复杂性。
  global diversity 是通过 P90 Coverage@10 来衡量的。
Multi-task Learning：在这里，我们衡量单任务学习和多任务学习之间的性能差异。对于 3 positive action types 中的每一种，我们训练一个模型来预测单个 action type （10s Closeup, 10s Click, Repin），然后训练一个模型来预测这 3 action types 中的任何一种。在 Table 5 中，我们看到了结果：
- 当我们针对特定 action type 进行训练时，仅将该 action type 视为 positive label ，我们最大化该 action type 的 Recall@10。
- 当对所有 3 action types 一起训练时，我们最大化总体 Recall@10，但在每个单独任务上的表现略差于单任务设置。对于每个特定任务的评估，多任务性能排名第二，因此我们选择 multi-task training objective 作为每个 objective 之间的权衡，确保 final embedding 不会强烈偏向于特定任务。
Feature Ablations：在 Table 6 中，我们看到了每个特征对最终模型性能的影响。
- 对 final embedding 贡献显着的两个特征是timestamp 和 PinSage embedding。
- 没有 PinSage embedding，模型无法理解用户行为背后的内容，这反映在 Recall@10 较低和 global diversity 非常低上，表明我们为所有用户检索了一组非常相似的结果。
- 我们看到移除每个特征都会产生负面影响，因此我们选择在 PinnerFormer 中包含所有特征。
Sequence Length：Figure 4 显示了序列长度对模型性能的影响。我们观察到，当序列长度翻倍直到大约 32 时，Recall@10 和global diversity 都大致持续增加，但随着序列长度的增加，收益递减。在这项工作中，我们没有检查长度超过 256 的序列，因为这样的模型需要在 batch size 或训练资源方面做出牺牲。
- 较小的 batch size 使得与较短序列模型的比较变得不可能，因为用于学习 embedding 的 negative pool 会发生变化，并且需要更长的训练时间。
- 使用更多的机器（512 序列长度使用 16 GPUs/2 machines，1024 序列长度使用 32 GPUs/4 machines）允许训练更长的序列模型，但减少了可能的并行 training runs 的数量。
当我们只能并行地训练更少的模型时，调优 modeling decisions 变得更慢，因此对于 PinnerFormer ，我们在最终模型中选择序列长度为 256。

1.3.2 Ranking A/B Experiments

我们进行了几个 A/B 实验，将其用作 ranking 模型中的特征，以更好地了解 PinnerFormer 在 online 的表现如何。
Homefeed：我们的第一个比较是在 Pinterest 的 Homefeed ranking model 中，该模型有助于确定 content 在首页上向用户展示的顺序。以前，该模型使用用户 top k PinnerSage embeddings 的加权平均作为特征。在实验的测试组中，我们将这个 PinnerSage 的 aggregation 替换为单一的 PinnerFormer embedding。控制组和测试组的 ranking 模型都在相同日期范围的数据上进行训练，以进行公平比较。
Table 7 展示了本次实验的主要结果。PinnerFormer 显著提升了 Homefeed 的 engagement ，并带动日活跃用户数和周活跃用户数实现增长。在该实验上线后的数月内，各项改进指标均未出现衰减。
Ad：为了验证此 embedding 在其显式训练用例之外的用途，我们还进行了一个 A/B 实验，将 PinnerFormer 添加到 Ads ranking models 中（不替换 PinnerSage）。每个主要页面（Homefeed, Related Pins, and Search）都有一个专用的模型来确定向用户展示广告的顺序，因此我们分别对每个模型进行了实验。总体而言，我们在每个页面上都看到了广告 engagement 的显著提升，包括点击率（clickthrough rate: CTR）和长点击率（long clickthrough rate: gCTR），如 Table 8 所示。

1.4 结论

在这项工作中，我们提出了 PinnerFormer，这是一种单一的端到端的 learned embedding，旨在离线环境中进行推理，并捕获用户在数日时间跨度内的兴趣。
与其他基于用户过往行为来 modeling users 的工作不同，我们并不直接聚焦于预测用户的 next engagement，而是应用一种新颖的损失函数来捕获用户在数日时间跨度内的兴趣。我们证明了这一训练目标能够缩小 realtime inference的模型与每日推理一次的模型之间的性能差距。我们还通过详细实验展示了模型中各个组件对整体性能的贡献，证明了 multi-task learning 与 sampled softmax 的有效性。
未来，我们计划更深入地研究 PinnerFormer 作为 candidate generator 的表现，并将除了 pin engagement 之外的行为也纳入用户行为序列的构成要素，以助力构建更全面的 user representation。

二、附录 (用于可复现的信息)

2.1 Timestamp Encoding

除了原始 timestamp 外，我们还使用 2 个派生值来表示时间：
- 对于每个 action，序列中 latest timestamp 与 action’s timestamp 的差值。
- 序列中每两个连续 action 之间的时间间隔，最后一个间隔设为零。
为了编码 timestamp ，我们修改了 Time2vec，使用固定周期，并对原始 time valuestimestamp $t$ $P$ $\{p_1, p_2, \cdots , p_P\}$ $2P + 1$ $r_1, \cdots , r_{2P+1}$ ：
$\begin{matrix} r (t)_{2 i - 1} = \cos (\frac{2 π t}{p_{i}} + ϕ_{2 i - 1}), r (t)_{2 i} = \sin (\frac{2 π t}{p_{i}} + ϕ_{2 i}), i = 1, \dots, P \\ r (t)_{2 P + 1} = \log (t) \end{matrix}$
$\vec\phi=(\phi_1,\cdots,\phi_{2P})\in \mathbb R^{2P}$ 是一个可学习的向量。
$P_{\text{abs}}$ 个具有实际重要性的周期：0.25 小时、0.5 小时、0.75 小时、1 小时、2 小时、4 小时、8 小时、16 小时、1 天、7 天、28 天和 365 天。
$P_{\text{rel}} = 32$ 个在对数刻度上均匀分布的周期来编码相对时间差特征（relative time difference features ），范围从一秒到四周。这假设模型区分短持续时间（例如 10 秒与 1 分钟）比区分长持续时间（例如 10 天与 11 天）更重要。
$A_i$ $t_i$ ），定义：
- $t_i$ （原始时间值，例如 Unix 时间戳）。
- $\Delta_\text{latest} = t_\text{latest} - t_i$ $t_\text{latest}$ 是该用户序列中最后一个 action 的时间戳。
- $\Delta_\text{gap} = t_i - t_{i-1}$ 。对于序列中第一个行为，该值可设为 0 ；最后一个行为之后（即，inference 的时候，也设为 0 。
Time2Vec $\tau$ $t_i$ $\Delta_\text{latest}$ $\Delta_\text{gap}$ $2P+1$ 维向量：
$\begin{matrix} r_{2 i - 1} = \cos (\frac{2 π τ}{p_{i}} + ϕ_{2 i - 1}), r_{2 i} = \sin (\frac{2 π τ}{p_{i}} + ϕ_{2 i}), i = 1, \dots, P \\ r_{2 P + 1} = \log (τ) \end{matrix}$
$\{\phi_i\}_{i=1}^{2P}$ $\{p_i\}_{i=1}^p$ 为预先指定的周期。
具体周期集合和特征维度：
- 25 $P_\text{abs} = 12$ 个周期：0.25h,0.5h,0.75h,1h,2h,4h,8h,16h,1d,7d,28d,365d 。每个周期产生 cos⁡sin $\log t_i$ $25 = 12 \times 2 + 1$ 维特征。
- 65 $P_\text{rel}= 32$ 个周期：对数均匀分布，范围从 1 秒到 4 周：
  $p_{i} \in logspace (1 second, 4 weeks), i = 1, \dots, 32$
  $65 = 32\times 2 + 1$ 维特征。
- 65 $P_\text{rel} =32$ $65 = 32\times 2 + 1$ 维特征。

2.2 模型架构

这里我们更详细地描述我们使用的 Transformer 架构。
- action $A_{T + 1}$ $M$ 个 actions 的 vector representationsinput matrix $\mathbf A = \left(\mathbf{\vec a}_{T}, \cdots, \mathbf{\vec a}_{T - M + 1}\right)^\top \in \mathbb{R}^{M \times D_{\text{in}}}$ 。
- $\mathbf W\in \mathbb{R}^{D_{\text{in}}\times H}$ $\mathbf A$ hidden dimension $H$ positional encoding $\text{PE}\in \mathbb{R}^{M\times H}$ 。这为 Transformerinput $\mathbf V^{(0)} = \mathbf A\mathbf W + \text{PE}\in \mathbb{R}^{M\times H}$ 。
- 之后，我们应用一个标准的 Transformer 模型，由交替的 2-layer feedforward network (FFN) blocks 和 multi-head self attention (MHSA) blocks 组成，其中 FFN 的 hidden dimensionTransformer hidden dimension $4H$ 。在每个 MHSA 块中，我们应用 masking，以便给定的 output 只能关注当前或之前的sequence elements （即，causal mask）。
  模型架构可以描述如下：
  $\begin{matrix} U^{(l)} = V^{(l - 1)} + MHSA (LayerNorm (V^{(l - 1)})) \\ V^{(l)} = U^{(l)} + FFN (LayerNorm (U^{(l)})), l = 1, \dots, L \end{matrix}$
  正如正文部分描述的，这里用的是 PreNorm。
- 经过 transforming inputsfinal hidden state $\mathbf V^{(L)}\in \mathbb{R}^{M\times H}$ 馈入一个 2-layer MLPoutput $L_{2}$ 归一化。
  Transformer 之后的 output MLP 定义为：
  $W_{2}^{⊤} GELU (W_{1}^{⊤} LayerNorm (\vec{v}) + {\vec{b}}_{1}) + {\vec{b}}_{2}$
  其中：
  - $\mathbf W_1\in \mathbb R^{H\times 4H}, \mathbf W_2\in \mathbb R^{4H\times D}, \mathbf{\vec b}_1\in \mathbb R^{4H}, \mathbf{\vec b}_2\in \mathbb R^{D}$ $H$ transformer hidden dimension $D$ 为 embedding dimension 。
  - $\mathbf{\vec v}\in \mathbb R^{H}$ 为 a single embedding，它表示每个 position 的 transformer output。
  embedding $\mathbf E = \left(\mathbf{\vec e}_{1}, \cdots ,\mathbf{\vec e}_{M}\right)^\top \in \mathbb{R}^{M \times D}$ $D$ final embedding dimension $\mathbf{\vec e}_1$ $\mathbf E$ 的第一行，也是 most recent output ，作为 final user embedding。
  user final embedding $\mathbf{\vec e}_1$ ）、以及 pin representationpin tower output $L_2$ 归一化。
- $M$ 个 engagementsinput sequence $M$ ，并且填充的位置可以在 attention 和损失函数的计算中被掩码，类似于它们在语言建模任务中的处理方式。

2.3 Mixed Negative Sampling Masking

在 Figure 5 中，我们描述了 mixed negative sampling with maskingGPU 1 $U_{1}$ 有两个 user embeddingsGPU 2 $U_{2}$ $U_{3}$ embedding $U_{1}$ $P_{1}$ $P_{2}$ $U_{2}$ $P_{3}$ $U_{3}$ $P_{4}$ $N_{1}$ $N_{2}$ 是 random negatives 。在计算 losspositive $P_{2}$ $P_{1}$ $U_{1}$ 的 positive examples 。所有四个 positives 都出现在两个 processes 中，因为它们在 loss computation 前跨 GPU 同步。每个 GPUfinal loss computation $U_{1}$ $U_{2}$ $U_{3}$ 两倍的权重。在实践中，一个 batch 将包含许多用户，因此即使不完全均匀，权重在所有 GPU 上也几乎是均匀的。
在我们的实验中，我们将 in-batch negatives 的数量限制在 5000，并将 random negatives 的数量固定为 8192。

2.4 架构消融

在这里，我们展示了改变模型超参数的影响。
Sequence Selection：我们彻底探索了从用户历史中移除 weaker engagement 从而为用户生成更好 embedding 的方法。但是，我们没有看到 sparsifying user sequences 带来任何显着的积极结果。
Embedding Dimension：在 Figure 6 中，我们展示了改变 final embedding 的维度对整体性能的影响。
- 我们看到 Recall@10 随着 embedding 维度的增加而收益递减，尤其是在 embedding size = 128 之后。
- 我们还看到，在较小维度下，embedding 倾向于为大多数用户检索相似的结果，这可能意味着一定程度的 memorization of popularity 。因此，在较小维度下，每个用户检索到的结果可以有更多 diversity ，但因为牺牲了显着的 Recall@10，这不是一个好的权衡。
我们选择使用 256 维 embedding，因为它提供了良好的离线指标，并且与 Pinterest ranking models 中使用的多数现有 embedding 特征大小相同；将 embedding 增加到 1024 维带来的微小性能提升不值得为大多数下游用例增加四倍的存储成本。
Transformer 架构。在 Table 9 中，我们展示了模型容量对最终性能的影响。
- 更大的模型提高了召回率，无论是在层数还是 hidden size 方面。
- 我们在改变 multi-head self attention 使用的 heads 数量时没有看到显著变化，因此我们将其固定为 head = 8。
SASRec 的修改。在原始论文中，SASRec 模型基于二元交叉熵任务进行训练，没有任何 sample probability correction。我们做了两个修改：
- (a) $\mathbf{\vec e}_1$ （the latest user embedding）的 loss 与其他位置相等的权重，
  事实上，SASRecloss $1/M$ $M$ 为序列长度。
- (b)：我们将二元交叉熵替换为 sampled softmax 。
在 Table 10 中，我们展示了我们的修改显著提高了召回率。
Sampled Softmax 是 list-wise 优化，而 BCE 是 pair-wise 优化。Sampled Softmax 更符合 retrieval 任务。