2023_VQ-Rec

一、 VQ-Rec [2023]

《Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders》

近年来，自然语言文本的通用性（generality）被用于开发可迁移的推荐系统。其核心思想是：利用预训练语言模型（pre-trained language model: PLM）将 item text 编码为 item representations。尽管该方法具有良好的可迁移性（transferability），但 item text 与 item representations 之间的绑定（binding）可能过于紧密，导致过度强调 text features 的作用、放大 domain gap 的负面影响等潜在问题。
为解决这一问题，本文提出 VQ-Rec ，一种学习 Vector-Quantized item representations 的新型方法用于可迁移的序列推荐器（transferable sequential Recommenders）。该方法的核心创新在于新的 item representation 方案：
- 首先将 item text 映射为 a vector of discrete indices （称为 item code ）。
- 然后通过这些索引（indices）来查询 code embedding table 以生成 item representations。
该方案可表示为 "text => code => representation"。基于此表示方案，本文进一步提出一种增强型对比预训练（enhanced contrastive pre-training）方法，将 semi-synthetic and mixed-domain code representations 作为难负样本（hard negatives）。此外，本文设计了一种 differentiable permutation-based network 的 cross-domain fine-tuning 方法。在六个公开基准数据集上的大量实验表明，该方法在 cross-domain and cross-platform settings 下均具有有效性。代码和预训练模型可在以下地址获取：https://github.com/RUCAIBox/VQ-Rec。
当迁移到 new domain 时，需要同时更新 PQ 质心、以及 embedding table（因为 item id 含义发生了变化），从而解决 cross domain 的语义 gap （及，code-embedding alignment ）。本论文在更新 embedding table 这个环节，复用了现有 embedding table：将现有 embedding table 进行排列置换。而不是重新训练 embedding table 。
序列推荐系统（sequential recommender systems）已被广泛部署在各类 application 平台中，用于向用户推荐感兴趣的 items 。通常，此类推荐任务被建模为序列预测（sequence prediction）问题，即基于用户的历史交互序列（historical interaction sequences）推断其可能接下来交互的 items。尽管不同的序列推荐系统采用了相似的任务建模方式，但将已训练好的 recommender 迁移到新的推荐场景中仍存在困难。例如，对于具有特定 interaction characteristics 的新 domain，可能需要从头训练 recommender ，这不仅耗时，还可能面临冷启动问题。因此，开发能够快速适应新 domains 或新场景的 transferable sequential recommenders 具有重要意义。
为此，推荐系统领域的早期研究主要通过跨域推荐（cross-domain recommendation）方法，将从 existing domains 学到的知识迁移到 new domain 。这些研究通常假设存在 shared information（如 overlapping users/items 或 common features），以学习 cross-domain 映射关系。然而，在实际应用中，不同 domains（尤其是 cross-platform setting）中的用户或 items 往往只是部分重叠或完全不重叠，这使得有效的 cross-domain transfer 难以实现。此外，以往 content-based transfer 方法通常针对 data format of shared features 设计了特定方案，难以适用于各类推荐场景。
作为一种近期提出的方法，多项研究提出利用自然语言文本（即 items 的标题和描述文本，称为 item text）的通用性（generality）来弥合推荐系统中的 domain gap。其核心思想是：将通过预训练语言模型（pre-trained language models: PLM）学到的 text encodings 作为 universal item representations。基于此类 item representations，在 a mixture of multiple domains 的 interaction data 上预训练好的 sequential recommenders 已展现出良好的可迁移性。该范式可表示为 "text => representation"。尽管该方法具有有效性，但本文认为，现有方法中 item text 与 item representations 之间的绑定（binding）过于紧密，导致两个潜在问题：
- 首先，由于这些方法直接使用 text encodings 来生成 item representations（不使用 item IDs），文本语义（text semantics）会直接影响推荐模型，可能导致推荐系统过度强调 text features 的作用（例如，生成文本相似性极高的推荐结果），而忽略了 interaction data 中反映的序列特性（sequential characteristics）。
- 其次，不同 domains 的 text encodings （具有不同的分布和语义）在统一语义空间（unified semantic space）中并非自然地对齐（naturally aligned），text encodings 中存在的 domain gap 可能导致 multi-domain pre-training 过程中的性能下降；而 text encodings 与 item representations 之间的紧密绑定（tight binding）会进一步放大这种 domain gap 的负面影响。
针对上述问题，本文的解决方案是在 item representation 方案中引入中间的离散物品索引（intermediate discrete item indices ）（本文称为 codes），以放松 item text 与 item representations 之间的强绑定，该方案可表示为 "text => code => representation"。不同于直接将 text encodings 映射为 item representations，本文采用 two-step item representation 方案：对于给定 item：
- 首先将 item text 映射为离散索引向量（a vector of discrete indices）（即 item code）。
- 然后根据 item code 聚合对应的 embeddings ，得到 item representation 。
该 representation 方案具有两大优势：
- 第一，item text 主要用于生成离散编码（discrete codes），既可以减少 item text 对推荐模型的直接影响，又能注入有用的文本语义（text semantics）。
- 第二，two mapping steps 可根据下游 domains 或任务进行学习或调优，使其更灵活地适应新的推荐场景。
为实现该方法，需要解决两个关键挑战：
- (i)：如何学习具有足够区分度的 discrete item codes，以实现准确的推荐。
- (ii)：如何在考虑跨不同 domains 上差异巨大的 distribution and semantics 的情况下，有效地 pre-train and adapt the item representations。
为此，本文提出 VQ-Rec，一种新颖的方法来学习 Vector-Quantized item representations 从而用于 transferable sequential Recommenders 。与现有的基于 PLM encoding 的 transferable recommenders 不同，VQ-Rec 将每个 item 映射为 a discrete D-dimensional code ，作为 embedding lookup 的索引（indices）。为获得语义丰富且具有区分度的 item codes，本文利用 optimized product quantization: OPQ 技术对 items 的 text encodings 进行离散化处理。通过这种方式，保留了文本语义（textual semantics）的 discrete codes 在 item set 上分布更均匀，从而具有更高的区分度。由于该 representation 方案不修改底层骨干网络（即 Transformer ），因此可广泛适用于各类序列模型架构。
- 为了捕获 transferable patterns based on item codes ，本文在 a mixture of multiple domains 上采用对比学习（contrastive learning）方法对推荐系统进行预训练，并同时将 mixed-domain code representations 和 semi-synthetic code representations 作为hard negatives ，以增强 contrastive training 的效果。
- 为将 pre-trained model 迁移到下游 domain，本文提出一种 differentiable permutation-based 网络来学习 code-embedding alignment ，并进一步更新 code embedding table 以适应这个下游 domain。该 fine-tuning 过程具有极高的参数效率，仅需调整与 item representations 相关的参数。
  这里作者并没有重新训练 code embedding table，而是排列置换。
在实验方面，本文在六个基准数据集上进行了大量实验，包括 cross-domain 和 cross-platform 场景。实验结果表明，该方法具有强大的可迁移性（transferability）。特别是，纯基于 item text 的 inductive recommenders 无需重新训练即可推荐 new items ，同时在已有 items 上也能获得更优性能。

1.1 相关工作

序列推荐（Sequential recommendation ）：序列推荐旨在基于历史交互序列（historical interaction sequences）来预测 next interacted items 。
早期工作遵循马尔可夫链假设（Markov Chain assumption）（《Factorizing personalized Markov chains for next-basket recommendation》），而近期研究主要关注设计不同的神经网络模型，包括循环神经网络（Recurrent Neural Network: RNN）（《Session-based Recommendations with Recurrent Neural Networks》、《Neural Attentive Session-based Recommendation》）、卷积神经网络（Convolutional Neural Network: CNN）（《Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding》）、Transformer （《Locker: Locally Constrained Self-Attentive Sequential Recommendation》、《CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space》、《Self-Attentive Sequential Recommendation》、《BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer》）、图神经网络（Graph Neural Network: GNN ）（《Sequential Recommendation with Graph Neural Networks》、《Session-Based Recommendation with Graph Neural Networks》）、以及多层感知机（Multilayer Perceptron: MLP）（《Filter-enhanced MLP is All You Need for Sequential Recommendation》）。然而，这些方法大多基于特定 domain 来定义的 item IDs or attributes ，难以利用其他 domains 或 platforms 的 behavior sequences 。
最近，已有研究尝试将文本特征或视觉特征用作 transferable item representations（《Zero-Shot Recommender Systems》、《Towards Universal Sequence Representation Learning for Recommender Systems》、《ID-Agnostic User Behavior Pre-training for Sequential Recommendation》、《TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback》）。此外，多项研究提出基于PLM 构建 a unified model，解决多个推荐相关任务（《M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems》、《Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)》）。本文的研究基于这些工作，但重点不同：通过引入 discrete codes 来解耦 text encodings 与 item representations 之间的绑定，并通过专门设计的 pre-training 和 fine-tuning 策略增强表征容量（representation capacity）。
推荐系统中的迁移学习（transfer learning ）：为缓解推荐系统中广泛存在的数据稀疏性（data sparsity）和冷启动（cold-start）问题，研究人员探索了从其他 domains、markets 或 platforms 迁移知识的思路。现有方法主要依赖 source domains 和 target domains 之间的共享信息进行迁移，如 common users 、common items 或 common attributes 。
近年来， pre-trained language models: PLM 已被证明是连接不同任务或 domains 的通用语义桥梁，多项研究提出通过 PLM 对 items 的关联文本进行编码，作为 universal item representations （《Zero-Shot Recommender Systems》、《Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)》、《Towards Universal Sequence Representation Learning for Recommender Systems》）。基于 pre-training universal item representations ，可以在没有 overlapping users or items 的情况下，将 fused knowledge 迁移到下游 domains 。然而，这些研究通常强制 PLM 的 text encodings 与 final item representations 之间的紧密绑定，可能导致过度强调 text features 的作用。相比之下，本文提出了一种基于 discrete codes 的新型的 two-step representation 方案，具有更强的表征容量（representation capacity ），可增强 cross-domain recommendation 性能。
推荐系统中的 sparse representation ： learning sparse representation 是机器学习中广泛采用的数据对象表示方法，启发了一系列相关研究，如 product quantization 、multi-way compact embedding 、semantic hashing 等。与 continuous representation 不同， sparse representation 旨在通过 sparse 方案捕获最显著的 representation dimensions 。具体来说，discrete representations 也被应用于推荐系统，现有研究主要旨在基于 sparse representations 开发内存和时间效率高的推荐算法，以构建大规模推荐系统。与这些研究不同，本文的目标是利用文本语义（text semantics）的通用性，通过 pre-trained 方法学习 transferable item representations 。

1.2 方法

本节将详细介绍所提出的基于 Vector-Quantized item indices 的可迁移序列推荐方法 VQ-Rec 。

1.2.1 方法概述

任务定义：本文考虑的序列推荐任务场景中，multi-domain interaction data 可用作训练（或 pre-training ）数据。形式上，用户在某个 domaininteraction data $\mathbf s = \{i_1, i_2, \cdots, i_n\}$ interacted item $i$ 都关联一个唯一的 item ID 和 text data（如标题或描述，即 item text）。由于一个用户可能与多个 domains 的 items 进行交互，因此可为每个用户生成多个交互序列（interaction sequences）。考虑到不同domains 之间存在较大的语义差异（semantic gap）（《Towards Universal Sequence Representation Learning for Recommender Systems》），本文不将单个用户的多个交互序列合并为单个序列，而是保留每个 domain 对应的交互序列。需要注意的是，本文方法在生成 item representations 时不直接使用 item IDs。任务目标是预训练一个 transferable sequential recommender ，使其能够有效适配 new domains（训练数据中未见过的 domains）。
解决方案概述：为构建 sequential recommender ，本文采用主流的 Transformer 架构作为骨干网络（backbone）。该架构基于自注意力机制，在每个 time step 接收 item embeddings 和 positional embeddings 作为输入。与以往相关研究（《Towards Universal Sequence Representation Learning for Recommender Systems》）不同，本文未在 Transformer 架构中添加任何额外组件（如 adaptors ），而是通过 transferable item representations 来为骨干网络提供输入。
该方法的核心创新在于为 sequential recommenders 设计的 new item representation scheme ：
- 首先将 item text 映射为 a vector of discrete indices（称为 item code）。
- 然后通过这些索引（indices）来查询 code embedding table 以生成 item representations。
该方案可表示为 "text => code => representation"，打破了 item text 与 item representations 之间的紧密绑定（tight binding）。为学习和迁移此类 item representations，本文进一步提出了特定策略来进行 contrastive recommender pre-training 和 cross-domain recommender fine-tuning 。
所提出的 VQ-Rec 方法的整体框架如 Figure 1 所示。本文认为，开发 transferable recommenders 需要三个关键组件：
- (i)：如何通过 vector-quantized code representation 来表示 items 。
- (ii)：如何基于新的 representation 方案来训练 recommenders。
- (iii)：如何将 pre-trained recommender 迁移到新的 domains。
整体框架分为 pre-training 和 fine-tuning 两个阶段：
- pre-trainingcode emb table $\mathbf E$ 和 sequence encoder。此外，还需要独立地训练一个 PQ 。
- fine-tuning：
  - 首先独立地在 new domain 上训练一个新的 PQ。
  - code emb table $\mathbf E$ sequence encoder $\mathbf \Pi$ 。
  - $\mathbf \Pi$ 和 sequence encodercode emb table $\mathbf E$ 。

1.2.2 Vector-Quantized Item Representation

如前所述，本文提出 two-step item representation 方案：
- (i) ：通过 PLM 将 item text 编码为 a vector of discrete codes。
- (ii) ：利用 discrete codes 来查询 code embedding table 以生成 item representation。
以下将详细介绍该 representation 方案。
Vector-Quantized Code Learning ：本节首先研究如何将 item text 映射为 a discrete code。为利用自然语言文本的通用性，首先通过 PLM 将 items 的描述性文本（descriptive text）编码为 text encodings ；然后基于 optimized product quantization 来构建 text encodings 与 discrete codes 之间的映射。具体过程如下：
- (1)：基于 PLM 的item text => text encodings 。本文采用广泛使用的 BERT 模型对 itemsitem $i$ item text $t_i$ $\{w_1, \cdots, w_c\}$ a special token $\text{[CLS]}$ ，然后将 extended text 馈入 BERTitem $i$ 的 text encoding ：
  ${\vec{x}}_{i} = BERT ([[CLS]; w_{1}; \dots; w_{c}]) \in R^{d_{W}}$
  $\mathbf{\vec x}_i$ input token $\text{[CLS]}$ final hidden vector $[\cdot;\cdot]$ 表示拼接操作。
  需要注意的是，本文中 BERT encoder 的参数是固定的。
  为什么选择 BERT？其它 PLM 是否可行？作者并未详细说明。
- (2)：基于 PQ 的text encodings => discrete codesitem $i$ ，基于 product quantization: PQ （《Product Quantization for Nearest Neighbor Search》text encoding $\mathbf{\vec x}_i$ 映射为 a vector of discrete codes 。
  PQ $D$ $M$ $d_W/D$ centroid embeddings $\mathbf{\vec a}_{k,j} \in \mathbb{R}^{d_W/D}$ $k$ $j$ 个 centroid embeddingtext encoding $\mathbf{\vec x}_i$ PQ $D$ $\mathbf{\vec x}_i = \left[\mathbf{\vec x}_{i,1}; \cdots ; \mathbf{\vec x}_{i,D}\right]$ 。然后，对于每个子向量，从对应的集合中选择距离最近的 centroid embedding 的索引，组成 discrete code（即 a vector of indicesitem $i$ $k$ 个子向量，selected index 的形式化定义为：
  $c_{i, k} = \arg min_{1 \leq j \leq M} {‖ {\vec{x}}_{i, k} - {\vec{a}}_{k, j} ‖}^{2} \in {1, 2, \dots, M}$
  $c_{i,k}$ item $i$ discrete code representation $k$ 维。
  PQ centroid embeddings $\left\{\mathbf{\vec a}_{k,j}\right\}$ ），本文采用主流的 optimized product quantization: OPQ （《Optimized product quantization》）方法，并基于所有 training domains 的 items 的 text encoding 来优化 centroid embeddings 。centroid embeddings 学习完成后，可通过上述公式独立地为每个维度分配 indexitem $i$ discrete code representation $\mathbf{\vec c}_i = (c_{i,1}, \cdots, c_{i,D})\in \mathbb R^D$ 。
- PQ 的问题与 OPQ 的优势：传统 PQ 假设各子空间相互独立，但实际数据中不同维度之间可能存在相关性，直接拆分会导致量化误差较大。OPQ 的核心思想：在量化之前，先对原始向量空间进行一个正交变换（旋转），使得变换后的子空间之间更接近独立，从而降低量化误差。
- PQ $\left\{\mathbf{\vec a}_{k,j}\right\}$ k-means $M$ 个质心。
- OPQ $\left\{\mathbf{\vec a}_{k,j}\right\}$ ：
  - $\mathbf R\in \mathbb R^{d_w\times d_w}$ $\mathbf{\vec y} = \mathbf R\mathbf{\vec x}$ 。
  - 训练目标：最小化重构误差：
    $min_{R, {{\vec{a}}_{k, j}}} \sum_{i} {‖ {\vec{x}}_{i} - R^{⊤} Quantize (R {\vec{x}}_{i}) ‖}^{2}$
    其中：Quantize 表示 PQ 量化操作（类似于 k-means）。
  - $\mathbf R$ k-means $\mathbf R$ （可通过特征分解或迭代最近邻方法求解）。
Code Embedding Lookup as Item Representations：给定学到的 discrete item codes，可直接通过 embedding lookup 来生成 item representations。
$D$ discrete indices $(c_{i,1}, \cdots, c_{i,D})$ item codes $D$ 个 code embedding matrices （称为 code embedding tablecode representation $k$ 个维度，所有 items 的 discrete codescommon code embedding matrix $\mathbf E^{(k)} \in \mathbb{R}^{M \times d_V}$ $d_V$ 是 final item representations 的维度。通过 embedding lookupitem $i$ code embeddings $\left\{\mathbf{\vec e}_{1,c_{i,1}}, \cdots, \mathbf{\vec e}_{D,c_{i,D}}\right\}$ $\mathbf{\vec e}_{k,c_{i,k}} \in \mathbb{R}^{d_V}$ $\mathbf E^{(k)}$ $c_{i,k}$ 行。
$D$ 个不同的 embedding table 而不是共享一个？
- PQ 结构要求：乘积量化本质是多子空间独立量化。
- 语义独立性：不同维度编码不同方面的语义。
code embedding $\mathbf{\vec e}_{k,j} \in \mathbb{R}^{d_V}$ PQ centroid embedding $\mathbf{\vec a}_{k,j} \in \mathbb{R}^{d_W/D}$ pre-training $\mathbf E^{(k)}$ 采用随机初始化。
item $i$ 的 code embeddings 后，进一步对其进行聚合，生成 final item representation ：
${\vec{v}}_{i} = Pool ([{\vec{e}}_{1, c_{i, 1}}; \dots; {\vec{e}}_{D, c_{i, D}}]) \in R^{d_{V}}$
$\mathbf{\vec v}$ item representations $\text{Pool}(·): \mathbb{R}^{D \times d_V} \to \mathbb{R}^{d_V}$ 是均值池化函数（mean pooling function）。
也可以考虑其它聚合函数（例如 attention-based 聚合）。
Representation Distinguishability vs. Code Uniformity：为实现准确的推荐，item representations 在 large candidate space 中应具有良好的区分度，尤其是对于具有 similar text encodings 的 items。在本文方法中，应尽量避免任意两个 items 的 discrete codes 发生碰撞（即分配相同的 code ）。《Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval》 的研究表明，当 vectors 被均匀量化为所有可能的 discrete codes 时，碰撞概率（collision probability）最小。因此，理想情况下，discrete codescode $c_{i,k}$ $\{1,2,\cdots,M\}$ 中每个值的概率相等。根据以往的实验结果（《K-means clustering versus validation measures: a data distribution perspective》、《Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance》、《Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval》），training OPQ 所使用的技术（即 K-means ）倾向于生成 cluster sizes 相对均匀的聚类结果。这表明 OPQ 能够为 items 生成具有强区分度的 discrete code representations。

1.2.3 Contrastive Recommender Pre-training

与直接文本映射（direct text mapping）方法（《Zero-Shot Recommender Systems》、《Towards Universal Sequence Representation Learning for Recommender Systems》）相比，本文的 item representation 方法更难优化，因为它涉及 discrete codes 并采用 two-step representation mapping。
以下首先介绍 sequential encoder 架构，然后提出所设计的对比预训练（contrastive pre-training）任务。为改进 sequential recommenders training，本文提出同时使用 mixed-domain 负样本和 semi-synthetic 负样本。
Self-attentive Sequence Encoding：给定从 vector-quantized item codes 中导出的 a sequence of item representations ，我们使用 a sequential encoder 获取 sequence representation。借鉴 SASRec，本文采用主流的 self-attentive Transformer 架构。具体来说，典型的 Transformer encodermulti-head self-attention layers $\text{Attn}(\cdot)$ multilayer perceptron networks $\text{FFN}(\cdot)$ index representations $\mathbf{\vec v}_j$ absolute position embeddings $\mathbf{\vec p}_j$ Transformer encoder $j$ 的输入。更新过程的形式化定义为：
$\begin{matrix} {\vec{f}}_{j}^{0} = {\vec{v}}_{j} + {\vec{p}}_{j} \\ F^{l + 1} = FFN (Attn (F^{l})), l \in {1, 2, \dots, L} \end{matrix}$
$\mathbf F^l = \left[\mathbf{\vec f}_1^l;\cdots;\mathbf{\vec f}_n^l\right]\in \mathbb R^{n\times d_V}$ $l$ hidden states $n$ $L$ 为 encoder 数量。
$n$ final hidden state $\mathbf{\vec f}_n^L$ 作为 sequence representation。
Enhanced Contrastive Pre-training：为 optimize the sequential recommender ，一种常用方法是采用对比学习（contrastive learningbatch-level optimization objective $B$ 个训练样本的一个 batch，每个样本由 sequential context（即历史交互序列）和 ground-truth next itemrepresentations $\left\{\left<\mathbf{\vec s}_1, \mathbf{\vec v}_1\right>, \cdots, \left<\mathbf{\vec s}_B, \mathbf{\vec v}_B\right>\right\}$ ，其中：
- $\mathbf{\vec s}_j$ $j$ 个 normalized sequence representations （即，该序列最后一个位置的 final hidden state）。
- $\mathbf{\vec v}_j$ $\mathbf{\vec s}_j$ 配对的 positive item 的 normalized representation（即， positive item 的 code embeddings 经过池化的 final item representation ）。
进行 contrastive training 的关键步骤是采样多个负样本从而与正样本进行对比，这通常通过随机采样（random sampling）实现。然而，由于以下两个主要原因，随机负采样在本文方法中效果不佳：
- 首先，由于本文的 representation 方案涉及 discrete codediscrete indices $D$ $M$ $M^D$ ），而训练数据中观察到的 discrete indices 的集合要小得多，导致 representation sparsity 问题。
- 其次，在 multi-domain training data 上学习时，需要有效缓解 domain gap 。
针对这两个问题，本文相应设计了两种增强型负样本。
- semi-synthetic negatives：除了训练集中存在的 item indices 外，本文还考虑合成增强型 item indices 作为负样本，以缓解 representation sparsity 问题。然而，完全合成（fully-synthesized）的 indices 在 sparse code representation space 中可能与 ground-truth items 距离过远，无法为对比学习提供有效指导（《Hard negative mixing for contrastive learning》）。因此，本文基于 true item codes 生成半合成编码（semi-synthetic codes）作为 hard negatives 。
  true item code $\mathbf{\vec c}_i$ $\rho \in (0,1)$ 随机替换每个 index，同时保持其余 indices 不变。通过这种方式，由 semi-synthetic code 导出的 item representations 可定义为：
  ${\tilde{\vec{v}}}_{i} = EmbPool (G ({\vec{c}}_{i}))$
  其中：
  - $\tilde{\mathbf{\vec v}}_i$ 是 semi-synthetic hard negative 样本的 representation。
  - $\text{EmbPool}(\cdot)$ 是前面章节中描述的 embedding lookup and aggregation 操作。
  - $G(\cdot)$ 为 point-wise 生成函数：
    $\begin{matrix} G (c_{i, j}) = {\begin{cases} Uni ({1, 2, \dots, M}), & X = 1 \\ c_{i, j}, & X = 0 \end{cases} \end{matrix}$
    $X \sim \text{Bernoulli}(p)$ $\text{Uni}(\cdot)$ 表示从输入集合中均匀采样 item code 。
  需要注意的是，均匀采样（uniform sampling）确保了 semi-synthetic indices 的 codes 与 true items 具有相似的分布；如前面章节所述，这些 codes 具有良好的区分度。
  这里是否可以考虑非均匀采样：根据训练集中出现的频率来采样。可以试试。
- Mixed-domain negatives：与以往使用 in-domain negatives 的 next-item prediction models （《Session-based Recommendations with Recurrent Neural Networks》、《Self-Attentive Sequential Recommendation》）不同，本文采用 mixed-domain items 作为负样本，以增强 pre-training 过程中的 multi-domain fusion。考虑到效率问题，直接使用in-batch itemssequential contexts $B-1$ 个 ground-truth items）。由于 batch 是从 multiple domains 中采样而构建的，in-batch sampling 自然可以生成 mixed-domain negatives。
结合这两种负样本，pre-training objective 可形式化为：
$L = - \frac{1}{B} \sum_{j = 1}^{N} \log \frac{\exp ({\vec{s}}_{j} \cdot {\vec{v}}_{j} / τ)}{\underset{semi-synthetic}{\underset{⏟}{\exp ({\vec{s}}_{j} \cdot {\tilde{\vec{v}}}_{j} / τ)}} + \underset{mixed-domain}{\underset{⏟}{\sum_{j^{'} = 1}^{B} \exp ({\vec{s}}_{j} \cdot {\vec{v}}_{j^{'}} / τ)}}}$
$\tau$ 是温度超参数。
为简化符号，mixed-domain samples1 $B-1$ 个负样本。

1.2.4 Cross-domain Recommender Fine-tuning

在 pre-training 阶段，我们优化（optimize） code embedding matrices and Transformer 中的参数，而 BERT encoder 保持固定，PQ 操作独立于 pre-training。接下来，将讨论如何在 cross-domain or cross-platform setting 中进行 fine-tuning。在 fine-tuning 阶段，我们固定 Transformer sequence encoder（在不同 domains 之间迁移），仅优化与 item representations 相关的参数，实现参数高效的 fine-tuning。
为进一步利用来自 pre-trained recommenders 的 learned knowledgecode representation scheme $M \times D$ 个 PQ indicescode embedding table $M$ 个 embedding 矩阵）。基于这两个方面，我们将 fine-tuning optimization 分解为两个阶段，即 fine-tuning code-embedding alignment 和 fine-tuning code embedding table 。
Fine-tuning for Code-Embedding Alignment：为迁移 item representation scheme，一种直接的方法是直接复用 discrete index set and corresponding embedding 。然而，这种简单方法忽略了不同 domains 之间的巨大语义差异（semantic gap），导致对下游 domains 的迁移能力较弱。本文的解决方案是仅复用 discrete index set，重新构建 indices to embeddings 的映射。
这里分为两步：
- 第一步，重新训练新的 PQ 质心，然后为 new domain 的每个 item 得到 item code。
- 第二步，这里没有重新训练 item code 的 embedding，而是将 pretrained item code embedding 重新排列并赋值给 new domain 的 item code。
  为什么用排列而不是重新训练？这是为了知识迁移从而在 new domain 数据不足的情况下提升效果。
Permutation-based code-embedding alignment：为生成下游 domain 的 item codes，我们重新训练新的 PQ centroids，以捕获 domain-specific semantic characteristics。通过共享 code setset num $D$ $M$ $\hat c_{i,k} = \arg\min_{1\le j\le M}\left\|\mathbf{\vec x}_{i,k} -\mathbf{\vec a}_{k,j}\right\|^2\in \{1,2,\cdots,M\}$ new item $i$ $\hat{\mathbf{\vec c}}_i \in \mathbb{N}_+^D$ 。由于保留了所有 discrete indices 和 code embeddings，我们采用基于置换（permutation-based）的方法重新学习映射关系（即新的 lookup方案），以关联 indices 和 code embeddingsdiscrete indices $k$ embedding alignment $\mathbf \Pi_k \in \{0,1\}^{M \times M}$ bijection alignment $\mathbf \Pi_k$ 应为置换矩阵（permutation matrix），即每行和每列恰好有一个 1，其余为 0。形式上，new domaincode embedding table $\hat{\mathbf E}^{(k)}$ 可定义为：
${\hat{E}}^{(k)} = Π_{k} E^{(k)}$
能否考虑用 pretrained embedding 的线性组合，而不是排列置换？线性组合的表达能力更强，可以试一试。
Alignment optimizationcode-embedding alignment $\mathbf \Pi_k$ ，我们通过传统的 next-item prediction objective 来优化相应参数。给定 sequential context，可根据以下概率预测 next item：
$P (i_{t + 1} ∣ i_{1}, \dots, i_{t}) = Softmax (\hat{\vec{s}} \cdot {\hat{\vec{v}}}_{i_{t + 1}})$
其中：
- $\hat{\mathbf{\vec s}}$ pre-trained sequential encoder $\hat{\mathbf{\vec v}}_{i_{1}},\cdots,\hat{\mathbf{\vec v}}_{i_{t}}$ 。
- item representations $\hat{\mathbf{\vec v}}_i$ $\hat{\mathbf{\vec v}}_i = \text{EmbPool}\left(G\left(\hat{\mathbf{\vec c}}_i\right)\right)$ code $\hat{\mathbf{\vec c}}_i$ 导出。
permutation matrices $\mathbf \Pi_k$ 可微，受 Birkhoff 定理（《Sparse sinkhorn attention》）启发：任何双随机矩阵（doubly stochastic matrix）都可视为置换矩阵的凸组合。基于此思想，我们使用双随机矩阵模拟置换矩阵。具体而言：
- $\mathbf \Theta_k \in \mathbb{R}^{M \times M}$ 。
- 然后通过 Gumbel-Sinkhorn 算法（《Learning Latent Permutations with Gumbel-Sinkhorn Networks》）将其转换为双随机矩阵。
  这一步是为了得到置换矩阵。
  双随机矩阵：所有元素非负；每一行的和为 1.0；每一列的和为 1.0 。
next item probability $P(i_{t+1}\mid i_1,\cdots,i_t)$ $\{\mathbf \Theta_1, \cdots, \mathbf \Theta_D\}$ ，同时固定 pre-trained recommender 中的其余参数。
Fine-tuning the Code Embedding Table：code-embedding alignment 之后，我们继续微 permuted code embedding table ，以提高其适配下游 domainsrepresentation capacity $P(i_{t+1}\mid i_1,\cdots,i_t)$ next-item prediction loss $\hat{\mathbf E}^{(1)}, \cdots, \hat{\mathbf E}^{(D)}$ $\{\mathbf \Theta_1, \cdots, \mathbf \Theta_D\}$ 。微调后的 VQ-Rec 不依赖 item IDs，可应用于 inductive setting，即无需重新训练模型即可推荐 new items。当出现 new item 时，可将其 item text 编码为 discrete indices ，然后通过 embedding lookup 获得对应的 item representations 。

1.2.5 讨论

本节从以下三个方面强调所提出的 VQ-Rec 方法的优势：
- 容量：通过利用文本语义（text semantics）的通用性，学到的 discrete codes 和 code embeddings 能够有效捕获不同 domains 间的可迁移模式（transferable patterns）。与现有的相关研究（《Zero-Shot Recommender Systems》、《Towards Universal Sequence Representation Learning for Recommender Systems》）不同，本文方法不直接将 text encodings 映射为 item representations，因此可以避免过度强调文本相似性（text similarity），同时对文本数据中的噪声更具鲁棒性。此外，如前面章节所述，本文方法能够生成具有高区分度的 code representations。
- 灵活性：与以往研究的另一个区别是，本文不修改底层的 sequence encoder（即 Transformer）架构，也不进行任何微小改动（如添加 adaptors）。此外，text encoder 和 PQ 与底层的 sequence encoder 在 optimization 方面是独立的。这些解耦的设计使得该方法可以灵活扩展，适用于各类 PLM、sequential encoder 和 discrete coding 方法。
- 效率：本文方法通过三个 specific designs 来实现高效的 model training and utilization ：
  - (1)：固定的 text encoder。
  - (2)：独立学习的 discrete codes 。
  - (3)：微调阶段固定了 sequence encoder 。
  此外，与 UniSRec （《Towards Universal Sequence Representation Learning for Recommender Systems》）等现有方法相比，VQ-Rectransferable item representations $O(d_V D)$ $O(d_W d_V D)$ ）。如前面章节节所示，VQ-Rec 不需要像以往基于 adaptors 的方法那样进行任何矩阵乘法运算。

1.3 实验

本节通过实验验证所提出的 VQ-Rec 方法的有效性和可迁移性。
数据集：实验使用公开基准数据集评估 transferable recommender。
- 预训练数据采用 Amazon 评论数据集中的五个 domains （食品 Food、家居 Home、CD、Kindle 、电影 Movies）。
- 然后，我们将 pre-trained model 迁移到五个下游 cross-domain datasets（科学 Scientific、食品储藏室 Pantry、乐器 Instruments、艺术 Arts 、以及办公 Office，均来自 Amazon）和一个 cross-platform dataset（Online Retail，一个基于英国的在线零售平台）。
借鉴《Towards Universal Sequence Representation Learning for Recommender Systems》的方法，我们过滤掉交互次数少于 5 次的 users and items。然后，在每个子数据集中，我们对 interactions 按照用户进行分组，并按时间顺序排序。对于 descriptive item text ，Amazon 子数据集拼接 title、categories 和 brand 字段，Online Retail dataset 使用 Description 字段。item text 截断为长度 512 。preprocessed datasets 的统计信息如 Table 1 所示。
baseline 方法：我们将所提出的方法与以下基准方法进行比较：
- SASRec：采用 self-attentive model 来捕获 item correlations 。我们实现两个版本：
  - (1)：使用传统的 ID embeddings。
  - (1)：将 item text 的 fine-tuned BERT representations 作为 basic item representations。
- BERT4Rec：采用 bi-directional self-attentive model，结合 cloze objective 进行 sequence modeling 。
- FDSA：使用独立的 self-attentive sequential models 对 item sequence 和 feature sequence 进行建模。
- S3-Rec：在预训练阶段通过互信息最大化目标（mutual information maximization objectives）来捕获 feature-item correlations 。
- RecGURU（《RecGURU: Adversarial Learning of Generalized User Representations for Cross-Domain Recommendation》）：提出对抗性学习范式（adversarial learning paradigm），通过 auto-encoder 来 pre-train user representations。
- ZESRec（《Zero-Shot Recommender Systems》）：使用 PLM 来编码 item text 从而作为 basic item representations 。为保证公平比较，ZESRec 与 VQ-Rec 在相同数据集上进行预训练。
- UniSRec （《Towards Universal Sequence Representation Learning for Recommender Systems》）：为 textual item representations 配备一个 MoE-enhanced adaptor ，用于 domain fusion and adaptation 。设计了 item-sequence and sequence-sequence contrastive learning tasks ，用于预训练 transferable sequence representations 。
对于本文方法，首先在来自预训练数据集的 mixture of item sequences 上预训练一个 VQ-Rec 模型，然后将预训练模型微调到每个下游数据集。
评估指标：借鉴以往研究，我们采用两个广泛使用的排序指标：Recall@KNDCG@K $K\in \{10,50\}$ ）。数据集划分采用留一法（leave-one-out），即 latest interacted item 作为测试数据，倒数第二个 item 作为验证数据。评估时，将每个序列的 ground-truth item 与所有其他 items 进行排序，最终报告所有测试用户的平均分数。
实现细节：
- 我们基于 Faiss ANNS 库和 REcBoLE 实现模型。
- $(M=32) \times (D=256)$ PQ indices 作为 code representation 方案。
- VQ-Rec300 epochs $\tau=0.07$ semi-synthetic ratio $\rho=0.75$ 。Gumbel-Sinkhorn 算法的迭代次数设置为 3。
- 主要基准方法的结果直接取自《Towards Universal Sequence Representation Learning for Recommender Systems》的研究。
- 对于其他模型，通过搜索超参数获取最优结果。
- batch size = 2048 $\{0.0003, 0.001, 0.003\}$ permutation learning epochs $\{3,5,10\}$ 中调优。
- 验证集上 NDCG@10 分数最高的模型被选中从而进行测试集评估。
- 我们采用早停策略，patience 值设置为 10 epochs 。

1.3.1 整体性能

将 VQ-Rec 与基准方法在六个基准数据集上进行比较，结果如 Table 2 所示。
- 对于 baseline 方法：
  - 基于文本的模型（即 SASRec(T)、ZESRec 和 UniSRec）在小规模数据集（如 Scientific 和 Pantry ）上的性能优于其他方法。对于这些 interactions 不足以训练强大 ID-based recommender 的数据集，基于文本的方法可能受益于 text characteristics 。
  - 而对于融合 item IDs 的模型（即 SASRec 和 BERT4Rec ），在 interactions 较多的数据集（如 Arts 、Office 和 Online Retail ）上表现更优，这表明过度强调 text similarity 可能导致次优结果。
- 所提出的 VQ-Rec 方法在所有数据集上均取得了最佳或次佳性能。
  - 在小规模数据集上，结果表明所提出的 discrete indices 能够保留文本语义，提供合适的推荐。
  - 在大规模数据集上，VQ-Rec 能够很好地训练，捕获 sequential characteristics 。
  需要注意的是，VQ-Rec 可应用于 inductive setting，无需重新训练模型即可推荐 new items 。实验结果还表明，通过精心设计的 encoding 方案和 large-scale pre-training，inductive recommenders 也能优于传统的 transductive models。

1.3.2 消融实验

我们分析每个提出的组件对最终性能的影响。Table 3 展示了默认方法及其六个变体在三个代表性数据集上的性能，包括一个小规模数据集（Scientific）、一个大规模数据集（Office）和一个 cross-platform 数据集（Online Retail）。
- (1) w/o Pre-training：不在 multiple domains 上进行预训练，该变体在所有数据集上的性能均低于 VQ-Rec。结果表明，VQ-Rec 能够学习并迁移 discrete codes 的通用序列模式（general sequential patterns）到下游 domains or platforms。
- (2) w/o Semi-synthetic NS：从 pre-training loss 中移除半合成负样本（semi-synthetic negative samples），VQ-Rec 可能会受到稀疏性问题的影响，导致结果次优。
- (3) w/o Fine-tuning：如果不对 pre-trained model 进行微调，性能会急剧下降，这进一步表明迁移到语义不同的 domains 具有挑战性。
- (4) Reuse PQ Index Set：我们直接使用 pre-training 的 PQ centroids 对下游 item indices 进行编码。巨大的语义差异（semantic gap）使得 indices 呈现长尾分布。由于区分度降低，该变体的性能更差。
- (5) w/o Code-Emb Alignment $\mathbf\Pi_k$ （它用于将 pre-trained embeddings 与下游 codes 进行对齐）。结果表明，permutation-based alignment network 通常能够提高性能。
- (6) Random Code：我们将 pre-trained embeddings 随机分配给下游 items。该变体的性能通常低于默认方法，表明学到的 vector-quantized codes 能够保留 text characteristics。需要注意的是，在 Online Retai 数据集上，该变体的性能略好，主要因为该数据集的 item text 相对较短（平均仅 27.8 words）。结果表明，informative item text 对于部署此类 text-based recommenders 至关重要。

1.3.3 进一步分析

训练数据量对 capacity 的影响分析：为验证所提出的 discrete code embeddings 是否具有更优的 capacity ，我们模拟了训练数据量增加的场景。具体来说，使用不同比例的 training interactions（即 20% ~100%）来训练模型，并在测试集上展示性能，结果如 Figure 2 所示。
可以看出，随着训练数据的增加，VQ-Rec 的性能始终能够提升，且优于对比方法。结果表明，VQ-Rec 具有更好的拟合 training sequences 的 capacity。
下游数据集对可迁移性的影响分析：本节展示了每个下游数据集上的相对迁移改进（relative transferring improvements），即有预训练与无预训练的性能对比。结果如 Figure 3 所示。
可以看出：
- 由于基于 PLM 的 item representations 存在 capacity 问题，UniSRec 在多个数据集（即 Arts、Office、以及 Online Retail）上可能会出现负向迁移。
- 相比之下，VQ-Rec 在所有六个实验数据集上都能从预训练中受益，最大改进幅度超过 20%。
结果表明，所提出的技术能够帮助 recommenders 迁移到语义不同的下游场景。
冷启动 Item 分析：开发 transferable recommenders 的动机之一是缓解冷启动推荐问题。根据 ground-truth items 的流行度将测试数据分为不同组，结果如 Figure 4 所示。
- 尽管直接将文本映射为 textual item representations 的 recommenders 在冷启动组（如 Office 的 [0,5) 、在 Online Retail 的 [0,20)）上表现良好，但在流行组（popular groups）上的性能会下降。
- 相比之下，VQ-Rec 在所有组上的性能都优于 SASRec ，尤其是在冷启动组。
结果表明，这些长尾 items 的推荐可能受益于所提出的预训练技术。

1.4 结论

本文提出 VQ-Rec ，一种学习 vector-quantized item representations 的 transferable sequential Recommenders 。与现有直接将来自 PLM 的 text encodings 映射为 item representations 的方法不同，本文建立了 two-step item representation 方案：首先将 text encodings 映射为 discrete codes，然后通过 embedding lookup 生成 item representations。
为在 multi-domain interaction data 上预训练该方法，我们采用 mixed-domain code representations 和 semi-synthetic code representations 作为 hard negatives。我们进一步提出了一种 permutation-based network ，学习 domain-specific code-embedding alignment ，能够有效适配下游 domains。在六个 transferring benchmarks 上的大量实验表明，VQ-Rec 具有良好的有效性。