2024_LLMRec

一、 LLMRec [2024]

《Large Language Models are Zero-Shot Rankers for Recommender Systems》

最近，大型语言模型（large language model: LLM）（例如，GPT-4）已展现出令人印象深刻的通用任务解决能力，包括处理推荐任务的潜力。沿着这一研究方向，本工作旨在研究 LLM 作为推荐系统排序模型的能力。我们首先将推荐问题形式化为一个条件排序任务（conditional ranking task），将 sequential interaction histories 视为 conditions，并将 items retrieved by other candidate generation models 视为 candidates。为了通过 LLM 解决排序任务，我们精心设计了 prompting template，并在两个广泛使用的数据集上进行了大量实验。我们表明，LLM 具备有前景的 zero-shot ranking 能力，但：
- (1)：难以感知 historical interactions 的顺序。
- (2)：可能会受到 prompts 中 popularity 或 item positions 的 bias 所影响。
我们证明，这些问题可以通过专门设计的 prompting 策略和 bootstrapping 策略得到缓解。具备这些洞察后，当 ranking candidates 由多个 candidate generators 检索而来时，zero-shot LLMs 甚至可以挑战传统的推荐模型。代码和处理后的数据集可在 https://github.com/RUCAIBox/LLMRank 获取。
在推荐系统文献中，大多数现有模型使用来自特定领域或场景的 user behavior data 进行训练，这存在两个主要问题。
- 首先，仅通过建模 historical behaviors（例如，clicked item sequences）难以捕获用户偏好，这限制了表达能力来建模更复杂但更明确的用户兴趣（例如，用自然语言表达的意图）。
- 其次，这些模型本质上是“狭窄的专家”（"narrow experts"），在解决依赖背景知识或常识知识的复杂推荐任务时缺乏更全面的知识（《survey on knowledge graph-based recommender systems》）。
为了提高推荐性能和交互性（interactivity），越来越多的努力探索在推荐系统中使用预训练语言模型（pre-trained language models: PLMs）。它们旨在用自然语言明确捕获用户偏好（《Recommendation as language processing (RLP): A unified pretrain, personalized prompt & predict paradigm (P5)》）、或从文本语料库中迁移丰富的世界知识（《Towards universal sequence representation learning for recommender systems》、《Learning vector-quantized item representation for transferable sequential recommenders》）。尽管有效，但在 task-specific data 上彻底微调推荐模型仍然是必需的，这使得它们解决多样化的推荐任务的能力较弱（《Towards universal sequence representation learning for recommender systems》）。最近，大型语言模型（large language model: LLM）已展现出作为 zero-shot task solvers 的巨大潜力（《Finetuned language models are zero-shot learners》、《Multitask prompted training enables zero-shot task generalization》）。确实，有一些初步尝试使用 LLM 来解决推荐任务。这些研究主要关注讨论利用 LLM 来构建推荐系统的可能性。尽管有前景，但对使用 LLM 进行推荐时的新特性的理解不足，可能会阻碍这一新范式的发展。
在本文中，我们进行实证研究以探究决定了 LLM 作为推荐模型能力的因素。通常，推荐系统以 pipeline 的架构开发（《Deep neural networks for youtube recommendations》），包括 candidate generation（检索 relevant items）和 ranking（将 relevant items 排在更高位置）过程。本工作主要关注推荐系统的 ranking 阶段，因为 LLM 在大规模 candidate set 上运行成本更高。此外，排序性能对 retrieved candidate items 敏感，这更适合检查 LLM 推荐能力的细微差别。
为了进行这项研究，我们首先将 LLM 的推荐过程形式化为一个条件排序任务（conditional ranking task）。给定 prompts，其中 prompts 包含 sequential historical interactions 作为 "conditions"，LLM 被指示根据 LLM 的内在知识对一组 "candidates"（例如，由 candidate generation models 检索到的 items）进行排序。然后，我们通过分别为 "conditions" 和 "candidates" 设计特定配置，进行控制实验以系统研究 LLM 作为 rankers 的实证性能。总体而言，我们试图回答以下关键问题：
- 哪些因素影响 LLM 的 zero-shot ranking 性能？
- LLM 进行推荐依赖哪些数据或知识？
我们的实证实验在两个公开的推荐系统数据集上进行。结果得出了几个关键发现，这些发现可能揭示如何将 LLM 发展为推荐系统的强大排序模型。我们将关键发现总结如下：
- LLM 难以感知给定的 sequential interaction histories 的顺序。通过采用专门设计的 promptings，可以触发 LLM 感知顺序，从而提高排序性能。
- LLM 在排序时存在 position bias 和 popularity bias，这可以通过 bootstrapping 策略或专门设计的 prompting 策略来缓解。
- LLM 优于现有的 zero-shot 推荐方法，展现出有前景的 zero-shot ranking 能力，尤其是在处理由多个采用不同实际策略的candidate generation models 所检索到的 candidates 时。
LLM 的 latency 太大导致它无法应用于推荐系统的 online inference，因此只能应用于 offline。而 ranking 阶段通常必须是 online 进行的，而 retrieval 阶段可以是 offline 进行的。因此，LLM 通常仅用于 retrieval 阶段。

1.1 相关工作

推荐系统的 Transfer Learning：由于推荐系统大多在从单一来源收集的数据上训练，人们试图从其他领域、市场或平台迁移知识。典型的推荐系统 transfer learning 方法依赖于锚点（anchors），包括 shared users/items、或来自共享空间的 representations 。然而，这些锚点在不同场景之间通常很稀疏，使得推荐系统之间迁移变得困难。
最近，有研究旨在通过微调或 prompting 将语言模型中存储的知识迁移到推荐任务中。在本文中，我们进行 zero-shot 推荐实验，以检验从 LLM 迁移知识的潜力。
用于推荐系统的大型语言模型：推荐模型，特别是 sequential recommendation models 的设计，长期以来一直受到语言模型设计的启发，从 word2vec 到最近的神经网络。近年来，随着预训练语言模型（pre-trained language model: PLM）的发展，人们试图将 PLM 中存储的知识迁移到推荐模型中，要么通过使用 items' text features 来表示 items，要么以自然语言格式表示 behavior sequences。
最近，大型语言模型（LLM）已被证明具有卓越的语言理解能力和生成能力。已有研究通过将 LLM 与传统推荐模型集成或使用专门设计的指令进行微调，使推荐系统更具交互性。也有早期探索表明 LLM 具有 zero-shot 推荐能力。尽管在一定程度上有效，但很少有工作探索决定 LLM 推荐性能的因素。

1.2 LLMs as Rankers 的通用框架

为了研究 LLM 的推荐能力，我们首先将推荐过程形式化为一个条件排序任务。然后，我们描述了一个通用框架，该框架使 LLM 适用于解决推荐任务。

1.2.1 Problem Formulation

historical interactions $\mathcal{H} = \{i_1, i_2, \cdots , i_n\}$ （按交互时间的时间顺序）作为 conditionscandidate items $\mathcal{C} = \{i_j\}_{j = 1}^m$ 进行排序，使得用户感兴趣的 items 被排在更高的位置。在实践中，candidate items 通常由 candidate generation modelsitem set $\mathcal{I}$ $m \ll |\mathcal{I}|$ ）（《Deep neural networks for youtube recommendations》item $i$ $t_i$ ，遵循 《Towards universal sequence representation learning for recommender systems》 的做法。

1.2.2 Ranking with LLMs Using Natural Language Instructions

我们按照 instruction-following 范式（《Finetuned language models are zero-shot learners》），使用 LLM 作为排序模型来解决上述任务。具体来说，对于每个用户，我们首先构建两个自然语言 patternssequential interaction histories $\mathcal{H}$ （ conditionsretrieved candidate items $\mathcal{C}$ （candidatespatterns $T$ 作为 final instruction。通过这种方式，我们期望 LLM 理解指令并按照指令的建议来输出排序结果。我们排序方法的整体框架如 Figure 1 所示。接下来，我们描述我们方法中详细的 instruction design。

Sequential historical interactions：为了研究 LLM 能否从 historical user behaviorssequential historical interactions $\mathcal{H}$ 作为 LLM 的输入包含在指令中。为了让 LLM 感知 historical interactions 的时序特性，本文提出三种指令构建方式：

Sequential prompting：将 historical interactions 按时间顺序排列，该方法也已在过往研究（《Uncovering chatgpt’s capabilities in recommender systems》）中得到应用。示例：


"I’ve watched the following movies in the past in order: ’0. Multiplicity’, ’1.Jurassic Park’, ..."

Recency-focused prompting：在保留 sequential interaction 记录的基础上，额外增加语句来强调 most recent interaction。示例：


xxxxxxxxxx
"I’ve watched the following movies in the past in order: ’0. Multiplicity’, ’1. Jurassic Park’, ... . Note that my most recently watched movie is Dead Presidents. ..."

In-context learning (ICL)：In-context learning 是 LLM 完成各类任务的主流 prompting 方法（《A survey of large language models），核心是在 prompt 中加入示范样例。针对个性化推荐任务，直接引入其他用户的样例容易引入噪声，因为不同用户的偏好存在差异。因此本文选择对 input interaction sequence 本身进行增强（augmenting）来构建示范样例：将 input interaction sequence 的 prefix 与对应的 successor 组合为样例。示例：
```
xxxxxxxxxx
"If I’ve watched the following movies in the past in order: ’0. Multiplicity’, ’1. Jurassic Park’, ..., then you should recommend Dead Presidents to me and now that I’ve watched Dead Presidents, then ..."
```
注意：这里的示范样例的构造，与Ranking with large language models 这里用到的 final examples 的格式不同。这里的 ICL 的样例仅仅用于增强 input interaction sequence。

Retrieved candidate items：待排序的 candidate items 一般先由 candidate generation models 完成召回（《Deep neural networks for youtube recommendations》）。本研究设置小规模 candidate set20 candidate items $m=20$ ）用于排序。使用 LLM 对 candidatescandidate set $\mathcal C$ 按序列的形式组织，示例：
```
xxxxxxxxxx
"Now there are 20 candidate movies that I can watch next: ’0. Sister Act’, ’1. Sunset Blvd’, ..."
```
参照经典的 candidate generation 方案（《Deep neural networks for youtube recommendations》），召回得到的 candidate items 没有固定顺序。为此，我们在 prompts 中对 candidate items 采用不同的排列次序，以此进一步探究：LLM 的排序结果是否会受 candidates 排列顺序影响（即 position bias），以及如何通过 bootstrapping 来缓解该 bias。

Ranking with large language models：现有研究表明，LLM 可在 zero-shot setting 下遵循自然语言指令完成多种任务（《Finetuned language models are zero-shot learners》、《A survey of large language models》）。利用 LLM 实现排序时，我们将上述各类 patternsinstruction template $T$ 中。instruction template 示例如下：


xxxxxxxxxx
"[pattern that contains sequential historical interactions H] [pattern that contains retrieved candidate items C] Please rank these movies by measuring the possibilities that I would like to watch next most, according to my watching history."

Parsing the output of LLMs：LLM 的输出为自然文本，本文采用启发式的 text-matching 方法来解析输出结果，并将推荐结果限定在指定的 item set 内。具体而言，我们使用 KMP 等高效的 substring matching 算法，对 LLM outputs 与 text of candidate items 进行匹配。实验发现，LLM 偶尔会生成 candidate set 以外的 items；对于 GPT-3.5，这类异常输出占比仅为 3%。针对该问题，既可对异常结果重新处理，也可直接将 out-of-candidate items 判定为无效推荐。

1.3 实证研究

数据集：实验在两个广泛使用的公开推荐系统数据集上进行：
- (1)：电影评分数据集 MovieLens-1M （简称 ML-1M），其中 user ratings 被视为 interactions。
- (2)：来自 Amazon Review 数据集的一个名为 Games 的类别，其中 reviews 被视为 interactions。
我们筛选出 interactions 次数少于 5 次的 users 和 items。然后，我们按时间戳对每个用户的 interactions 进行排序，最早的 interactions 在前，以构建相应的 historical interaction sequences。movie/product 标题用作 item 的描述性文本。我们在这项研究中使用 item titles 有两个原因：
- (1)：为了确定 LLM 是否能在仅提供最少信息的情况下，利用其内在的世界知识做出推荐。
- (2)：为了节省计算资源。探索 LLM 如何使用更广泛的文本特征进行推荐将是我们未来工作的重点。
评估和实现细节：
- 遵循现有工作（《Self-attentive sequential recommendation》、《Towards universal sequence representation learning for recommender systems》），我们采用留一法（leave-one-out）进行评估。
- 对于每个 historical interaction sequence，最后一个 item 用作测试集中的 ground-truth item。倒数第二个 item 用于验证集（用于训练 baseline 方法）。
- 我们采用广泛使用的指标 NDCG@KN@K $m$ candidates $K\leq m$ 。
- 为了便于复现本工作，我们的实验使用流行的开源推荐库 RecBoLe （《Towards a unified, comprehensive and efficient framework for recommendation algorithms》）进行。
- historical interaction sequences 被截断在 50 长度以内。
- 我们在 ML-1M 数据集上的所有用户以及 Games 数据集上默认随机抽样的 6,000 名用户上评估 LLM-based 的方法。除非另有说明，被评估的 LLM 通过调用 OpenAI 的 API gpt-3.5-turbo 访问。调用 LLM 的超参数温度设置为 0.2。
- 所有报告的结果是至少三次重复运行的平均值，以减少随机性的影响。

1.3.1 LLM 能理解包含 Sequential Historical User Behaviors 的 Prompts 吗 ?

在 LLM-basedhistorical interactions $\mathcal{H}$ 的不同配置，我们旨在检验 LLM 是否能利用这些 historical user behaviors 并感知其顺序性以做出准确的推荐。
LLM 难以感知给定 historical user behaviors 的顺序。在本节中，我们检验 LLM 是否能理解包含有序 historical interactions 的 prompts 并给出个性化推荐。任务是对一个包含 20 items 的 candidate set 进行排序，其中包括一个 ground-truth item 和 19 randomly sampled negatives。通过分析 historical behaviors，感兴趣的 items 应被排在更高的位置。
这里的 candidate generator 是一种规则：检索 ground-truth item、以及 19 randomly sampled negatives。
我们比较了三种 LLM-based 方法的排序结果：
- (a) 我们的方法：按照前面正文描述的方式进行排序。historical user behaviors 使用 "sequential prompting" 策略被编码到 prompts 中。
- (b) 随机顺序：其中 historical user behaviors 在输入模型前会被随机打乱，
- (c) 虚假历史：我们将原始 historical behaviors 中的所有 items 替换为随机采样的 items 作为 fake historical behaviors。
从 Figure 2(a) 可以看出，我们的方法比使用 fake historical behaviors 的变体表现更好。然而，我们的方法和随机顺序的表现相似，表明 LLM 对给定的 historical user interactions 的顺序不敏感。
此外，在 Figure 2(b) 中，我们改变了用于构建 promptlatest historical user behaviors $(|\mathcal{H}|)$ ，范围从 5 到 50。结果表明，增加 historical user behaviors 的数量并不能提高排序性能，反而会对排序性能产生负面影响。我们推测，这种现象是由于 LLM 难以理解顺序，而是平等地考虑所有 historical behaviorshistorical user behaviors $|\mathcal{H}| = 50$ LLM $|\mathcal{H}|$ 使 LLM 能够专注于 most recently interacted items，从而获得更好的推荐性能。
触发 LLM 感知 interaction order：基于上述观察，我们发现 LLM 难以通过默认的 prompting 策略来感知 interaction histories 中的顺序。因此，我们旨在通过提出两种替代的 prompting 策略并强调 recently interacted items，来激发 LLM 的顺序感知能力。前面正文章节中已详细描述了所提出的策略。在 Table 2 中，我们可以看到，recency-focused prompting 和in-context learning 通常都能提高 LLM 的排序性能，尽管最佳策略可能因数据集而异。上述结果可以总结为以下关键 observation ：
- Observation 1：LLM 难以感知给定的 sequential interaction histories 的顺序。通过采用专门设计的 promptings，可以触发 LLM 感知 historical user behaviors 的顺序，从而提高排序性能。
LLM zero-shot 的性能与 Pop 方法差不多，也就是推荐热门 item。

1.3.2 LLM 在排序时会受到 biases 影响吗?

传统推荐系统中的 biases 和 debiasing methods 已被广泛研究（《Bias and debias in recommender system: A survey and future directions》）。对于 LLM-based 的推荐模型，input 和 output 都是自然语言文本，将不可避免地引入新的 biases。在本节中，我们讨论 LLM-based 的推荐模型遭受的两种 biases。我们还讨论了如何减轻这些 biases。
candidates 的顺序影响 LLM 的排序结果：对于传统的排序方法，retrieved candidates 的顺序通常不会影响排序结果（《Self-attentive sequential recommendation》、《Session-based recommendations with recurrent neural networks》）。然而，对于正文章节中描述的 LLM-based 的方法，candidates 以序列方式排列并填入 prompt 中。已经证明，对于 NLP 任务，LLM 通常对 prompt 中示例的顺序敏感（《Calibrate before use: Improving few-shot performance of language models》、《Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity》）。因此，我们也进行实验来检验 candidates 的顺序是否影响 LLM 的排序性能。我们采用与第 1.3.1 节相同的实验设置。唯一的区别是我们控制了 prompts 中这些 candidates 的顺序，使 ground-truth items 出现在特定位置。我们将 ground-truth items 的位置在 {0,5,10,15,19} 中变化，结果如 Figure 3(a) 所示。
我们可以看到，当 ground-truth items 出现在不同位置时，性能会发生变化。特别是，当 ground-truth items 出现在最后几个位置时，排序性能显著下降。结果表明，LLM-based rankers 受到 candidates 顺序的影响，即 position bias，而这可能不会影响传统的推荐模型。
通过 bootstrapping 减轻 position bias：一种减轻 position bias 的简单策略是 bootstrap the ranking processcandidate set $B$ 次，每轮随机打乱 candidates 的顺序。这样，一个 candidate 可能出现在不同的位置。然后我们合并每轮的结果以得出 final ranking。从 Figure 3(b) 中，我们遵循第 1.3.1 节的设置，将 bootstrapping 策略应用于我们的方法。每个 candidate set 将被排序 3 次。我们可以看到，bootstrapping 提高了两个数据集上的排序性能。
“合并每轮的结果”：如何合并？论文并未说明。可以采用平均投票法：LLM 推荐哪个 candidate item 最多，则返回这个 candidate item。
candidates 的流行度影响 LLM 的排序结果：对于 popular items，其关联的文本也可能频繁出现在 LLM 的预训练语料中。例如，一本畅销书会在网络上被广泛讨论。因此，我们旨在检验排序结果是否受 the popularity of candidates 的影响。然而，直接衡量 the popularity of item text 是困难的。这里，我们假设 text popularity 可以间接地通过一个推荐数据集中的 item frequency 来衡量。在 Figure 3(c) 中，我们报告了 ranked item lists 中每个位置的 item popularity score（通过训练集中出现的 normalized item frequency 来衡量）。我们可以看到，popular items 倾向于被排在更高的位置。
让 LLM 关注 historical interactions 有助于减少 popularity bias：我们假设，如果 LLM 关注 historical interactions，它们可能会给出更个性化的推荐，而不是更 popular 的推荐。
- 从 Figure 2(b) 中，我们知道，当使用较少的 historical interactions 时，LLM 能更好地利用 historical interactions 。
- 从 Figure 3(d) 中，我们比较了随着 historical interactions 数量变化时， best-ranked itemspopularity scores $|\mathcal{H}|$ 减少，popularity score 也降低。
这表明，当 LLM 更多地关注 historical interactions 时，可以减少 popularity bias 的影响。通过以上实验，我们可以得出以下结论：
- Observation 2：LLM 在排序时存在 position bias 和 popularity bias，这可以通过 bootstrapping 策略或专门设计的prompting 策略来缓解。
historical interactions $|\mathcal H|$ ）可以有效减少 popularity bias。或者采用 recency-focused prompting 和in-context learning。

1.3.3 LLM 在 Zero-Shot Setting 中对 Candidates 进行排序的效果如何？

我们进一步评估 LLM-based 的方法在由不同策略检索的、包含 hard negatives 的 candidates 上的表现，以进一步研究 LLM 的排序依赖于什么。然后，我们展示了不同方法在由多个 candidate generation models 检索到的 candidates 上的排序性能，以模拟一个更实际和困难的设置。
LLM 具有有前景的 zero-shot ranking 能力：在 Table 2 中，我们进行实验，将 LLM-based1.3.1 $|\mathcal{C}| = 20$ ，并且 candidate items 是随机检索的。我们包括了三个在训练集上训练的传统模型，即 Pop（根据 item popularity 进行推荐）、BPRMF（《BPR: bayesian personalized ranking from implicit feedback》）和 SASRec（《Self-attentive sequential recommendation》）。
我们还评估了未在 target datasets 上训练的三个 zero-shot recommendation methods，包括BM25 （《The probabilistic relevance framework: BM25 and beyond》）（根据 candidates 与 historical interactions 之间的文本相似度进行排序）、UniSRec（《Towards universal sequence representation learning for recommender systems》）和 VQ-Rec（《Learning vector-quantized item representation for transferable sequential recommenders》）。对于 UniSRec 和 VQ-Rec，我们使用它们公开可用的 pre-trained models。我们未包含 ZESRec（《Zero-shot recommender systems》），因为没有发布 pre-trained model。此外，我们在 Table 3 中比较了不同 LLM 的 zero-shot ranking 性能。对于 LLM-based rankers，我们使用了 "Recency-Focused" 的 prompting 策略。
从 Table 2 和 Table 3 可以看出，参数更多的 LLM 通常表现更好。最佳的 LLM-based 方法大大优于现有的 zero-shot 推荐方法，展现出有前景的 zero-shot ranking 能力。我们强调，在 ML-1M 数据集上进行 zero-shot 推荐是困难的，因为仅仅通过电影标题的相似性来衡量电影之间的相似性具有难度。然而，LLM 可以利用其内在知识来衡量电影之间的相似性并做出推荐。我们要强调的是，评估 zero-shot 推荐方法的目标不是为了超越传统模型。目标是展示 pre-trained base models 的强大推荐能力，这些模型可以进一步适配和迁移到下游场景。
LLM 基于 item popularity、文本特征、以及 user behaviors 对 candidates 进行排序：为了进一步研究 LLM 如何对给定的 candidates 进行排序，我们在由不同 candidate generation methods 检索到的 candidates 上评估 LLM。这些 candidates 可以被视为 ground-truth items 的 hard negatives，可用于衡量 LLM 对 specific categories of items 的排序能力。我们考虑了两类策略来检索 candidates：
- (1)：基于内容的方法，如 BM25（《The probabilistic relevance framework: BM25 and beyond》）和 BERT（《Bert: Pre-training of deep bidirectional transformers for language understanding》），根据文本特征相似度来检索 candidates。
- (2)：基于 interaction 的方法，包括 Pop、BPRMF（《BPR: bayesian personalized ranking from implicit feedback》）、GRU4Rec（《Session-based recommendations with recurrent neural networks》）和 SASRec（《Self-attentive sequential recommendation》），使用在 user-item interactions 上训练好的神经网络来检索 items 。
给定 candidates，我们比较了 LLM-based 的模型（我们的方法）和代表性方法的排序性能。
从 Figure 4 中，我们可以看到 LLM-based 的方法的排序性能在不同的 candidate sets 和不同的数据集上有所不同。
- (1)：在 ML-1M 上，LLM-based 的方法无法在包含 popular items 的 candidate sets（例如 Pop 和 BPRMF）上很好地排序，表明在 ML-1M 数据集上，LLM-based 的方法在很大程度上依赖于 item popularity 来推荐 item。
- (2)：在 Games 上，我们可以观察到我们的方法在 popular candidates 和 textual similar candidates 上都有相似的表现，表明 item popularity 和文本特征对 LLM 的排序能力上贡献相似。
- (3)：在这两个数据集上，我们的方法的性能都受到 hard negatives（由 interaction-based candidate generation models 检索而来）的影响，但不像那些 interaction-based rankers（如 SASRec）那样严重。
上述结果表明，LLM-based 的方法不仅仅考虑单一方面进行排序，而是利用了 item popularity、文本特征、甚至 user behaviors。在不同的数据集上，这三个方面影响排序性能的权重也可能不同。
LLM 可以有效地对由多个 candidate generation models 检索而来的 candidates 进行排序：对于现实世界的推荐系统（《Deep neural networks for youtube recommendations》），待排序的 items 通常由多个 candidate generation models 检索而来。因此，我们还在一个更实际和更困难的设置下进行了实验。我们使用上述七个 candidate generation models 来检索 items。每个 candidate generation model 检索到的 top-3 best items 将被合并到一个包含总共 21 items 的 candidate set 中。作为一个更实际的设置，我们不将 ground-truth item 补充到每个 candidate set 中。请注意，这里的实验是在 implicit preference setup 下进行的（《A revisiting study of appropriate offline evaluation for top-n recommendation algorithms》），表明 the retrieved items 中可能存在 implicit positive instances（未显示地标记为正样本）。更忠实的评估可能需要人工研究，我们打算在未来的工作中进行探索。对于我们的方法，我们总结了从第 1.3.1 节和第 1.3.2recency-focused prompting $|\mathcal{H}| = 5$ 的 sequential historical interactions 编码到 prompts 中，并使用 bootstrapping 策略重复排序 3 轮。
从 Table 4 中，我们可以看到，在大多数指标上，LLM-based 的模型（我们的方法）在被比较的推荐模型中取得了第二佳的性能。结果表明，LLM-based zero-shot ranker 甚至优于已在 target datasets 上训练好的传统推荐模型 Pop 和 BPRMF，进一步证明了 LLM 强大的 zero-shot ranking 能力。我们假设 LLM 可以利用其内在的世界知识，综合考虑 item popularity、文本特征、以及 user behaviors 来对 candidates 进行排序。相比之下，现有模型（作为 narrrow experts ）可能缺乏在复杂环境中排序 items 的能力。上述发现可以总结为：
- Observation 3：LLM 具有有前景的 zero-shot ranking 能力，尤其是在由多个 candidate generation models with different practical strategies 检索到的 candidates 上。
这种人工构造的数据集，可能不是那么有说服力。

1.4 结论

在这项工作中，我们研究了 LLM 作为推荐系统的 zero-shot ranking model 的能力。为了使用 LLM 进行排序，我们构建了包含 historical interactions、candidates、和 instruction templates 的自然语言 prompts。然后，我们提出了几种专门设计的 prompting 策略，以触发 LLM 感知 orders of sequential behaviors 的能力。我们还引入了 bootstrapping 策略和 prompting 策略，以减轻 LLM-based ranking models 可能遭受的 position bias 和 popularity bias 问题。
大量的实证研究表明，LLM 具有有前景的 zero-shot ranking 能力。实证研究证明了将知识从 LLM 迁移为强大推荐模型的强大潜力。我们旨在阐明进一步提高 LLM 排序能力的几个有前景的方向，包括：
- (1)：更好地感知 the order of sequential historical interactions。
- (2)：减轻 position bias 和 popularity bias。
对于未来的工作，我们考虑开发技术方法来解决在部署 LLM 作为推荐模型时上述关键挑战。我们也希望开发 LLM-based 的推荐模型，可以在下游 user behaviors 上进行高效调优，以实现有效的个性化推荐。

1.5 局限性

在本文的大部分实验中，ChatGPT 被用作主要的 target LLM 从而用于评估。然而，作为一个闭源商业服务，ChatGPT 可能在其核心大型语言模型之上集成了额外的技术来提高性能。虽然有可用的开源 LLM，例如 LLaMA 2 和 Mistral ，但它们与 ChatGPT 相比表现出显著的性能差距（例如，Table 3 中的 LLaMA-2-70B-Chat 与 ChatGPT）。这种差距使得仅使用开源模型评估 LLM 在推荐任务上的涌现能力（emergent abilities）变得困难。此外，我们应该注意，observations 可能受到特定 prompts 和数据集的 bias 所影响。