2021_AutoDis

一、AutoDis [2021]

《An Embedding Learning Framework for Numerical Features in CTR Prediction》

如下图所示，大多数现有的深度 CTR 模型遵循 Embedding & Feature Interaction (FI) 的范式。由于特征交互在 CTR 预测中的重要性，大多数工作集中在设计 FI 模块的网络架构从而更好地捕获显式特征交互或隐式特征交互。虽然在文献中没有很好的研究，但 embedding 模块也是深度 CTR 模型的一个关键因素，原因有二：
- embedding 模块是后续 FI 模块的基石，直接影响 FI 模块的效果。
- 深度 CTR 模型中的参数数量大量集中在 embedding 模块，自然地对预测性能有很高的影响。
然而，embedding 模块被研究界所忽视，这促使 《An Embedding Learning Framework for Numerical Features in CTR Prediction》 进行深入研究。
embedding 模块通常以 look-up table 的方式工作，将输入数据的每个 categorical field 特征映射到具有可学习参数的潜在 embedding 空间。不幸的是，这种 categorization 策略不能用于处理数值特征，因为在一个 numerical field（如身高）可能有无限多的特征取值。
在实践中，现有的数值特征的 representation 方法可以归纳为三类（如下图的红色虚线框）：
- No Embedding：直接使用原始特征取值或转换，而不学习 embedding 。
- Field Embedding：为每个 numerical field 学习单个 field embedding 。
- Discretization：通过各种启发式离散化策略将数值特征转换为 categorical feature ，并分配 embedding 。
然而，前两类可能会由于 representation 的低容量而导致性能不佳。最后一类也是次优的，因为这种基于启发式的离散化规则不是以 CTR 模型的最终目标进行优化的。此外， hard discretization-based 的方法受到 Similar value But Dis-similar embedding: SBD 和 Dis-similar value But Same embedding: DBS 问题的影响，其细节将在后面讨论。
为了解决现有方法的局限性，论文 《An Embedding Learning Framework for Numerical Features in CTR Prediction》 提出了一个基于 soft discretization 的数值特征的 automatic end-to-end embedding learning framework ，即 AutoDis 。AutoDis 由三个核心模块组成：meta-embedding 、automatic discretization 和aggregation，从而实现高的模型容量、端到端的训练、以及 unique representation 等特性。具体而言：
- 首先，论文为每个 numerical field 精心设计了一组 meta-embedding ，这些 meta-embedding 在该field 内的所有特征取值之间是共享的，并从 field 的角度学习全局知识，其中 embedding 参数的数量是可控的。
- 然后，利用可微的 automatic discretization 模块进行 soft discretization ，并且捕获每个数值特征和 field-specific meta-embedding 之间的相关性。
- 最后，利用一个 aggregation 函数从而学习 unique Continuous-But-Different representation 。
据作者所知，AutoDis 是第一个用于 numerical feature embedding 的端到端 soft discretization 框架，可以与深度 CTR 模型的最终目标共同优化。
论文主要贡献：
- 论文提出了AutoDis ，一个可插拔的用于 numerical feature 的 embedding learning 框架，它具有很高的模型容量，能够以端到端的方式生成 unique representation 并且具有可管理的参数数量。
- 在 AutoDis 中，论文为每个 numerical field 设计了 meta-embedding 从而学习全局的共享知识。此外，一个可微的 automatic discretization 被用来捕获数值特征和 meta-embedding 之间的相关性，而一个 aggregation 过程被用来为每个特征学习一个 unique Continuous-But-Different representation 。
- 在两个公共数据集和一个工业数据集上进行了综合实验，证明了 AutoDis 比现有的数值特征的 representation 方法更有优势。此外，AutoDis 与各种流行的深度 CTR 模型兼容，大大改善了它们的推荐性能。在一个主流广告平台的 online A/B test 表明，AutoDis 在 CTR 和 eCPM 方面比商业 baseline 提高了 2.1% 和 2.7% 。
相关工作：
- Embedding：作为深度 CTR 模型的基石， embedding 模块对模型的性能有重大影响，因为它占用了大部分的模型参数。我们在此简单介绍推荐领域中基于 embedding look-up 的研究。
  现有的研究主要集中在设计 adaptable 的 embedding 算法，通过为不同的特征分配可变长度的 embedding 或 multi-embeddings ，如Mixed-dimension （《Mixed dimension embeddings with application to memory-efficient recommendation systems》），NIS （《Neural Input Search for Large Scale Recommendation Models》）和AutoEmb（《AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations》）。
  另一条研究方向深入研究了 embedding compression ，从而减少 web-scale 数据集的内存占用。
  然而，这些方法只能以 look-up table 的方式应用于 categorical feature 、或离散化后的数值特征。很少有研究关注数值特征的 embedding learning ，这在工业的深度 CTR 模型中是至关重要的。
- Feature Interaction：根据显式特征交互和隐式特征交互的不同组合，现有的 CTR 模型可以分为两类：堆叠结构、并行结构。
  - 堆叠结构：首先对 embeddings 进行建模显示特征交互，然后堆叠 DNN 来抽取 high-level 的隐式特征交互。代表性的模型包括 FNN、IPNN、PIN、FiBiNET、FGCNN、AutoGroup、DIN 和 DIEN 。
  - 并行结构：利用两个并行网络分别捕捉显式特征交互信号和隐式特征交互信号，并在输出层融合信息。代表性的模型包括 Wide & Deep、DeepFM、AutoFIS、DCN、xDeepFM 和 AutoInt 。
    在这些模型中，隐式特征交互是通过 DNN 模型提取的。而对于显式特征交互，Wide & Deep 采用手工制作的交叉特征，DeepFM 和 AutoFIS 采用 FM 结构，DCN 采用 cross network ，xDeepFM 采用 Compressed Interaction Network: CIN ，AutoInt 利用多头自注意力网络。

1.1 模型

1.1.1 基础概念

CTR $Q$ $(\mathbf{\vec x}, y)$ $y$ $\mathbf{\vec x}$ $M$ categorical fields $N$ 个 numerical fields 的 multi-field 数据记录：
$\vec{x} = [\underset{scalars}{\underset{⏟}{x_{1}, x_{2}, \dots, x_{N}}}; \underset{one-hot vectors}{\underset{⏟}{{\vec{x}}_{N + 1}, {\vec{x}}_{N + 2}, \dots, {\vec{x}}_{N + M}}}]$
$x_{j}$ $j$ numerical field $\mathbf{\vec x}_{N+i}$ $i$ 个 categorical field 的特征取值的 one-hot vector 。
$i$ 个 categorical field ，可以通过 embedding look-up 操作获得 feature embedding ：
${\vec{e}}_{N + i} = E_{N + i} {\vec{x}}_{N + i}$
$\mathbf E_{N+i}\in \mathbb R^{v_{N+i}\times d}$ $i$ 个 categorical fieldembedding matrix $d$ embedding size $v_{N+i}$ $i$ 个 categorical field 的词表规模。
因此， categorical fieldrepresentation $\left[\mathbf{\vec e}_{N+1},\mathbf{\vec e}_{N+2},\cdots,\mathbf{\vec e}_{N+M} \right]\in \mathbb R^{M\times d}$ 。
对于 numerical field ，现有的 representation 方法可以总结为三类：No Embedding、Field Embedding、Discretization 。
- No Embedding：这种方式直接使用原始值或原始值的变换，而不学习 embedding 。例如，Google Play 的 Wide & Deep 和JD.com 的 DMTYouTube DNN $\tilde x_{j}$ 的几种变换（即平方、平方根）：
  ${\vec{e}}_{YouTube} = [{\tilde{x}}_{1}^{2}, {\tilde{x}}_{1}, \sqrt{{\tilde{x}}_{1}}, {\tilde{x}}_{2}^{2}, {\tilde{x}}_{2}, \sqrt{{\tilde{x}}_{2}}, \dots, {\tilde{x}}_{N}^{2}, {\tilde{x}}_{N}, \sqrt{{\tilde{x}}_{N}}] \in R^{3 N}$
  在 Facebook 的 DLRM 中，他们使用了一个多层感知机来建模所有的数值特征：
  ${\vec{e}}_{DLRM} = DNN ([x_{1}, x_{2}, \dots, x_{N}])$
  其中 DNN512-256-d $\mathbf{\vec e}_\text{DLRM}\in \mathbb R^d$ 。
  直观而言，这些 No Embedding 方法由于容量太小，很难捕获到 numerical field 的 informative 知识。
- Field Embedding：学术界常用的方法是 Field Embedding，即同一个 field 内的的所有数值特征共享一个 uniform field embedding ，然后将这个 field embedding 与它们的特征取值相乘：
  ${\vec{e}}_{FE} = [x_{1} {\vec{e}}_{1}, x_{2} {\vec{e}}_{2}, \dots, x_{N} {\vec{e}}_{N}] \in R^{N \times d}$
  $\mathbf{\vec e}_{j} \in \mathbb R^d$ $j$ 个numerical field 的 uniform embedding vector 。
  然而，由于单个共享的 field-specific embedding 、以及field 内不同特征之间的线性缩放关系，Field Embedding 的 representation 容量是有限的。
  $x=1$ $x=10000$ ，根据 Field Embedding 的做法：
  $(1 \times \vec{e}) \cdot (1 \times \vec{e}) < (1 \times \vec{e}) \cdot (10000 \times \vec{e})$
  即，收入分别为1 和 10000 的两个用户，其相似度大于收入分别为 1 和 1 的两个用户。这在实际场景中是不恰当的。
- Discretization：在工业推荐系统中，处理数值特征的一个流行方法是 Discretizationcategorical feature $j$ numerical field $x_{j}$ feature embedding $\mathbf{\vec e}_{j}$ ：离散化（discretization）和 embedding look-up：
  ${\vec{e}}_{j} = E_{j} d_{j} (x_{j})$
  $\mathbf E_{j}\in \mathbb R^{H_{j}\times d}$ $j$ 个 numerical fieldembedding matrix $H_{j}$ $d_{j}(\cdot)$ $j$ numerical field $H_{j}$ 个桶中。
  具体来说，有三种广泛使用的离散化函数：
  - Equal Distance/Frequency Discretization: EDD/EFDEDD $H_{j}$ $j$ numerical field $\left[x^\min_{j}, x_{j}^\max\right]$ interval width $w_{j} = \left(x_{j}^\max - x_{j}^\min\right)/H_{j}$ EDD $d_{j}^\text{EDD}(\cdot)$ $\hat x_{j}$ ：
    ${\hat{x}}_{j} = d_{j}^{EDD} (x_{j}) = floor (\frac{x_{j} - x_{j}^{min}}{w_{j}})$
    EFD $\left[x^\min_{j}, x_{j}^\max\right]$ 划分为若干个桶，使得每个桶包含相同数量的样本。
  - Logarithm Discretization: LD：Kaggle 比赛中 Criteo advertiser prediction 的冠军利用对数和 floorcategorical $\hat x_{j}$ LD $d_{j}^\text{LD}(\cdot)$ 得到：
    ${\hat{x}}_{j} = d_{j}^{LD} (x_{j}) = floor (\log (x_{j})^{2})$
  - Tree-based Discretization: TD：除了深度学习模型外，tree-based 模型（如 GBDT ）在推荐中被广泛使用，因为它们能够有效地处理数值特征。因此，许多 tree-based 方法被用来离散化数值特征。
  虽然 Discretization 在工业界被广泛使用，但它们仍然有三个限制（如下图所示）：
  - Two-Phase Problem: TPP：离散化过程是由启发式规则或其他模型决定的，因此它不能与 CTR 预测任务的最终目标一起优化，导致次优性能。
  - Similar value But Dis-similar embedding: SBD：这些离散化策略可能将类似的特征（边界值）分离到两个不同的桶中，因此它们之后的 embedding 明显不同。
    例如， Age field 常用的离散化是将 [18,40] 确定为青少年、[41,65] 确定为中年，这导致数值 40 和 41 的 embedding 明显不同。
  - Dis-similar value But Same embedding: DBS：现有的离散化策略可能将明显不同的元素分组同一个桶中，导致无法区分的 embedding。使用同一个例子（Age field ），18 和 40 之间的数值在同一个桶里，因此被分配了相同的 embedding 。然而，18 岁和 40 岁的人可能具有非常不同的特征。基于离散化的策略不能有效地描述数值特征变化的连续性。
综上所述，下表中列出了 AutoDis 与现有 representation 方法的三方面比较。我们可以观察到，这些方法要么因为容量小而难以捕获 informative 知识、要么需要专业的特征工程，这些可能会降低整体性能。因此，我们提出了 AutoDis 框架。据我们所知，它是第一个具有高模型容量、端到端训练、以及保留 unique representation 特性的 numerical features embedding learning 框架。

1.1.2 AutoDis

AutoDis 框架如 Figure 3 所示，它可以作为针对数值特征的可插拔的 embedding framework ，与现有的深度 CTR 模型兼容。AutoDis 包含三个核心模块：meta-embedding、automatic discretization、aggregation 。
$x$ DNN $H$ ，relu 激活函数） -> 带 skip-connectionDNN $H$ ，softmaxDNN $d$ ，无激活函数）。
$j$ 个 numerical fieldAutoDis $x_{j}$ 学习一个 unique representation （如 Figure 4 所示）：
${\vec{e}}_{j} = f (d_{j}^{Auto} (x_{j}), {ME}_{j})$
其中：
- $\mathbf {ME}_{j}$ $j$ 个 numerical field 的 meta-embedding matrix 。
- $d_{j}^\text{Auto}(\cdot)$ $j$ 个 numerical field 的 automatic discretization 函数。
- $f(\cdot)$ 为聚合函数。
  所有 numerical field 都使用相同的聚合函数。
最后，categorical 特征和数值特征的 embedding 被拼接起来，并馈入一个深度 CTR 模型进行预测：
$\hat{y} = CTR (\underset{numerical embeddings}{\underset{⏟}{{\vec{e}}_{1}, {\vec{e}}_{2}, \dots, {\vec{e}}_{N}}}; \underset{categorical embeddings}{\underset{⏟}{{\vec{e}}_{N + 1}, {\vec{e}}_{N + 2}, \dots, {\vec{e}}_{N + M}}})$
Meta-Embeddings $j$ 个 numerical fieldmeta-embeddings $\mathbf {ME}_{j}\in \mathbb R^{H_{j}\times d}$ $\mathbf {ME}_{j}$ field $H_{j}$ 为 meta-embedding 的数量。每个 meta-embedding 可以被看作是潜在空间中的一个子空间，用于提高表达能力和容量。通过结合这些 meta-embedding ，学习到的 embedding 比 Field Embeddinginformative $H_{j}$ 所决定，因此模型的复杂性是高度可控的，使我们的方法 scalable 。
field $j$ $H_j$ field $H_j$ field $H_j$ 可以取较小的值。
Automatic Discretization：为了捕获数值特征的值和所设计的 meta-embeddingsautomatic discretization $d_{j}^\text{Auto}(\cdot)$ $d_{j}^\text{Auto}(\cdot)$ $j$ numerical field $H_{j}$ 个桶，每个 bucket embedding 对应于上面提到的 meta-embedding 。
具体而言，利用一个带有 skip-connectionnumerical field $x_{j}$ $H_{j}$ 个桶中：
$\begin{matrix} {\vec{h}}_{j} = Leaky-ReLU ({\vec{w}}_{j} x_{j}) \\ {\tilde{\vec{x}}}_{j} = W_{j} {\vec{h}}_{j} + α {\vec{h}}_{j} \end{matrix}$
其中：
- $\mathbf{\vec w}_j\in \mathbb R^{H_j}, \mathbf W_j\in \mathbb R^{H_j\times H_j}$ automatic discretization network $j$ 个 numerical feature field 上的可学习参数。
- Leaky-ReLU $\alpha$ 为 skip-connection 的控制因子（control factor）。
$\tilde{\mathbf{\vec x}}_j = \left[\tilde x_{j,1},\tilde x_{j,2},\cdots,\tilde x_{j,H_j}\right]$ $\tilde x_{j,h}$ numerical field $x_j$ $h$ 个桶的投影输出。最后，通过 Softmaxnumerical field $x_j$ meta-embedding $\mathbf {ME}_j$ 之间的相关性进行归一化：
${\hat{x}}_{j, h} = \frac{\exp ({\tilde{x}}_{j, h} / τ)}{\sum_{k = 1}^{H_{j}} {\tilde{x}}_{j, k} / τ}$
$\tau$ 为温度系数，用于控制 discretizationautomatic discretization $d_{j}^\text{Auto}(\cdot)$ $\hat{\mathbf{\vec x}}_j\in \mathbb R^{H_j}$ ：
${\hat{\vec{x}}}_{j} = d_{j}^{Auto} (x_{j}) = [{\hat{x}}_{j, 1}, {\hat{x}}_{j, 2}, \dots, {\hat{x}}_{j, H_{j}}]$
$\hat{\mathbf{\vec x}}_j$ $\hat x_{j,h}$ numerical field $x_j$ $h$ numerical field $x_j$ $h$ 个 meta-embedding （即 bucket embedding ）之间的相关性。这种离散化方法可以理解为 soft discretization 。与前面提到的 hard discretization 相比， soft discretization 没有将特征取值离散化到一个确定的桶中，这样就可以很好地解决 SBD 和 DBS 问题。此外，可微的 soft discretization 使我们的 AutoDis 能够实现端到端的训练，并以最终目标进行优化。
$\tau\rightarrow \infty$ discretization $\tau\rightarrow 0$ 时，discretizationone-hot $\tau$ 在 automatic discretization distribution 中起着重要作用。此外，不同 fieldfeatures $\tau$ 具有很大的必要性。具体而言，我们提出了一个温度系数自适应网络（temperature coefficient adaptive network）（双层网络），它同时考虑了 global field statistics feature 和 local input feature ，公式如下：
$τ_{x_{j}} = Sigmoid (W_{j}^{(2)} Leaky-ReLU (W_{j}^{(1)} [{\bar{\vec{n}}}_{j} | | x_{j}]))$
其中：
- $\bar{\mathbf{\vec n}}_j$ $j$ 个numerical fieldCumulative Distribution Function: CDF $x_j$ local input feature $(\cdot ||\cdot)$ 为拼接操作。
- $\mathbf W_j^{(1)}, \mathbf W^{(2)}_j$ 为待学习的权重参数。
$\tau_{x_j}$ $(0,1)$ rescale $(\tau-\epsilon, \tau+\epsilon)$ $\tau$ 是一个全局的超参数。
$\epsilon$ 是什么？作者没有说。读者认为这里是一个超参数。
这个温度系数自适应网络的设计不太简洁，工程实现也更复杂。根据实验部分的结论，这里完全可以微调 global 温度来实现。
Aggregation Function：在得到 feature value 和 meta-embeddingsmeta-embeddings $f(\cdot)$ embedding $f(\cdot)$ ：
- Max-Poolingmeta-embedding $\hat x_{j,h}$ ）：
  ${\vec{e}}_{j} = {ME}_{j, k}, k = \arg max_{1 \leq h \leq H_{j}} {\hat{x}}_{j, h}$
  $k$ $\hat x_{j,h}$ meta-embedding $\mathbf {ME}_{j,k}$ $\mathbf {ME}_j$ $k$ 个 meta-embedding 。
  然而，这种 hard selection 策略使 AutoDis 退化为一种 hard discretization 方法，从而导致上述的 SBD 问题和 DBS 问题。
- Top-K-Sum $\hat x_{j,h}$ 的 top-K 个 meta-embedding 相加：
  ${\vec{e}}_{j} = \sum_{l = 1}^{K} {ME}_{j, k_{l}}, k_{l} = \arg max_{1 \leq h \leq H_{j}, l - t h} {\hat{x}}_{j, h}$
  $k_l$ $l$ $1\le l\le K$ $\hat x_{j,h}$ meta-embedding $\mathbf {ME}_{j,k}$ $\mathbf {ME}_j$ $k$ 个 meta-embedding 。
  然而，Top-K-Sum方法 有两个局限性：
  - 尽管与 Max-Poolingembedding $H_j$ $C_{H_j}^K$ ，但它不能从根本上解决 DBS 问题。
  - embedding $\mathbf{\vec e}_j$ 未考虑相关性的取值。
- Weighted- Average：充分地利用整个 meta-embeddings 集合、以及 meta-embedding 与特征取值的相关性：
  ${\vec{e}}_{j} = \sum_{h = 1}^{H_{j}} {\hat{x}}_{j, h} {ME}_{j, h}$
  通过这种加权聚合策略，相关的元嵌入对提供一个信息丰富的嵌入有更大的贡献，而不相关的元嵌入则在很大程度上被忽略了。此外，这种策略保证了每个特征都能学到 unique representation，同时，学到的 embedding 是 Continuous-But-Different ，即，特征取值越接近则 embedding 就越相似。
训练：AutoDis 是以端到端的方式与具体的深度 CTR 模型的最终目标共同训练的。损失函数是广泛使用的带有正则化项的LogLoss：
$L = - [\frac{1}{Q} \sum_{i = 1}^{Q} y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i}) + λ | | Θ | |_{2}]$
$y_i$ $\hat y_i$ $i$ ground truth label $Q$ $\lambda$ L2 $\Theta=\{\Theta_\text{Cat-Emb},\Theta_\text{AutoDis},\Theta_\text{CTR}\}$ 为 categorical field 的 feature embedding 参数、meta-embedding 和 automatic discretization 参数、以及深度 CTR 模型的参数。
[0, 1] $j$ numerical field $x_j$ 被归一化为：
$x_{j} \leftarrow \frac{x_{j} - x_{j}^{min}}{x_{j}^{max} - x_{j}^{min}}$

1.2 实验

数据集：
- Criteo：Criteo Display Advertising Challenge 2013 发布的，包含 13 个 numerical feature field 。
- AutoML：AutoML for Lifelong Machine Learning Challenge in NeurIPS 2018 发布的，包含 23 个numerical feature field 。
- 工业数据集：从一个主流的在线广告平台上采样收集的，有 41 个 numerical feature field 。
数据集的统计数据如下表所示。
评估指标：AUC, LogLoss。
所有的实验都通过改变随机数种子重复 5 次。采用 two-tailed unpaired t-test 来检测 AutoDis 和最佳 baseline 之间的显著差异。
baseline 方法：为了证明所提出的 AutoDis 的有效性，我们将 AutoDis 与数值特征的三类 representation learning 方法进行了比较：No Embedding（YouTube, DLRM）、Field Embedding: FE（DeepFM ）、Discretization （EDD ，如 IPNN；LD 以及 TD，如 DeepGBM）。
此外，为了验证 AutoDis 框架与 embedding-based 的深度 CTR 模型的兼容性，我们将 AutoDis 应用于六个代表性模型：FNN、Wide & Deep、DeepFM、DCN、IPNN、xDeepFM 。
实现：
- 我们用 mini-batch Adam 优化所有的模型，其中学习率的搜索空间为 {10e-6, 5e-5, ... , 10e-2} 。
- 此外，在 Criteo 和 AutoML 数据集中， embedding size 分别被设置为 80 和 70 。
- 深度 CTR 模型中的隐层默认固定为1024-512-256-128 ，DCN 和 xDeepFM 中的显式特征交互（即 CrossNet 和 CIN ）被设置为 3 层。
- L2 正则化系数的搜索空间为 {10e-6, 5e-5, ... , 10e-3} 。
- 对于 AutoDis ，每个 numerical field 的 meta-embedding 数量为：Criteo 数据集为 20 个，AutoML 数据集为 40 个。
- skip-connection 控制因子的搜索空间为 {0, 0.1, ... , 1} ，temperature coefficient adaptive network 的神经元数量设置为 64
和其它 Representation Learning 方法的比较：我们在三个数据集上执行不同的 numerical feature representation learning 方法，并选择 DeepFM 作为深度 CTR 模型。结果如下表所示。可以看到：
- AutoDis 在所有数据集上的表现都远远超过了所有的 baseline，显示了其对不同数值特征的优越性和鲁棒性。
- No Embedding 和 Field Embedding 方法的表现比 Discretization 和 AutoDis 更差。No Embedding 和 Field Embedding 这两类方法存在容量低和表达能力有限的问题。
  No Embedding, Field Embedding 二者之间的差距不大，基本上在 0.1% 以内。
- 与现有的三种将数值特征转化为 categorical 形式的 Discretization 方法相比，AutoDis 的 AUC 比最佳 baseline 分别提高了 0.17% 、0.23% 、以及 0.22% 。
不用 CTR 模型的比较：AutoDis 是一个通用框架，可以被视为改善各种深度 CTR 模型性能的插件。为了证明其兼容性，这里我们通过在一系列流行的模型上应用 AutoDis 进行了广泛的实验，结果如下表所示。可以看到：与 Field Embedding representation 方法相比，AutoDis 框架显著提高了这些模型的预测性能。numerical feature discretization 和 embedding learning 过程的优化是以这些 CTR 模型为最终目标的，因此可以得到 informative representation ，性能也可以得到提高。
Online A/B Testing：我们在一个主流广告平台上进行在线实验从而验证 AutoDis 的优越性能，该平台每天有数百万活跃用户与广告互动，并产生数千万的用户日志事件。对于控制组，数值特征通过 hybrid manually-designed rules （如 EDD、TD 等）进行离散化。实验组则选择 AutoDis 对所有数值特征进行离散化和自动学习 embedding 。将 AutoDis 部署到现有的 CTR 模型中很方便，几乎不需要 online serving 系统的工程工作。
AutoDis 实现了 0.2% 的离线 AUC 的提升。此外，相对于对照组，实验组在线 CTR 和 eCPM 分别提升了 2.1% 和 2.7% （统计学意义），这带来了巨大的商业利润。
此外，随着 AutoDis 的融合，现有的数值特征不再需要任何离散化规则。此外，在未来引入更多的数值特征将更加有效，而不需要探索任何人工制作的规则。
Embedding Analysis：为了更深入地理解通过 AutoDis 学到的 Continuous-But-Different embedding ，我们分别在 embeddings 的宏观分析、以及 soft discretization 的微观分析中做了进一步的调研。
- Embeddings 的宏观分析：下图提供了 DeepFM-AutoDis 和 DeepFM-EDD 在 Criteo 数据集的第 3 个 numerical field 中得出的 embedding 的可视化。我们随机选择 250 个 embedding ，并使用 t-SNE 将它们投影到一个二维空间。具有相似颜色的节点具有相似的值。可以看到：
  - AutoDis 为每个特征学习了一个 unique embedding 。此外，相似的数值特征（具有相似的颜色）由密切相关的 embeddings （在二维空间中具有相似的位置）来表示，这阐述了 embedding 的 Continuous-But-Different 的特性。
  - 然而，EDD 为一个桶中的所有特征学习相同的 embedding 、而在不同的桶中学习完全不同的 embedding ，这导致了 to step-wise "unsmooth" embedding ，从而导致了较差的任务表现。
- Soft Discretization 的微观分析：我们通过调查 DeepFM-AutoDis 的 soft discretization 过程中的 Softmax 分布进行一些微观分析。我们从 Criteo 数据集的第 8 个 numerical field 中选择一个相邻的 feature-pair （特征取值为 1 和 2 ）、以及一个相距较远的feature-pair （特征取值为 1 和 10 ），然后在下图中可视化它们的 discretization distribution 。可以看到：相邻的feature-pair 具有相似的 Softmax 分布，而相距较远的feature-pair 具有不同的分布。
  这一特点有利于保证相似的特征取值能够通过 AutoDis 学习相似的 embedding ，从而保持 embedding 的连续性。
Numerical Fields Analysis：为了评估 DeepFM-AutoDis 对每个 numerical field 的影响，在 Criteo 数据集中，我们选择了所有26 个 categorical fields 作为基础特征，并每次累积添加 13 个 numerical fields 中的一个。下图展示了根据数据集的原始顺序和随机顺序添加 numerical fields 的预测性能。可以看到：
- 即使只有一个 numerical field ，AutoDis 也能提高性能。
- AutoDis 对多个 numerical fields 具有 cumulative improvement 。
- 与现有方法相比，AutoDis 取得的性能改善更加显著和稳定。
Model Complexity：为了定量分析我们提出的 AutoDis 的空间复杂度和时间复杂度，我们比较了 DeepFM 上的 EDD 离散化方法的模型参数、以及 batch inference time ，结果如下表所示。可以看到：
- 与 EDD 相比，AutoDis增加的模型参数量可以忽略不计。
- 此外，AutoDis 的计算效率也是可比的。
消融研究和超参数敏感性：
- 聚合策略：下图总结了 DeepFM-AutoDis 采用 Max-Pooling 、Top-K-Sum 、Weighted-Average 等聚合策略的预测性能。可以看到：Weighted-Average 策略实现了最好的性能。原因是：与其他策略相比，Weighted-Average 策略充分利用了 meta-embeddings 及其相应的关联性，完全克服了 DBS 和 SBD 问题。
- 温度系数：为了证明 temperature coefficient adaptive networkfeature-specific $\tau_{x_j}$ $\tau_{x_j}$ $\tau$ 进行比较。如下图所示：
  - 在 Criteo 和AutoML 数据集上，最佳全局温度分别为 1e-5 和 4e-3 左右。
  - 然而，我们的 temperature coefficient adaptive network 取得了最好的结果（红色虚线），因为它可以根据 global field statistics feature 和 local input feature ，针对不同的特征自适应地调整温度系数，获得更合适的 discretization distribution 。
- Meta-Embeddings 的数量：实验结果如下图所示。可以看到：
  - Meta-Embeddings 数量的增加有助于在开始时大幅提高性能，因为 meta-embeddings 中涉及更丰富的信息。
  - 然而，使用过多的 meta-embeddings 不仅会增加计算负担，而且会使性能受挫。
  考虑到预测效果和训练效率，我们将 Criteo和 AutoML 数据集的数量分别设定为 20 和 40 。