2026_UniMixer

一、UniMixer [2026]

《UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems》

近年来，推荐模型的 scaling laws 受到越来越多的关注，它决定了推荐系统性能与参数、计算量（FLOPs）之间的关系。当前，实现推荐模型 scaling 的主流架构主要有三种，分别是 attention-based 方法、TokenMixer-based 方法以及 factorization-machine-based 方法，这些方法在设计理念与架构结构上存在本质差异。本文提出一种面向推荐系统的统一的 scaling 架构—— UniMixer，旨在提升 scaling 效率，并构建一个统一了主流 scaling blocks 的理论框架。通过将 rule-based TokenMixer 转化为等价的参数化结构，我们构建了一个通用的参数化的 feature mixing 模块，使 token mixing 模式可在模型训练过程中被优化和被学习。同时，通用的参数化的 token mixing 移除了 TokenMixer 中要求 heads 数量与 tokens 数量相等的约束。此外，我们为推荐系统建立了统一的 scaling module 设计框架，打通了 attention-based 方法、TokenMixer-based 方法、以及 factorization-machine-based 方法之间的联系。为进一步提升 scaling ROI，我们设计了轻量级 UniMixing 模块—— UniMixing-Lite，在大幅压缩模型参数与计算成本的同时显著提升模型性能。scaling curves 如下图所示。本文通过大量离线与在线实验验证了 UniMixer 优异的 scaling 能力。
大语言模型（Large language models: LLMs）展现出一个令人瞩目的现象：随着模型规模、数据量与计算资源的增加，性能稳步提升，这一现象被称为 scaling laws。LLMs 中显著的性能 scaling 效果启发了推荐系统领域，研究者开始探索适配推荐任务的scaling 框架。近年来，研究人员尝试设计 scaling 模块并多层堆叠，以提升 ranking 模型复杂度，从而实现模型性能与模型规模、计算成本（如参数、FLOPs ）之间的 scaling laws。
推荐系统基于大量的 multi-field 的 user and item features，预测用户行为，为用户展示最相关的内容，提升用户对推荐结果的 positive engagements。这些 multi-field features 通常包含 categorical features 与 dense features，具备更动态的 embedding representations，可从多视角捕获信息。与自然语言处理（natural language processing: NLP）领域不同（在 NLP 中，所有 tokens 共享一个统一的 embedding space），推荐任务的 feature space 天然具有异构性（heterogeneous）。因此，learning heterogeneous features interactions 是推荐领域与 NLP 领域的根本区别。
得益于 Transformer 在 LLMs 中的巨大成功，一个自然的思路是修改 Transformer 模块以适配推荐任务，因为直接将Transformer 模块作为推荐系统 scaling laws 的 fundamental block 通常不可行。为解决 heterogeneous feature interaction problem，当前推荐模型的主流 scaling 架构可分为三类：attention-based 方法、TokenMixer-based 方法以及 factorization-machine-based 方法。
- attention-based 方法为每个 input token 构建 token-specific query, key, and value projections。如 HiFormer （《Hiformer: Heterogeneous feature interactions learning with transformers for recommender systems》）、FAT（《From scaling to structured expressivity: Rethinking transformers for ctr prediction》）、HHFT（《Hhft: Hierarchical heterogeneous feature transformer for recommendation systems》）等等。
- 与 attention-based 方法不同，TokenMixer-based 方法采用 rule-based token mixing 操作实现 heterogeneous feature interactions，避免计算两个 heterogeneous semantic spaces 之间的内积相似度。如 RankMixer（《Rankmixer: Scaling up ranking models in industrial recommenders》）、TokenMixer-Large （《Tokenmixer-large: Scaling up large ranking models in industrial recommenders》）等等。
- 而 factorization-machine-based方法则通过引入因子分解机（Factorization Machine: FM）模块，建模每层 input embeddings 之间的 feature interactions。如 Wukong（《Wukong: Towards a scaling law for large-scale recommendation》）、Kunlun （《Kunlun: Establishing scaling laws for massive-scale recommendation systems through unified architecture design》）等等。
这些框架基于完全不同的 scaling blocks 来构建，却均具备模型性能 scaling up 的能力。这引发了一个根本性问题：我们能否为推荐系统构建一个统一的 scaling module，融合现有主流 scaling components 的优势？
为打通这些 scaling modules 间的联系，我们首先为 rule-based TokenMixer 操作建立 parameterized formulation。通过进一步优化计算流程，我们推导出计算成本更低的 UniMixing 模块。基于该设计与实验结果，我们提出一个统一的理论框架，整合推荐系统主流的 scaling modules。此外，我们设计了轻量级的 UniMixer 模块，结合现有主流 scaling blocks 的优势，实现最优的参数效率与计算效率。我们希望该统一架构能助力推荐系统领域迎来属于自己的 "attention moment"。
本文主要贡献总结如下：
- 通过对 rule-based TokenMixer 进行等价的参数化，揭示其 feature interaction 模式。
- 提出统一的 scaling 框架 UniMixer，打通 attention-based 方法、TokenMixer-based 方法、FM-based 方法之间的差异与联系。通过优化计算流程，UniMixer 显著降低训练与推理阶段的计算复杂度与 GPU 内存消耗。
- 为进一步减少模型参数与计算成本，设计轻量级的 UniMixing 模块—— UniMixing-Lite，可同时利用 attention-based 架构与 TokenMixer-based 架构的优势，实现更优的 scaling 效率。
- 开展大量离线与在线实验，证明 UniMixer 具备优异的 scaling 能力。

1.1 相关工作

当前，面向大规模推荐系统建立 scaling laws 的建模范式主要有三种：attention-based 方法、TokenMixer-based 方法、以及 FM-based方法。
Attention-Based Framework：近年来，推荐系统领域将 Transformer 适配用于 CTR prediction。该范式的核心挑战是弥合token sequence 的异构性与语言建模假设的 sequential compositionality 之间的差距。
- 为此，《Hiformer: Heterogeneous feature interactions learning with transformers for recommender systems》提出 heterogeneous attention layer 解决 heterogeneous feature interaction 问题，并设计 HiFormer，将 heterogeneous tokens 展平为 a single vector representation，显式建模高阶交互。
- 此外，Field-Aware Transformers: FAT 通过 factorized contextual alignment 与 cross-field modulation，将 field-aware interaction 的先验知识注入 attention 机制，进一步建立 CTR prediction 的经验 scaling law （《From scaling to structured expressivity: Rethinking transformers for ctr prediction》）。
- HHFT 通过交替使用 heterogeneous Transformer blocks（保留 domain-specific semantics ）与 HiFormer blocks （学习高阶交互），验证了这些 scaling 特性（《Hhft: Hierarchical heterogeneous feature transformer for recommendation systems》）。
- 此外，在 dynamic user behavior modeling 中，HSTUV1/V2 、MARM、OneTrans、Climber、Hyformer、LLaTTE 等方法利用 attention 机制捕获长程时序依赖。
这些方法凸显了统一 feature interaction 与 sequential behavior modeling 以实现更鲁棒 scaling laws 的潜力。
TokenMixer-Based Framework：尽管 attention 机制具备强大的 feature interaction表达能力，但 attention score computation 的二次复杂度会带来高昂的计算成本。受计算机视觉领域 MLP-Mixer（《Mlp-mixer: An all-mlp architecture for vision》）成功的启发，工业级推荐系统出现了向 token-mixing 架构的范式转变，诞生了 RankMixer（《Rankmixer: Scaling up ranking models in industrial recommenders》）、Lemur（《Lemur: Large scale end-to-end multimodal recommendation》）、TokenMixer-Large（《Tokenmixer-large: Scaling up large ranking models in industrial recommenders》）等先进模型。
- 例如，RankMixer 用静态的、无参数的 token-mixing 操作替代 dynamic attention，在保持相当的 FLOPs 的同时，实现了有竞争力的 CTR 预测性能。
- 在此基础上，TokenMixer-Large 通过引入辅助的 residual connections 与定制的 loss functions，将该架构扩展至 13 Billion 参数规模，在 various model dimensions 上展现出良好的 scaling laws。
尽管如此，当前 token-mixing 算子的设计仍高度依赖经验规则，缺乏与传统 FM-based 方法或 attention-based 方法的严谨理论桥梁。
FM-Based Framework：FM-based 的开创性方法采用低阶 pairwise modeling 来实现推荐系统的 feature interactions（《Factorization machines》），后续经过 Field-aware FMs 来泛化（《Field-aware factorization machines for ctr prediction》），可捕获 field-specific and context-sensitive interactions。这类模型具备高可解释性与高效性，但固有地受限于低阶交互能力。
为解决该局限，DeepFM （《Deepfm: a factorization-machine based neural network for ctr prediction》）、AutoInt（《Autoint: Automatic feature interaction learning via self-attentive neural networks》）、DCN （《Deep & cross network for ad click predictions》、《Dcn v2: Improved deep & cross network and practical lessons for web-scale ctr prediction》）等多种神经网络扩展方法，融合 MLP 或 transformer attention 以捕获高阶交互。
近期，Wukong （《Wukong: Towards a scaling law for large-scale recommendation》）通过堆叠 FM-style interaction blocks with linear compression，展现出良好的 scaling 特性。
然而，FM-based 方法对显式低阶交互的依赖，仍限制了模型在参数量与 FLOPs 扩大时的性能提升，这与 LLMs 中观察到的 predictive scaling laws 形成对比。

1.2 预备知识

考虑一类判别式推荐任务，如 rating 预测、点击率（click-through rate: CTRpost-click conversion rate: CVR $\mathcal D=\left\{\left(\mathbf {\vec x}_{1}, y_{1}\right), \cdots,\left(\mathbf {\vec x}_{i}, y_{i}\right), \cdots,\left(\mathbf {\vec x}_{N}, y_{N}\right)\right\}$ ，其中：
- $\mathbf{\vec x}_{i}=\left[x_{i}^{(1)}, x_{i}^{(2)}, \cdots, x_{i}^{(F)}\right] \in \mathbb R^F$ $F$ feature fields $\mathbf{\vec x} = \left\{\mathbf{\vec x}^{C}, \mathbf{\vec x}^{D}\right\}$ categorical features $\mathbf{\vec x}^C$ dense features $\mathbf{\vec x}^{D}$ $|C|$ $|D|$ 分别表示 categorical features 数量与 dense features 数量。
- $y_i$ $i$ $y_{i} \in{0,1}$ $y_{i} \in \mathbb{R}$ 对应回归问题。
- $N$ 为样本数量。
对于 CTR prediction 与 CVR prediction 任务，核心目标是建立模型从而预测 clickconversion $\text{Pr}\left(y_{i}=1 \mid \mathbf{\vec x}_{i}\right)$ 。推荐系统中学到的 embedding representations 更具动态性。与语言模型 input tokens 不同，推荐系统中的 feature spaces 天然具有异构性。因此，直接将大语言模型所用的 Transformer 架构迁移至推荐建模并不合适。
迄今为止，推荐领域的 scaling laws 主要通过三类 foundational blocks 及其变体实现。
Heterogeneous Attention Layer：Heterogeneous-attention-based 架构通常采用 field-specific query, key, and value projections 来实现 heterogeneous feature interactioninput hidden states $\mathbf X=\left[\mathbf{\vec x}_{1} ; \cdots ; \mathbf{\vec x}_{T}\right] \in \mathbb{R}^{T \times D}$ ，heterogeneous attention layer 的公式如下：
$\begin{matrix} Q_{h} = [\begin{matrix} W_{Q}^{1 h} {\vec{x}}_{1} \\ ⋮ \\ W_{Q}^{T h} {\vec{x}}_{T} \end{matrix}] \in R^{T \times d}, K_{h} = [\begin{matrix} W_{K}^{1 h} {\vec{x}}_{1} \\ ⋮ \\ W_{K}^{T h} {\vec{x}}_{T} \end{matrix}] \in R^{T \times d}, V_{h} = [\begin{matrix} W_{V}^{1 h} {\vec{x}}_{1} \\ ⋮ \\ W_{V}^{T h} {\vec{x}}_{T} \end{matrix}] \in R^{T \times d} \end{matrix}$
$\mathbf W_{Q}^{i h},\mathbf W_{K}^{i h}, \mathbf W_{V}^{i h} \in \mathbb{R}^{D\times d}$ 分别为 query, key, and value projections 的 token-specific weights。
$T$ heterogeneous tokens $h$ 为 head 编号。
multi-head heterogeneous attention layer 的输出计算如下：
$O_{h} = softmax (\frac{Q_{h} K_{h}^{⊤}}{\sqrt{d}}) V_{h} \in R^{T \times d}$
将 multi-head heterogeneous attention 的输出进行拼接后，通过线性投影使 outputinput $\mathbf X$ 对齐。
TokenMixer：TokenMixer-based 框架采用无参数的、基于规则的 mixing 操作实现 feature interactioninput hidden states $\mathbf X=\left[\mathbf{\vec x}_{1} ; \cdots ; \mathbf{\vec x}_{T}\right] \in \mathbb{R}^{T \times D}$ ，TokenMixerinput token $\mathbf{\vec x}_t$ $H$ 个 heads：
$[{\vec{x}}_{t}^{(1)} ∣ {\vec{x}}_{t}^{(2)} ∣ \dots ∣ {\vec{x}}_{t}^{(H)}] = SplitHead ({\vec{x}}_{t})$
$\mathbf{\vec x}_t^{(h)}\in \mathbb R^{D/H}$ $h$ 个 head 。
$h$ token $\mathbf{\vec s}_h$ 为：
${\vec{s}}_{h} = concat ({\vec{x}}_{1}^{(h)}, {\vec{x}}_{2}^{(h)}, \dots, {\vec{x}}_{T}^{(h)}) \in R^{(T D / H)}$
TokenMixer 的输出可以公式化为：
$\begin{matrix} S = [\begin{matrix} {\vec{s}}_{1} \\ ⋮ \\ {\vec{s}}_{H} \end{matrix}] \in R^{H \times (T D / H)} \end{matrix}$
$H$ $T$ $\mathbf X$ $\mathbf S$ 的维度一致。
Wukong：Wukong-based 的模型将一个因子分解机块（Factorization Machine Block: FMB）的输出与一个线性投影层的输出进行拼接，以提升 interaction component：
$\begin{matrix} FMB (X) = reshape (MLP (LN (flatten (FM (X))))), FM (X) = X X^{⊤} Y \\ LCB (X) = W X \end{matrix}$
其中：
- $\mathbf W\in \mathbb R^{n\times T}, \mathbf Y\in \mathbb R^{T\times r}$ $\mathbf Y$ interaction matrix $\mathbf X \mathbf X^{\top}$ 的内存需求。
- $\text{LN}(\cdot)$ 为 Layer Normalization。
本文聚焦于为推荐系统建立一个统一的结构基础，融合当前 scaling blocks 的优势，进一步提升 scaling ROI。

1.3 UniMixer

本文构建了用于推荐系统 scaling 的统一模块—— UniMixer block，在统一理论框架下整合了 attention-based 模块、TokenMixer-based 模块、Wukong-based 模块等推荐系统主流 scaling 模块。如 Figure 2feature tokenization $M$ 个带 Siamese norm and Sparse-Pertoken MoE 的 UniMixer blocks。通过对 rule-based TokenMixer 进行参数化，我们打通了 attention-based 方法、TokenMixer-based 方法、Wokong-based 方法之间的联系，使所提出的 UniMixer 同时具备这些方法的优势。此外，我们开发了一个轻量级 UniMixing 模块，进一步压缩模型参数与计算成本，同时显著提升模型性能。

1.3.1 Feature Tokenization

根据 input feature fields 的 semantic categoriesinput features $\mathbf{\vec x}$ $N$ 个不相交的 feature domains。
$\vec{x} = [\underset{User Profile}{\underset{⏟}{x_{U}^{(1)}, \dots, x_{U}^{(n_{U})}}}, \underset{Item Features}{\underset{⏟}{x_{I}^{(1)}, \dots, x_{I}^{(n_{I})}}}, \underset{Behavior Sequence}{\underset{⏟}{x_{B}^{(1)}, \dots, x_{B}^{(n_{B})}}}, \underset{Query Features}{\underset{⏟}{x_{Q}^{(1)}, \dots, x_{Q}^{(n_{Q})}}}, \dots]$
每个 feature domain 通过 embedding layers 转化为不同维度的 embedding vectors：
${\vec{e}}_{n} = Embedding ({\vec{x}}_{n}) \in R^{d_{domain}}$
$\mathbf{\vec x}_n$ 为 feature domain 内某个 featureone-hot embedding $d_\text{domain}$ 为该 feature domain 对应的 embedding 维度。
将所有 obtained feature domain embeddingsembedding $\mathbf{\vec e}=\left[\mathbf{\vec e}_{1}, \mathbf{\vec e}_{2}, \cdots, \mathbf{\vec e}_{N}\right]$ 。与 RankMixerembedding $\mathbf{\vec e}$ 均匀切分为合适数量的 blocks，再通过 token-specific linear layer 将每个 block 投影为 token embedding：
${\vec{x}}_{i} = W_{i}^{proj} {\vec{e}}_{(d \times i) : (d \times i + d)} + {\vec{b}}_{i}^{proj} \in R^{D}$
$\mathbf W_i^\text{proj}\in \mathbb R^{D\times d}, \mathbf{\vec b}_i^\text{proj}\in \mathbb R^D$ $i$ block $d$ 为每个 block 的维度。
$\mathbf{\vec x}_i$ input hidden states $\mathbf X \in \mathbb{R}^{T \times D}$ 。
- feature domains $\mathbf{\vec e}$ ，而是随机组织，那么结果会怎样？可以做实验来验证。
- embedding $\mathbf{\vec e}$ 均匀切分为合适数量的 blocks“，这一步其实就是 sparsify 操作，参考论文 SSRNet（《Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation》）。
- $T\ll D$ $T$ $D$ 。这也是为什么需要拆分为 blocks，而不是采用原始的 feature-level embedding 的原因。

1.3.2 UniMixer Block

Heterogeneous Feature Interactions：如 ”预备知识“ 章节所述，heterogeneous attention 通过采用 token-specific query, key, and value weights，解决两个 heterogeneous semantic spaces 的 feature interaction 问题。然而，通过内积相似度得到的 attention patterndiagonally dominant prior $\mathbf W_{Q}^{h},\mathbf W_{K}^{h}$ attention weights $\mathbf Q_h\mathbf K_h^\top$ input token values $\mathbf X$ 主导，易导致 attention weights 集中在少数 tokens 上，如 Figure 3(a) 所示。
由 Figure 3(a) 可见，heterogeneous attention 的 attention weights 尖锐且稀疏，给梯度反向传播带来风险，导致 query and key weights 训练困难，甚至可能停滞（如 Figure 3(a) 中 heterogeneous attention 的 attention weights 的第 10 行与第 15 行）。同时，在大规模 heterogeneous feature inputs 下，这类 attention 模式可能导致 feature interactions 趋同，即 attention scores 极小且缺乏区分度，可能产生噪声信号，掩盖关键 feature interaction 模式。
另一方面，无参数的、基于规则的 TokenMixerheterogeneous feature interactions $T=H$ 进一步限制了 heterogeneous feature interaction 模式的选择。通过深入分析 TokenMixer 操作，我们得到一些有趣的发现，使 TokenMixer 操作参数化成为可能。如 Figure 3(b) 所示，我们发现：TokenMixerpermutation matrix $\mathbf W^\text{perm}$ flattened input embedding $\text{flatten}(\mathbf X) \in \mathbb{R}^{(TD)}$ 的乘积，公式如下：
$TokenMixer (X) = reshape (W^{perm} flatten (X))$
$\mathbf W^\text{perm}\in \mathbb R^{(TD)\times (TD)}$ 为一个大型置换矩阵，附录 A 给出具体数值示例。
$\mathbf W^\text{perm}$ rule-based TokenMixer $O\left(T^{2} D^{2}\right)$ $O\left(T^2 D^2\right)$ TokenMixer $\mathbf W^\text{perm}$ 的关键特性。
- $\mathbf W^\text{perm}$ Kronecker product $\mathbf W^\text{perm}= \mathbf G\otimes\mathbf I$ $\mathbf I\in \mathbb R^{\frac{D}{T}\times \frac{D}{T}}$ $\frac{D}{T}$ $\mathbf G\in \mathbb R^{T^2\times T^2}$ $T^2$ $\otimes$ 代表克罗内克积运算。
  $\mathbf A\otimes \mathbf B$ $\mathbf A$ $\mathbf B$ ，按位置拼成新大矩阵。例如：
  $\begin{matrix} A = [\begin{matrix} a_{1, 1} & a_{1, 2} \\ a_{2, 1} & a_{2, 2} \end{matrix}], B = [\begin{matrix} b_{1, 1} & b_{1, 2} \\ b_{2, 1} & b_{2, 2} \end{matrix}], A \otimes B = [\begin{matrix} a_{1, 1} B & a_{1, 2} B \\ a_{2, 1} B & a_{2, 2} B \end{matrix}] \end{matrix}$
  .
- $\mathbf W^\text{perm}$ 的任意行、列元素求和结果均为 1，满足行和、列和归一特性。
- 稀疏性：该置换矩阵的每一行、每一列仅有唯一一个非零元素。
- $T$ $H$ $\mathbf W^\text{perm}$ $\mathbf W^\text{perm}= \left(\mathbf W^\text{perm}\right)^\top$ $\mathbf W^\text{perm}$ 为非对称矩阵。
$D \ge T$ $D$ $T$ $T$ 为 input hidden statestokens $D$ 为 input hidden states 的维度。
TokenMixer $\mathbf G$ $\mathbf I$ token mixing $O\left(T^{4}+\left(\frac{D}{T}\right)^{2}\right)$ $T$ $D$ 。此外，TokenMixer 参数化仍面临三大挑战：
- $\mathbf G$ $\mathbf I$ $\mathbf W^\text{perm}$ size $[TD, TD]$ 的中间变量，对 GPU 内存要求极高。
  $\mathbf W^\text{perm} = \mathbf G\otimes \mathbf I$ 。
- 如何保证学到的参数满足双随机性、稀疏性与对称性。
- 如何设计融合了现有 scaling 模块的优势的 unified recommendation scaling module，为推荐系统建立更优的 scaling 效率。
Unified Token Mixing Module：受 Figure 3unified token mixing module $T$ $D$ ，而是定义置换矩阵中的 block num 与 block sizeblock size $B$ block num $(L // B)^{2}$ $L$ input embedding $\mathbf{\vec e}=\left[\mathbf{\vec e}_{1}, \mathbf{\vec e}_{2}, \cdots, \mathbf{\vec e}_{N}\right]$ block size $B$ 所整除。
$L = T\times D$ 。
$\mathbf G$ parameterized weights $\mathbf W_{G}\in \mathbb R^{(L//B)\times (L//B)}$ heterogeneous feature interactions $\mathbf W_G$ distinct parameterized weight $\mathbf W_{B}^{i}\in \mathbb R^{B\times B}$ $L//B$ 行。该操作使每个 blockfeature interaction $\mathbf W_{G}$ $\mathbf W_{B}^{i}$ $\mathbf W^\text{perm }$ ，公式如下：
$UniMixing (X) = reshape ((W_{G} \otimes {W_{B}^{i}}_{i = 1}^{L / / B}) flatten (X), 1, L)$
$\otimes$ 为广义克罗内克积（generalized Kronecker product）。
$\mathbf W_G\otimes\left\{\mathbf W_B^i\right\}_{i=1}^{L//B}$ 和经典的克罗内克积不同，它的物理含义为：
$\begin{matrix} W_{G} \otimes {W_{B}^{i}}_{i = 1}^{L / / B} = [\begin{matrix} W_{G, 1, 1} W_{B}^{1} & W_{G, 1, 2} W_{B}^{2} & \dots & W_{G, 1, L / / B} W_{B}^{L / / B} \\ W_{G, 2, 1} W_{B}^{1} & W_{G, 2, 2} W_{B}^{2} & \dots & W_{G, 2, L / / B} W_{B}^{L / / B} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ W_{G, L / / B, 1} W_{B}^{1} & W_{G, L / / B, 2} W_{B}^{2} & \dots & W_{G, L / / B, L / / B} W_{B}^{L / / B} \end{matrix}] \end{matrix}$
$\mathbf W_B^i$ 。
$\mathbf W_G$ $\mathbf W_B^{i,j}$ 。此时会引起参数爆炸，以及过拟合。
$\text{UniMixing}(\mathbf X)$ 的计算流程，显著降低计算成本与 GPU 内存需求。
- embedding vector $\text{flatten}(\mathbf X)$ $L // B$ size $B$ ，表示如下：
  $[{\vec{x}}_{1} ∣ {\vec{x}}_{2} ∣ \dots ∣ {\vec{x}}_{\frac{L}{B}}] = Split (flatten (X), \frac{L}{B}) \in R^{B \times (L / / B)}$
- block weights $\mathbf W_{B}^{i}\in \mathbb R^{B\times B}$ block-wise vectors $\mathbf{\vec x}^{(i)}\in \mathbb R^{B}$ 分别相乘，得到 local feature interaction vector：
  $\begin{matrix} H = [{\vec{x}}_{1} W_{B}^{1} ∣ {\vec{x}}_{2} W_{B}^{2} ∣ \dots ∣ {\vec{x}}_{L / / B} W_{B}^{L / / B}] \in R^{B \times (L / / B)} \\ reshape (H, L / / B, B) = [\begin{matrix} {\vec{x}}_{1} W_{B}^{1} \\ {\vec{x}}_{2} W_{B}^{2} \\ ⋮ \\ {\vec{x}}_{L / / B} W_{B}^{L / / B} \end{matrix}] \in R^{(L / / B) \times B} \end{matrix}$
- 最后，UniMixing 模块的输出为：
  $UniMixing (X) = reshape (W_{G} reshape (H, L / / B, B), 1, L)$
  $\text{reshape}(\mathbf H, L//B, B)\in \mathbb R^{(L//B)\times B}$ $\mathbf W_G\in \mathbb R^{(L//B)\times (L//B)}$ ，因此二者可以执行矩阵乘法。
reconstructed matrix $\mathbf W^\text{perm }$ $O\left(L^{2}\right)$ $O\left(L^{2} / B+L B\right)$ B $\mathbf W_{B}^{i}$ block $\mathbf W_{G}$ block $L$ embedding inputs $T=H$ 。与 TokenMixer 操作相比，UniMixing 模块具备更多样的 local and global feature mixing patterns and interaction scales，同时保留优势：它是可学习的、可优化的。
为保证学到的置换矩阵满足双随机性，采用 Sinkhorn-Knoppexponent operator $\mathbf W_{G}$ $\mathbf W_{B}^{i}$ 的所有元素为正，再交替缩放 rows and columns1 $(\mathbf W_{G}+\mathbf W_{G}^{\top}) / 2$ $(\mathbf W_{B}^{i}+\mathbf W_{B}^{i \top}) / 2$ 实现参数矩阵的对称约束。最终约束权重如下：
$\begin{matrix} {\tilde{W}}_{G} = \frac{W_{G} + W_{G}^{⊤}}{2}, {\tilde{W}}_{B}^{i} = \frac{W_{B}^{i} + W_{B}^{i ⊤}}{2} \\ {\overset{―}{W}}_{G} = Sinkhorn-Knopp (\frac{{\tilde{W}}_{G}}{τ}), {\overset{―}{W}}_{B}^{i} = Sinkhorn-Knopp (\frac{{\tilde{W}}_{B}^{i}}{τ}) \end{matrix}$
$\tau$ 1.0 $\tau$ 较小（如 0.05）时：除法放大了元素之间的原始差异，指数运算后大的元素更大，小的元素更小，最终的双随机矩阵变得尖锐：少数元素接近 10 $\tau\rightarrow 0$ ），矩阵趋近于一个硬置换矩阵（每行每列只有一个 1，其余为 0）。
$\tau$ 需要采用一个较小的值。
$\tau$ 为温度系数。
Sinkhorn‑Knopp 迭代（也称 Sinkhorn 缩放算法）是一种将任意正矩阵转化为双随机矩阵（doubly stochastic matrix）的经典数值方法。所谓双随机矩阵，是指一个方阵满足：所有元素非负（通常为正）、每行之和为 1、每列之和也为 1。
$\mathbf A\in \mathbb R^{n\times n}, a_{i,j}\gt 0$ ，Sinkhorn‑Knopp 迭代通过交替缩放行和列，使其逼近一个双随机矩阵：
- 重复迭代直到收敛（或固定次数）：
  - 行归一化：将每一行除以其行和，使每行和为 1。
  - 列归一化：将每一列除以其列和，使每列和为 1。
$\mathbf A$ $\mathbf A$ 为正且连通时）。实际实现时，通常使用指数算子先对原始矩阵元素取指数（exp），以保证元素为正，然后再应用上述缩放迭代。
随后用残差连接与归一化模块处理 UniMixing block 的输出：
$O = RMSNorm (X + UniMixing (X))$
.
A Unified Perspective of Heterogeneous Feature Interaction $\mathbf V_{h}$ $\text{reshape}(\mathbf H, L//B, B)$ blocks $L // B$ $T$ $\mathbf W_{V}^{i h}$ $\mathbf W_{B}^{i}$ $\text{reshape}(\mathbf H, L//B, B)=\mathbf V_{h}$ $\mathbf W_{V}^{i}= \mathbf W_{B}^{i}$ 的条件下，UniMixer 的 local interaction projectionvalue projection of the heterogeneous attention layer $\mathbf W_{G}$ attention weights $\mathbf W_G$ 需要满足双随机性、稀疏性与对称性。
Wukong 的 feature interactionFM $\text{FMB}(\mathbf X)$ 的表达式可改写为：
$FMB (X) = reshape (MLP (LN (flatten ((X I) {(X I)}^{⊤} Y))))$
$\mathbf I$ feature interaction $(\mathbf X \mathbf I)(\mathbf X \mathbf I)^{\top} \mathbf Y$ attention $\mathbf W_{Q}=\mathbf I$ $\mathbf W_{K}=\mathbf I$ ，且 valuehidden state input $\mathbf X$ $\mathbf V_{h}=\mathbf W_{V}=\mathbf Y$ ）时，attention 机制退化为 FM 模块。因此，attention-based 机构、TokenMixer-based 架构与 Wukong-based 架构可统一于单一理论框架：
$\begin{matrix} UniMixing (X) = reshape (\underset{Global Mixing Pattern}{\underset{⏟}{G (X, W_{G})}} \underset{Local Mixing Pattern}{\underset{⏟}{[\begin{matrix} {\vec{x}}_{1} W_{B}^{1} \\ ⋮ \\ {\vec{x}}_{L / / B} W_{B}^{L / / B} \end{matrix}]}}, 1, L) \end{matrix}$
$G(\mathbf X, \mathbf W_G)$ 为 heterogeneous feature interaction projection，衡量 token-to-token/block-to-block 的 interaction 强度。
single-head attention setting $\text{UniMixing}(\mathbf X)$ ），各类方法的差异总结于 Table 1。对于 self-attention、heterogeneous attention 与 FMglobal mixing pattern $G(\mathbf X, \mathbf W_{G})$ 通过计算两个 tokens 的内积相似度得到；而 TokenMixer 的 global mixing pattern 与 input token embedding 无关。
UniMixing-Lite：如 Figure 3block $B$ local interaction parameter matrices $\mathbf W_{B}^{i}$ $L//B$ global interaction parameter matrix $\mathbf W_{G}$ 越大，导致了冗余的 local interaction patterns。同时，更大的 global interaction matrix 在 reducing the number of parameters 上的效率较低。因此，基于 UniMixing block，我们设计了一个轻量级的 UniMixing 模块—— UniMixing-Lite，从而进一步减少 module parameters 与计算成本，提升模型的 scaling 效率。
为解决 local interaction pattern 的冗余性问题，我们引入一个 basis-composed moduleblock-specific local mixing weight $\mathbf W_{B}^{i}$ basis matrices $\{\mathbf Z_{\ell}\}_{\ell=1}^{b}$ ，这些 basis matricesblock-specific weight vectors $\left\{\vec \omega^{i}\right\}_{i=1}^{L // B}$ $b$ basis local mixing weight $\vec \omega^{i}=\left[\omega_{1}^{i}, \cdots, \omega_{b}^{i}\right]$ global interaction parameter $\mathbf W_{G}$ 采用低秩近似，从而进一步提升效率。则 UniMixing-Lite 模块可表示为：
$\begin{matrix} UniMixing-Lite (X) = reshape (W_{r} reshape ([{\vec{x}}_{1} W_{B}^{* 1} ∣ \dots ∣ {\vec{x}}_{\frac{L}{B}} W_{B}^{* \frac{L}{B}}], \frac{L}{B}, B), 1, L) \\ O = RMSNorm (X + UniMixing-Lite (X)) \end{matrix}$
其中：
- $\mathbf W_r = \text{Sinkhorn-Knopp}(\mathbf A_G\mathbf B_G)$ $\mathbf A_G\in \mathbb R^{(L//B)\times r}, \mathbf B_G\in \mathbb R^{r\times (L//B)}$ $r$ $\mathbf W_G$ 低秩近似（low-rank approximation ）的秩。
- $\mathbf W_B^* = \text{Sinkhorn-Knopp}\left(\sum_{\ell=1}^b \omega_\ell^i \mathbf Z_\ell \right)$ 。
UniMixing-Lite 模块同时保留了 TokenMixer 的低参数的 global interaction pattern 、以及 attention 的针对heterogeneous features 的 local interaction 能力，可同时利用 attention-based 方法与 token-mixer-based 方法的优势。
$\mathbf W_G$ $\mathbf W_B^i$ 进行简化。
Pertoken SwiGLU：在 UniMixing block 之后，与 TokenMixer-Large 类似，我们引入 pertoken SwiGLU 来建模不同 tokens 之间的 feature heterogeneityinput token $\mathbf{\vec x}_{i}$ ，SwiGLU 公式如下：
$pSwiGLU ({\vec{o}}_{i}) = W_{down}^{i} ((W_{up}^{i} {\vec{o}}_{i} + {\vec{b}}_{up}^{i}) ⊙ Swish (W_{gate}^{i} {\vec{o}}_{i} + {\vec{b}}_{gate}^{i})) + {\vec{b}}_{down}^{i}$
其中：
- $\mathbf W_\text{up}^i, \mathbf W_\text{gate}^i\in \mathbb R^{B\times (nB)}$ $\mathbf W_\text{down}^i\in \mathbb R^{(nB)\times B}$ $\mathbf{\vec b}_\text{up}^i, \mathbf{\vec b}_\text{gate}^i\in \mathbb R^{(nB)}$ $\mathbf{\vec b}_\text{down}^i\in \mathbb R^B$ $n$ 为一个超参数。
- $\mathbf{\vec o}_i$ $i$ 个 token 的 UniMixing output。

1.3.3 SiameseNorm

当前 RankMixer 架构缺乏针对 deep architectures 的专门设计，主要体现在 model depth 的 scaling 效果有限。尽管TokenMixer-Large 尝试通过在 block 内加入 interval residuals 与 auxiliary loss 来解决该问题，但未触及根本。为实现model depth 增加时的训练稳定性与性能提升，我们将孪生归一化（ SiameseNorm ）引入 UniMixer 架构，如 Figure 2 所示。如相关工作所述（《Siamesenorm: Breaking the barrier to reconciling pre/post-norm》），SiameseNorm 通过在每层引入两个耦合流（coupled streams），解决预归一化（Pre-NormPost-Norm $\overline {\mathbf X}_{i}$ $\overline {\mathbf Y}_{i}$ input embeddings $\overline {\mathbf X}_{0}=\overline {\mathbf Y}_{0}=\mathbf X$ $\ell$ 个 block，SiameseNorm 执行如下更新：
$\begin{matrix} {\tilde{Y}}_{ℓ} = RMSNorm ({\overset{―}{Y}}_{ℓ}), O_{ℓ} = UniMixer ({\overset{―}{X}}_{ℓ} + {\tilde{Y}}_{ℓ}) \\ {\overset{―}{X}}_{ℓ + 1} = RMSNorm ({\overset{―}{X}}_{ℓ} + O_{ℓ}), {\overset{―}{Y}}_{ℓ + 1} = {\overset{―}{Y}}_{ℓ} + O_{ℓ} \end{matrix}$
$\overline{\mathbf Y}_i$ pre-norm $\overline{\mathbf X}_i$ 执行的是 post-norm。
$M$ UniMixer block $\overline{\mathbf X}_{\ell}$ $\overline{\mathbf Y}_{\ell}$ 从而生成 final representation，公式如下：
$X_{output} = {\overset{―}{X}}_{M} + RMSNorm ({\overset{―}{Y}}_{M})$
根据离线实验表面，PostNorm -> SiameseNorm 能带来 0.027% 的 AUC 提升。因此，它并不是核心设计。

1.3.4 UniMixer 训练策略

parameter matrices $\mathbf W_{G}$ $\mathbf W_{B}^{i}$ 具备稀疏性，我们引入温度系数来控制其稀疏程度。但温度越小，权重越稀疏，同时会导致梯度稀疏、微弱甚至不稳定，使训练困难，optimization 陷入局部最优。另一方面，我们的实验表明 weight parameters 的稀疏性对模型性能有显著正向影响，如 Table 3 所示。因此，该稀疏性不可或缺。
linear temperature annealing $\tau=1.0$ training iterations $\tau = 0.05$ ，公式如下：
$τ_{j} = max {τ_{start} - \frac{j}{J} \times (τ_{start} - τ_{end}), τ_{end}}$
其中：
- $\tau_j$ $j$ $\tau_\text{start}$ $\tau_\text{end}$ 分别为初始温度与最终温度。
- $J$ 为温度退火的迭代范围。
$\tau=1.0$ $\tau=0.05$ ），以高温模型权重为 initialization，重新训练低温模型。
根据实验部分的描述，训练策略对 UniMixer 模型性能的影响最大。众所周知，dnn 模型的训练策略、模型架构都对最终模型性能产生重大影响。那么，UniMixer 的优秀性能是来自于它的训练策略，还是来自于它的模型架构？

1.4 实验

本节开展大量实验，对比所提出的 UniMixer 架构与 SOTA 方法的性能，并回答以下问题：
- Q1：UniMixer 架构的 scaling 效率是否优于 SOTA 架构？
- Q2：所提出的方法在 different settings of global and local mixing pattern 下性能如何变化？
- Q3：轻量级模块 UniMixing-Lite 是否进一步提升 scaling 效率？
- Q4：部署至真实在线系统后，UniMixer/UniMixing-Lite 是否在 A/B 测试中提升业务指标？
数据集与评估指标：我们采用 Kuaishou 的广告投放场景的真实训练数据集日志，建模用户留存（user retention），开展离线与在线评估。数据集包含一年收集的超 0.7 billion 的用户样本，涵盖数值特征、ID 特征、交叉特征、序列特征等数百个 heterogeneous features。二元标签（User Retention = 1/0 ）表示 users’ first activation 的次日是否返回 Kuaishou application。推荐模型的 scaling evaluation 的指标采用行业常用的 AUC 、UAUC（User-Level AUC）以评估模型性能，用dense parameter 数量、FLOPs、MFU 来评估模型效率。
基线与实验细节：将本文的 2-blocks/4-blocks UniMixer/UniMixing-Lite 架构与以下代表性 SOTA 框架对比，按建模范式分类：
- Attention-Based 架构：Heterogeneous Attention（《Hiformer: Heterogeneous feature interactions learning with transformers for recommender systems》）、HiFormer（《Hiformer: Heterogeneous feature interactions learning with transformers for recommender systems》）、FAT（《From scaling to structured expressivity: Rethinking transformers for ctr prediction》），采用 field-specific query, key, and value projections 实现heterogeneous feature interaction。
- TokenMixer-Based 框架：RankMixer（《Rankmixer: Scaling up ranking models in industrial recommenders》）、TokenMixer-Large（《Tokenmixer-large: Scaling up large ranking models in industrial recommenders》），采用 rule-based token mixing operation 实现 feature interaction。
- FM-Based 框架：Wukong（《Wukong: Towards a scaling law for large-scale recommendation》），将 outputs of a FMB and a linear projection layer 拼接起来，从而提升 interaction component。
所有实验在由 40 GPUs 组成的 a hybrid distributed training framework 下进行。所有模型使用一致的 optimizer hyperparameters：dense and sparse parts 均用 Adam 优化，学习率设为 0.001。

1.4.1 Performance Comparison (for Q1)

采用参数约 100 million 的 SOTA scaling architectures 与 UniMixer、UniMixing-Lite 对比，探索其 scaling laws。heterogeneous attention 架构作为 base model。Table 2 展示本文模型与 SOTA 模型的主要性能结果。可以看到：
- 在更小的参数预算与计算成本下，UniMixer 与 UniMixing-Lite 架构在多个指标上均显著优于其他 SOTA 模型。
  TokenMixer-Large 的效果还不如 RankMixer？有点奇怪。
此外，该广告投放场景中，除 UniMixer/UniMixing-Lite 外，RankMixer 性能优于所有其他 SOTA 模型。因此，我们选择该最强的 SOTA 模型与 UniMixer/UniMixing-Lite 进行 scaling laws 的对比。所有模型在相同数据集、一致超参数下进行训练，其参数量与 FLOPs 的 scaling curves 如 Figure 4 所示。可以看到：
- 随 number of parameters/FLOPs 增加，三个模型的 AUC 均呈现清晰的 power-law 趋势。
- UniMixer-Lite 实现最优的 scaling 效率，提升斜率更陡。
根据 Figure 4，RankMixer、UniMixer、UniMixing-Lite 的 AUC 与 Parameters/FLOPs 的良好的 scaling laws 公式如下：
$\begin{matrix} Δ {AUC}_{RankMixer} = 0.002718 \times {Params}^{0.116043}, Δ {AUC}_{RankMixer} = 0.002022 \times {Params}^{0.116635} \\ Δ {AUC}_{UniMixer} = 0.003032 \times {Params}^{0.131973}, Δ {AUC}_{UniMixer} = 0.002058 \times {Params}^{0.125702} \\ Δ {AUC}_{UniMixer-Lite} = 0.003767 \times {Params}^{0.141903}, Δ {AUC}_{UniMixer-Lite} = 0.002338 \times {Params}^{0.135327} \end{matrix}$
scaling laws 中的两个常数里，scaling exponent 常数对性能增长影响最大，是 scaling 效率的主导因素。UniMixer-Lite 展现出最强的 scaling 效率，在参数量与 FLOPs 上均取得最大的 scaling exponent 与 scaling coefficient，说明其从 increased model capacity 中获益最大。

1.4.2 Ablation Studies (for Q2)

为探索 global and local mixing weights 的特性，以及 UniMixer 中各模块对 AUC gains 的贡献，我们对多种 UniMixer 变体开展消融实验，测量其相对 full UniMixer model 的 AUC 变化。所有变体在相似设置下训练。结果如 Table 3 所示，移除任意模块或违反参数约束（parameter constraints）均会导致性能下降，其中 low temperature coefficient 与 model warm-up 对整体性能影响最显著。
model warm-up 就是正文章节提到的线性温度退火.

1.4.3 Performance of the UniMixing-Lite Module (for Q3)

根据 Figure 4 的 scalingUniMixing-Lite $\left\{\mathbf Z_{\ell}\right\}_{\ell=1}^{b}$ $b$ $r$ $\mathbf A_G, \mathbf B_G$ 的秩），以及不同 UniMixer block 数量的影响。如 Table 4 所示：
- $b$ $\mathbf A_{G}, \mathbf B_{G}$ $r$ 的增加，模型性能相应提升。
- $b$ $r$ 带来更高的 AUC gain。
注意：RankMixer 的层数越多，效果反而下降了。
$\mathbf A_{G}\mathbf B_{G}$ Sinkhorn–Knopp $\left\{\mathbf Z_{\ell}\right\}_{\ell=1}^{b}$ 对重构 global and local mixing matrices2-blocks-UniMixer-Lite $\tau=1$ $\tau=0.05$ 时，第一个 UniMixer blockreconstructed global matrix $\overline{\mathbf W}_{G}$ local mixing matrices $\overline {\mathbf W}_{B}^{i}$ ，如 Figure 5 所示。input embedding 维度为 768，blockB = 6 $\overline {\mathbf W}_{G} \in \mathbb{R}^{128 \times 128}$ $\overline {\mathbf W}_{B}^{i} \in \mathbb{R}^{6 \times 6}$ $\mathbf A_{G} \in \mathbb{R}^{128 \times 16}$ $\mathbf B_{G} \in \mathbb{R}^{16 \times 128}$ 。
由 Figure 5 可见，尽管模块中使用了低秩近似与 basis matrices，Sinkhorn–Knopp 操作仍能保证矩阵接近满秩。此外，对比Figure 5(a)(b) and (c)(d)，更低的温度系数下的 global and local mixing matricesinteraction distributions $\overline {\mathbf W}_{G}$ $\overline {\mathbf W}_{B}^{i}$ 的稀疏性使模型性能大幅提升。
另一方面，由 Table 4 可见，随 UniMixer 的深度的增加，所提出的模型持续呈现清晰的 scaling-up 趋势，而 RankMixer 随深度的增加出现性能下降。UniMixing-Lite with 2 blocks and 4 blocks 的 scaling curves 如 Figure 6 所示，说明沿 depth 缩放比沿 width 缩放更高效。

1.4.4 Online A/B Test Results (for Q4)

为验证所提出的 UniMixer 架构的在线性能，我们将 UniMixer 与 UniMixing-Lite 部署至 Kuaishou 的多个广告投放场景。在线A/B 测试中，以 30 天观察窗口内的累计活跃天数（Cumulative Active Days: CAD）衡量 user engagement（排除 installation day，即 day 0）。在多个场景中，D1-D30 的 CAD 平均提升超 15%。
没有详细的图表来说明？base model 是啥？有没有上线？这些都没讲。

1.5 结论

本文为推荐系统的 scaling laws 建立了一个统一的 scaling 框架，打通了 attention-based 方法、TokenMixer-based 方法与FM-based 方法的联系，使融合各自优势成为可能。从得到的 scaling laws 可见，与 SOTA 架构相比，本文的 UniMixing-Lite 实现了最优的参数效率与计算效率。我们已将该架构部署至 Kuaishou 的多个场景，取得显著的离线与在线收益。
本工作不再孤立看待推荐系统中现有的 scaling 模块（如 Heterogeneous Attention、TokenMixer、Wukong），而是建立统一理论框架，为推荐系统的 scaling design 提供指导。我们相信该统一架构能助力推荐系统领域迎来属于自己的 "attention moment"。这个统一模块，UniMixer，是专为推荐领域设计的 fundamental block，其适用性可进一步扩展至 user behavior sequence modeling 任务与 generative recommendation 任务。

二、附录

2.1 附录 A：TokenMixer 等价变换数值示例

input hidden state $\mathbf X \in \mathbb{R}^{2 \times 6}$ $x_i$ 为标量：
$\begin{matrix} X = [\begin{matrix} x_{1} & x_{2} & x_{3} & x_{4} & x_{5} & x_{6} \\ x_{7} & x_{8} & x_{9} & x_{10} & x_{11} & x_{12} \end{matrix}] \end{matrix}$
input hidden state $\mathbf X$ 经过 TokenMixer 操作后被变换为：
$\begin{matrix} TokenMixer (X) = [\begin{matrix} x_{1} & x_{2} & x_{3} & x_{7} & x_{8} & x_{9} \\ x_{4} & x_{5} & x_{6} & x_{10} & x_{11} & x_{12} \end{matrix}] \end{matrix}$
TokenMixer 的输出可以被展平为一个向量：
$flatten (TokenMixer (X)) = {[x_{1}, x_{2}, x_{3}, x_{7}, x_{8}, x_{9}, x_{4}, x_{5}, x_{6}, x_{10}, x_{11}, x_{12}]}^{⊤}$
$\text{flatten}(\mathbf X)$ $\mathbb R^{12\times 12}$ $\text{flatten}(\text{TokenMixer}(\mathbf X))$ ，可写作：
$\begin{matrix} \underset{W^{perm}}{\underset{⏟}{[\begin{matrix} 1 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 \\ 0 & 1 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 \\ 0 & 0 & 1 & . & 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 \\ . & . & . & + & . & . & . & + & . & . & . & + & . & . & . \\ 0 & 0 & 0 & . & 0 & 0 & 0 & . & 1 & 0 & 0 & . & 0 & 0 & 0 \\ 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 1 & 0 & . & 0 & 0 & 0 \\ 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 1 & . & 0 & 0 & 0 \\ . & . & . & + & . & . & . & + & . & . & . & + & . & . & . \\ 0 & 0 & 0 & . & 1 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 \\ 0 & 0 & 0 & . & 0 & 1 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 \\ 0 & 0 & 0 & . & 0 & 0 & 1 & . & 0 & 0 & 0 & . & 0 & 0 & 0 \\ . & . & . & + & . & . & . & + & . & . & . & + & . & . & . \\ 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 & . & 1 & 0 & 0 \\ 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 1 & 0 \\ 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 0 & . & 0 & 0 & 1 \end{matrix}]}} \underset{flatten (X)}{\underset{⏟}{[\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \\ x_{5} \\ x_{6} \\ x_{7} \\ x_{8} \\ x_{9} \\ x_{10} \\ x_{11} \\ x_{12} \end{matrix}]}} = \underset{flatten (TokenMixer (X))}{\underset{⏟}{[\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{7} \\ x_{8} \\ x_{9} \\ x_{4} \\ x_{5} \\ x_{6} \\ x_{10} \\ x_{11} \\ x_{12} \end{matrix}]}} \end{matrix}$
TokenMixer $\mathbf W^{\text{perm}} \in \mathbb{R}^{12 \times 12}$ 可以被等价分解为下面两个小矩阵的克罗内克积（Kronecker product）：
$\begin{matrix} W^{perm} = \underset{Global Mixing Matrix}{\underset{⏟}{[\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]}} \otimes \underset{Local Mixing Matrix}{\underset{⏟}{[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}]}} \end{matrix}$
.

2.2 附录 B：UniMixing 模块的计算流程优化

$\mathbf W_G \in \mathbb{R}^{(L // B) \times (L // B)}$ $\mathbf W_B^i \in \mathbb{R}^{B \times B}$ 如下：
$\begin{matrix} W_{G} = [\begin{matrix} w_{(1, 1)}^{G} & \dots & w_{(1, L / / B)}^{G} \\ ⋮ & ⋱ & ⋮ \\ w_{(L / / B, 1)}^{G} & \dots & w_{(L / / B, L / / B)}^{G} \end{matrix}], W_{B}^{i} = [\begin{matrix} v_{(1, 1)}^{i} & \dots & v_{(1, B)}^{i} \\ ⋮ & ⋱ & ⋮ \\ v_{(B, 1)}^{i} & \dots & v_{(B, B)}^{i} \end{matrix}] \end{matrix}$
$w_{(i,j)}$ $v_{(i,j)}$ 均为标量。
$\left[\mathbf{\vec x}_1\mid \mathbf{\vec x}_2\mid \cdots \mid \mathbf{\vec x}_{L//B}\right] = \text{Split}\left(\text{flatten}(\mathbf X), \frac{L}{B}\right)$ $\text{flatten}(\mathbf X)$ $L//B$ 个向量，重写为：
$flatten (X) = {[{\vec{x}}_{1} ∣ {\vec{x}}_{2} ∣ \dots ∣ {\vec{x}}_{L / / B}]}^{⊤}$
$\mathbf{\vec x}_i\in \mathbb R^{1\times B}$ $B$ 的行向量。
UniMixing $\text{UniMixing}(\mathbf X) = \text{reshape}\left(\left(\mathbf W_G\otimes\left\{\mathbf W_B^i\right\}_{i=1}^{L//B}\right)\text{flatten}(\mathbf X),1,L\right)$ $\left(\mathbf W_G\otimes\left\{\mathbf W_B^i\right\}_{i=1}^{L//B}\right)\text{flatten}(\mathbf X)$ 可以被重写为：
$\begin{matrix} (W_{G} \otimes {W_{B}^{i}}_{i = 1}^{L / / B}) flatten (X) = [\begin{matrix} w_{(1, 1)}^{G} W_{B}^{1} & \dots & w_{(1, L / / B)}^{G} W_{B}^{L / / B} \\ ⋮ & ⋱ & ⋮ \\ w_{(L / / B, 1)}^{G} W_{B}^{1} & \dots & w_{(L / / B, L / / B)}^{G} W_{B}^{L / / B} \end{matrix}] [\begin{matrix} {\vec{x}}_{1}^{⊤} \\ ⋮ \\ {\vec{x}}_{L / B}^{⊤} \end{matrix}] \\ = [\begin{matrix} w_{(1, 1)}^{G} W_{B}^{1} {\vec{x}}_{1}^{⊤} + \dots + w_{(1, L / / B)}^{G} W_{B}^{L / / B} {\vec{x}}_{L / B}^{⊤} \\ ⋮ \\ w_{(L / / B, 1)}^{G} W_{B}^{1} {\vec{x}}_{1}^{⊤} + \dots + w_{(L / / B, L / / B)}^{G} W_{B}^{L / / B} {\vec{x}}_{L / / B}^{⊤} \end{matrix}] \in R^{L \times 1} \end{matrix}$
另一方面，我们可以得到下面的表达式：
$\begin{matrix} W_{G} reshape ([{\vec{x}}_{1} {(W_{B}^{1})}^{⊤} ∣ \dots ∣ {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤}], \frac{L}{B}, B) \\ = [\begin{matrix} w_{(1, 1)}^{G} & \dots & w_{(1, L / / B)}^{G} \\ ⋮ & ⋱ & ⋮ \\ w_{(L / / B, 1)}^{G} & \dots & w_{(L / / B, L / / B)}^{G} \end{matrix}] [\begin{matrix} {\vec{x}}_{1} {(W_{B}^{1})}^{⊤} \\ ⋮ \\ {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤} \end{matrix}] \\ = [\begin{matrix} w_{(1, 1)}^{G} {\vec{x}}_{1} {(W_{B}^{1})}^{⊤} + \dots + w_{(1, L / / B)}^{G} {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤} \\ ⋮ \\ w_{(L / / B, 1)}^{G} {\vec{x}}_{1} {(W_{B}^{1})}^{⊤} + \dots + w_{(L / / B, L / / B)}^{G} {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤} \end{matrix}] \in R^{(L / / B) \times B} \end{matrix}$
上式中的元素满足：
$w_{(i, 1)}^{G} W_{B}^{1} {\vec{x}}_{1}^{⊤} + \dots + w_{(i, L / / B)}^{G} W_{B}^{L / / B} {\vec{x}}_{L / / B}^{⊤} = {(w_{(i, 1)}^{G} {\vec{x}}_{1} {(W_{B}^{1})}^{⊤} + \dots + w_{(i, L / / B)}^{G} {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤})}^{⊤}$
因此有：
$(W_{G} \otimes {W_{B}^{i}}_{i = 1}^{L / / B}) flatten (X) = reshape (W_{G} reshape ([{\vec{x}}_{1} {(W_{B}^{1})}^{⊤} ∣ \dots ∣ {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤}], \frac{L}{B}, B), L, 1)$
$\mathbf W_B^i$ $\left(\mathbf W_B^{i}\right)^\top$ 都是可学习参数，参数的转置不会影响模型。因此，computation pipeline optimization 后的 UniMixing 模块可以写作：
$UniMixing (X) = reshape (W_{G} reshape ([{\vec{x}}_{1} {(W_{B}^{1})}^{⊤} ∣ \dots ∣ {\vec{x}}_{L / / B} {(W_{B}^{L / / B})}^{⊤}], \frac{L}{B}, B), 1, L)$
.