Can Attention Enable MLPs To Catch Up With CNNs?

In the first week of May, 2021, researchers from four different institutions: Google, Tsinghua University, Oxford University and Facebook, shared their latest work [16, 7, 12, 17] on arXiv.org almost at the same time, each proposing new learning architectures, consisting mainly of linear layers, claiming them to be comparable, or even superior to convolutional-based models. This sparked immediate discussion and debate in both academic and industrial communities as to whether MLPs are sufficient, many thinking that learning architectures are returning to MLPs. Is this true? In this perspective, we give a brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers. We then examine what the four newly proposed architectures have in common. Finally, we give our views on challenges and directions for new learning architectures, hoping to inspire future research.


Learning architectures for visual tasks
Multilayer perceptrons (MLPs) [15] consist of an input layer and an output layer, possibly with multiple hidden layers in between. Layers are typically fully connected with linear transformations and activation functions. MLPs were the basis for neural networks before deep convolutional neural networks (DCNNs) took over, and greatly improved the power of computers to handle problems of classification and regression. However, MLPs are computationally costly and prone to overfitting, due to their large numbers of parameters. MLPs are also poor at capturing local structures in the input, since the linear transformations between layers always take the output from the previous layer as a whole. However, we note that the capabilities of MLPs were not fully explored when they were proposed, both because of limited computer performance, and unavailability of massive data for training.
To learn local structures in the input while maintaining computational efficiency, convolutional neural networks (CNNs) were proposed. In 1998, LeCun et al. presented LeNet [11], which greatly improved the accuracy of handwritten digit recognition using a five-layer convolutional neural network. Later, AlexNet [10] lead to wide acceptance of CNNs in the research community: it was much larger than previous CNNs like LeNet, and beat all other competitors by a significant margin in the ImageNet Large Scale Visual Recognition Challenge in 2012 1 . Since then, many more models with ever deeper architectures have been developed, with many providing more accurate results than humans in many realms, resulting in profound paradigm changes in both scientific research, and engineering and commercial applications.
Putting aside the advances in computing power and amounts of training data, the key success of CNNs lies in the inductive bias they introduce: they assume that information has spatial locality and can thus reduce the number of network parameters by making use of a sliding convolution with shared weights. However, the side-effect of this approach is that the receptive fields of CNNs are limited, making CNNs less able to learn long-range dependencies. To enlarge the receptive field, a larger convolutional kernel is required, or other special strategies must be employed, such as dilated convolutions [20]. Note that composing a large kernel from several small kernels is not a suitable approach for enlarging the receptive field of CNNs [13].
Recently, the Transformer neural network architecture was proposed [19] for sequential data, with great success in natural language processing [14,4], and more recently, vision [5,3,6,21,18]. The attention mechanism is at the core of Transformer, which is readily capable of learning long-range dependencies between any two positions in the input data in the form of an attention map. However, this additional freedom and reduced inductive bias mean that effectively training Transformer-based architectures requires huge amounts of data. For best results, such models should be first pre-trained on a very large dataset, such as GPT-3 [2] and ViT [5].

Four Recent Architectures
To avoid the drawbacks of the aforementioned learning architectures, and, with the aim of achieving better results at lower computational cost, very recently, four architectures were proposed almost simultaneously [16,7,12,17]. Their common aim is to take full advantage of linear layers. We briefly summarize these architectures below; also see

MLP-Mixer
MLP-Mixer [16] takes S non-overlapping image patches of resolution P × P as input. Each patch is first projected to a C-dimensional embedding via a shared-weight linear layer: this representation of the input image is thus a matrix, Next, X is fed into a sequence of identical mixer layers, each of which is composed of a token-mixing MLP block and a channel-mixing MLP block, mixing information from all patches, and from all channels, respectively. We may express the computation as: where f 1 , · · · , f 4 are linear layers, and σ denotes GELU (nonlinear) activation [9]. Layer normalization [1] is employed. U ∈ R S×C is the intermediate matrix after per-channel feature aggregation: a shared-weight mapping R S → R S of the column vectors in X. Similarly, two linear transformations are performed per patch, giving the output Y.

External Attention
External attention [7] reveals the relation between selfattention and linear layers. It first simplifies self-attention as in Eq. 4, where M ∈ R N ×d is the input feature map.
Then an external memory unit M ∈ R S×d is introduced to replace M-to-M attention by M-to-M attention as below: Finally, like self-attention, it uses two different memory units M k and M v as the key and the value to increase the capability of the network. The overall computation of external attention is as below: Because FM T k is matrix multiplication, it is linear in F, so Eq. 8 can be written as The final output is then obtained by adding an identity mapping as below: Based on the external attention, Guo et al. [7] also design a multi-head external attention and achieve an all MLP architecture named EAMLP.

Feed-forward-only Model
The feed-forward-only model [12] replaces the attention layers in Transformer [19] by simple feed-forward layers on the token dimension. It firstly uses linear layers on the channel dimension and then adopts linear layers on the token dimension in a linear block. Given an input X ∈ R N ×C , the computation in detail can be expressed as:

ResMLP
ResMLP [17] also separately aggregates information in per-patch-style and per-channel-style, and can be formulated as follows: A major difference of ResMLP is that it uses an affine transformation in the role of a normalization layer. This affine transformation is parameterized by two learnable vectors to scale and shift the input component-wise: Note that no statistics of the input are used in the above, and thus it can be integrated in the linear layers during inference for further speed.

Common Themes
We now examine the above approaches, to see what they have in common.

Long distance interactions
As in self-attention, interactions between different patches are taken into account by these four methods. MLP-Mixer, ResMLP and the Feed-forward-only model use linear layers acting on the token dimension to allow different patches to communicate with each other. External attention adopts softmax and L1 normalization to perform a similar role. Unlike CNNs, these models can consider long distance interactions between patches and automatically select suitable and irregular receptive fields.

Local semantic information
Unlike independent words in natural language, single pixels have very little semantic information and their interactions with other pixels are not directly informative. It is thus important to extract meaningful information before using MLPs. MLP-Mixer, ResMLP and the Feed-forwardonly model divide the image into 16 × 16 local patches to obtain semantic information. External attention adopts a T2T module [21] or CNN backbone to provide rich semantics before passing information to linear layers.

Residual connections
Residual connections [8] solve the problem of vanishing gradients and stabilize the training process, so they are commonly used in deep convolutional neural networks. They also benefit architectures based around linear layers and are adopted by all the above models.

Reduced inductive bias
Localised processing in CNNs results in inductive bias, which can decrease accuracy when the training data is suf-ficient. The recently introduced architectures use linear layers on single tokens independently, or process all tokens equally, resulting in lower inductive bias than CNNs.

Challenges and future directions
These promising recently introduced architectures have simple network structure and fast inferencing throughput. However, on ImageNet, their results are currently 5-10% less accurate than those provided by the best CNNs or Transformer networks. They also do not significantly outperform light-weight networks in the trade-off between accuracy and speed. Thus additional research is needed if the potential of such architectures is to be realised.
We suggest possible directions for future work below, and make other observations about these architectures: • All linear layers process image patches in a direct or indirect manner, to extract local features thereby reducing computational cost. Dividing images into nonoverlapping patches again introduces inductive bias. On one hand, CNNs capture local structure extremely well, but lack the ability to handle long range interactions. On the other hand, these four architectures provide a good way to process long range interactions. It seems natural to try to combine the advantages of both architectures.
• One main goal of these four methods is to avoid the use of the self-attention mechanism. The successful configurations used for this purpose in Transformer could be employed in these linear architectures. For example, Transformer can use multi-head attention, and a similar multi-head mechanism could be employed by these methods to improve model capability.
• Residual connections play a key role in all these methods, indicating that the network structure is cru-cial. Because these new architectures are simpler than CNNs, better backbones are needed.
• Due to the simplicity of these new architectures, they can easily tackle irregular data structures, including point clouds, graphs, etc., used in various applications. Furthermore, this flexibility promises the ability to make cross-modal models, with a unified network backbone for all modes of data.
• An additional benefit is that all computations are matrix multiplications, which can be highly optimized in deep learning frameworks and readily performed on hardware. This simplicity can promote deployment in industry and commerce, and also reduce energy consumption.

Conclusions
Overall, the new architectures separately apply linear layers in the element (token) dimension and channel dimension to learn long range interactions between any two positions in the feature matrix, while traditional MLPs mix these two dimensions together as a long vector, with too much freedom for effective learning. We conclude that the new architectures do not simply reuse traditional MLPs, but are a significant advance over them.