Connecting the Dots in Self-Supervised Learning: A Brief Survey for Beginners

The artificial intelligence (AI) community has recently made tremendous progress in developing self-supervised learning (SSL) algorithms that can learn high-quality data representations from massive amounts of unlabeled data. These methods brought great results even to the fields outside of AI. Due to the joint efforts of researchers in various areas, new SSL methods come out daily. However, such a sheer number of publications make it difficult for beginners to see clearly how the subject progresses. This survey bridges this gap by carefully selecting a small portion of papers that we believe are milestones or essential work. We see these researches as the “dots” of SSL and connect them through how they evolve. Hopefully, by viewing the connections of these dots, readers will have a high-level picture of the development of SSL across multiple disciplines including natural language processing, computer vision, graph learning, audio processing, and protein learning.


Introduction
"You can't connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future." -Steve Jobs 1 ○ .
In the last few years, the artificial intelligence (AI) community has witnessed a boom in self-supervised learning (SSL), a class of algorithms that can learn meaningful representations 2 ○ without manually labeled data. These methods have significantly improved the performance of a variety of AI-related tasks [1][2][3] . Research fields like natural language processing (NLP) [4] , computer vision (CV) [5,6] , and speech recognition [7] have all witnessed breakthroughs through the use of self-supervised methods. With the rapid growth of computational power, modern neural architectures endowed with self-supervised algorithms can even improve supervised models trained with over a million labeled data [6] . Having its advantages in representation learning, SSL has become a popular research topic. Recently, thousands of papers [1,2] have been published each year, and such a massive number of publications make it difficult for researchers, especially newcomers, to find out genuinely inspiring articles and gain an overall picture of how SSL evolves.
In addition to many related publications, the intriguing property of SSL allows it to raise interdisciplinary. That said, innovations of SSL can appear in any of the application fields because of its wide us-  ○ In the remainder of this manuscript, we will use the terms "representation" and "embedding" interchangeably.
©The Author(s) 2022 age. Researchers these days often get ideas from related fields. For example, both context predictions [8] and wav2vec [9] were inspired by the famous Word2Vec [10] algorithm. Similarly, Mockingjay [11] and MAE [6] are the audio and visual version of BERT [4] , respectively. Therefore, interdisciplinary integration requires researchers to keep track of papers across all related research fields.
Usually, surveys are good resources for beginners to learn a particular field quickly and comprehensively. However, due to the tendency to include more papers, these surveys themselves are becoming lengthy and difficult to digest. Actually, when these surveys list their main contributions, "comprehensive" and "detailed" are often keywords [2,12] . In addition, because new papers come out daily, some methods listed in these surveys will quickly become outdated and lack reference significance. In order to make our survey easy to understand, we select a handful of milestone papers and important work from each field. We call these papers "dots", and connect them instead of listing or categorizing them. By connecting these dots, we clarify how SSL evolves and how different research fields inspire one another.
Our criterion for selecting these dots is that they must be the top-cited papers in the fields. We first determine the famous work for the feature engineering for each field. We restrain ourselves from selecting those papers published after 2013, because deep learning, the activator of representation learning, gets popular since 2013. Before then, training deep neural networks (DNNs) with graphics processing unit (GPU) was not trendy and researchers spent many of their efforts reducing computational complexity. Hence they paid less attention to the SSL algorithm itself. Table 1 gives an overview of the work presented in this survey.
In this brief survey, we review and connect the work from NLP, CV, graph learning, audio processing, and protein learning. By looking at the links of these SSL methods from different fields, we can see the followings. 1) Supervised learning contributes significantly to the development of SSL. Major neural network architectures like residual networks and transformers resulting supervised learning research are essential to SSL. 2) SSL methods in NLP like Word2Vec [10] and BERT [4] inspired most SSL methods in other fields. 3) The gains in hardware are the main driving force of SSL methods as they are computationally demanding.
We structure the rest of this paper as follows. From Section 2 to Section 6, we present the recent advances of SSL in NLP, CV, graph learning, audio processing, and protein learning, respectively. In Section 7 and Section 8, we discuss the existing survey articles and the future trends of SSL. We conclude our article in Section 9.
2 Self-Supervised Learning in Natural Language Processing The language data, i.e., text, is a typical sequence of word-tokens, and is easily accessible through the Internet. However, annotating the label for text is very expensive. For example, on Google AI Platform, assigning a class label for a piece of text with 50 words is about 4x as expensive as classifying an image 3 ○ . This motivates researchers to invent effective SSL methods to learn language representations without using text labels. Its progress diagram is shown in Fig.1.
After a few decades of research efforts, several key ingredients for SSL are identified. Some of the very important factors are global modeling and large size in terms of both network and data. However, all of these factors demand large computation resources. In this section, we will learn how researchers in the field, constrained by computational resources, gradually gather together and improve the above ingredients.
The journey is not without peril. One great temptation that researchers must resist is optimizing for a specific task. Although such improvements can be useful at the time when the invention was made, they have little, sometimes even negative, contribution towards the elusive final goal of solving language understanding problems. Many researchers go astray along this journey.

Early Attempts on Global Modeling
In the deep learning era, let us begin our recount with the seminal work, proposed by Collobert et al. in 2011 [15] . Prior to this work, the majority of the state-of-the-art systems were built upon taskspecific features. This work addresses each task independently with linear models on top of features that contain a large body of manually designed linguistic knowledge. Different from these systems, Collobert et al. learned contextualized intermediate representations through language modeling (LM), i.e., estimating the acceptability of a word given the previous words in a sentence [15] . Through this classical pretext task in language learning, [15] performs full network pretraining on convolutional neural networks (CNNs) and fine-tuned the network on multiple downstream benchmarks. This visionary design includes both full-network pre-training and long contextualized embedding. However, without using GPUs, this approach was still too expensive to compute. Early attempts on global modeling were limited due to the computational constraint.

Falling Back into Local Modeling
As the global modeling in [15] was computationally demanding, this method was not so popular as another local, shallow, and lightweight method called Word2Vec [10] , which appeared two years later in 2013. In order to train on a larger dataset, Mikolov et al. [10] shortened the input window size to 5 and only used a one-layer MLP as the network. Specifically, they used a linear network to either predict middle words from four neighboring words (CBOW) or the four neighboring words from the middle word (skip-gram). The trade-off among the data size, network size, and context size was a huge success at that time. In fact, it was so successful that follow-up researchers even called into question "the importance of the full neural network structure for learning useful word representation". However, although good word embeddings helped to improve the performance of language tasks in general, they also encouraged researchers to design task-specific networks on top of pre-training embeddings. A number of follow-up studies went astray along that direction and later on were proved to be meaningless.

Global Modeling Regaining Its Strength
It was not until 2015 that Dai and Le revisited the full network pre-training strategy in the SA-LSTM model [16] . They used both LM and sequence autoencoder which reconstructs the input sequence from the hidden states as their pre-training target and showed that sequence autoencoder was a better choice as a pre-training model. It was also observed that a well pre-trained model did significantly improve the performance of multiple downstream tasks. However, their networks were still shallow and only pre-trained on a relatively small target dataset, which limited the power of pre-training. Full-network pre-training still did not get popular.

Full Prosperity of Global Modeling
Finally, in 2018, the NLP community met its year of wonder. In February, Peters et al. introduced a new type of deep contextualized word representation called ELMO [17] . Compared with CBOW and skip-gram, ELMO was pre-trained on a larger network, and used a bidirectional language model and a much longer context. It pre-trained a fix contextualized word embedding through a 2-layer bidirectional LSTM. At roughly the same time, a similar method called ULMFiT was proposed by Howard and Ruder [18] . Instead of using a fixed word embedding, they proposed to fine-tune the deep pre-trained network and further increased the depth of the network. In June, Radford et al. suggested to pre-train a transformer decoder, which is a typical architecture of modern language model [25] . This new method was called GPT and it had a much larger network compared with previous approaches. The ability to model contexts was also stronger compared with the LSTM or CNN methods. Devlin et al. proposed the now well-known BERT model [4] . In BERT, the model pre-training is realized by a new pretext task, namely, the masked language model (MLM), i.e., masking some of the words in the inputs and recovering them through the network. BERT also further increases the data scale by including both the book corpus and the Wikipedia dataset. The simplicity of the model, its significant performance improvements, and the easy-to-use toolkits made BERT extremely popular.
On top of BERT, there are several variants including RoBERTa [19] , XLNeT [20] , ALBERT [21] , ELECTRA [22] , DeBERTa [23] , T5 [24] , etc. Most of these models increase the data size, increase the model size, or design a better objective function. For example, as compared with BERT, RoBERTa is trained on larger datasets with longer training time; ALBERT is a much wider network (although the network size is smaller through sharing the parameter cross layers); ELECTRA replaces the MLM task with a task of predicting whether a token is generated or not, so that it calculates losses in all the token position; DeBERTa combines LM and MLM objectives together, and T5 converts all text-based language problems into a textto-text format and trains a much larger model (1.5 billion parameters).
Although the above methods gain significant performance improvement over BERT, they still follow the pre-training fine-tune scheme, where the labeled data is needed. In contrast to the above approaches, GPT-3 [26] , proposed by Brown et al., shows that when the pre-trained model is larger enough, one can remove the necessity of fine-tuning the model, and instructions including basic information about the task can generate appealing text. This progress from large networks brought the SSL in language to a brand new high level.

Epilogue in NLP
Looking back, improved network architecture like Transformer [82] and bigger datasets have fueled a revolution in SSL. Transformer [82] , although initially invented for machine translation -a typical supervised method, enabled us to create a much larger and deeper network in SSL and currently is the main architecture for most of the SSL tasks. Because SSL with Transformer can continuously benefit from larger architec-tures and larger quantities of data, one of the biggest trends for SSL in NLP has been the ever-increasing model size [83] .
Also, the development of SSL in NLP has a great influence on other research fields. For example, the Transformer architecture is also preferable as a feature extractor in the CV field. In the following Sections 3-6, we will briefly introduce the important dots of SSL in other research fields.

Self-Supervised Learning in Computer Vision
Learning discriminative representations of visual data (e.g., image embedding or video embedding) in a self-supervised fashion have been considered as an important problem in the CV community. In CV applications, the input data is images or a sequence of video frames, composed of well-structured discrete RGB values. However, labeling a large number of images is expensive, and the cost increases dramatically for the task of pixel-level prediction, i.e., semantic segmentation. For example, annotating the pixel-level label per 2 048 × 1 024 image costs more than 1.5 hours [86] . To avoid the reliance on human effort for the data, SSL, therefore, becomes a useful tool to build a pre-trained feature extractor.
By looking back at the progress of SSL in visual data, its development is in line with this intuition from local modeling to global modeling (see its progress diagram in Fig.2). However, different from NLP, the local modeling in CV only happens in the data part. For the network part, the pre-training is on the whole network from the beginning. We conjecture that it is because of the efficient computation of CNNs and the high accuracy bar brought by the supervised pre-training on ImageNet [87] .

Traditional Feature Engineering
The traditional feature engineering for visual data creates the image descriptors. The scale-invariant feature transform (SIFT) [27] was proposed by Lowe in 2004. SIFT is invariant to image transformation (e.g., scaling or rotation); hence, it can perform reliable matching between different views of an object. HOG improves SIFT by counting occurrences of gradient orientation in localized portions of images [28] . Beyond SIFT descriptor, a fast algorithm, called SURF, was further invented [29] . Its feature descriptor is based on the summation of the Haar wavelet response around the point of interest. It leverages the multi-resolution pyramid technique to realize the blurring effect and to guarantee the scale-invariant property of the interesting point.

Local Modeling on Patches
Early work of learning highly-discriminative visual representations leveraged the local cues within images. The work, exemplar CNN, samples a set of 32 × 32 patches from the same image and applies various data transformations to each patch [30] . These patches from one image are grouped into one category, such that a network can be trained to discriminate between a set of categories. Understanding the visual concepts is necessary for the feature extractor and the Counting in [31] defines a counting rule in the pre-text task, which trains the network to recognize visual primitives, e.g., noses, eyes, by means of correctly predicting the counting relationship.
To make the network understand both the scenes and objects, another famous work utilized the spatial position of patches of one image as labels and model the SSL as a task to predict the spatial relationship between patches [8] . In doing so, a pair of patches are sampled per image and is fed to a network which is required to predict the relative position of two patches to learn more inherent visual information. Following the intuition that doing a complex task well requires more knowledge, a jigsaw puzzle game, where the objective trains the network to place shuffled patches back to their original positions, was further proposed as a pretext task in SSL [32] . In [32], all patches are shuffled and independently fed into an encoder, such that the encoder can jointly learn the feature embeddings of patches and the associated spatial arrangement. More challenging settings were also investigated in [88,89].
The initial development using local features has shown positive results for SSL. Increasing the receptive field of images becomes a possible way for further study.
In [33], unsupervised tracking is performed as a pretext task. That is, the visual tracker provides a query patch and a positive patch from the same video and samples a negative patch from other videos, such that the patch features can be optimized by the triplet loss [82] . In contrast to tracking the moving objects, estimating the motion from the camera for videos (a.k.a., ego-motion) is further considered later in [34], where the objective is to synthesize a targeted view using the depth and pose features. These video-based methods can learn robust features but have difficulty applying them to image tasks, because of the domain gap between videos and images.
The SSL for visual data at an early age was achieved by defining complex pretext tasks, and most of the pretext tasks used the local features of an image/video. Even though it has achieved considerable improvement, it has difficulty in encoding the holistic representation. However, objects within image/video are wellstructured, and such structured information indeed affects the representation power. This issue can be addressed by discriminative modeling or generative modeling, which learns the global representation of images.

Discriminative Global Modeling from Augmented Data
In the discriminative modeling, the basic idea relies on the Noise Contrastive Estimation (NCE) [85] . In NCE, a positive pair only contrasts with one negative pair, which is similar to the triplet loss [84] . A more general formulation is called InfoNCE [56] , where a positive pair contrasts with many negative pairs 4 ○ . In the NCE or InfoNCE framework, two main groups of approaches are studied to realize SSL, i.e., mutual information estimation and contrastive learning scheme. In the following, we will briefly introduce those two types of methods.

Mutual Information Estimation
The methods using mutual information (MI) achieve SSL by jointly estimating and maximizing MI, and MI can also be presented by the NCE value. Intuitively, maximizing the MI of two variables can align the associated distributions. In the CV field, the variables can be modeled as different views of images. The seminal work, Deep InfoMax (DIM), models the variables as global context features and local region features [35] . That said, maximizing the MI between global features and local features forces the network to encode the consistent information of global and local features of images.
Exploration of better ways to model the variables for MI was studied in the following work [36,37] . Augmented Multiscale DIM (AM-DIM), applies various augmentation skills to context and region features of the same image, thereby enforcing the deep network to learn a highlevel image representation that is robust against the diversity of data transformation [36] . In [37], contrastive multiview coding (CMC) calculates the MI value between global features, and such features are encoded from the same images with different views. This setting enables networks to learn the view-invariant factors of images. Even though CMC optimizes the MI as the objective, it has a fundamental difference from DIM and AM-DIM in that CMC considers the globalto-global MI, while infoMax and AM-DIM optimize the global-to-local MI.

Contrastive Learning Scheme
Learning with a contrastive scheme is also a natural idea in supervised representation learning [84,90,91] and has been studied extensively in recent years for SSL. In [38], He et al. developed MoCo, which adopts two encoders to the same image, leading to a positive pair. MoCo also proposes a momentum contrastive scheme, which significantly enlarges the number of negative pairs. Despite its effectiveness, creating positive pairs without using data augmentation makes the encoder easy to distinguish positive pairs. This issue is addressed by another seminal work, SimCLR, proposed by Chen et al. [39] . SimCLR establishes a general framework for SSL using a contrastive scheme [39] . Similar to CMC [37] , SimCLR adopts 10 data augmentation techniques and each positive pair can be constructed by applying two random augmentations to the same image. More importantly, the authors also conducted heuristic experiments to study the correct usage of contrastive loss. To be specific, it is observed that a large batch size, non-linear projection heads, deeper networks, and more training steps are essential factors for a good practice of contrastive loss. The MoCo v2 justifies the effectiveness of such training methods by integrating them into the MoCo framework [40] .
In contrast to work in [39,40] adopting more negative pairs in infoNCE loss, both BYOL and SimSiam avoid collapsing solutions during the optimization process even without using negative pairs [41,92] . It is observed from BYOL [92] that using a static key encoder (referred to as target encoder) can avoid the collapse because the static network is not trained. With such an observation, BYOL trains a query encoder (referred to as online encoder) as in the common practice and iteratively updates the key encoder with a moving average of the query network. The same idea also occurs in SimSiam [41] , whereas two encoders are identical, and a projection head is added to one of the encoders, creating two views of features.
The success of Transformer architecture (i.e., BERT) in the NLP field suggests using Transformer as an alternative building block of the backbone network, which is verified in the Vision Transformer (ViT) [93] . DINO further bridges the gap between ViT and SSL, i.e., training a ViT in a self-supervised manner, and reveals that the Transformer architecture can learn classspecific semantic information [42] . DINO follows the form of self-distillation that contains a teacher network and a student network and optimizes the objective of the cross-entropy loss calculated between the features from the student and the central feature from the teacher.
Discriminative modeling indeed makes significant progress as a pre-training technique, and the recognition accuracy on ImageNet is very close to the supervised learning. However, because all of these approaches are built upon the concept of distinguishing the augmented data from all other data, it is not so difficult as generative tasks in general. Its further improvement is stepped by using the generative ideas from NLP.

Generative Modeling Through Recovering Missing Image Patches
Inspired by the significant progress of generative modeling in NLP, one can also consider adopting such models (e.g., GPT or MLM) as candidates to learn image representations.
In [5], the image is operated via downsampling and flattening, obtaining a 1D sequence, which is then fed to a generative model, i.e., GPT, to realize the pixel generation objective. Despite that iGPT only generates lowresolution images, it shows its potential that it achieves SOTA performance as compared with its competitors in low-resolution representation learning. Recovering masked pixels, which mimics the pipeline in MLM, is also studied in BEiT [44] . Training a BEiT consists of two steps, with the first step that an auto-encoder is applied to tokenize the patch features. Then the masked image modeling (MIM) is used as a pre-training task, which trains the network to predict the masked visual tokens. A simpler yet effective method, MAE, proposed by He et al., further simplifies the training paradigm in that it trains an asymmetric auto-encoder to construct the masked patches [6] . Appealing performance is observed that the trained auto-encoder can recover images with only 25% visible patches. Recent work, termed MaskFeat, proves that the model's prediction in the feature space (i.e., HOG features) is much better than that in pixel spaces [45] .

Epilogue in CV
With the large-scale application of Transformer in the field of CV, the development of SSL in NLP and CV is getting approach. Although the development of SSL in CV bears the imprint of NLP, CV has also begun to feedback on the development of NLP. For example, SimCSE [94] uses dropout as minimal data augmen-tation for sentence embedding and applies contrastive learning on top of pre-trained model like BERT [4] or RoBERTa [19] . The resulting pre-training model significantly outperforms the original models.

Self-Supervised Learning in Graph Learning
Graph data is presented by a set of nodes, with linked ones being related. Unlike other formats of data, the graph can model a number of graph-structured data, e.g., the social networks, molecules, knowledge graphs. Addressing problems with graph data is not easy and the emergence of graph neural networks (GNNs) makes the solutions flexible and easier. That said, once the input data is modeled as a graph, the GNNs provide a powerful framework for the tasks at hand, e.g., node predication, edge prediction, or graph predication. The recent trend also shows promising results by employing the SSL on GNNs for pre-training. Its various applications make the progress from local modeling for node-level tasks to global modeling for graph-level tasks, shown in Fig.3.

Traditional Feature Engineering
An early solution of learning graph embeddings uses walks to traverse the graph and aggregates the connected node representations. This is known as DeepWalk [46] , and it learns the node representations by leveraging the skip-gram from Word2Vec [10] .

Local Modeling as a Way of Embedding Nodes
In dealing with the SSL for graph data, it comes to mind that the local modeling can be a straightforward choice as that in the NLP and CV fields. Similar to Word2Vec in NLP, node2vec was proposed for graph data [47] . Using the network neighborhoods of nodes as supervision signal, node2vec establishes node presentations that keep the connection relationship the same between the graph space and the embedding space. As in the CV field, graph learning also includes discriminative and generative modeling.

Discriminative Modeling Maximizing the Similarity of Different Views
The discriminative modeling of SSL for graphstructured data also follows closely the progress of the visual data, where the main categories are MI estimation and the contrastive learning scheme.
The graph counterpart of DIM, termed Deep Graph Infomax (DGI), was developed in [48]. In DGI, a graph convolutional network (GCN) is trained to learn node representations by the infoNCE objectives, thereby maximizing MI between the local patch representations and the global graph representations. In practice, the local patch representation is the high-level node feature, aggregated from the node and its neighborhoods, and the global graph representation is summarized by the readout function over node features. A similar idea was extended to InfoGraph [49] , where a GIN is trained to encode the graph representations.
Its further improvement derives from the success of the contrastive scheme [50,51] . A simple attempt is performed in GRACE [50] that the representation per node is optimized by maximizing the agreement of two graph views, where the graph views are constructed by removing edges to neighbors and masking node features. In [51], a new method, namely, CMVR, investigates a new method to create different views per graph. Given a raw graph, another view is created by graph diffusion. The origin graph and the augmented graph are fed to two separated GCN encoders respectively, to obtain both node features and graph features, which are then optimized by contrasting the node representations from one view to the graph representations of another view. In contrast to establishing various views for graphs, GROVER defines a pretext task that predicts the contextual properties of the node/edge and adopts Transformer, jointly learning representations for graphs [52] .

Generative Modeling via Generating Graph Components
Generative modeling on graph data relies on two pipelines, i.e., generative-adversarial and autoregressive [53,54] . Under the framework of GANs, Graph-GAN is composed of two networks, i.e., a generator and a discriminator [53] . For each node, the generator aims to learn the underlying connectivity of all nodes and generates a graph as a fake sample. Then the discriminator can tell the connectivity of true pairs and false pairs.
The generative pre-training is realized in GPT-GNN [54] . To achieve so, self-supervised attributed graph generation is defined as a pre-training pretext task. By the generative process for both node attributes and edges, the network can capture the inherent dependency of the underlying graphs, thereby producing powerful representations.

Epilogue in Graph Learning
As suggested by the above SSL methods, we can find that researches in both NLP and CV are the sources of ideas for the SSL in graph learning, though the format of graph data is significantly different from that of text and images. This indeed shows the importance of interdisciplinary research. We believe the generative modeling over the Transformer architecture [82] would be an important direction to explore.

Self-Supervised Learning in Audio Processing
Audio data is a format of time sequence being continuous in both time and amplitude. To facilitate analysis, the audio signal is normally split into clips with duration varying from hundreds of milliseconds to several seconds depending on the task at hand. According to the frequency spread, the audio signal is sampled in time with a rate of, e.g., 16 kHz 5 ○ . Assuming that the signal is stationary (with invariant frequency components) in one frame, each sampled audio clip is further split into frames with a constant frame length, e.g., 10 milliseconds. The raw audio samples can be directly fed to a neural network as input, or alternatively, a feature vector can be extracted for each frame in the frequency domain, e.g., the log-Mel (log-magnitude in Mel-Frequency) feature. With this feature representation, an audio clip is represented as a matrix with axes of frequencies and time frames, which is called a spectrogram. Audio units, such as speech phones, sound events, and music notes, have varying lengths and normally occupy multiple frames. The application over audio data includes clip-level tasks and frame-level tasks. Its SSL pre-training has been developed rapidly since 2019, and many ideas are inspired by the NLP/CV field. Due to the fact that audio frames have strong temporal dependencies, including short-term dependencies due to the signal smoothness within audio units, and long-term dependencies between audio units reflecting the semantic information, the modeling of SSL mainly focuses on discriminative modeling and generative modeling for the contextual/global embedding. (Refer to Fig.4 for its progress diagram.)

Traditional Feature Engineering
In the traditional feature engineering, the audio representations can be represented by the Melspectrograms [55] , which are calculated from the logmagnitude spectrum. Due to the property of the spectrum features, it can preserve both the frequency resolution and amplitude of a signal.

Discriminative Modeling via Contrastive Scheme
The discriminative modeling in audio data minimizes a pretext classification loss. In [56], targeting the task of future frame prediction, CPC aims to correctly classify the positive frames (future k frames) from a set of negative frames (other frames in the audio). Pre-training for a downstream task, i.e., speech recognition, is realized by wav2vec [57] , where MI between the speech context embedding and the future frame embeddings is maximized. Extending the idea from SimCLR [39] , some methods, e.g., COLA [58] , CLAR [59] and CLMR [60] , propose to create positive samples in the contrastive objective for the clip-level feature learning. Similar to SimCLR, CLAR applies various data augmentations to the same audio clip, leading to a positive pair [59] . The follow-up work, CLMR [60] , uses the same strategy for the music data. While in COLA, the positive pair is defined as two segments in the same audio recordings [58] . Note in both cases, for an anchor sample, any different audio clips in a mini-batch are selected as negative samples. Considering the fact that in the audio data, negative samples are possibly similar to the anchor sample in some scenarios, BYOL-A [43] , the audio version of BYOL [92] , removes the negative pairs in the contrastive learning.
Research efforts were also made to benefit the powerful Transformer architecture as a feature extractor for audio data. However, it brought the issue that unlike the words in a text with discrete tokens, audio frames are real-number vectors. In vq-wav2vec [61] and wav2vec 2.0 [62] , the real-number hidden units of audio frames are clustered via either Gumbel-Softmax or online k-means algorithms, so as to assign a discrete token to each audio frame. With these discrete tokens, it is ready to use the BERT [4] model for SSL of audio data. The follow-up work, HuBERT, applies the offline clustering method to produce the discrete tokens [7] .

Generative Modeling via Audio Reconstruction
The development of SSL in the NLP and CV fields also feeds many ideas in generative modeling for audio data. Early studies on audio SSL adopt the classic denoising autoencoder [95,96] by embedding the input to a bottleneck hidden representation and then reconstructing the input from the hidden representation. APC [63] and VQ-APC [64] follow the line of autoregressive learning used for LM. Different from CPC [56] and wav2vec [57] that use the contrastive classification loss, APC and VQ-APC directly predict the input feature of future frames and use the 1 loss between the true feature and the predicted one. Mockingjay mimics BERT to predict the masked input feature of one frame conditioning on both past and future frames, and also uses the 1 loss between the true feature and the predicted one [11] . TERA extends Mockingjay by not only masking frames, but also masking frequencies and contaminating spectrogram with noise [65] .

Multi-Task Modeling as Joint Discriminative and Generative Training
Both discriminative modeling and generative modeling boost the representation power of SSL via multi-task training. PASE [66] and its improvement PASE+ [67] jointly train a model for regression and discriminative tasks. To better preserve meaningful information in the latent space, wav2vec-C [9] was developed to reconstruct the audio signal from the latent space, in conjunction with the training target of contrastive loss in wav2vec 2.0. Splitting the audio spectrogram into patches, SSAST learns audio representations supervised by contrastive loss and generative loss in the BERT model [97] .

Epilogue in Audio Processing
Given the fact that the audio data can be processed either by a sequence of frames or a spectrogram, the ideas from both the NLP and CV fields promote the development of the SSL for audio data, again showing the necessity and potential of interdisciplinary research. Although audio data is continuous by nature, currently the superior performance is still achieved with discriminative learning by constructing a pretext classification task. Generative learning alone, or combined with discriminative learning, has not yet been very successfully developed.

Self-Supervised Learning in Protein Learning
The protein sequence is composed of ordered amino acids sequentially, and each protein consists of 20 common types of elements and several uncommon ones [72] . The evolution process selects the protein with a suitable function, thereby biasing the protein distribution, and such distribution results in special dependencies among amino acids in protein [98] . The dependency property can be used to define the pretext tasks for SSL.
The protein structure results from the complicated physical and chemical interactions among amino acids, such that a protein with a specific function folds into a specific shape in space. In the protein learning community, the multiple sequence alignment (MSA) is a useful tool to identify the dependencies of the protein [99] . The MSA consists of a group of aligned homologous sequences, which includes the co-evolution pattern, and such co-evolution patterns can indicate dependencies. Recently, the language models built on top of protein sequences have also encoded the dependencies of amino acids within a protein sequence. (Refer to Fig.5 for an illustration of concepts in protein data.) Therefore, the protein structure, MSA, as well as the protein sequence, can be used to identify the dependencies in protein, and the mapping function can be learned by SSL. Its progress is suggested in Fig.6.

Traditional Feature Engineering
The initial method, namely Protein Coding Features (PCF), to represent the protein structure is determined by the protein sequence of amino acid residues [68] . That is, each amino acid is encoded by a one-hot vector, with only the element of the amino acid being a non-zero value. The MSA information [99] can also be leveraged to represent the protein features. For example, the PSSM for homologous sequences calculates the substitution log-likelihood of the occurrence per amino acid at each position [69] .

Local Modeling and Discriminative
Modeling by Using Amino Acids The protein data is a sequence of amino acids, which is similar to the language data. That said, creating discriminative protein representation can follow the success in the NLP field, such that the methodology includes learning the local amino acids feature and the global protein context feature. Inspired by Word2Vec [10] , its local modeling is studied in ProtVec [70] . ProtVec groups every three contiguous amino acids of the protein as a word and employs the Word2Vec technique to train a context-independent embedding of the protein. Following the development of representation learning in NLP and CV fields, the global modeling of protein data is also studied. In the discriminative modeling, the contrastive scheme benefits the protein data to establish its representation via optimizing the infoNCE objective in CPCProt [71] . Following the framework of the CPC contrastive model [56] , CPCProt maximizes the MI between embeddings of a protein fragment (i.e., local feature) and its context (i.e., global feature). The only difference here is that CPCProt replaces the image patch with a certain number of amino acids.

Generative Modeling by Treating Each Amino Acid as a Word
In protein learning, generative modeling dominates the community, probably attributed to the fact that it is difficult to define a positive sample for an anchor protein. In generative modeling, many ideas are derived from the NLP field. Following the format of language data, protein, a sequence of amino acids, can be modeled as a sequence of tokens and the sequence contains the long-range dependencies of protein.
UniRep [73] takes a single amino acid as a word and uses the multiplicative LSTM [100] to train a generative model in an auto-regressive manner [101] . Its further improvement of the representation power adopts the bi-LSTM models, such as SeqVec [74] , UDSMProt [75] , PLUS [76] and P-ELMO [102] . The parallel computation, realized by Transformer, is also studied in TAPE [72] and more variants of Transformer-type models are investigated in ESM [77] and ProtTrans [78] .
Inspired by the MLM protocol in NLP pre-training, He et al. proposed the Pairwise Masked Language Model (PMLM) and empirically justified that the pretraining model incorporating PMLM is particularly good at capturing co-evolutionary dependencies [79] . This improvement is mainly attributed to the fact that the model of the joint probability of a pair of masked tokens is much more delicate than the product of the probability of a single masked token in the conventional masked language model.
Considering the structure information in the model, a de-noising task is defined in HJRSS [80] as a pretext task for pre-training, where a network is trained to recover both the token and the structure from the masked tokens and the disturbed structure.
To understand more dependencies of amino acids, one can also resort to MSA information. In doing so, an MSA transformer is trained under the MLM protocol [81] . In the MSA transformer, the dependencies on all amino acids in sequences of an MSA are built by the axis attention.

Epilogue in Protein Learning
As we can see, SSL methods for protein modeling largely follow the development of SSL in NLP. This is easy to understand as proteins are the language of nature and they are also one-dimensional sequential data. However, different from human language, proteins have structures and MSA. How to leverage this additional information would be much more interesting than simply applying SSL methods in language modeling to protein modeling. We believe that MSA transformer [81] is just a beginning, and we are looking forward to more exciting breakthroughs.

Related Work
SSL has been the choice of learning representations for various formats of data in the learning community, and their research progress has been extensively summarized in a number of survey papers [1,3,12,103] . In this section, we will briefly introduce the existing survey work for inter-disciplines, i.e., NLP, CV, graph learning, protein learning and audio processing.
The work in [1] takes a look into these methodologies of SSL and groups them into three categories: generative SSL, contrastive SSL, and generativecontrastive SSL. Following this categorization, the SSL on the applications of CV, NLP, and graph is considered in the survey. Employing SSL as means of model pretraining, a recent work [12] established a hands-on guide for understanding, using, and developing pre-trained models (PTM) on various NLP tasks. Another important component of PTM, network models, is reviewed in [3], where the training objective, model architectures, over-parameterization issue, etc., are thoroughly introduced over BERT-like architectures. A unified framework using contrastive learning as the objective for representations is surveyed in [104]. A new promising paradigm, dubbed prompt learning, is systematically studied in [105]. The success of Transformer in NLP also inspires the researchers in CV to develop a better visual feature extractor, i.e., Vision Transformer (ViT), and their progress is reported in the latest manuscripts [106][107][108] .
The progress of SSL on the graph-structured data is also studied in many articles [103,[109][110][111] . The survey provides comprehensively studied mainstream learning settings in graph neural networks (GNNs), i.e., supervised learning, self-supervised learning, and semisupervised learning [109] . In [110], Xie et al. summarized the SSL in GNNs and split the methodologies into two groups, namely, the contrastive model and the predictive model. The superiority of SSL in GNN is justified in [111] that SSL brings better generalization and robustness to GNNs. A deep understanding of the training methods w.r.t. different pretext tasks on graphstructured data is also empirically evaluated [103] .
Endowing the capacity of identifying the protein sequences with optimized properties to AI tools also gains increasing interest in the biological field, and the learning methodologies using deep neural networks are also extensively surveyed [112,113] . Targeting the goal to generate protein sequences, the articles [114,115] summarize the methods of generative models.
Remark 1. In contrast to existing work, our survey wants to thin the existing surveys, and mainly focuses on the milestone work in SSL, thereby building the connection to the dots. To be specific, the difference can be summarized as follows.
• Second, existing surveys comprehensively presented the papers, making them lengthy and difficult to digest. In contrast, our survey only selected a handful of milestone papers and important work from each field, making our survey easy to understand and the development path clear.
• Third, instead of merely listing or categorizing the papers in existing surveys, our article also connected the main ideas via inter-disciplines, such that readers can understand how SSL evolves and how different research fields inspire each other.

Discussions and Future Directions
In this section, we would like to discuss the main challenges and potential solutions for SSL.
Network Architecture and Knowledge Transfer. Recent studies have showed that the Transformer-type architecture consistently improves SSL in different fields. However, the success of Transformer-type architecture relies on the heavy parameters of the model. For example, the parameter size of GPT-3 is up to 175 billion for language understanding models [26] and the parameter number of DeepNet is up to 3.8 billion for vision tasks [120] . Thus deploying such a large model on mobile devices is not easy. That said, it is necessary to develop efficient architectures, e.g., neural architecture search (NAS), or algorithms, e.g., knowledge distillation, network pruning, to leverage the knowledge from large models. Also, to address the issue of out-of-date knowledge in machine [121,122] , it is also useful to develop self-supervised continual learning algorithms that can endow the model to learn knowledge in a lifelong manner.
Pre-Training Tasks. Recent advances of SSL are converged to the generative modeling, e.g., GPT-3 [26] in NLP, MAE [6] in CV, or GPT-GNN in graph learning [54] , and gained considerable achievements. Nevertheless, the SOTA pre-training strategies require either deeper architecture or large-scale data, resulting in expensive training cost. To mitigate this issue, it is possible to investigate efficient pre-training tasks, like ELECTRA [22] . In addition, another promising avenue to improve the model efficiency is to align models with user intent, e.g., InstructGPT [123] , such that the aligned model can save parameters while reaching a good performance, which is on par with large-scale models.

Conclusions
Self-supervised learning (SSL) is an important step in the road to improving the understanding of AI machines. The research community made numerous efforts to push the boundary of development, recorded by hundreds of publications. It is not easy for researchers, especially beginners, to follow and understand the progress in their own subjects. In this brief article, we built a path of the important dots in SSL development on various data, i.e., text, image, and graph. This not only showed the progress of SSL in each subject but also clearly explored the interaction of the development between subjects, e.g., the Transformer architecture invented in the NLP field inspired the development of ViT in the CV field, or the contrastive learning pipeline in CV field can also be extended in the graph/audio learning field. Beyond the high-level picture of SSL, we also believe that the development of individual subjects can be inspired by other subjects, and the research over cross-subjects is a useful way to produce impact work.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.