1 Introduction

Neural decoding of action, the accurate prediction of behavior from brain activity, is a fundamental challenge in neuroscience with important applications in the development of robust brain machine interfaces (Ahmed et al., 2021; Spampinato et al., 2017; Palazzo et al., 2018, 2021). Recent technological advances have enabled simultaneous recordings of neural activity and behavioral data in experimental animals and humans (Dombeck et al., 2007; Seelig et al., 2010; Chen et al., 2018; Pandarinath et al., 2018; Ecker et al., 2010; Topalovic et al., 2020; Urai et al., 2021). Nevertheless, our understanding of the complex relationship between behavior and neural activity remains limited.

Fig. 1
figure 1

Our Motion Capture and Two-Photon (MC2P) Dataset. A tethered fly (Drosophila melanogaster) is recorded using six multi-view infrared cameras and a two-photon microscope. The resulting dataset includes the following. A 2D poses extracted from different views (only three are shown), calculated on grayscale images. B 3D poses triangulated from the 2D views. C Synchronized, registered, and denoised single-channel fluorescence calcium imaging data using a two-photon microscope. Shown are color-coded activity patterns for populations of descending neurons from the brain. These carry action information (red is active, blue is inactive). D Annotations for eight subjects of eight different behaviors, four of which are shown here. E Manual neural segmentation has been performed to extract neural activity traces for each neuron. We will release our MC2P publicly. Examples videos of selected actions and multi-modal data are in the Supplementary Material

A major reason is that it is difficult to obtain many recordings from mammals and a few subjects are typically not enough to perform meaningful analyses (Pei et al., 2021). This is less of a problem when studying the fly Drosophila melanogaster, for which long neural and behavioral datasets can be obtained for many individual animals (Fig. 1). Nevertheless, current supervised approaches for performing neural decoding  (Nakagome et al., 2020; Glaser et al., 2020) still do not generalize well across subjects because each nervous system is unique (Fig. 2). This creates a significant domain-gap that necessitates tedious and difficult manual labeling of actions. Furthermore, a different model must be trained for each individual subject, requiring more annotation and overwhelming the resources of most laboratories.

Another problem is that experimental neural imaging data often has unique temporal and spatial properties. The slow decay time of fluorescence signals introduces temporal artifacts. Thus, neural imaging frames include information about an animal’s previous behavioral state. This complicates decoding and requires specific handling that standard machine learning algorithms do not provide.

To address these challenges, we propose to learn neural action representations—embeddings of behavioral states within neural activity patterns—in an self-supervised fashion. To this end, we leverage the recent development of computer vision approaches for automated, markerless 3D pose estimation  (Günel et al., 2019; Nath et al., 2019) to provide the required supervisory signals without human intervention. We first show that using contrastive learning to generate latent vectors by maximizing the mutual information of simultaneously recorded neural and behavioral data modalities is not sufficient to overcome the domain gap between subjects and to generalize to unlabeled subjects at test time (Fig. 3B). To address this problem, we introduce two sets of techniques:

  1. 1.

    To close the domain gap between subjects, we leverage 3D pose information. Specifically, we use pose data to find sequences of similar actions between a source and multiple target subjects. Given these sequences, we mix and replace neural or behavioral data of the source subject with the ones composed of multiple target subjects. To make this possible, we propose a new Mixup strategy which merges selected samples from multiple target animals, practically hiding the identity information. This allows us to train our decoder to ignore subject identity and close the domain gap.

  2. 2.

    To mitigate the slowly decaying calcium data impact from past actions on neural images, we add simulated randomized versions of this effect to our training neural images in the form of a temporally exponentially decaying random action. This trains our decoder to learn the necessary invariance and to ignore the real decay in neural calcium imaging data. Similarly, to make the neural encoders robust to imaging noise resulting from low image spatial resolution, we augment random sequences into sequences of neural data to replicate this noise.

The combination of these techniques allowed us to bridge the domain gap across subjects in an unsupervised manner (Fig. 3D), making it possible to perform action recognition on unlabeled subjects better than earlier techniques, including those requiring supervision (Glaser et al., 2020; Batty et al., 2019; Kostas et al., 2021). To test the generalization capacity of neural decoding algorithms, we record and use MC2P dataset, which we will make publicly available (Aymanns et al., 2022)Footnote 1. It includes two-photon microscope recordings of multiple spontaneously behaving Drosophila, and associated behavioral data together with action labels.

Fig. 2
figure 2

Domain gap between nervous systems across subjects. Neural imaging data from four different animals in each corner. Images differ in terms of total brightness, the location of observed neurons, the number of visible neurons, and the shape and size of axons

Finally, to demonstrate that our technique generalizes beyond this one dataset, we tested it on two additional ones. One dataset features neural ECoG recordings and 2D pose data for epileptic patients (Peterson, 2021; Singh et al., 2021) along with the well-known H36M dataset  (Ionescu et al., 2014) in which we treat the multiple views as independent domains. In all of the datasets, our ultimate goal is to interpret neural or video data so that one can generate latent representations that are useful for action recognition. Our method markedly improves across-subject action recognition in all datasets.

We hope our work will inspire the use and development of more general self-supervised neural feature extraction algorithms in neuroscience. These approaches promise to accelerate our understanding of how neural dynamics give rise to complex animal behaviors and can enable more robust neural decoding algorithms to be used in brain-machine interfaces.

Fig. 3
figure 3

t-SNE plots of the neural data. Each color denotes a different fly. The red dots are embeddings of the same action label in two different subjects. A Raw neural data. B SimCLR (Chen et al., 2020) representation, C Domain adaptation using a two-layer MLP discriminator and a Gradient Reversal Layer. D Ours. The identity of the animals is discarded and the semantic structure is preserved better than the previous methods, as similar same actions are positioned similarly, irrespective of subject identity

2 Related Work

2.1 Neural Action Decoding

The ability to infer behavioral intentions from neural data, or neural decoding of behavior, is essential for the development of effective brain-machine interfaces and for closed-loop experimentation (Wen et al., 2021; Lau et al., 2021). Neural decoders can be used to increase the mobility of patients with disabilities (Collinger et al., 2018; Ganzer et al., 2020), or neuromuscular diseases (Utsumi et al., 2018), and can expand our understanding of how the nervous system works (Sani et al., 2018). However, most neural decoding methods require manual annotations of training data that are both tedious to acquire and error prone (Glaser et al., 2020; Lacourse et al., 2020; Segalin et al., 2021).

Existing self-supervised neural decoding methods (Wang et al., 2018; Kostas et al., 2021; Mohsenvand et al., 2020; Peterson et al., 2021) cannot be used on unlabeled subjects without action labels. A potential solution would be to use domain adaptation techniques to treat each new subject as a new domain. However, existing domain adaptation studies of neural decoding (Li et al., 2020; Farshchian et al., 2018) have focused on gradual domain shifts associated with slow changes in sensor measurements rather than the challenge of generalizing across individual subjects. In contrast to these methods, our approach is self-supervised and can generalize to unlabeled subjects at test time, without requiring action labels for new individuals.

2.2 Action Recognition

Contrastive learning has been extensively used on human motion sequences to perform action recognition using 3D pose data (Liu et al., 2020; Su et al., 2020; Lin et al., 2020) and video-based action understanding (Pan et al., 2021; Dave et al., 2021). Similarly, supervised and unsupervised action recognition approaches have been used on animal datasets on pose or RGB modalities  (Sun et al., 2021; Eyjolfsdottir, 2017; Eyjolfsdottir et al., 2014, 2017; Bohnslav et al., 2021; Wiltschko et al., 2015). However, a barrier to using these tools in neuroscience is that the statistics of our neural data—the locations and sizes of cells—and behavioral data—body part lengths and limb ranges of motion—can be very different from animal to animal, creating a large domain gap.

In theory, there are multimodal domain adaptation methods for action recognition that could deal with this gap (Munro & Damen, 2020; Chen et al., 2019; Xu et al., 2021). However, they assume supervision in the form of labeled source data. In most laboratory settings, where large amounts of data are collected and resources are limited, this is an impractical solution.

2.3 Representation Learning

Most efforts to derive a low dimensional representation of neural activity have used recurrent models (Nassar et al., 2019; Linderman et al., 2019, 2017), variational autoencoders (Gao et al., 2016; Pandarinath et al., 2018), and dynamical systems (Abbaspourazad et al., 2021; Shenoy & Kao, 2021). Video and pose data have previously been used to segment and cluster temporally related behavioral information (Sun et al., 2021; Segalin et al., 2020; Overman et al., 2021; Pereira et al., 2020; Johnson et al., 2020).

By contrast, there have been relatively few approaches developed to extract behavioral representations from neural imaging data (Batty et al., 2019; Sani et al., 2021; Glaser et al., 2020). Most have focused on identifying simple relationships between these two modalities using simple supervised methods, such as correlation analysis, generalized linear models (Robie et al., 2017; Musall et al., 2019; Stringer et al., 2019), or regressive methods (Batty et al., 2019). We present a joint modeling of motion capture and neural modalities to fully extract behavioral information from neural data using a self-supervised learning technique.

2.4 Pose Estimation

In order to utilize the advances in large behavioral recordings, recent efforts have made it possible to perform markerless predictions of 2D poses on animals using mostly deep learning (Pereira et al., 2020; Wu et al., 2020; Bala et al., 2020; Graving et al., 2019; Li et al., 2020). 2D animal poses can be converted into 3D animal poses using multi-view stereo systems or using lifting methods (Karashchuk et al., 2021; Günel et al., 2019; Gosztolai et al., 2021; Pedersen et al., 2020). Similarly, multi-animal tracking and pose estimation can be achieved using deep learning (Koger et al., 2022; Walter & Couzin, 2021). At the same time, realistic animal models have been built for downstream applications such as extracting 3D shape and texture from images, for animals such as mice, zebras, and elephants (Kulkarni et al., 2020; Sanakoyeu et al., 2020; Lobato-Rios et al., 2021; Bolaños et al., 2021).

2.5 Mixup Training

Mixup regularization was first proposed as a way to learn continuous latent spaces and to improve generalization for supervised learning (Berthelot et al., 2019; Verma et al., 2019; Zhang et al., 2018). Several previous studies have used a Mixup strategy to generate new positive pairs in contrastive learning (Shen et al., 2022; Lee et al., 2021). Mixup has rarely been used for domain adaptation. Recent examples include temporal background mixing (Sahoo et al., 2021), prediction smoothing across domains (Mao et al., 2019), and training better discriminators on uni-modal datasets (Sahoo et al., 2020). Our Mixup strategy can be regarded as a multi-modal extension of the previous approaches (Sahoo et al., 2021; Zhang et al., 2018) where per-frame feature-level stochastic Mixup between domains was performed to explore shared space and to hide identity information. Unlike these approaches, we explicitly condition the sampling procedure on the input data. We demonstrate that this approach helps to learn domain-invariant neural features.

In this work, we propose a new action recognition system by learning joint neural-behavioral representations using multi-modal pre-training. We learn these joint representations together with a novel set of augmentation strategies. Our method performs action classification without requiring action labels in the target domain. We show that our method outperforms previous neural action decoding work on three different datasets. We hope our method will accelerate our understanding of how neural dynamics give rise to complex animal behaviors and can enable more robust neural decoding algorithms to be used in brain-machine interfaces.

3 Approach

Our ultimate goal is to interpret neural data so that, given a neural image, one can generate latent representations that are useful for downstream tasks. This is challenging due to the wide domain-gap in neural representations between different subjects (Fig. 2). Hence, we aim to leverage self-supervised learning techniques to derive rich features that, once trained, could be used on downstream tasks including action recognition to predict the behaviors of unlabeled subjects.

Our data is composed of set of neural images synchronized with behavioral data, where we do not know where each action starts and ends. We leveraged contrastive learning to generate latent vectors from both modalities such that their mutual information would be maximized and therefore describe the same underlying action. However, this is insufficient to address the domain-gap between subjects (Fig. 3B). To do so, we implement an across-domain mixing strategy: We replace the original pose or neural data of an animal with mix of another set of animals from the same dataset, for which there is a high degree of 3D pose similarity at each given instance in time. Unlike behavioral data, neural data has unique properties. Neural calcium data contains information about previous actions because it decays slowly across time and it involves limited spatial resolution. To teach our model the invariance of these artifacts of neural data, we propose two data augmentation techniques: (i) Neural Calcium augmentation - given a sequence of neural data, we apply an exponentially decaying neural snapshot to the sequence, which imitates the decaying impact of previous actions, (ii) Neural Noise augmentation - to make the model more robust to noise, we applied an augmentation which merges a sequence of neural data with another randomly sampled neural sequence using a coefficient.

Together, these augmentations enable a self-supervised approach to (i) bridge the domain gap between subjects allowing testing on unlabeled ones, and (ii) imitate the temporal and spatial properties of neural data, diversifying it and making it more robust to noise. In the following section, we describe these steps in more detail.

Fig. 4
figure 4

Our approach to learning an effective representation of behaviors. First, we sample a synchronized set of behavioral and neural frames, \((\textbf{b}_{\textbf{i}}, \textbf{n}_\textbf{i})\). Then, we augment these data using randomly sampled augmentation functions \(t_b\) and \(t_n\). Encoders \(f_b\) and \(f_n\) generate intermediate representations \(\textbf{h}^b\) and \(\textbf{h}^n\), which are then projected into \(\textbf{z}_b\) and \(\textbf{z}_t\) by two separate projection heads \(g_b\) and \(g_n\). For the behavioral modality, we first apply a frame-wise MLP before \(f_b\). We then apply mixing on \(\textbf{m}^{b}_\textbf{i}\). For the neural modality, we apply mixing on \({\textbf {h}}_{{\textbf {i}}}^{n}\) without an MLP, since mixing cannot be done at frame level. We maximize the similarity between the two projections using an InfoNCE loss. At test time, the red branch and \({\textbf {h}}_{{\textbf {i}}}^{n}\) is used for neural decoding

3.1 Problem Definition

We assume a paired set of data \(\mathcal {D}_{s}=\left\{ \left( \textbf{b}_{\textbf{i}}^{s}, \textbf{n}_{\textbf{i}}^{s}\right) \right\} _{i=1}^{n_{s}}\), where \(\textbf{b}^{s}_\textbf{i}\) and \(\textbf{n}^{s}_\textbf{i}\) represent behavioral and neural information respectively, with \(n_s\) being the number of samples for subject \(s\in \mathcal {S}\). We quantify behavioral information \(\textbf{b}^{s}_\textbf{i}\) as a set of 3D poses \(\textbf{b}^{s}_k\) for each frame \(k\in \textbf{i}\) taken of subject s, and neural information \(\textbf{n}^{s}_\textbf{i}\) as a set of two-photon microscope images \(\textbf{n}^{s}_k\), for all frames \( k \in \textbf{i}\) capturing the activity of neurons. The data is captured such that the two modalities are always synchronized (paired) without human intervention, and therefore describe the same set of events. Our goal is to learn an unsupervised parameterized image encoder function \(f_n\), that maps a set of neural images \(\textbf{n}^{s}_\textbf{i}\) to a low-dimensional representation. We aim for our learned representation to be representative of the underlying action label, while being agnostic to both modality and the identity. We assume that we are not given action labels during pre-training. Also note that we do not know at which point in the captured data an action starts and ends. We just have a series of unknown actions performed by different subjects.

3.2 Contrastive Representation Learning

For each input pair \(\left( \textbf{b}_{\textbf{i}}^{s}, \textbf{n}_{\textbf{i}}^{s}\right) \), we first draw a random augmented version \(( \tilde{\textbf{b}}_{\textbf{i}}^{s}, \tilde{\textbf{n}}_{\textbf{i}}^{s} ) \) with a sampled transformation function \(t_{n} \sim \mathcal {T}_n\) and \(t_{b} \sim \mathcal {T}_b\) , where \(\mathcal {T}_n\) and \(\mathcal {T}_b\) represent a family of stochastic augmentation functions for behavioral and neural data, respectively, which are described in the following sections. Next, the encoder functions \(f_b\) and \(f_n\) transform the input data into low-dimensional vectors \(\textbf{h}_b\) and \(\textbf{h}_n\), followed by non-linear projection functions \(g_b\) and \(g_n\), which further transform data into the vectors \(\textbf{z}_b\) and \(\textbf{z}_n\). For the behavioral modality, in order to facilitate mixing, we first transform augmented input data \(\tilde{\textbf{b}}_{\textbf{i}}^{s}\) into \(\textbf{m}_{\textbf{i}}^{s}\) using a shallow frame-wise MLP, as shown in Fig 4. For the neural modality, we instead directly apply mixing using \(\textbf{h}_n\), since frame-level mixing is not possible. We give the details of the mixing strategy in the next sections. During training, we sample a minibatch of N input pairs \(\left( \textbf{b}_{\textbf{i}}^{s}, \textbf{n}_{\textbf{i}}^{s}\right) \), and train with the loss function

$$\begin{aligned} \mathcal {L}^{b\rightarrow n}_{NCE} = - \sum _{i=1}^{N} \log \frac{\exp \left( \left\langle \textbf{z}^{i}_{b}, \textbf{z}^{i}_{n}\right\rangle / \tau \right) }{\sum _{k=1}^{N} \exp \left( \left\langle \textbf{z}^{i}_{b}, \textbf{z}^{k}_{n}\right\rangle / \tau \right) } \end{aligned}$$
(1)

where \(\left\langle \textbf{z}^{i}_{b}, \textbf{z}^{i}_{n}\right\rangle \) is the cosine similarity between behavioral and neural modalities and \(\tau \in \mathbb {R}^{+}\) is the temperature parameter. Intuitively, the loss function measures classification accuracy of a N-class classifier that tries to predict \(\textbf{z}^{i}_{n}\) given the true pair \(\textbf{z}^{i}_{b}\). To symmetrize the loss function with respect to the negative samples, we also define

$$\begin{aligned} \mathcal {L}^{n\rightarrow b}_{NCE} = - \sum _{i=1}^{N} \log \frac{\exp \left( \left\langle \textbf{z}^{i}_{b}, \textbf{z}^{i}_{n} \right\rangle / \tau \right) }{\sum _{k=1}^{N} \exp \left( \left\langle \textbf{z}^{k}_{b}, \textbf{z}^{i}_{n} \right\rangle / \tau \right) }. \end{aligned}$$
(2)

We take the combined loss function to be \(\mathcal {L}_{NCE} = \mathcal {L}^{b\rightarrow n}_{NCE} + \mathcal {L}^{n\rightarrow b}_{NCE}\), as in Zhang et al. (2020), Yuan et al. (2021). The loss function maximizes the mutual information between two modalities (van den Oord et al., 2019). Although standard contrastive learning bridges the gap between different modalities, it does not bridge the gap between different subjects (Fig. 3B). This is a fundamental challenge that we address in this work through augmentations described in the following section, which are part of the neural and behavioral family of augmentations \(\mathcal {T}_n\) and \(\mathcal {T}_b\).

3.2.1 Mixup Strategy

Given a set of consecutive 3D poses \(\textbf{b}^{s}_\textbf{i}\) and their features \(\textbf{m}^{s}_\textbf{i}\) calculated by a shallow MLP from augmented \(\tilde{\textbf{b}}_{\textbf{i}}^{s}\), for each \(k\in \textbf{i}\), we stochastically replace \(\textbf{m}^{s}_k\) with a mix of its two pose neighbors, sampled from two subjects, in the set of domains \(\mathcal {D}_{\mathcal {S}}\), where \(\mathcal {S}\) is the set of all animals. To get one of the neighbors, we first uniformly sample a domain \(\hat{s} \in \mathcal {S}\) and define a probability distribution \(\textbf{P}^{\hat{s}}_{\textbf{b}^{s}_k}\) over the domain \(\mathcal {D}_{\hat{s}}\) with respect to single 3D pose \(\textbf{b}^{s}_k\),

$$\begin{aligned} \textbf{P}^{\hat{s}}_{\textbf{b}^{s}_k} (\textbf{b}^{\hat{s}}_l) = \frac{\exp ( - \Vert \textbf{b}^{\hat{s}}_l - \textbf{b}^{s}_k \Vert _{2})}{\sum _{ \textbf{b}^{\hat{s}}_m \in \mathcal {D}_{\hat{s}} } \exp ( - \Vert \textbf{b}^{\hat{s}}_m - \textbf{b}^{s}_k \Vert _{2})}. \end{aligned}$$
(3)

We then sample from the above distribution and pass it through the shallow MLP, which yields \(\textbf{m}^{\hat{s}}_l \sim \textbf{P}^{\hat{s}}_{\textbf{b}^{s}_k}\). Notice that, although distribution is conditioned on the 3D pose \(\textbf{b}^s\), we sample back 3D pose features \(\textbf{m}^s\). In practice, we calculate the distribution \(\textbf{P}\) only over the approximate \(\textbf{N}\) nearest neighbors of \(\textbf{b}^{s}_k\), in order to speed up the implementation. We empirically set \(\textbf{N}\) to 128. Given two samples \(\textbf{m}^{\hat{s}}_l\) and \(\textbf{m}^{\overline{s}}_j\) from the above distribution from independent domains, we then return the mixed version of

$$\begin{aligned} \tilde{\textbf{m}}^{s}_k = \lambda \textbf{m}^{\hat{s}}_l +(1-\lambda ) \textbf{m}^{\overline{s}}_j \; . \end{aligned}$$
(4)

We sample the mixing coefficient \(\lambda \) from the Beta distribution \(\lambda \sim {\text {Beta}}(\alpha , \beta ).\) Our Mixup strategy removes the identity information in the behavioral data without perturbing it to the extent that semantic action information is lost. Since each behavioral sample \(\textbf{m}^{s}_\textbf{i}\) is composed of a set of 3D pose features, and each 3D pose feature \(\textbf{m}^{s}_k, \forall k \in \textbf{i}\) is replaced with a feature of a random domain, the transformed sample \(\tilde{\textbf{m}}_{\textbf{i}}^{s}\) is now composed of multiple domains. This forces the behavioral encoding function \(f_{b}\) to leave identity information out, therefore generalizing across multiple domains (Fig. 5).

Our Mixup augmentation is similar to the synonym replacement augmentation used in natural language processing (Wei & Zou, 2019), where randomly selected words in a sentence are replaced by their synonyms, therefore changing the syntactic form of the sentence without altering the semantics. Instead, we randomly replace each 3D pose in a motion sequence. To the best of our knowledge, we are the first to use frame-wise mix strategy in the context of time-series analysis or for domain adaptation that is conditioned on the input.

To keep mixing symmetric, we also mix the neural modality. To mix a set of neural features \(\textbf{h}_{\textbf{i}}^{s}\), we take its behavioral pair \(\textbf{b}_{\textbf{i}}^{s}\), and search for similar sets of poses in other domains, with the assumption that similar sets of poses describe the same action. Therefore, once similar behavioral data is found, their neural data can be mixed. Note that, unlike behavior mixing, we do not calculate the distribution on individual 3D pose \(\textbf{b}^{s}_k\), but instead on the whole set of behavioral data \(\textbf{b}_{\textbf{i}}^{s}\), because similarity in a single pose does not necessarily imply similar actions and similar neural data. More formally, given the behavioral-neural pair \(\left( \textbf{b}_{\textbf{i}}^{s}, \textbf{n}_{\textbf{i}}^{s}\right) \), we mix the neural modality features \(\textbf{h}_{\textbf{i}}^{s}\) by sampling two new neural features \(\textbf{h}^{\overline{s}}_\textbf{j}\) and \(\textbf{h}^{\hat{s}}_\textbf{l}\) from distinct animals \(\hat{s}\) and \(\overline{s}\), using the probability distribution

$$\begin{aligned} \textbf{P}^{\hat{s}}_{\textbf{n}^{s}_\textbf{i}} (\textbf{b}^{\hat{s}}_\textbf{j}) = \frac{\exp ( - \Vert \textbf{b}^{\hat{s}}_\textbf{j} - \textbf{b}^{s}_\textbf{i} \Vert _{2})}{\sum _{ \textbf{b}^{\hat{s}}_\textbf{m} \in \mathcal {D}_{\hat{s}} } \exp ( - \Vert \textbf{b}^{\hat{s}}_\textbf{m} - \textbf{b}^{s}_\textbf{i} \Vert _{2})}, \end{aligned}$$
(5)

and then we return

$$\begin{aligned} \tilde{\textbf{h}}^{s}_k = \lambda \textbf{h}^{\hat{s}}_l +(1-\lambda ) \textbf{h}^{\overline{s}}_j \; . \end{aligned}$$
(6)

Similarly, first we sample the mixing coefficient \(\lambda \) from the Beta distribution \(\lambda \sim {\text {Beta}}(\alpha , \beta ).\) This yields new mixed neural feature \( \tilde{\textbf{h}}_{\textbf{i}}^{s}\), where the augmented neural data comes from two different subjects in \(\mathcal {S}\).

Fig. 5
figure 5

Our Mixup strategy. Each 3D pose is processed by a pose-wise MLP to generate 3D pose features. Then, each 3D pose feature in the motion sequence of Domain 2 is randomly replaced with mix of two of its neighbors, from the set of domains \(\hat{s}\) \(\in \) \(\mathcal {S}\), which includes Domains 1 and 3. The Mixup augmentation hides identity information, while keeping pose changes in the sequence minimal

3.2.2 Neural Calcium Augmentation

Our neural data was obtained using two-photon microscopy and fluorescence calcium imaging. The resulting images are only a function of the underlying neural activity, and have temporal properties that differ from the true neural activity. For example, calcium signals from a neuron change much more slowly than the neuron’s actual firing rate. Consequently, a single neural image \(\textbf{n}_t\) includes decaying information concerning neural activity from the recent past, and thus carries information about previous behaviors. This makes it harder to decode the current behavioral state.

We aimed to prevent this overlap of ongoing and previous actions. Specifically, we wanted to teach our network to be invariant with respect to past behavioral information by augmenting the set of possible past actions. To do this, we generated new data \(\tilde{\textbf{n}}^{s}_\textbf{i}\), that included previous neural activity \(\textbf{n}^{s}_k\). To mimic calcium indicator decay dynamics, given a neural data sample \(\textbf{n}^{s}_\textbf{i}\) of multiple frames, we sample a new neural frame \(\textbf{n}^{s}_k\) from the same domain, where \(k \notin \textbf{i}\). We then convolve \(\textbf{n}^{s}_k\) with the temporally decaying calcium convolutional kernel \(\mathcal {K}\), therefore creating a set of images from a single frame \(\textbf{n}^{s}_k\), which we then add back to the original data sample \(\textbf{n}^{s}_\textbf{i}\). This results in \(\tilde{\textbf{n}}^{s}_\textbf{i} = \textbf{n}^{s}_\textbf{i} + \mathcal {K} * \textbf{n}^{s}_k\) where \(*\) denotes the convolutional operation. In the Supplementary Material, we explain calcium dynamics and our calculation of the kernel \(\mathcal {K}\) in more detail.

3.2.3 Neural Noise Augmentation

Two-photon microscopy images often include multiple neural signals combined within a single pixel. This is due to the the fact that multiple axons can be present in a small tissue volume that is below the spatial resolution of the microscope. To mimic this noise-adding effect, given a neural image \(\textbf{n}^s_{{\textbf {i}}}\), we randomly sample a set of frames \(\textbf{n}^{\hat{s}}_{{\textbf {k}}}\), from a random domain \(\hat{s}\). We then return the blend of these two videos, \(\tilde{\textbf{n}}^s_{{\textbf {i}}} = \textbf{n}^s_{{\textbf {i}}} + \alpha \textbf{n}^{\hat{s}}_{{\textbf {k}}}\), to mix and hide the behavioral information. Unlike the CutMix (Yun et al., 2019) augmentations used for supervised training, we apply the augmentation in an unsupervised setup to make the model more robust to noise. We sample a random \(\alpha \) for the entire set of samples in \(\textbf{n}^s_{{\textbf {i}}}\).

4 Experiments

We test our method on three datasets. In this section, we describe these datasets, the set of baselines against which we compare our model, and finally the quantitative comparison of all models.

4.1 Datasets

We ran most of our experiments on a large dataset of fly neural and behavioral recordings that we acquired and describe below, which we called MC2P. To demonstrate our method’s ability to generalize, we also adapted it to run on another multimodal dataset that features neural ECoG recordings and markerless motion capture (Peterson, 2021; Singh et al., 2021), as well as the well known H36M human motion dataset (Ionescu et al., 2014).

4.1.1 MC2P

Since there was no available neural-behavioral dataset with a rich variety of spontaneous behaviors from multiple individuals, we acquired our own dataset that we name Motion Capture and Two-photon Dataset (MC2P). We will release this dataset publicly. MC2P features data acquired from tethered behaving adult flies, Drosophila melanogaster (Fig. 1), It includes:

  1. 1.

    Infrared video sequences of the fly acquired using six synchronized and calibrated infrared cameras forming a ring with the animal at its center. The images are \(480\times 960\) pixels in size and recorded at 100 fps.

  2. 2.

    Neural activity imaging obtained from the axons of descending neurons that pass from the brain to fly’s ventral nerve cord (motor system) and drive actions. The neural images are \(480\times 736\) pixels in size and recorded at 16 fps using a two-photon microscope (Chen et al., 2018) that measures the calcium influx which is a proxy for the neuron’s actual firing rate.

We recorded 40 animals over 364 trials, resulting in 20.7 hours of recordings with 7,480,000 behavioral images and 1,197,025 neural images. We provide additional details and examples in the Supplementary Material. We give an example video of synchronized behavioral and neural modalities in Supplementary Videos 1 and 2.

To obtain quantitative behavioral data from video sequences, we extracted 3D poses expressed in terms of the 3D coordinates of 38 keypoints (Günel et al., 2019). We provide an example of detected poses and motion capture in Supplementary Videos 3 and 4. For validation purposes, we manually annotated a subset of frames using eight behavioral labels: forward walking, pushing, hindleg grooming, abdominal grooming, rest, foreleg grooming, antenna grooming, and eye grooming. We provide an example of behavioral annotations in Supplementary Video 5. To keep the experiments consistent, we always paired 32 frames of neural data with 8 frames of behavioral data.

4.1.2 ECoG Dataset

(Peterson, 2021; Singh et al., 2021): This dataset was recorded from epilepsy patients over a period of 7-9 days. Each patient had 90 electrodes implanted under their skull. The data comprises human neural Electrocorticography (ECoG) recordings and markerless motion capture of upper-body 2D poses. The dataset is labeled to indicate periods of voluntary spontaneous motions, or rest. As for two-photon images in flies, ECoG recordings show a significant domain gap across individual subjects. We applied our multi-modal contrastive learning approach on ECoG and 2D pose data along with mixing-augmentation. Then, we applied an across-subject benchmark in which we do action recognition on a new subject without known action labels (Fig. 6).

Fig. 6
figure 6

Domain Gap in the H3.6M dataset. Similar to the domain gap across nervous systems, RGB images show a significant domain gap when the camera angle changes across individuals. We guide action recognition across cameras in RGB images using 3D poses and behavioral mixing

4.1.3 H3.6M

H3.6M is a multi-view motion capture dataset that is not inherently multimodal. However, to test our approach in a very different context than the other two cases, we treated the videos acquired by different camera angles as belonging to separate domains. Since videos are tied to 3D poses, we used these two modalities and applied mixing augmentation together with multimodal contrastive learning to reduce the domain gap across individuals. Then, we evaluated the learned representations by performing action recognition on a camera angle that we do not have action labels for. This simulates our across-subject benchmark used in the MC2P dataset. For each experiment we selected three actions, which can be classified without examining large window sizes. We give additional details in the Supplementary Material.

4.2 Baselines

We evaluated our method using two supervised baselines, Neural Linear and Neural MLP. These directly predict action labels from neural data without any self-supervised pretraining using cross-entropy loss. We do not use any post-processing or smoothing after any of our baselines. We also compared our approach to three regression methods that attempt to regress behavioral data from neural data, which is a common neural decoding technique. These include a recent neural decoding algorithm, BehaveNet (Batty et al., 2019), as well as to two other regression baselines with recurrent and convolutional approaches: Regression (Recurrent) and Regression (Convolution). In addition, we compare our approach to recent self-supervised representation learning methods, including SeqCLR (Mohsenvand et al., 2020) and SimCLR (Chen et al., 2020). We also combine convolutional regression-based method (Reg. (Conv)) or the self-supervised learning algorithm SimCLR with the common domain adaptation techniques Gradient Reversal Layer (GRL) (Ganin & Lempitsky, 2015), or Mean Maximum Discrepancy (Gretton et al., 2006). This yields four domain adaptation models. Finally, we apply a recent multi-modal domain adaptation network for action recognition, MM-SADA (Munro & Damen, 2020) on MC2P dataset. For all of these methods, we used the same backbone architecture. We describe the backbone architecture in more detail in the Supplementary Material. We describe the baselines in more detail in following:

4.2.1 Supervised

A feedforward network trained with manually annotated action labels using cross-entropy loss, having neural data as input. We discarded datapoints that did not have associated behavioral labels. For the MLP baseline, we trained a simple three layer MLP with a hidden layer size of 128 neurons with ReLU activation and without batch normalization.

4.2.2 Regression (Convolutional)

A fully-convolutional feedforward network trained with MSE loss for behavioral reconstruction task, given the set of neural images. To keep the architectures consistent with the other methods, the average pooling is followed by a projection layer, which is used as the final representation of this model.

4.2.3 Regression (Recurrent)

This is similar to the one above but the last projection network was replaced with a two-layer GRU module. The GRU module takes as an input the fixed representation of neural images. At each time step, the GRU module predicts a single 3D pose with a total of eight steps to predict the eight poses associated with an input neural image. This model is trained with an MSE loss. We take the input of the GRU module as the final representation of neural encoder.

4.2.4 BehaveNet

This uses a discrete autoregressive hidden Markov model (ARHMM) to decompose 3D motion information into discrete “behavioral syllables” (Batty et al., 2019). As in the regression baseline, the neural information is used to predict the posterior probability of observing each discrete syllable. Unlike the original method, we used 3D poses instead of RGB videos as targets. We skipped compressing the behavioral data using a convolutional autoencoder because, unlike RGB videos, 3D poses are already low-dimensional.

4.2.5 SimCLR

We trained the original SimCLR module without the calcium imaging data and mixing augmentations (Chen et al., 2020). As in our approach, we took the features before the projection layer as the final representation.

4.2.6 Gradient Reversal Layer (GRL)

Together with the contrastive loss, we trained a two-layer MLP domain discriminator per modality, \(D_{b}\) and \(D_{n}\), which estimates the domain of the neural and behavioral representations (Ganin & Lempitsky, 2015). Discriminators were trained by minimizing

$$\begin{aligned} \mathcal {L}_{D}=\sum \nolimits _{x \in \{\textbf{b}, \textbf{n}\}}-d \log \left( D_{m}\left( f_{m}(x)\right) \right) \; \end{aligned}$$
(7)

where d is the one-hot identity vector. Gradient Reversal layer is inserted before the projection layer. Given the reversed gradients, the neural and behavioral encoders \(f_n\) and \(f_{b}\) learn to fool the discriminator and outputs invariant representations across domains, hence acting as a domain adaptation module. We kept the hyperparameters of the discriminator the same as in previous work (Munro & Damen, 2020). We froze the weights of the discriminator for the first 10 epochs, and trained only the \(\mathcal {L}_{NCE}\). We trained the network using both loss functions, \(\mathcal {L}_{NCE} + \lambda _{D} \mathcal {L}_{D}\), for the remainder of training. We set the hyperparameters \(\lambda _{D}\) to 10 empirically.

4.2.7 Maximum Mean Discrepancy (MMD)

We replaced adversarial loss in GRL baseline with a statistical test that minimizes the distributional discrepancy from different domains (Gretton et al., 2006).

4.2.8 MM-SADA

A recent multi-modal domain adaptation model for action recognition that minimizes cross-entropy loss on target labels, adverserial loss for domain adaptation, and contrastive losses to maximize consistency between multiple modalities (Munro & Damen, 2020). As we do not assume any action labels during the contrastive training phase, we removed the cross-entropy loss.

4.2.9 SeqCLR

This approach learns a uni-modal self-supervised contrastive model (Mohsenvand et al., 2020). Hence, we only apply it to the neural imaging data, without using the behavioral modality. As this method was previously applied on datasets with Electroencephalography (ECoG) imaging technique, we removed ECoG specific augmentations.

4.2.10 Maximum Mean Discrepancy (MMD)

We replaced adversarial loss in GRL baseline with a statistical test to minimize the distributional discrepancy from different domains (Gretton et al., 2006). Similar to previous work, we applied MMD only on the representations before the projection layer independently on both modalities (Munro & Damen, 2020; Kang et al., 2020). Similar to the GLR baseline, we first trained 10 epochs only using the contrastive loss, and trained using the combined losses \(\mathcal {L}_{NCE} + \lambda _{MMD} \mathcal {L}_{MMD}\) for the remainder. We set the hyperparameters \(\lambda _{MMD}\) as 1 empirically. For the domain adaptation methods GRL and MMD, we reformulated the denominator of the contrastive loss function. Given a domain function dom which gives the domain of the data sample, we replaced one side of \(L_{NCE}\) in Eq. 1 with,

$$\begin{aligned} \log \frac{\exp \left( \left\langle \textbf{z}^{i}_{b}, \textbf{z}^{i}_{n}\right\rangle / \tau \right) }{\sum _{k=1}^{N} \textbf{1}_{[dom(i) = dom(k)]} \exp \left( \left\langle \textbf{z}^{i}_{b}, \textbf{z}^{k}_{n}\right\rangle / \tau \right) }, \end{aligned}$$
(8)

where selective negative sampling prevents the formation of trivial negative pairs across domains, therefore making it easier to merge multiple domains. Negative pairs formed during contrastive learning try to push away inter-domain pairs, whereas domain adaptation methods try to merge multiple domains to close the domain gap. We found that the training of contrastive and domain adaptation losses together could be quite unstable, unless the above changes were made to the contrastive loss function.

4.3 Benchmarks

Since our goal is to create useful representations of neural images in an self-supervised way, we focused on single- and across-subject action recognition. Specifically, we trained our neural decoder \(f_{n}\) along with the others without using any action labels. Then, freezing the neural encoder parameters, we trained a linear model on the encoded features, which is an evaluation protocol widely used in the field (Chen et al., 2020; Lin et al., 2020; He et al., 2020; Dave et al., 2021). We used either half or all action labels. We mention the specifics of the train-test split in the Supplementary Material.

4.3.1 Single-Subject Action Recognition

For each subject, we trained and tested a simple linear classifier independently on the learned representations to predict action labels. We assume that we are given action labels on the subject we are testing. In Table 1 we report aggregated results.

4.3.2 Across-Subject Action Recognition

We trained linear classifiers on N-1 subjects simultaneously and tested on the left-out one. Therefore, we assume we do not have action labels for the target subject. We repeated the experiment for each individual and report the mean accuracy in Tables 1 and 2.

4.3.3 Identity Recognition

As a sanity check, we attempted to classify subject identity among the individuals given the learned representations. We again used a linear classifier to test the domain invariance of the learned representations. In the case that the learned representations are domain (subject) invariant, we expect that the linear classifier will not be able to detect the domain of the representations, resulting in a lower identity recognition accuracy. Identity recognition results are rerported in Tables 1 and 2.

Table 1 Action recognition accuracy on MC2P dataset
Table 2 Action recognition accuracy on H36M and ECoG dataset

5 Results

5.1 Single-Subject Action Recognition on M2CP

For the Single-Subject baseline, joint modeling of common latent space out-performed supervised models by a large margin, even when the linear classifier was trained on the action labels of the tested animal. Our mixing and neural augmentations resulted in an accuracy boost when compared with a simple contrastive learning method, SimCLR (Chen et al., 2020). Although regression-based methods can extract behavioral information from the neural data, they do not produce discriminative features. When combined with the proposed set of augmentations, our method performs better than previous neural decoding models because it extracts richer features thanks to a better self-supervised pretraining step. Domain adaptation techniques do not result in a significant difference in the single-subject baseline; the domain gap in a single animal is smaller than between animals.

5.2 Across-Subject Action Recognition on M2CP

We show that supervised models do not generalize across animals, because each nervous system is unique. Before using the proposed augmentations, the contrastive method SimCLR performed worse than convolutional and recurrent regression-based methods including the current state-of-art BehaveNet (Batty et al., 2019). This was due to large domain gap between animals in the latent embeddings (Fig. 3B). Although the domain adaptation methods MMD (Maximum Mean Discrepancy) and GRL (Gradient Reversal Layer) close the domain gap when used with contrastive learning, they do not position semantically similar points near one another (Fig. 3C). As a result, domain adaptation-based methods do not result in significant improvements in the across-subject action recognition task. Although regression-based methods suffer less from the domain gap problem, they do not produce representations that are as discriminative as contrastive learning-based methods. Our proposed set of augmentations and strategies close the domain gap, while improving the action recognition baseline for self-supervised methods, for both single-subject and across-subject tasks (Fig. 3D).

5.3 Action Recognition on ECoG Motion versus Rest

As shown at the bottom of Table 2, our approach significantly lowers the identity information in ECoG embeddings, while significantly increasing across-subject action recognition accuracy compared to the regression and multi-modal SimCLR baselines. Low supervised accuracy confirms a strong domain gap across individuals. Note that uni-modal contrastive modeling of ECoG recordings (SimCLR (ECoG)) does not yield strong across-subject action classification accuracy because uni-modal modeling cannot deal with the large domain gap in the learned representations.

5.4 Human Action Recognition on H3.6M

We observe in Table 2 that, similar to the previous datasets, the low performance of the supervised baseline and the uni-modal modeling of RGB images (SimCLR (RGB)) are due to the domain-gap in the across-subject benchmark. This observation is confirmed by the high identity recognition of these models. Our mixing strategy strongly improves compared to the regression and multi-modal contrastive (SimCLR) baselines. Similar to the previous datasets, uni-modal contrastive training cannot generalize across subjects, due to the large domain gap.

5.5 Ablation Study

Table 3 Ablation on effects of different augmentations
Table 4 Ablation on neural preprocessing

We compare the individual contributions of different augmentations proposed in our method. We report these results in Table 3. We observe that all augmentations contribute to single- and across-subject benchmarks. Our mixing augmentation strongly affects the across-subject benchmark, while at the same time greatly decreasing the domain gap, as quantified by the identity recognition result. Other augmentations have minimal effects on the domain gap, as they only slightly affect the identity recognition benchmark (Tables 4 and 5).

We compare our neural augmentations to standart neural preprocessing approaches commonly used in neuroscience. To compare, we use the state-of-art neural preprocessing library CaImAn (Giovannucci et al., 2019). CaImAn requires the tuning of 25 parameters for spike inference. Running a single-set of parameters took 5h of processing. As shown in Tab. 4, this algorithm produced worse results, likely due to errors in ROI detection and spike inference. Because our method can be run on the raw data, without requiring ROI detection and spike inference, it removes the burden of an extensive hyperparameter search and unnecessarily long computational times. Thus, we believe that our augmentations and model are more general and useful for the community. Lastly, we performed ablation experiment on mixing individual modalities and report the results in Tab. 5. Mixing both modalities results in the best scores. However, when mixed alone, the behavioral modality performs superior as it more effectively hides subject identity information, since it is mixed in pose level instead of window level.

Table 5 Ablation on mixing of different modalities on MC2P dataset

6 Conclusion

We have introduced an self-supervised neural action representation framework for neural imaging and behavioral videography data. We extended previous methods by incorporating a new mixing based domain adaptation technique which we have shown to be useful on three very different multimodal datasets, together with a set of domain-specific neural augmentations. Two of these datasets are publicly available. We created the third dataset, which we call MC2P, by recording video and neural data for Drosophila melanogaster and will release it publicly to speed-up the development of self-supervised methods in neuroscience (Aymanns et al., 2022). We hope our work will help the development of effective brain machine interface and neural decoding algorithms. In future work, we plan to disentangle remaining long-term non-behavioral information that has a global effect on neural data, such as hunger or thirst, and test our method on different neural recording modalities. As a potential negative impact, we assume that once neural data is taken without consent, our method can be used to extract private information.