1 Introduction

The recent increases in mental health conditions worldwide have made the prevention and treatment of mental disorders a global health priority. Multiple factors have contributed to the dramatic rise in mental illness, such as social media pressure, increased adoption of electronic media (Electronic Screen Syndrome), increased divisive news, increased performance pressures (education, career, financial, etc.), household breakdown, and recently the COVID-19 pandemic [1, 2]. According to the World Health Organization (WHO), depression is the leading cause of disability worldwide, affecting more than 264 million people [3]. Similarly, recently published statistics from the National Institute of Mental Illness (NIMH) in 2020 indicate that an estimated 52.9 million adults aged 18 or older in the United States were diagnosed with a mental illness condition, accounting for 21% of the population [4]. Moreover, suicide is the world’s second leading cause of death among people aged 15 to 24 [5]. It is estimated that nearly 800,000 people commit suicide yearly, which translates to one death every 40 seconds. People affected with mental health disorders frequently face significant human rights violations, discrimination, and stigma.

Mental health promotion is rising as societies become more aware of the risks associated with mental illnesses. Many technology-based applications and techniques surfaced in response to the need for more effective mental health prevention, awareness, patient monitoring, and disease diagnosis. Digital mental health platforms powered by artificial intelligence algorithms are also growing popular for diagnosing and treating various psychiatric disorders.

The main challenge encountered by almost all intelligence (AI)-powered algorithms is the lack of data in general and good quality of data in specific. As sensitive and private as any health problem, data have many privacy policies, problematic data sharing privacy, and ethical constraints. This is specifically relevant in the mental health domain, where patient information is deeply personal and sensitive due to the highly stigmatized nature of the patient’s illness. Even when publicly available datasets are published, they are usually limited in size, limiting their performance of current techniques. The data availability will always limit the potential of both machine learning (ML) or deep learning (DL) models if trained in the traditional way, referred to as centralized learning. In centralized learning, only one model is trained on data collected from different sources and compiled together into one dataset. The model is then tested and deployed in a computer program or a web/mobile application to be used by psychiatrists to support their decision-making processes. There is a need for another robust approach that serves clinical psychiatric practice (evaluate, diagnose, and build decision-support systems for patients with mental disorders) but also prioritizes the privacy of the patients and their data collected from multiple sources such as hospitals, clinics, wearable devices, and even social media.

A collaborative learning approach referred to as federated learning (FL) was introduced in 2016 by a team at Google Research [6]. FL follows a client-server approach that trains a centralized model on decentralized data so the data never leave the client side. While federated learning was initially designed for other domains, it quickly gained attention in the healthcare and medical fields through its capability to handle data privacy and governance by training models collaboratively without exchanging data. It provides a consensus solution without moving patient data beyond the firewalls of the healthcare institution in which they reside [7].

Several systematic reviews addressed the use of federated learning in the health domain [8,9,10,11,12]. However, to the best of our knowledge, none of them specifically investigated the use of FL techniques in the mental health domain. This systematic literature review (SLR) aims to bridge the gap and provides in-depth background on using federated learning (FL) and its current state-of-the-art techniques applied in the mental health field. The main research question of this work therefore is:

MRQ: To what extent has federated learning been exploited in mental health state detection?

Through our systematic reviews, we also answer the following sub-research questions.

  • RQ1: What mental disorders were explored?

  • RQ2: What data types were most used with FL and mental illness?

  • RQ3: What countries contributed in this direction?

  • RQ4: What is the most commonly used FL algorithm in the context of mental illness?

The rest of the paper is organized as follows: Section 2 provides a background on the federated learning paradigm. A detailed description of the systematic review methodology employed is given in Section 3. The findings gathered from the selected papers are highlighted in Section 4. Section 5 provides a list of the challenges and limitations that researchers are currently facing in this area. Section 6 concludes with a critical discussion and suggestions to pave the way for developing FL-based applications and systems in mental health.

2 Background

Privacy-enhancing technologies (PET) aim to prevent data leaks while balancing privacy and usability. Federated learning is one of the PETs with the primary concept of protecting the privacy of clients’ data. The more clients guarantee the security of their data, the more available data the model is trained on, and the more generalized the model can be. Unlike traditional centralized model training, where data are brought to the server stored on one machine or a data center, models are sent to clients’ end to be trained on their on-device data. Discussed below are the key factors to having an FL system.

Fig. 1
figure 1

FedAvg algorithm explanation where \(W_{i}\) is the model shared by the server, \(W_{i+1}^k\) is client k update on the shared model, \(n_k\) is client k’s local data size, n is the total data size and \(W_{i+1}\) is the updated model calculated at the server

2.1 Aggregation methodologies

A preliminary step of a federated learning system is to aggregate the results of each client’s model to realize a more powerful generalized model. This step is done by a coordinating centralized server responsible for any client communications. The first aggregation algorithm was titled FederatedAveraging (FedAvg) [6]. In FedAvg, the coordinating server first sends an identical initial model \(W_{i}\) to each of the participating clients K. Each client trains the model locally on their data for a predetermined number of epochs using the stochastic gradient descent (SGD) optimizer. The encrypted trained model results (weights and parameters) \(W_{i+1}^k\) are sent back to the coordinating server, which calculates the new updated model \(W_{i+1}\) to be shared once more till the learning phase ends. The server updates the model weights by averaging each model’s results based on their share of data (weighted average), as shown in Fig. 1. To ensure privacy, a secure aggregation protocol was developed, allowing the server to decrypt the average update only if a predetermined number of users have participated and sent their results [13]. Sharing model parameters throughout the network requires ensuring secure communication among clients and the centralized server to avoid problems such as model poisoning. Various techniques, including homomorphic encryption (HE) [14], secure multi-party computation (SMPC) [15], and differential privacy (DP) [16] have been used to compute the defined FL functionality privately.

2.2 Data distribution

Data can have two possible distributions in a federated learning system: independent and identically distributed (IID) and non-IID. Given the nature of the non-IID data in a federated learning system, many statistical challenges can be encountered, such as [10]:

  • Quantity Skew. Quantity skew is when the class distribution among clients is unequal, referred to as imbalanced data. Imbalanced data are when a client holds far more data records about one class than the others.

  • Label Skew. Label skew is when data sizes fluctuate among clients when one client holds more data records about a certain class than others. For example, a big hospital has much more data about depression than a small medical center.

  • Feature Skew. Feature skew is when clients do not have the same set of features. For example, when two hospitals report data about a certain disease, there will be a huge overlap in the reported features owing to the nature of the disease itself. However, some features may differ as the machines used to conduct the results, such as MRI scanners, may not come from the same manufacturer.

Data are considered IID when they are balanced, label distributions are nearly the same at each client, and all clients have the same features. Luckily, challenges such as quantity skew can be overcome by data augmentation and feature skew by data imputation techniques. Label skew is what federated learning is designed for; it can learn from any data source, no matter how small it is.

2.3 Data partitioning

Data partitioning in federated learning has three different types: Horizontal FL (HFL), Vertical FL (VFL), and Federated Transfer Learning (FTL). The three differ in the data each model gets trained on. In HFL, each local dataset used to train each client’s model has the same features, i.e., each client gets trained on the same set of features for different patients. In VFL, each client has a different feature set of the same patients. For example, two different healthcare facilities can have different data (features and labels) for the same patient. Lastly, in FTL, the clients don’t share the same feature set or the same patients’ profiles. It uses a pre-trained model trained on a similar dataset at one client to solve a different problem for another client. HFL is the data partitioning scheme most frequently explored by researchers.

3 Methods

This research employs the Systematic review Methodology Blending Active Learning and Snowballing (SYMBALS) [17]. SYMBALS does not only follow authoritative systematic review guidelines but also combines the existing methods into a quick and accessible technique [18,19,20]. Its stages are explained in the upcoming subsections.

3.1 Database search

Database searching is at the core of all systematic review methodologies. This step constructs a set of all possible relevant publications from different sources. To ensure comprehensive review coverage, we include six databases in our search: Science Direct, Springer, ACM Digital Library, PubMed Central, IEEE Xplore, and Wiley Online Library. The used search query was:

(“federated learning” OR “multi-party computing” OR “multiparty computing”) AND (“mental health” OR “psychiatry” OR “psychology”)

Since FL was proposed in 2016, no FL-based medical research existed until then. The retrieval time range was from 2016 to July 2023. In Springer, Computer Science was chosen as a discipline, and articles and conference papers were selected from the content type. Books were excluded from Wiley Online Library. The query string returned a total of 418 papers; however, after removing the duplicates based on their titles, the final paper set included 402 papers. Table 1 shows the total number of publications returned by each database.

Table 1 Number of papers returned from each database

3.2 Screening using active learning

This is a fundamental step in the SYMBALS methodology as it accelerates the screening process without sacrificing accuracy. Machine learning is applied in the title and abstract screening step to spare researchers from manually labeling papers. SYMBALS uses the ASReview tool [21] to achieve this. This is very important when the original paper set is large; however, since the total number of non-duplicated papers retrieved is 402 papers, we decided to perform this step manually to ensure even more validity of the selected papers. The decision to include or exclude a paper was based on the following criteria:

  • Inclusion criteria:

    • I1: The paper must describe the use of an FL technique in training an AI model.

    • I2: The paper must address a mental disorder.

  • Exclusion criteria:

    • E1: The paper discussed a technique for securely sharing the training parameters during the FL process.

    • E2: The paper discussed a fully decentralized implementation of federated learning such as blockchain.

    • E3: The paper does not address a mental problem.

    • E4: The paper does not explain the FL algorithm used.

    • E5: Local data are shared with the server even if they were encrypted or sent anonymously.

    • E6: Irrelevant papers inaccurately returned by the query.

E6 had the greatest share of set reduction by excluding 251 papers for varying reasons. E3 excluded about 74 articles, such as those that address mood detection, emotion recognition, stress monitoring, and loneliness detection [22, 23, 23, 24]. E4 had a share of 30 papers as the researchers did not mention the federated learning algorithm or how they dealt with the data for federation settings such as [25]. Only three papers were excluded by E5: [26] and [27] as the data left the clients’ side, violating the FL concept. After the active learning phase, 384 papers were excluded to end up with 18 papers ready for the next step. Table 2 shows the number of papers excluded by each criterion.

Table 2 Number of papers excluded by each criterion
Fig. 2
figure 2

SYMBALS steps

Fig. 3
figure 3

Visualization of federated learning applications and relevant data types in mental health research

3.3 Backward snowballing

Unlike other SLR techniques that only rely on active learning in their design, SYMBALS complements the output of the previous step with a backward snowballing step. Snowballing ensures the inclusion of relevant papers that could have been missed because its database was not considered or covered by the search query. From a set of selected papers, a researcher can find additional relevant papers by consulting the list of references of each paper, a process called backward snowballing. Other SLRs employ forward snowballing, in which the citations within the papers are inspected to add more relevant papers. However, the authors of SYMBALS argue that older papers will generally constitute the largest group of relevant papers not yet included. It is more efficient to examine the references rather than citations, based on the observation that databases generally have excellent coverage of recent peer-reviewed research. Because the output of the previous step is relatively small, no extra stopping criterion needs to be defined in the current step. One additional paper was added from backward snowballing, increasing the total number of papers to 19.

The three subsequent SYMBALS steps are designed to ensure the quality of the included papers, prepare data extraction sheets, and validate the search results.

3.4 Quality assessment

This is an optional step proposed by SYMBALS for a large number of inclusions. Since all the included papers were manually selected and their number is relatively small, this step was skipped in our systematic review process.

3.5 Data extraction and synthesis

Data extraction was performed to give a numerical analysis of the literature reviewed and to describe some approaches in the following section. The following data were extracted from each paper:

  • D1: Title and publication year.

  • D2: Mental illness type.

  • D3: Data type and dataset description.

  • D4: Federated learning algorithm.

  • D5: Whether the FL model was implemented or simulated.

  • D6: Whether the used model was based on traditional machine learning or deep learning (DL)

  • D7: Description of the used AI model.

  • D8: Performance measures and their results.

It is worth noting that in D5, an actual implementation of FL means working on different client data, sending the model to be trained on their ends, and aggregating the results. On the other hand, an FL simulation is when models are not sent to be trained on users’ devices or when data are coming from the same distribution but are divided locally to mimic the FL flow.

During the data extraction phase, we discovered that three pairs of papers were duplicated in terms of their contribution, i.e., in each pair, both papers describe the same model in two different publications [28,29,30,31,32,33]. To avoid redundancy, only one paper from each pair was considered, leaving 16 papers to be reviewed.

Fig. 4
figure 4

Quantitative analysis

3.6 Validation

This is the last step of the SYMBALS methodology. Its main target is to verify the acquired set of papers. A set of 40 papers resulting from the search query were re-assessed by a different author who did not contribute to the screening process. After viewing the inclusion and exclusion criteria, the author made the same decisions and ended up with the same labeling results as the original author.

A visual representation of the SYMBALS review process applied in our SLR is shown in Fig. 2.

4 Results

In this section, we provide answers to the research question previously introduced after reading and analyzing the research in the selected paper set. Figure 3 gives a summary of the explored mental disorders, data types, and techniques applied in the published research that used FL. Important research insights and quantitative analysis of the reviewed literature are introduced first. A detailed description of each paper in the final selected set is given afterward.

4.1 Quantitative analysis

MRQ: to what extent has federated learning been exploited in mental health state detection? Based on our systematic review and after applying SYMBALS to conduct this SLR, sixteen papers applied the federated learning concept in the mental health domain. While the concept of FL was introduced in 2016, the first published research merging FL and mental health applications appeared in 2019. Since then, there has been an increased rate of publications, specifically in the recent two years, as shown in Fig. 4a. Figure 4b shows the distribution of papers included in this review among the different search engines. Most papers were found in IEEE Xplore. Only four papers were found in PubMed, a medical literature repository; this indicates a lack of exposure to using FL in the mental health domain.

RQ1: what mental disorders were explored? Seven mental disorders were covered: depression, schizophrenia, violence incidents detection, suicidal ideation, obsessive-compulsive disorder (OCD), bipolar disorder, and attention deficit hyperactivity disorder (ADHD), leaving space for many other illnesses to benefit from FL. As illustrated in Fig. 4c, most of the research (10 publications) targeted depression, whereas schizophrenia was the second most addressed mental disorder.

RQ2: what data types were most used with FL and mental illness? All the medical data with the diversity in terms of their type: textual data such as electronic health records, tabular data such as patient information (e.g. age, gender, and sensor readings), images such as scans of patients including ultrasound, CT, MRI, and audio data such as patients’ recordings. Textual and tabular sensor data were equally used in the reviewed papers as observed in Fig. 4d. This outcome is not surprising as the nature of mental illness and the spread of social networks made a huge pool of textual data for researchers to work on. Also, tabular sensor data can be obtained from various sources such as smartphones, wristbands, and wearable devices. It is important to emphasize that most of the datasets used in the literature were collected by the authors and not made publicly available, such as the clinical data collected from hospitals and social media posts collected from Twitter, Reddit, and Weibo. Table 3 gives details on the publicly available datasets used to experiment with FL in mental health research.

Table 3 Publicly available datasets used in FL research

RQ3: what countries contributed in this direction? Researchers from eleven countries explored the federated learning algorithm to develop a more robust generalized model while keeping the privacy of mental health data. Many researchers from different countries showed interest and contributed to such a beneficial application. Figure 4e lists all these countries by considering every author affiliation in the resulting papers. The United States of America and China each have an equal share of four publications.

RQ4: what is the most commonly used FL algorithm in the context of mental illness? FedAvg was used in more than 75% papers addressing mental illness. This is also expected as the FL is still in its infancy, and FedAvg was the first algorithm introduced and most commonly used in other domains. In Fig. 4f, we provide an overview of the underlying machine learning models that papers employed to evaluate their proposed FL framework. As can be seen, recurrent neural networks (RNNs) are the most commonly used models, followed by convolutional neural networks (CNNs). Less explored are Decision Trees, multi-layer perceptron (MLP), Deep and Cross Network (DCN), and BERT-based models.

4.2 Reviews for federated learning in mental health

In this section, we introduce paper-specific details and findings. For a better comparison among the reviewed literature, papers are segmented by the model employed for learning, i.e., traditional machine learning (ML) or deep learning (DL), followed by a subdivision based on the used data type.

4.2.1 Traditional machine learning based classifiers

Four papers employing traditional machine-learning techniques are discussed in this section.

Tabular data

In [39], depression was detected using sensor data collected from the ActiGraph wristband [34]. In each minute, the quantity, duration, and strength of the movements were recorded for each patient. The authors proposed a new data augmentation approach to tackle the imbalance problem in the collected data. For every minute in the day, if a data sample is missing, then a set of data records representing this patient’s data at the same time on other days were extracted from the dataset, and a random one was selected to complete the patient’s vector for the day. The data are then fed to a Privacy-Preserving Distributed Extremely Randomized Trees (PPD-ERT) [40] algorithm based on decision trees. PPD-ERT guarantees data privacy by making data holders keep their data. A mediator server initializes and shares a global and personal random seed among data holders and calculates the best candidate node at each step from the aggregated results to ensure the same tree is built at each client. The proposed augmentation approach led to better classifiers with higher performance measures up to 7.9% higher f1-score, 8.2% higher accuracy, and 0.169 higher Matthews correlation coefficient. The authors continued the work on the same model and introduced [41], an extension of the above-explained work.

In [42], the gender differences in negative symptom severity in schizophrenia were studied using the Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC) software platform. The authors used data collected by the FBIRN (Function Biomedical Informatics Research Network). COINSTAC [43] is open-source software that enables federated or decentralized data analysis by sharing analysis pipelines and communicating partial results, updated models, and other features. R scripts were written to read clinical and demographic data, calculate five-factor and two-factor model scores from each client’s SANS (Schedule for the Assessment of Negative Symptoms) data items, and regress these scores against gender. The five-factor model yields scores for avolition, anhedonia, alogia, blunt affect, and asociality. The two-factor model yields scores for Motivation/Apathy (MAP) and Expressiveness (EXP). The MAP is a weighted combination of avolition, anhedonia, and asociality, while the EXP is a weighted combination of alogia and blunted affect. The SANS and gender data were stored in a standardized CSV file at each site. However, the spreadsheets could be located in any directory on the local system as the user identifies the required files during data mapping. Data were collected from seven different institutions, and a simulation of seven clients was created, yielding the following results: Males had significantly more severe total negative symptoms than females (P < 0.05). On closer inspection, however, men with schizophrenia had a higher EXP factor score than women.

Textual data

In [44], violence risk among psychiatric patients from Dutch clinical notes was predicted using natural language processing and federated learning techniques. Each data point corresponded to a patient’s admission period and contained the concatenation of clinical notes from up to 28 days, including the first day of admission. The data points were labeled by whether a violent incident occurred or not (positive/negative outcome) over the next 27 days following the first day of admission. The authors used Doc2Vec [45] to extract a 300-dimension feature vector from the input text. The vector was then fed to a feed-forward neural network with one hidden layer of a ReLU activation function and one output layer with one neuron and a sigmoid activation function to classify the output. Four models were trained and compared: two local, one federated, and one data-centralized model. The collected data were split between two institutions, A and B, where each client was trained only on its share of data. In the data-centralized approach, the model was trained on the full dataset. FedAvg was used to aggregate the models trained by the two local clients. The results indicated that the federated model outperformed the local models and performed similarly to the data-centralized model.

4.2.2 Deep learning based classifiers

Twelve papers are included in this section.

Tabular data

In [46], the problem of depression detection was addressed by using data collected by BiAffect, a mobile application with a special input keyboard. Three types of metadata were collected: alphanumeric characters, special characters, and accelerometer values. To ensure users’ privacy, only the duration of the keypress, the duration before the last keypress, and the distance from the last key to the coordinate axis on the horizontal and vertical axes were collected instead of the alphanumeric characters themselves. With its three variations, the DeepMood [47] model was used as a classifier, with the data fusion stage being the main variation. The first used a multi-view machine layer, the second used a factorization layer, and the third used a conventional fully connected layer. Five experiments were conducted on each model: (1) Local training where each client had only their share of the data. (2) Traditional centralized training. (3) Federated model using FedAvg algorithm. (4) Institutional Incremental Training (ILL), where each client sent its model to the next one after it completed its training until all had trained once. (5) Cyclic Institutional Incremental Training (CILL) repeated the ILL training process for a predetermined number of cycles.

Two data distribution scenarios were considered: IID and non-IID. The testing accuracy was reported in each experiment. For IID, Multiple clients were considered (4, 8, 12, 16 & 24), and a different number of data points held by each party was also experimented with (100, 500, 1000, 1500, 2000 & 3000). For both IID and non-IID, the FL model achieved the second-highest accuracy in most experiments after the centralized learning model with the trait of preserving data privacy.

In [48], Obsessive-Compulsive Disorder (OCD) was detected using the OPPORTUNITY Dataset for Human Activity Recognition from Wearable, Object, and Ambient Sensors [35]. The authors used readings from the accelerometer and gyroscope sensors only. To simulate the repeated actions done by OCD patients, a specific set of activities with a particular number of repetitions was assigned to each subject. The baseline-designed model is a two-layer bidirectional Long Short-Term Memory with a fully connected output layer and dropout between each layer. For personalization, the last dropout and fully connected layer were trained individually on the local data, whereas the rest of the model was subject to the FedAvg algorithm. Four experiments were designed to test the model’s performance: (1) Traditional centralized model training on the full dataset. (2) Local data training, where each client was trained on its local data only. (3) Federated learning using the FedAvg algorithm on a simulation of 4 clients without personalization. Lastly, (4) FL with personalization using three different personalization schemes. The results showed that FL and federated personalized learning outperformed both centralized and local model training.

In [49], Lee et al. used tabular data extracted from electronic health records of five hospitals in South Korea to apply a real-world horizontal federated learning setting that can detect bipolar transitions in patients with depression. The team tackled the federated real-world environment challenges through four stages: standardized feature extraction, federated feature selection, FL, and cross-site evaluation. For standardizing feature extraction, the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) [50] was used as the data format. The patient data in each hospital’s electronic medical records were anonymized and standardized using OMOP CDM, and stored safely within each organization. The extract, transform, and load processes for OMOP CDM were done by a trustworthy broker, who ensured that only data from which personal information had already been taken out were used. The authors added a second stage of federated feature selection due to the lack of powerful computational resources at the contributing hospitals. LightGBM [51] was used for this phase as it quickly trains even on CPU and had proven its feasibility on medical tabular datasets. Only features that were present in all of the internal datasets were selected. An early stopping criterion was defined to make the model stop if its performance did not increase by more than 2% in 3 consecutive searches. Only 100 out of 21,042 features were selected to train the FL model. In the third stage, the FL process, the authors used federated averaging for the weight aggregation algorithm. They applied differential private-stochastic gradient descent (DP-SGD) during the local update to ensure differential privacy. Deep and Cross Network (DCN) [52] was used to train data from four hospitals; the last hospital data were kept for validation. The model was trained for five rounds in FL, with each round training for five epochs. Lastly, in the cross-site evaluation stage, the federated and the four local models were compared with each hospital’s data on internal and external validation datasets. Mean AUC was used as an evaluation metric. The reported mean AUC of the federation model was 0.726 across all test datasets, while the local models trained with each hospital’s local data had mean AUCs of 0.642, 0.662, 0.707, and 0.692, respectively. This indicates that the federated model has higher generalizability than any local model.

Textual data

In [30], posts collected from both Reddit and Twitter were used to address the problem of detecting suicidal ideation on social media. For text classification, the authors trained two local data-preserving deep learning models: CNN and LSTM. A new optimization algorithm called the average difference descent for learning with data protection (AvgDiffLDP) was proposed for aggregating the locally trained models at the centralized server. AvgDiffLDP used the gradient of the average differences between the server’s parameters in the previous time stamp and the updated users’ parameters in the current time stamp. The updated model parameters were sent to the local users/clients and trained using stochastic gradient descent. The authors conducted three experiments: SimpleLDP, AvgDiffLDP, and centralized NonLDP. The collected data were distributed among users in the first two experiments. In SimpleLDP, they trained separate local data-preserving models for each user on different devices without sharing data or parameters. In AvgDiffLDP, they trained multiple users locally and used the new proposed optimization algorithm. In the centralized NonLDP, they used the entire dataset to train one centralized model on the server. Average testing accuracy and the average area under the receiver operation curve (AUC) were reported. LSTM model results were slightly better than CNN. Even though the centralized model performance was better than the AvgDiffLDP one, the proposed model kept data privacy which is critical when dealing with such sensitive data.

In [53], Italian text sentences from the ANDROIDS project were used to predict depression. The authors trained a Long-Short Term Memory (LSTM) neural network for text classification. Two experiments were conducted; one was centralized, and the other was federated on three simulated clients. FedAvg was used to aggregate each client’s trained model parameters. The architecture of the federated model had four layers: An embedding layer, a bidirectional LSTM, an LSTM, and a Dense layer. The categorical cross-entropy was used as a loss function, while Adam’s algorithm was adopted to train the model. The testing accuracy was reported at each experiment and showed that the centralized model outperformed the federated one.

In [33], Li et al. proposed a CNN Asynchronous Federated optimization (CAFed) depression detection system. The system adopted a text-based convolutional neural network model (Text-CNN) for detecting depression from Weibo posts. The team collected data from 900 users throughout an entire year. The proposed model consisted of the following layers: (1) Embedding layer where the Weibo vector was formed using one user’s data. (2) Convolution layer where various filter sizes were used with ReLU activation function to get the feature maps. (3) Max pooling layer to get the most important features and create the final feature vector. (4) Dropout layer to avoid overfitting. (5) A fully connected layer and an output layer with a sigmoid activation function to classify the output.

The proposed CAFed algorithm followed the same start as the FedAvg, except when updating the model, CAFed updated the global model instantly after receiving the local updates sent by any client. To ensure the model’s privacy, Gaussian white noise with a mean of 0 and a variance of 1 was added to the server process to adjust global values and keep each device’s contribution hidden. The authors compared the results of CAFed to FedAvg, revealing that CAFed converged faster than FedAvg. FedAvg waited for all ten user devices in the experiment to respond in each epoch, whereas CAFed required one device’s response only to proceed to the next epoch. Furthermore, FedAvg had more communications than CAFed in each global epoch. In general, CAFed converged faster than FedAvg for the same communication overhead.

In [54], Ahmed et al. proposed a hyper-graph attention-based federated learning model for detecting depressive symptoms from text collected from patients using the standard PHQ-9 questionnaire. Data were collected from different internet forums and questionnaire websites. There were two approaches to feature extraction used. The first used an emotional lexicon, while the second used a structure-aware graph model. For vectorization, both models used a 300-dimensional glove vector. The embedding method was used to convert text into node vectors in the lexicon of nine symptoms. The structure embedding model then used the hyper-graph to extract word-based node patterns. The text was then labeled using trained embedding depending on the question. Two models were built to classify the extracted features. The baseline model was a feed-forward neural network with (30, 20, 10) hidden layers with ReLU activation function and the final layer is a 9-link sigmoid function. The other model was a recurrent neural network with long short-term memory (LSTM) units and an attention position layer. When compared directly to the baseline model, the LSTM network achieved a relatively high level of performance. For applying federated learning, a global initial model was sent to six clients, where it was used to train a local model on the part of the dataset. The FedAvg algorithm was used to update the global model parameters. According to the validation loss, each client can choose whether to use the global updated model or the local model’s best iteration. The proposed system achieved a 0.86 ROC score.

In [55], Basu et al. used data scraped from Twitter to address the problem of detecting depression and sexual harassment. The team investigated the effects of differential privacy (DP) on training contextualized language BERT-based [56] models in both a centralized and an FL setting. They used four natural language processing (NLP) models: BERT, ALBERT [57], RoBERTa [58], and DistilBERT [59]. Four experiments were carried out: baseline NLP model, DP NLP model, FL NLP model, and FL+DP NLP model. The team tried both IID and non-IID data distributions for the federated learning setting in an HFL data partitioning scheme. The FedAvg algorithm was used to aggregate the simulation of ten clients. The reported results were as follows: When employing differentially private training, it was observed that smaller networks such as ALBERT and DistillBERT exhibit a more gradual degradation compared to larger models like BERT and RoBERTa. Utility degradation was higher in the Non-IID setting for FL, the typical scenario in medical applications, than in the IID arrangement, indicating the necessity for training methods adapted to such setups. Finally, when the size of the training dataset was limited, the impact of differential privacy on utility was more deleterious than when a larger amount of data were available.

Image data

In [60], ResNet-18 was adjusted to detect the patients with depression using their structure brain MRI (3D-T1). Data were collected from 23 different sites, but as they were limited in size, they were partitioned among five clients where the local models were trained. Encrypted gradients from the clients were weighted and aggregated at the centralized server to produce the updated global gradients at the end of each epoch. The updated model was then re-distributed to the clients to proceed with their training. The average accuracy of five-fold cross-validation was reported. The federated models outperformed the local models by 0.2\(\sim \)4.33% for each of the five groups.

In [61], Federated Multi-Task Learning for Joint Diagnosis (FMTLJD) used MRI scans to diagnose three mental disorders: schizophrenia (SCZ), attention-deficit/hyperactivity disorder (ADHD), and autism. The used data were aggregated from three publicly available databases: Center for Biomedical Research Excellence (COBRE for SCZ) [36], the ADHD-200 Competition (ADHD-200 for ADHD) [37] and the Autism Brain Imaging Data Exchange I (ABIDE for ASD) [62]. The authors proposed a federated contrastive learning-based feature extractor (FCLFE) for feature extraction that used the Pearson Correlation Coefficient (PCC) to calculate brain functional connective features. A Gaussian noise augmentation step was added to reduce the risk of overfitting. The augmentation output was fed into a multi-layer perceptron (MLP) network with non-linear transformation to extract the higher level of abstraction representation. To train the extracted features of each dataset, a federated multi-gate mixture of expert classifiers (FMMoE) was proposed. Expert networks and gated networks made up the classifier. Given multiple task inputs, the expert network, built using group stacking of neural networks, learned the various feature representations. The gated networks learned to obtain an optimal mixture pattern by assembling these expert networks with different learned weights. An MLP was constructed from each task’s MMoE output and acted as a tower network to refine the task-specific representation and make predictions. To simulate the federated learning process, the data were divided among four clients, and FedAvg was used to aggregate the local models and update the shared one. Modifying the minibatch SGD optimization process, differentially private stochastic gradient descent(DP-SGD) [63] was used on private local datasets of client models to ensure the privacy of distributed data processing systems. Four scenarios were created to evaluate the performance of the proposed model: non-federated (centralized) mode and federated mode with multi-task learning and without. The results were not expected as the centralized MTLJD model outperformed the federated one, but the FMTLJD model performance exceeded the centralized model in ABIDE and ADHD-200 databases. This result also demonstrated that, besides lowering the risk of privacy leakage, FMTLJD enabled a reliable diagnostic detection that was competitive with the ideal scenario of gathering all multi-site data for training.

Audio data

In [64], English audio recordings from clinical interviews were used for depression detection. The used data were available online through DAIC-WOZ dataset [38, 65]. A convolutional neural network (CNN) model was proposed to classify the extracted audio features. The authors used the Mel Frequency Cepstral Coefficients (MFCCs) feature and generated 13-dimensional MFCCs from each speech segment by using 26 filters from the Mel filter bank with a window size of 25ms and a step size of 10ms. All MFCC coefficients were normalized to prevent training from being hampered by their wide variation. The proposed CNN model consisted of 3 convolution layers of 32, 64, and 128 filters and size 3x3. A ReLU activation function followed each convolution layer. A max-pooling layer of size 2x2 was used to reduce the dimensionality of the output feature maps. The output feature was then routed to two fully connected layers with 64 and 32 hidden units, respectively, before being followed by a dropout layer (the dropout rate is set to 0:1). The ReLu function activated each fully connected layer. Finally, a neuron with Sigmoid activation was used to predict whether a person was depressed. An SGD optimizer was used to train the model with binary cross-entropy loss. Three experiments were designed to compare the performance of FedAvg to the baseline centralized one. Centralized learning achieved the best results among the three, with 96.8%, 93.7%, and 92.3% for accuracy, precision, and recall, respectively. The two FL approaches, IID and Non-IID, were trained multiple times, each on a different number of clients (8, 56, and 189). In the IID scenario, results showed that the more clients contributed to the learning process, the lower the model accuracy as the amount of data each held decreased. The non-IID scenario produced lower results than the centralized and IID scenarios. Such performance degradation was expected because data heterogeneity across clients caused computed local model updates to drift in different directions, resulting in suboptimal server updates. A significant number of clients with a more distinct client distribution may make global model convergence more difficult.

Suhas et al. [28] also addressed the problem of depression detection using speech analysis. They used a subset of the clinical audio recordings available online through the DAIC-WOZ dataset [38] to ensure balanced data distribution among classes and genders. Two classification tasks were considered: depression detection and depression severity. The scipy.signal.spectrogram function was used to extract log spectrogram features from an overlapping window with a duration of 1s and a shift of 0.1s. The spectrogram images aided in modeling both the temporal and harmonic structures of audio signals, resulting in better classification performance than existing methods. GoogleNet, MobileNetV2 and ResNet-18 were used to classify the input spectrograms utilizing the concept of transfer learning. Three scenarios were designed for model training: one data-centralized and two federated learning frameworks using the FedAvg algorithm and federated matched averaging (FedMA) [66]. FedMA was designed for modern neural network architecture, such as CNN and LSTM. It updated the global model parameters layer-wise by matching and averaging hidden elements (filters for CNN and neurons for deep feed-forward networks) with similar feature extraction signatures. Five-fold cross-validation accuracy was reported to compare the performance of the models. Across folds, the centralized approach outperformed the federated methods by 6-10%. The centralized approach had the best average five-fold accuracy of 0.934, while the federated scheme had 0.91. The centralized approach was approximately 1.55-2.19x faster than the federated schemes, with ResNet-18 being the fastest for both the centralized (155s) and federated (327 & 340s, respectively). Compared to a centralized approach, the FL models outperformed previous work using the same dataset and allowed for a robust assessment of depression with only a 4-6% accuracy loss. TensorFlow Lite was used for developing the mobile application. The app determined whether or not the speech contained depression symptoms and, if so, how severe they were. FL models were energy-efficient, with low inference latency and a small memory footprint.

Table 4 Summary of the reviewed articles

5 Challenges and limitations

Applying federated learning in mental health has a number of challenges that can sometimes lead to limitations. In this section, we discuss some limitations that were observed by analyzing the reviewed literature. Deploying a real-world FL setting faces the following challenges:

  • Privacy Leakage and Patient Consent. In the real world, FL necessitates using personal health data that require regulatory compliance and user acceptance. The latter will not be achieved unless patients have complete confidence that their privacy will be protected through a federated learning application. Not all the reviewed papers considered using a privacy-preserving algorithm such as differential privacy to secure their models [44, 48, 60].

  • Data Heterogeneity. When working with data collected from different sources, it’s common to encounter inconsistencies or discrepancies in the types of data fields available. The variance in the types of data collected across different resources limits the model’s ability in terms of the training process. In such cases, models usually rely on the overlapping data across sources, leaving out some important information that could help better identify, diagnose, or treat the mental disorder [49].

  • Data unification. The nature of clinical data necessitates the creation of a unified process when gathering data from various sources. This ensures a coherent view, so it can be utilized more efficiently and effectively. This process requires time and resources and hence complicates and slows down the research and also limits its transparency, interoperability, reproducibility, and scalability.

  • Computational Power. Hospitals/psychiatric clinics do not always have the powerful computational resources needed for the AI model local training step. The speed of training will be limited by the slowest resource that sends its local update. This was one of [49] limitations as they had to use the available hospitals’ CPUs.

  • Communication Overhead and Network Stability Sharing the model between the centralized server and multiple clients for numerous FL rounds results in communication overhead and hence creates a bottleneck for the system. It also needs a stable, secure network connection available for users to upload their updated, locally trained models.

6 Conclusion and discussion

This systematic review highlighted the previous attempts to use federated learning with mental health applications. It followed the SYMBALS methodology to conduct the SLR and answer the main and sub-research questions. Table 4 summarizes the sixteen papers that were selected for review after applying all the inclusion and exclusion criteria.

Besides answering the research questions and providing quantitative analysis, this research explained in detail each included paper in terms of the used learning model, whether a traditional machine learning model or a deep learning model, feature extraction methods, the used federated learning algorithm, data type, and distribution among several clients in the simulation or real-world environment and the addressed mental disorder. Our findings indicate that the provided research shows high potential, but a considerable gap still needs to be filled through the coming research directions.

The first observation is that a relatively low number of published research is found online in this specific research direction. This is surprising, given that mental health is one of the best fields to benefit from the privacy-preserving trait offered by federated learning. FL has been widely adopted for almost five years. Researchers are encouraged to explore and conduct more research in this area.

Secondly, only one of the published papers applied a real-world federated learning scenario where models are sent back to users’ devices and trained on their local data. Until now, FL has predominantly been employed in simulated environments only. Technology Readiness Level (TRL) assessment is used to fairly assess the proposed systems. TLR is a widely accepted metrics-based process that evaluates the maturity of technologies under development. It rates technologies on a scale from 1 to 9, where 1 is the lowest level of readiness, and 9 indicates that the actual application of the technology is implemented and in its final form. We map the reviewed papers according to the guidelines and constraints of TRL. Only one paper scored between 7 and 9 on the TRL scale; the rest of the papers received a score between 4 and 6, prototype level, as none of them was demonstrated in an actual operational environment. The main difference among papers was how the simulation was conducted and its corresponding experimental settings; also it differs in the number of clients in each case.

Thirdly, none of the research in the domain of mental health explores vertical FL or Federated Transfer Learning. Current research focuses on horizontal federated learning where each local dataset used to train each client’s model has the same features, i.e., each client gets trained on the same features’ set for different patients. In some of the reviewed literature, the initial model seed shared by the centralized server to the clients was a pre-trained deep learning model where transfer learning is used. However, the learning setting does not follow the definition of FTL mentioned in Section 2 where one client transfers its trained model to another to fine-tune it to address a similar problem. Rather, it follows horizontal partitioning where all the clients contribute to training one model that addresses one problem with datasets having the same features. From the performance levels and findings of the papers, there is still room for exploring potential enhancements using other FL techniques, i.e., VFL and FTL. Both approaches could benefit mental health applications as each patient could have varying symptoms (features), given a robust global model. FTL can particularly useful when one client has a relatively small set of labeled data that no model can generalize well enough by training solely on the small set. In the FTL, such a small dataset client can exploit the model trained at another client with a larger, somehow similar dataset. On the other hand, VFL is important as it enhances the characterization of samples by incorporating features from different sources to boost the model’s capabilities. More research should be addressing the validation of the efficiency of FL with this type of sensitive data.

Federated learning, in general, remains an emerging area of research. The reviewed literature shows a huge potential for the use of FL, specifically for mental health applications across different types of data. For future directions, researchers are encouraged to develop new machine learning and deep learning techniques that follow the FL approach with better efficiency and accuracy, as there is still a huge room for improvement in real-world settings. Similarly, exploring the potential of the different federated learning types. Future research should focus on bridging the gap by deploying robust privacy-preserving algorithms, creating a unified system for data collection from different institutions, and ensuring that the participating hospitals have wireless resources for networking and powerful hardware. Such improvement in models’ performance while preserving patient privacy can be the key to increased accessibility of personalized mental health care.