1 Introduction

The World Health Organization (WHO) has described mental health as a state of wellbeing in which individuals can realize their abilities, cope with the everyday stresses of life, work productively, and contribute to their community (WHO, 2018). According to recent estimates (Ritchie, 2018; WHO, 2020), approximately one billion individuals suffer from mental health disorders (e.g., depression, anxiety, bipolar disorder), costing the global economy trillions of dollars in disability payments and lost productivity. However, a significant proportion of individuals with mental illnesses do not receive treatment or the quality of care they need, often due to resource shortages (Docrat et al., 2019; Petersen et al., 2019; Wainberg et al., 2017; Wang & Cheung, 2011). For example, in many countries, there is fewer than one psychiatrist for every 100,000 people (Hanna et al., 2018; Jenkins et al., 2010). Moreover, certain tools and methods used by mental health professionals to make care-related decisions (e.g., formulating accurate diagnoses) are inadequate (Kilbourne et al., 2018; Wang & Cheung, 2011). Additionally, recent pandemics (e.g., COVID-19) and epidemics (e.g., opioid overdose) have exacerbated the mental health crisis worldwide (Johnson et al., 2021; Khanal et al., 2020; Ransing et al., 2020). Thus, there is a need for innovative tools that can help mental health professionals (e.g., psychiatrists, counselors) make more efficient and accurate diagnostic decisions Casado-Lumbreras et al., 2012; Perkins et al., 2018; Thieme Anja et al., 2020).

The primary diagnostic guidelines used by mental health professionals to detect and classify mental disorders according to the type, intensity, and duration of symptoms are the Diagnostic and Statistical Manuel of Mental Disorders (DSM) and International Classification of Diseases (Americans et al., 2013; Fabiano & Haslam, 2020; Fellmeth et al., 2021). These tools, despite their frequent use, have certain limitations. For instance, the criteria used to identify diseases are common to many diagnoses; thus, diagnostic groups cannot be clearly separated from one another (Lin et al., 2006; Ogasawara et al., 2017; Park & Kim, 2020; Sleep et al., 2021). Moreover, these tools do not consider additional factors such as demographic and biochemical data, information obtained during a patient interview, mental illness history for the family, or the individual’s response to medications. They also examine patients according to a binary structure (i.e., patient vs. not-patient), resulting in misdiagnoses and inappropriate treatment plans (Garcia-Zattera et al., 2010; Pechenizkiy et al., 2006; Sleep et al., 2021).

Due to these limitations, mental health professionals often use other diagnostic guidelines, particularly the Symptom Checklist 90-Revised (SCL-90-R), which has become the most common assessment tool used around the world (Akhavan, Abiri, & Shairi, 2020b; Hildenbrand et al., 2015; Li et al., 2018). This tool includes 90 questions that are employed to assess 10 primary mental disorders, including somatization, obsessive-compulsive disorder, interpersonal sensitivity, depression, anxiety, hostility, phobic anxiety, paranoid ideation, psychoticism, and additional items (i.e., sleep, appetite, and feelings of guilt) (Gallardo-Pujol & Pereda, 2013).

Despite the availability of various diagnostic guidelines and tools, many mental health professionals still struggle to diagnose patients accurately and efficiently. Consequently, recent literature has called for more research on the use of cutting-edge technologies (i.e., artificial intelligence) to develop decision support systems (DSSs) that can help mental health professionals make evidence-based treatment decisions and guide policymakers with the digital mental health implementation (Balcombe & Leo, 2021; D’Alfonso, 2020). This study precisely responds to this call and addresses the challenges regarding the guidelines and tools used to diagnose mental disorders. This study mainly aims to build a mental health assessment tool and develop an AI-based DSS for mental health professionals to accurately diagnose mental disorders.

The remainder of the research is organized as follows. First, Sect. 2 summarizes the relevant literature regarding the standard mental health assessment tools and the use of AI to detect mental disorders and ethical guidelines regarding AI to design such tools. Next, Sect. 3 explains the research framework and applies it to a real-world dataset to demonstrate the use of AI to build a mental health assessment tool and develop an efficient DSS that accurately diagnoses mental disorders. Finally, Sect. 4 discusses the implications of the proposed DSS, while Sect. 5 offers concluding remarks.

2 Literature Review

2.1 Assessment tools for mental disorder diagnosis

One of the primary tools used to assess mental health is the SCL-90-R. The instrument features 90 questions scored on a five-point scale ranging from 0 to 4 to denote symptom occurrence (e.g., 0 for symptoms that never occur and 4 for symptoms that occur extremely frequently) (Derogatis, 2017). The SCL-90-R can be used to diagnose various mental disorders, as provided in Table 1 (Derogatis, 2017; Holi, 2003; Prinz et al., 2013).

Table 1 Mental Disorders Diagnosable using the SCL-90-R (Excluding “Other”)

The SCL-90-R has gained the attention of many researchers, becoming the most widespread mental disorder diagnostic tool used around the world (Barker-Collo, 2003; Bernet et al., 2015; Chen et al., 2020; Kim & Jang, 2016; Olsen et al., 2004; Rytilä-Manninen et al., 2016; Schmitz et al., 2000; Sereda & Dembitskyi, 2016; Urbán et al., 2016). However, the SCL-90-R is lengthy and requires a considerable amount of time to answer all questions. Thus, various studies have proposed using statistical techniques to shorten the number of questions (Akhavan, Abiri, & Shairi, 2020a; Imperatori et al., 2020; Lundqvist & Schröder, 2021). For example, Derogatis (2017) developed the BSI-18, which features 18 questions and is used to diagnose somatization, anxiety, and depression. Prinz et al., (2013) created the SCL-14 to diagnose somatization, phobic anxiety, and depression. However, these tools cannot be used to diagnose all 10 mental health disorders that the original SCL-90-R can detect, because they feature fewer questions.

Despite its effectiveness, the SCL-90-R has some limitations. For example, The SCL-90-R includes several sets of questions; each is reserved for a specific mental disorder. For instance, 13 questions are reserved for depression, while nine address anxiety. When mental health professionals interpret responses to these questions, they do not account for interactions among different groups of questions, meaning that each set of questions only linearly contributes to the diagnosis of a specific mental disorder. Another issue with the SCL-90-R is its length. answering all of the questions on the SCL-90-R requires substantial time and can be exhausting, drastically reducing the participation and completion rates (Galesic & Bosnjak, 2009). Despite efforts to shorten it, the best version for diagnosing the same mental disorders with fewer questions remains the BSI-53, which features 53 questions (Derogatis & Spencer, 1993). Additionally, reducing the number of questions on a diagnostic tool, however, brings up more challenges. For example, the SCL-14 can only assess somatization, phobic anxiety, and depression, while the SCL-25 can only diagnose depression and anxiety. Some scales (e.g., the SCL-27 and SCL-27-Plus) include altered questions that are drastically different from those on the original SCL-90-R and are used to assess various sets of disorders (Hardt & Gerbershagen, 2001; Jochen & Hardt 2008).

Due to the limitations in the past literature, various researchers have emphasized the urgent need to reduce the length of the SCL-90-R (Kruyen et al., 2013) without compromising on the number of disorders that it can diagnose (Graham et al., 2019; Hao et al., 2013; Luxton, 2016; Nie et al., 2012). This study responds to this call and attempts to reduce the number of questions used to assess and detect all 10 mental disorders through the use of AI.

2.2 DSSs for mental disorder diagnosis

Several studies have designed and deployed DSSs for the diagnosis and treatment of mental disorders. For instance, the Computerized Texas Medication Algorithm Project is used to support diagnosis, treatment, follow-up, and preventive care decisions related to major depressive disorders and can be incorporated into a clinical setting (Trivedi et al., 2004). A clinical DSS called the SADDESQ, constructed by Razzouk et al., (2006), can be used to diagnose schizophrenia spectrum disorders, based on variables such as symptoms of psychosis and the number and duration of seizures. The Sequenced Treatment Alternatives to Relieve Depression is another DSS that assists doctors with determining the optimal dose and timing of medications by considering changes in symptoms and the medications’ side effects (Sinyor et al., 2010).

In addition to the above-mentioned studies, there have also been efforts to design AI-based DSSs to produce more accurate diagnoses of mental disorders. For instance, Mueller et al., (2011) detected ADHD using a DSS that employs an SVM model constructed using the responses of individuals to the BSI tool. Nie et al., (2012) diagnosed mental disorders using an SVM model built via the SCL-90-R tool. Similarly, Hao et al., (2013) determined individuals’ mental health through SVM and ANN models constructed by combining an SCL-90-R response dataset with social media blogs. Chekroud et al., (2017) proposed a DSS that can cluster depression symptoms. A software platform created by Rovini et al., (2018) can be used to help clinicians diagnose Parkinson’s disease early on by evaluating various non-motor symptoms. Stewart et al., (2020) proposed a DSS based on tree-based learning such as decision trees (DTs) and random forest (RF) to identify children at the highest risk of suicide and self-harm. Chen et al., (2020) utilized a DSS built through deep learning to screen and score dementia patients. Zhang et al., (2020) employed biological markers and genetic data to propose a deep learning framework for recognizing and diagnosing mental disorders early on.

Despite these previous efforts to develop an AI-based DSS to detect and diagnose mental disorders, there are still various gaps in the literature. First, most diagnostic tools are provided to individuals on paper instead of digitally. This necessitates the individual visiting the clinic in person. Also, mental health professionals must expend a significant amount of effort to manually analyze the answers and derive an accurate diagnosis. Additionally, mental health professionals do not investigate the relationships among various variables used to assess mental health. Many of these previous studies focus on a particular set of disorders, instead of assessing multiple disorders at the same time. Additionally, these studies did not consider ethical design elements when creating AI-based DSSs. To the best of our knowledge, our study is the first study that aims to detect 10 disorders using the same set of variables.

2.3 Ethical issues related to AI and mental disease diagnosis

Several researchers indicated that many information systems (IS) researchers do not consider the practitioners’ needs when designing DSSs (Dennehy et al., 2021). On the other hand, these practitioners mostly rely on vendors and consultants to solve IS-related problems rather than IS researchers. (Dennehy et al., 2021). In addition, many studies emphasized that the IS researchers should integrate ethical guidelines obtained from IS practitioners when creating AI-based DSSs, particularly in the healthcare area where patients’ health and wellbeing are at stake (Fosso Wamba & Queiroz, 2021). Therefore, there is a growing need to forge alliances and improve the collaboration between IS researchers and practitioners when designing ethical and socially responsible AI-based DSSs.

The primary problem with AI-based DSSs is that they can be unfair and biased in their decision-making, specifically if they are developed via Blackbox algorithms and using various variables (Akter et al., 2021; Parra et al., 2021; Tsamados et al., 2021). There has been evidence in the literature that AI-based DSSs can perpetuate and exacerbate gender and racial biases and discriminate against some members of society more than others (Gupta et al., 2021; Mittelstadt et al., 2016; Mittelstadt & Floridi, 2016). These research studies noted that AI solutions are value-laden and have biases that are “specified by developers and configured by users with desired outcomes in mind that privilege some values and interests over others.” Additionally, most AI algorithms are not transparent and are difficult to explain (Buhmann & Fieseler, 2021). For example, AI-based DSSs can yield unintended negative consequences if certain variables and features are utilized in the training dataset (e.g., gender and race) (Crawford & Calo, 2016; Mittelstadt et al., 2016). For example, HireVue – a recruiting-technology firm – designed an AI-based DSS to hire employees. But according to AI researchers, the HireVue AI application penalized nonnative speakers, visibly nervous interviewees, and anyone else who does not fit the model for look and speech. Another example is the AI application called COMPAS designed to determine recidivism risk. As a result, the algorithm’s prediction assigns a higher probability of recidivism to black and brown men than to all other person (Kirkpatrick et al., 2017). Thus, there have been proposals to create guidelines for integrating ethics into the process of AI-based DSS design to prevent their misuse and potential biases, specifically in the healthcare field (Borgesius, 2020; Floridi & Cowls, 2021; Johnson et al., 2021).

Gerke et al., (2020) and Chandler et al., (2020) proposed a theoretical framework for the ethical implementation of AI models in the health care industry. Their framework has three major principles, including fairness, trustworthiness, and transparency. Fairness (i.e., AI system without biases) particularly deals with data collection and variable types when training AI models. Biases result from the datasets themselves or from how researchers choose and analyze the data (Price, 2019). For instance, including various variables such as race, gender, and insurance payer type as a proxy for socioeconomic status when training AI-based DSS in the healthcare field introduces biases (Chen et al., 2019). Trustworthiness indicates the users’ confidence in the AI systems and has three core elements (IBM, 2020). First, the purpose of AI is to augment human intelligence and help decision-making, not completely replace it. Second, creators of the AI-based DSSs own the data and the insights; thus, they are liable for the decisions that the AI makes. Third, AI systems must be transparent and explainable. Transparency stands for the knowledge regarding the infernal structure of AI-based DSSs. In other words, transparency indicates that the predictions of the AI model used within the DSS can be properly explained (Fosso Wamba et al., 2021; Fosso Wamba & Queiroz, 2021). Transparency allows practitioners to see whether the AI models have been thoroughly tested and make sense and that they can understand why particular decisions are made. Issues related to the trustworthiness of AI-based DSSs arise with the use of “Blackbox” algorithms because users cannot provide a logical explanation of how the algorithm arrived at its given output (Schönberger, 2019). Therefore, explainable and Whitebox AI algorithms are recommended over complex Blackbox models. However, there are cases when these Blackbox models need to be used (i.e., image data). Then, there needs to be an ad-hoc post-model analysis process (i.e., SHAP, LIME) to further understand the decisions made by these Blackbox models.

As many research studies stated, there is a need for a methodology for integrating ethics into the design of AI-based DSSs from the start of the project. For instance, Morley et al., (2019) emphasized that the ethical challenges raised by implementing AI in healthcare settings are tackled proactively rather than reactively. This brings up the concept called “Ethics by Design” which is concerned with algorithms and tools needed to endow AI-based DSSs with the ability to reason about the ethical aspects of their decisions, to ensure that their behavior remains within given moral bounds (D’Aquin et al., 2018; Dignum et al., 2018; Iphofen & Kritikos, 2019).

In this study, the “Ethics by Design” approach was utilized to create an AI-based DSS that can diagnose mental disorders. Throughout the design process, a close cooperative relationship between IS researchers designing AI-based DSSs and the practitioners using the AI-based DSSs (i.e., mental health professionals) was established for ensuring greater impact and addressing concerns regarding the social and responsible use of AI, as addressed by Dennehy et al., (2021) and Morley et al., (2019). Furthermore, various procedures were embedded in the design of the DSS to ensure the fairness, transparency, and trustworthiness of the proposed DSS, as discussed in the methodology section.

3 Methodology and application

3.1 Phase I: data collection and preprocessing

We developed a web portal called Psikometrist that digitally collects participants’ responses to the SCL-90-R and saves them in a database. Mental health professionals such as counselors, psychologists, and psychiatrists seeking to use this platform must register and obtain preauthorization. After registration and obtaining access to the Psikometrist platform, these mental health professionals can implement the SCL-90-R test on the Psikometrist platform by sending an encrypted link to the participant. The questions appearing via the Psikometrist platform have time thresholds, establishing the minimum amount of time necessary to read and answer them. The Psikometrist platform compares the time that patients spend answering an SCL-90-R question with the minimum threshold and excludes patients who answer in a time frame less than that. Using the encrypted link, participants can securely access the SCL-90-R questions. Before answering the questions. Participants or their guardians sign an informed consent indicating that their answers to the SCL-90-R questions can be anonymously used to develop a decision support system. Additionally, patients are informed that they can skip or refuse to answer questions. After data collection, personal identifiers (e.g., name, date of birth, address) are masked and remain confidential in the system and are only visible to the mental health professional who administered the test to satisfy the ethical AI guidelines mentioned in Sect. 2.3. The system generates a unique identifier for each participant and uses it instead of personal identification to ensure anonymity. This ensures that the information collected from patients is kept strictly confidential. Additionally, participants are informed that the data collected will be utilized to improve their mental health diagnosis. Each participant’s responses are tabulated and stored in the system. Respondents also receive a copy of their tabulated responses. Following this procedure, more than 6,000 participants have taken the SCL-90-R test since 2019. We established a close cooperative relationship with three mental health experts (i.e., psychiatrists) who plan to use the proposed DSS to ensure greater impact and address concerns regarding the social and responsible use of AI, as discussed by Dennehy et al., (2021) and Morley et al., (2019). These three mental health experts (i.e., psychiatrists) evaluated the participants’ responses to identify potential mental disorders. To integrate ethics into our AI solution and ensure that the predictions of mental disorders remain within given moral bounds, we removed variables related to the participants’ demographic information (e.g., race, gender, and insurer information). We only considered their responses to the SCL-90-R test. Figure 1 shows the registration GUI for Psikometrist, which was created as a part of this research.

Fig. 1
figure 1

GUI of the Psikometrist platform

3.2 Phase II: variable selection and creation through NEPAR

We utilized a social network analysis technique called the Networked Pattern Recognition (NEPAR) algorithm to reduce the number of SCL-90-R questions and identify similarities among individuals taking the test (i.e., participants). Through NEPAR, we calculated similarities among the questions and participants, built undirected network graphs, and extracted three centrality measures (i.e., closeness, degree, and betweenness centralities). We then compared these centrality measures to decrease the number of questions from 90 to 28, without reducing the number of disorders the SCL-90-R could detect. Additionally, we applied NEPAR to compute similarities among participants by obtaining three centrality measures, namely closeness, betweenness, and degree centrality. The NEPAR algorithm extracts similarities and relationships among variables. It creates a network diagram with nodes and links, using the responses given by the participants to the SCL-90-R questions. For more information about NEPAR, see Khan & Tutun (2021) and Tutun et al., (2017). It is important to note that the NEPAR algorithm is transparent and explainable as it generates a network diagram providing insights regarding the observations and their relationships in the dataset. Thus, it is possible for the creator of the AI-based DSSs to easily identify issues regarding fairness and bias. We developed two NEPAR models, NEPAR-Q and NEPAR-P. The Q and P stand for questions and participants, respectively. The nodes in the NEPAR-Q model indicate the questions, while the nodes in the NEPAR-P model represent the participants. The links in both models depict similarities among the nodes. NEPAR-Q was used in the present research to reduce the number of questions on the SCL-90-R (i.e., variable selection), while NEPAR-P was utilized for feature engineering (i.e., variable creation) because it generates and incorporates similarities among patients as new variables. The NEPAR algorithm employs the similarity measures of closeness, betweenness, and degree centrality to compute the abovementioned links. We used all of these similarity measures and combined them, since choosing one over another might have caused bias in either model.

For variable selection through the NEPAR-Q model, three mental health experts determined the threshold values for each similarity measure, based on their experience. These threshold values were 36% for betweenness centrality and 39% for closeness and degree centrality. The threshold values were then used to identify the questions to be removed from the SCL-90-R. Next, the remaining questions for each similarity measure were combined, leading to an assessment tool comprised of 28 questions in total. Figure 2 represents the network diagrams based on three centrality measures obtained from the NEPAR-Q model. The red nodes in the figure represent questions with a similarity value above the threshold, and the green nodes are questions with similarity values below the threshold. The intersection of the red nodes in the three graphs represents the remaining 28 questions.

Figure 2.a shows the questions (nodes) located further away from the densely populated nodes, with fewer interconnections. These nodes (seen in the blue circles) were 1: Somatization, 5: Interpersonal Sensitivity, 9: OCD, 20: Depression, and 45: Obsessive Compulsive. Therefore, these questions were most likely the weakest and could be removed from the SCL-90-R, since they provided information that could be obtained from other questions.

Fig. 2
figure 2

Centrality graphs

The top 10 questions (seen in the orange circle) conveying the most information on the SCL-90-R are indicated as augmented nodes in Fig. 2.b. These top 10 questions were 56: Somatization, 78: Anxiety, 41: Interpersonal Sensitivity, 58: Somatization, 33: Anxiety, 80: Anxiety, 7: Psychoticism, 51: Obsessive-Compulsive, 10: Obsessive-Compulsive, and 43: Paranoid Thoughts. Additionally, Fig. 2.b shows that some questions were not necessary and could be predicted using other questions, as seen yellow lines of Fig. 2.b. For example, we could infer that “Question 6: Feeling critical of others” was highly correlated with “Question 37: Feeling that people are unfriendly or dislike you.“ Moreover, the response to “Question 33: Feeling fearful” was highly correlated to the response to “Question 57: Feeling tense or keyed up.“ The SCL-90-R was reduced to a tool called the Symptom Checklist 28-Artificial Intelligence (SCL-28-AI), which features 28 questions and threshold values, as well as a correlation comparison of the centrality values in the non-directional graphs. This tool’s questions and corresponding mental disorders are given in Table 2.

Table 2 New SCL-28-AI Obtained by Applying NEPAR-Q to the SCL-90-R

Figure 3 displays the network graph of participants (nodes) obtained using NEPAR-P and degree centrality. It is important to note that the other centrality measures (i.e., closeness and betweenness) provided similar graphs for various sets of communities. For example, as seen in Fig. 3, there were five communities of individuals related to one another via links. This fact underscores that the similarities among participants provide critical information and should be used in the AI models as new variables to improve diagnostic accuracy. Figure 3.a and Fig. 3.b shows that there is high connections among nodes and these connections can give new information about the participants.

Fig. 3
figure 3

Network graph of individuals obtained through NEPAR-P, using the degree centrality measure

3.3 Model training

In this phase, we built various AI and ML using two different training sets: one including only participants’ responses to the 28 questions and one with their responses and three centrality measures. These AI and ML models were designed to determine the probability of participants’ having a particular type of mental disorder (e.g., depression, anxiety). The prediction models were determined based on our preliminary analysis and included logistic regression-based ridge regression (R-LR), lasso regression (L-LR), RF, and SVM. To ensure the AI-based DSS is explainable and transparent, we particularly, selected three Whitebox explainable AI algorithms (i.e., R-LR, L-LR, and RF) and compared their performance measures to a Blackbox algorithm (i.e., SVM).

Logistic regression (LR) is a statistical approach that explains the relationship between multiple input variables and one output variable. The input variables can be of any type (either numerical or categorical), while the output variable must be categorical. The class probabilities (i.e., the likelihood of having specific mental disorders) are predicted based on the relationships among the input and output variables. In L-LR, a penalty function is added to the traditional multinomial LR such that the coefficients of unimportant variables are set to zero Hastie et al., 2017; James et al., 2013; Johnson, Albizri, Harfouche, et al., 2021; Tutun et al., 2022). Conversely, R-LR uses a different penalty term that shrinks the coefficients of insignificant variables to be close to zero, instead of making them zero.

RF is a tree-based ensemble algorithm comprised of many DTs represented by “if-then” rules. The RF algorithm uses the DT algorithm to build numerous uncorrelated DTs by sampling observations with replacements from the training dataset (Hastie et al., 2017; James et al., 2013). Then, the individual DTs are combined using a function such as simple averages or majority voting (Johnson et al., 2020, 2021; Simsek et al., 2020). Since RF uses multiple uncorrelated DTs by sampling the training set, it provides lower model variance and better accuracy rates, which are considered very robust.

SVM is a supervised learning algorithm mainly used for classification. SVM uses quadratic programming to find hyperplanes that can optimally separate classes with the largest gap possible (Hastie et al., 2009; Simsek et al., 2020). It is important to note that SVM can utilize various kernel functions to classify datasets that are not linearly separable. This allows SVM to efficiently operate in high-dimensional space at high accuracy rates. For each of these algorithms, we used four different datasets, as provided below:

  1. (1)

    Without NEPAR-Q and NEPAR-P: This dataset only included participants’ responses to the total 90 questions on the SCL-90-R. This set did not make use of NEPAR-Q or NEPAR-P.

  2. (2)

    Without NEPAR-Q and with NEPAR-P: This dataset included participants’ responses to the total 90 questions on the SCL-90-R, in addition to three similarity features (i.e., closeness, betweenness, and degree similarity) obtained through NEPAR-P. Each similarity feature represented the average similarity of a particular participant to the remaining participants.

  3. (3)

    With NEPAR-Q and without NEPAR-P: This dataset included participants’ responses to the 28 questions on the SCL-28-AI. This set did not make use of NEPAR-Q or NEPAR-P.

  4. (4)

    With NEPAR-Q and NEPAR-P: This dataset included participants’ responses to the 28 questions on the SCL-28-AI, in addition to the three similarity features obtained through NEPAR-P.

We divided the datasets into training and test sets. The training set was used for model building, while the test set was used to assess the models’ performance. During model training, we performed cross-validation to tune the model hyperparameters and prevent overfitting. Next, we ran the data preprocessing pipeline for the test set and computed the three additional features. We then predicted the mental disorders of the participants within the test set and calculated the performance measures. Table 3 provides the macro-averages of the performance measures, indicating their overall ability to detect mental disorders.

Table 3 Macro-averages of Performance Measures by Model

Reducing the number of questions from 90 to 28 led to an approximately 9% decrease in performance measures. However, the AI models using the version with 28 questions (i.e., with NEPAR-P) could still diagnose all 10 mental disorders. This substantially contributes to the literature, since previous studies using fewer than 30 questions could only diagnose three to four mental disorders (Derogatis & Fitzpatrick, 2004; Prinz et al., 2013). Thus, mental health professionals can now use the SCL-28-AI instead of the SCL-90-R without compromising the number of mental disorders diagnosed or diagnostic accuracy. Furthermore, the SCL-28-AI has fewer questions; hence, it is much faster to complete and will yield a better response rate. Moreover, the performance measures obtained via NEPAR-P were higher than those obtained without, indicating that including participants’ similarities as additional features improved mental disorder diagnosis. The L-LR model with NEPAR-P yielded the highest performance measures; therefore, it was selected as the final model and deployed as part of the DSS.

Figure 4 shows each model’s accuracy, sensitivity, and specificity with NEPAR-Q, at the mental disorder level. As seen from the plots in Fig. 4, the models with NEPAR-P (i.e., the three additional features accounting for similarities among participants) outperformed (seen in the yellow circles) the models without NEPAR-P. These models were powerful for diagnosing anxiety, depression, hostility, obsessive-compulsive disorder, and somatization. It appears that the models built both with and without NEPAR-P had some minor issues with diagnosing PAR and ADI. This was mainly because the numbers of individuals with these two disorders are relatively low. We believe that the DSS will perform significantly better once we collect more data and increase the sample size of individuals with these two disorders.

Fig. 4
figure 4

Performance of the AI and ML models with NEPAR-Q, by mental disorder

4 Discussions, implications and Future Research

This study was conducted in response to several calls for research on the use of cutting-edge technologies to develop decision support systems (DSSs) to help mental health professionals make evidence-based ethical treatment decisions and guide policymakers with digital mental health implementation (Balcombe & Leo, 2021; D’Alfonso, 2020). Because many mental health professionals use cumbersome standardized assessment tools such as the SCL-90-R for mental disorder diagnosis, previous studies emphasized the importance of developing less complex diagnostic tools with fewer questions (Kruyen et al., 2013). To this extent, this study mainly focused on techniques to reduce the length of the SCL-90-R for faster competition times and better completion rates. Also, various studies stated the need for DSS to diagnose mental disorders automatically and accurately (Chekroud et al., 2017; Rovini et al., 2018; Stewart et al., 2020; Zhang et al., 2020). Therefore, this study explored how to utilize AI and ML to develop DSS for mental health diagnosis. This study has various implications and makes several contributions to the pertinent literature and practice as follows.

4.1 Theoretical implications

How an ethics framework is implemented in an AI-based healthcare application is not widely reported in the previous literature. Thus, there is a need for examples of AI implementations that satisfy the three principles of ethical AI, particularly in the healthcare field. As mentioned in Sect. 2.3, these principles are fairness, transparency, and trustworthiness. Therefore, this study focused on techniques as developed and described in Fig. 5 in the various phases of AI-based DSS creation and contributed to the theory of developing ethical AI solutions.

First, the study emphasized the importance of establishing close cooperation between the creators of AI-based DSSs and the practitioners. Second, this close collaboration ensured that the practitioners (i.e., mental health professionals) were included in the design process and the proposed DSS is used to aid their decision making and augment their capabilities, not replacing them. The decisions made by the proposed DSS were examined by the mental health practitioners before model deployment to address issues regarding DSS fairness and liability.

Fig. 5
figure 5

Core elements of designing ethical AI solutions

Additionally, this study used transparent algorithms. Most AI algorithms used to develop DSSs, such as ANN, are not transparent and may have internal biases (Buhmann & Fieseler, 2021). For example, recommendations and decisions made by a Blackbox-based DSS can yield unintended adverse consequences when certain variables and features are utilized in the training dataset (e.g., gender and race) (Crawford & Calo, 2016; Mittelstadt et al., 2016). For instance, Nie et al., (2012) and Hao et al., (2013) developed used SVM and ANN models to diagnose mental disorders. Unfortunately, these models have a Blackbox nature and do not provide information about their internal structure. Therefore, mental health professionals using this kind of Blackbox DSS do not know how certain factors and variables affect the final diagnosis. Thus, this study made a unique contribution by using explainable and transparent AI models when diagnosing mental disorders.

There were various steps undertaken to ensure that the proposed DSS did not generate biased predictions. For instance, several variables (e.g., gender and race) were removed from the analysis to ensure the predictions are not biased towards a certain group of people. Furthermore, the study used the NEPAR algorithm for feature selection because it is an interpretable algorithm showing how features are related to one another, thus reducing model bias, and making it transparent. Furthermore, the three different functions were applied to calculate similarities among questions and participants because integrating different functions improved the robustness and reliability of the NEPAR models.

4.2 Practical implications

Various studies have proposed using statistical techniques to shorten the SCL-90-R test (Akhavan, Abiri, & Shairi, 2020a; Imperatori et al., 2020; Lundqvist & Schröder, 2021). For example, Derogatis (2017) developed the BSI-18, which features 18 questions and is used to diagnose somatization, anxiety, and depression. Prinz et al., (2013) created the SCL-14 to diagnose somatization, phobic anxiety, and depression. However, these shorter tools cannot be used to diagnose all ten mental health disorders that the original SCL-90-R can detect because they feature fewer questions. This study addresses this major issue by creating a new symptoms checklist tool called the SCL-28-AI. Unlike the tools proposed by several previous studies (Barker-Collo, 2003; Bernet et al., 2015; Chen et al., 2020; Kim & Jang, 2016; Olsen et al., 2004; Rytilä-Manninen et al., 2016; Schmitz et al., 2000; Sereda & Dembitskyi, 2016; Urbán et al., 2016), the SCL-28-AI can diagnose all the mental disorders that the SCL-90-R was intended to diagnose but with fewer questions. Therefore, it provides significantly higher participation and completion rates.

Previous studies have never considered the similarities among patients. In other words, previous studies did not consider the fact that similar patients in terms of demographics and other various metrics may have similar mental disorders (Akhavan, Abiri, & Shairi, 2020a; Imperatori et al., 2020; Lundqvist & Schröder, 2021). Using NEPAR, this study utilizes the similarities among participants as an input to determine their mental disorders. Another important practical implication of this study is that it eliminated the manual analysis of participants’ answers to the SCL-90-R test. Because the DSS is used to diagnose disorders, mental health professionals can use their valuable time to develop treatment plans and effective interventions to improve their patients’ wellbeing instead of manually analyzing the test results (Chekroud et al., 2017; Rovini et al., 2018; Stewart et al., 2020; Zhang et al., 2020). This is the first study using the NEPAR algorithm along with major AI algorithms to develop a transparent DSS. Additionally, the proposed DSS can generate the likelihood of a person having a particular mental disorder, instead of binary decisions (i.e., whether having a disorder), which can help mental health professionals to determine the severity of the diagnosis.

There are also economic and social implications of this research too. For example, accurate diagnosis for mental disorders through this proposed DSS can reduce the overall healthcare cost due to misdiagnosis, overdiagnosis, and unnecessary treatment. Additionally, accurate and faster diagnosis of patients through this proposed DSS can help them start their treatment early, improving their overall quality of life. Finally, mental health professionals can increase their panel size (i.e., number of patients that they see) as it is easier to administer and diagnose mental disorders through the proposed DSS. This can provide a greater population with access to mental health services.

4.3 Limitations and Future Research

This study is not without limitations. The proposed research is implemented using a dataset obtained from a web portal called Psikometrist. This web portal is currently used by specific mental health professionals. Future research should obtain data from different mental health professionals and apply the framework to other datasets collected from a larger population. The dataset contained approximately 6000 observations. Future studies can use this framework for larger datasets with more variables to increase diagnostic accuracy and reduce model bias. The datasets can include historical variables related to patient demographics, genetic data, medications used, etc., and explore the impact of bias in the final diagnosis. In addition to traditional ML algorithms, future studies can consider more complex models (i.e., deep learning) with post-model explainability techniques to explore the possibility of improving the results.

5 Conclusions

Mental health crisis has exacerbated in the past years with severe impacts on the personal wellbeing and financial situation of many individuals. Due to the lack of adequate tools to help mental health professionals, this study was motivated by the urgent need to develop innovative tools that can help professionals make improved clinical diagnostic decisions. Our paper developed a DSS called Psikometrist that can replace traditional paper-based examinations, decreasing the possibility of missing data and significantly reducing cost and time needed by patients and mental health professionals. The findings show how AI-based tools can be utilized to efficiently detect and diagnose various mental disorders. In addition, the study discussed the ethical challenges faced during AI implementation in healthcare settings and outlined the ethical guidelines that should be integrated.