Keywords

1 Introduction

1.1 What is ADHD and How is It Currently Diagnosed?

ADHD (attention deficit hyperactivity disorder) is the most common childhood neurodevelopmental disorder with a prevalence of about 5–7.2% in children and adolescents and 2.5–6.7% in adults [1]. The relative number of ADHD diagnoses has been rapidly increasing in the Western countries: For instance, in US ADHD prevalence raised from 6.1% to 10.2% between 1997 and 2016 [2]. While the causes of ‘ADHD epidemic’ remain partially unclear, the increase in referrals to psychiatric care has resulted in global healthcare crisis where the resources are not matching the needs [3]. At the same time, various new opportunities for how technology could assist in related healthcare solutions have been found. One exciting potential landscape involves the use of motion sensor technology in ADHD diagnostics.

Out of the two broader ADHD symptom domains, inattention, and hyperactivity/impulsivity, especially the latter one that directly concerns physical movements of the body could potentially be objectively quantified with motion sensors. Hyperactivity/impulsivity symptoms include, for example, fidgeting or tapping with hands or feet, squirming in the seat, leaving when remaining seated would be appropriate, and running about in situations where one is expected to not do so. To detect inattention there are other technological solutions such as virtual reality [4], but it is good to keep in mind that the symptom domains are often highly correlated and capturing a single domain reliably may hence provide valuable information even on the broader scale. While large-scale initiatives have been recently made to improve the precision of psychiatric diagnostics [5, 6], the current ADHD diagnostics is still far from precise quantitative measurement relying instead on subjective evaluations gathered with structured interview and symptom screening questionnaires. Subjective experiences are sensitive to various biases [7] that are dependent on the awareness and reliability of the informants, interpretation of the questions, ability to scale the outcomes against others, and generally the longitudinal data provided by technologies with high sampling rates has benefits against scarce evaluations of one aspect in life over the months. Hence, some limitations of the current diagnostics could be potentially tackled with objective, quantitative sensor-based methods solidly grounded on the biobehavioral reality to which so many other fields in medicine rely on [8]. Could a machine do the assessment of hyperactive-impulsive patterns better than a man, and what would be required for fulfilling the high medical standards?

1.2 Movement Sensors in ADHD Assessment

Movement sensors such as an accelometer or gyroscope, have been popular in actigraphs employed for research purposes already for several decades [9]. During the past ten years related solutions have become common also in customer devices (e.g., mobile phones, smart watches, and other wearables), which has facilitated related technological developments even further. By gathering information about linear acceleration in multiple axes, microelectromechanical accelometers can reliably detect gross body movements but also provide signal for more detailed analysis of movement patterns and trajectories. Gyroscopes using gravity to determine orientation of the movement can detect also angular velocity or rotation of the moving object, which allows more comprehensive interpretation of the motion signals. A combination of these two sensor types hence gives the most precise picture of the movement features. Characterizing the type of movement signal is also affected by the number of sensors and sampling rates (e.g., 1–100 Hz), which considerably affects the battery consumption and sensitivity to measurement noise [10, 11]. Choosing the sensor types, temporal resolution, and the number of sensors depends on the need. For instance, detecting the overall activity level or even the quality of sleep does not require high precision signals, while the capturing movement signals mimicking natural human behavior in its richness (e.g., sports) have different hardware requirements [12]. Regarding detecting and interpreting the type of hyperactivity in ADHD, such measurement standards remain to be carefully investigated.

Precise modelling of human movements requires not only high-quality input data but also benefits of advanced computational methods [5]. Machine learning methods such as convolutional neural nets (CNN) can detect regularities in movement trajectories signaling, for instance, different posture positions, movement types, or activities [13]. Such methods should be able to detect all possible variants within any single interpretable movement category at individual level. This is a big challenge in studying heterogeneous disorders like ADHD and yet another issue to tackle is the context where the movements take place that should be carefully considered when examining ADHD symptoms, as the symptoms essentially relate to whether the movement is appropriate in a specific context rather than whether the movement is appropriate as such. Here, determining the movement in relation to other individuals (inter-individual differences) and deriving the changes in the movement patterns of the same person in different contexts (intra-individual differences) come into play. More specifically, like human individuals, machine learning algorithms may learn to identify certain types of movements in a particular context (e.g., fidgeting with hands or feet), but this comes clinically interpretable only when the system has first characterized whether the movement signals maladaptive behavior in the specific measurement context (e.g., observed during a school class when one should stay still and concentrate). One approach that helps here is supervised learning: when reference data where classification has already been done is available and the training sample is representative and large enough, such methods provide powerful means [14]. Alternatively, when predesigned categorization information is not available, it is possible to use unsupervised learning where the algorithm categorizes the data according to the statistical regularities in the input [14]. This method can be powerful in detecting, for instance, inter-individual differences. Along with manual annotation, both of these approaches could provide higher interpretability than in the analysis of gross movement levels, which may also vary between individuals with vs. without ADHD [15].

1.3 The Present Study

This critical review will examine the existing literature concerning clinical utility of motion sensors measuring bodily movements in ADHD diagnostics, especially in the context of a) research quality (e.g., interpretability of the signals, contextual control) and b) diagnostic standards (e.g., representativeness of the study samples, observations in different contexts, length of the measurements or test-retest reliability). We hope that this paper raise questions helping to improve sensor-based research of ADHD and is able to guide development of future health care applications. As the current research in this field has used highly heterogeneous methods, it is important to raise up questions that would allow building high clinical quality standards. Moreover, for the clinicians it is currently difficult to evaluate the readiness state of the technology as such critical analysis is lacking. One important caveat here is how to interpret the performance of the methods. For example, some of the studies have reported considerably high classification accuracies (e.g., >98%) and if such studies would consider all the relevant clinical aspects, one could easily argue that the method is ready for clinical use. However, it should be borne in mind that detection accuracies are highly dependent on the difficulty of the specific classification problem that in this case largely raises from sample characteristics. From the clinical point of view, the algorithm should be able to identify the status of every single individual that comes to the assessment. For this reason, population-wide representativeness of the training and testing samples would be utterly important. In most cases individuals with ADHD have also other problems. Indeed, challenges in the clinical assessment especially concern evaluating the severity of the problems near the diagnostic threshold and ruling out the possible other problems that may overlap with ADHD (e.g., autism [16], learning disabilities [17] as well as in mood, anxiety, and conduct disorders [18, 19]). Finally, it is worth noting that this review will not cover comparisons between motion sensors and other methods employing machine learning in detail as these were at focus in another recent review [5]. We will also focus more on the more recent and technologically advanced studies with higher clinical potential as older studies with standard methods have been carefully meta-analyzed by De Crescenzo and colleagues [15].

2 Methods

2.1 Study Selection

Two researchers (JB and JS) independently conducted the search and selection of the studies. To find the relevant studies, we employed PubMed and Scopus as the primary search engines. PubMed’s comprehensive coverage of biomedical and life sciences publications provided a strong foundation for identifying relevant studies. Additionally, Scopus, with its multidisciplinary scope and strong representation in life sciences, complemented the search potentially capturing research studies not included in PubMed. For both search engines we used search words: ADHD AND (movement OR motion) AND sensor AND diagnostics. The initial abstract selection was based on whether the abstracts concerned detection or diagnostics of ADHD. Studies where the primary focus was on brain signals or other aspects associated with objective diagnostics (e.g., task performance) were excluded, as well as studies not published in international peer-reviewed English journals or not reporting quantitative results. We also excluded studies examining eye movements, as they are likely to reflect different aspects in the ADHD symptoms (i.e., shifting and focusing of attention) than bodily movements captured by the sensors. Scopus found 13 studies of which six were found to be eligible for the present purposes, while PubMed gave 15 hits of which seven were found to be eligible. Out of the seven eligible studies found by PubMed, five were the also given by Scopus, leaving us eight unique articles in total. Additional search words (hyperactivity, accelometer, accelerometer, gyroscope, IMU, wearable-sensors) and Google Scholar were used to complement the search procedure and previous meta-analyses and reviews were examined to further identify eligible studies. Altogether, 25 studies were selected for more careful inspection (see Table 1).

2.2 Study Participants

Out of the eligible studies, 23 had children and 2 had adults as participants (Table 1). The average age in pediatric studies has been 9 years and 33 in adult studies. On average about 80% of the participants were males, reflecting approximately the typical gender distribution of ADHD [20]. The information about the ADHD subtyping and examination of comorbid symptoms in the clinical group as well as the methodological standards for verifying that the controls do not have psychiatric or neurodevelopmental issues or how well they represent the general population (e.g., distribution of socio-economic status and education) varied across individual studies. However, in most cases participants with neurological or psychiatric disorders, other than ADHD in the clinical groups, had been excluded from the samples in the original studies.

2.3 Sensor Data Collection

The measurement devices have been actigraphs, smart-watches, VR-controllers, some of which contain only an accelometer and some also a gyroscope. The studies have used distinct types of sensors typically placed on hands (either a wrist monitor or hand controller), and sometimes also on ankles or waist (Table 1). The wrist measurement was sometimes done from the dominant and sometimes from the non-dominant hand, which depend on the situation. For instance, during a school class the dominant hand may be used more for writing or drawing, and other such activities and the movements of the non-dominant hand could therefore give information about ‘irrelevant’ movements. In experimental tasks that are performed with the dominant task, the motivation for the sensor placement might be different, although in both cases data could be collected from both hands, even just for the cross-validation. A few studies have used sensors simultaneously in multiple body parts, also including hand and leg [13, 21]. The sampling rates of the devices typically range between 1–30 Hz.

2.4 Experimental Designs

Experimental designs in ADHD studies collecting motion sensor data can be scarcely divided into naturalistic studies where the data is collected at home and/or school and laboratory studies where typically a specific task is being presented (Table 1). The design also influences the duration of the measurement: naturalistic data can be collected over several days (has been on average 18 h/day) and with a few sessions could potentially fulfil the criteria concerning durability of the symptoms. The laboratory measurements with experimental tasks, in turn, typically last for tens of minutes and maximum few hours (on average 60 min). The main trade-off in the selection of this experimental design is in sampling distribution and representativeness of the situations where the symptoms are manifested (naturalistic designs) vs. contextual control with a measurement situation that can be carefully interpreted and more reliably compared between individuals (laboratory tasks). For example, a person could able to inhibit hyperactivity to manifest during a few minute laboratory task as such behaviors are generally considered inappropriate and the situation is new to the participant, but on the other hand, data collected at the classroom or home could be affected by numerous potential confounds related to what the activities in the measurement days have been exactly (what kind of teaching was arranged and how, what is the child’s situation at home or school etc.). At the moment, the empirical evidence demonstrating the comparative benefits from single studies is limited. Average classification accuracies have been 95% in naturalistic studies and 86% in studies with laboratory tasks. Also, hybrid paradigms employing naturalistic tasks attempting to combine the benefits of the naturalistic and laboratory designs are becoming increasingly common. Such paradigms where motion sensor data is collected in a naturalistic situation that is emulated in a virtual reality laboratory task, have been developed for classroom [22] and home situations [23, 24]. A virtual classroom task that is commonly used in ADHD studies is a variant of the CPT (see Introduction section [24]) that is one of the most widely used experimental tasks in this domain overall. Finally, there are at least two studies dividing the measurement period into multiple different real-world and experimental measurement sessions that gives information about the influences of the measurement context [21, 25]. O’Mahony and colleagues collected sensor data when the participants were in the 1) waiting room with their parents, 2) in the waiting room with a supervisor, 3) with the psychiatrist in her/his office, 4) with the psychiatrist and parent, and 5) during performance of an experimental task [21]. Miyahara and colleagues collected movement data from rather small children for about two hours when they were performing multiple types of neuropsychological or computerized cognitive tasks [25].

2.5 Analysis Methods

The studies reviewed have used various analytical techniques to interpret the measurement results collected by the sensors (Table 1). These techniques include machine- and deep learning algorithms, as well as traditional statistical methods. The choice of method generally depended on the design and main objective of the study, as well as the nature of the data to be analyzed. Many of the reviewed studies used statistical tests, especially analysis of variance (ANOVA). Statistical tests are considered useful for hypothesis-driven research, as they allow for testing the significance of differences between groups (ANOVA), means (t-tests), proportions (chi-square tests), and correlations (Pearson or Spearman correlation tests). ANOVAs were typically used to examine the significance of group differences in factors related to overall activity or changes in activity during measurement periods. Similarly, the studies focusing on classification of the group status of single individuals often used various statistical tests to evaluate which features should be used in the process. While these tests are powerful for hypothesis testing, they come with limitations. One major limitation observed in study by Miyahara et al. [25], was that each test has its assumptions. For instance, ANOVA assumes homogeneity of variances and normal distribution in the populations being compared. These assumptions can sometimes be restrictive and not met by all data sets. Violation of these assumptions can lead to inaccurate results. Another limitation considering the aim of our study is the classification of an individual, as traditional statistical tests are mainly capable of describing differences and relations between features. However, there are classification methods that rely on some of these tests like discriminant analysis, which was used in some of the studies [24, 25]. Discriminant analysis offers a rather simple and efficient classification method, but it is limited by the assumptions in the tests included. Similarly, machine- or deep learning algorithms have their own requirements for the input data, but as there is a wide range of classification algorithms for different data types, the violation of assumptions can usually be avoided. Studies reviewed showed use of different basic machine learning classification methods, such as support vector machine (SVM), logistic regression and decision tree, as well as more advanced deep learning methods like CNN. These methods were tailored for different types of input data and often provided accurate classification results. Considering all the methods, CNN offers a rather different approach for the analysis, as it uses image data as an input. The accuracy of CNN can also be affected by the number of convolutional layers. CNN was used only in two of the studies reviewed [13, 26]. Amado-Caballero et al. [26] further experimented on different combinations of convolutional layers and input window sizes to find the highest accuracy. The implementation of these methods usually requires expertise in the field, rather large data sets and more computing power compared to traditional statistical methods.

3 Results

The published results have consistently reported group differences in the movement of the participants with ADHD and neurotypical controls (Table 1). Classification accuracies for the detection of the group status in single participants range between around 70% and 99%. More specifically, there are many studies with acceptable to excellent discrimination rates (70–90%), but then a few studies with outstanding classification accuracies (>90% or even around 98–99%). Overall, the studies reported sensitivity values of 87% on average. Similarly, the average specificity reached an average level of 86%. Both values are, as expected, close to the corresponding classification accuracies. Overall, each individual study reported significant group differences in motion sensor data between individuals with or without ADHD. Some studies reported the results separately in multiple different experimental situations or with a comparative analytical method. For example, O’Mahony et al. [21] reported multiple accuracies, each obtained from different experimental situations. Accuracies of these situations ranged from 81% to 93%, and the final accuracy (95%) was obtained by combining the data in each independent situation. Similarly, Kam et al. [27] reported results of two models which used different situations (class, class + recess), but also differently implemented decision trees. These models showed differences of 1–2% units in discrimination accuracy. More dramatic differences were reported in the study by Amado-Caballero et al. [26], with accuracies ranging from 56% up to 99%. For each study, Table 1 presents the highest achieved accuracies, if reported. Otherwise, the most significant features by group difference are presented.

Table 1. Summary of included studies divided between naturalistic designs and laboratory tasks. Both categories are in ascending order by mean age of the participants.

4 Discussion

The present critical review identified 25 studies examining the role of motion sensors in the clinical assessment of ADHD. Fifteen of these studies, focusing on overall daytime activity levels and not on detection of ADHD symptoms are not discussed here in detail because the research questions in these studies do not allow comprehensive discussion of the clinical interpretability and utility of the findings, these studies are not methodologically comparable to the novel studies, and a meta-analysis on these older studies already exists [15]. Overall, the reported classification accuracies or AUC’s for identifying single participant status are highly varying across individual studies (Table 1), which could be due to several factors (e.g., different analysis methods, sample characteristics or measurement solution, variability in the measurement context). Such heterogeneity in the research in this emerging field should be carefully considered and one important issue to advance the clinical use of these methods would be to establish generally accepted research standards to this field. This paper raises some of the critical questions to improve sensor-based research of ADHD attempting to serve in this path toward future health care applications. Besides varying experimental designs and research methods, clinical interpretation of the findings is limited by the participant samples that rarely represent the true variability in the population especially lacking demonstrated cases of attention deficits below the diagnostic threshold (i.e., the groups may have included those with a diagnosis or individuals with no attention deficits whatsoever) and individuals with other neurodevelopmental disorders (e.g., learning disabilities, autism spectrum disorder, conduct disorder, mood and anxiety disorders). Finally, the work resulting to detailed understanding of the motion sensor signals as part of the manifestation of ADHD symptoms is still underway. It would be critical to carefully benchmark or cross-validate the sensor-based methods against other assessment methods and determine which individuals with ADHD can or cannot be detected by the accelometer data. Research addressing these topics is likely to determine how broadly and for which purposes sensor-based methods could be used at the clinic. In most cases the challenges ahead are such that at least in principle they can be solved even with the existing methods by running more extensive high-quality studies employing the current technologies (e.g., large-scale multi-center studies) along with other benchmarking methods and detailed contextual descriptions. In the following sections, we will go through these research quality issues and clinical aspects in more detail.

4.1 Critical Analysis of the Research Quality in Sensor-Based ADHD Studies

Evaluation of the quality of research findings here involves several issues starting from (1) the number and location of the sensors in the measurement concept, (2) sensor type and sampling rate, to (3) the measurement situation and contextual control (e.g., naturalistic vs. experimental) and (4) various choices made in the data analysis (e.g., simple frequentist statistics vs. advanced machine learning methods). The existing studies have mostly utilized a single sensor worn in the hand, leg, or waist (Table 1). Although a single sensor is probably able to detect the overall level of activity, it may not detect all the relevant hyperactivity symptoms that are essentially characterized by specific types of bodily movements when the participant is otherwise still [45]. For instance, in the measurements at the school class, distinguishing movements of the hands and legs (fidgeting), torso rotations (e.g., talking to another student), moving from the seat (interruptions of the learning situation) and other such distinct behaviors signal very different issues and detecting such movement patterns could significantly improve clinical interpretability. For such an analysis, including at least four sensors would be critical. Inaccuracies in the detection of the symptoms could also relate to the sampling rate. Some studies have collected data at 5 Hz sampling rate [28] that could potentially limit detection of high frequency movement signals. However, overall, it is likely that the sampling rates of the existing commercially available sensors are sufficient for detailed enough movement analysis and the bottle necks could be in other factors coming from the sensor placement and data analytics [21, 27]. Some data loss has been taking place in the existing studies, but such problems are likely not going to be a key factor from the methods development side since there are many sufficiently reliable measurement solutions available and the data is generally exceptionally rich as compared with many other methods and a few percentages data loss could be easily tolerated in the measurements that may last for several days. We suggest that a more important factor instead would be to obtain more detailed data on the measurement situation to help to improve interpretability of the findings. Apparently, the accuracy to detect the symptoms may vary a lot even in the same group of participants within a study according to the measurement situation [25, 39]. It would also be important to acquire reference data from certain type of bodily movements to teach the algorithm to identify certain type of movement patterns (supervised learning), improve interpretability of the complex machine learning and deep learning methods that tend to be ‘black boxes’, and share the algorithms for transparent evaluation and testing across the datasets to increase the transparency of the research.

4.2 Evaluation of the Clinical Utility of the Sensor-Based Diagnostics

Most of the so far conducted studies have been pilot studies with small samples not representing the variability of attention deficits or psychiatric and neurodevelopmental disorders in the population, but a few refreshing examples with two clinical populations [33, 35] or impressive sample sizes [34] have already been conducted indicating that this research field is going to a direction where sufficient representativeness of the normative and clinical data maybe be evidently reached. Measurement durations are highly varying (Table 1), but as there are so many factors that differ between the individual studies (sensor solutions, analytical methods, experimental paradigms) apparently the amount of datapoints is not a critical factor in achieving excellent classification accuracy. Several studies have obtained outstanding classification accuracy already in short one session measurements. Hence there is no reason to assume that at least in those cases where the two populations clearly differ (no suspicion of other disorders, comorbid symptoms, or close cases near the diagnostic threshold) even a relatively short measurement is sufficient. The amount of data might be an issue to consider when extending to other types of populations and trying to fulfill the more stringent clinical criteria. The stringent clinical criteria require not only confirming the prolonged appearance of the symptoms (min. 6 months) and manifestation of the symptoms in multiple environments (e.g., home and school), but also the influence of the measurement time (e.g., summer holiday vs. stressful school or work period) would be highly important.

Another factor that should be accounted for and has already been examined to some extent is the time of the day, which could influence participants with or without ADHD differentially and should be considered in short measurements. Especially the mid-range activity periods have been suggested to contribute to detecting ADHD [13, 27]. Due to the day-to-day differences in measurement outcomes multiple different measurement days might be the best option to assure reliable and representative results [25, 29]. It has been noted that during a multiple session study, activity levels in ADHD participants may not change over the study, but in the control group activity levels reduced after the first measurement day [25]. This could be due to adjusting to the study participation and normalization of the behavior after the study beginning [25]. Finally, the possible role of the medication in the clinical reference samples should be further examined. It is difficult to obtain medication naïve reference samples and recommendations for the washout period and knowledge of the history of drug use are varying. Generally, it can be expected that the history of stimulant use should not be a major confounding factor as most of the drugs that are at use have limited aftereffects and are the effect is relatively short lasting.

Factors that are increasingly commonly examined in ADHD studies are the influence of age and gender on the manifestation of the symptoms. Among the published studies, only a few have been conducted in adults [22, 44]. In general, hyperactivity/impulsivity symptoms are less often observed in adults and the symptoms may be milder. In a similar vein, symptoms in females are typically more on the inattention domain and therefore it could be that detection of females with ADHD might be more difficult based on motion sensor data. Based on the currently available data, it is difficult to make detailed inferences on the role of age and gender on the detection accuracy. However, a recent review included data on gender and age differences in ADHD symptoms among a cohort of 1,326 children and adolescents revealed significant negative associations between female gender and total, inattentive, and hyperactive/impulsive ADHD symptoms, and age was found to be significantly associated only with hyperactive/impulsive symptoms [46]. Until these issues are carefully examined, e.g., by providing reference samples and sensitivity/specificity values for different subgroups, the application of these methods should be handled with caution. Although especially in children and in boys hyperactivity/impulsivity are quite strongly correlated with inattention, the limited available data could lead into underdetection of individuals with particular type of symptom patterns. While the gender/sex bias of ADHD may result to underrepresented female populations in research, such factors could be taken into account e.g., by prescreening gender-balanced groups from a larger sample. With prescreening, also the possible influence of many other potential confounding factors (e.g., socioeconomic status, general abilities, academic performance) could be controlled. In many of the published studies, even the lack of neurodevelopmental disorders in control group participants has not been carefully examined. Researchers have raised the point that in larger representative samples with detailed background information many of these factors could be accounted for [27].

At the same time when the current literature gives still a limited window on what aspects in ADHD sensor-based methods detect and what the obtained results reflect, it is good to keep in mind that potential applications of these methods go way beyond examination of the core symptoms. Studies reporting several other use cases have already been conducted. More specifically, motion sensors could be useful even in detecting the aggressive episodes that are commonly observed as ADHD has high comorbidity with conduct disorders [47]. Motion sensors could also help in detecting comorbid coordination disorder [48], neurological soft signs [49]. Finally, they have considerable potential for monitoring the treatment such as examining the effects of stimulants (including the detailed data on the type of the stimulants and individual dosing) [50] or even quality of life [51]. Together these promising opportunities paint a rather positive landscape on the potential clinical applications of sensor-based methods. There are several aspects raising from the critical analysis of the published studies that could be considered in planning future research. To control for the contextual effects, researchers could consider measuring multiple participants simultaneously in a same adult supervised situation could help in interpreting the data from several participants in the same situation. With such a setting, it might be possible to get further information also regarding other clinically relevant aspects such as hyperactive behavior during social interaction in group situations. Besides contextual control, generally the number of participants within a study should be larger. Based on the reported studies, the role of the measurement duration in the accuracy in detecting ADHD remains unclear, as shorter studies have been stringently controlled laboratory studies while the longer ones are naturalistic studies including various potential confounding factors coming from the naturalistic measurement context. This is certainly one factor that could be considered, for instance, by developing experimental designs with somewhat comparable naturalistic and laboratory conditions (e.g., virtual vs. real school class or home situation). Due to the complementary nature of different technological advances aiming at objective diagnostics, integrative solutions potentially combining input data from multiple sources could give best results. Large-scale data pools with rich questionnaire, interview, neuropsychological, virtual reality, motion sensor, biosensor etc. data where advanced computational analyses can be performed could help clinicians to obtain reliable results. According to the present results, the opportunities of standard smart watches, rings or even mobile phone sensor signals for diagnostics could also be further examined. For example, smart watches may contain biosensors that could complement the data provided by the motion sensors [5]. In one potential scenario, smart watch users could download a medical app that would give data to healthcare service providers that could then be accounted for in the diagnostics process.

4.3 Conclusions

This article critically evaluated the research quality and clinical utility of the studies employing motion sensor data for ADHD assessment, discussing the current state of the research in this field as well as needs for future improvements. The motivation for this branch of research relates to the need for developing objective assessment methods being able to record manifestation of the symptoms even over long times in everyday life with relatively little effort from the participants. These features also make motion sensor methods such that the participants will not pay too much attention to their presence and change their behavior in a way that might bias the results. Such methods could be relatively cheap, easy to use, and cost-effective, potentially saving limited healthcare resources and improving the quality of the assessment. Motion sensors hence provide multiple potential benefits as compared with current diagnostic methods such as interviews and questionnaires. Despite these promising features, this branch of research is still at the early stages considering large-scale clinical use. Majority of the studies covered in this review have limited sample sizes, underrepresentative populations, and especially the performance of these methods in diffential diagnostics remains largely unclear. The methods in the studies are heterogeneous, which makes rigorous quantitative assessment of factors contributing to clinical value of these methods difficult. Quality standards, some of which were introduced in the present manuscript, should be kept high to meet the medical regulation criteria and large enough studies with representative samples need to be conducted replicating the promising results. Especially the measurement concept (naturalistic vs. laboratory-based, and which type of measurement sessions/tasks) and annotation of the context, supervising the classifiers and such factors influencing the performance of the computational methods should be carefully examined. When meeting these criteria, motion sensor research may provide methods complementing ADHD diagnostics already in the near future.