Background

Mobile health (mHealth) apps have transformed healthcare delivery, with over 350,000 direct-to-consumer (DTC) apps available worldwide [1]. These apps provide individuals with what promises to be convenient, accessible, and personalised health tools. Broadly, they can be classified into two main categories: wellness management and health condition management [1, 2]. Wellness management apps often use behaviour change techniques to track and promote healthy habits related to fitness, diet, sleep, or other lifestyle factors [1,2,3]. In comparison, health condition management apps are designed to support individuals in self-managing specific, often chronic conditions, such as diabetes, hypertension, or substance addictions [1, 2, 4]. These apps encompass a range of functionalities including tools for self-diagnosis and monitoring, clinical decision support tools for information and guidance from healthcare providers, and therapeutic apps for delivering treatment interventions directly to the user [1, 5, 6]. Thus, individuals looking to turn to technology to manage their health are exposed to a wealth of options.

The mHealth app industry has the potential to enhance population health outcomes and address access disparities [7, 8]. However, there is a concomitant need for rigorous evaluation and evidence-based approaches. Regulatory bodies and initiatives, such as the US Food and Drug Administration (FDA) Digital Health Center [9, 10], the European Commission eHealth policies [11], and the German Federal Institute for Drugs and Medical Devices (Bundesinstitut für Arzneimittel und Medizinprodukte, BfArM) [12], all play a crucial role in ensuring the safety and real-world effectiveness of mHealth apps and ultimately protecting consumers [13]. While the majority of wellness apps are deemed low risk, health condition management apps are often categorised as medical devices, leading to more stringent regulations and lengthy oversight, which can result in significant delays to the commercialisation of innovative health products [13, 14]. Hence, app developers either find themselves outside of the regulatory landscape, with little guidance on how to conduct evaluations, or under the obligation to embark on an expensive and long regulatory journey [15, 16].

The overall body of clinical evidence on mHealth app effectiveness has been growing, with 1,500 studies published between 2016-2021 [1]. Nevertheless, evidence on DTC health apps is sparse and inconclusive regarding safety and effectiveness [8, 17, 18]. Attempts have been made in recent years by regulatory bodies to streamline the path to commercialisation under their oversight [19]. Further, standardised approaches and reporting guidelines now exist to enhance research efficiency and quality for app developers and scientists evaluating digital health products [20,21,22,23]. However, conducting mHealth app research poses challenges such as lengthy study timelines, complex interventions with multiple features, changing app iterations, and a lack of in-house research expertise and funding [8, 16, 20]. Further, resources on the state of the art in the field to guide app developers in making research design choices appropriate for the stage of their product are lacking. A wide variety of study designs exist to evaluate digital health products [21, 24, 25], but it is not always clear which would be optimal in different contexts, with a range of different frameworks available [22]. Additionally, alternative approaches are being considered, like micro-randomised trials, but uptake may be slow [20, 26,27,28].

The aim of the current scoping review is to summarise research designs and study characteristics used to evaluate and validate currently on the market DTC mHealth apps at different company financing stages. Specifically, we focused on evaluations assessing health efficacy or effectiveness in improving health outcomes. By shedding light on the available research methods and evidence base, this scoping review seeks to understand what methods are commonly used, including identifying possible gaps and inconsistencies, thus contributing to the development of a solid framework for evaluating mHealth apps and promoting evidence-based practice and guidance in digital health.

Methods

We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) [29] to guide the literature search (Supplementary Table 1).

Search strategy

A structured search was conducted on January 2, 2023 in the databases MEDLINE via PubMed and EMBASE. The search terms “mHealth” and “evaluation methods” were used and enriched with synonyms, truncations, and Medical Subject Headings (MeSH) (Supplementary Table 2).

Inclusion and exclusion criteria

Only full-text primary research studies published in English and in peer-reviewed journals between January 2017 to January 2023 were included. The choice of January 2017 as the starting point for the search was driven by key developments in the digital health sector, as there was a significant increase in digital health apps available to consumers in 2017 [1]. Additionally, the 5-year time frame was selected to reflect the contemporary landscape of mHealth app development and evaluation. Studies were included if they evaluated stand-alone mHealth apps targeted at the general public, measured health outcomes, and the evaluated app was downloadable via the Google Play store (with >50K global downloads). Download cut-off was to ensure relevance to the public. No download cut-off criteria were applied to iOS store downloads as these data are not publically available.

Evaluations lacking a health efficacy or effectiveness (including cost-effectiveness) evaluation component were excluded. mHealth apps that solely send text messages or phone calls as their primary behaviour change modification were excluded. Finally, studies were excluded if the apps were solely a reminder service (including medication, treatment adherence, and appointment), electronic patient portal, or cloud-based personal health record app (Supplementary Table 3).

Data management and selection process

The literature search results were transferred to Zotero reference management software for de-duplication. Three reviewers (CP, KP, VN) pilot-screened the same random sample of 5% of studies at the title and abstract stage to ensure consistency. The remaining studies were randomly distributed between the reviewers. Studies designated a ‘maybe’ by any reviewer were subsequently checked by all three reviewers collectively.

For full-text screening, 10% of full-text studies were screened by all three reviewers and the results were discussed to ensure consistency and reliability. The remaining studies were randomly distributed between the reviewers. At the full-text screening stage, all identified app titles were searched in the Google Play store to confirm their eligibility.

Data extraction and analysis

A data extraction form was pilot-tested on five randomly selected studies and completed for all included studies (Supplementary Table 4).

The extracted mHealth app characteristics include: app name, number of global Google Play App store downloads, financing stage of company at publication date, current number of employees, founding or launch date of app, mHealth app category (wellness management versus health condition management), mHealth app sub-category, companion device, and regulatory status (when applicable). Global Google Play app store download data was extracted on August 4, 2023 and therefore the number of downloads at the time of study publication (for all included studies) is unknown. Access to historical app download records from Google Play is typically restricted, making it challenging to obtain precise download figures for the study's publication date. Company financing stage at publishing date, current number of employees, and founding or launch date of app was extracted using Crunchbase Database, a company providing business information about private and public companies (www.crunchbase.com). We were able to determine the company financing stage at study publishing date, however the data related to number of employees at the time of study publication (for all included studies) is unknown.

Apps were dichotomised as wellness management or health condition management. Wellness management apps were further classified into sub-categories based on the health outcome aim of the app. These included diet and nutrition, exercise and fitness, mental health, sleep, children’s health, oral hygiene, and skin health. Health condition management apps were further classified into sub-categories based on the aim of the app. These included diagnostics, clinical decision support tools, or therapeutics. Apps with companion medical devices by the same or an alternative company were recorded as well as the FDA regulatory status of the device [10].

The extracted study characteristics include: country study was conducted in, potential or stated conflict of interest from the team that conducted the study (including external research groups with and without conflicts of interest, internal team, and mixed teams; these were determined based on declared conflicts of interest, author affiliations, author contributions, acknowledgements, and study funding), sample target age group, sample target health condition and/or population, study purpose (focus on health outcomes), study design (as reported by authors), sample size (defined as sample allocated to intervention and when applicable, split by intervention and control group), study intervention length (in months), retention rate (defined as % of study participants who started the intervention or control method and remained until the defined end of the study intervention period), type of intervention (behaviour change technique), methods for controls (when applicable), health outcome measurement instruments, inferential statistical techniques (and related health outcome measured), study results, and distribution of study demographics at baseline (mean (sd) of age and frequency (%) of sex and ethnicity).

Results

We found 2799 articles, of which 47 were included in the review (Fig. 1) [30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76].

Fig. 1
figure 1

PRISMA flow chart. Search and study selection process for this review

mHealth app characteristics by financing stage

The financing stage of the app companies at study publication date (Fig. 2) included 16% early stage startups (Pre-seed, Seed and Series A; 6/38), 29% scale-ups (Series B-F; 11/38), 39% acquired or public (15/38), and 16% of apps developed by universities or government groups (6/38). Companies where financing data was not found (9/47) were excluded from this categorisation.

Fig. 2
figure 2

Infographic of mHealth DTC app evaluation methodology. Grouped by financing stage of company at study publish date

We found a significant association (p<0.001) between the financing stage and the team conducting the study (Fisher’s exact test; Supplementary Table 5), with early-stage startup (n=6) and scale-up (n=11) research both being conducted by external groups (no declared conflict of interest; n=3 and n=5, respectively) and mixed teams (internal employees and external collaborators; n=3 and n=6, respectively); acquired or public companies (n=15) being exclusively researched by external groups; and university or government-developed app research (n=6) being conducted mainly by internal teams (n=3), with some efforts led by external groups (n=2) and mixed teams (n=1).

No association was found between study design and financing stage using FIsher’s exact test (p=0.55; Supplementary Table 6). However, early stage startups were more apt to use pilot (n=2) or full-scale RCTs (n =2); scale-ups showed a higher inclination towards pre-post studies (n=5) and full-scale RCTs (n =4); acquired or public companies employed more diverse study design choices, including pilot (n =6) and full-scale (n =3) RCTs, pre-post studies (n=4), and alternative designs (including a micro-randomised trial and a non-randomised open label controlled trial; n =2); and university or government-developed app research teams tended to use pre-post studies (n =3), full-scale RCTs (n =2), and a 2x2 randomised, mixed factorial design (n =1).

mHealth app categories

Most apps (94%, 44/47) fell under the category of wellness management. Among the wellness management apps, 32% (14/44) targeted three or more health outcome sub-categories. The most prevalent subcategories included diet and nutrition (55%, 24/44), exercise and fitness (55%, 24/44), mental health (52%, 23/44), and sleep (41%, 18/44; Fig. 3 and Supplementary Table 7). Of the few health condition management apps (6%, 3/47), all were therapeutic tools designed for self-monitoring specific health conditions (including diabetes, hypertension, and allergic rhinitis; Fig. 3 and Supplementary Table 7).

Fig. 3
figure 3

Infographic of mHealth DTC app evaluation methodology characteristics

Nearly one-third of the wellness management apps (32%, 14/44) were associated with companion devices offered by the same or alternative companies. Of these, eleven apps were integrated with smartwatches, two apps were linked to smart scales, and one app was linked to a monitoring system for urine excretion. Only smart scale devices have received FDA 510k clearance. Two of the three health condition management apps were accompanied by companion devices with FDA 510k clearance. One of these devices is a capillary glucose reader catering to individuals with diabetes [65] and the other is a blood pressure monitor targeting individuals with hypertension [66].

Numerous wellness management studies focused on distinct populations to explore various health outcomes. Twenty-seven percent focused on overweight and obese individuals to evaluate dietary and physical activity improvements (12/44), 16% involved cancer patients, examining enhancements in diet, exercise, mental health, and sleep (7/44), and 11% targeted women with conditions such as breast cancer and postpartum depression (5/44). However, the majority of wellness management apps did not have a target condition or disease in mind and were interested in improving wellness health outcomes in the general population (36%, 16/44).

Study characteristics of the included studies

A variety of research designs were employed in the evaluation of all included mHealth apps (Table 1 and Fig. 3), with the majority (64%, 30/47) being RCTs. Among these, 56% were full-scale RCTs (17/30), characterised by medium-sized sample groups (median 107, range 28-1573), moderate intervention durations (median 2.5 months, range 0.3-24.0 months), and relatively high retention rates (mean 79.6%, SD 18.5). Pilot RCTs (37%, 11/30) had smaller samples (median 54, range 25-142), longer intervention durations (median 4.5 months, range 1.4-12.0 months), and higher retention rates (mean 86.3%, SD 9.7). Full-scale and pilot RCTs employed many control methods, including standard care, waitlist (delayed access to treatment), partial access to treatment, alternative treatments, or no treatment. Novel RCT approaches constituted a minor portion (7%, 2/30). A micro-randomized trial featured a large sample size of 1565 participants over a six-month study period, using a partial treatment control group, though retention rate was not reported. The mixed factorial (2x2) study involved a smaller sample of 52 participants for a one-week study period, using an alternative treatment control method, and achieving a 100% retention rate.

Table 1 Frequency table of study designs and associated study characteristics

Pre-post studies accounted for 32% (15/47), split between non-pilot (40%, 6/15) and pilot (60%, 9/15) studies. The non-pilot pre-post studies featured larger sample sizes (median 129, range 61-416) and longer study durations (median 2.76 months, range 0.69-12.0 months), but had lower retention rates (mean 68.3%, SD 22.3). In comparison, the pilot pre-post studies had smaller sample sizes (median 27, range 8-90) and shorter durations (median 1.8 months, range 1.0-2.8 months), and exhibited higher retention rates (mean 84.6%, SD 16.0). The majority of pre-post studies used a before/after single group design (87%, 13/15), and only two used a non-randomised comparative design (with intervention and control groups).

Finally, of the non-randomised open label trials (4%, 2/47) the sample sizes were 19 and 75, study intervention lengths were 1.8 and 2.76 months, and the retention rates were 45.0% and 87.0%.

Participant demographics

The studies were conducted in 15 countries (Fig. 3, Supplementary Table 8). The majority (62%, 29/47) of the studies were conducted in the USA. Other countries were represented by one or two studies and no global or multi-country studies were found.

The majority of the studies (72.3%, 34/47) targeted adults aged 18 years and older, 10.6% focused on children under 18 years of age (5/47), and the remaining studies (17.0%; 8/47) focused on adults aged 40 years and older (Fig. 3, Supplementary Table 9). Eight studies were gender/sex-specific, with five of them exclusively researching female participants in the context of breast cancer, pre- and post-partum depression, and premenstrual syndrome. Conversely, three studies solely included male participants, focusing on esophageal cancer and obesity. The remaining studies exhibited a wide range in the proportion of female and male participants at baseline, varying from 21% to 95% and 5% to 78%, respectively (Fig. 4). Overall, 75% (36/47) of studies included a majority of female participants. Notably, only one study reported inclusion of individuals outside of a sex or gender binary [73].

Fig. 4
figure 4

Reported sex distribution at baseline

Ethnicity was reported by 58% (27/47) of included studies (Figs. 3 and 5). Sixty-seven percent of these studies reported a majority of White/Caucasian participants (18/27). Two studies conducted in the USA targeted Hispanic/Latin adults [35, 71], one study conducted in USA researched an underserved community with 95% Black/African descent participants [66], and one study conducted in Singapore reported all Asian/Asian descent participants (92% Chinese, 0.6% Malay, 4.5% Indian, 2.9% Other) [72]. Excluding the studies that targeted specific ethnicities, the median (range) representation of all ethnic groups among included studies were: 62% (4%-98%) White/Caucasian, 7% (0%-50%) Black/African descent, 0.4% (0%-17%) Asian/Asian descent, 7% (0%-48%) Hispanic/Latin, 9% (0%-60%) Biracial/Multiracial, and 0.3% (0%-14%) Indigenous Groups (Fig. 5).

Fig. 5
figure 5

Reported ethnicity distribution from included studies at baseline

Measurement tools for evaluating mHealth apps

Various measurement tools were used to assess the effectiveness of health outcomes (Supplementary Table 10). Five commonly employed measurement tools were identified: the Short Form Health Survey (SF-12 or SF-35) [77] for measuring health-related quality of life (7/47), the Patient-Reported Outcomes Measurement Information System (PROMIS) [78] for evaluating physical, mental and social health (6/47), the Perceived Stress Scale (PSS) [79] for measuring individual stress levels (6/47), the Five Facet Mindfulness Questionnaire (FFMQ-SF) [80] for assessing the five vital elements of mindfulness (4/47), and the Hospital Anxiety and Depression Scale (HADS) [81] for measuring anxiety and depression among patients in hospital settings (4/47). These measurement tools were employed to evaluate wellness apps and health promotion apps.

Discussion

To our knowledge, this scoping review is the first attempt to summarise the financial stages, study characteristics, and methods for evaluation of popular currently on the market mHealth apps. This is particularly important given the varying guidance available to app developers outside of, or in the phases leading up to, regulatory oversight.

We found that most of the studies were conducted by companies at later stages of financing. The finding that scale-ups and acquired/public companies are better positioned to invest in evidence generation is unsurprising. While governmental and international funding schemes exist to support start-ups and small businesses in their evaluation efforts, internal expertise or ability to leverage university collaborations to obtain such funding may be lacking [15]. Further, even though educational resources on the need for iterative evaluations of health products are available, awareness of their implications among app developers is still limited [7, 8, 18]. Working groups and committees including industry, academia and government representatives are needed to build action plans guiding early-stage companies through evaluating their products at cost.

While the majority of wellness apps (with the exception of smart scale companion devices) did not have FDA clearance, a subset of health condition management apps obtained FDA clearance along with associated medical devices. The lack of clear regulatory pathways and guidance for informational apps for health management and tracking explains this finding [10]. Health condition management apps face more stringent regulations and lengthy oversight, and often result in significant delays to commercialisation [13, 14]. Future research should assess regulatory challenges faced by health condition management apps and provide recommendations on how to implement better strategies for expediting clearance, including considering what research designs should be considered suitable to satisfy regulatory concerns.

The prevalence of full-scale RCTs, even among early-stage startups, is encouraging. Alongside full-scale RCTs, pilot RCTs and pre/post studies were also frequent evaluation methods. Even though the reviewed studies often had small sample sizes and short duration, this finding suggests that RCTs may be more feasible for app developers than previously thought. We also observed some novel study designs: a factorial RCT (assess multiple interventions simultaneously) and a micro-randomized trial (involving frequent, small-scale randomizations within individuals' daily lives). Alternative study designs offer promise as substitutes to traditional approaches, potentially allowing app developers to attain robust evidence at reduced costs, shorter timeframes and increased flexibility [20, 28]. While pre/post designs do not provide the same robust level of evidence for causality as RCTs, they are a simpler and cheaper design and can be appropriate where evidence requirements are not as stringent or as part of an iterative series of evaluations building up to an RCT. There was a notable research gap of economic evaluations, which are often valued by healthcare services in their commissioning decisions. Future research should focus on barriers to adoption of novel designs, as well as shed light on the lack of health economics considerations in app evaluations.

The majority of studies in this review, regardless of study design, had study durations of less than 3 months, utilised wait-list controls, and maintained participant retention rates above 65%. Short study durations are limited in understanding sustained health effects and highlight the need for longer follow-up periods. Wait-list controls prevent any long-term follow-up being possible. But finding or developing active controls with comparable characteristics can be a significant challenge for companies, particularly when faced with resource constraints [82]. Nevertheless the wide adoption of wait-list controls may limit the interpretability of the results. Furthermore, participant retention over the course of a study is a critical factor that influences the validity of results, yet limited resources for participant compensation can impact retention. Industry-wide efforts to establish unified strategies for participant recruitment and retention, with potential support from innovative engagement approaches, such as gamification and personalised interventions, could prove instrumental in mitigating attrition rates.

Our findings reveal a distribution of study conductors and highlight the importance of fostering robust academic-industry partnerships [15]. While external research groups may provide more impartial assessments, the absence of an internal team may limit the integration of research findings into the company's development process. A mixed approach, involving both internal teams and external collaborators, can strike a balance between impartial evaluation and in-house expertise, promoting a culture of evidence-based practice within companies. Future studies should investigate effective models for fostering collaboration between internal and external research groups, optimising the integration of study findings within organisations, and further enhancing evidence-based practices.

A lack of ethnic diversity and limited representation of children, seniors, and non-binary individuals in mHealth app research was revealed. Achieving ethical approval and meaningful engagement of these underrepresented groups is a pivotal but challenging step towards enhancing inclusivity and equity of mHealth app access and research. Future initiatives should prioritise development of culturally sensitive recruitment strategies to address the underrepresentation of ethnic minorities and non-binary individuals, allowing researchers to generalise study findings to diverse populations. Ideally, target users, patient groups, or other stakeholders should be included in research design as early as possible.

This review highlights a notable lack of specificity and standardisation in the measurement tools and outcomes employed across studies, due to the diverse range of health outcomes addressed by mHealth apps. This variability in instruments makes it challenging to compare findings and draw overarching conclusions. The absence of standardisation poses difficulties in aggregating evidence and establishing clear benchmarks for app developers. Collaboration between researchers, regulatory bodies, and industry stakeholders plays a valuable role in developing a set of standardised tools for evaluating specific health outcomes. These tools should be readily accessible to app developers, enabling them to design interventions aligned with accepted measurement standards.

This review has some limitations. Firstly, the review omitted formative evaluations, potentially missing early-stage app development insights. The emphasis on efficacy and effectiveness evaluations excluded exploratory observational studies, qualitative, implementation, and user experience insights. Secondly, the omission of mHealth apps designed for healthcare providers or facilitating interactions between individuals and healthcare providers resulted in the underrepresentation of health condition management apps and the research methodologies commonly employed to evaluate them. This deliberate exclusion, while aimed at maintaining focus on DTC apps, is the reason for relatively few condition management apps identified in this review. The restriction to studies with over 50K Google Play downloads excluded smaller-scale apps, potentially overlooking innovative evaluation approaches. Language bias may exist due to the English language restriction, with a predominance of US-based studies limiting the transferability of findings to diverse global healthcare contexts. It is crucial to acknowledge that varying regulatory requirements exist among different countries, necessitating localised evidence. Furthermore, the complexities associated with the transference of digital health solutions from one nation to another [83] underscore the need for caution when assuming that successful outcomes in the US readily apply to other global healthcare settings. Lastly, we do not know what evaluations have been conducted, but were not published in the academic literature.

Conclusions

In conclusion, the field of digital health is rapidly expanding, posing significant challenges to app developers, regulators and end users. Academics and industry leaders are calling for streamlining and simplifying of current processes to avoid exerting too much pressure on companies who are innovating but cannot afford full-scale evaluations. We note that RCTs are feasible for a range of company types, but that most evaluations are on fairly small samples and with short follow-ups. Attempts to summarise the available evidence, such as this scoping review, together with detailed how-to guidance [21], represent a first step towards helping companies set realistic and feasible plans for ongoing product evaluations.