Introduction

Eight out of ten internet users access medical information online [1]. The YouTube platform, in particular, allows users to create medical content, without any obligation to post verified information [2, 3]. In 2007, Keelan et al. first examined the quality of immunization-related online videos [4]. Many subsequent studies have further assessed the reliability of medical videos on YouTube. Presently, the search term “YouTube” returns more than 1,500 publications on PubMed and Scopus (accessed on 17 Jan 2021) [1]. However, a standardized tool for evaluating medical health videos is lacking. Most previous studies used novel, topic-specific scoring systems based on the literature and authors’ own knowledge [5]. However, the generalizability of these scoring systems is poor, and the results obtained using them are difficult to repeat. Moreover, their validity and reliability have not been adequately measured.

A variety of tools to evaluate the accuracy of medical information are available, including the DISCERN instrument, Health on the Net (HON) code, Journal of the American Medical Association (JAMA) evaluation system, brief DISCERN instrument, global quality score (GQS), and video power index (VPI); medical videos can also be evaluated subjectively [1, 5, 6]. The HON Foundation devised eight principles for websites to abide by, called the HONcode [7]. Certification by the HONcode Foundation is available for a fee, but the quality of medical information is not rated. Furthermore, the validity and reliability of this system for YouTube videos have not been confirmed. The JAMA scoring system was created to evaluate medical information on websites [8], but has not been validated for videos. The DISCERN instrument was created nearly 20 years ago for application to “written information about treatment choices” [9, 10]. Again, however, this instrument has not been validated for medical videos. In addition, the second part of the DISCERN questionnaire is focused on treatment information. Thus, videos that exclude treatment information yield misleading results [10]. The VPI score, as a measure of audience approval, is calculated as the number of likes a video has divided by the number of likes and dislikes [11]. This scoring system, which is frequently used, is not suitable for evaluating the quality and reliability of medical videos. Given the lack of suitable instruments, we developed a reliable instrument, i.e., the Medical Quality Video Evaluation Tool (MQ-VET), for use by patients and healthcare professionals.

Materials and methods

Instrument development and item generation

Our original questionnaire included 42 novel items based on published evaluations of medical video quality. All questionnaires used in YouTube-related articles and questions used in subjective evaluations were examined by both authors [1, 5, 12,13,14]. Candidate items were rated by the authors (0 points, not applicable; 10 points, highly applicable). Duplicated questions and those with a score below the average were excluded, resulting in a total of 28 questions.

Participants

Videos were evaluated by 25 medical and 25 non-medical participants who have obtained sufficient points in any of the valid English language tests in the country and fluent in English. The questionnaire items were rated by participants in terms of quality and relevance (0–10 points for each item). The face and content validity of the questionnaire were also evaluated using the 10-point rating system. After excluding items with a score < 7, the final MQ-VET included 19 items. Ten unique videos (first appeared video for the popular topics from different medical subjects using YouTube’s default setting) differing in terms of the uploader and medical topic were evaluated using a 5-point Likert scale (Table 1). The DISCERN instrument was also used to evaluate each video for concurrent validity.

Table 1 General information of the evaluated YouTube videos (updated on 12.02.2021)

Statistical analysis

Statistical analyses were performed using SPSS ver. 20.0 (IBM, Armonk, NY, USA). Data distribution was examined using the Shapiro–Wilk test and histograms. Continuous variables are expressed as means ± standard deviation (SD) with ranges, and categorical variables are expressed as numbers and percentages.

For item analysis, kurtoses, item-item correlations (IIC), and item-total correlations were calculated. Exploratory factor analysis (EFA) was conducted to verify construct validity. Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy and Bartlett’s sphere test were conducted to check whether the data were suitable for EFA. In general, KMO values between 0.8 and 1 indicate that the sampling is adequate, and KMO values less than 0.6 indicate that the sampling is not adequate [15]. Factors in the EFA were extracted using principal components analysis and the varimax kappa 4 of the rotation.

Reliability was assessed by analyzing Cronbach’s alpha value. Spearman’s correlation coefficients between MQ-VET and DISCERN scores were used for concurrent validity.

After the study was completed, the post hoc power analysis was performed using the G*Power version 3.1.9.2 software (Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany). For the bivariate normal model correlation from exact test family, the post hoc power was calculated as 0.81 in the power analysis using correlation between the MQ-VET and the DISCERN scores (Table 2).

Table 2 MQ-VET and DISCERN scores of the YouTube videos on different topics

Results

The mean age of the participants was 30.98 ± 4.38 years (range: 25–42 years). The professions of the participants were as follows: doctor (44%, n = 22), pharmacist (6%, n = 3), academic/teacher (20%, n = 10), and engineer (24%, n = 12). Profession data were missing in three cases. There were 23 (46%) participants with a bachelor’s degree, 6 (12%) with a masters, and 21 (42%) with a doctorate.

Exploratory factor analysis

The Kaiser–Meyer–Olkin (KMO) value for the final MQ-VET (19 items) was 0.83 and the Bartlett’s test statistic was x2 = 3920.72 (p < 0.001). Thus, the data were suitable for further analysis. The first exploratory factor analysis (EFA) had five factors. The component correlation matrix was orthogonal, and varimax rotation with Kaiser normalization was applied. The factor loadings of the final EFA are displayed in Table 3. Ultimately, our questionnaire included 15 items across four factors (5, 4, 3, and 3 items for factors 1–4, respectively).

Table 3 Factor loadings from exploratory factor analysis

Concurrent validity

Correlation between the final form of the MQ-VET and DISCERN questionnaire used for concurrent validity. The scores of the questionnaires have shown in Table 2. The first part of the MQ-VET significantly correlated with all sections of DISCERN: Sect. 1 (rho = 0.617, p < 0.001), Sect. 2 (rho = 0.508, p < 0.001), Sect. 3 (rho = 0.436, p < 0.001), and total score of DISCERN (rho = 0.640, p < 0.001). The second part of the MQ-VET also significantly weakly correlated with all sections of DISCERN: Sect. 1 (rho = 0.456, p < 0.001), Sect. 2 (rho = 0.167, p < 0.001), Sect. 3 (rho = 0.123, p = 0.006), and total score of DISCERN (rho = 0.326, p < 0.001). The third part of the MQ-VET only significantly weakly correlated with Sect. 1 of the DISCERN scores (rho = 0.228, p < 0.001), but not significantly correlated with Sect. 2, Sect. 3, and total scores of DISCERN (p = 0.975, p = 0.578, p = 0.18, respectively). The fourth part of the MQ-VET significantly correlated with all DISCERN scores: Sect. 1 (rho = 0.510, p < 0.001), Sect. 2 (rho = 0.231, p < 0.001), Sect. 3 (rho = 0.205, p < 0.001), and total score (rho = 0.395, p < 0.001). Total scores of the MQ-VET significantly correlated with all sections: Sect. 1 (rho = 0.654, p < 0.001), Sect. 2 (rho = 0.377, p < 0.001), Sect. 3 (rho = 0.320, p < 0.001), and total scores of DISCERN (rho = 0.564, p < 0.001).

Reliability

Regarding internal consistency, the Cronbach’s alpha value was 0.81, 0.78, 0.75, and 0.73 for factors 1–4, respectively. The Cronbach’s alpha reliability coefficient for the overall MQ-VET questionnaire was 0.72.

Discussion

Collectively, our results confirmed the validity and reliability of the MQ-VET questionnaire. Although previous publications have discussed YouTube medical video quality [16,17,18], standardized assessment tools were not utilized. Typically, de novo questionnaires were devised based on the literature and the authors’ own knowledge [1, 5]. Several tools exist for the evaluation of written online information [5]. However, the applicability of these tools to videos is not known [7]. The DISCERN questionnaire is designed to evaluate treatment options [10]. Thus, it is inappropriate for analyzing videos lacking treatment information. The JAMA questionnaire, the GQS, and the HONcode were also created for the evaluation of the medical sites and written information on the internet. The VPI was designed specifically for evaluating videos, but assesses popularity rather than quality and content. Scores based on popularity change over time, which impacts repeatability [11].

The MQ-VET resolves the aforementioned issues, and its validity and reliability have been demonstrated for a variety of medical topics. Also, the MQ-VET was designed for use by both medical professionals and the general population. Evaluation of additional medical topics by more reviewers will provide further support for the MQ-VET, while translation into other languages will increase its utility. This study was limited by the low number of participants and videos, and by the lack of the test–retest reliability of the MQ-VET. However, we believe that these problems will be resolved in future studies (Table 4).

Table 4 Final version of the Medical Quality Video Evaluation Tool

In conclusion, we have developed a questionnaire to evaluate the quality of online medical videos posted by both medical professionals and members of the general public. We believe that this tool will help standardize evaluations of online videos.