Quality of patient-reported outcome measures for primary dysmenorrhea: a systematic review

Purpose To conduct a systematic review of the quality of patient-reported outcome measures (PROMs) for primary dysmenorrhea (PDys) using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology, and to derive recommendations for use of the PROMs. Methods We searched PubMed and Web of Science for studies reporting on the development and/or validation of any PROMs for women with PDys. Applying the COSMIN Risk of Bias Checklist, we assessed the methodological quality of each included study. We further evaluated the quality of measurement properties per PROM and study according to the criteria for good measurement properties, and graded the evidence. Based on the overall evidence, we derived recommendations for the use of the included PROMs. Results Data from seven studies reporting on four PROMs addressing different outcomes were included. Among those, the Adolescent Dysmenorrhic Self-Care Scale (ADSCS) and the on-menses version of the Dysmenorrhea Symptom Interference Scale (DSI) can be recommended for use. The Exercise of Self-Care Agency Scale (ESCAS) and the Dysmenorrhea Daily Diary (DysDD) have the potential to be recommended for use, but require further validation. The off-menses version of the DSI cannot be recommended for use. Conclusions The ADSCS can be recommended for the assessment of self-care behavior in PDys. Regarding measures of impact, the on-menses version of the DSI is a suitable tool. Covering the broadest spectrum of outcomes, the DysDD is promising for use in medical care and research, encouraging further investigations. Further validation studies are indicated for all included PROMs. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-023-03517-8.


Plain English summary
Primary dysmenorrhea (PDys), defined as menstrual pain in the absence of pelvic pathology, is among the most common gynecological conditions among women of reproductive age.To assess patient-reported outcomes (PROs) related to PDys, several disease-specific patient-reported outcome measures (PROMs) are applied.An evaluation of the quality of PROMs for PDys using a standardized methodology is currently not available, but would help researchers and clinicians to select the most suitable instrument.We aimed (a) to conduct a systematic review of the quality of PROMs for PDys using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology, and (b) to derive recommendations for their use in research and patient care.Data from seven studies reporting on four PROMs focusing on various outcomes were included.Among the identified instruments, the Adolescent Dysmenorrhic Self-Care Scale (ADSCS) measuring self-care behavior, and the on-menses version of the Dysmenorrhea Symptom Interference Scale (DSI) assessing the impact of PDys on physical activities, sleep, daily activities, work, leisure and social activities, and mood can be recommended for use.The Dysmenorrhea Daily Diary (DysDD) assessing menstrual bleeding, pelvic pain, use of rescue medication, and impact of pelvic pain/cramps on daily life does currently not fulfill the COSMIN criteria for a recommendation.However, as the tool is capturing the broadest spectrum of outcomes, it appears promising for use in research and patient care, and further investigations are encouraged.The off-menses version of the DSI cannot be recommended for use.

Background
Primary dysmenorrhea (PDys), defined as menstrual pain in the absence of any organic cause [1], is among the most common gynecological conditions among women of reproductive age [2].The prevalence of PDys ranges from 45 to 95% among menstruating women, whereby up to 29% experience severe pain [3].The burden of PDys is substantial with negative impact on physical and mental health, physical activity, school and work productivity, sleep, and healthrelated quality of life [4].Treatment commonly involves drugs, medicinal plants, and acupressure [5].Evaluating the efficacy of these interventions from the patients' perspective is critical, and patient-reported outcome measures (PROMs) are suitable tools for this purpose [6].When selecting an instrument, the construct of interest and the quality of measurement properties of available tools should be taken into account.The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology [7] provides a profound framework for the assessment of the methodological quality of single studies on measurement properties of PROMs, and for the evaluation of the quality of measurement properties of PROMs.The COSMIN methodology has been specifically developed to guide the selection of PROMs in research and clinical practice in an international Delphi study involving experts with backgrounds in epidemiology, statistics, psychology, and clinical medicine [8].COSMIN provides a methodological approach including detailed, standardized, and transparent criteria, and practical tools for selecting the most appropriate instrument [9].
A systematic review of disease-specific PROMs for PDys and an assessment of the quality of their psychometric properties is currently not available, but would facilitate the selection of the most appropriate instrument for researchers and clinicians.Using the COSMIN methodology, we pursued the following aims: 1. To conduct a systematic review of the quality of existing disease-specific PROMs for PDys, i.e., i. to evaluate the quality of development and/or validation studies ii. to evaluate the psychometric properties of the identified PROMs including aspects of interpretability and feasibility iii. to grade the evidence 2. To derive recommendations for use of the identified PROMs in research and patient care.

Protocol and registration
The present systematic review was conducted following the recommendations of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols (PRISMA-P) statement [10] and the COSMIN guideline and manual for systematic reviews of PROMs [7,11].The protocol has been registered in the International Prospective Register of Systematic Reviews (PROSPERO) (CRD42022358458).

Eligible studies
Inclusion and exclusion criteria are displayed in Table 1.

Study selection
Following deduplication of the records in Citavi 6, we performed the screening of titles and abstracts using Rayyan [12].To assess initial eligibility, titles and abstracts were evaluated according to the inclusion and exclusion criteria independently by two reviewers.For articles considered eligible at this stage, the full texts were searched and also evaluated independently by two reviewers according to the predefined criteria.In case of any disagreement, consensus was reached within the research team.

Evaluation of measurement properties
All measurement properties were evaluated according to the COSMIN manual (based on [7,11,13]) following three sub steps as outlined below.

Assessment of the methodological quality of the included studies
The methodological quality of each single study on a measurement property was evaluated independently by two reviewers with psychological background and experience in the application of the COSMIN methodology using the COSMIN Risk of Bias checklist [11].The COSMIN Risk of Bias checklist consists of 10 boxes encompassing all standards needed to assess the quality of a study on that specific measurement property (Appendix 2).Content validity is considered the most important measurement property, and the available evidence from content validity studies and the PROM development study was considered for the evaluation of content validity.The assessment is based on five items on relevance, one item on comprehensiveness and four items on comprehensibility.The content validity is also rated by the reviewers themselves, and their ratings are considered as additional to the evidence from the literature.However, if no content validity studies are available, or only content validity studies of inadequate quality, and the PROM development is of inadequate quality, the rating of the reviewers determines the overall ratings [13].The methodological quality of the studies was rated on a four-point rating scale as either very good, adequate, doubtful, or inadequate.The overall quality of a study was determined by the lowest rating of any standard in the box ("worst score counts") [11].

Assessment of the quality of measurement properties
The quality of measurement properties was assessed by one reviewer, and a second reviewer evaluated 20% of the included data for quality assurance purposes.The result of each single study on a measurement property was evaluated against the criteria for good measurement properties, and rated as either sufficient ( +), insufficient ( −), or indeterminate (?) (Appendix 3).We further summarized the quality of the evidence per measurement property per PROM, and the summarized results were also rated against the criteria for good measurement properties.Additionally, we extracted data on interpretability and feasibility of the PROMs.These aspects are not formally evaluated by the COSMIN tools, but are viewed as important considerations for the practical use of a measurement instrument (see [14] for details).

Grading the evidence
The quality of evidence of the summarized results was graded using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach [14].In case of concerns regarding the trustworthiness of a result, the quality of evidence is downgraded per measurement property per PROM.Downgrading was possible due to risk of bias (methodological quality of studies assessed by the RoB checklist), inconsistency (unexplained inconsistency of results across studies), imprecision (total sample size of available studies), and/or indirectness (evidence from different populations than the population of interest).The quality of evidence was rated as either high, moderate, low, or very low.We did not grade the quality of evidence if an overall rating was indeterminate or inconsistent.To generate recommendations for use of the identified PROMs, we categorized each instrument as follows [7]: A. PROMs with evidence for sufficient content validity (any level) and at least low-quality evidence for sufficient internal consistency.B. PROMs categorized not in A or C. C. PROMs with high-quality evidence for an insufficient measurement property.
PROMs of category A can be recommended for use, while PROMs of category B have the potential to be recommended

Literature search
The results of our literature search are displayed in Fig. 1.
For data extraction, we included seven studies reporting on four different PROMs.Two studies reported on the Dysmenorrhea Daily Diary (DysDD) [15,16], and one study, respectively, reported on the Exercise of Self-Care Agency Scale (ESCAS) [17], the Adolescent Dysmenorrhic Self-Care Scale (ADSCS) [18], and on the Dysmenorrhea Symptom Interference Scale (DSI) [19].The studies on the ESCAS and the ADSCS referred to the respective development study [20,21], which we searched and considered for evaluation of the content validity of these instruments.
Additionally, we considered a review of self-reported pain and symptom measures for PDys [22], and evaluated the included tools regarding eligibility.The identified instruments did not meet our predefined criteria and were excluded (Appendix 4).The update of our literature search did not yield new eligible studies.

Characteristics of the included PROMs and study populations
Details of the included PROMs and study populations are presented in Tables 2 and 3.The purpose of the ESCAS and ADSCS is to assess self-care behavior using 43 and 35 items, respectively, which are rated on a 5-point (ESCAS) and 6-point (ADSCS) Likert scale.The DSI is measuring the impact of PDys on physical activities, sleep, daily activities, work, leisure and social activities, and mood.The instrument comprises nine items, which are rated on a five-point Likert scale with one version each for on-menses and off-menses using different recall periods (last 24 h vs. last menstrual  period).The DysDD is conceptualized as daily diary aiming to assess menstrual bleeding, pelvic pain, use of rescue medication, and impact of pelvic pain or cramps on daily life using 10 items, which are scored independently on different scale formats.
The sample sizes of the included studies ranged from 24 to 686 patients, and the overall age range was 13-49 years.

Information on interpretability and feasibility
No data regarding interpretability and feasibility were reported for the ESCAS.For the on-menses and off-menses versions of the DSI, distribution-based minimal important difference (MID) estimates ranging from 0.27 to 0.36 were reported.Further, the anchor-based estimate was 0.28 for minimally important improvement and 0.18 for minimally important worsening.For the DysDD, data on the distribution of scores in the study population, missing data, and data on MID were provided.Within the framework of the development study [15], preliminary quantitative analyses were conducted showing a good distribution of responses with no major ceiling or floor effects and all response options utilized.Subsequent validation analyses [16] revealed that the items showed a good distribution of responses across response options at baseline.Furthermore, the majority of responses on day − 1 (the day before the initiation of menstrual bleeding) and on day 3 were concentrated at the lower end of the scale, whereas the responses on days 1 and 2 were grouped toward the higher end of the scale.At treatment cycle 2, the response distributions were comparable with baseline scores, with a general trend to show slightly lower scores, which was accompanied by lower mean scores for rescue medication items.All items of the DysDD showed floor effects at day − 1, and the majority of items (items 3, 6, 8, and 9) then showed ceiling effects over days 1-2.The item assessing impact on sleep (item 10) did not show any ceiling effects, but floor effects on days − 1, 1, and 3. Analyses on missing data revealed that four participants (17%) missed one or more days of completing the DysDD during the pilot test.In the validation study, only women with complete data were included, and missing data were not imputed or carried forward for validation analyses.With respect to MID, analyses indicate that changes on the pelvic pain score (score range 0-10) of three points can be considered clinically meaningful.For all included PROMs, no data were available regarding scores and change scores for relevant subgroups and response shift.
Concerning feasibility, no study reported difficulties regarding the patient's comprehensibility and administration of the PROM.Pretesting the ADSCS showed that it took 5-10 min to complete the questionnaire.The DysDD was administered as eDiary using a hand-held, electronic, touch-screen device.In the pilot test, participants found the format and functionality of the eDiary device easy to use and to incorporate into their daily lives [15].Information on access to all identified PROMs is given in Appendix 5.

Measurement properties of instruments
When evaluating the quality of the included studies using the COSMIN Risk of Bias checklist, the reviewers had a mean agreement of 81.4% across all studies.Major disagreements were resolved by discussion with a third reviewer having expertise with the COSMIN methodology.

Evaluation of content validity
The overall ratings of the PROM development and content validity studies are displayed in Appendix 6.The development study of the ESCAS [20] was rated 'inadequate' since the instrument was not developed for the target population.The content validity study [17] received a 'doubtful' rating because detailed information about different aspects of the procedure were not provided.The development study of the ADSCS [21] was rated 'doubtful' due to methodological weaknesses regarding the collection and analysis of qualitative data for PROM design, and due to methodological weaknesses of the pilot test.Likewise, the content validity study of the ADSCS [18] was rated 'doubtful' because details of the methodological approach were not described.The development study of the DSI [19]   Different countries, presumably English language, but not detailed: European Union, Australia, New Zealand, South America, Mexico, South Africa Reliability, hypotheses testing, responsiveness population was not involved in the design of the instrument.Due to methodological shortcomings when asking patients about relevance, the content validity study of the DSI [19] was rated 'doubtful.'The development study of the DysDD [15] received a 'doubtful' rating since the qualification of the interviewers was not described.For the DysDD, a content validity study was not performed.The overall content validity rating per PROM and the evaluation of the quality of evidence is displayed in Appendix 7. The content validity of the ESCAS was rated 'indeterminate,' and we therefore did not assess the quality of evidence.The ADSCS and the DSI showed sufficient content validity, and the quality of evidence was rated 'moderate' since at least one content validity study of doubtful quality was available, respectively.Also the DysDD showed sufficient content validity, but the quality of evidence was rated 'low' because only a PROM development study of 'doubtful' quality was available, and a content validity study was not performed.
As we found no high-quality evidence for insufficient content validity of any PROM, we subsequently assessed the remaining measurement properties of each PROM.

Evaluation of the remaining measurement properties
The results of the evaluation of the quality of studies on measurement properties and the rating of the methodological quality of the instruments are displayed in Table 4. Based on the five validation studies available in total, the methodological quality of 26 single studies on measurement properties was evaluated.No study analyzed cross-cultural validity/measurement invariance, measurement error, and criterion validity.Regarding the ADSCS, it is important to note that the development study resulted in a 40-item questionnaire, for which structural validity, internal consistency, and hypotheses testing were assessed [21].Evaluating these measurement properties showed sufficient structural validity, but insufficient internal consistency, and sufficient construct validity (data not shown).In the validation study [18], the instrument was revised resulting in a 35-items version, for which we analyzed and report the psychometric properties.The summarized results per PROM and measurement property are depicted in Table 5.

Recommendation
The ADSCS and the on-menses version of the DSI were placed into category A (Table 6).The ESCAS and the DysDD were placed into category B, and the DSI off-menses version was placed into category C.

Discussion
This systematic review provides a synthesized evaluation of the quality of PROMs for PDys applying the COSMIN methodology.Among the four identified instruments, the ADSCS and the on-menses version of the DSI can be recommended for use in future research (COSMIN category A).We further found that the ESCAS and the DysDD have the potential to be recommended, but require further validation (COSMIN category B).The off-menses version of the DSI cannot be recommended for use (COSMIN category C).The identified PROMs address different outcomes, which is of importance for their application in research and clinical care.
The classification of a PROM into a recommendation category according to the COSMIN methodology is based on the evaluation of content validity and structural validity.Although the ADSCS and the on-menses version of the DSI meet the requirements for a recommendation according to these criteria, significant evidence gaps remain.All included PROMs show substantial conceptual and methodological flaws, which need to be discussed.
The ESCAS was developed to measure a person's exercise of self-care agency based on Orem's self-care deficit nursing theory [23].Most importantly, the ESCAS is a generic instrument for the assessment of self-care ability, and it was not designed for use in women with PDys.Due to methodological weaknesses of the validation study, which was performed in adolescent girls with PDys [17], we could not determine the content validity, and also structural validity and internal consistency could not be evaluated since the required data were not reported.Extending their work on the ESCAS, Wong and colleagues have translated and validated the ADSCS [18], which also aims to assess self-care behavior of adolescent girls with PDys.The development of the ADSCS involved a sample of the target population, and also a cognitive interview study was performed [21].Data from the subsequent translation and validation study [18] showed sufficient content validity and sufficient internal consistency, indicating that the instrument can be recommended for use.However, since patients were not asked about comprehensiveness in the development phase, and relevance and comprehensiveness were not assessed from the patients' perspective, further content validity assessments are indicated.
Applying the COSMIN criteria further suggests that the ESCAS has the potential to be recommended for use.Nevertheless, in view of its substantial methodological weaknesses and the availability of the ADSCS measuring the same construct with sufficient measurement properties, we oppose further validation of the ESCAS and consider the ADSCS as preferred measure of self-care behavior in PDys.
The DSI measuring the impact of PDys on various outcomes is available as version on-menses with a 24-h recall period, and as version off-menses referring to the last menstrual period [19].We found sufficient content validity and sufficient internal consistency of both versions, indicating that the instrument can be potentially recommended for use.Concerning aspects of feasibility, data on MID are available, which is important for the application of the instrument by researchers and clinicians.Notably, the DSI was developed based solely on research literature, and a sample representing the target population was not involved in the design of the instrument.As patient participation is considered a major quality criterion for PROM development [24], the DSI is of insufficient quality in this regard.Moreover, construct validity is a concern of the off-menses version.Construct validity was determined by examining correlations of symptom interference with menstrual pain severity, perceived stress, and sleep disturbance referring to the last 24 h and to the last menstrual period for the on-and offmenses version, respectively.For the off-menses version, the observed correlations were not in accordance with the predefined hypotheses, which might be related to recall bias resulting from a potentially too long recall period.Consequently, the construct validity of the DSI off-menses version was rated 'insufficient,' and this version cannot be recommended for use.
Capturing the broadest spectrum of outcomes, the DysDD [16] was found to have sufficient content validity.Meeting the scientific and regulatory requirements for PROM development [25], the instrument was developed based on profound concept elicitation and comprehensive qualitative assessments in the target population, and also a cognitive interview study was performed.However, as data from content validity studies were are not available, the content validity of the DysDD was solely rated by the reviewers, resulting in low quality of evidence.These findings indicate that studies on the content validity of the DysDD are highly recommended.Another shortcoming of the DysDD concerns the lack of data regarding structural validity and internal consistency.Furthermore, while intra-cycle (within menstrual cycle) reliability was sufficient, we found insufficient inner-cycle (between menstrual cycles) reliability.Concerning inner-cycle reliability, it might be argued that the 60 days between baseline and treatment cycle 2 may have been too long, and that the results for intra-cycle reliability can be considered more indicative for the true reliability.For this reason, we decided not to consider the insufficient reliability between menstrual cycles when deriving a recommendation for use, but stress that sufficient reliability of the DysDD is only given when administered within the menstrual cycle.Regarding aspects of feasibility, the DysDD was administered as eDiary in the validation study, suggesting that the tool has the potential to be used by physicians in daily practice and by researchers in studies involving women with PDys.Underlining the usefulness of the DysDD, data on MID indicate that a change of three points in the pelvic pain score can be considered clinically meaningful.
Taken together, our evaluation revealed that the ADSCS can be recommended as PROM for the assessment of selfcare behavior of adolescent girls with PDys, but requires further content validity assessments.Regarding measures of impact, the on-menses version of the DSI can be recommended for use, while the DysDD does currently not fulfill the COSMIN criteria for a recommendation.However, given the intensive work on scale development and testing during the PROM design phase and the broad spectrum of outcomes covered, the DysDD appears promising for use in medical care and research, encouraging further investigations.Overall, the insufficient construct validity of the DSI off-menses version and the insufficient intra-cycle reliability patients' perspective, which is relevant for both patient care and research.For this purpose, also the DysDD including daily assessments might be a suitable tool, but requires further validation.

Strengths and limitations
Strengths of the present work encompass the application of an established comprehensive and sensitive search filter, which was not restricted to publication year and language.Allowing to capture all potentially relevant outcomes, our search strategy included any PROMs for women with PDys.Our literature search was carried out in the two major databases PubMed and Web of Science, and we additionally searched the reference lists of the included studies for relevant articles.Moreover, we contacted the authors of the included studies to obtain further information on research activities regarding PROMs for PDys.Notably, due to the methodology applied in the present systematic review, only PROMs for which validation studies were available could be considered.A limitation may arise from the fact that we did not search all reference lists of relevant full texts for further eligible studies, and that further databases such as Scopus, Embase, or PsycINFO were not considered.However, in the biomedical field, PubMed is considered the leading database [26].

Conclusions
We identified four PROMs for use in women with PDys focusing on various outcomes.According to COSMIN criteria, the ADSCS can be recommended for the assessment of self-care behavior of adolescent girls with PDys.
To measure the impact of PDys symptoms on the women's daily activities, the on-menses version of the DSI can be recommended.Although both instruments showed sufficient content validity, major shortcomings concern the deficient patient involvement in the content validity study of the ADSCS, and the lack of patient engagement in the design of the DSI, indicating the need for further content validity studies.Applying the criteria of the FDA for the evaluation of PROMs, which require patient involvement in the item generation phase [25], the DSI would not be accepted as measure for endpoints in clinical trials.The DysDD has the potential to be recommended for use, but further validation studies assessing content validity and structural validity are required.

Fig. 1
Fig. 1 Adapted preferred reporting items for systematic reviews and meta-analyses (PRISMA) 2020 flow diagram

Table 1
Inclusion and exclusion criteria for use, but require further validation.PROMs of category C should not be recommended for use.

Table 2
Characteristics of the included instruments

Table 4
Quality of studies on measurement properties and methodological rating of the instruments a No study has analyzed cross-cultural validity/measurement invariance, measurement error, and criterion validity ADSCS adolescent dysmenorrhic self-care scale, ESCAS exercise of self-care agency scale, DSI dysmenorrhea symptom interference scale, DysDD dysmenorrhea daily diary b Rating: ( +) sufficient, (−) insufficient rating, (?) indeterminate PROM patient-reported outcome measure, reported outcome measures.All other authors declare that they have no conflict of interest.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.