Introduction

Chronic prostatitis/chronic pelvic pain syndrome (CP/CPPS) is a common disorder among men [1]. It is defined as chronic pelvic pain not caused by other identifiable pathologies and is often characterized by with urogenital pain, lower urinary tract symptoms, psychological issues, and sexual dysfunction [2, 3]. Men of all ages and races may experience prostatitis, with a worldwide prevalence of 2% to 10% [4]. The CP/CPPS causes morbidity, through both symptoms and associated impairment in health-related quality of life, thus illustrating the importance of patient-centered outcomes. Moreover, CP/CPPS is a poorly-defined clinical entity, and therefore is prone to misdiagnosis, mistreatment, and mismanagement [5]. The lack of a systematized and universally accepted outcome measure has led to inconsistent and vague results in CP/CPPS studies while making patient evaluation a challenge, as well as hindering research and clinical endeavors in aiding patients with CP/CPPS, thus The National Institutes of Health (NIH) Chronic Prostatitis Collaborative Research Network developed the NIH Chronic Prostatitis Symptom Index (NIH-CPSI) by Litwin and co-workers in 1999, in order to accurately assess the extent of CPPS, objectively measure the symptoms in natural history studies, and to assess the outcome parameters in clinical trials [6].

The NIH-CPSI, a self-administered questionnaire has nine items, divided into three domains: pain or discomfort (with a total score ranging from 0 to 21), urinary symptoms (with a total score ranging from 0 to 10), and impact on the quality of life (QOL) (with a total score ranging from 0 to 12 points) [6]. It is used as a diagnostic tool for the diagnosis and follow-up of CP/CPPS. In previous studies, the NIH-CPSI was shown to be reliable, valid, and responsive to change [7,8,9,10]. Pain scores of perineal or ejaculatory discomfort ≥ 8 are good predictors of moderate to severe CP/CPPS [11]. The scale was also used by English speakers with different cultural backgrounds, such as Australian, Malaysian, and Spanish, and found to have good concurrent validity, and discriminant validity [12,13,14].

Initially it was written in English, but in the present day, it has been translated into many other languages including Arabic, and Chinese. Due to the cultural differences, a simple translation of the original version of a questionnaire does not guarantee similar measurement properties and rough translations may lead to construct bias, method bias, and item bias, all of which impact the validity of cross-cultural comparisons [15, 16]. Whether the NIH-CPSI has similar reliability and validity as the logical cross-cultural adaptation of the original edition is still uncertain. Therefore, a systematic review on the quality of the cross-cultural adaptations of the NIH-CPSI is necessary.

Materials and methods

Study selection

The search for articles was performed in the PubMed, Embase, CINAHL and SciELO from their established year to September 2020. The search terms included “National Institutes of Health Chronic Prostatitis Symptom Index”, “NIH-CPSI”, “NIH Chronic Prostatitis Symptom Index”, “cross-cultural”, “equivalence”, “translation”, “validation”, and “adaptation”. Additional hand searching of journals, references lists, conference papers, and textbooks related to the NIH-CPSI were performed comprehensively. There was no language restriction.

Inclusion and exclusion criteria

The following studies were included:

  1. (1)

    Studies related to the cross-cultural adaptation development of the NIH-CPSI;

  2. (2)

    Studies reporting the process of cross-cultural adaptations;

  3. (3)

    Studies on the quality assessment of at least one measurement criterion of a cross-cultural adaptation;

  4. (4)

    Otherwise, other validation studies from different English-speaking societies were also included.

Emails were also sent to the authors asking for their publications that were not available in full free of charge. Studies not reporting the detailed adaptation process were excluded. Two reviewers checked the potentially relevant studies according to the inclusion and exclusion criteria, and selected eligible studies independently. Any disagreement was resolved through discussion with the third reviewer.

Data extraction and quality assessment

The language, population, publication year, and other related information about the studies were extracted by two independent reviewers in a predefined form. Then the third reviewer verified the information.

Quality assessment was made by two reviewers independently. The results were adopted on the premise that the weighted kappa (κ) was more than 0.75. Any disagreement was resolved by consensus, if a consensus could not be reached; a third reviewer decided the result.

The translation and cross-cultural adaptation methods of each study were classified according to the Guidelines for the Process of Cross-Cultural Adaptation of Self-Report Measures [17]. First, independent initial translations should be performed by a translator who is familiar with the field of medicine and by another translator with a non-medical background. These two independently performed translations (T1 and T2) should be synthesized (T1-2). Next, two different translators who are native English speakers and unfamiliar with the outcome measurement tool should provide a back translation into English independently (B1 and B2). Then, the next step is an expert committee made up of methodologists, health professionals, language professionals, and translators to review the original questionnaire as well as each translation (T1, T2, T12, B1, and B2). This committee then agrees on any changes that need to adapt to the tool and creates a new draft version of the questionnaire. The prefinal version should then be tested with at least 30–40 patients from the target setting. This phase is important to identify the understanding, and acceptable and emotional impact of the questionnaire items, besides detecting items that were confusing or misunderstood. Finally, the final version of the questionnaire is appraised by the expert committee again, and they should unanimously approve the final version of the tool. These procedures are described with more detail in Additional file 1: Table S1.

The measurement properties were assessed according to the Quality Criteria for Psychometric Properties of Health Status Questionnaire, which focused on assessing of the psychometric properties [18]. The evaluation in this study included content validity, construct validity, internal consistency, criterion validity, concurrent validity, discriminant validity, agreement, reliability, responsiveness, and ceiling and floor effects, as well as interpretability. A clear description of measurement objectives, concept to be measured, project selection, target population participation means a positive rating for content validity. A positive rating for internal consistency was assigned when factor analysis was applied and Cronbach’s α was found to be from 0.70 to 0.95, with the sample size is greater than 7 * items and a minimum number of 100 subjects. Although CP is a common complaint in societies, but no golden standard exists. Therefore, all the adaptations lacked criterion validity. Reliability refers to the extent to which patients can be distinguished from each other despite measurement error (relative measurement error). Generally, an intraclass correlation coefficient (ICC) of > 0.7 is recommended as a minimum standard for reliability [18]. These procedures are described in more detail in Additional file 2: Table S2.

Results

A total of 132 studies were identified in the initial search. Of these, 56 publications were excluded because of duplications, and 35 were not relevant. Further, 12 studies were intervention trials, and 4 were reviews. The last 25 studies were identified as potentially relevant publications after screening by titles and abstracts (Fig. 1).

Fig. 1
figure 1

Flowchart of the search process of this review

Two studies were excluded due to subsequent research of the original NIH-CPSI without testing measurement properties [19, 20], and two studies were reviews of the NIH-CPSI [7, 21] after full-text selection. Among the 21 studies included, 16 cross-cultural adaptations of the NIH-CPSI were in 15 different languages/cultures [13, 14, 22,23,24,25,26,27,28,29,30,31,32,33,34,35]; otherwise, four studies related to the original version in the US, and one in Australia, one in Spanish, and one in Malaysian were also included [7, 9,10,11,12,13,14]. There were two adaptations in Japanese [31, 32]. Besides, two multiple were performed studies in German (1,2) [28, 29] (Table 1).

Table 1 Description of cross-cultural adaptations for the National Institutes of Health Chronic Prostatitis Symptom Index

The sample size of the studies on validity ranged from 30 to 434, but none of them reported sample size calculation. The Original-American, Chinese-Malaysian, Japanese (a), Japanese (b) and Malaysian adaptations enrolled patients with CP/CPPS, patients with benign prostatic hyperplasia (BPH) patients, and healthy controls [9, 13, 31, 32]. The Arabic-Egyptian, Chinese-Mainland, Estonian, Finnish, French-Canadian, Italian, Persian, and Turkish adaptations enrolled patients with CP/CPPS, and healthy controls, but patients with BPH [22, 23, 25,26,27, 30, 33, 35]. The Original-American, Danish, German (1,2), Portuguese-Brazilian, and Spanish adaptations included only patients with CP/CPPS [7, 11, 14, 24, 28, 29, 34]. More than half of the applications included consecutive patients [14, 22, 23, 26,27,28,29, 33, 34].

Quality assessment of the cross-cultural adaptations of the NIH-CPSI

The quality assessment of the adaptation process was assessed by two independent reviewers achieving a κ of 0.876. A consensus was achieved on 100% of occasions, when the reviewers had disagreements (Table 2).

Table 2 Quality assessment of the process for cross-cultural adaptations of the National Institutes of Health chronic prostatitis symptom

Most adaptations reported forward translation. Only three adaptations completely met the requirement that the forward translation process should be completed by one translator with the medical background and the other one with no medical background [22, 30, 34]. The two translators of the Chinese-Mainland, Estonian, Finnish, French-Canadian, Japanese (a), and Japanese (b) adaptations had a medical background and were aware of the concepts being examined in the questionnaire [23, 25,26,27, 31, 32]. In contrast, none of the translators of the Persian, and Spanish adaptations were familiar with the NIH-CPSI [14, 33]. The Chinese-Malaysian, Danish, German (1,2), Malaysian, and Turkish adaptations did not explain the specific background of the translators [13, 24, 28, 29, 35].

Most of the adaptations introduced the synthesis stage of translation, and met the requirements of the synthesis process.

In this review, only the Danish, Portuguese-Brazilian, and Turkish adaptations finished back translation completely [24, 34, 35]. Most of the adaptations had only one translator [13, 14, 22, 25,26,27,28,29,30, 32, 33]. Therefore, the Chinese-Mainland adaptation did not report whether the back translators were native English speakers [23], and the Japanese (a) adaptation provided no information about back translation [31].

Only the Finnish, Italian, Portuguese-Brazilian, and Spanish adaptations met the standards of composition for the existence of an expert committee [14, 26, 34]. The German (1,2), Persian, and Turkish adaptations did not explain the specific composition of this committee [28, 29, 33, 35]. The Arabic-Egyptian adaptation enrolled only language professionals, and the Estonian adaptation only clinicians [22, 25]; the Danish adaptation did not enroll methodologists [24]. No information was found in the other adaptations [13, 23, 27, 31, 32].

The final process of adaptation process was the pretest. Only three adaptations met the requirements [14, 25, 34]. Patients were not enough for the prefinal versions of French-Canadian, Danish, and Turkish adaptations [24, 27, 35]. The Finnish and Italian adaptations did not report the sample size of patients [26, 30]. The others lacked information about this process [13, 22, 23, 28, 29, 31,32,33]. All adaptations had a submission of the final version.

Methodology used for property measurement

The κ of the two reviewers was 0.869. The methodological quality and the measurement of the studies are provided in Table 3. All studies showed a clear description of the content validity in the development of a questionnaire.

Table 3 The measurement properties of the National Institutes of Health Chronic Prostatitis Symptom Index adaptations relate to Quality Criteria for Psychometric Properties of Health Status Questionnaire

The original version in American was found with qualified content validity, construct validity, internal consistency, test–retest reliability, responsiveness, discriminant validity, and interpretability [7, 9,10,11]. The pain dimension of the original-American NIH-CPSI had reported ceiling effects (20.70%) [9]. The original version in Australian only checked with good interpretability [12]. Original-Malaysian reported unsatisfactory internal consistency, then Original-Spanish on the contrary, since only less than 50 patients were included, the results were not convincing.

Construct validity was conducted in 10 studies, then five of them met the met the standard [9, 25, 26, 28, 29, 35], then Original-American, Estonian, Finnish, German (1,2) versions used Pearson’s r correlation [9, 25, 26, 29], and Turkish versions applied Spearman’s r correlation [35]. A negative rating was given by the reviewer for the construct validity of Original-Spanish, French-Canadian, Japanese (a), Persian, and Spanish adaptations, because their sample sizes were smaller than 100 [14, 27, 31, 33].

An analysis of the internal consistency was conducted on most of the adaptations, but only four of them met the standard [9, 25, 28, 29, 35]. A negative rating was given by the reviewer for the internal consistency of Arabic-Egyptian, French-Canadian, Italian, Japanese (b), Portuguese-Brazilian, Malaysian, Persian, and Spanish adaptations, because their sample sizes were smaller than 100 [13, 14, 22, 27, 30, 32,33,34]. The Chinese-Mainland, and Chinese-Malaysian adaptations did not fully meet the criteria of internal consistency for the missing factor analysis [13, 23, 28]. No information was available on the internal consistency of the Finnish and Danish adaptations [24, 26]. Only half of the adaptations reported had Cronbach’s α of more than 0.70 [14, 22, 25, 27, 32,33,34,35]. The details are shown in Table 4.

Table 4 The summary of the measurement properties of cross-cultural the National Institutes of Health Chronic Prostatitis Symptom Index adaptations

The Finnish, Italian, and Turkish adaptation showed a good correlation with the visual analogue scale or International Prostate Symptom Score, then American Urological Association symptom index a good correlation with Original-American [9, 26, 30, 35]. Therefore, a positive rating for concurrent validity was given to the adaptation enrolling at least 50 patients, while the Chinese-Malaysian, French-Canadian, Japanese (a), Malaysian, and Spanish adaptations did not have 50 patients [13, 14, 27, 31]. The others did not have concurrent validity.

Only two publication reported the discriminant validity following the guidelines [13]. The discriminant validity between the CP/CPPS group and each of the control groups was assessed by calculating the area under the receiver operating characteristic curve (AUC). The Original-Malaysian, Chinese-Malaysian, and Malaysian NIH-CPSI reported that the AUC of CP/CPPS versus healthy individuals was more then 0.80. The AUC of Original-American NIH-CPSI was 0.67, then Original-Malaysian, and Malaysian reported good discriminant validity of more than 0.75 CP/CPPS versus BPH, except void [11, 13]. Other studies reported only the P value of difference between CPPS, and controls or BPH [13, 22, 23, 25,26,27,28,29,30,31,32, 35].

Only nine adaptations reported test reliability, but only the Original-American, Arabic-Egyptian, Chinese-Mainland, Danish, Italian, Persian, and Turkish adaptations met the criterion [22,23,24, 30, 33, 35]. The sample size for the reliability should be at least 50 patients, while the Chinese-Malaysian, French-Canadian, Japanese (b), Malaysian, and Portuguese-Brazilian adaptations enrolled less than 50 patients [13, 27, 32, 34].

Most adaptations reported the interpretability, except for Danish, Portuguese-Brazilian and Spanish adaptations [14, 24, 34]. Then, only the Danish adaptation reported the agreement [24].

Other measurements such as responsiveness, and floor and ceiling effects were not reported in any of the adaptations.

Discussion

Summary of evidences

The objective of this study was to assess the cross-cultural adaptation procedures and the measurement properties in each adaptation of the NIH-CPSI. Back translation was the weakest process for the quality assessment of the cross-cultural adaptations of the NIH-CPSI. The main reason was the presence of only one translator in most of the adaptations. An analysis of the internal consistency was conducted on most of the adaptations, but none of them met the standard. Only 11 adaptations reported test reliability, but only the Arabic-Egyptian, Chinese-Mainland, Danish, Italian, Persian, and Turkish adaptations met the criterion. Most adaptations reported the interpretability, but only the Danish adaptation reported the agreement. The quality of several other measurement properties, including responsiveness and internal consistency was blank.

The overall quality of the NIH-CPSI cross-cultural adaptations was unsatisfactory. Only the Italian, Portuguese-Brazilian, and Spanish adaptations provided a better quality compared with the other adaptations for the quality assessment of the cross-cultural adaptations [14, 30, 34]. Only the Turkish adaptations finished half of the measurement properties [35]. Many standards were developed to measure the cross-cultural reliability of questionnaires, such as the guidelines for the process of cross-cultural adaptation of self-reported measures in 2000 [17], Consensus-based Standards for the selection of health status Measurement Instruments (COSMIN)-checklist in 2016 [36, 37], the Patient-Reported Outcomes in 2005 [38], and the Scientific Advisory Committee of the Medical Outcome Trust checklist in 1996 [39]. However, the Danish adaptation in 2019, and the Persian adaptation in 2020 showed very little improvement in the methodological quality of the cross-cultural adaptation of the NIH-CPSI [24, 33, 35].

Sample size for the future cross-cultural adaptations of the NIH-CPSI

Many adaptations did not take pretesting of the prefinal version or did not have enough patients, which was important for adaptations. Ideally, 30 to 40 participants should be included in pretesting [17]. The patients with CP/CPPS were different from the translators, and the expert committee. Some of them did not have a high educational background, and thus the pretesting was necessary.

The sample size for the assessment of the measurement properties was also important. Additionally, 9 out of 14 adaptations reported that the internal consistency did not meet the requirement of an adequate sample size of more than 100; 5 out of 8 adaptations reported construct validity, and 5 out of 11 adaptations reported the reliability. The sample size of the studies on validity ranged from 30 to 259, but none of them reported sample size calculation. It was the most outstanding drawback for the measurement properties. Overall, 100 patients should be included in internal consistency and validity, and then 50 patients included in the reliability, agreement, and responsiveness [18]. Thus, 30 to 40 participants should be included in the pretesting process, and a sample size of at least 100 patients should be included to assess the measurement properties for NIH-CPSI.

Best practice for evaluating the construct validity of the NIH-CPSI

The construct validity of the NIH-CPSI has been tested in most of the publications, but the method is not unified. This is best estimated using the multi-trait multi-method matrix [40]. In some cases, researchers have used either latent variable modeling or Pearson product-moment correlation based on Fisher’s Z transformation [41, 42]. An internally consistent scale is achieved through principal component analysis or exploratory factor analysis, followed by confirmatory factor analysis. A clear hypothesis exists that the factor structure is determined as pain or discomfort, urinary symptoms, and QOL, and hence confirmatory factor analysis (CFA) should be used [43, 44]. Robust maximum likelihood was used to estimate the CFA model. The fit of the model was assessed by combining the following fit indices: comparative fit index (CFI), Tucker-Lewis index (TLI), standardized root mean square residual (SRMR), and root mean square error of approximation (RMSEA). Pre-determined cut-off values were used to assess the fit (CFI and TLI > 0.95 for good fit and > 0.90 for acceptable fit; SRMR < 0.08 for good fit and < 0.12 for acceptable fit and RMSEA < 0.06 for good fit and < 0.10 for acceptable fit) [45]. It could be invalidated by too low or weak correlations with other tests, which were intended to measure the same construct. The critical values for Pearson’s or Spearman’s r correlations were as follows: high, r > 0.50; moderate, 0.50 > r > 0.30; and low, 0.30 > r > 0.25 [46]. The critical value for significant factor loading was > 0.40 [46]. According to the guidelines, the sample size for CFA should be more than 100, and 7 times of the items [47]. The CFA was conducted using the Analysis of Moment Structures Program, or the Lavaan package in R statistical software. Then, Pearson’s r correlations were performed using SPSS, SAS or R statistical software.

Limitations of this review

The major English databases were included in literature retrieval. Meanwhile, manual retrieval was also shown in the references. Nonetheless, it could hardly guarantee that all cross-cultural adaptations of NIH-CPSI has been found. It was significant for a systematic review to assess all original studies that reported cross-cultural translations of the NIH-CPSI. Then, the systematic review design was defined before conducting the study as a priori, but this predefined systematic review protocol was not registered before.

Conclusions

The overall quality of the NIH-CPSI cross-cultural adaptations was not organized as expected. Only the Portuguese-Brazilian, Italian, and Spanish adaptations reached over half of the process for the cross-cultural adaptation. Also, only the Italian and Turkish adaptations finished half of the measurement properties of cross-cultural adaptations. Future studies should consider the sample size reasonably and test responsiveness and floor and ceiling effects. Moreover, other psychometric properties should follow the guidelines.

What is new?

  1. 1.

    The overall quality of the NIH-CPSI cross-cultural adaptations is not organized as expected.

  2. 2.

    Only the Portuguese-Brazilian and Spanish adaptations showed a better quality than the other adaptations for the quality assessment.

  3. 3.

    For the measurement properties, only the Italian, and Turkish adaptations finished half of the measurement properties.

  4. 4.

    Many standards had been developed to measure the cross-cultural reliability of questionnaires, however, from the Danish adaptation in 2019, and Persian adaptation in 2020, we could find that there was very little improvement in the cross-cultural adaptation of NIH-CPSI.