Introduction

Sexual health is a state of physical, emotional, mental, and social well-being in relation to sexuality; it is not merely the absence of disease, dysfunction, or infirmity [1]. Pelvic floor disorders (PFD), including urinary (UI) and anal incontinence (AI) and pelvic organ prolapse (POP), are common presenting gynecological complaints and adversely affect quality of life, including sexual health [2, 3]. Up to 60 % of sexually active women attending urogynecology clinics report sexual dysfunction [3]. Despite the prevalence, only a minority of urogynecologists consistently screen patients for sexual complaints. Lack of time, uncertainty about therapeutic options, and older age of the patient have been cited as potential reasons for failing to address sexual function as part of routine gynecological history [4].

Questionnaires play an integral role in the evaluation of female sexual function. A number of general sexual function questionnaires have been developed and utilized in the evaluation of women with PFD, including the Female Sexual Function Index (FSFI) [5], the Profile of Female Sexual Function (PFSF) [6], and the McCoy Female Sexuality Questionnaire (MFSQ) [7]. While these measures are validated, they were not developed to focus on PFD and its impact on sexual health, and because they are not condition-specific for PFD, they may not be sensitive enough to detect differences in sexual function due to PFD. The Pelvic Organ Prolapse/Urinary Incontinence Sexual Questionnaire (PISQ) and its short form version, the PISQ-12, are the only current validated condition-specific female sexual function questionnaires purposively developed to assess sexual function in women with UI and/or POP [8, 9].

Questionnaires utilized to assess sexual health among women with PFDs share common limitations [10]. None of the questionnaires have been validated in a population of women with AI, nor have they been developed within a framework that intentionally included women without a partner or those who did not consider themselves to be sexually active. This may underestimate PFD impact on sexual function since women with severe PFD may elect to become sexually inactive. A validated sexual measure that evaluates not only the impact of PFDs on sexual function but also PFD impact on sexual activity is needed. The questionnaire should have proven responsiveness in evaluating treatment outcomes as well as validity and reliability. This multicenter international study, sponsored by the International Urogynecological Association (IUGA), was designed to address the above limitations in current sexual function measures. We aimed to build upon prior work and establish the validity, reliability, and responsiveness of a revised PISQ, the PISQ- IUGA-revised or PISQ-IR.

Materials and methods

Twenty-three international experts in urogynecology, female sexual function, and survey design and validation convened to evaluate existing sexual function questionnaires and elucidate shortcomings. Three fundamental domains were identified: sexual inactivity; sexual response; and quality, satisfaction, and desire. The PISQ-12 was selected as the foundation for developing a new condition-specific instrument. In addition to the 12 items from the PISQ-12, 48 additional items were adopted, revised, or developed for consideration. The basic construct (face) validity of this 60-item pool was evaluated using 31 cognitive interviews [11, 12] conducted at three sites: University of Minnesota (n = 16), University of New Mexico (n = 5), and at the Cleveland Clinic, Florida (n = 10). Based on the results of the cognitive interviews, 18 items were eliminated and 42 items were selected for inclusion in the final item pool.

To increase the diversity of the population used to develop the questionnaire, 12 sites from across the USA and 5 sites from the UK participated in the validation study. Institutional Review Board approval was obtained from all sites, and all women gave written consent. To participate, women were 18 years or older, not pregnant, able to read/write and understand English, and seeking treatment for UI and/or AI and/or POP. Exclusion criteria were a diagnosis of vulvodynia, painful bladder syndrome, or chronic pelvic pain, defined as pelvic pain for greater than 6 months. Since this was a study to evaluate both sexual activity status as well as sexual function, women did not need to be sexually active to participate.

Women were given a survey packet and asked to complete it at home or in the clinic. Due to cost and follow-up capabilities the test-retest packets were only distributed in the USA. To increase response rates, the study followed the principles of the tailored design method [13] in conducting follow-up mailings. To establish the responsiveness of the new questionnaire, a portion of the women who completed the baseline validation survey were mailed a following questionnaire 4-6 months after initial enrollment (Fig. 1).

Fig. 1
figure 1

Study recruitment flow sheet

After written consent, women underwent a physical examination, including the Pelvic Organ Prolapse Quantification Scale (POPQ) [14], muscle strength using the Oxford grading scale [15] and muscle tone. If the woman had these measures assessed within the past month without treatment, she did not undergo repeat examination. Baseline characteristics and past medical history were collected. If the subject was scheduled for surgical treatment, the anticipated surgery date was recorded. Clinicians indicated one or more PFD diagnoses based on assessment of the physical examination findings, history, and any other clinical data available. Definitions conformed to IUGA/International Continence Society (ICS) recommendations (Fig. 1 and Table 1).

Table 1 Enrollment and follow-up

In addition to the 42-item validation version of the questionnaire, women also completed the Incontinence Severity Index (ISI) [16], a single question evaluating prolapse and its bother (question #35) from the Epidemiology of Prolapse and Incontinence Questionnaire (EPIQ) [17], the Pelvic Floor Distress Inventory-20 (PFDI-20) [1820], and the FSFI [5, 21]. The first step in the analysis was to describe the basic distribution of patient and clinical characteristics. Comparisons were made between the UK and US populations using chi-square or t tests as appropriate. The test-retest reliability of each item was assessed using Student’s t tests.

Basic psychometric analytic tools [2225] were used to guide the evaluation of the instrument. Briefly, we evaluated item distribution and test-retest reliability and then progressed into bi- and multivariate evaluations. In the bi- and multivariate analysis, correlational and factor analysis were the primary statistical tools used. Two independent investigators conducted analyses, using both principle components (PCA) and principle factor analysis (PFA) methods that included orthogonal (varimax) and oblique rotation (promax, Harris-Kaiser) [2224]. Both analytic methods are commonly used to evaluate the interrelatedness of questions to determine if particular items form coherent and valid subscales.

Factor analysis is a common method used in scale development for instruments such as the PISQ-IR. When a group of items within a scale demonstrates strong relationships to one another then factor analysis identifies this grouping as a factor or subscale. Eigenvalue statistics are used to evaluate the strength of these relationships. A high eigenvalue indicates that an underlying subscale exists; low values indicate that the underlying subscale might not exist. For example, if POP/UI/AI uniquely contributes to sexual life then some items which deal with condition-specific impacts (e.g., I feel sexually inferior because of my incontinence and/or prolapse) should emerge as a subscale. If the condition-specific subscale does not emerge, or if it has a low eigenvalue, then it is assumed that POP/UI/AI does not have a unique role relative to sexual life. If a subscale’s eigenvalue meets the threshold for retention, the next step in the analysis is to evaluate how items relate to a particular factor using factor loading scores. A factor loading score represents the strength of the relationship of an item with a particular subscale. In factor analysis each item has a loading score for all candidate subscales. High factor loading scores indicate that an item is strongly associated with that factor; low scores indicate that an item is it not strongly associated with that particular factor. Standard criteria for selection and retention of factors (eigenvalue values greater than 1) and items within factors (factor loading scores) were used [2224]. The majority of items were retained in a subscale if the item had a factor loading of 0.60 and did not load at greater than 0.40 on other factors. For some items, an alternative criterion was used; loading of the item on the primary factor had to be at least 0.20 greater than the items loading on any other factor. If an item meets either the 0.60/0.40 or 0.20 difference criteria then the item was retained as part of a subscale [2224]. Scale development was iterative and included the full study team who discussed if items in each scale were coherent, made sense, and were clinically useful [26].

In the criterion validity and responsiveness evaluation, correlations served as the primary evaluation technique, but regression was also used to evaluate how our new measure correlated with other proved measures of PFD and sexual function. We used the POPQ, Oxford grading and pelvic floor tone (clinical exam measures), the PFDI-20, ISI, EPIQ question #35 (self-reported indicators of condition including severity and impact), and the FSFI (self-reported measure of sexual function). Since the value of criterion analyses was based upon a priori established comparisons, five experts determined which comparisons would be made. Subscales which had condition-specific or related items were compared against all clinical exams and self-reported indicators. For FSFI comparisons, the evaluation focused on the items with sufficient overlap between the PISQ-IR and FSFI subscales that the reviewers felt were measuring conceptually similar constructs.

Our responsiveness evaluation was based on comparison of the change scores in our new questionnaire to change scores in the PFDI-20, ISI, EPIQ question #35, and FSFI following surgical treatment using correlations. We compared the change scores for women who had surgery during the course of the study with those who did not have surgery (difference of difference test using Student’s t).

Sample size was determined based on the sample size needed to conduct the psychometric analysis, including responsiveness. The basic rule of thumb in psychometric analysis is that ten subjects per item are needed for analyses [23]. For sexually active women we included 32 items, requiring a sample size of at least 320; for women who were not sexually active we included 14 items, requiring a sample of at least 140. To evaluate the responsiveness of the instrument, we estimated that we would need 350 respondents who completed both the baseline as well as a 4- to 6-month follow-up survey. This is based on using an alpha of .05, a targeted power of .80, and an assumed change of 20 % in the score between baseline and follow-up. Adjusting for expected response rates, the targeted enrollment for the study was 850 subjects (600 US and 250 UK subjects). All analyses were conducted using SAS© (v. 9.12). This trial was registered with ClinicalTrials.gov, NCT00952406.

Results

A total of 877 women were enrolled for the study, and 589 returned a completed baseline survey (67 % response rate); 200/288 (70 %) women gave data after surgery and the response rate for the test-retest was 147 (54 %). The majority of women were married, middle-aged (55 ± 12.1 years), and Caucasian. As expected, most women had UI and/or POP; 11 % of the population had AI. Sixty-eight percent of the study population reported sexual activity. A number of differences between the UK and the US populations were noted (Table 2).

Table 2 Clinical and demographic characteristics of study population

The univariate distribution of responses to all items demonstrated acceptable distributions, except for two items with skewed distributions, ceiling effects in items Q12c (shame) and Q12d (fear) (Appendix A). Further evaluation of these items demonstrated that the ceiling effect is driven by the POP-only respondents, but the items demonstrated particular sensitivity in women with AI. As a result of this sensitivity, the decision was made to retain the items for the rest of the psychometric evaluation. For an evaluation of other issues, such as item nonresponse in relationship to the scaling of the PISQ-IR, see the companion to this article [27].

The test-retest evaluation demonstrated that 3 of the 42 items had significantly different responses between the test and retest administrations (paired t test, all p < 0.05). We reran the analysis using a correction for the number of tests to determine whether or not the observed difference was secondary to chance (step-down Bonferroni). When corrected for the number of tests none of the items demonstrated significant differences between test and retest. We then evaluated the mean and maximum difference in responses for these three items and the number of respondents who selected a response option that was at least two categories different between test and retest. We found only a few women fell into this category. Finally, we identified and eliminated all respondents who had surgery between the test and retest administrations, assuming that a change in scores following surgery would be expected; when these cases were removed, none of the items demonstrated a significant difference between test and retest. Based on this analysis, all items were retained.

Psychometric analysis

Our original study goal was to pool the data for both sexually active and inactive women for the quality, satisfaction, and desire construct of the questionnaire. During the psychometric analysis, it became clear that the underlying structure and nature of this construct was distinct between groups and pooling was not possible. Thus, the results for the sexually active and inactive groups are presented separately.

Sexually inactive

Two subscales emerged in each of two domains. In the sexual inactivity domain one subscale captured the contribution of PFDs and personal health [not sexually active–condition-specific (NSA-CS), three items], and the other captured partner and personal interest as to why a person is not sexually active [NSA-partner-related (NSA-PR), two items] (see Table 3). The quality and satisfaction domain is composed of two subscales: a global rating of sexual quality [NSA–global quality (NSA-GQ), four items] and a condition-specific subscale [NSA-condition impact (NSA-CI), three items]. Each of the subscales within the two domains demonstrated sound psychometric properties and the items, as demonstrated by their factor loadings, revealed a conceptual grouping that conformed to the underlying assumptions used when developing the instrument (Table 3).

Table 3 Final factor loadings and internal consistency for scales among women who report sexual inactivity

Sexually active

For the sexual response domain three subscales emerged that included sexual arousal and orgasm [sexually active-arousal, orgasm (SA-AO), four items], partner-related issues [SA-partner-related (SA-PR), three items], and condition-specific issues [SA-condition-specific (SA-CS), three items] (see Table 4). The quality, satisfaction, and desire domain included one subscale on global quality [sexual activity-global quality (SA-GQ), four items], one on sexual desire [SA-desire (SA-D), three items], and a condition-specific subscale [SA-condition-specific (SA-CS), four items]. Two items did not meet the 0.6/0.4 criteria, but each met the 0.20 criterion and were retained in the subscale.

Table 4 Final factor loadings and internal consistency for scales measuring function among women who report sexual activity

Criterion validity

Sexually inactive

None of the physical exam measures demonstrated significant correlations with the two condition-specific factors (NSA-CS, NSA-CI) for women who reported sexual inactivity. However, the PFDI-20, ISI, and EPIQ question #35 did demonstrate significant correlations in the anticipated direction.

Sexually active

Both condition-specific factors (SA-CS and SA-CI) in both domains demonstrated significant correlations in the anticipated direction with PFDI-20 and ISI scores. EPIQ question #35 and the POPQ correlated with the quality, satisfaction, and desire condition-specific factor (SA-CI) as anticipated. As shown in Table 5, all comparisons with FSFI subscales demonstrate significant correlations in the direction predicted. This criterion analysis demonstrates that the subscales correlate with external criteria in a manner that was predicted for those who are sexually active. This correlation was found for physical exam measures, condition-specific validated questionnaires as well as the FSFI. For women with AI, strong correlations were found with the FSFI in all factors (Table 5).

Table 5 Criterion validity: scale correlations with other measures

Responsiveness

Sexually inactive

The condition impact factor in the quality and satisfaction domain (NSA-CI) demonstrated significant correlation with the EPIQ question #35 and ISI, but not with other measures used to test responsiveness (Table 6).

Table 6 Responsiveness evaluation: change scores correlations

Sexually active

Improvement in the PFDI-20, ISI, and EPIQ question #35 scores were positively correlated with improvement in both condition-specific subscales (SA-CI and SA-CS) in both domains (Table 6). In addition, all comparisons with the FSFI show a significant positive correlation. The two condition-specific subscales show significant differences in scores between women who did and did not have surgery.

Figure 2 illustrates the subscales that emerged from the item pool across three basic domains: sexual response, quality/satisfaction/desire, and sexual inactivity. The sexual response dimension applies only to individuals who are sexually active and has three subscales; the sexual inactivity dimension applies only to those who are not active and has two subscales. There is overlap for the subscales assessing quality and satisfaction between those who are sexually active and those who are not, but the desire subscale (SA-D) emerged as a coherent subscale only in those who are sexually active.

Fig. 2
figure 2

Final conceptual framework for domains and scales for sexually active and inactive populations resulting from the factor analysis

Questionnaire scoring

Different scoring approaches for all subscales were evaluated, including summated, variants of magnitude and mean scoring [22, 28, 29]. We recommend that one of two methods be used to score the PISQ-IR: mean calculation or a transformed sum. In the calculation of a score for a subscale the respondent must have answered more than one half of the items in each subscale. Missing values should not be imputed. Appendix B gives detailed instruction on how to calculate mean scores. In brief, mean subscale scores are calculated by summing the valid responses to items in the subscale and then divided by the number of items with valid responses. A transformed sum can also be used to score the PISQ-IR; transformed sums demonstrate more accuracy than mean calculation and are further described elsewhere [27]. For sexually inactive women four separate scores are calculated and for sexually active women six separate subscale scores. Total scores are not reported, since the subscales emerged as distinct in the psychometric analyses.

Discussion

We describe the validation and reliability testing of a condition-specific measure of sexual function for women with PFDs and its responsiveness to change. The PISQ-IR improves upon prior sexual function questionnaires used in women with PFDs because it evaluates the effect of PFDs on sexual inactivity and includes validation in women with AI. Although the absolute numbers of women with AI were low, significant findings and correlations were noted. We included a broad population of women with pelvic floor dysfunction in both the USA and UK. Although we did find some baseline differences between groups, we feel that the diversity between the US and UK populations strengthens the generalizability of the PISQ-IR. Finally, this measure is ideally poised for international translation into a variety of languages because the items tested in this validation study were initially vetted by an international panel. Ultimately, we hope that this work provides the groundwork for global validation of the PISQ-IR in a variety of languages and cultural contexts.

We acknowledge that patient input into this new measure was relatively limited with 31 cognitive interviews. We did, however, reach saturation in the interviews and believe that we had adequate patient input for the new measure. The psychometric analysis of the PISQ-IR supported the framework that guided the development of the questionnaire (Fig. 2). For each of the domains, multi-item subscales emerged which demonstrated sound psychometric properties. The items in each subscale are internally consistent and form a coherent grouping that is differentiated from the content of the other subscales and between domains. Scaling demonstrated a robust measure in which specific subscales emerged, which assessed core dimensions that are related to general sexual function as well as the impact of PFDs on sexual function. Furthermore, these subscales captured the underlying intent of assessing multiple aspects for each domain including both condition-specific issues and non-condition-specific issues. Initially, the items in the quality, satisfaction, and desire domain were developed to be relevant to both sexually active and inactive women. The psychometric analysis partially supported this assumption, but identified some differences between groups and therefore separate subscales were developed for sexually active and inactive women. Since the domains were distinct for sexually active and inactive women, scores of women who change sexual activity status over time cannot be assessed. For all women, sexual activity status is arguably the most important measure of sexual function, and change of status is an important marker of improved or deteriorated sexual health and should be reported in all studies of sexual function as a separate outcome.

In addition to not being able to measure changes in sexual activity status, the PISQ-IR scoring does not support a single summary score. This is because on psychometric analysis the domains and subscales in the measure emerged as distinct. In order to create a summary score, more patient input is needed to accurately weight the relative importance of various items to respondents. Our new measure represents a significant departure from the original PISQ, although many of the items are similar. For this reason we do not feel that the PISQ-IR scores are comparable with the original questionnaire’s scores. Finally, the minimally important difference (MID) for the PISQ-IR has not been established; testing of the instrument in women with pelvic floor dysfunction before and after interventions will be needed to determine the MID.

We found that in the criterion validity testing our new scale correlated significantly in the anticipated direction with other self-report measures. However, the correlations with physical exam measures, including the POPQ, Oxford grading scale, and assessment of pelvic floor muscle tone, were not consistent. We feel that this is representative of how women experience functional disorders which are imprecisely tied to physical exam findings.

We found that the PISQ-IR was responsive to change as measured by correlations with other self-report questionnaires and with changes in PISQ-IR scores following surgery for women who reported sexual activity. While we did not test responsiveness to nonsurgical management, interventions that result in changes similar to surgery are likely to result in similar responsiveness. Responsiveness in condition-specific scales is often evaluated post hoc, in which the scale is developed and subsequent research is conducted to determine if the scale is responsive. The integration of the responsiveness into the design of this study had significant impact on scale development. When reducing an item pool through psychometric evaluation, items often demonstrated very similar properties. Responsiveness data was used to inform the choice between these items, and we were able to select items that demonstrated greater responsiveness. In addition, establishing responsiveness to change at the time of initial publication allows more confidence in the initial use of the questionnaire. We had aimed to include data from 350 women in our responsiveness evaluations and, despite recruiting the anticipated number of women from the UK and USA, were only able to obtain follow-up data from 200 women. Nonetheless, we were able to demonstrate significant correlations with changes in other self-report measures as well as changes in scores following surgery. In addition, we did not include responsiveness evaluation of women who underwent conservative treatment of pelvic floor dysfunction, since most women who chose conservative management underwent a variety of interventions of varied efficacy. It is possible, but unlikely, that women who respond to nonsurgical treatments and have similar responses in improvement of function will not demonstrate the responsiveness we found among women who underwent surgery.

The responsiveness of the domains evaluating women who reported sexual activity was more robust than the responsiveness of the questionnaire in the sexually inactive group. It was recognized at the start of this project that responsiveness in sexually inactive women would be difficult to evaluate. Nonetheless, we did see that the sexually inactive measures were responsive to change in scores on some of the condition-specific questionnaires.

In conclusion, we have presented the initial validation and reliability data as well as responsiveness of a condition-specific measure of sexual function in women with PFD and that assesses the impact of PFD on sexual inactivity and includes the evaluation of women with AI. Importantly, this measure is now available for international validation in a variety of languages and cultural contexts.