Physical Activity Questionnaires for Pregnancy: A Systematic Review of Measurement Properties

Background In order to assess physical activity (PA) during pregnancy, it is important to choose the instrument with the best measurement properties. Objectives To systematically summarize, appraise, and compare the measurement properties of all self-administered questionnaires assessing PA in pregnancy. Methods We searched PubMed, Embase, and SPORTDiscus with the following inclusion criteria: (i) the study reported at least one measurement property (reliability, criterion validity, construct validity, responsiveness) of a self-administered questionnaire; (ii) the questionnaire intended to measure PA; (iii) the questionnaire was evaluated in healthy pregnant women; and (iv) the study was published in English. We evaluated results, quality of individual studies, and quality of evidence using a standardized checklist (Quality Assessment of Physical Activity Questionnaires [QAPAQ]) and the GRADE (Grading of Recommendation, Assessment, Development, and Evaluation) approach. Results Seventeen articles, reporting 18 studies of 11 different PA questionnaires (17 versions), were included. Most questionnaire versions showed insufficient measurement properties. Only the French and Turkish versions of the Pregnancy Physical Activity Questionnaire (PPAQ) showed both sufficient reliability and construct validity. However, all versions of the PPAQ pooled together showed insufficient construct validity. The quality of individual studies was usually high for reliability but varied considerably for construct validity. Overall, the quality of evidence was very low to moderate. Conclusions We recommend the PPAQ to assess PA in pregnancy, although the pooled results revealed insufficient construct validity. The lack of appropriate standards in data collection and processing criteria for objective devices in measuring PA during pregnancy attenuates the quality of evidence. Therefore, research on the validity of comparison instruments in pregnancy followed by consensus on validation reference criteria and standards of PA measurement is needed. Electronic supplementary material The online version of this article (10.1007/s40279-018-0961-x) contains supplementary material, which is available to authorized users.


Key Points
There was high-quality evidence that the Pregnancy Physical Activity Questionnaire (PPAQ) has sufficient reliability in assessing total physical activity (PA) and vigorous PA (VPA) in pregnancy. However, the questionnaire revealed insufficient construct validity in assessing these scores, but the evidence for this was of low-tomoderate quality.
The Australian Women's Activity Study (AWAS), Leisure-Time Exercise Questionnaire (LTEQ), Leisure-Time Physical Activity Questionnaire (LTPAQ), and Recent Physical Activity Questionnaire (RPAQ) showed both insufficient reliability and construct validity when assessing either total PA, moderate-to-vigorous PA, or VPA in pregnancy. This assessment was based on very low-to-moderate quality evidence.
Most importantly, we need more high-quality evidence regarding the validity of objective measures of PA in pregnancy, such as accelerometers, and standards in data collection and processing criteria of these devices. Only then will we be able to guarantee adequate and comparable estimations of the validity of a PA questionnaire in pregnancy.

Introduction
Physical activity (PA) plays a pivotal role in the improvement and maintenance of physical and mental health [1]. In pregnancy, regular PA can have various health benefits for mother and fetus, such as reduced symptoms of depression [2] and lower risks for excessive gestational weight gain [3], gestational diabetes mellitus [4], lower birth weight [5], preterm birth [3], and pre-eclampsia [6]. There is even evidence that PA during pregnancy may improve cardiac and neurobehavioral maturation of the offspring [7], which is in harmony with the premise of fetal programming [8]. Therefore, the American College of Obstetricians and Gynecologists [9] recommends that pregnant women, in the absence of medical or obstetric complications, participate in moderate-intensity activities for at least 20-30 min per day on most or all days of the week.
Research on PA in pregnancy has grown steadily over the last years. To provide solid evidence-based recommendations, and to determine the health benefits of PA, effectiveness of PA interventions, dose-response relationships of PA, and health outcomes, as well as to assess global trends of PA over time, adequate measurement of PA in pregnancy is essential. In particular, a measurement instrument should provide reliable and valid estimates of PA in this target population.
Questionnaires are a commonly used, inexpensive, and acceptable method to determine PA levels. Because of different study purposes, populations, settings, or unsatisfactory pre-existing questionnaires, many PA questionnaires have been developed, which introduces complexity when choosing the right questionnaire for one's study purpose. Moreover, using different questionnaires hinders the comparability of PA levels across studies and countries, especially if the questionnaires vary in their measurement quality. Therefore, an overview of measurement properties of PA questionnaires for use in pregnancy is helpful to select the best qualified questionnaire. A critical appraisal of the methodological quality of these validation studies and the overall evidence is essential for drawing unbiased conclusions about measurement properties.
Although the measurement properties of PA questionnaires have been systematically reviewed for non-pregnant populations [10][11][12], there is still a lack of knowledge addressing this issue in pregnancy. The purpose of this systematic review was to critically appraise, compare, and summarize the measurement properties (reliability, criterion validity, construct validity, responsiveness) of all available self-administered questionnaires measuring PA in pregnancy, taking the methodological quality of these studies as well as the quality of evidence into account.

Literature Search
We performed a systematic literature search using a priori defined eligibility criteria in the databases PubMed, Embase using the filter Embase only, and SPORTDiscus. The search strategy included (variations of) the terms 'physical activity', 'measurement properties' [13], 'questionnaire' and 'pregnancy' (see Electronic Supplementary Material Appendix S1 for the full search strategy). Publication types such as interviews, case reports, or biographies were excluded. This search strategy was adapted for Embase and SPORTDiscus following their individual search guidelines. Additional studies were identified by searching references of the retrieved articles. The search was performed on the 17 July 2017.

Eligibility Criteria
The eligibility criteria were based on the previous series of reviews on PA questionnaires [10][11][12], and adapted to our target population. The following inclusion criteria were used: (i) The aim of the study was to evaluate one or more of the following measurement properties of a selfadministered questionnaire: reliability, criterion validity, construct validity, or responsiveness. (ii) The aim of the questionnaire was to measure PA, which was defined as any bodily movement produced by skeletal muscles that resulted in energy expenditure (EE) above resting level [14]. (iii) The study was performed in healthy pregnant women, irrespective of the population for which the questionnaire was originally developed (e.g., pregnant women, general population, adolescents). (iv) The article had to be published in English.
Since different modes of data collection likely cause heterogeneity in effect estimates and data quality [15], the aim of this review was to provide evidence-based recommendations only for self-administered PA questionnaires. Consequently, we excluded PA interviews (face-to-face, telephone), diaries, interview-administered questionnaires, questionnaires measuring physical functioning, and questionnaires (questions) asking about sweating. All studies performed in patients (e.g., pregnant women with gestational diabetes) were excluded. There were no limitations concerning the mean age or body mass index of the study populations.
Finally, measurement properties regarding the internal structure (structural validity, internal consistency, cross-cultural validity/measurement invariance), development, and content validity of the PA questionnaires were not assessed in this review. The evaluation of internal structure (e.g., using Cronbach's alpha) is relevant for constructs consisting of reflective indicators [16]. These indicators are manifestations of the construct and, thus, should be highly correlated with each other. In contrast, PA is represented by causal or composite indicators, which can independently contribute to PA. The evaluation of content validity would require the inclusion of studies of the development and translations of the questionnaire as well as studies focusing on content validity and expert opinions. Therefore, a single but comprehensive evaluation of content validity of (all available) PA questionnaires should be performed in a future review.

Selection of Articles
Two researchers independently performed abstract selection, selection of full-text articles, data extraction, and quality assessment. Disagreements were discussed and resolved. Full-text articles were retrieved if the abstracts fulfilled the inclusion criteria or if the abstract did not contain measurement properties, but these were likely to be presented in the full-text article.

Data Extraction
We used a standardized extraction form, based on the QAPAQ (Quality Assessment of Physical Activity Questionnaire) checklist [17], to obtain the required information to evaluate the methodological quality and results of each individual study. The QAPAQ checklist was developed for PA questionnaires and is based on the COSMIN (COnsensus based Standards for the selection of health Measurement INstruments) checklist for assessing the methodological quality of studies of measurement properties of patientreported outcome measures (PROMs) [18] and a list of criteria for sufficient measurement properties [19].
To provide a description of the PA questionnaire, the following information was collected: (i) target population of the questionnaire; (ii) dimension(s) of PA (e.g., habitual, EE); (iii) setting (e.g., household, sports); (iv) recall period; (v) number of questions; (vi) parameters of PA (e.g., frequency, duration, intensity); (vii) number and type of scores which can be calculated (e.g., total EE, minutes of activity per day). To assess the methodological quality and results of each individual study, we extracted information regarding study population, sample size, time intervals, data analysis, and results of the measurement properties.

Content Validity
Content validity is the degree to which the questionnaire encompasses all relevant aspects and dimensions of the intended construct. Since there is no statistical criterion (e.g., numerical value) for content validity, we evaluated content validity for all included questionnaires using the extracted qualitative attributes. Based on previous systematic reviews [11], the following two criteria were assessed: (i) if the questionnaire aims to measure total PA, it should incorporate activities in all settings (home, recreation, sports, transport, work); (ii) the questionnaire should measure at least frequency and duration of PA together with a recall period of at least 1 week.

Reliability
Reliability is the extent to which the scores for participants, who did not change, are the same for repeated measurements under several conditions (free from measurement error) [20]. We considered parameters of reliability (Pearson/Spearman correlation, intraclass correlation coefficient [ [17]. To ensure that a measurement detects clinically important changes accurately (beyond measurement error), a definition of minimal important change (MIC) of PA is required. Currently, there is no consensus about MIC of PA in pregnancy but a change in the frequency of twice per week or a change in moderate PA or moderate-to-vigorous PA (MVPA) of 30 min (≥ 90 MET [metabolic equivalent of tasks] min) per week can be seen as important for both the individual and the clinician. According to this definition, the PA questionnaire should be able to reliably measure changes of ± 20% of currently recommended PA guidelines (i.e., 150 min of MVPA). Only when the LOA or SDC are smaller than the MIC can one be confident that changes as large as the MIC reflect true changes (e.g., statistically significant) in individual people that cannot be attributed to measurement error. Consequently, measurement error was rated using MIC frequency = 2 and MIC duration/intensity = 30 min (90 MET min) per week. It is important to note that these considerations about MIC were made irrespective of individual differences such as fitness, physical capacity, and body composition. Furthermore, for a CV (i.e, standard deviation in relation to the mean), a maximum value of 15% was considered acceptable, which indicates that every observed PA score could vary on average ± 15% of the mean score (or 95% of the observed PA scores were between ± 1.96 × 15% of the mean). Finally, we considered ICC, kappa, and concordance coefficients of ≥ 0.70 or Pearson/Spearman correlation coefficients of ≥ 0.80 as sufficient [17].
Based on QAPAQ [17], each result received either a positive (sufficient), negative (insufficient), or indeterminate rating. The result was sufficient (+) if ICC/kappa/concordance was ≥ 0.70 or Pearson/Spearman ≥ 0.80 or MIC > LOA/SDC or CV ≤ 15%, and otherwise insufficient (-). If no such coefficient was reported, the rating of the result was indeterminate (?).

Construct and Criterion Validity
Construct validity is the degree of agreement between the questionnaire and comparable measures of PA, whereas criterion validity is the degree of agreement between the questionnaire and the gold standard of measuring PA. Although doubly-labeled water (DLW) and the respiratory chamber can be considered as the gold standard for measuring EE, there is no gold standard for the assessment of PA. Consequently, all comparisons to other instruments were considered as evidence for construct validity in our review.
Based on QAPAQ [17] and the series of previous systematic reviews [10][11][12], a priori defined correlations were considered as sufficient ( Table 1). The result was sufficient (+) if the correlation was equal to or above the defined cut points, and otherwise insufficient (-). If no correlation coefficient or comparable measure was reported, the rating of the result was indeterminate (?).

Responsiveness
Responsiveness can be considered as an aspect of validity and is the degree to which an instrument detects changes over time in the construct [21,22]. In this case, it is the ability of the questionnaire to detect changes in PA in a longitudinal setting (validity of change score rather than single score). We applied the same approach as for construct validity to rate responsiveness, except that the change in scores of the questionnaire was compared with the change in scores of other instruments such as accelerometers.

Quality of Individual Studies
Evaluation of the methodological quality of the included studies was based on the QAPAQ checklist [17], the series of previous reviews [10][11][12], as well as the recently updated COSMIN checklist [23]. For the assessment of the quality of all individual studies, we assigned one of three different levels of quality (1: very good, 2: adequate, 3: doubtful) for each outcome (PA score) and measurement property. If an individual study had any substantial flaws in the design or analysis, the quality was inadequate (level 4).
To evaluate the methodological quality of studies of reliability and measurement error, we considered ICC, kappa, and concordance as adequate measures of reliability, and LOA, SDC, and CV as adequate measures of measurement error. We considered Pearson and Spearman correlation coefficients as less adequate since they neglect systematic errors between measurements [24]. However, Pearson and Spearman correlations are widely used in validation studies and, thus, were not omitted from our review. To ensure that the measured construct did not change over time, an adequate time interval between test and retest should be defined. For pregnancy, we considered a time interval from 2 days to 2 weeks as adequate to ensure that PA did not change over time (e.g., between the second and third trimesters) [2]. If there have been no substantial flaws in the design or analysis (level 4), we assigned one of the following levels of quality for each PA score reported in an individual study for the assessment of reliability and measurement error: • Level 1: an adequate time interval between test and retest (2 days-2 weeks) and reporting of ICC, LOA, SDC, SEM, CV, kappa, or concordance. • Level 2: an inadequate time interval between test and retest (> 2 weeks) and reporting of ICC, LOA, SDC, To evaluate the methodological quality of studies of construct validity and responsiveness, it is important to formulate a priori hypotheses about the expected direction and magnitude of the results, which guarantees unbiased conclusions. Since this criterion was rarely met previously [10][11][12] and a study may still provide unbiased coefficients without these hypotheses, we did not rate the quality of these studies as inadequate but stated how many studies formulated such an a priori hypothesis. We further applied our own criteria in order to compare all results with the same set of hypotheses. Depending on the type of comparison, we assigned three different levels of quality for the assessment of construct validity and responsiveness (Table 1). Higher levels of quality (level 1 or 2) were provided if the questionnaire was evaluated against objective measures of PA (e.g., accelerometer) depending on the use of the objective data. More specifically, a higher level of quality was given the more similar the constructs were. For example, the comparison of moderate PA from the questionnaire with moderate PA from the accelerometer is currently the optimal approach (level 1), whereas a comparison with total counts (including, light, moderate, and vigorous PA [VPA]) is less optimal (level 2). We assigned level 3 of quality when the questionnaire was compared with measures less similar to the construct, such as pedometers, questionnaires, diaries, and interviews, or if different intensity levels were compared against each other (e.g., light PA estimated from the questionnaire compared with MVPA estimated from the accelerometer).

Quality of Evidence
We evaluated the quality of the body of evidence using the state of-the-art GRADE (Grading of Recommendation, Assessment, Development, and Evaluation) approach [25]. Since this assessment should be outcome-specific, we evaluated the quality of evidence for each questionnaire version (including different language versions) and measurement property (reliability, measurement error, construct validity, responsiveness) for three outcomes (total PA, MVPA, and VPA) separately. In addition, we pooled the evidence from individual studies when there was more than one study of the same questionnaire available. In particular, we applied a modified GRADE approach to grade the body of evidence [26]. For each outcome (PA score), the quality of evidence could be high, moderate, low, or very low depending on the assessment of four factors (risk of bias [methodological quality of the individual study], imprecision, inconsistency, indirectness). At the beginning, the quality of evidence for each outcome was high, but could be downgraded if there were any serious shortcomings in these factors. Currently, there are no guidelines for upgrading due to very good measurement properties.
Regarding risk of bias, high-quality evidence (no downgrading) was available when most individual studies had very good quality (level 1). When most individual studies were of doubtful quality (level 3) or only one study of adequate (level 2) or very good quality was available, we downgraded the quality of evidence by one level (e.g., from moderate to low). When only one individual study of doubtful quality or multiple studies of inadequate quality (level 4) were available, we downgraded by two levels. Moreover, we downgraded by three levels if there was only one individual study of inadequate quality available. To evaluate imprecision, we determined the optimal information size (OIS) to ensure a sufficient precision in the estimation of adequate effect sizes. Assuming that ICC = 0.7, a sample size of n ≥ 45 would be required to obtain a 95% confidence interval (CI) with a maximum width of 0.30 (i.e., ± 0.15; calculated using STATA 12.1, Statacorp, College Station, TX, USA) [27]. Likewise, assuming r = 0.40, a sample size of n ≥ 123 would be required to obtain a 95% CI with the same width [28]. Serious imprecision was present if the total sample size did not meet these criteria (i.e., 45 for reliability and 123 for construct validity and responsiveness), and we downgraded the quality of evidence by one level. We downgraded the quality of evidence by two levels (very serious imprecision) when the total sample size was n < 12 for reliability or n < 32 for construct validity and responsiveness (95% CI width of ± 0.30). Because publication bias is difficult to assess in studies of measurement properties (e.g., lack of registries), we did not downgrade due to this factor. Finally, we downgraded by one or two levels in the presence of unexplained inconsistency (differences in results [i.e., sufficient, insufficient]) or indirectness (differences in populations, interventions, outcomes, indirect comparisons).

Literature Search
The literature search resulted in 1,719 hits. Of these, 27 articles were selected based on titles and abstracts. After reading the full-texts, ten articles were excluded because of the absence of measurement properties (n = 5) [29][30][31][32][33] or using a diary/record (n = 3) [34][35][36] or an interview (n = 2) [37,38]. Finally, 17 articles [39-55] on 11 different PA questionnaires (17 versions) [39,44,[56][57][58][59][60][61][62][63] were included ( Fig. 1). Overall, these 17 articles reported 18 studies of measurement properties. It should be noted that the studies describing the development of the short and long form of the International Physical Activity Questionnaire (IPAQ) [59] share the same reference in order to avoid any misconceptions. All results are presented for questionnaires developed for the pregnant and non-pregnant population separately only to improve readability. Table 2 shows a summary of all included articles and questionnaires in combination with evaluated measurement properties and study populations. Construct validity was assessed for all questionnaires, whereas reliability (parameters of reliability and measurement error) was assessed for six questionnaires (11 versions) and responsiveness for two questionnaires. In most studies, an accelerometer was used as a comparison measure. Eight studies [42-46, 49, 51, 55] assessed the measurement properties of the Pregnancy Physical Activity Questionnaire (PPAQ) [44] or adaptations of this questionnaire (e.g., Japanese version). Another study [48] evaluated the long form of the IPAQ, whereas two studies (of reliability and construct validity), reported in one article [52], evaluated the short form of the IPAQ (IPAQ-SF). One study [39] used a strongly modified version of the IPAQ measuring leisure time (LT) PA (LTPA) in pregnancy. One article [40] reported one study evaluating two questionnaires, namely the Australian Women's Activity Study (AWAS) [60] and the Recent Physical Activity Questionnaire (RPAQ) [57].

Description of Questionnaires
A detailed description of the questionnaires is shown in Table 3. Of the 11 questionnaires, four were developed to assess PA in pregnant women [39,44,62,63], whereas five were developed for adults [56,57,59,61], one for adults and adolescents [58], and one for women with young children [60].
Of the seven questionnaires that were developed for the non-pregnant population, six (Activity Questionnaire for Adolescents and Adults      [63]. PAPQ and PPAQ aim to measure the construct (total) PA, whereas LTPAQ and Q1 of MoBa aim to measure LTPA or recreational exercise during pregnancy. The LTPAQ was based on the IPAQ but was strongly modified to provide a better discrimination between the structured (LT excluding household) and unstructured (household) features of PA. Parameters of duration, frequency, and intensity of PA are assessed by all questionnaires except Q1 of MoBa. Scores for total PA, time spent in light PA, moderate PA, VPA, and SB can be calculated for the PAPQ, PPAQ, and LTPAQ. For Q1 of MoBa, only a total PA score can be calculated. All four questionnaires use minutes per week or MET min/week to calculate PA scores.
Finally, all questionnaires that assigned MET intensities for activities use compendium-based information about intensities for different activities [64]. These MET intensities are based on the general population, including men and nonpregnant women. In contrast, the PPAQ uses pregnancy-specific MET intensities whenever possible, such as for walking and light-to-moderate intense household activities [44].

Content Validity
A comprehensive evaluation of the content validity of PA questionnaires during pregnancy was not part of this review. Consequently, no included study assessed the content validity in a methodological approach but some provided information on content validity. During the development of the PPAQ, one study [44] used 24-h recalls to select both prevalent and discriminatory activities of pregnant women. The findings of the study showed that watching television, standing or slowly walking at work while carrying light/ moderate loads, and childcare were the most relevant activities. Another study [54] discussed the content validity of the GPAQ theoretically in the context of previous research and expert opinions. Their conclusion was that the GPAQ includes important settings (e.g., work, transport, leisure) and scores (frequency, duration, intensity) of PA but including pregnancy-specific activities (and settings) such as caregiving might result in a better content validity. Furthermore, one study [39] of the LTPAQ strongly modified the IPAQ to provide a better discrimination between the structured (LT excluding household) and unstructured (household) features of PA. They excluded occupational PA and used the degree of breathlessness (none, some, strong) instead of light, moderate, and vigorous to describe the intensity of activities, which may result in a better understanding for some women.
Finally, studies of adaptations of the PPAQ [43,45,49,51,55] included expert opinions and pilot studies to assess content validity and, consequently, items were modified and/or deleted during their cross-cultural validation process. According to our criterion (i) (see Sect. 2.5.1), of those questionnaires that aim to measure total PA, AQuAA, AWAS, GPAQ, IPAQ, IPAQ-SF, PAPQ, and PPAQ cover all relevant settings of PA. The RPAQ does not collect information on household-related activities [57] since the authors showed in a previous study [65] that these activities were inversely correlated with objectively measured PA. Therefore, they only included a few activities such as stair-climbing at home, mowing the lawn, watering the lawn or garden, or home maintenance. The IPAQ-SF aims to cover all settings of PA, but domain-specific scores cannot be obtained. The LTEQ, LTPAQ, and Q1 of MoBa were developed to collect specific information about LT/recreational exercise and LTPA rather than total PA. According to criterion (ii) (see Sect. 2.5.1), all included questionnaires assess frequency and duration of PA except LTEQ and Q1 of MoBa and no questionnaire uses a recall period of less than 1 week. In sum, the AQuAA, AWAS, GPAQ, IPAQ, IPAQ-SF, LTPAQ, PAPQ, and PPAQ provided sufficient content validity for the assessment of PA during pregnancy, whereas LTEQ, Q1 of MoBa, and RPAQ did not.

Reliability
The results for reliability (parameters of reliability and measurement error) of ten studies of six questionnaires (11 versions) are summarized in Table 4. Of the questionnaires developed for the non-pregnant population, the IPAQ-SF [52] showed sufficient reliability for all estimates of PA, the LTEQ [53] for strenuous LT exercise but not for total, mild, and moderate LT exercise, and the RPAQ [40] showed sufficient reliability for moderate PA but insufficient reliability for all other estimates of PA. The AWAS [40] showed insufficient reliability (ICC < 0.70).
Of the questionnaires developed for the pregnant population, parameters of reliability and measurement error were only assessed for (versions of) the PPAQ and LTPAQ. In sum, studies of the English [44], Turkish [45], and Vietnamese versions [51] of the PPAQ showed sufficient reliability. The Chinese version [55] showed sufficient reliability for all PA scores except moderate PA, VPA, and sports/exercise. The French version of the PPAQ [43] showed sufficient reliability for all scores except for transportational PA and, likewise, the Japanese version [49] for all scores except for transportational PA, sports/exercise, and occupational PA (1-week interval only). Although three studies [39,49,51] assessed measurement error, only one study reported LOA or CV for repeated measurements. In particular, the results for the LTPAQ [39] were insufficient because of large LOA (MIC frequency/duration < LOA/SDC) and CV. These values indicate large measurement errors and hamper a reliable detection of MIC of PA (e.g., two sessions or 30 min of MVPA per week) [17].

Construct and Criterion Validity
The results for construct validity are summarized in Table 5. Of the 11 different questionnaires, construct validity was mostly assessed by validation against accelerometers and less often against pedometers, logbooks, or other PA questionnaires.
Of the seven questionnaires developed for the nonpregnant population, the AQuAA [50], AWAS [40], GPAQ [54], IPAQ [48], IPAQ-SF [52], and LTEQ [53] showed insufficient construct validity because of low coefficients or large disagreements (e.g., wide LOA). The RPAQ [40] showed a sufficient correlation with PA estimates from the accelerometer for total active time (r ≥ 0.50) but not for total physical activity energy expenditure (PAEE) and other estimates of PA.
Of the four questionnaires developed for the pregnant population, the LTPAQ [39] showed insufficient construct validity. The ratings for the PAPQ [47] were insufficient for light and moderate PA but sufficient for VPA. However, the LOA indicated large disagreement between PAPQ and accelerometry in assessing VPA. The results of studies of the construct validity of (versions of) the PPAQ were predominantly insufficient, such as for the Vietnamese [51], Japanese [49], English [44,46], Chinese [55], and bilingual [46] versions of the questionnaire. Likewise, the second study    [42] of the English version revealed insufficient construct validity for all scores expect for LT-MVPA. The Turkish version of the PPAQ [45] showed sufficient validity for the assessment of total PA due to a high correlation with the pedometer but insufficient ratings for all other estimates.
The French version of the PPAQ [43] received sufficient ratings for total, light, and moderate PA, household/caregiving and occupational but insufficient ratings for sports/ exercise, vigorous, and transportational PA. Finally, Q1 of MoBa [41] showed insufficient construct validity. There was AWAS Australian Women's Activity Study, CV coefficient of variation, d change in the mean, EE energy expenditure, ICC intraclass correlation coefficient, ICC 1wk intraclass correlation coefficient for one week interval, ICC 2wks intraclass correlation coefficient for 2 weeks interval, IPAQ-SF International Physical Activity Questionnaire (short-form), κ kappa coefficient, LOA limits of agreement, LT leisure time, LTEQ Leisure-Time Exercise Questionnaire, LTPAQ Leisure Time Physical Activity Questionnaire (modified from IPAQ), MVPA moderate-to-vigorous physical activity, n occup sample size for occupational physical activity, PA physical activity, PPAQ Pregnancy Physical Activity Questionnaire, r Pearson correlation coefficient, RPAQ Recent Physical Activity Questionnaire a As described in Sect. 2.6, the quality of the individual study was evaluated per questionnaire and PA score using four levels (1: very good, 2: adequate, 3: doubtful, 4: inadequate). Additionally, the reported results were rated (i.e., sufficient [+], insufficient [-]) as described in Sect. 2.5.2 b LOA =d ± 1.96 × s × √ 2 , where s = within-subject standard deviation (typical error) [88] a low correlation (r < 0.50) between sum of weekly exercise estimated from the questionnaire and VPA estimated from the accelerometer.

Responsiveness
Only two studies examined responsiveness for two questionnaires (see Table 5). The AQuAA [50] showed insufficient responsiveness. Similarly, the GPAQ [54] showed insufficient responsiveness because of large disagreements (large LOA) between the questionnaire and accelerometer. Moreover, the GPAQ showed both systematic (difference in intercepts) and proportional differences (difference in slopes) regarding the change in MVPA between 14-18 and 29-33 weeks of gestation as indicated by Passing Bablok regression [54].

Quality of Individual Studies
Regarding the assessment of reliability of each PA score, nine studies [39, 40, 43-45, 49, 51, 52, 55] of AWAS, IPAQ-SF, LTPAQ, PPAQ, and RPAQ were at the highest level of quality (level 1) and one study [53] of the LTEQ at level 3 because of use of Pearson correlations and an inadequate time interval between test and retest. Regarding construct validity, six studies [40,41,47,50,52,54] of AQuAA, AWAS, GPAQ, IPAQ-SF, PAPQ, and Q1 of MoBa were at the highest level of quality (level 1), four studies [40,43,44,55] of PPAQ and RPAQ at level 1 and 2, one study [42] of PPAQ at level 1 and 3, and six studies [39,45,46,49,51,53] of LTEQ, LTPAQ, and PPAQ at level 3 (see Table 5). The quality of one study of the IPAQ was either of level 1, level 2, or level 3 depending on the evaluated PA score [48]. Different levels of quality were assigned due to comparisons with either objective (e.g., accelerometer, pedometer) or subjective (e.g., logbook, questionnaire) measures of PA or comparisons between different intensity levels. For example, a lower level of quality was assigned if light PA measured by the questionnaire was compared with MVPA measured by the accelerometer (e.g., Japanese version of the PPAQ) [49] or if PA measured by the questionnaire was compared with pedometer measured daily steps (e.g., LTEQ) [53]. Furthermore, the quality for the assessment of total PA was often of level 2 because total PAEE estimated from the questionnaires was compared against accelerometer estimated total counts. Responsiveness was evaluated in two studies [50,54] for two questionnaires (AQuAA, GPAQ). The quality of these studies was rated as level 1.
Finally, almost none of the studies formulated a priori hypotheses about expected results for construct validity or responsiveness. Only two studies [50,52] of the AQuAA and IPAQ-SF considered a minimum correlation of r = 0.5 as an adequate agreement between PA questionnaire and accelerometer. Table 6 summarizes the overall results (i.e., sufficient/insufficient measurement properties) and quality of evidence (GRADE) for three PA scores; total PA, MVPA, and VPA (per questionnaire and measurement property). None of the questionnaires provided evidence for all the relevant measurement properties (i.e., reliability [parameters of reliability or measurement error], construct validity, responsiveness). Only for the AWAS, IPAQ-SF, LTEQ, LTPAQ, PPAQ (i.e., Chinese, English, French, Japanese, Turkish, Vietnamese versions), and RPAQ was both reliability and construct validity assessed. Because there was usually only one study per questionnaire version and PA score available (except PPAQ), inconsistency could not be evaluated for these studies. With reference to the eligibility criteria and the checklist for methodological quality, we identified no serious indirectness, and therefore, did not downgrade the quality of evidence for any of the PA scores due to this factor.

Quality of Evidence
Overall and irrespective of the reported results (i.e., sufficient/insufficient measurement properties), the quality of the body of evidence was limited and ranged from very low to moderate. There was no high-quality evidence indicating that any of the included questionnaires had sufficient measurement properties in assessing total PA, MVPA, or VPA. Only the Turkish and French versions of the PPAQ showed both sufficient reliability and construct validity when assessing total PA (but not MVPA and VPA), but these results were based on low-to-moderate quality evidence.
Although different language versions of questionnaires should be treated initially separately [26], one may consider pooling the results (i.e., body of evidence) of the different versions of the PPAQ. When doing so, there was high-quality evidence (no serious risk of bias, no serious imprecision, no serious inconsistency, no serious indirectness) that the PPAQ had sufficient reliability in assessing total PA and VPA. We did not consider downgrading the quality of evidence for VPA as most of the results were sufficient (four of five studies), except the Chinese version, which may have occurred because most women did not engage in these activities, as suggested by the authors [55].
The results for construct validity of the PPAQ were inconsistent for total PA (i.e., two studies showed sufficient and five studies insufficient results) and consistently insufficient for VPA (see Table 6). When pooling these results, the PPAQ showed insufficient validity in assessing total PA, which was based on low-quality evidence (serious risk of bias, serious inconsistency, no serious imprecision, no serious indirectness). Similarly, there was moderate-quality   When an individual study reported both results using different cut points and average (or total) counts, we integrated coefficients with higher quality c As described in Sect. 2.6, the quality of the individual study was evaluated per questionnaire and PA score using four levels (

Discussion
In contrast to the considerable evidence concerning measurement properties of PA questionnaires in adults [11], youth [10], and elderly people [12], little information is available about the quality of PA questionnaires in pregnancy. This article provides an overview of the measurement properties of all self-administered questionnaires assessing PA in pregnancy. In contrast to other reviews [66], the quality of individual studies as well as the overall quality of evidence was evaluated.
The findings show that the quality of evidence of measurement properties for self-administered PA questionnaires assessing PA in pregnancy is currently low to moderate. Most PA questionnaires showed insufficient measurement properties. Only two studies assessed responsiveness for two questionnaires (AQuAA, GPAQ) and, thus, no questionnaire demonstrated sufficiency for all relevant measurement properties (i.e., content validity, reliability, construct validity, responsiveness). Of those questionnaires for which evidence for both reliability and construct validity was available, only few showed consistent results. Based on low-to-moderate quality evidence, only the Turkish and French versions of the PPAQ showed sufficient reliability and construct validity in assessing total PA. When considering all versions together, the PPAQ showed sufficient reliability in assessing total PA and VPA, based on high-quality evidence. However, based on low-to-moderate quality evidence, the questionnaire showed insufficient construct validity in assessing these PA scores. Furthermore, the pooled results of the PPAQ were consistently sufficient for reliability, but inconsistent for construct validity (i.e., sufficient or insufficient). Although there was limited high-quality evidence, we currently recommend the PPAQ, irrespective of language, to assess PA during pregnancy. The PPAQ showed sufficient content validity and was the only included questionnaire with versions showing both sufficient reliability and validity.
Construct validity was assessed for all (versions of) questionnaires and most of them were compared with objective measures of PA such as accelerometers or pedometers. However, the methodological quality of these individual studies varied substantially. No study used DLW, although this technique can safely be applied in pregnancy [67], but it does not represent maternal EE since the DLW will cross the placenta. For many PA scores, comparisons were made with a different level of intensity in accelerometer data, which led to a lower quality of the individual study. For example, time spent in light activities does not necessarily correlate with time spent in moderate or vigorous activities. Furthermore, sometimes (total) PA was compared with pedometer estimated daily steps. Because pedometers are not able to capture duration, frequency, and intensity of PA [68], the quality of these individual studies was considered as doubtful. Only few studies reported statistics such as LOA to assess absolute validity, rather than relative validity evaluated with Spearman or Pearson correlations. Reliability was assessed for six questionnaires (11 versions) and the methodological quality of these individual studies was usually high. Most studies used ICC or LOA and adequate time intervals between test and retest. Finally, only two studies of very good quality assessed responsiveness, the ability of a questionnaire to detect changes in PA over time. Especially in pregnancy, a period in which PA usually changes profoundly [2], a questionnaire with sufficient responsiveness is needed to capture these changes.
During pregnancy, a precise focus on content validity such as the choice of recall periods, activities or relevant settings of PA is needed. First, the intensity, type, and duration of PA can change with the ongoing pregnancy [2]. For example, light activities become more frequent, especially during the second and third trimesters. Activities can become more intense throughout pregnancy because of increased fatigue [2] and energy requirements [69]. For example, carrying loads can be experienced as more exhausting in late compared to early pregnancy, and walking up the stairs will objectively require more energy with increasing body weight. Furthermore, work-related PA might be more important in early pregnancy compared to the second and/or third trimester due to maternity leave. Similarly, household and caregiving activities become more important, especially when assessing PA in combination with parity. These pregnancy-related changes should be considered when assessing PA during pregnancy. Questionnaires with sufficient content validity (AQuAA, AWAS, GPAQ, IPAQ, IPAQ-SF, LTPAQ, PAPQ, PPAQ), based on our elementary criteria, may need to be further appraised with respect to these considerations.
In pregnancy EE needed for some activities increases, especially in the second and third trimesters [69,70], and the intensity of activities may be different [2,71]. Many PA questionnaires use compendium-based information about MET intensities of different activities [64], which are based on the adult non-pregnant population. Pregnancyspecific MET intensities are scarce and may only be available for light and moderate household PA [72]. Such intensities are applied in, for example, the PPAQ. The lack of  pregnancy-specific MET intensities together with the application of intensities from the non-pregnant population can be a source of bias when assessing total PA or PAEE. This could be the reason that for the RPAQ, a low correlation was shown for total PAEE, but a high correlation for total active time. However, more studies would be needed to test this hypothesis.
The present findings also revealed heterogeneity in the study design and analysis. This could result in a serious bias (e.g., risk of bias, inconsistency) and hampers the comparability of findings across (included) studies and countries. For example, accelerometers have been widely used to assess construct validity in this review. Although these devices can provide accurate information about duration, frequency, and intensity of PA under free-living conditions [73], there are currently no standards for accelerometer data collection and processing [74][75][76], including during pregnancy. Consequently, we observed large heterogeneity in data collection and processing criteria (Table 5). In contrast to the placement of the accelerometer (most women wore the device on their waist or hip), the included studies differed considerably in epoch length (i.e., 5 s to 10 min), registration period (3-14 days), and the definition of a valid week (e.g., 3 of 4 days, 4 of 8 days, 10 of 14 days). Furthermore, not all studies reported processing criteria, including the definition of filters and sampling frequency, which were reported least often. Since different decision rules for accelerometer data could impact PA outcomes [76], the reporting of these would increase transparency, comparability between studies and countries, and allow assessment of potential risks of bias.
Most importantly, we observed large heterogeneity in applied cut points [77][78][79][80][81] used to classify the intensity of PA into light, moderate, and vigorous. These cut points were usually developed for non-pregnant populations. For example, cut points for moderate PA in this review varied substantially between 191 [79] and 1952 [78] counts per minute, which will affect estimates of both PA and construct validity [82]. The influence of using different cut points on construct validity was demonstrated by two studies included in this review [49,50]. Because there are currently no validated cut points available for pregnant women, it is unclear which cut points provide the best comparison for assessing construct validity. Not only are pregnancy-specific cut points lacking, but little is known in general about the reliability and validity of accelerometers in pregnancy [83]. Changes in body girth, gait, and monitor tilt can affect the accuracy and the ability to detect certain movements [84].  Table 5) d There was one study of very good (level 1) and one study of adequate (level 2) quality e There was no serious inconsistency and/or indirectness f Validation against pedometer All things considered, objective devices such as accelerometers and pedometers are likely to provide sufficient reliability, whilst construct validity may be limited due to technical shortcomings, non-wearing time, participant interference with the results, and application of (different) cut points [85]. Lower construct validity of comparison measures clearly limits the quality of evidence for the validity of PA questionnaires. This is one of the greatest challenges for reviews on measurement properties of PA questionnaires, such as for the present review. Because of these shortcomings, future (validation) studies should report their decision rules in detail and attempt to develop guidelines for the optimal use of accelerometer data in the target population (e.g., pregnancy). To this end, two recent reviews emphasized the importance of such standards, as well as critically scrutinizing the validity of accelerometers and attempting to provide age-specific practical considerations for choosing the most appropriate method [85,86].

Recommendations for Choosing a Questionnaire
The choice of the right questionnaire depends on the study purpose. According to this, different settings (e.g., work, recreation), dimensions of PA (e.g., PAEE, total PA), or recall periods (e.g., last week, typical week) might become more important. In addition to previous recommendations for the selection of PA questionnaires [17], we recommend the following criteria for use in pregnancy: (i) When assessing total PA, the questionnaire should cover all relevant settings of PA (work, home, transport, recreation, sports), but should especially focus on household/caregiving. (ii) The questionnaire should measure at least duration and frequency of PA and should include a large range of light and moderate activities. Lower intensity activities become more prevalent during pregnancy, especially in the second and third trimesters. This will ensure sufficient content validity as well as discrimination of pregnant women regarding the level (e.g., time) engaged in these activities. For example, during the development of the PPAQ, light activities such as slowly walking at work while carrying light/ moderate loads and childcare were one of the most discriminatory activities [44]. In general, identifying relevant activities for the target population should precede the selection of questions used. (iii) The recall period of the questionnaire should be the last week (or last seven days), a typical week in a specific trimester, or the current trimester but should not expand over more than one trimester as PA during pregnancy varies [2].
(iv) Because pregnancy-specific MET intensities for different activities are lacking and energy cost changes during pregnancy, we further recommend using total time when assessing total PA instead of assigning activities different MET intensities from the nonpregnant population.
In general, we recommend using a questionnaire that has been evaluated in the target population and provides (consistent) results with sufficient content validity, reliability, construct validity, and responsiveness, based on high-quality evidence. If a questionnaire does not provide sufficient content validity, evaluation of further measurement properties is irrelevant. In our opinion, (versions of) the PPAQ may currently be the best choice to assess self-reported PA during pregnancy. However, some language versions of the PPAQ showed insufficient measurement properties, and, in fact, sufficient measurement properties for one language does not guarantee the same quality for other language versions and target populations. We carefully recommend not using AWAS, LTEQ, LTPAQ, RPAQ, and Q1 of MoBa (at least for some PA scores) because of insufficient content validity and/or both insufficient reliability and validity. However, our findings concerning the measurement properties of all included questionnaires were based on very low-to-moderate quality evidence.

Limitations and Strengths of this Review
Whenever a study presented multiple PA scores for construct validity and responsiveness, we tried to integrate all of them into our tables. However, if an individual study used both different cut points and average counts, we integrated coefficients with higher quality (Table 1), usually average counts. Furthermore, we did not apply any restrictions concerning certain pregnancy characteristics such as parity or pregnancy body mass index (BMI). For example, study populations in this review consisted of both normal-weight and overweight/ obese pregnant women. Whether this heterogeneity influenced the results is unclear and difficult to assess because of the low number of studies. However, in our review, this may have been a problem for only inter-and not intra-questionnaire comparisons.
Another problem was the observed heterogeneity in data collection and processing criteria of objective measures such as accelerometers and pedometers. Unfortunately, these criteria likely impact both PA and validation outcomes. We were unable to define particular criteria and comparison measures as a preferable 'gold standard'. Although we tried to incorporate the use of accelerometer data and the similarity between constructs into our quality assessment, we did not evaluate the application of different decision rules such as registration period, epoch length, filter, valid wear time, and cut points. In theory, VPA estimated from the questionnaire should be compared with VPA measured by accelerometry but the use of different cut points influences this association. These limitations are of major concern for this systematic review. Since the results of the validity of a questionnaire strongly depend on the validity of the comparison measure, we recommend that all readers bear in mind the importance of standards when using objective measures of PA during pregnancy and interpret the presented results carefully.
Lastly, we tried to use state-of-the-art methodology for our quality and result rating. The assessment was based on our experience, a series of previous published systematic reviews [10][11][12], a standardized quality checklist for PA questionnaires [17] as well as the COSMIN [23,26] and GRADE [25] guidelines. Researchers in the field are invited to discuss these findings in the light of their own expertise, possibly assigning different criteria (e.g., MIC of PA during pregnancy), levels of quality, and result ratings.

Recommendations for Further Research
We recommend further studies assessing the quality of those questionnaires that provide sufficient content validity but limited high-quality evidence of sufficient measurement properties. Furthermore, future studies should include responsiveness in their assessment. In this review, most questionnaires were in the English language but a questionnaire should always be evaluated in the target population and language. We observed large heterogeneity in data collection and processing criteria. We strongly recommend that future studies be designed to develop standards for accelerometer use and analysis, in particular during pregnancy. Although only little is known about the validity of accelerometers in our target population, we currently recommend the use of omniaxial devices that capture all directions of movements and the use of total (or averaged) counts, which are independent from any cut points. Finally, since lower validity of (objective) comparison measures hinders the accurate estimation of the validity of a PA questionnaire, we strongly recommend research on the validity of accelerometers during pregnancy before evaluating measurement properties of PA questionnaires.

Conclusions
Evidence concerning the measurement properties of selfadministered PA questionnaires in pregnancy is at the moment limited and mostly of lower quality (i.e., very low to moderate). No questionnaire showed sufficient content validity, construct validity, reliability, and responsiveness. Some versions of the PPAQ showed sufficient measurement properties, based on low-to-moderate quality evidence. Overall (i.e., when pooling the results of all versions), the PPAQ showed sufficient reliability in assessing total PA and VPA, based on high-quality evidence. However, based on low-to-moderate quality evidence, the questionnaire revealed insufficient construct validity in assessing these PA scores. Only after the development of guidelines for the most appropriate use of accelerometer data during pregnancy will we be able to provide recommendations for PA questionnaires based on high-quality evidence.