Background

Physical activity is defined as "any bodily movement produced by the skeletal muscle that results in energy expenditure (EE)" [1]. Inactivity is known to be associated with an increased risk for many chronic diseases including: coronary artery disease, stroke, hypertension, colon cancer, breast cancer, Type 2 diabetes, and osteoporosis [2], as well as premature death. The economic burden of physical inactivity in Canada has been estimated to be $2.1 billion [2]. Physical activity levels are often monitored to assess the health behaviours of the population and their association with health status including mortality and morbidity rates. Accurate assessment of physical activity is required to identify current levels and changes within the population, and to assess the effectiveness of interventions designed to increase activity levels.

Data collection at the population level often involves self-report (subjective) measures of physical activity through the use of questionnaires, diaries/logs, surveys, and interviews. These measures are frequently used due to their practicality, low cost, low participant burden, and general acceptance [3]. Although self-reports are useful for gaining insight into the physical activity levels of populations, they have the capacity to over- or underestimate true physical activity energy expenditure and rates of inactivity. The self-report methods are often wrought with issues of recall and response bias (e.g. social desirability, inaccurate memory) and the inability to capture the absolute level of physical activity.

As self-report methods possess several limitations in terms of their reliability and validity [4], objective or direct measures of physical activity are commonly used to increase precision and accuracy and to validate the self-report measures. Direct measures are believed to offer more precise estimates of energy expenditure and remove many of the issues of recall and response bias. Direct measures consist of calorimetry (i.e., doubly labeled water, indirect, direct), physiologic markers (i.e., cardiorespiratory fitness, biomarkers), motion sensors and monitors (i.e., accelerometers, pedometers, heart rate monitors), and direct observation. Despite the advantages of using direct methods, these types of measures are often time and cost intensive and intrusive rendering them difficult to apply to large epidemiologic settings. These measures also require specialized training and the physical proximity of the participant for data collection. In addition, direct measures each possess their own limitations and no single "gold standard" exists for measuring physical activity or assessing validity [3].

The appropriate method for measuring physical activity at various levels depends on factors such as the number of individuals to be monitored, the time period of measurements and available finances [5]. Many previous studies have examined the reliability and validity of various self-report and direct methods for assessing physical activity. Results from these studies have been conflicting. To our knowledge no attempt has been made to synthesize the literature to determine the validity of physical activity measures in adult populations.

The primary objective of this study was to perform a systematic review to compare self-report versus direct measures for assessing physical activity in observational and experimental studies of adult populations. The results from this systematic review provide a comprehensive summary of past research and a comparison between physical activity levels based on direct versus self-report measures in adult populations.

Methods

Study criteria

The review sought to identify all studies (observational or experimental) that presented a comparison of self-report and direct measurement results to reveal differences in physical activity levels based on measurement in adult populations (18 years and over). Studies which examined only a self-report or direct measure, but not both were not included in the review. All study designs were eligible (e.g. retrospective, prospective, case control, randomized controlled trial, etc.) and both published (peer-reviewed) and unpublished literature were examined.

Only studies involving adult populations with a mean age of 18 years and older were considered. Abstracts and titles were examined for their mention of adult populations (using adult$.tw.), but the search relied mostly on the subject headings for adult age groups (exp adult/). This systematic review was conducted simultaneously with a systematic review of the same focus in child populations (mean age < 19 years). A separate pediatric review was carried out as a result of differences in measurement methodologies and hypothesized cognitive and recall abilities between adults and children [6].

The eligible self-report measures of physical activity included: diaries or logs; questionnaires; surveys; and recall interviews. Proxy-reports were excluded because they present issues of reliability due to the potential heterogeneity of reporters (e.g., spouse, trainer, coach, parent, caregiver). The eligible direct measures of physical activity included: doubly-labeled water (DLW), indirect or direct calorimetry, accelerometry, pedometry, heart rate monitoring (HRM), global positioning systems, and direct observation. Although no language restrictions were imposed in the search, only English language articles were included in the review. Abstracts were included if they provided sufficient details to meet inclusion criteria.

Search strategy

The following electronic bibliographic databases were searched using a comprehensive search strategy to identify relevant studies reporting the use of both self-report and direct measures for assessing individual physical activity levels: Ovid MEDLINE(R) (1950 to April Week 4 2007); Ovid EMBASE (1980 to 2007 Week 18); Ovid CINAHL (1982 to April Week 4 2007); Ovid PsycINFO (1806 to April Week 1 2007); SPORTDiscus (1830 to April 2007); Physical Education Index (1970 to April 2007); Dissertations and Theses (1861 to April 2007); and Ovid MEDLINE (R) Daily Update (May 4, 2007). The search strategy is illustrated using the MEDLINE search as an example (Table 1) and was modified according to the indexing systems of the other databases. The OVID interface was used to search MEDLINE, EMBASE, CINAHL, and PsycINFO; Ebscohost was used to search SPORTDiscus; Scholar's Portal was used to search Physical Education Index; and ProQuest for Dissertations and Theses. Grey literature (non-peer reviewed works) included published abstracts and conference proceedings, published lists of theses and dissertations, and government reports. Knowledgeable researchers in the field were solicited for key studies of interest. The bibliographies of key studies selected for the review were examined to identify further studies.

Table 1 Medline search strategy

Two independent reviewers screened the titles and abstracts of all studies to identify potentially-relevant articles. Duplicates were manually removed. The full texts of all studies that met the inclusion criteria were then obtained and reviewed. When disagreements between reviewers occurred, consensus was achieved through discussion and/or with a third reviewer.

Standardized data abstraction forms were completed by one reviewer and verified by two others. Information was extracted on the type of study design, participant characteristics, sample size, and methods of physical activity measurement (self-report and direct measures employed, units of measurement, duration of direct measure, length of recall, and length of time between the self-report and directly measured estimates). Reviewers were not blinded to the authors or journals when extracting data.

Risk of bias assessment

The Downs and Black [7] checklist was used to assess the risk of bias. The Downs and Black instrument was recommended for assessing risk of bias in observational studies in a recent systematic review [8] and other assessments [9] and was employed in this review to assess study quality including reporting, external validity, and internal validity (bias). The Downs and Black checklist consists of 27 items with a maximum count of 32 points. A modified version of the checklist was employed with items that were not relevant to the objectives of this review removed. The adapted checklist consisted of 15 items, including items 1–4, 6, 7, 9–13, 16–18, and 20 from the original list, with a maximum possible count of 15 points (higher scores indicate superior quality). The risk of bias assessment was carried out by two independent assessors and when disagreements between assessors occurred, consensus was achieved through discussion.

Data synthesis

Percent mean difference was used as the main outcome of this analysis; it was calculated using the formula: [(self-report mean – direct mean)/direct mean]. Only studies with units of measurement that were the same for both the self-report and direct measures were used to calculate percent mean differences. Units were converted where possible. These studies were included in the direct comparison analyses. Forest plots (graphical displays of the percent mean differences across the individual studies) were constructed to present overall trends in agreement of physical activity by direct measure and gender. As most studies did not employ the same units of measurement (e.g. kcal/week, MET/day, MET-min/day) and did not report a measure of variance (e.g. standard deviations or standard errors), pooled estimates and confidence intervals were not calculated.

Results

Description of studies

The preliminary search of electronic bibliographic databases, reference lists and grey literature identified 4,463 citations (see Figure 1). Of these, 1,638 were identified in MEDLINE, 1,306 in EMBASE, 732 in CINAHL, 218 in PsycINFO, 133 in SportDISCUS, 34 in Physical Education Index, 3 in MEDLINE Daily Update, and 399 from Dissertations and Theses. After a preliminary title and abstract review, 296 full text articles were retrieved for a detailed assessment. Of these, 173 met the criteria for study inclusion. One hundred and forty-eight of these studies reported correlation statistics [10157]. Seventy-four studies contained comparable data meaning the self-report and direct measurements were reported using the same units [11, 15, 17, 19, 20, 23, 32, 33, 44, 48, 53, 5659, 65, 7377, 80, 88, 90, 92, 94, 100, 102, 105, 111, 114, 116, 119121, 128, 131, 134, 135, 138140, 143, 148, 151, 153, 154, 158183]. These studies were included in the direct comparison analyses and their characteristics are described in Table 2. Common reasons for excluding studies included: populations with mean ages less than 18 years, the absence of directly measured and self-report data on the same population, non-English language, duplicate reporting of data, and the absence of comparable units between measures or the absence of a direct comparison.

Table 2 Study and participant characteristics for studies with directly comparable data
Figure 1
figure 1

Results of the literature search.

Data abstraction identified three articles and two dissertations that analyzed and reported duplicate data in multiple papers [184188]. Authors of suspected duplications were contacted and in cases where several publications reported the same analyses from the same data source, only one study per data source/analysis was retained in order to avoid double counting. Studies were retained based on the most pertinent and most recent data, as well as the largest sample size. Studies included were published over a 24-year period from 1983 to 2007. All studies were written in English. Nineteen of the studies used randomized controlled trial designs [22, 24, 26, 28, 30, 50, 53, 61, 84, 91, 124, 148, 149, 163, 165, 171, 181] and all others used observational designs (e.g. case control, cross-sectional, longitudinal). All included studies were published as journal articles except for 19 dissertations [16, 24, 30, 34, 38, 45, 49, 61, 64, 69, 71, 73, 74, 78, 99, 107, 117, 163, 171].

Participants in the studies ranged from 10 to 101 years of age. Although the focus of the review was on those aged 18 and over, studies that had a range of ages less than 18 years were not excluded as long as the mean age of the sample was over 18 years. Sample sizes ranged from a low of six [21] to a high of 2,721 in Craig et al.'s work that assessed the validity of the International Physical Activity Questionnaire (IPAQ) [35]. There were a greater number of studies reporting on female-only data than studies reporting on male-only data.

A total of five direct measures were used in the assessment of physical activity and included: accelerometers, DLW, indirect calorimetry, HRM, and pedometers. Of the studies included in the synthesis of directly comparable data (Table 2), accelerometers were the most frequently used direct measure and indirect calorimetry was the least used. A variety of self-report measures were employed, but the seven-day physical activity recall (7-day PAR) [189] was the most cited. Over half of the studies reported that the self-report and directly assessed physical activity levels were measured over the same length of time (e.g. seven days) and over the same period of time (i.e. no time lag between measurements). There were also a considerable number who reported measurements over the same period of time, but that did not measure the same length of time (e.g. self-report over seven days, directly measured over three days). Eleven of the studies in Table 2 lacked any mention of time [59, 131, 135, 138, 143, 159, 160, 164, 177, 178, 183].

Risk of bias assessment

Risk of bias was assessed for all included studies (n = 173) including those reporting only correlation data. The range of items met on the modified Downs and Black tool was 8 to 15 (maximum possible count was 15) with a mean of 11.7 ± 1.2. Results of the risk of bias assessment indicated that 38% (65/173) of the studies had lower quality (based on a median split count of < 12/15). All studies were given maximum points for describing study objectives. All but one study scored maximum points for describing the main outcomes to be measured and the interventions used (including comparison methods between measures). Although most studies carried out some sort of significance testing on results, most did not report the actual probability values associated with the estimates or their associated measures of random variability (e.g. standard error or confidence intervals). Most studies obtained a high number of items on the reporting section (maximum count of 8) with a mean of 6.9 ± 0.9.

The external validity section of the risk of bias assessment had a maximum count of three and consisted of reporting on the representativenessof the subjects and the testing conditions. Almost all of the studies (166/173) reported that the staff, places and facilities where the participants were tested were representative of the testing conditions that would be expected by most individuals (e.g. real-life and free-living situations). However, 87% (151/173) of the studies did not report on the representativeness of the subjects asked to participate in the study and 95% (165/173) of the studies failed to report on the representativeness of those who were prepared to participate (enrolled) compared to the entire population from which they were recruited (received a score of 0). As a result, the external validity ratings of most studies were poor with a mean of 1.1 ± 0.5.

In order to obtain the maximum number of items (four) in the internal validity section, studies must have reported whether any of the results of the study were based on "data dredging", whether the analyses adjusted for any time lag between the two measurements or different lengths of follow-up, whether the statistical tests used to assess the main outcomes were appropriate, and whether the main outcome measures were accurate (valid and reliable). Internal validity item counts were generally high with the majority of studies having obtained a four.

A qualitative analysis was conducted on the top seven (scores of 14 and 15 out of 15) and lowest seven studies (8 and 9 out of 15) based on scores from the risk of bias assessment. No conclusive patterns were identified from this analysis. The results from the accelerometer studies were further examined, as this was the only group of studies with a good distribution of low and high quality studies based on the accelerometer median split of bias scores. Findings from this analysis did not identify any clear patterns in the differences in agreement between physical activity measured by self-report compared to accelerometer when grouped by low and high quality.

Data synthesis

One hundred and forty-eight studies [10, 11, 13157, 190] reported correlation statistics between self-report and direct measurements of physical activity. Figure 2 is a plot of all extracted correlations and shows that overall, there is no clear trend in the degree of correlation between self-reported and directly measured physical activity, regardless of the direct method employed. Overall, correlations were low-to-moderate with a mean of 0.37 (SD = 0.25) and a range of -0.71 to 0.98. Mean correlations were higher in studies reporting results for males-only (r = 0.47) versus studies reporting results for females-only (r = 0.36), but with very similar ranges (males: -0.17 to 0.93 vs. females: -0.17 to 0.95).

Figure 2
figure 2

Scatter plot of all correlation coefficients between direct measures and self-report measures.

Seventy-four studies contained comparable data on the measurement of physical activity based on self-report and directly measured values. Table 2 describes these studies and their subcomponents. Percent mean differences were calculated for all of these studies and are presented as forest plots in Figures 3 to 8. Negative values indicate that self-report estimates were lower than the amount of physical activity assessed by direct methods while positive values indicate values that are higher. Sixty percent of the percent mean differences indicated that self-reported physical activity estimates were higher than those measured by direct methods.

Figure 3
figure 3

Forest plot of percent mean differences between accelerometers and self-report measures from studies reporting combined results for males and females (excluding outliers ≥ 400%).

Figure 4
figure 4

Forest plot of percent mean differences between accelerometers and self-report measures from studies reporting results for males only (excluding outliers ≥ 400%).

Figure 5
figure 5

Forest plot of percent mean differences between accelerometers and self-report measures from studies reporting results for females only (excluding outliers ≥ 400%).

Figure 6
figure 6

Forest plot of percent mean differences between doubly-labeled water, heart rate monitoring, pedometers, and indirect calorimetry and self-report measures from studies reporting combined results for males and females (excluding outliers ≥ 400%). Cal – calorimetry, DLW – doubly labeled water, HRM – heart rate monitor, Ped – pedometer.

Figure 7
figure 7

Forest plot of percent mean differences between doubly-labeled water, heart rate monitoring, pedometers, and indirect calorimetry and self-report measures from studies reporting results for males only (excluding outliers ≥ 400%). Cal – calorimetry, DLW – doubly labeled water, HRM – heart rate monitor, Ped – pedometer.

Figure 8
figure 8

Forest plot of percent mean differences between doubly-labeled water, heart rate monitoring, pedometers, and indirect calorimetry and self-report measures from studies reporting results for females only (excluding outliers ≥ 400%). Cal – calorimetry, DLW – doubly labeled water, HRM – heart rate monitor, Ped – pedometer.

Studies with extreme percent mean differences (≥ 400%) were removed from the forest plots for clarity purposes [11, 139, 151, 181]. All outlying data were from studies where physical activity was categorized by level of exertion (e.g. easy, moderate, vigorous) and outliers represent physical activity data categorized as vigorous or of high energy expenditure. While not all data categorized as vigorous had percent mean differences ≥ 400%, a pattern emerged whereby greater percent mean differences between the self-report and direct measures was larger for vigorous levels of physical activity than for light or moderate activities [11, 44, 56, 134, 139, 151, 175, 181, 182].

Percent mean differences were examined separately for the five different direct measures. Accelerometers were the most used direct measure. Self-report measures of physical activity were generally higher than those directly measured by accelerometers (Figures 3 to 5). Studies reporting data for males and females combined (n = 58) had a mean percent difference of 44% (range: -78% to 500%), with similar findings for the male-only data (n = 32) (mean: 44%, range: -100% to 425%). However, female-only data (n = 60) identified that, on average, females self-reported higher levels of physical activity compared to accelerometers with a mean percent difference of 138% (range: -100% to 4024%).

The second-most common direct measure employed was DLW and comparable data with self-report measures are presented in Figures 6 to 8. Studies reporting on combined male and female data (n = 6) indicated that self-report measures of physical activity were lower when compared to DLW measures with a mean percent difference of -9% and a range of -1% to -26%. Results for male-only (n = 16) and female-only (n = 23) data were less distinct with percent mean differences and ranges of -4.5% (-78% to 37%) and 7% (-58% to 113%), respectively.

A greater number of HRM and self-report comparisons were observed for studies with both male and female participants (n = 11) or female-only populations (n = 13) versus male-only populations (n = 3). Female-only results showed a general trend toward higher levels of self-reported physical activity (mean 11%, range: -5% to 45%), while the male-only (mean -9%, range: -24% to 5%) and combined (mean -2%, range: -21% to 67%) data had a greater number of studies with lower self-reported physical activity levels when compared to results of HRM.

Pedometers and indirect calorimetry were the least commonly used direct measures for studies with comparable data. There were a total of eight comparisons from four studies for pedometers and 15 from two studies for indirect calorimetry (Figures 6 to 8) making it difficult to draw conclusions with regard to patterns of agreement between the self-report and direct measures. However, seven [19, 75, 76, 167] of the eight pedometer comparisons reported higher levels of physical activity by self-report when compared to the pedometer results. The eighth comparison [19] which involved female-only data saw no difference between the two measures. The indirect calorimetry results were less straightforward and presented no obvious patterns in agreement.

Subgroups were qualitatively examined to assess whether any differences existed in the degree of agreement between self-reported and directly measured physical activity. No clear patterns emerged within studies reporting on elderly (range or mean ≥ 65 years) populations [23, 73, 77, 92, 105, 116, 174] or within studies reporting on different time lags and periods of measurement. Few studies with comparable data reported exclusively on overweight/obese populations, but amongst those captured, the majority of studies reported higher levels of physical activity by self-report compared to the direct measures [139, 143, 148, 163165, 172]. However, it was not possible to compare the overweight/obese percent mean differences to those reported in general populations.

Meta-analyses were not possible due to the substantial heterogeneity in units of reporting for physical activity measured by the various self-report and direct methods across the studies, and the significant lack of data with comparable units across measures. As a result, we were unable to determine the sensitivity of the values and the associated measures of error for the studies. Overall effect sizes to summarize the magnitude of discrepancy across the various measures of physical activity could therefore not be calculated.

Discussion

To the authors' knowledge this review represents the most comprehensive attempt to examine the relationship between self-report and directly measured estimates of adult physical activity in the international literature. Risk of bias was assessed and identified that just over one third of the studies had lower quality based on their description of the methods and external and internal validity. Overall, no clear trends emerged in the over- or underreporting of physical activity by self-report compared to direct methods. However, some results suggest that patterns in the agreement between self-report and direct measures of physical activity may exist, but they are likely to differ depending on the direct methods used for comparison and the sex of the population sampled. Interestingly, findings also identified that studies which categorized physical activity by level of exertion (e.g. light, moderate, vigorous) exhibited a trend wherein these categorized studies saw the mean percent differences between the self-report and direct measures increasing with the higher category levels of intensity (i.e. vigorous physical activity). These larger differences may reflect a problem with self-report measures attempting to capture higher levels of physical activity, or problems with participant interpretation and recall.

Many of the studies tested the relationship between self-report and direct measures by using a correlation coefficient, but this is limited as correlation is only able to measure the strength of the relationship between two variables and cannot assess the level of agreement between them, as well as ignoring any bias in the data [191]. A more useful approach, the Bland-Altman method, provides a means for assessing the level of agreement between self-report and direct measures by deriving the mean difference between the two measures and the limits of agreement. If the two measures possess good agreement and measure the same parameter of physical activity, then the cheaper and less invasive self-report methods may be valid substitutes for direct methods.

A meta-analysis would have allowed us to estimate the overall effect sizes for each of the direct measures and undertake a sensitivity analysis to further understand the degree of bias in the studies. Unfortunately, inconsistent methods and reporting among the studies included made such an analysis methodologically inappropriate. Further research in this area would benefit from greater consistency in the units of reporting and the methods used to facilitate comparisons. For instance, many studies did not report results using the same units, so estimates of agreement between the self-report and direct measures could not be computed. There was also an inconsistency in the number of days measured and the time lag between the self-report and direct measures. It is recommended that authors present their results using the same units for both measures (e.g. minutes/day, kcal/day), that the two measurements assess physical activity for and over the same time period, and that all relevant data including a mean and measurement of variance (i.e. standard deviation, standard error) be included in all reports.

Adhering to consistent reporting criteria would increase the comparability of results across studies and enable the calculation of overall effect sizes. At the population level, over- or underestimation of physical activity prevalence has important implications as these data are used to monitor physical activity trends, determine spending for research and physical activity interventions and programming, and to estimate physical inactivity-related risks of disease. Future studies may wish to refer to the updated Compendium of Physical Activities [192] which provides a coding scheme to classify physical activity by rate of energy expenditure. The Compendium offers a means to increase the comparability of results between self-report and direct measures, as well as across studies.

A lack of a clear trend amongst the differences between the self-report methods for assessing physical activity and the more robust direct methods is of concern, especially when trying to establish whether the measures could be used interchangeably. There are several possible explanations for the lack of a clear trend in the data. Many self-report instruments (such as the 7-day PAR) may not have the ability to account for activities of less than 10 minutes in duration or those with a level of exertion lower than brisk walking [193], whereas some of the direct methods (such as DLW) may capture all forms of physical movement. However, it is important to recognize that other direct measures such as accelerometers are unable to capture certain types of activities such as swimming and activities involving the use of upper extremities. Our findings demonstrate the inherent difficulty self-report measures possess when trying to accurately capture data at various levels of exertion. Compared to direct measures, self-report methods appear to estimate greater amounts of higher intensity (i.e. vigorous) physical activities than in the low-to-moderate levels.

Just as with some self-report measures not being able to capture all forms of activity, some direct measures may capture non-physical activity. For instance, the DLW technique is an accurate assessment of total energy expenditure, but it does not only capture physical activity, but rather all forms of energy expenditure including resting energy expenditure and the thermogenic effect of food. DLW is therefore expected to overestimate physical activity unless corrections are made. These and other measurement errors may inflate the between-individual variability in the energy expended in physical activity [194]. Finally, direct methods may be too sensitive to small errors derived from the various calibration methods employed and the equations used to define and categorize physical activity.

It is important to take into account all of these factors when comparing self-report and direct measures of physical activity. In specific circumstances (e.g. at different levels of activity) these two methods may not be comparable as they are not able to capture the same parameters of physical activity. Self-report measures may not able to accurately capture all levels of activity, but they may be able to capture how difficult an individual perceives an activity to be and the type of activity that is undertaken (e.g. leisure, work, transportation). Direct measures, on the other hand, may be more able to capture some of the information not captured in self-report methods (e.g. incidental daily movement and lower intensity activities), but also possess their own limitations such as the inability to capture arm movements and various types of physical activity (e.g. swimming).

Concern regarding the discrepancy between self-reported and directly measured physical activity were recently reported by Troiano and colleagues who examined data from the 2003–2004 National Health and Nutrition Examination Survey (NHANES) which contained the first direct measurements of physical activity in a nationally representative U.S. sample [195]. They compared self-reported adherence estimates of physical activity recommendations with those directly measured by accelerometer. Their findings identified that self-reported adherence estimates were much higher than those measured by accelerometer. The authors hypothesize that the overestimation may be a result of respondents misclassifying sedentary or light activity as moderate or from underestimations of activity duration by the accelerometers.

Other factors, such as those related to the population under study, may influence the ability of self-report and direct methods to capture the same measurement. For example, our findings show that in studies with a focus on overweight/obese individuals, self-reported physical activity was overestimated in all cases except for DLW studies involving combined male/female and male-only data. Our results differed from those reported by Irwin, Ainsworth and Conway (2001) [58]. Their study consisted of 24 males and used DLW to compare energy expenditure estimates with those obtained by physical activity record and the 7-day PAR. The investigators observed an overestimation of energy expenditure in participants with higher body fat using the physical activity record, but not the 7-day PAR. A comparison of the same sample by body mass index (BMI) identified that those with a BMI ≥ 25 kg/m2 overestimated energy expenditure from physical activity records and the 7-day PAR. In confirmation of the trends within our accelerometer data, a recent study (published after our search) of 154 subjects compared a physical activity questionnaire to accelerometry data and identified that the accuracy of the physical activity questionnaire was higher for males than females and for those with a lower BMI [196]. It is likely that a response bias exists due to social desirability, and influences the degree of over-reporting of physical activity by overweight/obese individuals. Future research and synthesis is needed to identify whether a bias does in fact exist and if so, whether it differs by gender, and to what extent.

This review had limitations that should be considered when examining the results. First, the sample was limited to studies that included directly comparable data between self-report and direct measures (same units for both measures) or a comparison by way of correlation. Access to primary data from each study was not feasible; therefore, we relied upon reported comparisons and the means of measured physical activity. This reduced the number of studies with reported measures of physical activity by self-report and direct methods and limited our ability to accurately assess the degree of agreement between the two measures. However, when possible we converted non-comparable units to increase the number of studies used. The review did not assess the agreement between proxy-reported physical activity and direct measures. Proxy-report data are less prevalent but is an important means for assessing physical activity in sub-populations such as those who are chronically ill, disabled, or elderly, and who are unable to self-report on their own physical activity levels. Further research is required to assess the validity of proxy-report measures of physical activity when compared to direct methods. Finally, this review did not discern between differences in study protocols related to calibration, cut-points, or collection of the measurements and other population specific characteristics.

Conclusion

In conclusion, this review provides an objective summary of the difference in physical activity levels assessed via self-report methods compared to directly measured physical activity. The results may assist researchers considering the use of self-report or direct measurement methods and serves as a note of caution that self-report and directly measured physical activity can differ greatly. Overall there were no clear trends in the degree to which physical activity measured by self-report and direct measures differ. The strength of trends differed by the direct method employed and by the gender of the population sampled. One-third of the studies were of poor quality with most studies having failed to report actual probabilities or measures of variability for estimates and the representativeness of their samples. The costs and benefits of direct measurement need to be considered in any study in order to determine if the added resources required for personnel training and laboratory analyses justify the possible increase in the precision of results. At this time, it is not possible to draw any definitive conclusions concerning the validity of self-report measurements compared to various direct methods, but caution should be exerted when comparing studies across methods.