Background

In obstetrics and gynaecology there has been a rapid growth in the development of new tests and primary studies of their accuracy. These studies generate a comparison of the result from an index test against an accepted reference standard [1]. The accuracy of the index test is usually expressed as sensitivity and specificity or other measures like the diagnostic odds ratio (DOR), likelihood ratio (LR) or area under a receiver-operator characteristics curve [2]. These allow clinicians to judge the usefulness and suitability of testing in clinical practice. It is imperative that such studies are reported with transparency allowing the detection of any potential bias that may invalidate the results [35]. Guidelines for the reporting of other study types have widely been accepted e.g. CONSORT [6] for randomised control trials. There has been a format for reporting evaluations of tests called Standards for Reporting of Diagnostic Accuracy - STARD [7], introduced in 2003.

The object of the STARD initiative is to improve the reporting of test accuracy studies to allow for the detection of potential bias in a study and to make a judgement on the applicability of the index test results. One of the benefits of using the STARD initiative is to develop a consistent reporting format across all types of tests. The STARD group identified 33 previously published checklists for diagnostic research. From an initial 75 point check list a consensus meeting formulated a 25 point list that could be employed to accuracy studies. This list was designed to help readers judge the studies and to act as a study design tool for authors. Points were specifically chosen on evidence supporting their ability to show variations in measures of diagnostic accuracy [7]. Further supplementing the checklist was flow diagram which aids the assessment of the study population, the recruitment method and indicates the numbers receiving the index test, those excluded and those compared with the reference standard at different stages of the study. STARD should allow a reader to critically appraise the study design, analysis and results.

Previous studies have looked at the impact of STARD in specific clinical areas [812] with varying outcomes and the overall quality of reporting of studies which was generally found to be poor. Smidt et al studied reporting quality pre and post STARD in twelve general medical journals and found the mean number of STARD items reported pre-STARD publication was 11.9 (3.5-19.5) and post-publication was 13.6 (4.0-21.0). Coppus et al found that the mean compliance for articles published in Fertility and Sterility and Human Reproduction was 12.1 (6.5-20). There is no published research looking at the impact of STARD in obstetrics.

This study aims to assess the reporting quality of test accuracy studies in obstetrics and gynaecology and the impact of the STARD statement and compare the quality between the two specialities.

Methods

We developed a protocol to assess the impact of STARD on studies included in ten systematic reviews performed over the period 2004-2007. The studies covered the time period 1977-2007. We included reviews of minimal and non invasive tests to determine the lymph node status in gynaecological cancers [1315] and reviews of Down's serum screening markers and uterine artery Doppler to predict small for gestational age in obstetrics [16, 17]. These systematic reviews were selected as they were all performed by the authors according to prospective protocols and recommended methodology with prospective assessment of reporting quality using the STARD checklist thus uniform assessment could be ensured. The STARD checklist was applied to each of the studies included in all the reviews with the reporting item being determined as either present, absent, unclear or not applicable (additional file 1). All studies were assessed by TJS and RKM in duplicate, where there was disagreement consensus was achieved following assessment by a third reviewer (KSK). In the event that several tests had been applied to the same patient, the results including the largest number of patients were used in this study or where there was no difference, one index test was selected at random, this ensured patients were only included once.

We addressed the following questions: Has the introduction of STARD improved reporting quality?; does study size correlate with reporting quality?; is there a geographical pattern to reporting quality?. The percentage compliance of studies with STARD items was compared between both specialties before and after the introduction of STARD and over time using the unpaired t test to assess the effect of STARD on the reporting quality of studies. With the publication of STARD in 2003 the assumption was made that all studies published pre 2004 were published without the benefit of this directive.

We examined the relationship between sample size and compliance with STARD using Spearman's rank correlation coefficient (Rho). Kruskal Wallis was used to investigate any relationship between geographical distribution and reporting quality. The country of origin of a study was determined by the country of the corresponding author. Where a significant result was found, pairways comparison was made using Conover Inman procedure. Countries were grouped depending on the number of articles published and the mean journal impact factor and adjusted for gross domestic product and population, based on previous publication [18]. Where there was a large disparity in number of studies per geographical area, some studies were re grouped to avoid large differences in group size and potentially spurious results. For obstetric reviews geographical areas were Oceania, USA, Canada, Asia, Japan, Africa, Eastern Europe and Western Europe and for gynaecology studies there were no studies from Oceania or Canada, but Latin America was added.

In the initial analysis those reporting items coded as unclear and not applicable were excluded. For all of the above analysis, due to the uncertainty of whether reporting items coded as unclear represented methodological failure, sensitivity analysis was performed excluding this code and adding it to the not reported group for all comparisons. Similarly sensitivity analysis was also performed to assess the effect of those items assessed as not applicable, with their initially exclusion to the analysis and then addition as if they were reported so as not to penalise studies which had a larger number of not applicable items and would therefore potentially have a seemingly lower compliance with STARD.

Results

A total of 300 studies (195 obstetric and 105 gynaecological studies) were identified and included in this analysis. 82% (160/195) of the obstetric and 83.8% (88/105) of the gynaecological studies were published prior to the STARD initiative. The overall percentage compliance with individual reporting items and percentage compliance pre- and post-STARD publication is shown in table 1 for gynaecology and table 2 for obstetrics. The included obstetric studies reported adequately > 50% of the time for 62.1% (18/ 29) of the items as assessed in this review and for gynaecology 51.7% (15/29). Items where reporting was uniformly poor (both obstetrics and gynaecology studies < 50%) were participant sampling, description of technique of reference standard, description of expertise of people performing index and reference standard, blinding of results of index test to those interpreting reference standard, assessment of test reproducibility, tabulation of results and description of adverse events.

Table 1 Percentage compliance with individual STARD criteria for included diagnostic accuracy studies in gynaecology
Table 2 Percentage compliance with individual STARD criteria for included diagnostic accuracy studies in obstetrics

There was a greater compliance with STARD for obstetric than gynaecological studies (p = < 0.0001). There was significant improvement in the reporting quality of obstetric studies after the introduction of STARD (p = 0.0004). Two studies in obstetrics used a STARD flow diagram following the publication of STARD. Although there was an improvement in the mean compliance in gynaecological studies as well, this did not reach significance (p = 0.08). Tables 1 and 2 also demonstrate the mean differences in percentage compliance pre and post-STARD publication. Figure 1 shows the trend in compliance with the STARD criteria over time. Analysis of the correlation between sample size and compliance with STARD revealed a positive correlation in both obstetrics (Rho = 0.37, p = < 0.0001) and gynaecology (Rho = 0.24, p = 0.0123). Investigation in to the relationship between geographical area of publication and the compliance with STARD showed no relationship for obstetrics or gynaecology, (Kruskal-Wallis 5.05 p = 0.65 and 6.79 p = 0.24 for obstetrics and gynaecology respectively) table 3. Sensitivity analysis showed no significant difference in any of the results.

Figure 1
figure 1

Bar chart showing mean percentage compliance of studies with STARD criteria, line shows trend over time.

Table 3 Mean percentage compliance of studies with STARD according to geographical area of publication

Discussion

The reporting of included studies in this review overall was poor with obstetric studies demonstrating better reporting than the gynaecological studies. In both specialties the geographical origin had no effect on the reporting quality; however the study size showed a positive correlation. There has been a trend in improvement in reporting quality, more so in obstetrics than gynaecology, however there is still significant room for improvement.

There was poor compliance with STARD in many of the studies in this review, in many studies it was unclear whether the study complied with the reporting item. This lack of clarity could potentially affect our inferences, but in other fields it is well known that unclear reporting is associated with bias [19]. Although the studies crossed both obstetrics and gynaecology they were limited to a subset of conditions within these fields. It is likely that these results can be translated across obstetrics and gynaecology, however care should be taken as to the generalisability of this study.

We compared our results to those from similar studies in other subject areas. Within reproductive medicine the reporting of individual items still showed wide variation post STARD publication8. In medical journals there was an improvement post STARD in the reporting of calculating test reproducibility, distribution of severity of disease, variability in accuracy between subgroups and use of a flow diagram9. In obstetrics and gynaecology there was an improvement in describing participant sampling/recruitment, description/blinding of reference standard, reporting the characteristics of the study population, distribution of severity of disease and variability of accuracy. There was no significant improvement in the use of a flow diagram. In gynaecology however, there were some items that showed poorer reporting post STARD such as description of cut-off of index test and blinding. There are thus no particular items of the STARD checklist that have been poorly adopted or interpreted by authors more that authors have been slow to adopt the STARD checklist and with its publication still being very recent. As more journals adopt the STARD statement and more authors make use of it at the planning and data collection stage of their research there will hopefully be a considerable improvement in the reporting quality in all subject areas in the future.

Poor reporting of a study does not necessarily correlate with bad quality. Accurate reporting is necessary to allow transparency of a study and to ensure the results are interpreted correctly. The application of the STARD checklist may help prevent the implementation of unnecessary or inaccurate tests which can lead to unnecessary financial expenditure and potentially serious consequences for patients.

Conclusion

The reporting quality of papers in obstetrics and gynaecology is improving. This may be due to initiatives such as the STARD checklist as well as historical progress in awareness among authors of the need to accurately report studies. There is however considerable scope for further improvement.