Buder and colleagues [1] undertook a systematic review and meta-analysis to evaluate whether non-surgical management was supported in children with congenital non-refluxing primary megaureter. They concluded—based on eight studies—that the high prevalence of spontaneous resolution and low pooled prevalence of surgical intervention suggest that non-surgical intervention appeared to be supported. While it is reassuring to find current practice supported by a review of available data, this study raises more questions than it answers, first among them: “Wait, what?”.

To be clear, Buder and colleagues undertook a well-designed and well-executed study. They scoured multiple databases for peer-reviewed publications, reviewed conference proceedings, and even checked clinical trial registries. They registered their systematic review prospectively in PROSPERO, and the data were reported in accordance with current best practices (PRISMA statement and the Cochrane Handbook for Systematic Reviews) [2, 3]. They reviewed manuscripts, conference proceedings, and registries published between 1947 and 2022. Seventy-six records were reviewed in full-text format, and only eight studies were included in the final analysis. As over 28,000 records were identified through the initial search, ultimately those eight studies represented fewer than 0.3% of records.

In rare conditions for which large-scale, randomized controlled trials are not feasible or practical, meta-analyses offer the opportunity to combine the results of multiple smaller studies. This increased power associated with an effectively larger sample size can more accurately characterize the magnitude of an observed effect than any of the smaller studies alone [4]. Similarly, systematic reviews can be useful for proactively informing the direction of future research [5]. In both cases, the quality of the studies selected and the data analysis performed will determine the quality of the conclusions drawn. While both meta-analyses and systematic reviews have the potential to include a majority (if not all), of the published data on a given topic, the aggregate sample size may still be relatively small. Consequently, the included patients may be different in ways that may not be readily evident even through careful review of a manuscript or use of thoughtful selection criteria.

It is by no means unusual that systematic reviews and meta-analyses report on relatively few manuscripts: a review of recent publications in PubMed finds 103 studies included in one systematic review [6], 25 in another [7], and 19 in a third [8]. This fivefold range illustrates two points: that a meta-analysis including 103 studies is fairly large despite the low absolute number of manuscripts and that the number of publications on less common conditions is widely variable. Stringent inclusion criteria further decrease the number of eligible publications: Buder’s group identified 28,000 manuscripts that were culled to the eight included in their manuscript. What is most noteworthy about Buder and colleagues’ work is the quality of the data included in the meta-analysis: not only did almost half of the eight included studies have a high risk of bias, but the data reported in the eight studies—though ostensibly inclusive of the same variables and outcomes—had substantially different levels of quality and detail. Furthermore, all eight studies reported retrospectively reviewed data from a single institution. In other words, each of the eight studies provided a curated presentation of a single institution’s data.

Why is this a problem? Meta-analyses and systematic reviews will, by their very nature, include studies that have different inclusion criteria, sample composition, and study design. Buder and colleagues should be commended for reviewing conference proceedings and clinical trial registries to minimize one common pitfall of systematic reviews: publication bias. However, the eight studies included in the current systematic review, while often including the same endpoints and measurement tools, varied in how measurements were taken and reported. Differential kidney outcome and urinary drainage during follow-up were both reported inconsistently and often qualitatively in the included studies. “Resolution” of non-refluxing primary megaureter was reported in almost two-thirds of patients in seven (not eight) studies, though Buder et al. found that the criteria for what constituted “resolution” was not uniform across studies nor even documented in most of the included manuscripts. Simply put, the published studies tended to report interpretations of the primary data, rather than the primary data itself. Moreover, the lack of agreement on nomenclature and classification raises concern: if different groups of authors cannot even agree on an operational definition for a clinical outcome, are these studies too heterogeneous to analyze together?

The current meta-analysis highlights one of the most maddening truths of research: the findings are only as good as the underlying data. Peer-reviewed manuscripts reflect only those papers that are selected for publication, and presentation of an abstract at a conference is no guarantee that a manuscript will follow. Almost two-thirds of conference abstracts were not published within 2–5 years (ironically, these data are derived from meta-analyses) [9,10,11]. Studies in which the outcome is the decision to proceed with surgical intervention have been heavily criticized, since without clear prospective criteria to consider surgery, surgeon preference and/or patient-specific nuances may play a disproportionate role [12, 13]. There is no expectation that every study included in a meta-analysis would have exactly the same design, subject group, and endpoints. However, each added layer of heterogeneity among studies further decreases the generalizability of the results of the meta-analysis. At some point, often unknown to the authors, the groups in different studies become too dissimilar to analyze together, and the well-intentioned recommendation of the meta-analysis is unknowingly unsupported by data.

For rare diseases, the above impacts are magnified, since uncommon conditions are often the subject of meta-analyses. According to the National Organization for Rare Diseases (NORD), rare diseases are those that affect fewer than 200,000 Americans, and 25–30 million Americans are living with a rare disease at any given time [14]. While this is an enormous number of people living with a rare disease, all of these people are not, of course, living with the same rare disease. NORD notes over 7000 known rare diseases [14]. Rare conditions require collaborative analysis of data from patients at geographically different sites and at different periods in time, making it less likely that data will be collected in precisely the same way in different studies.

Wilms tumor is the paradigm for successful progress in the management of a rare disease. Dismal clinical outcomes prompted the creation of the National Wilms Tumor Study (later the Children’s Oncology Group), the model for inter-institutional collaboration in data collection and analysis. This collaborative work has generated treatment protocols that share best practices for the diagnosis and management of Wilms tumor and have more importantly facilitated identification of prognostic factors that direct tailored medical and surgical intervention, as well as development of clinical trials to support further progress [15].

Critical to the success of the collaborative Wilms tumor research is the collection of the same primary data, in the same way, for every included patient. Clinical data (e.g., operative notes), imaging tests, and surgical specimens are evaluated by institutional and central reviewers to minimize errors and variability in how data are reported and recorded [16, 17]. One example of the importance of recording primary rather than interpreted data is kidney function. Kidney function is assessed by reviewing serum creatinine, which allows calculation of creatinine clearance or glomerular filtration rates using approved equations. Advances in medicine, such as the recent exclusion of race from the calculation of glomerular filtration rate [18, 19] at many institutions, still permit utilization of the primary data to assess kidney function using the new equation; had only glomerular filtration rate been recorded a priori, such recalibration would not be possible.

In contrast, meta-analyses are often composed of collections of manuscripts and other available data that were intended for use by the authors and readers only, not for intentional inclusion in a larger analysis [20]. The authors of the meta-analyses should not be faulted, as their intent was almost certainly to share their clinical findings in what they believed to be the most clear and succinct way. Similarly, journal reviewers and editors clearly valued the information and its presentation when recommending the manuscript for publication. However, intent and impact often differ, and the impact of eight different author groups each presenting data in their preferred way is that there are eight distinct studies rather than eight institutions cohesively assessing a similar clinical question.

Research is, at its core, the intent to gather and analyze data in a way that generates generalizable findings and allows those analyzing and interpreting the data to conclude with a recommendation based on their findings [21]. With this in mind, the extreme variation in how endpoints are defined and reported in the eight studies included in this meta-analysis raises concern that the data may not be generalizable. In reporting their single-institution, retrospective experiences, have the authors of those eight studies simply reported their own data points? Or are these eight institutions independently and consistently reporting findings that would unquestionably apply to larger cohorts of children across the globe? We will never know, because the lack of consistency in how the data were collected, interpreted, and recorded (in particular the subjective, qualitative, and general assessment of differential kidney function and kidney unit excretion) makes the eight groups uncertainly comparable. While patient characteristics and study design may vary somewhat among studies, the absence of shared operational definitions for quantifying differential kidney function, urinary drainage, or resolution is surprising—and worrisome.

Inconsistency in what should presumably be readily consistent data has been reported for many clinical findings in medicine. Medical students and residents may hear a pulmonary wheeze or a cardiac murmur that has disappeared by the time the attending examines the patient, and repeated blood pressure measurements are typically similar, rather than identical. Within urology, there is significant inter- (and intra-) rater variability in the grading of vesicoureteral reflux on voiding cystourethrography, the assessment of bladder qualities and function on urodynamic studies, and even how a voiding cystourethrogram is performed and reported [22,23,24,25]. In any test that relies on collection as well as interpretation of data, there is the potential for error and variation at each step. For example, a voiding cystourethrogram or urodynamic study may show vesicoureteral reflux on one cycle but not on the next (variation in data generated by the test), and two clinicians reviewing the same images from the test may grade the reflux differently (variation in data interpretation) [26]. When data are presented only in a refined, interpreted form, nuances are lost. Some of these nuances may be important, but specific data can only be made general, not vice versa. The generalizability of results, as Kukull and Ganguli note [27], is predicated on the ability of the researcher to cull relevant from irrelevant information—easier said than done. When quantitative data are reported qualitatively and only on some participants, teasing apart data to identify which kidney units have altered differential function or drainage is akin to attempting to identify the top students in a class when the only provided data is whether they have passed a single test or not.

As authors, reviewers, and editors, we can, and we must, do better. No study is perfect: biases and confounders abound. However, it is incumbent upon every researcher to include the highest quality data possible. In some cases, this may mean including primary data rather than interpretations, which can feel clumsy. Reviewers and editors must hold authors accountable: are the data provided in a manuscript sufficiently granular? In reviewing a single manuscript, a lapse in quality or a large proportion of missing data may not be noticeable or may be easily rationalized. However, Buder and colleagues have nicely illustrated that, when multiple studies have missing data and/or high levels of bias, the ability to draw meaningful conclusions from an analysis of those studies together is extremely limited. Uncommon conditions with limited publications derived from retrospective, incomplete, and non-primary data are at particular risk of this happening. The consequence is that individual manuscripts—and not larger-scale analyses—drive clinical practice. Apparent “evidence-based practice” may in fact be an echo chamber of the published experiences of a rarefied few, rather than a considered and thoughtful analysis of aggregate data.