Population level (level 1) screening for autism spectrum disorder (ASD) has been the subject of numerous papers, particularly since the American Academy of Pediatrics published a policy statement more than a decade ago (Council on Children with Disabilities 2006). The most commonly studied tool is the Modified Checklist for Autism in Toddlers (M-CHAT; Robins et al. 1999), and its revision, the M-CHAT-revised, with follow-up (M-CHAT-R/F; Robins et al. 2009). However, the variety of screening tools for prospective identification of early signs of autism has encouraged the publication of different systematic reviews (Daniels et al. 2014; McPheeters et al. 2016). See Table 1 for the tools included in the current meta-analysis, and references for more information about each tool.

Table 1 Details of sample characteristics and individual outcomes such as studies show

The U.S. Preventive Services Task Force (USPSTF; Siu and Preventive Services Task Force 2016) concluded that there was insufficient evidence to provide a recommendation regarding universal toddler screening for ASD. At the same time they emphasized the potential of the M-CHAT as a universal screening tool, as evidenced by empirical results (R. Canal-Bedia, personal communication, May 9, 2016). Hence, it is necessary to perform a systematic study of the psychometric data available in different studies.

The meta-analysis is an important resource to summarize—in quantitative terms—the accuracy of diagnostic test, providing a higher level of evidence; for this reason, the current study conducted a meta-analysis to review empirical data from the studies and tools used since the first ASD population screening was performed in England (Baron-Cohen et al. 1996).

In this kind of study, the reference test may be imperfect because a gold standard is not available in practice. We have used the Bayesian Hierarchical Model (HSROC; Rutter and Gatsonis 2001) to carry out the meta-analysis. The model is robust in adjusting for the imperfect nature of the reference standard of autism tools, in a bivariate meta-analysis of diagnostic test sensitivity and specificity and others psychometric parameters. Another bivariate model was proposed by Reitsmaet al. (2005) in which it is assumed that the vector of (logit(sensitivity), logit(specificity)) follows a bivariate normal distribution. However, Harbord and Whiting (2009) showed that the likelihood functions of both the HSROC and bivariate models are algebraically equivalent, and yield identical pooled sensitivity and specificity. Dendukuri et al. (2012) have demonstrated the usefulness of HSROC model, when no gold standard test is available.

Therefore, in this study, we used a Bayesian meta-analysis, and the main aim was to evaluate the accuracy of the different screening tools. The second objective was to calculate the pooled psychometric properties associated with different studies to evaluate the tools effectiveness and support their recommendation internationally (R. Canal-Bedia, personal communication, May 9, 2016).

Methods

The preferred reporting items for systematic reviews and meta-analyses (PRISMA) (Moher et al. 2009) has guided this systematic review.

Criteria for Selection of Studies

Included papers focused on the screening and diagnosis of ASD and other developmental disorders in the general population, also known as level 1 screening. In cases where studies had duplicated data, only the most complete one was selected in order to avoid an unrealistic increase in the homogeneity between studies, and emphasis was placed on studies validating screening tools, which were often the most complete samples. Therefore, we excluded studies focused on tools that were not designed to screen for ASD, screening studies not applied to the general population (level 1), and all those that did not provide sufficient data to construct a 2 × 2 contingency table of screening × diagnosis (such as those without confirmatory diagnoses), or had a low quality rating in the quality assessment.

Literature Search

A systematic literature search identified studies that reported tools and procedures used for the early detection of ASD. The articles were obtained from CINHAL, ERIC, PsycINFO, PubMed and WOS databases using several combinations of the relevant keywords and Medical Subject Heading (MeSH), which include the categories of terms suggested by Daniels et al. (2014). All articles published between January 1992 and April 2015 were considered eligible. Only articles published in the English language and reporting an age range of screening from 14 to 36 months were included. The search strategy for PubMed is described (see Appendix 1). An additional search was conducted for grey literature captured on other search engines such as Google Scholar; we also searched the reference lists of included articles and any relevant review articles identified through the search and the ‘related articles’ function in PubMed. In addition, when searching the grey literature, we took into account the reference lists of primary studies and review papers, and contacted the experts to locate significant but as yet unpublished studies.

Assessment of Methodological Quality

Two reviewers conducted quality assessment of the included studies with the QUADAS-2 Tool (Quality Assessment of Diagnostic Accuracy Studies-2) (Whiting et al. 2004). Any discrepancies were referred to a third reviewer. QUADAS is a validated quality checklist (Deeks 2001; Whiting 2011; Whiting et al. 2006) composed of 14 items which encompass the most important sources of bias and variations observed in diagnostic accuracy studies. The studies were classified according to whether they had low or high risk for bias and their applicability was graded as low or high.

Data Extraction

The following data items were extracted from each study using a data collection form: first author and year of publication; size and characteristics of the study population; raw cell values [true positive (TP), true negative (TN), false positive (FP), false negative (FN); and psychometric properties, specifically sensitivity (Se), specificity (Sp), positive and negative predictive values (PPV, NPV), positive and negative likelihood ratio values (LR+; LR−), and diagnostic odds ratio (DOR)]. See Appendix 2 for definitions of bio-statistical terms. Psychometric properties which were not provided in the studies were calculated based on raw cell values. Clarification was requested from the authors via e-mail when we observed discrepancies between the data reported and the data calculated. Details of the search and results are shown (see Tables 1, 2).

Table 2 Details of individual diagnostic outcomes such as studies show

Data Synthesis and Statistical Analysis

We calculated the pooled Se, Sp, LR+, LR−, PPV, NPV and DOR for the included studies. Separate pooling of sensitivity and specificity may lead to biased results because different thresholds were used in different studies (Deeks 2001; Moses et al. 1993). Therefore, we used the Hierarchical Summary Receiver Operating Characteristic Model (HSROC) (Rutter and Gatsonis 2001) to estimate the diagnostic accuracy parameters and to generate a summary receiver operating characteristic curve with HSROC, [an R package available from CRAN (Schiller and Dendukuri 2015)]. The model is robust for including studies with different reference standards and potential negative correlation in paired measures (Se/Sp) across studies (Trikalinos et al. 2012). This kind of analysis models the variation in diagnostic accuracy and cut-off values, and identifies sources of heterogeneity, which is a common feature among diagnostic or screening test accuracy reviews.

The model has been called a “Hierarchical Model” owing to the fact that it takes into account statistical distributions at two levels. At the first level, within-study variability in sensitivity and specificity is examined. At the second level, between-study variability is examined (Macaskill 2004). The main goal of the model is to estimate an SROC curve across different thresholds.

The estimation from the model requires Markov Chain Monte Carlo (MCMC) simulation (Rutter and Gatsonis 2001). To carry out this Bayesian estimation we specified the prior distributions over the set of unknown parameters with a similar assumption made by Higgins et al. (2003). This process was used in order to obtain posterior predictions of the Se and Sp. According to Harbord and Whiting (2009), the true estimate of Se and Sp in each study could be found by empirical Bayes estimates, although we acknowledge that many of the included studies were limited in their ability to confirm that negative cases were in fact true negatives.

In order to establish whether there was inconsistency and heterogeneity in the meta-analysis, we summarized the test performance characteristics using a forest plot with the corresponding Higgins I2 index (Higgins and Thompson 2002) and assessed heterogeneity by visual inspection of the SROC plots and using Cochran’s Q test (p > 0.1) (Cochran 1954). Summary DORs were estimated by random DerSimonian–Laird effect model (DerSimonian and Laird 1986) following the recommendations of Macaskill et al. (2010) because I2 was greater than 50% and Q test was < 0.1. Since variability of results among different studies was confirmed, an investigation of heterogeneity was necessary and subgroup analyses were used. The Egger’s test (Song et al. 2002) was calculated for assessing publication bias using STATA 12.0.

Finally, we obtained a crosshair plot and ROC ellipses plot to summarize the confidence intervals of Se and FP cases in each study with the R-package (Doebler 2015) using meta-analysis of diagnostic accuracy (MADA), LR+, LR−, PPV, NPV and DOR were calculated using SAS for Windows, version 9.4 (Cary, NC).

Results

Study Selection

The initial literature search identified 1883 studies. Six hundred and sixty-seven duplicate records were eliminated to obtain 1216 non-duplicated articles, 1114 of which were excluded after title and abstract screening through the application of inclusion/exclusion criteria, and 87 were excluded after full text screening or methodological quality assessment and data extraction (see Supplemental Table 1). One additional study that qualified for inclusion was identified from the search of grey literature. Finally, 14 studies: (Baird et al. 2000; Barbaro and Dissanayake 2010; Canal-Bedia et al. 2011; Chlebowski et al. 2013; Dereu et al. 2010; Honda et al. 2005; Inada et al. 2011; Kamio et al. 2014; Miller et al. 2011; Nygren et al. 2012; Robins et al. 2014; Stenberg et al. 2014; Wiggins et al. 2014; Baranek 2015) were eligible for inclusion in our review. We present the flow chart showing the selection process in Fig. 1.

Fig. 1
figure 1

Study selection flow chart following PRISMA guidelines

Methodological Quality of the Included Studies

We used the QUADAS-2 tool for study of quality assessment and K coefficient to examine inter-rater agreement for our initial overall quality score, and resolved any item discrepancies through discussion. The agreement between judges’ kappa values was 0.643 (CI 95%; p < 0.01). In Fig. 2, we summarize the results of the methodological quality for all 20 studies included in this assessment: (Baird 2000; Barbaro 2010; Canal-Bedia et al. 2011; Chlebowski 2013; Dereu 2010; Dietz 2006; Honda 2005, 2009; Inada 2011; Kamio 2014; Kleinman 2008; Miller 2011; Nygren et al. 2012; Pierce 2011; Robins 2008, 2014; Stenberg 2014; VanDenHeuvel 2007; Wetherby 2008; Wiggins et al. 2014).

Fig. 2
figure 2

Methodological quality graph depicting the cumulative findings of the methodological quality analysis

As Fig. 2 shows, two bar graphs report the assessment of risk of bias and applicability. The percentage of studies rated as unclear, high, or low is observed across X-axes at intervals of 20%. The concerns regarding applicability include three domains: patient selection, index test, and reference standard. The risk of bias dimension is comprised of four domains: patient selection, index test, reference standard, and flow and timing. Across a majority of studies, concern about applicability of the reference standard was assessed as low, the index test was assessed as unclear, and patient selection was assessed as having low concerns. Regarding risk or bias, the majority of the studies demonstrated high risk of bias for flow and timing; the index test was rated as unclear risk, the reference standard was generally rated as low risk, and patient selection was rated as low risk.

During this process we excluded the following studies: Honda (2009), Pierce (2011), Robins (2008), VanDeHeuvel (2007), Wetherby (2008). In supplemental materials (see supplemental Table 1) we show the list of papers excluded during analysis of quality and data extraction processes.

Characteristics of the Included Studies

One hundred and two full text articles were assessed for eligibility, 14 (13.72%) of which were included in the quantitative synthesis. Some articles evaluated more than one index test (Inada et al. 2011; Nygren et al. 2012; Wiggins et al. 2014) and this is why we present a meta-analysis on 18 sets of psychometric values, 35.71% of which came from the USA, 35.71% from Europe, 21.42% from Japan and 7.14% from Australia. The sample includes 191,803 toddlers. The interval of age range is between 16.7 and 29 months. Sex data was available for 158,965 toddlers, of whom 73,431 (46.19%) were female.

The studies presented great variability in terms of the data reported. Twelve of 14 studies (66.6%) showed all the primary outcomes required to populate 2 × 2 contingency tables. Data pertaining to Se were presented in 77.7% of studies, Sp in 55.5%, PPV in 77.7%, NPV in 44.4%, and LR+ and LR− in 22.2% of studies. The main characteristics and the clinical outcomes, as shown in included studies are presented (see Tables 1, 2).

Diagnostic Accuracy of Screening Tools

The accuracy of screening tools was evaluated in 14 studies that assessed the test characteristics of various screening tools (18 in all). The pooled Se was 0.72 (95% CI 0.61–0.81) and the Sp was 0.98 (95% CI 0.97–0.99). The positive likelihood ratio (LR+) was 131.27 (95% CI 50.40–344.48) and the negative likelihood ratio (LR−) was 0.22 (95% CI 0.13–0.45). The diagnostic odds ratio (DOR) was 596.09 (95% CI 174.32–2038.34). The positive predictive value (PPV) was 97.78 (95% CI 97.71–97.84) and the negative predictive value (NPV) was 93.13 (95% CI 93.02–93.24). The above is summarized in Table 3, while the corresponding HSROC plot is presented in Fig. 3. The Se of each individual study varied between 0.22 and 0.95 whereas the Sp ranged from 0.81 to 0.99 (see Table 4).

Table 3 Parameters estimated between studies (point estimate = median) both for the entire meta-analysis and for the sub-analysis of nine studies
Fig. 3
figure 3

ROC ellipses plot with confidence regions, which describe the uncertainty of the pair of sensitivity and false positive rate. The size of the circles indicates the weight of each study. Studies indicated by study number (see Table 1)

Table 4 Estimates of diagnostic precision and outcomes in single studies

Exploration of Heterogeneity

A considerable degree of heterogeneity in sensitivities was observed (Q = 337.62, df = 17.00, p < 0.001) and specificities (Q = 30901.50, df = 17.00, p < 0.001). The heterogeneity in test accuracy between studies may be due to differences in cut-offs utilized in different studies, among other factors (Doebler et al. 2012). To delve deeper into the understanding of these results, we evaluated the confidence intervals which describe the relationship between the psychometric properties. The ROC ellipse plots of the confidence intervals in Fig. 3 shows the studies responsible for high levels of heterogeneity, how cut-off values vary, and how they demonstrate moderate negative correlations between sensitivities and False Positive rates (rs = − 0.355), that is, if Se tends to decrease when FP rate increases.

According to this analysis, study 18 (Baranek 2015), study 14 (Dereu et al. 2010), studies 12 and 13 (Inada et al. 2011) and study 15 (Miller et al. 2011) show the largest confidence intervals both for Se and FP rate, and study 4 (Baird et al. 2000), study 10 (Canal-Bedia et al. 2011), study 7 (Kamio et al. 2014) and study 8 (Stenberg et al. 2014) indicate large confidence intervals only in Se.

The SROC curve summarizes the relationship between Se and (1 − Sp) across studies, taking into account the between-study heterogeneity. We constructed a SROC curve using all studies selected; see Fig. 3. It is worth noting that it is a significant graphical tool for understanding how the diagnostic accuracy of the different test depends on the different cut-off (Doebler et al. 2012).

As Fig. 4 shows, the prediction region covers a larger range of Se than Sp. This may be due to the fact that most of the studies had a considerably larger number of participants with screen negative results compared to screen positive results, leading to greater sampling variability when we estimated Se vs. Sp. The figure also demonstrates an asymmetry of the test performance measures towards a higher Sp with higher variability of Se, providing indirect proof of some threshold variability. The figure also shows how when the threshold is increased then Se is decreased but Sp is increased.

Fig. 4
figure 4

Hierarchical summary receiver operating characteristic curve (HSROC) plot shows test accuracy (using all studies selected). According to Schiller and Dendukuri (2015) individual studies are represented by round circles. The size of the circles is proportional to the number of patients included in the study, the height of ovals indicates the number of affected individuals and the width indicates the number of non-affected individuals. The filled red circle is the pooled sensitivity and specificity across the studies taking into account the between-study heterogeneity. The blue dotted-curve defines the 95% prediction region. The red dot-dashed-curve marks the boundary of the 95% credible region for the pooled estimates

The posterior predictive value of Se was 0.71 (95% CI 0.22–1) with a standard error of 0.23 and that of Sp was 0.98 (95% CI 0.81–1) with a standard error of 0.07.

Subgroup of Analysis

A large degree of heterogeneity was observed. Heterogeneity may be due to different factors (Macaskill et al. 2010; Trikalinos et al. 2012). In order to investigate the source of heterogeneity in the current sample, we followed recommendations of these authors and conducted analyses using a subgroup of studies. The new meta-analysis excluded the following studies, based on graphical analysis and the Cochran Q test (p > 0.1): Study 4 (Baird et al. 2000), Study 7 (Kamio et al. 2014), Study 8 (Stenberg et al. 2014), Study 10 (Canal-Bedia et al. 2011), Studies 12 and 13 (Inada et al. 2011), Study 14 (Dereu et al. 2010), Study 15 (Miller et al. 2011), and Study 18 (Baranek 2015).

Regarding the estimations between study parameters, subgroup analysis demonstrated that Se was increased because the pooled sensitivity was 0.77 (95% CI 0.69–0.84), and the Sp was 0.99 (95% CI 0.97–0.99). The posterior predictive p-value of Se was 0.81 (95% CI 0.39–1) and Sp, 0.97 (95% CI 0.76–1, SD = 0.08).

Parameters estimated between studies by HSROC model are shown in Table 3, which demonstrates how the parameters estimated for the subgroup of analysis are higher results than those obtained for the first meta-analysis. For example, it is of note that standard deviation in the cut-off and standard deviation of the difference in means between studies are decreased.

The estimates for individual studies were grouped by parameters and are shown in Table 5.

Table 5 Estimates of diagnostic precision and outcomes in single studies for the sub-analysis of nine studies

Figure 5 shows how the prediction region covers a larger range of Se than Sp although this is less than in the first meta-analysis. The figure also shows less asymmetry of the test performance and therefore less heterogeneity. This means that the range, which includes the measurements for Se and Sp is lower than the one shown in Fig. 4.

Fig. 5
figure 5

Hierarchical summary receiver operating characteristic curve (HSROC) plot show test accuracy (using subgroup of studies)

Publication Bias

The estimated Egger bias coefficient was 3.21 (95% CI − 0.49 to 6.92) with a standard error of 1.5, giving a p-value of 0.08. The test thus suggests evidence that results are not biased by the presence of small-study effects.

Discussion

Interest in early detection of ASD is increasing, due to the growing evidence that early intervention improves prognosis. Low-risk screening, as part of pediatric primary care, for example, is one of the most widely studied strategies to promote early detection.

Consequently, the information reported from systematic reviews of screening accuracy is valuable, both for research and practice. Different systematic reviews, such as the ones carried out by Daniels et al. (2014) and McPheeters et al. (2016), have represented an important advance with regard to traditional or narrative reviews, which were characterized by a lack of systematization. However, a meta-analysis is a systematic review which also uses statistical methods to analyze the results of the included studies. It is accepted that data from systematic reviews with meta-analyses adds value since the statistical analysis used converts the results of primary studies into a measure of integrated quantitative evidence. This is beneficial both to the scientific community and to the clinicians who use the tools in such meta-analyses.

Meta-analysis of screening studies is a complex but critical approach to examining evidence across measures and scoring thresholds in different populations (Gatsonis and Paliwal 2006). We employed a Bayesian Hierarchical Model (Rutter and Gatsonis 2001), which is robust in adjusting for the imperfect nature of the reference standard of autism tools, in a bivariate meta-analysis of diagnostic test sensitivity and specificity and others psychometric parameters. This kind of meta-analysis statistically compares the accuracy of different diagnostic screening tests and describes how test accuracy varies. Therefore, it is more likely to lead to a ‘gold standard’ than other types of reviews which can be influenced by biases associated with the publication of single studies.

The HSROC model was used to estimate the screening accuracy parameters and a summary in each study as functions of an underlying bivariate normal model. This model has been recommended when there is no standard cut-off to define a positive result (Bronsvoort et al. 2010; Dukic and Gatsonis 2003; Macaskill 2004) in order to allow the meta-analytic assessment of heterogeneity between studies while taking into consideration both within- and between-study variability. Furthermore, it is also optimally suited when more information is available, for example, when the studies have reported results from more than one modality (Rutter and Gatsonis 2001) like our case. The advantages of the model have been discussed (Gatsonis and Paliwal 2006; Leeflang et al. 2013; Macaskill 2004; Rutter and Gatsonis 2001) and support its selection in this meta-analysis.

This review included 14 studies that assessed the test characteristics of various screening tools (18 in all) for detecting autism and a subgroup of analysis retaining nine studies that demonstrated lower heterogeneity. Initial findings of the overall meta-analysis show that tools which are used in level 1 ASD screening are accurate at detecting the presence of ASD [pooled sensitivity was 0.72 (95% CI 0.61–0.81)] and highly accurate at detecting a lack of presence of ASD [pooled of specificity was 0.98 (95% CI 0.97–0.99)]. But more importantly, we demonstrate the tools’ performance in identifying autism, DOR 596.09 (95% CI 174.32–2038.34). The clinical utility of the level 1 screening tools reviewed in this study is clear because the pooled positive likelihood ratio (LR+) was 131.27 (95% CI 50.40–344.48) and the negative likelihood ratio (LR−) was 0.22 (95% CI 0.13–0.45). LR+ > 1 indicates the results are associated with the disease. Although those findings are informative to clinicians, it is important to understand the limitations of the last assertion because the accuracy of a LR depends upon the quality of the studies that generated the pooled of sensitivity and specificity, therefore data must be interpreted with caution. Finally, the pooled of positive predictive value (PPV) was 97.78 (95% CI 97.71–97.84) and the negative predictive value (NPV) was 93.13 (95% CI 93.02–93.24).

A limitation of this meta-analysis comes from the methodological limitations of the included studies; 55% of the included studies were assessed to have high risk or unclear risk of bias in the quality analysis with QUADAS, particularly in the domains of flow and timing, and in the index test. We recommend that future screening studies include a flowchart with information about the method of recruitment of patients, sample, order of test execution, follow up and other details related to the process to improve replicability and to better inform readers about potential bias.

The second concern is about the heterogeneity of the psychometric data in the included studies. In this respect, according to Doebler et al. (2012), in diagnostic meta-analysis the observed sensitivities and specificities can vary across primary studies and heterogeneity should be assumed in results of this kind of meta-analysis (Macaskill et al. 2010). This assertion has been acknowledged in this work and justifies the choice of the model HSROC, which is a more robust model for addressing heterogeneity compared to some of the other meta-analysis models.

Following the recommendations of Macaskill et al. (2010) and Trikalinos et al. (2012) we conducted a subgroup of analyses to assess the pooled Se and Sp without those studies driving heterogeneity in analyses. The pooled of sensitivity and specificity were improved by the exclusion of these studies. Consequently, the parameters estimated for this set of studies suggested a good performance for ruling out and ruling in ASD since the prior pooled Se was 0.77 (95% CI 0.69–0.84, SD = 0.03), Sp was 0.99 (95% CI 0.97–0.99; SD ≤ 0.01), the posterior predictive p-value of Se was 0.81 (95% CI 0.39–1, SD = 0.18), and high specificity was maintained, 0.97 (95% CI 0.76–1, SD = 0.08). The previous data from the posterior predictive p-values of Se and Sp are very important because the true estimate of Se and Sp in each study could be found by empirical Bayes estimates (Harbord and Whiting 2009).

One important aspect to bear in mind is that only about 66.6% of all studies showed all the primary outcomes required to populate 2 × 2 contingency tables. Data pertaining to the Se were presented in 77.7% of studies, Sp in 55.5%, PPV in 77.7%, NPV in 44.4%, LR+ and LR− in 22.2% of studies. This leads us to recommend that authors of screening studies include sufficient detail to calculate all psychometric properties to improve the quality of systematic reviews and future meta-analyses. It also would be valuable for authors of future studies to reflect on the question of why there is such a low percentage of primary studies that do provide those data. Some authors use caution in presenting psychometric properties when the negative cases cannot be confirmed to be true negatives. Although this is a notable limitation of cross-sectional screening studies, given that confirmatory evaluations are prohibitive in very large samples, it is likely that the number of truly negative cases greatly outnumbers those cases that will later be identified as false negatives, suggesting that interpreting the TN cell of the 2 × 2 matrix to be “presumed TN” is a reasonable assertion. Looking further at the omission of specific psychometric values, there is a remarkably low percentage of studies that include LR+ and LR−, as well as a number that do not report NPV. LR+ and LR− may not have been commonly included given that they were not emphasized in the American Academy of Pediatrics’ policy statement that highlighted the psychometric properties of Se and Sp. The reduced emphasis on NPV may be due to the fact that predictive value is affected by baserate of the disorder in the sample being studied (such as PPV and NPV may vary dramatically across sampling strategies), whereas Se and Sp are not influenced by base rate. We recommend that future studies report comprehensive psychometrics, in order to promote understanding of the findings. In addition, it is often difficult to ascertain characteristics of the study, study cohort, and technical aspects (Gatsonis and Paliwal 2006). In future studies, a unified approach is necessary in presenting results of screening research to avoid the inconsistency and heterogeneity observed.

The present results suggested improved screening accuracy when meta-analysis was restricted to a subset of studies with reduced heterogeneity (see Table 3 for a comparison of parameters for the complete meta-analysis and the subgroup meta-analysis). The subgroup findings add specific knowledge for clinicians and researchers regarding each tool used for toddler ASD screening.

We have estimated parameters for each study in both meta-analyses (see Tables 4, 5). The results from subgroup analysis suggest that the Se of each individual study varied between 0.78 and 0.88. In those tables we also reported other important data, which could be a particular contribution for the clinicians in this field of study, such as the different cut-off points or the ‘accuracy parameter’, which measures the difference between TP and FP in each study and the prevalence. With respect to prevalence, we can say that it was estimated at or near 1% depending on the studies.

Finally, in the light of the results obtained by computing the summary measures with and without studies (shown as outliers Tables 3, 4, 5) we suggest that the tools used in Level 1 screening are adequate to detect ASD in the 14–36 age range. Thus, we confirm -in quantitative terms- the finding of the USPSTF that screening detects ASD.

Conclusion

A systemic review and meta-analysis of screening tools to detect ASD in toddlers determined that these measures detect ASD with high Se and Sp. Studies were restricted to low-risk samples in children younger than 3 years old, in order to evaluate the use of these screening tools in primary pediatric care. Given that children who start ASD-specific early intervention before age three have improved outcomes compared to children who go untreated prior to preschool, it is essential to disseminate strategies to improve the identification of the children in need of intervention as young as possible. Consistent with the recommendation of the American Academy of Pediatrics (Johnson et al. 2007) results of the current study show the validity of low-risk screening to identify ASD in children under 3 years old.