Key points

  • Large heterogeneity/variability in studies hampers successful clinical implementation of DWI metrics.

  • Technical variability was encountered in, e.g., b-value combinations, ROIs, and models.

  • Clinical heterogeneity was observed (e.g., scan-moment during treatment, tumor type differentiation, and NST-protocol)

  • Multi-disciplinary consensus/cooperation is required for proper clinical study design.

  • Quality control and standardization are essential for clinical and technical validation.

Introduction

Women with breast cancer are increasingly treated with neoadjuvant systemic therapy (NST) [1]. The optimal response is achieved when at subsequent surgical pathology no residual cancer is detected (pathological complete response, pCR). Between subtypes, pCR rates vary widely from 0.3% (luminal A) to 60% (HER2-type) [2].

To identify breast tumor pCR, a diagnostic lumpectomy is currently necessary, albeit for therapeutic reasons this may no longer be required. Identifying pCR with imaging only would be a significant improvement, as it would prevent needless surgical procedures. However, this requires that non-pCR is accurately detected. Only then omitting surgery can be accepted with a wait-and-see strategy as a practical and reliable alternative. Such an approach is already proposed for colorectal cancer treated with neo-adjuvant chemo-radiotherapy [3]. In the case of breast cancer, 18F-FDG PET-CT and/or dynamic contrast-enhanced (DCE) magnetic resonance imaging (MRI) is extensively investigated to predict and evaluate NST-response [4, 5]. Despite all these efforts, NST response assessment still needs to be improved. The percentage of correctly identified pCR on MRI appears too low to safely omit diagnostic lumpectomy [6]. Furthermore, the accuracy of DCE-MRI seems to depend on the cancer subtype [6, 7]. In addition, the potential risk of the observed gadolinium deposition in the deep nuclei of the brain after repeated exposure to gadolinium-based contrast agent has raised some concerns [8]. Therefore, other MRI-techniques, like diffusion-weighted imaging (DWI), are investigated [9].

While DCE provides information on perfusion, DWI provides information about cell density and tissue microstructure based on the diffusion of tissue water. Tumors with high cell density have a relative low apparent diffusion coefficient (ADC), which theoretically increases when the density is reduced by chemotherapy. However, this is not observed in all tumors, since ADC is dependent on multiple factors [10].

The use of DWI might be beneficial for the response assessment of NST, as microstructure changes may be detected at an earlier stage than tumor size reduction [11]. Previous reviews reported aggregate values on the performance of DWI–MRI for predicting or identifying pCR. Chu et al. reported a sensitivity = 0.88, and specificity = 0.79 [12]; similarly, Gao et al. reported sensitivity = 0.89 and a specificity = 0.72 [13]. However, reported cutoff ADC-values in the individual studies appear variable, preventing the use of a single cutoff value to achieve such performance. It is, therefore, uncertain whether these aggregate performance measures are valid. In addition, studies vary in including factors, such as patient selection, tumor subtypes, and NST-types. Moreover, the methodology used for quantitative analysis of DWI–MRI is not uniform. To partly solve this issue, Baltzer et al. published a EUSOBI consensus paper regarding DWI of the breast for lesion classification. However, the consensus paper does not provide insights on issues applicable in treatment monitoring using DWI for identifying patients with pCR [14]. To shed a light on the magnitude of these issues, this review aims to identify technical, clinical, and biological heterogeneity and their impact in DWI studies identifying pCR on NST. The final aim is to support a more robust implementation of quantitative DWI for NST monitoring in breast cancer patients.

Materials and methods

Search, inclusion/exclusion criteria, and quality assessment

A PubMed-search was performed until April 2020, using Medical Subject Headings (MeSH)- and free-text terms for breast cancer, NST, DCE, DWI, and pCR. Identified abstracts were read and selected by two researchers. Abstracts were excluded when they were: (1) not published in English; (2) not about human breast cancer; (3) studies that performed no prediction/evaluation of the breast tumor with pCR; (4) studies that did not compare outcome to histopathology; (5) studies with neoadjuvant therapy using radiotherapy; (6) comment on; (7) meta-analysis; (8) case report.

After selection, the references of included studies were checked for extra studies (selection process: Fig. 1). Finally, quality of included studies was assessed using QUADAS-2 [15].

Fig. 1
figure 1

Flow chart selection process review

Data extraction and analysis

Data were extracted based on general parameters (e.g., first author, publication year), clinical characteristics (e.g., type of tumor, neoadjuvant treatment protocol), scan-moments (i.e., before, during (number of cycles) and/or post-NST), MRI/DWI protocol parameters (e.g., B0-field strength (T), b-values (s/mm2)), and details on the measures derived from the DWI data (e.g., ADC (mm2/s)). The reported performance measures per study were collected. For pCR prediction/detection, pCR-definitions were also extracted, since studies could permit different degrees of residual (tumor) tissue for pCR.

If performance measures were missing, reconstruction was tried by extracting data (from full-text/supplementary material) normally used in 2 × 2 contingency tables. In this review, pCR and non-pCR are defined as, respectively, positive and negative events.

After data extraction, grouping of results based on comparable study methodologies/definitions was performed. Data were analyzed by comparing study population (-related) and MR (-related) parameters to outcomes in terms of distinguishing pCR/non-pCR.

Sub-analyses were performed on different pCR-definitions (regarding in- or exclusion of residual ductal carcinoma in situ (DCIS)), when sufficient data were available.

Due to expected heterogeneity, we did not initially intend to conduct formal data-pooling and/or meta-analysis. Post hoc analysis of the results also prohibited this.

Results

Search strategy and study selection

The search (Additional file 1: Search term combinations in PubMed) resulted in 42 unique publications. After selection, 20 publications were included (Fig. 1). QUADAS-2 [15] assessment identified sources of bias and applicability concerns present in most studies (Table 1). In some studies, the patient selection might have initiated bias by using a non-representative study population (e.g., not describing the group as consecutive, small research populations in a large time interval, tumor diameter as exclusion criterion). Furthermore, several studies included patients who had a different number of scans within the study.

Table 1 Risk of bias and applicability concerns of included studies

General study characteristics

A general overview of the study characteristics is presented in Tables 2 and 3. In total, approximately 1455 patients/lesions were included (range per study: 7–242) (Table 2). Most studies were retrospective single center. NST-regimens varied widely between and within the studies. The pCR-ratios varied (12.9–39.3%), reflecting the variability and wide distribution of histological and molecular subtypes in the studies. In addition, the pCR-definitions differed between studies. Taking this into account, we noted that in general higher pCR-ratios were observed with less stringent pCR-definitions (especially for DCIS), as shown in Table 4.

Table 2 General study parameters
Table 3 Technical scan parameters
Table 4 Studies classified by pCR-definition

MRI characteristics and DWI measures to predict and evaluate NST response

Regarding MRI-scanners, coils, and acquisition parameters of the DWI sequence, large heterogeneity was observed (Table 3). For example, in ten studies, DWI was performed at 1.5 T, eight studies used a 3.0 T scanner, and two studies used MRI-scanners with both field strengths. Although most studies used single-shot echo-planar imaging (SS-EPI), a wide variety was observed within and between studies regarding echo times (TE), the use of low b-values (< 150 s/mm2), methods to calculate ADC-values, and region of interest (ROI)-definitions (Table 5). Details/study characteristics (Tables 2, 3) are reviewed in “Discussion” section.

Table 5 Main region-of-interest specifications

Furthermore, the DWI measures varied in the studies (e.g., absolute, relative (: (percentage) change, ratios) or histogram related values). Figures 2 and 3 illustrate the ADC values and the percentage change in ADC over time for pCR and non-pCR, respectively. In Fig. 2, studies using scanners with the main magnetic field strength B0, 1.5 T or 3 T, were also visually separated (Fig. 2).

Fig. 2
figure 2

Mean/median ADC-values (× 10−3 mm2/s) in different studies pre, during and post-NST between the pCR group (left figure) and the non-pCR group (right figure). Different time points are connected by solid (: studies acquired at 1.5 T), dashed (: studies acquired at 3.0 T) lines. Hahn et al. and Partridge et al. used both 1.5 T and 3.0 T scanners represented with non-connected points. The legend shows different studies that are included in the graphs. Note: the period of a cycle of neoadjuvant therapy (number of weeks) can differ and within and between studies as well as the total number of cycles. Subsequently, the solid, and dashed arrow lines should not be used for interpolation of ADC-values between two measuring time points. For Woodhams et al. [64] only the pCR-definition from the full-text was used, ADC rounded at one decimal. For Kim et al. [53] Miller and Payne grade 4 as good responders included

Fig. 3
figure 3

Studies reporting the percentage difference in ADC for pCR and non-pCR from baseline for the general study population at different time points. Note: The period of a cycle of neoadjuvant therapy (number of weeks) can differ within and between studies as well as the total number of cycles

Baseline DWI–MRI

Sixteen publications investigated DWI–MRI at baseline for predicting pCR. Most studies tried to identify an ADC-threshold. The reported overall (mean or median) ADC-values varied between studies for patients that obtained pCR (0.41 × 10−3–1.16 × 10−3 mm2/s) and those that did not (0.64 × 10−3–1.23 × 10−3 mm2/s). Reported thresholds were highly variable. Figure 4 shows the results of three studies that distinguished pCR/non-pCR based on molecular subtype [16,17,18]. In general intervals of ADC-values for pCR and non-pCR cases were overlapping between studies (Fig. 5). An observed trend within studies, where residual DCIS is explicitly not allowed in the pCR-definition, is that some tumors with a relative low ADC tend to have a higher chance to show pCR on NST (Fig. 5, category: “Invasive-, DCIS-”).

Fig. 4
figure 4

ADC-values (× 10−3 mm2/s) at baseline per molecular subtype for two of the included studies, with two subtypes (HR-) in a and two subtypes (HR+) in b. Bufi et al. [17] distinguished triple negative, HER2-enriched, luminal, hybrid (: luminal and HER2+, HR+/HER2+) tumors. Liu et al. [16] distinguished luminal A (ER+ and/or PR+ incl. Ki67 < 14% or HER2−), luminal B (ER+ and/or PR+ incl. Ki67 ≥ 14% or HER2+), HER2-enriched and triple negative tumors. In this graph, the types from Liu et al. [16] of luminal A are appointed as HR+/HER2− and luminal B as HR+/HER2−. From Bufi et al. [17] the luminal group is appointed as HR+/HER2−. From Pereira et al. [18] three subtypes were reported

Fig. 5
figure 5

Mean/median ADC (× 10−3 mm2/s) at baseline for pCR and non-pCR (and if known, the standard deviation), using different sub-classifications for pCR. For Woodhams et al. [64] mean and standard deviation extracted from data supplementary material, rounded by two decimals for both pCR-definitions: with and without DCIS

Some studies reported non-mono-exponential/non-Gaussian models, (e.g., intravoxel incoherent motion (IVIM)). A mean true diffusion coefficient (D) of 0.92 × 10−3 mm2/s (pCR) versus 0.83 × 10−3 mm2/s (non-pCR) was reported (p = 0.323) [19]. Another non-Gaussian approach, stretched exponential modeling (SEM), quantifying the intravoxel heterogeneity (i.e., the intravoxel heterogeneity index (α)) and the distributed diffusion coefficient (DDC) in a multi-exponential decay, resulted in cutoff values for α = 0.838 (AUC = 0.644) and DDC = 1.141 × 10−3 mm2/s (AUC = 0.756) [20]. Furthermore, one study identified skewness (p < 0.05) and entropy (p = 0.05) (both histogram based features) as predictor for pCR [21]. More details are presented in Table 6.

Table 6 DWI parameters pre-NST

DWI–MRI during NST

Nine studies reported on absolute ADC-values during NST to predict pCR. The scan-moments varied widely between the studies (after 1–5 NST-cycles). Reported ADC-values were heterogeneous. Overall, increasing ADC-values during NST seem to reflect response of the tumor. However, there is no clear threshold to distinguish partial and non-responders from complete responders. The optimal scan-moment evaluating therapy during NST seems to be subtype and NST-regimen dependent.

In one study [22], three types of NST (start) regimens were compared to predict pCR for different molecular subtypes. Looking at the highest AUC per subtype over all NST variants, the optimal scan-moment for pCR prediction in Luminal A and B after starting with taxanes or anthracyclines is suggested after 3 weeks of therapy. When using change in ADC, an AUC = 0.865 for Luminal B (starting with taxanes) and AUC = 0.845 for luminal A (when starting with anthracyclines) are reported. The optimal scan-moment for basal-like and HER2-enriched tumors starting with anthracyclines and taxanes is suggested after 3 weeks, with AUC = 0.879 and AUC = 0.783, respectively, using change in ADC. For other NST-regimen and molecular subtype combinations, 6 weeks is reported as optimal scan-moment. The optimum can thus differ, depending on a specific NST-type and cancer subtype; see for all details [22].

A difficulty is that reported series are in general small. Subdividing those in different subtypes and NST regimen leads to very small study populations. Partridge et al. [23] reported that all subtypes were underpowered, except HR+/HER2−. For this subtype, the predictive value of DWI (\(\Delta\)ADC (%)) after 3 weeks of taxane (paclitaxel) treatment achieved an AUC of 0.61, whereas Yuan et al. [22] reported an AUC = 0.678 for the (absolute) \(\Delta\)ADC in Luminal A cancers, neglecting Ki-67 in this comparison. Furthermore, one study investigated three ROI-types in luminal cancer and defined the optimal ROIs according to the specific shrinkage pattern, achieving an AUC = 0.877 for \(\Delta\)ADC% after two cycles [24]. In addition, ADC-ratios, related to baseline and a time point (number of cycles), were analyzed. Here, increased AUCs were observed as the evaluation moment progressed toward post-NST [25].

Studying IVIM, Che et al. [19] found after two cycles a mean true diffusion coefficient (D) of 1.36 × 10−3 mm2/s (pCR) versus 0.98 × 10−3 mm2/s (non-pCR) over all subtypes (p = 0.001). For distinguishing pCR/non-pCR, they reported a cutoff value of 0.971 × 10−3 mm2/s, yielding a 100% sensitivity at 63% specificity (AUC = 0.851). Another IVIM-parameter, the change in perfusion fraction (\(\Delta f)\) showed an AUC of 0.906 using a cutoff of 11.3% [19]. More details are displayed in Tables 7 and 8.

Table 7 DWI parameters during NST
Table 8 Change in ADC between baseline and during NST; (i) percentage change, (ii) absolute change, (iii) ADC ratios baseline

DWI–MRI after NST

Four papers evaluated absolute post-NST ADC-values (Table 9). In one study [26], an ADC-threshold of 1.19 × 10−3 mm2/s to distinguish pCR/non-pCR yielded an AUC of 0.80. Another study [16] used higher thresholds that also differed for the molecular subtypes (range: 1.33 × 10−3 mm2/s (luminal B) to 1.43 × 10−3 mm2/s (triple negative)).

Using the change in ADC between baseline and post-NST, one study suggested a threshold of 40.7% of increase to identify patients with a pCR, with 100% sensitivity, 91% specificity, and an AUC of 0.96 [26].

Another measure, the ADC-ratio (= mean post-ADC/mean pre-ADC), used in Santamaria et al. [27] was significant (p = 0.009) for distinction pCR/non-pCR (AUC = 0.73) (Table 10).

Table 9 DWI parameters after NST

More details about change in ADC are displayed in Fig. 3 (three studies at different time points) and Table 10.

Table 10 Percentage change in ADC after NST

Finally, also the ROI-methodology differed between studies for cases with and without apparent residual disease (ROI-specifications: Table 5).

Discussion

This review describes 20 studies reporting on DWI–MRI prior to/during/after NST to identify pCR of the breast. A major finding is that the studies were very heterogeneous regarding clinical, technical, and epidemiological aspects. These differences make pooling of results for meta-analysis difficult. Previous meta-analyses [12, 13] should therefore be interpreted with caution. Currently, it is impossible to define the role of DWI in identifying pCR after NST. The observed heterogeneity in type of cancers, applied treatments, and used quantification methods precludes straightforward implementation of DWI protocols for NST-monitoring in other hospitals.

Some of these limitations were also recognized for the value of DWI for lesion classification. The European Society of Breast Imaging (EUSOBI) International DWI working group recently published a consensus and mission statement to alleviate this issue for lesion classification only [14]. Further standardization to implement DWI for treatment monitoring seems based on the findings of current systematic review essential.

The Quantitative Imaging Biomarkers Alliance (QIBA) of the RSNA published in 2018 for some organs standards related to implementation of quantitative DWI biomarkers (like reproducibility, repeatability, and regarding measurement errors vs. real changes) [28]. In the revised standard, currently under development, also technical breast imaging aspects are included [29]. These aspects may alleviate some differences in acquisition and evaluation parameters that currently make multicenter implementation challenging. From a technical perspective, even more parameters than discussed in this review may influence measurements [30,31,32,33]. Different hardware components and MRI-protocols might also initiate effect on the precision and accuracy of the DWI metrics obtained for pCR prediction/evaluation or even DWI in general [34,35,36]. Furthermore, interpretation factors (e.g., reading system, reader experience) may affect results. Some quality issues were already addressed in a test–retest study of Newitt et al. [37]. Strikingly, the biological variability of cancers and the differences in treatment protocols are not at all addressed by the available guidelines.

Below, we discuss some of the most eye-catching differences between studies with respect to treatment monitoring that need to be addressed shortly. We acknowledge that this list is certainly not complete.

As observed, ADC-values overlapped between pCR/non-pCR groups, and between studies. This may partly be explained by different b-value combinations used for calculating ADC-values [38]. For example, including perfusion-sensitive low b-values can overestimate ADC, whereas using (diffusion and noise sensitive) high b-values potentially underestimate ADC (Fig. 6, Additional file 1: Figure A and B, illustrating the different slopes). Moreover, the b-values can be constructed in different ways (i.e., depending on the DWI gradient properties). Theoretically, diffusion time, and thereby the DWI image, can vary between scans, although the b-value is identical. This makes it difficult to compare b-values between scanners. Reporting differences in the gradient strength and its timing properties, which may influence measurement results, makes multi-center multi-scanner studies easier to understand. This is important as DWI, by applying a certain b-value, can be sensitive to intra- and/or extra-cellular water motion effects (i.e., restriction and hindrance, respectively) and/or perfusion/pseudo-diffusion effects. Additionally, the ADC calculation methods (e.g., the scanner or specific formulas) [39] might not be identical.

Fig. 6
figure 6

Schematic overview, with the semi logarithmic plots (and S0, the signal at b = 0 without perfusion component) of the signal attenuation of pure diffusion (blue curve) and signal attenuation by (micro)perfusion, diffusion and including contribution of noise by the rician noise floor (red curve). Within the red curve, the first small arrow represents the mono-exponential slope (ADC1) within segment I, the second small arrow includes the mono-exponential slope in segment II (with ADC2). The large arrow represents the mono-exponential approach/slope (ADC3) using two b-values, one in segment I and one in segment II. Three segments of diffusion sensitive gradient strength, by the b-values are defined; I: diffusion and flow-sensitive b-values (diffusion gradients); II: diffusion sensitive, flow insensitive b-values; III: flow insensitive and noise sensitive b-values. The b-value independent rician noise level is mentioned as noise floor. Note: ADC1 + ADC2 = ADC3. The axis scales, slopes and by this the numeric functions are used as a schematic representation for the general picture and therefore might differ from clinical practice

The large variation in studies is fairly illustrated by the differences in the baseline mean ADC: in Santamaria et al. and Tozaki et al. between the non-pCR (1.072 × 10−3 mm2/s vs. 0.64 × 10−3 mm2/s) and pCR cases (1.025 × 10−3 mm2/s vs. 0.41 × 10−3 mm2/s), with b-combinations: b = 0 s/mm2 and b = 700 s/mm2 or b = 50 s/mm2 and b = 700 s/mm2 for Santamaria et al.; and b = 500 s/mm2 and b = 1500 s/mm2 for Tozaki et al. [27, 40]. ADC cutoff values for pCR and non-pCR reported in different studies may thus be sensitive to technical heterogeneity. This makes Quality Control (QC) and Quality Assurance (QA), using DWI phantoms [41,42,43] and patient test–retest procedures [37], essential.

One could argue that, in a longitudinal study, using (flow-sensitive) low b-values may have an undesirable effect on the validity of ADC measuring response in highly vascularized tumors. NST reduces vascularization within the ROI and therefore leads to a decrease in the perfusion fraction (f), which may cause a decrease in the slope of (a part of) the attenuation curve. Simultaneously the diffusion coefficient increases and compensates this decrease, resulting in a smaller (or even no) difference in ADC between time points. Theoretically, separating the perfusion/pseudo-diffusion and diffusion effects by using > 2 b-values and calculating IVIM-parameters could solve this. However, whether this is really beneficial could not be concluded from the included studies in this review due to the small number of studies and heterogeneity. The complexity of choosing the optimal scan-moments and parameters can be observed in Li et al. [44] who suggested that tumors with a relative high ADC during NST are more likely to show pCR, while Tozaki et al. [40] suggested the opposite (Table 7). However, this could not clearly be explained by the DWI acquisition moment during NST.

Besides DWI models [45] and b-values, ROI-selection is also crucial for a representative quantitative analysis of each lesion. Using different ROI-definitions (2D/3D) can influence the quantitative results in general, as reported by Bickel et al. [46]. These authors suggested to choose the area of the most aggressive part, the minimum ADC for a 2D-ROI. [46] Other methods are also studied, like whole tumor versus small sub-regions ROI’s [47]. However, these publications are related to lesion classification. It is even more unclear which ROI is most appropriate in a longitudinal setting. Within the ROI, partial volume effects (PVE) might influence (mean) ADC. During therapy, tumor heterogeneity (and thus PVE) may increase and the optimal ROI selection may be affected by various observed shrinkage patterns of breast cancer [24, 48]. Consequently, these aspects make choosing a reliable ROI during and after therapy even more difficult to standardize. Based on systematic review, no optimal ROI technique was identified [49]. In line with the recent study of Wielema et al., regarding the optimal ROI technique for lesion classification using DWI, more extensive research regarding this specific topic in the setting of therapy monitoring is also required.

For identifying the most reliable ROI, in case of small regions of (residual) disease, a sufficient spatial resolution and contrast-to-noise ratio (CNR) between the lesion and the breast parenchyma is required. In DWI, this can be challenging, as often SS-EPI is used with a large field-of-view (FOV) for covering both breasts and thereby compromising spatial resolution due to signal-to-noise ratio (SNR) and scan-time limitations. Therefore, often DCE-images are used as guidance for tumor localization, assisting with identifying lesion(s) at the high b-value images. It should be noticed that at higher b-values, the SNR decreases and thereby the noise level (rician noise floor) can be reached. To increase SNR for these cases, the number of excitations (NEX) can be increased, which directly will increase the total scan-time. Balancing both (noise ratio and scan-efficiency) can be challenging and will depend on the magnitude of the high b-value image. Increasing the highest b-value might result in a longer TE, causing a lower SNR, requiring more NEX, and finally a longer acquisition time. Moreover, as there is an inverse relation between image resolution and SNR, recommendations are required discussing the optimal use of DWI for near complete response cases at time-points toward surgery or when small volume lesions (< 1 cm) at baseline are detected (e.g., by using a different or additional high resolution protocol). The development of new DWI sequences addressing this resolution aspect [50] and implementation of post-processing (noise filtering, using advanced DWI models/representations with their considerations [51]) need to be investigated more for these kind of cases. However, it should be noted this would make standardization of DWI for treatment monitoring even more complex.

Analyzing the value of DWI requires measurements coupled to a specific pathological endpoint after NST (pCR/non-pCR). Differences in the histo-pathological analysis (and inter-observer differences in defining the molecular subtype of the diagnostic biopsy [52]) and pCR-definitions can affect this categorization, which further hampers data pooling. Some authors allowed residual DCIS within the group of pCR; others classified it as non-pCR. Furthermore, Liu et al. [16] included Miller & Payne grade 4 (> 90% loss of tumor cells) within the pCR group and Kim et al. [53] labeled those as good responders, whereas others only included grade 5 (no viable tumor cells). Inclusion of DCIS (alone or in combination with grade 4 residual disease) in the pCR group logically leads to different ADC measures than when the pCR group consists of cases without residual DCIS. Noteworthy, while DCIS is not always visible on DWI, because of the spatial resolution, it may still affect ADC-values due to microstructural changes. With the final goal of identifying pCR of the breast after NST in mind, and thereby omitting breast surgery, it seems most appropriate to use a pCR definition of ypT0 (i.e., residual DCIS is not permitted). However, recommendations from the Breast International Group-North American Breast Cancer Group (the BIG-NABCG), on the pathological evaluation of post-NST specimens, still give the option to in- or exclude DCIS from this definition [54, 55]. Aiming at more standardization by making studies more comparable, expert consensus on the most suitable pCR-definition and the definition of radiological complete response on DWI is required.

ADC-values can also vary widely between tumors of different morphological [56] and molecular subtype [57]. Remarkably, in most studies ADC-values were not differentiated by tumor type. Likewise, differences can occur after treatment due to varying NST-regimes. Only four studies reported (absolute/change in) ADC-values for different cancer subtypes, showing differences in distinguishing pCR/non-pCR cases. In other words, all subtypes will likely have specific cutoff values that will also further differ depending on the NST-type. In line with DCE-MRI [7] and PET-CT [58], DWI will likely also have varying diagnostic performance for the response prediction in different subtypes. Partridge et al. [23] and Yuan et al. [22] underlined that also the optimal timing of DWI during NST differ for the molecular subtypes and types of NST. Substantial knowledge about the tumor, its initial and long-term reaction to NST (e.g., cell swelling, apoptosis, and inflammation) is required to determine the optimal timing. Therefore, future DWI research should study identical treatment regimen for specific tumors in large study populations.

Based on this review, identifying pCR seems to be more accurate with parameters that measure differences in ADC-value during NST than with measuring an (absolute) ADC at one or several time point(s). This is likely, because the relative changes (partly) compensate for the variability in the acquisition parameters and biological properties of breast cancers. In general, treatment response is represented by an increase in the lesion’s ADC-value, although even this was not apparent in all studies [59].

Moreover, statistical limitations hamper the potential comparison and pooling of studies. For example, in the QUADAS-2 [15] assessment, risks of bias were observed regarding the research populations. Furthermore, for comparing predictive statistical parameters (PPV/NPV) the prevalence of tumor subtypes needs to be identical. Only a ROC-AUC might give some statistical value to all cases, because it is reported to be prevalence independent [60]. However, as reported in this study, this does not compensate for underlying heterogeneity.

In summary, this review unearths many sources of heterogeneity that are currently present in studies on the use of breast DWI for the prediction of response to NST. This heterogeneity is not limited to acquisitions parameters, but is also caused by large differences in patient populations, biological tumor characteristics, differences in applied therapies, and differences in the used outcome parameters. We acknowledge that besides the factors we specifically addressed even more characteristics in each of these fields could influence DWI measurements. Considering the limited case and study numbers, and all heterogeneity encountered, it would be premature to define the optimal DWI parameters based upon this review. Overall, the level of evidence for response prediction and evaluation using ADC as DWI metric is moderate. However, specific details, such as the influence of the biology of tumors, and the technical aspects of DWI for response prediction only have a low level of evidence [61]. Proper validation aimed at overcoming the translational gaps [62] and, standardization of the study designs (patient inclusion → analysis), requires substantial consensus efforts that are crucial to accelerate optimization, and potential implementation of quantitative-DWI for NST-monitoring in breast cancer patients.

Finally, besides standardization and validation issues, there are also limited data about the cost-effectiveness of MRI in the NST setting [63]. To get an overall idea of the added value of DWI in this NST setting, also cost-effectiveness needs to be analyzed.

By addressing these issues, this review aims to increase awareness on different sources of variability and supplements the works of EUSOBI [14], QIBA [29], Padhani et al. [10] and O’Connor et al. [62], to initiate a future consensus for the use of breast DWI in the treatment monitoring setting.

Conclusion

Clinical, technical, and epidemiological heterogeneity was observed in all aspects of studies correlating DWI measurements to pCR/non-pCR.

The observed methodological heterogeneity and the small patient numbers make it currently difficult to assess to what extent DWI–MRI might predict pCR. The preliminary conclusion is that the absolute ADC is not (yet) robust for distinguishing pCR/non-pCR, without considering multiple variables. Therefore, multidisciplinary cooperation/consensus is required, to obtain reliable and reproducible longitudinal DWI measurements for identifying non-pCR/pCR cases in specific and well-defined subgroups of patients.