Dear Editor,

I read with interest your recently published article in Gastric Cancer entitled “Choice of PD-L1 immunohistochemistry assay influences clinical eligibility for gastric cancer immunotherapy” [1]. This, like many other similar papers, is the results of what authors refer to as “unmet clinical and logistical need for harmonization” for PD-L1 testing. This study explores so-called “interchangeability” between the Dako 22C3, Dako 28-8 and Ventana SP142 assays as predictive biomarkers in gastric cancer.

Consequences of this and similar studies are such that they may significantly impact clinical practice as well as be used as evidence that regulatory bodies will use to approve or deny approval for certain biomarkers for certain purposes (e.g., FDA, Health Canada, or other). Therefore, it is critical that this type of evidence is transparent for what it is and how it applies to clinical practice and other published studies on interchangeability of PD-L1 biomarker assays.

I have reviewed this paper carefully and found that it unfortunately has several issues in the study design and interpretation of the study results, as follows:

  • All published clinical trials used regulatory agency-approved PD-L1 biomarker assays, specifically DAKO PD-L1 IHC 22C3 pharmDx, DAKO PD-L1 IHC 28-8, and VENTANA PD-L1 (SP142) Assay, or VENTANA PD-L1 (SP263) Assay. These IHC assays have established analytical sensitivity, which is generally stable, and the results are generally reproducible, and the scoring schemes are directly linked to the respective FDA-approved assay and clinical trial(s), which are using specific immune checkpoint inhibitor and are also designated as companion diagnostics (CDx). Primary antibody used by these CDx is not an “assay”, but it is just a primary antibody. Since the authors have used these primary antibodies with different, laboratory-developed tests (LDTs), which were not IHC, but multiplex assay mIHC/IF with Opal Multiplex fIHC kit, the results of this study are not applicable to clinical practice. Although the authors used the same primary Ab clones, it is not clear why they assume that their LDTs will have the same analytical performance as the original FDA-approved CDx. There is also no attempt by the authors to compare their 22C3 LDT with DAKO PD-L1 IHC 22C3 pharmDx, their 28-8 LDT with DAKO PD-L1 28-8 pharmDx, or their SP142 LDT with VENTANA PD-L1 (SP142) Assay, first separately and also when multiplexed. It is a common knowledge and historical experience of many proficiency testing programs that the conditions of IHC protocols beyond the primary Ab are critically important for IHC results. Some LDTs may be good, others may be insufficient for the purpose for which they are developed. Therefore, fit-for-purpose diagnostic and technical validation of LDTs is essential before they are to be considered to be similar to clinically validated CDx. The authors cited their previous publication where they stated that they have compared what they refer to as “conventional IHC” to their multiplex LDT with three different clones, which did not include 28-8 clone [2]. Not only that the 28-8 clone was not included, but also there is no clear statement whether “conventional IHC” assays were performed according to CDx specifications, or the pre-diluted antibodies were used in their own LDT for each IHC assay. Even if they have used CDx assays as per specification protocols, the purpose of the study was not to validate the multiplex LDT against CDx assays, but their results were compared and concordance rates show, where only 2 out of 9 comparisons for concordance were above 90%, with lowest concordance being 67%. With these results, we can be reasonably assured that the mIHC/IF was not validated for diagnostic equivalence against respective CDx assays.

  • Sample degradation is an important consideration in studies of PD-L1 expression as it has been shown that for at least some clones (e.g., 22C3) paraffin blocks older that 3 years may already show degradation of the PD-L1 epitope recognized by this clone [3, 4]. This may or may not be the same for different PD-L1 clones and it may cause background noise where two different assays using two different clones are compared. However, even the “new cohort” is very old with the newest samples being from 2013. Therefore, all samples are much older than 3 years. It is uncertain if this is possibly causing low(er) sensitivity with 22C3 clone.

  • The authors use the same scoring schemes for the readout of their LDT as they are used by CDXs for each primary Ab. It is a serious mistake to apply the same scoring system to assays of unknown and presumed different analytical sensitivity. If authors developed LDTs with higher analytical sensitivity, using the same scoring would lead to higher number of positive cases and, the other way around, with an LDT of lower analytical sensitivity than that of the relevant CDX, the number of positive cases would be lower. Since the analytical sensitivity of their LDT is unknown, the results could go both ways. The need to align analytical sensitivity with a scoring scheme was recently emphasized for ROS1 IHC assays [5].

  • The authors used correlation (Spearman’s correlation) for the analysis of the results. This is a common, but serious mistake. Correlation should never be used to compare two methods that are “measuring” (or assessing) the same variable/parameter. While it makes sense to assess the correlation between the height and weight of a person, it does not make sense to assess the correlation between two methods that measure the height of a person. There will be a correlation. What really matters in comparing different predictive qualitative assays (such are PD-L1 IHC assays) is their accuracy, which is assessed by diagnostic sensitivity and specificity, calculated from the number of true-positive, false-positive, true-negative, and false-negative results where a candidate assay is compared to designated reference method (or comparator assay) [6]. This is elaborated in detail in meta-analysis of PD-L1 interchangeability studies (7). Similarly, mean scores are completely irrelevant because they do not tell us anything about diagnostic errors (e.g., false-negative or false-positive results).

In summary, the use of multiplex immunohistochemistry/immunofluorescence (mIHC/IF) LDT for the simultaneous assessment of PD-L1 expression is a very interesting and promising methodology. However, based on the design of the study, methods applied, and terminological confusion, the results of this study are not applicable to clinical practice. The conclusions about the performance of FDA-approved assays and their interchangeability in gastric cancer cannot be presumed based on the results of this LDT; the authors provided no evidence that this multiplex LDT has the same test performance characteristic and is equivalent to corresponding FDA-approved assays based solely on the fact that they use the same primary antibodies.

Editor-in-Chief

Yasuhiro Kodera, Nagoya.