A deep learning-based histopathology classifier for Focal Cortical Dysplasia

A light microscopy-based histopathology diagnosis of human brain specimens obtained from epilepsy surgery remains the gold standard to confirm the underlying cause of a patient’s focal epilepsy and further inform postsurgical patient management. The differential diagnosis of neocortical specimens in the realm of epilepsy surgery remains, however, challenging. Herein, we developed an open access, deep learning-based classifier to histopathologically assess whole slide microscopy images (WSI) and to automatically recognize various subtypes of Focal Cortical Dysplasia (FCD), according to the ILAE consensus classification update of 2022. We trained a convolutional neuronal network (CNN) with fully digitalized WSI of hematoxylin–eosin stainings obtained from 125 patients covering the spectrum of mild malformation of cortical development (mMCD), mMCD with oligodendroglial hyperplasia in epilepsy (MOGHE), FCD ILAE Type 1a, 2a and 2b using 414 formalin-fixed and paraffin-embedded archival tissue blocks. An additional series of 198 postmortem tissue blocks from 59 patients without neurological disorders served as control to train the CNN for homotypic frontal, temporal and occipital areas and heterotypic Brodmann areas 4 and 17, entorhinal cortex and dentate gyrus. Special stains and immunohistochemical reactions were used to comprehensively annotate the region of interest. We then programmed a novel tile extraction pipeline and graphical dashboard to visualize all areas on the WSI recognized by the CNN. Our deep learning-based classifier is able to compute 1000 × 1000 µm large tiles and recognizes 25 anatomical regions and FCD categories with an accuracy of 98.8% (F1 score = 0.82). Microscopic review of regions predicted by the network confirmed these results. This deep learning-based classifier will be made available as online web application to support the differential histopathology diagnosis in neocortical human brain specimens obtained from epilepsy surgery. It will also serve as blueprint to build a digital histopathology slide suite addressing all major brain diseases encountered in patients with surgically amenable focal epilepsy.


Introduction
Advancements in WSI technology helped to further promote digital pathology and the development of artificial intelligence (AI)-based disease classification algorithms at a larger scale [8,14]. However, published AI studies cover only few topics of the diagnostic spectrum in neuro-or histopathology including epilepsy surgery, yet. Recently, a deep learning-based algorithm was able to demonstrate the ability to differentiate the cellular profile of FCD Type 2b from cortical tuber on routine hematoxylin-eosin (HE)stained WSI [19]. Their algorithm recognized bulky and strand-like matrix reaction, halo-like balloon cell artifacts as well as bigger astroglial cell nuclei in cortical tuber compared to FCD ILAE Type 2b as disease-specific signatures. These small differences at the cellular level and in Extended author information available on the last page of the article the texture were not recognized by light microscopy before and also helped to define a histopathology scoring system when training colleagues not specialized in this diagnostic arena. Another hurdle in the development of AI-based histopathology classifiers is the huge work burden for annotation of large datasets and which support the use of weakly supervised deep learning pipelines [8,14]. This applies also for the prediction of defined genetic mutations in histopathology specimens, e.g., colorectal, gastric and bladder cancer [14]. A different AI-based approach was able to replicate histochemical and immunohistochemical stainings from unstained slides without actually performing the staining procedures [12,29]. In contrast, AI-based classifiers are increasingly developed in neuroradiology in order to reduce the burden of reviewing large stacks of MR images at various protocols and to detect small lesions in drug-resistant focal epilepsy, i.e., FCD [15]. However, the full potential of cost-effective and AI-based digital pathology classifiers needs to be further explored.
Malformations of cortical development (MCD) represent the first most common structural brain lesion in children with drug-resistant focal epilepsy. Focal Cortical Dysplasia (FCD) represents approx. 75% of all MCD entities collected in large epilepsy surgery series [6,20]. The International League against Epilepsy (ILAE) has proposed a consensus classification scheme for FCD in order to cover the clinical spectrum of FCD, their associated histopathology patterns and likely etiology, e.g., brain somatic mutations in the mTOR pathway [3,4,7]. The histopathological classification of these categories remains, however, an ever-challenging issue in daily routine practice [22,23], and which has been documented many times in the scientific literature [1,4,5,9,11]. Molecular neuropathology has been recognized as a helpful diagnostic tool to objectively classify human brain lesions, e.g., in brain tumors, and was recently introduced also into the realm of epileptology and epilepsy surgery, e.g., FCD ILAE Type 2 with brain somatic mTOR pathway mosaicism or MOGHE, SLC35A2 altered [3,4]. These entities have been also included into the ILAE consensus FCD classification update of 2022, which mandates an integration of clinical, histopathological and neuroimaging information into the final FCD diagnosis [22]. Molecular testing of surgical human brain samples remains, however, cost intensive and may not be accessible to many surgical centers and pathology laboratories around the world. An AI-based histopathology classifier for FCD would be costeffective, available online and thus improve the knowledge and access to routine histopathology diagnosis. In the current work, we retrieved a large series of MCD specimens covering the spectrum of common and rare FCD subtypes, used the WSI technology and developed a supervised, deep learning-based histopathology classifier to reliably detect FCD subtypes in surgically resected human brain tissue specimens.

Patients included in this study
Archival microscopy slides of tissue specimens from 125 patients with the histopathological diagnosis of mMCD, MOGHE, FCD Type 1a, 2a and 2b according to the ILAE classification scheme [7,22] were retrieved from the Neuropathological Institute at Universitätsklinikum Erlangen (see supplemental Table 1 and supplemental Figure 1). In addition, we retrieved age-matched postmortem brain tissue of Brodmann areas 4 and 17, entorhinal cortex and dentate gyrus from 59 patients without any neurological disorder from the Neuropathological Institute at Universitätsklinikum Erlangen and the Ludwig-Maximillian University München. Histological sections were 3-5 lm thin, obtained from formalin-fixed and paraffin-embedded (FFPE) tissue blocks and stained with hematoxylin-eosin (HE) and HE-Luxol-Fast-Blue (HE-LFB) according to routine neuropathology workup protocols [2]. Additional HE, HE-LFB or immunohistochemical stainings for NeuN, MAP2, SMI32, and Olig2 epitopes were prepared from archival FFPE tissue blocks when necessary using the same protocols. All slides were microscopically reviewed by experienced neuropathologists (IB, JoHe, SR and RC) to select a minimum of 20 cases from each histopathology entity to be included in this study (see supplemental Table 1). The FCD1a category included those patients described recently [17]. The glossary summarizes the definition applied for each category. Whole slide images (WSI) were digitally recorded from HE as well as adjacent immunohistochemical stainings using a Hamamatsu S60 scanner (Hamamatsu Photonics Europe, Herrsching, Germany) with a mean of 3.02 Gb per slide, ranging from 0.47 to 5.72 Gb. Each patient's clinical history was retrieved from the archival hospital files. The study was approved by the University of Erlangen ethical review board under the agreement number 193_18B. and whose structure precisely reflects the required workflow of data generation, data management and data evaluation.

The annotation and extraction pipeline
The region of interest (ROI) representing the anticipated anatomical region or cortical malformation was manually annotated on the microscopy slide or WSI using the Hamamatsu NDPi viewer by our experienced neuropathologists IB, RC, JHe, SR using HE-LFB and/or respective immunostainings (Fig. 1b). Regions not cut perpendicular to the cortical surface were excluded from the study. Annotations were digitally introduced onto the WSI of the adjacent HE staining (Fig. 1c). A Python script was developed to identify the vector coordinates of these annotations and to automatically extract 1mm 2 tiles with a stride of 100 lm at Hamamatsu ndpi-Level 2, perpendicular to the annotation line by using the open access OpenSlide (https://doi.org/10.4103/2153-3539.119005) and diagonal-crop libraries (GitHub-jobevers/diagonalcrop: Diagonally crop an image using python and pillow) (Fig. 1d). Tiles were then resampled at a resolution of 1024 9 1024 pixels and stored locally for later use by the CNN adding the histopathology label to each unique filename.

Dataset splitting
In order to monitor and control the training process, we divided the total set of available training images into a training group and a validation group. To do this, we determined the total number of training images for each of our eight main entities to be examined (see Supplement Figure 1) and assigned 20% from each group for the validation process. The splitting process was carried out using a random number generator at the patient level and subsequently using an iteration selection, also at the patient level, until the required number of validation images had been separated, i.e., all images of a specific patient were available in either the training group or in the validation group. The aim of this approach was to prevent the system from assignments based on specific individual characteristics of the individual patient samples in order to positively support the generalization of the model and avoid overfitting in the context of hidden variables within the patient data, e.g., HE coloring patterns of a given case or certain repetitive artifacts. The distance between each scale is 500 lm. b For each category we used a special stain or immunostaining to reproducibly recognize anatomical landmarks or the histopathological lesion. Here, we delineated Gennari's strip of Brodmann area 17 on the microscopy glass slide with HE-LFB. c The same region was then annotated with npdi viewer on the WSI (same shown in a). d A Python script was developed to generate and extract tiles perpendicular to the annotation line

Metric output of the ResNet18 CNN
The prediction score given by the CNN is a measure of the likely probability of a tile to match with any of the introduced categories, i.e., ''maximum confidence.'' To achieve this value, we applied a sigmoid activation function for multi-label classification tasks to the array of values (digits) that was given by the final linear layer of a neuronal network for each image.
All metrics were obtained from the fastai library and adapted to our multi-label classification approach. Fastai itself hereby uses scikit-learn metrics and pytorch lossfunctions. We monitored the training performance by train_loss and validation_loss as calculated by the BCEWithLogitsLoss function (threshold 0.5). Generally both parameters should converge toward zero as the number of epochs is rising. We determined overfitting of the training process as the divergence between training loss, i.e., numbers decrease, compared to validation loss, i.e., numbers increase, and other metrics, i.e., numbers decrease. Train_loss and validation_loss were therefore monitored to optimize computation time for training and to find the best time point to stop training the model.
In addition, we monitored accuracy_multi, Preci-sionMulti, RecallMulti and F1ScoreMulti to evaluate the practicability of the trained models as well as overfitting. The accuracy_multi score was calculated from the logits score provided by the ResNet18 for each image. These logits were normalized therefore into a value range between 0 and 1 by using the sigmoid function. From these values, we defined a positive prediction at a threshold higher or equal 0.5 (Table 1). Values below this threshold were considered to not meet the prediction (negative).
These predictions were converted into a one-hot-encoded format and compared with the basic ground truth, i.e., the manually labeled values for the corresponding image. The mean of all element-wise compared values then provides the accuracy_multi score metric (see below).
We applied weighted averaging for PrecisionMulti, RecallMulti and F1ScoreMulti across all labels in this work to take the presence of unbalanced categories into account (see dataset splitting) [26]. This means that the calculated metrics for the respective individual classes are weighted with the proportion of training images in this class in relation to the size of the total population of training image data. Other averaging regimes were not used herein.

Results
A total of 612 WSI slides were obtained from 184 patients (see supplement Table 1), representing at least 20 independent patients for each of the main anatomo-pathological categories. With this dataset, we generated a total 533.461 tiles for further use in this study, labeled for 25 subcategories, including nine major histopathology categories of diagnostic interest and their helper classes (see glossary and supplemental Figure 1). Training of the ResNet18 CNN was carried out with a batch size of 16 tiles each and the Adams optimizer (https://doi.org/10.48550/arXiv.1412. 6980). The learning process used the cyclic learning rate policy (https://doi.org/10.1109/WACV.2017.58). The maximum learning rate multiplier variable was set to 0.015, as determined by the fastai learning rate finder. The basic model was trained for 10 epochs (supplement Table 2, column 1). In addition, a fivefold cross-validation Table 1 Example for calculation of accuracy_multi   Logit Sigmod One-hot-encoded prediction after applied threshold One-hot-encoded ground truth Comparison Label (example) 0.50 0.62 1st column: logit values represent the not yet normalized output of prediction probabilities of the classifier for each WSI tile. 2nd column: The sigmoid function was applied to each logit to normalize the probability values into a value range between 0 = very unlikely and 1 = very likely. 3rd column: A threshold value was set to 0.5 to derive one-hot-encoded predictions from the probabilities shown in 2nd column. If the value in column 2 was above the threshold value (0.5), a positive prediction (1) is assigned for the presence of the category (= label in 6th column). Values below 0.5 were assigned as negative prediction (0). The real presence of one or more histopathology categories in the original WSI was reported as One-Hot coding ground truth in column 4. 5th column: The comparison of the model's prediction with the ground truth provides logical truth values for each label trained in the classifier. The summation of calculated truth values and their subsequent averaging resulted in the accuracy_multi-score for the input image was carried out on the base of this model in which validation sets were unique across all fivefold (Table 2). Other sub-models were trained for 5 epochs depending on the achieved metrics score until no significant improvement in control parameters could be recognized (see supplement Table 2). The pre-trained convolution base was not changed during training for all models. All models mentioned above were re-initialized with the ImageNet weights before training. Data augmentation was applied during the training at batch level including horizontal and vertical mirroring, rotation up to 180°, image zoom from factor 0.9 to 1.1, and ''zeros'' as padding mode (fastai).
The training of the model was performed with a case cohort representing approx. 80% of each of the 25 categories was validated with the remaining cases, i.e., dataset splitting at the patient level. The ResNet18 CNN was fed with tiles from our comprehensive extraction routine and managed by the WSI processing pipeline [24]. The first run achieved an accuracy_multi of 0.986, however, which prompted us to stop the training after the tenth run at an accuracy_multi of 98.8% (F1MultiScore = 82.2%) within the best epoch (epoch nine, total computing time of approx. 16 h, supplement Table 2). Training progress ceased after the seventh training epoch. Any further training epoch did not show notable improvement in the control metrics (data not shown). However, we did not observe overfitting, i.e., discrepancies between train_loss and validation_loss, accuracy_multi and F1MultiScore at higher epochs (supplemental Figure 2). The advantage of dataset splitting at the patient level vs. random dataset splitting became further evident when challenging the two approaches with entirely new cases of genetically confirmed diagnosis never presented to the CNN (supplement Table 1). This analysis revealed a poorer generalization of the random-split model (supplement Table 3) and was not considered further, therefore.
Our basic model (Table 3) included all categories specified (see glossary) and achieved a maximum accu-racy_multi score of 0.988 after the tenth epoch. The RecallMulti score for this basic model achieved 0.780 and showed more variation across our categories (Table 3). Best PrecisionMulti score and F1MultiScore with applied weighted averaging were calculated with 0.900 and 0.820, respectively. The accuracy_multi score was equally high across the major anatomo-pathological categories (Table 3). Within this group, best accuracy_multi values were achieved for FCD2a (0.991) and least for mMCD (0.955). Overall, the highest values were achieved for DG-PM (1.000), along with WM-PM (1.000) followed by BA17-PM (0.998). WM-PM also achieved a high recall value (1.000), whereas recall for WM-SX reached 0.720. Samples of white matter (WM) adjacent to the FCD2a and FCD2b areas showed lower values for precision and recall (FCD2a-WM (0.290, 0.580), FCD2b-WM (0.450, 0.360). WM adjacent to mMCD could not be recognized by the model as a separate class. However, the evaluation results showed sufficient differentiation between cortex and white matter in general as did anatomical border zones of the subarachnoidal space (SUB) or the unstained glass slide with non-tissue background (NTU). Out of the 8 main categories, F1 score and recall were lowest for MOGHE (0.450/0.420), followed by mMCD (0.600, 0.460). FCD2b (0.830, 0.770) showed higher metrics compared to FCD2a (0.740, 0.670). Best results could be achieved for BA04-PM, BA17-PM and DG-PM (see Table 3).

A weakly supervised classifier approach did not achieve successful results
In another approach, we collected all slides from FCD1a, FCD2a, FCD2b, MOGHE and mMCD, extracted approx. 1 9 10 6 tiles from the entire tissue region of the WSI using the previously published library [24]. No specific annotation was applied in this approach, i.e., weakly supervised. All tiles were assigned to one of the five labels mentioned above. The ResNet18 CNN was trained with five epochs for approximately two weeks. Neither training nor validation accuracy reached a level above 50%. This experiment confirmed the importance and value of our supervised approach of time-costly histopathology annotation.

Calculating the most efficient number of labels/categories for the FCD classifier
Another question pertinent to our approach was to assess the best and most efficient number of discriminating categories or the combinations thereof, e.g., grouping together homotypic neocortices from postmortem and surgical tissues and of frontal, temporal and occipital regions, or grouping anatomical compartments of each disease category, i.e., white matter (WM) and adjacent cortices (hCx). The best performance of the CNN was achieved when all anatomo-pathological labels were included into the training and validation process (supplement Table 2). Each of the aforementioned groupings reached less accuracy. Along these lines, exclusion of categories such as all postmortem controls, no tissue regions (NTU) or adjacent FCD compartments did not achieve a better performance of the algorithm. In a next step, we developed a data class to manage calculated predictions and their graphical representation on the WSI by the trained model and to provide an interpretation aid, i.e., the histopathology classifier (Fig. 2). All digitally available information was extracted from the WSI by OpenSlide including the magnification scale and spatial x and y dimension of each WSI image under study. This information was necessary to extract tiles with the same specification used to train the CNN. Then, the program was able to generate a grid of tiles covering the entire space of the WSI (1000 9 1000 lm and 1024 9 1024 pixels each; see Fig. 2a). The tiles were fed into the trained model in a png-format, and their prediction score was calculated (Fig. 2b). The results of this calculation were stored with reference to the respective patches. A raster image was created from these predicted values in the same dimension as the original patch raster. An algorithm then determined the average prediction probability for all categories recognized on the slide. Using an adjustable threshold value, categories with low predictive values were hidden, e.g., \ 0.5 (Fig. 2b). A color and transparency value was calculated for categories above the prediction threshold, and which were based on a pre-defined RGB color code table depending on the prediction probability as defined by the user (Fig. 2c). A thumbnail of the original WSI image was assembled from the individual png-formatted tiles, anticipating the large file size of each original WSI. The raster image was scaled up to the size of the generated thumbnail, and the two images were merged using an alpha channel and the pillow software package [10]. Our first visualization attempts confirmed the benefit of helper categories such as non-tissue background (NTU) and the subarachnoidal space (SUB-PM and SUB-SX) for training the CNN in order to improve the graphical output of the classifier. During each experiment, we applied the visualization program to 20 randomly selected cases of the validation cohort and microscopically reviewed regions of the slides identified by the CNN with an accuracy threshold above 0.5. This analog review confirmed the presence of the predicted lesion in all our cases, i.e., BA17, BA4, DG, mMCD, MOGHE, FCD1a, FCD2a and FCD2b. This system was further tested using WSI of genetically defined disease probes of MOGHE-SLC35A2 altered, FCD2a-DEPDC5 altered and FCD2b-MTOR altered. These histopathologically and genetically defined cases were not seen by the CNN before and were correctly assigned to their respective category (Fig. 2e, f). A third approach addressed the potential disease differentiating anatomo-pathological signatures recognized by the CNN. We downloaded, therefore, the most accurately recognized tiles from the training and validation cohort of the CNN with a predicted confidence C 0.99 for microscopy review. It was evident from this analysis that the CNN has captured the specific disease tissue pattern for each category. Prominent examples are depicted in Fig. 3 showing the clusters of increased oligodendroglial cell densities in MOGHE, dysmorphic and balloon cells in FCD2b and excessive microcolumnar cortical architecture in FCD1a. An online accessible reviewing program was programmed and made accessible online to nine internationally renowned colleagues experienced in the classification of epilepsy-associated brain lesions to survey their agreement with our CNN's prediction. This study disclosed that 62.47% of the CNN's best tiles would have been classified as the respective brain disorder also by independent external referees, i.e., MOGHE, FCD1a, FCD2a, FCD2b (ranging from 10 to 96%, and kappa values ranging from poor (-0.80) to almost perfect (0.92), respectively; see Table 4).
Training results as well as the expert panel agreement trial suggested that our model was able to recognize complex tissue patterns within large WSI tiles with reasonable accuracy. To improve transparency of the model's decision-making process, we applied several discrimination software tools and algorithms for this interpretation using open access libraries, e.g., LIME [28] 3089943] to visualize the common patterns of tissue texture as a heatmap trained in the model and which can help us guiding the microscopic examination of a surgical specimen [19] (Fig. 4).
We further investigated the class activation across convolution levels of our trained ResNet18 model applying the torchcam library and Layer-CAM approach included therein (Fig. 5). This analysis showed evidence for feature recognition taking place from layer 2 to layer 4 of the CNN. No more complex features were recognized in earlier or later convolutional layers of the model.
Finally, we assembled all features of our deep learningbased histopathology classifier for FCD into a web-based application to facilitate an automated interpretation of WSI slides of epilepsy surgery brain tissue (https://fcd-classifier. eu.ngrok.io). The application has several dashboards to access and review all of the information. It is important to note that the user can individually select the anatomical categories to be reviewed by the model, i.e., 1-25 (see glossary), and more importantly, that the user can verify the model's prediction ad hoc using a digital microscopy b Fig. 2 Evaluation and visualization of a complete WSI slide. a A grid of tiles each measuring 1000 9 1000 lm, 1024 9 1024 pixel, respectively, was automatically calculated to cover the entire WSI image. b Prediction values for each tile were obtained from the trained CNN. The threshold to visualize the prediction value was defined as variable and set to 0.5 in this example. c All tiles reaching the defined threshold can be visualized by our color index map. Furthermore, the color intensity indicates the probability value. ( The possibility to review each tile's prediction score at the microscopic level is of utmost importance to validate the AI-based approach and to achieve a final and reliable histopathology diagnosis in routine practice.

Discussion
We developed a WSI-based deep learning algorithm to automatically detect and further classify FCD in epilepsy surgery brain specimen, including those entities newly introduced in to the ILAE consensus FCD classification update of 2022, i.e., mMCD and MOGHE [22]. Comprehensive training and validation experiments revealed a success rate for the prediction of a total of 25 categories at 98.9%. The predicted categories can then be visualized at a high spatial resolution on the WSI by color coding and additional metrics to further specify these results. This prototype can be used online for diagnostic routine assessments and may become a template for a digital slide suite for neuropathology and epilepsy surgery in the near future. Notwithstanding, such an automated diagnostic device shall only support the diagnostic workup as a screening tool, e.g., when large surgically resected en bloc brain samples were submitted for neuropathology examination, and will not replace the microscopic inspection by an experienced and board certified neuro-/pathologist.  Nine independent experts (A-J) were invited to review 50 best tiles predicted by the model with a confidence score C 0.99. Please note that the agreement among experts varied substantially across all disease categories, i.e., from poor (kappa = -0.80/red) to almost perfect (k = 0.84/green) for FCD1a, from To the best of our knowledge, our deep learning-based histopathology classifier of FCD is first to process large image tiles of 1000 9 1000 lm. Digital tiles of this size represent a rich brain tissue texture with complex cellular and extracellular content, e.g., the various neuronal profiles of different diameter and shape, orientation and architectural organization in layers. The same holds true for the various glial cell types, endothelial cells, blood vessel diameter and the large extracellular space comprising almost 20% of the adult brain tissue [25,27]. In addition, perisurgical and laboratory artifacts, e.g., bleeding, wrinkles and scratches, were often included in such large tiles and also challenge this approach. During the experimental training phases, the question arose whether a 1 mm 2 tile size is suited for this classifier challenge as all previous models have used smaller images focusing only on the disease-specific cellular pattern, e.g., dysmorphic neurons w/o calcification in cortical tubers vs. FCD2b [8,19] as well as addressing specific histopathology domains, such as carcinoma islets [8], pituitary glands or immunohistochemical staining patterns [24]. We computed smaller tiles, therefore, during the course of this project, e.g., of 224 9 224 and 512 9 512 pixel. Accuracy scores were lower in all of these conditions (data not shown), and we concluded from these results the algorithm's difficulty to We reviewed the microcopy signature of best predicted tiles chosen by the CNN ([ 0.99) for each category and recognized the prototypical tissue texture (Fig. 3). In addition, we applied several discrimination software tools to highlight the CNN's activation using open access libraries (Fig. 4). Results confirmed the recognition of prototypical tissue texture including cellular disease hallmarks, e.g., dysmorphic neurons in FCD2a and FCD2b and excessive microcolumns in FCD 1a (Figs. 4,5). This is an interesting observation given the similarity of many histopathology features and the variability of normal anatomical landmarks. As an example, the microcolumnar architecture defining FCD1a was also reported to be partially presented in normal brain, in particular in the occipital region close or adjacent to BA17 [16]. Giant cells of Betz characterize Brodmann area 4 in the frontal lobe and may be difficult to differentiate from dysmorphic neurons in FCD2 of the frontal lobe (supplemental Figure x1A). Our trained application never reported this issue, however. White matter was always differentiated from neocortex based on HE staining alone, and the edges of the tissue specimen were always correctly recognized by the model, i.e., the subarachnoidal space. The assignment of different white matter categories still poses the biggest challenge to the system (Table 3), whether it belongs to normal brain obtained from patients without any epileptic disorders or to mMCD characterized by an excess of heterotopic neurons in the white matter or to MOGHE with increased oligodendroglial cell densities. Results predicted by the system should, therefore, always be confirmed by microscopic inspection of an experienced and specialized neuro-/pathologist and preferably also by specific immunohistochemical markers as recommended by the ILAE classification schemes [2,7].
We took this approach to another level and invited nine histopathologists around the world and experienced in epilepsy surgery to review the 50 best tiles selected by the CNN (Table 4). This trial was not designed to challenge the CNN's selection but rather understand its prediction rules. The results were stunning and very much comparable to the variable inter-rater agreement of previous histopathology trials in epilepsy surgery when presenting analog glass slides or digital WSI series to be iteratively reviewed by expert panels [4,9,11]. Such agreement scores reached 62.5% in one study [9], with best values for FCD Type 2 and lowest for FCD Type 1, respectively. Similarly, a kappa value of 0.64 was reported for the FCD classification scheme of 2011 with best consensus for FCD Type 2 categories [11]. A more recent approach achieved a kappa score of 0.65 only after four rounds of microscopic review and when molecular testing was disclosed [4]. These scores were pretty similar to those obtained in our current study, however. Overall, our inter-rater agreement varied between 10 and 96% across the expert panel with kappa values Fig. 6 Landing page of the online FCD classifier application. 1 First, go to the accordion-tab on the left to select your WSI of interest (see also Fig. 7). A set of WSI slides is already stored in its database. Future versions will enable the upload of external WSI. Technical issues to allow different WSI formats and size need to be adjusted, however. This tab also allows to choose among trained classifier models. Currently, the herein described basic ResNet18 and One-vs-All FCD2b models can be selected (supplement Table 2), but future applications may envisage different topic-specific models addressing the larger spectrum of lesions associated with epilepsy surgery [6]. 2 Adjust the visual representation of the classifier's prediction (see Fig. 8 for more detail). 3 Output area: at this entry level it provides a short instruction manual. At other stages it reveals the status of the system during/after the calculation process, or will allow zooming, panning and saving of the output image. 4 Access to live microscopy review of a chosen area of interest (HE-stained WSI, Fig. 9). 5 To review the model's prediction scores of all chosen categories. 6 Receive a report ranging from poor (-0.80) to almost perfect (0.96). The category with best agreement was that of FCD2a, for which three raters reached an almost perfect kappa score of [ 0.81. In contrast, the FCD1a category achieved agreement with the CNN's selection in 10% to 92% of tiles and kappa values ranging from -0.80 (poor) to 0.84 (almost perfect). While these numbers tell us that some raters could recognize the features selected by the CNN, the subjective nature of human microscopy studies in FCD remains an everchallenging issue. This assumption also supports our foremost study aim to develop an AI-based histopathology classifier to support difficult-to-diagnose FCD lesions at the microscopy level.
Another application of this model could be the recognition of double pathologies which is currently not included in the ILAE classification of FCD. Although we have not systematically performed such studies with this new classifier, the probability that architectural dysplasia of FCD ILAE Type 1 can occur in the vicinity of FCD Type 2 lesions caught our attention in several cases (see Fig. 2f) and shall be further examined in the near future. Another challenge remains, however, that the model falsely assigns small regions or individual tiles to a different disease category when the majority of tiles were predictive otherwise. An example is also shown in Fig. 2e, f with genetically confirmed FCD2b or MOGHE and small islets of other FCD entities identified herein by our model. It is important to review the prediction score, which was below 0.5 in both cases confirming a low probability of such a double pathology. Visual inspection of the microscopic slide did also not confirm the presence of additional pathologies. We recommend to use this classifier, therefore, as a screening tool in which areas with a high probability rate should be microscopically reviewed to achieve a final diagnosis.
Despite the great success in the application of CNN in the arena of digital pathology, we have to also consider its limitation. Poor traceability of the decision-making process is due to the complexity of CNN architectures. We addressed this issue (see Fig. 4) but cannot further resolve the nature of its decision-making, yet. Another limitation in the context of digital pathology reflects on the robustness of generalization. We cannot exclude overfitting from use of a single scanner model and protocols established in our histopathology laboratory, including formalin fixation, Fig. 7 How to pick your slide and DL model. Please click on the ''predict on wsi file'' accordion in the upper left to start the application. Two file selection dialogs will be presented in order to choose a WSI file (shown on left) and a trained prediction model (shown on right). The application already provides WSI files obtained from epilepsy surgery as well as two trained models: 01_all_resnet18.mod representing our basic model (Table 3); the second model represents the One-vs-All model for FCD2b (supplement Table 2). File upload is not enabled in this version. In addition, the selection list is secured by an informatics sandbox. Take the following steps to complete this task. 1 Select your slide of interest. 2 Push ''select'' to confirm your selection. 3 Select your DL model. 4 Push ''select'' to confirm your selection. 5 You can use sliders to manually adapt the parameters of the image. Please note that dimension and resolution must match that of the trained model, i.e., 1024 9 1024 pixels and image edge length of 1000 mm. These are set as default and should not be changed. 6 Choose your inference mode. In ''normal'' mode, each extracted tile is scored simple. In the ''test-time augmentation'' (tta) mode, each tile is evaluated four times to further augment the image output. This procedure takes significantly longer in time. 7 Push the ''predict'' button to start the algorithm. Calculation progress can be followed on the output area ( Fig. 6 2012.03843). This implies that a trained classification model should only be used with image inputs for which it was explicitly trained to avoid misclassification. In addition, a basic knowledge of the structure and functionality of the classification network is necessary to ensure its correct handling and interpretation of results. This notion demands data preselection and coarse pre-classification by a pathology expert in the context of digital neuropathology of epilepsy surgery. As mentioned above, we unanimously mandate to use such tools in the future as diagnostic aid to reduce work burden but not to replace the diagnostic decision maker. Fig. 8 Use various display options to visualize the results. The box ''overview'' of the middle panel will reveal the computed prediction map for the WSI. The accordion on the left allows to modify the display options by selecting either all anatomic categories available for the model or a preselection thereof. Check boxes on left to make a choice. Please note that only checked as well as recognized categories will be shown. For your convenience, maximum and average confidence scores for those categories detected are readily visible on the upper left of the image. The sliders on the lower left allow to adjust the graphic presentation with more or less ''blurring'' to visualize all tiles. ''Transparency'' is a measure of the detection probability. The AVG-threshold slider can be applied to define or modify the prediction threshold. Only tiles above this threshold will then be shown in the image. The last tab on the lower left allows the user to choose the display mode. The application offers different illustrations of color-coded tiles, e.g., as clouds, rectangles or circles, each with/out recognition scores as attached legend (see supplement Figure 2). When color-coded circles are chosen, the diameter of each circle indicates the level of the detection probability. The output image will be recalculated and displayed in the middle panel following each change of display options. Please accept any time delay of the output results after each interactive adaptation of the user due to data transfer and recalculation ([ 5-10 s). Results will best displayed at 16:9 scaled monitors. Asterisk in overview: the output table is magnified on lower right for better readability

A free access, online application of the FCD classifier
We developed a web-based open access application to use our FCD classifier on a daily diagnostic routine basis. This application may also become helpful in view of the first update of the FCD consensus classification scheme of the ILAE [22]. This update includes for the first time the histopathologically difficult-to-diagnose disease entities of mMCD and MOGHE, and our model can help to train colleagues not yet familiar with these diagnosis online. The application prototype is accessible via https://fcd-classifier. eu.ngrok.io. A user will have to register to the platform to get free access (please send an email with your registration request to the senior authors). Currently, the application has a database of securely stored WSI for testing purpose as our analysis pipeline is able to compute only WSI of the Hamamatsu scanning.ndpi/.ndpa formats, however. Further advancements to add conventional WSI formats are in progress but largely pending on the availability of a common DICOM format offered from all vendors [18]. It will also be mandatory that the FFPE tissue section has a cutting thickness of 4-5 lm and the HE staining protocol was performed according to ILAE recommendations [2]. The large WSI file size is another hurdle for any transfer and upload pipeline and should be limited, therefore, to \ 3 Gb.
In conclusion, we developed an AI-based histopathology classifier to support routine diagnosis of FCD samples obtained from epilepsy surgery. The ResNet18-trained classifier currently recognizes 25 anatomo-pathological categories of five major disease entities compared to normal human neocortex. Major achievements include the use of large 1 mm 2 WSI tiles allowing to recognize the disease phenotype admixed with an otherwise normal appearing human neocortex or white matter. Extraction of the activation patterns showed concordance with disease defining cellular phenotypes which was also confirmed by our invited expert panel. However, the low concordance among histopathology experts is an ongoing challenge in the Fig. 9 Review the results. This figure describes the functionality of the three tabs on the upper right of the application (see Fig. 6), i.e., selection, predictions and slide report in rectangle mode with legend display. You can review the original histopathology image (WSI slide) of each tile to confirm or challenge the model's prediction. Click on any region of interest on the output image, e.g., tile 141 (indicated by white circle on output image). 1 A gray-colored thumbnail with a resolution of 1024 9 1024 pixels will be displayed for this tile. 2 If the \ \detail-view[ [ button is pressed, this tile is retrieved from the WSI file and the HE staining accessible at high resolution for the user's digital microscopy review. 3 The prediction scores for the selected tile were disclosed for all categories (... to secure space on this image we removed few categories without any further information). 4 A slide report can be issued for the studied WSI: 1st column: selected categories (label); 2nd column: number of tiles assigned to this class (count); 3rd column: sum of predicted scores of all tiles assigned to this category (sum); 4th column: highest prediction score (0-1) calculated for any tile of this category (max); and 5th column (not shown): average value of all predictions across all tiles of this category (avg). The footer displays the leading class (category) for this WSI. In this example, the model predicts FCD2a with an average prediction score of 85% neuropathological examination of epilepsy surgery brain tissue specimens and may be resolved by introducing our AI-based algorithm available online and open access to the worldwide community of neuro-/pathologists.

Supplementary Information
The online version contains supplementary material available at https://doi.org/10.1007/s00521-023-08364-9. Campinas participated in our agreement trial. The Erlangen histopathology team was instrumental to retrieve the samples, re-cut and stain missing/damaged slides, as well as digitizing and annotating all specimens.
Funding Open Access funding enabled and organized by Projekt DEAL.
Data availability The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declarations
Conflict of interest None of the authors have any disclosure.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.