Gathering and Preprocessing the Protocol Data
The study was approved by the authors’ Institutional Review Board. From the radiology information system (RIS) (Syngo Workflow, Siemens Healthineers, Germany), abdominal imaging CT studies performed between 2016 and 2019 at our institution were identified. All study and final assigned protocol data were exported in a tabular format and initially cleaned and normalized using Excel (Microsoft, Redmond, Washington, USA). Pertinent text data such as the referring provider’s free text indication data-why the study was requested to be performed-and the associated ICD diagnosis code were concatenated into a single string column. All personal identifiers were stripped. As the automatic builder tool does not support the use of both structured and unstructured data elements for NLP, the date of the study, patient age, gender, and others were also removed to maintain comparability of the ML models’ performance. Duplicate studies with identical text columns were removed Table 1.
Thirty-three abdominal imaging CT protocols were grouped into 11 organ-specific protocol classes that matched the grouping presented on RIS. For example, all “CT Abdomen and Pelvis” protocols, including those with or without iv and oral contrast, were included under the “Abdomen and Pelvis” protocol class. Similarly, all “CT Liver” protocols including three and four-phase studies were included under the “Liver” protocol class. A complete list of protocol classes and the protocols they include are listed in Table 2. The final protocol performed for the study and assigned by the radiologist was used as the ground truth. The protocol class names were organized and substituted for an integer value based on a standardized key. Using this two-column format of the integer value of the assigned protocol class followed by the associated text information, the data was processed by each of the workflows separately.
To address data imbalance, a combination of undersampling, augmentation, and oversampling was used depending on the protocol class . Random undersampling was applied to the largest class (Abdomen and Pelvis) to 20,000 samples. Text-based data augmentation and random oversampling was used in all other classes within the training data set to achieve data balance with 20,000 samples in all classes. All augmentations and sampling were performed only on the training set with the validation set remaining unmodified. Augmentation techniques used included back translation by translating the protocol indication text data to French and translating the output back to English to create minor modifications without loss of meaning . Additional augmentations included replacing words with synonyms, randomly swapping words within a sample, and randomly deleting a word from a sample. The data was then processed by the manual workflows including through the four machine learning algorithms and the universal language model based deep learning algorithm. The data augmentation step is considered part of the data pre-processing steps for manual machine learning workflows. Therefore, only the original training dataset was submitted to the commercial automated machine learning builder.
Manual Machine Learning Model Workflow
The free, open-source data analytics platform, KNIME, was used to preprocess further and evaluate the data using common NLP operations such as erasing punctuation, filtering stop words, and Porter stemming. The data was then converted to a two NGram bag-of-words model before vectorized using the inverse document frequency. Then, this preprocessed data was randomized and divided into training and testing sets before being input into four machine learning algorithms separately: random forest (RF), tree ensemble (TE), gradient boosted trees (GBT), and multi-layer perceptron (MLP) Table 3 [14, 15]. Random forest and tree ensemble algorithms were selected as examples of machine learning algorithms commonly used for classification tasks. Gradient boosted trees and mult-layer perceptrons were selected as algorithms sometimes shown to outperform random forests in ML tasks. The outputs of each algorithm were visualized as a confusion matrix and compared using precision, recall, and F1 scores. Class specific F1 scores and Cohen’s kappa were also calculated. The execution time of the text processing and each model’s training and inference time was recorded.
Additionally, a deep-learning language model workflow was also deployed. The deidentified two-column data was processed using Python within a Jupyter Notebook instance. The primary libraries used included SpaCY for text preprocessing and the Fast. AI deep learning library for training and fine-tuning of a language model for the classification task. Similar to the KNIME workflow, the data processing included shuffling the data for randomness, dividing it into training, testing, and validation datasets, and removing any data with empty columns. Using a previously described ULMFiT technique and AWD-LSTM model, a pre-trained general language model was loaded containing approximately 103 million words obtained from the 28,595 Wikipedia articles in the Wikitext-103 corpus [16,17,18]. Further fine-tuning of the language model was performed on the entirety of the CT protocol dataset, which trained the model to predict the next word of the dataset. After fine-tuning, the pre-trained weights and biases of the model were used to aide in the performance of the original task of classifying the given text field into one of the 11 abdominal CT protocol classes. An AWD-LSTM was again used for the classification task, and training was performed until the validation loss was minimized. FastAI automatically incorporated multiple state-of-the-art paradigms for efficient and effective training, including an optimal learning rate finder, variable learning rates throughout training, and dropout. The results, including loss values, F1 scores, and Cohen’s kappa, were obtained, and a confusion matrix was created for visual analysis. Execution time, including processing, training, and validation, was evaluated using Python’s standard library.
Automated Machine Learning Builder Workflow
We imported the same two-column text data to Google’s AutoML automated commercial builder platform (Alphabet, Inc., Mountain View, CA) to create an NLP classification model. The 11 classes and individual samples were presented for a brief, final edit, including moving a sample to a different class or deleting a sample. After reviewing, the training and inference of the dataset were initiated through a single button click. Several hours later, a second notification was received by e-mail that the process was complete. Results, including precision, accuracy, and F1 score, were available to view as well as a confusion matrix and a live interface for testing the model on new text data if desired. Finally, data processing time and model training time was calculated for comparison.