1 Introduction

The field of artificial intelligence known as natural language processing (NLP) allows for automated processing and analysis of everyday language. In the past two decades, NLP has rapidly expanded across all information technology domains and is now being utilized more frequently in medicine. Its applications include enhancing the use of unstructured electronic health records, aiding communication with patients, conducting consultations, and finding pertinent information in papers [21]. Most cutting-edge NLP techniques rely on statistical language modeling, which involves representing words as numerical vectors that capture their probability distribution in a sentence structure [12]. These vectors, also known as word embeddings, are numerical representations of words and are frequently generated through self-supervised machine learning methods applied to large, unlabeled textual datasets. More advanced language models create distinct representations for a word based on its context, allowing them to accurately capture polysemous terms that have multiple meanings. Contextual language models based on Transformer architectures, such as BERT [8] or RoBERTa [20], are trained using a deep neural network with a masked language modeling (MLM) objective [33]. These models use a bidirectional self-attention mechanism [34] to associate each word with its context, or the words surrounding it in the sentence. These features enable contextual language models to outperform non-contextual ones in various NLP tasks [8]. Although trained on enormous digital corpora consisting of billions of words, language models trained on general text frequently do not work effectively in very specialized domains such as scientific ones. As a result, several recent NLP studies have concentrated on retraining or fine-tuning language models for very specialized domains using domain-specific text (as explained in detail in Sect. 2).

While a large number of domain-specific language models have been developed to improve the understanding of the semantic information in their field of expertise, to the best of our knowledge a specialized model for surgical language does not exist yet, even if the scientific community has shown growing interest in the application of NLP in surgery [19, 28, 38,39,40]. There is an abundance of high-quality resources in the surgical literature, including books, online materials, and academic papers that are adopted and utilized by universities around the globe. The vast quantity of this high-quality available information can be a valuable resource for various clinical applications, involving both humans and smart robotics systems, if automatically processed via NLP techniques. For instance, one possible application of using the content extracted from textual resources is for building or extending the knowledge bases exploited by surgical robots, which they can use to make informed decisions in real-life intervention situations. Similarly, as reported in recent studies focusing on the clinical field [30, 42], humans can also benefit from this information in question-answering applications. These systems could be useful for medical students during their early training phase, or to provide a summary or simplified version of surgical descriptions.

In this paper, we follow this line of research and introduce a new pre-trained language model trained on procedural surgical language, named SurgicBERTa. The main, novel contributionsFootnote 1 presented in this paper are:

  1. 1.

    The development of SurgicBERTa, a pre-trained language model specific for the understanding of procedural surgical language;

  2. 2.

    The intrinsic evaluation of SurgicBERTa with respect to the general-purpose model RoBERTa;

  3. 3.

    The extrinsic evaluation of SurgicBERTa with respect to RoBERTa, that is, the comparison of their performances when employed on four different downstream tasks;

  4. 4.

    The public release of SurgicBERTa to the research community: https://gitlab.com/altairLab/surgicberta.

The quantitative assessments are complemented with qualitative analysis on SurgicBERTa, showing that it contains a lot of surgical domain knowledge that could be useful to enrich existing state-of-the-art surgical knowledge bases. The evaluation indicates that SurgicBERTa better deals with surgical language than a state-of-the-art yet open-domain and general-purpose model such as RoBERTa, and therefore can be effectively exploited in many computer-assisted applications, specifically in the surgical domain.

The paper is organized as follows: Sect. 2 revises relevant works in this area. Then, SurgicBERTa is presented in Sect. 3. The required textual data is collected, extracted, pre-processed and used for the continuous training of RoBERTa on the MLM task with domain-specific text. Section 4 presents the intrinsic metrics and tasks used to evaluate SurgicBERTa. In particular, metrics for the intrinsic evaluation of SurgicBERTa (i.e., perplexity, accuracy, and evaluation loss of the MLM task) are presented in Sect. 4.1, while Sects. 4.24.5 present the downstream tasks used to compare SurgicBERTa with RoBERTa, namely, (i) procedural sentences detection, (ii) procedural knowledge extraction, (iii) ontological information discovery, and (iv) surgical terminology acquisition. Section 4.6 reports and qualitatively discusses some examples of surgical domain knowledge contained in SurgicBERTa. Finally, Sect. 5 summarizes obtained results and proposes future works.

2 Related works

2.1 Transformers and pre-trained language models

Transformers are deep-learning models widely used in NLP [34] and computer vision [9]. In particular, they have fundamentally changed the landscape of NLP by gradually replacing recurrent neural networks across the board. The core innovative part of these architectures is the self-attention mechanism [34]. Since one word can have different meanings in different contexts, self-attention allows the model to look at other positions in the input sequence for clues that can help lead to a better encoding for the current word. Moreover, the creation of large-scale, Transformer-based pre-trained language models such as BERT or RoBERTa has revolutionized the NLP domain. These models only use the encoder part of the Transformer (in contrast, e.g., to denoising autoencoders such as BART [16]). Such pre-trained large models are pre-trained once in an unsupervised way, e.g., on a language model objective, and can be fine-tuned for a large number of NLP tasks with a modest amount of training data, achieving state-of-the-art results on many of them, such as sentiment analysis, textual entailment, and natural language inference, crucially also across languages [15].

2.2 Pre-trained language models in biomedicine

Transformer-based pre-trained language models have also been fine-tuned for different tasks in the biomedical domain. However, they were originally built for general English, and thus they may miss some domain words or expressions. To overcome this limit, there is the possibility to train from scratch a model specific to a given domain of interest, such as in [42] where a large model specific to the clinical domain using \(> 90\) billion words of text is proposed. Developing such a model from scratch is very expensive for the computational resources and the training time required. For this reason, domain adaptation techniques, such as the MLM described in Sect. 3, have been proposed and widely used in biomedicine with fine-tuning for various downstream tasks. In [44], domain adaptation is used to obtain a cancer domain-specific language model for effectively extracting breast cancer phenotypes from electronic health records.

In [37], the authors utilize pre-trained neural models to classify patients as either seizure-free or not, as well as to extract text from clinical notes that contains their seizure frequency and the date of their last seizure. The first step of this pipeline is the unsupervised domain adaptation, using progress notes that were not selected for annotation. The obtained model has been fine-tuned for the classification and extraction tasks. Also, [41] adopted a domain adaptation technique on clinical notes from the Medical Information Mart for Intensive Care III database [14] to extract clinically relevant information. In [18], causal precedence relations are recognized among the chemical interactions in the biomedical literature to understand the underlying biological mechanisms. However, detecting such causal relations can be challenging because annotating such causal relation detection datasets requires considerable expert knowledge and effort. To overcome this limitation, in-domain pre-training of neural models with knowledge distillation techniques has been adopted, showing that the neural models outperform previous baselines even with a small number of annotated data. In [7], a domain adaptation strategy is adopted to encourage the model to learn features from the context to curate all validated antibiotic resistance genres, i.e., the ability of bacteria to survive and propagate in the presence of antibiotics, from scientific papers. In [30], a domain adaptation technique has been used to align large language models to new medical domains, showing that, after a proper adaptation step, they encode some clinical knowledge usable in question-answering applications. Finally, a domain adaptation technique has been adopted for biomedical domain adaptation in languages different than English, such as Spanish [6] and Chinese [43] showing the same improvement trend when compared to the corresponding base models.

However, due to the syntactic, semantic, and terminological differences between domains, it is often difficult to use these models to gain benefits outside the domain they were trained on. It is generally accepted that model performance may degrade when evaluated on data with a different distribution [31]. Consequently, domain adaptation on relevant domain data is essential to improve performance in very specialized domains [1], and despite the availability of several biomedical language models, to the best of our knowledge, a pre-trained surgical language model is missing. Such a model is essential for mining surgical procedural knowledge from text and developing intelligent surgical systems.

3 A language model for the surgical domain: SurgicBERTa

Fig. 1
figure 1

MLM task used for adapting SurgicBERTa to the surgical domain. s and \s are special tokens denoting the sentence’s beginning and end, respectively

This section describes the development of SurgicBERTa, the pre-trained language model for the surgical domain that we contribute. SurgicBERTa has been developed on top of RoBERTa, an already available pre-trained language model for English for the general domain. Specifically, the roberta-base version of the HuggingFace Transformer library has been adopted. Therefore, the evaluation (presented in Sect. 4) will compare these two models along several dimensions.

RoBERTa [20] is a Transformer model that adopts the same encoder–decoder architecture made popular by BERT [8], while being trained on a larger quantity of data, consisting in a combination of datasets totaling around 160 GB of raw text: namely, texts from BookCorpus and English Wikipedia, data from the English portion of the CommonCrawl News, from OpenWebText, and some stories from CommonCrawl data. RoBERTa has been trained via MLM with dynamic masking: i.e., each time a sequence is input to the model, a new masking pattern is created. Differently from BERT, RoBERTa was not trained also on next sentence prediction, as this training task did not contribute a significant improvement of the performance in downstream tasks [20].

Leveraging RoBERTa as a starting point, we developed a new model that is tailored to the surgical domain. This involved the continuous training of RoBERTa on a large corpus of surgical text for the MLM unsupervised task. In the MLM task, a token \(w_{t}\) is replaced with \(\langle mask \rangle \) and predicted using all past and future tokens \( {\varvec{W}}_{\setminus t}:= ( {\varvec{w}}_{1},\ldots , {\varvec{w}}_{t-1}, {\varvec{w}}_{t+1},\ldots , {\varvec{w}}_{\vert W \vert })\). Figure 1 illustrates the MLM task used to derive SurgicBERTa.

In more detail, to obtain a surgical model as general as possible, we collected 300 K sentences (7 M million words) from surgery books covering several heterogeneous surgical domains, including, for instance, orthopedics, abdominal surgery, and eye surgery. We searched for surgery books written in English on the web pages of several publishing houses. As keywords, we used the name of the surgical macro-areas (e.g., general surgery, abdominal surgery, gynecology surgery, eye surgery, etc.). From the results, we downloaded the digital version only of the texts to which our universities have proper free legitimate access.Footnote 2 A very minimal pre-processing of the sentences was performed, mainly to clean the text from bibliographic references and URLs. In more detail, \(15\%\) of tokens are selected for possible replacement. Among those selected tokens, \(80\%\) are replaced with the special \(\langle mask \rangle \) token, \(10\%\) are left unchanged and \(10\%\) are replaced by a random token. The model is then trained to predict the initial masked tokens using cross-entropy loss. Following the RoBERTa approach, tokens are dynamically masked instead of fixing them statically for the whole dataset during pre-processing. This improves variability and makes the model more robust when training for multiple epochs. SurgicBERTa is computed using one NVIDIA RTX A6000 GPU, with 48 GB of GPU memory. We trained for 30 epochs with a learning rate of \(5e{-}06\) and a batch size of 32. The Adam optimizer has been used. The implementation is based on PyTorch and Transformers libraries. The entire training required about 8 hours to be completed.

4 Evaluation

This section presents the intrinsic evaluation (Sect. 4.1) and the four downstream tasks that we use to evaluate SurgicBERTa in Sect. 4.2 through 4.5, namely: procedural/non-procedural surgical sentence classification, surgical information extraction, ontological information discovery, and surgical terminology acquisition.

4.1 Intrinsically evaluating the quality of language modeling

4.1.1 Evaluation metrics

Perplexity is one of the most common metrics for evaluating language models and measures the degree of uncertainty of a language model to generate a new token, averaged over very long sequences [27]. This means that the lower the perplexity, calculated as the exponentiated average negative log likelihood of a sequence, the better the language model is able to predict a given text. While perplexity can be computed out of the box for traditional language models trained on guessing the next word given the previous context, i.e., autoregressive or causal language models, it is not well defined for language models like BERT or RoBERTa trained with the masked language modeling technique. For these models, we can compute instead the perplexity from their pseudo-log likelihood scores (PPL) [36], which corresponds to the sum of conditional log probabilities of each sentence token [29]. Formally, the pseudo-log likelihood scores (PPL) of a sentence \({\varvec{W}} = (w_1, \ldots , w_{\vert {\varvec{W}} \vert })\) under a language model with parameters \(\Theta \) is defined as:

$$\begin{aligned} \hbox {PPL}({\varvec{W}}):= \sum _{t=1}^{\vert {\varvec{W}} \vert } \hbox {log} \, P_{\text {MLM}}({\varvec{w}}_{t} \vert {\varvec{W}}_{\setminus t}; \Theta ) \end{aligned}$$

where \(P_{\text {MLM}}({\varvec{w}}_{t} \vert {\varvec{W}}_{\setminus t}; \Theta )\) is the conditional probability of token \({\varvec{w}}_{t}\) given all past and future tokens \({\varvec{W}}_{\setminus t}:= (w_1, \ldots , w_{t-1}, w_{t+1}, \ldots , w_{\vert W \vert }\)).

The (pseudo) perplexity PP of a masked language model [27] on a corpus of sentences \({\mathbb {W}}\) is then computed as:

$$\begin{aligned} \hbox {PP}({\mathbb {W}}):= \hbox {exp} \left( -\frac{1}{N} \sum _{{\varvec{w}} \in {\mathbb {W}}} PPL ({\varvec{W}}) \right) \end{aligned}$$

where N is the number of tokens in the corpus. By computing PP on a test corpus for both RoBERTa and SurgicBERTa, we are evaluating the model’s ability to predict the unseen text from the corpus and take this as an intrinsic evaluation metric of the quality of the two models.Footnote 3

Other intrinsic metrics used to evaluate RoBERTa and SurgicBERTa on the surgical domain in this paper are the accuracy of MLM computed on the masked tokens during the evaluation step and the evaluation loss. Accuracy measures how well our model predicts the masked words by comparing the model predictions with the proper values in terms of percentage. Instead, the loss is a value that represents the summation of errors in a model. It measures how well or badly the model is performing. If the errors are high, the loss will be high, and then the model will not perform well.

Generally, the higher the accuracy in the evaluation dataset and the lower the evaluation loss, the better the model will perform.

4.1.2 Results and discussion

Table 1 reports perplexity, accuracy, and loss values of RoBERTa and SurgicBERTa obtained during the evaluation of the MLM tasks as described in Sect. 4.1. SurgicBERTa has lower perplexity (\(-11.11\)), greater accuracy (\(+15.30\%\)), and lower evaluation loss (\(-1.277\)) than RoBERTa. All obtained results intrinsically confirm that SurgicBERTa better deals with surgical language than RoBERTa.

Table 1 Perplexity, accuracy, and evaluation loss

4.2 Extrinsic evaluation: task A—procedural content detection

4.2.1 Task definition

The detection of procedural content consists of a binary classification task where the aim is to classify each sentence of a corpus into two different classes (procedural and non-procedural) This task is generally a preliminary and essential step for the business or robotic process automation starting from procedural content stored in textual materials because it allows models to deal with only those sentences that are important for the extraction of a workflow [26]. In the case of the surgical domain, the two classes are defined in [2]:

  • Procedural sentences describe a specific action performed by either the robot or the human surgeon (e.g., an intervention on the body, the positioning of the robot). An example of a procedural sentence is “The colon is reflected medially over the kidney along the white line of Toldt.”;

  • Non-procedural sentences do not contain any indication of a specific surgeon action, but rather describe general, complementary information or anatomical features, not necessarily specific to perform a particular step of the intervention. An example of a non-procedural sentence is “This permits greater range of camera movement inferiorly within the retroperitoneum.”

As training and testing material, we exploit the latest available version (v1.1) of the SPKS dataset,Footnote 4 containing 2250 sentences manually annotated as procedural (approx. 68%) and non-procedural (approx. 32%).

In order to fine-tune RoBERTa and SurgicBERTa for procedural sentence classification, these pre-trained models have been extended to produce a classification output (procedural/non-procedural) by adding a softmax-activated classification layer on the pre-trained language models, and then by fine-tuning them on the SPKS dataset. A standard cross-entropy loss function has been adopted for classification. Due to the reduced size of the dataset, we utilized the classical 10-fold cross-validation protocol, which involves dividing the dataset into ten sets. In each iteration, one set is used for testing the classifier, while the remaining nine sets are used for training and hyperparameter tuning. This process is repeated ten times, and the classification performance is evaluated by computing the average of the evaluation metrics over the ten iterations.

Standard metrics for classification tasks, namely precision (P), recall (R), and F1-score, are used to compute performance. The metrics are calculated for each class (procedural/non-procedural) and we report for each of them the macro average, i.e., the mean of the considered metric on the two classes. In addition, we also compute Accuracy (Acc), i.e., the ratio between the correctly predicted classes, divided over the test set size, that in the case of binary classification, coincides with the micro average of P, R, and F1. For testing the statistical significance, we computed the p value applying the McNemar’s test with significance threshold \(\alpha \) of 0.05, as implemented in [10].

4.2.2 Results and discussion

Results of the procedural sentence detection task described in Sect. 4.2 have been reported in Table 2. SurgicBERTa improves all the performance metrics when compared to RoBERTa on both procedural and non-procedural classes. Overall, averaging the performances on both classes, SurgicBERTa improves the accuracy of 0.014, and Macro-F1 of 0.015, confirming the benefit of having a domain-specific language for surgical-related text classification. The observed performance difference of the two systems is statistically confirmed by the considered significance test.

Table 2 Text classification performance of the tested methods (Extrinsic Evaluation—Task A)

4.3 Extrinsic evaluation: task B—procedural knowledge extraction

4.3.1 Task definition

Footnote 5 The purpose of this task is the extraction of procedural information from texts using semantic role labeling (SRL) techniques. Given a sentence, the SRL task aims at labeling the semantic arguments of the sentence predicates in order to extract Who does What to Whom, How, When, and Where. In this paper, we adopt the PropBank [23] approach for SRL, leveraging the catalog of semantic roles and predicate meanings codified in the Robotic-Surgery Propositional Framebank (RSPF) [3].

SRL can be organized in two complementary subtasks: (i) predicate disambiguation, i.e., the understanding of the correct meaning of a word describing an action (a.k.a., a predicate), and (ii) semantic arguments identification and classification, i.e., the detection of the argument spans of a predicate, and the assignment of them to the correct semantic role labels from RSPF. For example, given the sentence:

The colon is reflected medially over the kidney along the white line of Toldt.

with task (i) the method should recognize that reflect has in this context the RSPF’s meaning of reflect.03, i.e., to bend or fold back, and not for example the RSPF’s meaning of reflect.02, i.e., think about or reflect.01, i.e., cast an image back, casting back an image. Then, given this meaning, the method has to solve the task (ii), i.e., to tokenize and classify the arguments in the sentence as follows:

[Arg.1: The colon] is [V: reflected] [Arg.2: over the kidney] [Arg.3: along the white line of Toldt].

where Arg.1, Arg.2 and Arg.3 indicate (a) the thing reflected, (b) its location, and (c) other spatial useful indications, respectively.

Modern SRL methods rely on neural architectures that require annotated data to learn the language in a supervised way [11, 17]. To train, validate, and test the models, we used two different manually annotated textual datasets for semantic role labeling: CoNLL-2012 [25] and a smaller dataset specific to robotic surgery [5]. CoNLL-2012 is a large-scale general-English corpus with 318 k annotated predicates, covering multiple genres. We used this dataset to teach the common neural architecture the basic knowledge about the SRL task. The smaller dataset is instead domain-specific, containing 1559 SRL-annotated sentences regarding robotic surgery procedures, thus including both traditional surgical actions and specific robot operations. We used this smaller dataset to specialize the models, helping them to better understand surgical language and perform the SRL task more effectively in the given domain. The train, test, and validation splits already provided with the smaller dataset are used for the training, tuning, and evaluation of the performances. Specifically, 80% of the sentences are utilized for training (with 10% of them being set aside for validation), while the remaining 20% are dedicated to the test dataset. Moreover, for comparing the two language models on this task, the same metrics adopted for the procedural content detection task are used (cf. Sect. 4.2). For testing the statistical significance, we applied the Bootstrap test on the accuracy of the label (predicates and arguments) predictions with significance threshold \(\alpha \) of 0.05 and using the implementation of [10].

4.3.2 Results and discussion

Table  3 reports the performance of the procedural knowledge extraction task described in Sect. 4.3. SurgicBERTa substantially improves the predicated disambiguation task accuracy of 0.018 when compared to RoBERTa. Moreover, SurgicBERTa outperforms RoBERTa in all evaluation metrics related to the arguments disambiguation task. In particular, it improves the precision of 0.007, recall of 0.016, and F1 of 0.011. The improvement is confirmed to be statistically significant by the performed Bootstrap test. These results extrinsically demonstrate the benefit of having specialized RoBERTa in the surgical domain for the accurate extraction of actions and related information from surgical text.Footnote 6

Table 3 Performance (overall) on the SRL task (Extrinsic Evaluation—Task B). The best scores are highlighted in bold

4.4 Extrinsic evaluation: task C—ontological information about the surgery and anatomical target

4.4.1 Task definition

The purpose of this task is to associate the name of the surgical procedure with the corresponding anatomical target or relevant feature to verify if the language models have learned this type of knowledge during training. For example, the prostatectomy has to be associated with prostate, nephrectomy with kidney, and mastectomy with breast. To evaluate our models on this task, we built a dataset consisting of the definition of 20 different surgical procedures. In particular, surgical procedures that can be performed with the aid of a robot have been chosen, together with other very frequent laparoscopic ones. The definitions are retrieved from the web or surgical manuals not used during the training of the language models. From them, the name of the corresponding anatomical target has been removed, and the models are asked to guess it. As evaluation metrics, we consider the ranking of the correct target word with respect to the others returned by the model, the reciprocal rank (RR), and the mean reciprocal rank (MRR) [35]. We have chosen these metrics and not others primarily with “accuracy" because we have a finite list of candidates in output that we want to be able to scroll through. MRR is a metric used to assess the performance of systems that provide a ranked list of answers in response to user queries. In the case of this task, answers are words returned to fill the \(\langle mask \rangle \) , i.e., the anatomical part corresponding to the procedure description, and queries are the sentences describing the procedure. In more detail, for a single query, the RR is defined as \(\frac{1}{\text {rank}}\), where rank is the position of the correct answer among the ones (sorted by probability, from the highest to the lowest) predicted by the model. For multiple queries \(\vert Q\vert \), the MRR is the mean of the \(\vert Q \vert \) RRs, i.e.,

$$\begin{aligned} \hbox {MRR} = \frac{1}{\vert Q \vert } \sum _{i=1}^{\vert Q \vert }\frac{1}{\hbox {rank}_i} = \frac{1}{\vert Q \vert } \sum _{i=1}^{\vert Q \vert } \hbox {RR}_{i} \end{aligned}$$
(1)

The vocabulary has not been restricted, i.e., a list of possible candidates to choose from has not been used so that models can return any word belonging to the vocabulary.

To better clarify with an example, consider the following sentence (i.e., query):

a sacrocolpopexy is a surgical procedure used to treat \(\langle mask \rangle \)organ prolapse.

Models are asked to fill in the missing word with the correct one which in the above example is pelvic. They will propose a list of possible candidates sorted by probability. For example, for the above sentence, RoBERTa and SurgicBERTa return the correct word pelvic in the third and first position, thus obtaining an RR of 0.33 and 1.0 with a log likelihood probability of 0.043 and 1.0 respectively. For testing the statistical significance, we applied the Bootstrap test on the RRs of the corrected predictions, using the same \(\alpha \) threshold and implementation of the other tasks.

4.4.2 Results and discussion

This section summarizes the results of the above-described task, i.e., that of predicting the anatomical target given the name and a brief definition of the surgical intervention related to that anatomical target. On average, the correct target is returned by RoBERTa in position 2.35, while SurgicBERTa outperforms RoBERTa proposing the correct target in position 1.35. The MRR of RoBERTa is 0.731, while that of SurgicBERTa is 0.902. In more detail, \(30\%\) of the times SurgicBERTa performs better than RoBERTa in terms of RR. The base model performs better than SurgicBERTa only in one case (query 19), where the model is asked to predict the anatomical part related to the “endarterectomy,” that is the “artery.” Since RoBERTa performs only slightly better than SurgicBERTa (the first returns “artery" in 4th position while the latter in 5th), this is perhaps due to the fact that the base model may have already seen a similar sentence (or documents describing the “endarterectomy") during its training phase. The violin plots of Fig. 2 summarize the obtained RRs on each query sentence: the one for SurgicBERTa is very wide at the top and skinny in the middle and at the bottom, while the one of RoBERTa, albeit having a similar distribution, is much less wide at the top and has a median weight lower than that of SurgicBERTa. The shape of the distribution indicates that the RRs of SurgicBERTa are highly concentrated around the first quartile, meaning that the model is predicting very well the proper anatomical target very well. In contrast, the RRs of RoBERTa are more evenly distributed across the entire range, highlighting lower scores. The computed p value (\(<0.05\)) confirms the statistical significance of the observed performance difference, and thus the benefit of having specialized RoBERTa for the surgical language.

Fig. 2
figure 2

Reciprocal rank of the predicted word in the task of predicting the anatomical target given the information of a surgical procedure (Extrinsic Evaluation—Task C)

4.5 Extrinsic evaluation: task D—surgical terminology acquisition

4.5.1 Task definition

This task is the same as the previous one but applied to a different dataset and therefore proposed for a different purpose: to verify whether SurgicBERTa masters the surgical language and can use it more appropriately than RoBERTa. In particular, a dataset of 50 surgical sentences was collected from different sources, i.e., surgical books, academic papers, and web pages not used during the MLM training. The sentences were randomly chosen from those that met the following requirements:

  • The sentence has not been used to train SurgicBERTa;

  • One of the following holds:

    • The sentence contains an expression commonly used in surgery. To define widely used expressions, we have selected those typically abbreviated with an acronym in papers. In the sentences included in the dataset, the abbreviations have been substituted with the original expression, and the language models are asked to complete them correctly in the corresponding context;

    • The sentence contains a description of a surgical procedure. In the sentences inserted in the dataset, the verb describing the action is masked, and the language model is asked to guess it based on the context.

Since the task is the same as the previous one, we used the same metrics adopted for it, i.e., the position in which the correct solution is proposed, the RR and the MRR. We also applied the same statistical significance test.

4.5.2 Results and discussion

Table 4 summarizes the obtained results for the task described in Sect. 4.5. SurgicBERTa substantially improves all proposed metrics: the mean position at which the word filling correctly the masks is proposed by SurgicBERTa among the list of returned ones is 19.19 times better than the RoBERTa one. This means SurgicBERTa is much more familiar with surgical terminology than RoBERTa. Consequently, the MRR is improved by 0.396. \(66\%\) of the times SurgicBERTa improves the RRs when compared to RoBERTa. Only in two cases (out of 50) RoBERTa performs better than SurgicBERTa: similarly to task C, it is difficult to understand why this happens, and the same considerations may apply. The violin plots of Fig. 3 illustrate the RRs of the two language models for each query: while the one for SurgicBERTa is wide at the top, the one for RoBERTa is wide at the bottom. Furthermore, SurgicBERTa has a median weight much higher than that of RoBERTa. This highlights the best accuracy of SurgicBERTa in managing surgical terminology, also confirmed by the significance test performed (p value \(<0.05\)). Hence, also this task confirms that SurgicBERTa better captures the surgical language.

Table 4 Mean position and MRR on the task of surgical terminology acquisition (Extrinsic Evaluation—Task D)
Fig. 3
figure 3

Reciprocal rank of the predicted word in the task of surgical terminology acquisition (Extrinsic Evaluation—Task D)

4.6 Qualitative examples of surgical knowledge available in pre-trained language models

There is a lot of domain information implicit in pre-trained language models [24]. Adapting the domain through continual learning with MLM helps to capture this kind of knowledge. However, it is complicated to quantify this domain knowledge objectively and exhaustively due to the lack of any gold standard for the surgical domain. For this reason, this section proposes a qualitative analysis, providing examples of domain information stored in pre-trained language models.

To start with, RoBERTa and SurgicBERTa are asked to return the name of the most used surgical robot in the operating room. In particular, RoBERTa and SurgicBERTa are asked to substitute the \(\langle mask \rangle \) in the following sentence with the most appropriate five words, ranking them in order of probability:

The most commonly used surgical robot is \(\langle mask \rangle \) .

Results are reported in Table 5. While to the best of our knowledge, none of the top five words returned by RoBERTa is the name of a surgical robot, Zeus,Footnote 7Xi,Footnote 8 and SiFootnote 9 returned by SurgicBERTa are instead examples of surgical robots that have been used in operating theaters. This means that the continual MLM learning with domain text has captured this kind of information that now is available in the model. Nonetheless, it is interesting to note how some of the words returned by RoBERTa are sometimes related to the robotics field: “Hawk,” “Orion,” “Juno” are also examples of (non-surgical) robots. This observation may suggest that while the general model tries to be correct, it lacks specific domain knowledge.

Table 5 RoBERTa and SurgicBERTa most probable words for the most used surgical robots
Fig. 4
figure 4

Illustration of the critical view of safety method during a cholecystectomy

Fig. 5
figure 5

Pfannenstiel incision to access the abdomen. This figure is adapted from [13]

As reported in Table 1, SurgicBERTa has a perplexity substantially lower than RoBERTa in the MLM task when applied to surgical literature. This intrinsically means that SurgicBERTa has learned the surgical language and thus also the composition of well-known surgical expressions. Consider the following example highlights how SurgicBERTa has learned specialized domain terminology. In surgery, the expression critical view of safety refers to a method of secure identification in open cholecystectomy in which the cystic duct and artery are putatively identified, after which the gallbladder is taken off the cystic plate so that the gallbladder is attached only by the two cystic structures [32] as shown by Fig. 4.

To verify if RoBERTa and SurgicBERTa know this information, they are asked to complete the following sentence:

During cholecystectomy, it is important to achieve the critical view of \(\langle mask \rangle \) .

SurgicBERTa returns the word safety as 1st result with a probability of 0.3428, while RoBERTa returns it only at 47th position with the probability of 0.0032.

This section ends with another example of domain knowledge available in SurgicBERTa. In surgery, a Pfannenstiel incision is a type of surgical incision that allows access to the abdomen (Fig. 5). The following test wants to investigate if pre-trained language models know this information:

The Pfannenstiel is a type of surgical incision that allows access to the \(\langle mask \rangle \) .

The correct word is abdomen and is retrieved by SurgicBERTa at the 1st position with probability 0.1267 and by RoBERTa at the 5th position with probability 0.0478, after the words brain (0.1969), heart (0.1488), skin (0.0713), and vagina (0.0542).

These qualitative examples show that in SurgicBERTa there is a lot of surgical information that could be used, for instance, to enrich and complement the one codified in domain ontologies and knowledge bases.

Nevertheless, since the model was fine-tuned on the MLM task on surgical domain texts, SurgicBERTa could also suffer from the problems that the models thus generated typically have. Among all, we underline the frequent risk of introducing bias into the models which in the case of a surgical model could be that of making predictions of words always considering a standard human anatomy, ignoring all possible particular cases. Also, SurgicBERTa was obtained by specializing RoBERTa on the surgical case, so some of the known biases of the latter are likely to be replicated on SurgicBERTa as well. All of these problems can be reduced by choosing better training materials or adapting de-biasing techniques to the domain. Furthermore, the relevance of the returned word could be low in domains not seen (enough) during the training: using reinforcement learning with human feedback techniques [22] could help to reduce these problems.

5 Conclusions

This paper proposed SurgicBERTa, a pre-trained language fine-tuned for capturing surgical language and knowledge, i.e., the vocabulary and expertise provided in surgical books and academic papers.

The building process has been described, and the model has been evaluated both intrinsically, by considering perplexity, accuracy, and evaluation loss during the MLM task, and extrinsically, by considering several downstream tasks, namely (i) procedural sentences detection, (ii) procedural knowledge extraction, (iii) ontological information discovery, and (iv) surgical terminology learning. All the results confirm that SurgicBERTa deals with surgical language and knowledge more adequately than RoBERTa, a language model targeting general-domain English. Moreover, the potential of SurgicBERTa has been investigated qualitatively by showing several examples of surgical domain knowledge available in the model, which could be used to complement other knowledge sources, e.g., state-of-the-art surgical knowledge bases. As future works, we will enrich SurgicBERTa by continuously training it on a larger surgical dataset and extending it in a multilingual scenario.