Keywords

Background

This section will cover some of the standard terminology, techniques, and resources that are essential to a variety of NLP tasks. A collection of terms that will appear frequently throughout will be defined and examples given where appropriate. Then, pre-processing techniques will be discussed followed by feature extraction. Finally, key biomedical NLP resources will be discussed.

Basic NLP Terminology

In general, NLP attempts to understand what the contents of a corpus is about. A corpus is a collection of text documents and often consists mostly of unstructured data (data without a formal structure). In the biomedical domain, a corpus may be a collection of physician’s notes or reports (radiology reports or pathology reports) from an Electronic Health Record (EHR), journal abstracts or full-text articles in PubMed, etc. Large medical centers (academic or otherwise) naturally accumulate large volumes of unstructured text data where NLP techniques can be leveraged to extract relevant clinical information for a better understanding of the patient population or conducting observational studies.

Often a document refers to a single body of text, e.g. a single physician’s note from a patient visit, an entire journal article, an individual tweet, etc. A document is comprised of words which may or may not be further decomposed into a string of characters called tokens. It is often the case that a word is the smallest semantic unit of a text while a token may be a part of a word or punctuation (i.e. “hasn’t” is a word that could be decomposed to tokens “hasn” and “t”).

Standard Pre-processing

There are some instances where it is useful to take raw text and use it in NLP tasks. However, it is often more useful to preprocess the text to some degree. Preprocessing text is applying techniques to prepare the text for ingestion in computational systems. There are a myriad of techniques to preprocess text that can be used together to prepare it for use. The ones that follow are some of the most common.

Preprocessing is often required for many NLP tasks but the specific preprocessing techniques will vary by desired outcome. This includes any potential feature extraction that may be used.

Cleaning. Cleaning the text refers to removing unwanted characters or tokens from text and is often the first step in the preprocessing pipeline. Often, it is advantageous to remove characters that do not belong to a desired character set (i.e. UTF-8, ascii, etc.) or language (English, Mandarin, Spanish, etc.) as many tools and programs will have limitations on what type of text they can accept. Other times, it may be desirable to replace characters with phonetically similar characters (e.g. β → ss) or with unaccented characters (e.g. ü → u). Depending on the situation, it may also be useful to remove some non-alphanumeric characters (e.g. &, #, *) from text.

Once the documents are all using a shared character set, the next step is often to remove stopwords. A stopword is a frequently occurring word that carries little semantic value (e.g. “the”, “a”, “is”, “of”, etc.). It is well established that the word frequencies in languages have a Zipfian distribution: a small number of high-occurring words comprise the majority of most text. In English, these high-occurrence words are typically articles, some prepositions, pronouns, and variations of “to be” (i.e. “is”, “was”, etc.). Removing these from the text can significantly reduce the number of tokens that need to be used by models which can decrease training time and improve performance. Removing a large number of common words may seem counterintuitive, as in most machine learning problems it is desirable to keep common data points as they often indicate a trend or pattern of some sort, but, in the case of language, low semantic value translates to low information contribution to the model. These words are analogous to noise, in some sense.

Tokenization. The tokenization of text refers to decomposing a document into smaller computational pieces. This can be accomplished by one of many techniques. One of the most standard techniques for English is to perform simple whitespace tokenization where a string of text is divided based on whitespaces (e.g. spaces, tabs, new lines, etc.). This often produces tokens that are simply words, as the convention in English is typically to separate words with spaces. In the cases of words joined by a hyphen, how these are treated depends, in part, on how hyphens were handled in the preceding cleaning step.

Other methods for tokenization exist such as WordPiece, which is an example of a subword tokenization method. That is, it is splitting a word into smaller sub-word components to help avoid potential Out-of-Vocabulary (OoV) issues in downstream tasks, such as vector embedding.

Stemming & Lemmatization

It is very often the case that a single word will appear in a single document in various grammatical forms (i.e. “running”, “ran”, etc.). These words share a base meaning with some inflection induced for grammatical purposes. To reduce the number of distinct tokens or words in text, it may be useful to apply stemming or lemmatization. Both processes attempt to perform the same task, but in rather different ways.

Stemming works by simply dropping the ends of words in hopes that the base term is recovered. This works well for things that have simple endings such as “ends”, “ending”, “ended”, etc. All of these are reduced to simply “end” by removing the various strings at the end of the base. This method fails when special spelling rules change the base term in some way. For example, “carry” and “carries”, while having the same base, do not reduce to the same word by way of simply removing the ending characters due to English spelling conventions.

This is where lemmatization is useful. Lemmatization attempts to produce the base term but considers a vocabulary and morphological analysis of the words. In this way, lemmatization may return “carry” if given “carries” or “see” if given “saw”.

Feature Extraction

Text contains a myriad of features that can be extracted for use in downstream tasks. These can be at a variety of levels from individual words, groups of words, entire sentences, parts of speech, term frequency, etc. Some key feature extraction methods will be discussed in the following sections. These can be used individually or in tandem and the discussion below is by no means exhaustive.

N-grams. An n-gram is simply a string comprised of n consecutive parts. These parts can be at the level of words, tokens, or even individual characters. These are extracted from text by, usually, padding the beginning and ending of the string with empty items to accommodate the first and last part, in the case where n > 1. For example, “She ran home” can be represented as:

unigram

{“she”,”ran”,”home”}

bigrams

{(“”,”she”),(“she”,”ran”),(“ran”,”home”),(“home”,””)}

trigrams (character level)

{“ s”, “ sh”, “she”, “he “, “e r”, “ ra”, “ran”,

“an “, “n h”, “ ho”, “hom”, “ome”, “me “, “e “}

N-grams can be collected after cleaning or from raw (unprocessed) text depending on the use case. Stemming or lemmatization may also be performed before n-grams are extracted from text.

The collection of all unique n-grams are typically compiled into a table and assigned an index value for future use. Depending on the task and desired outcome, character level n-grams may provide more generalized morphological models than token level n-grams but at the cost of lost semantic meaning. Conversely, token level n-grams may provide more generalized semantic models but reduced morphological representation.

TF-IDF. While n-grams may show what tokens are present in a document, it does not, on its own, convey information regarding the relative importance of a token in a given document. This is where term frequency-inverse document frequency (tf-idf) can be useful. The TF-IDF is a statistic that reflects the relative importance of a token in a document or corpus.

Embeddings. Due to the large number of text features that may exist in a corpus, the features may often be sparse inputs that can make downstream machine learning difficult. As such, a common technique to reduce the dimensionality and sparsity of features is to embed the features in a Euclidean space. Three primary techniques exist to accomplish this embedding: an embedding layer, Word2Vec [1], and Global Vectors (GloVe) [2].

An embedding layer is a layer that is learned jointly with a neural network as the first layer in the network model. Because it is, often, trained from scratch, the embedded representation is going to be highly corpus and task specific. The drawback to this approach is that training is subject to the usual shortcomings of training neural networks and is also more data intensive than other techniques.

Word2Vec is a statistical model for learning embeddings in a more efficient manner. There are two primary training approaches to training a Word2Vec model: continuous bag-of-words (BoW) and continuous skip-gram. A continuous BoW model tries to predict a word given the context around the word and the continuous skip-gram model tries to predict the context of a word given the word. The core idea for this technique is that a word’s meaning can be learned by the words that occur around it. This technique also tends to be faster and more efficient than training embedding layers.

The Global Vector (GloVe) technique is an extension of the Word2Vec model. A global word-word co-occurrence matrix is used in tandem with context methods (such as Word2Vec) to learn efficient models that scale well as the corpus size increases. Notably, it leverages statistical information contained in the co-occurrence matrix by using only the non-zero entries of the co-occurrence matrix. GloVe is able to obtain good results even on small corpora and often produces more informative embeddings.

Best Practice 14.1

Always test multiple preprocessing pipelines to find what works best for the desired goals.

Biomedical Resources for Clinical NLP

There are many resources specialized to the biomedical domain for NLP. Recent years have seen an explosion in databases, specially trained models, specialized tools and packages, various medical knowledge graphs, and much more. Below, we will focus on two resources that have several important tools that use them as a foundational component.

UMLS. The Unified Medical Language System (UMLS) [3] is a critical resource for many clinical NLP tasks. Ambiguity is present in all languages, and biomedical language is no exception. The medical realm is full of various terms that represent the same concept (e.g., “heart attack” and “myocardial infarction”). There is also a natural hierarchy that arises in biomedical terminology such as “transient ischemic attack” which is a type of “ischemic stroke” which is a type of “stroke”. All of these relationships contain valuable information that can aid in a variety of NLP tasks and is where the UMLS comes in.

The UMLS is a collection of files containing a wealth of information but comprises three main components: the Metathesaurus, the Semantic Network, and the SPECIALIST Lexicon and Lexical tools. The Metathesaurus is the largest component and contains the concepts, semantic types and their identifies, such as the concept unique identifiers (CUIs). The Semantic Network contains information about the semantic relationships between concepts. The SPECIALIST Lexicon contains syntactic, morphological, and orthographic information about terms in the UMLS as well as more common English terms.

SemMedDB. The Semantic MEDLINE Database (SemMedDB) [4] is a repository of semantic predications extracted by applying a tool called SemRep to all PubMed citations (over 29 million citations). This results in over 96 million semantic predictions, which are triples of the form subject-predicate-object. SemRep will be further discussed in a later section. An example of a semantic predication is:

“Effects of Asian sand dust, Arizona sand dust, amorphous silica and aluminum oxide on allergic inflammation in the murine lung.”

  • Subject: C0002374 - Alumina

  • Predicate: EFFECTS

  • Object: C0021375—Inflammation, allergic

  • Predication: Alumina EFFECTS Inflammation, Allergic

SemMedDB contains tables of predications, concept mentions, mappings of mentions and predications to source sentences, as well as the source sentences themselves. The database can be readily loaded into SQL for querying in various applications. The nature of predications also allows SemMedDB predications table to be loaded into a graph database to leverage relationships.

Clinical NLP Tasks

While many of the classic NLP tasks also exist in the clinical domain, there are often more specific applications of these tasks to clinical tasks. Often, high-level tasks include Named Entity Recognition (NER) for the purposes of identifying medications, procedures, etc. Relation extraction (RelEx) to identify why a treatment or course of action was decided on. Both of these are examples of information extraction (IE) tasks where the goal is to pull desired information from clinical free text. Some other tasks include text classification, text generation, and question answering. These can serve any number of purposes, but often are not instances of IE [5].

Common NLP tasks that will be useful in a clinical setting include NER and RelEx as techniques of Information Extraction, Concept Normalization, Text Classification, Language Generation, and Question Answering. These can be used as means to ends (e.g. NER to extract entity mentions from patient notes, entity mention counts can be used to generate TF-IDF vectors which can be used in various machine learning models to predict outcomes) or as ends in themselves (e.g. RelEx to find if a patient had an adverse reaction to a particular substance).

Named Entity Recognition

Named Entity Recognition (NER) is the act of labeling specific terms or concepts in text, such as determining the part of speech for a token in text. In a clinical setting, this is often used to identify drugs or dietary supplements, procedures, symptoms, conditions, or other desirable information. A model that can identify these terms can be used for a variety of downstream tasks, such as recognizing a particular dietary supplement or an adverse event from clinical records [6].

A term or concept may have a single word or multiple. Using the example of drug recognition, the sentence “She has been taking aspirin for headaches” contains the drug “aspirin” that we would want identified. Alternatively given the sentence “The patient reports using black cohosh daily”, we would want “black cohosh” to be extracted.

One of the most common forms of labeling for NER is BIO labeling. In this convention, “B-x” means beginning and is used to signify the first word of a named entity of type x in some text. In the examples above, “aspirin” would be labeled “B-drug”, “headache” might be “B-symptom”, and “black” from “black cohosh” would be labeled “B-supplement”. “I-x” means the interior and is, similarly, used to signify a non-beginning term for a multi-word concept. Continuing from the example above, “cohosh” would be labeled “I-supplement” giving “black cohosh” the labels {“B-supplement”, “I-supplement”}. O is used to represent any term outside of the defined labels. For example, “daily” from above is a frequency, but if we are only considering drugs, symptoms, and supplements, “daily” would be labeled O since we do not have a label defined for frequencies.

Concept Normalization

Concept normalization is the task of mapping multiple terms that have the same meaning to a common, standardized term. The UMLS attempts to do this by way of the Metathesaurus, for example, and often excels when doing so on text writing by medical professionals or appearing in scientific publications due to the nature of the vocabulary used by such individuals and in such contexts. However, this problem can be more complicated when the corpus includes text from individuals who are not medical professionals or writing without a more sophisticated medical vocabulary. This is prevalent when examining mainstream news articles or social media posts. An example of this would be mapping the phrase “head spinning” to “dizziness” or “feeling like I need to throw up” to “nausea”. For tasks where the goal is to determine the prevalence of symptoms in tweets, for example, or what sort of side effects people experience when using dietary supplements, it is important that concepts are normalized unless specialized models are used [7].

Relation Extraction

Relation Extraction (RelEx) is the task of extracting terms and the interaction or relationship between them. In a clinical setting, this may be trying to determine what drug caused which side-effect in a patient. The relationships extracted can help identify incidence rates for side-effects, determine if a particular patient subpopulation is more likely to experience certain effects than another, etc. The results will often take the form of subject, relation, object between two entities and one of multiple pre-defined relationships. This type of result is essentially what is contained in SemMedDB. The tool used to generate SemMedDB, SemRep, is a rules-based tool that performs relationship extraction.

As an example, take the sentence “the patient reported nausea after his chemotherapy treatment last week”. One relationship that might be extracted is “chemotherapy-causes-nausea”. Another might be “patient-experiences-nausea”. It should be noted that prior to predicting the relationship between two concepts or terms, those terms first need to be properly identified. In other words, NER is necessary to perform RelEx.

Text Classification

Text classification is the act of classifying a collection of tokens with some label, such as determining if a movie review is positive or negative. In a clinical setting, this might be determining if a treatment was successful, if the patient has started using new medications, etc. Rather than operating on a token level, like most of the above tasks, this tends to consider a chunk of text in its totality. This might be n-grams, sentences, or entire documents depending on the particular task. A discrete collection of labels is often defined based on the goal for the task at hand.

For example, if we want to determine if a patient is stopping the use of a particular medication [8], text classification may be an appropriate task. If we want to classify a string of text as “started”, “continuing”, “discontinuing”, or “unknown” with regards to a patient’s use of a medication or supplement and are given the sentence, “the patient reports that he has not continued taking vitamin D supplements”, the entire sentence would be labeled as “discontinued”. This can help understand trends in patient treatment compliance or the effects starting/stopping a substance might have on health outcomes.

Natural Language Generation

Natural Language Generation (NLG) is the task of trying to produce text that has the appearance of something a human might produce. Famously, GPT-3 by OpenAI is known to be able to produce prose that has a high-degree of similarity to prose that has been written by poets, writers, reporters, and average internet users. In a medical setting, NLG may be used to help generate synthetic clinical notes without leaking any protected health information of patients, and thus can be used to develop clinical NLP systems [9]. NLG can also be used to generate answers to users’ questions [10].

Question Answering

Question Answering (QA) is selecting an answer from multiple choices given a query. Consumers increasingly are typing to find answers from the internet or smart device to their questions regarding medical conditions or medication usage. QA is a task to understand their questions and retrieve information and then generate answers automatically. The first component is the Natural Language Understanding, since the task at hand is to understand the question being asked as well as the available answer options in order to correctly determine which option answers the question. The second component is information retrieval, which is to find corresponding information from knowledge bases or the internet that may contain the contents to respond to their information. The final component is NLG, which is discussed above. CHiQA is an experimental AI-based QA system that is learning how to answer health-related questions using reliable sources for patients [11].

Symbolic Based Biomedical NLP

Symbolic NLP uses human-readable symbols and logic to create rules for a system. This is a subset of symbolic (also called Old-Fashioned AI). This process involves the explicit codification of human knowledge, behavior, and expertise into computer programs. While these systems may, at face, be easier to understand, they are often very difficult to construct and require considerable manual effort from domain experts to produce quality systems.

In the biomedical domain, there are a couple key tools that fall under the domain of symbolic NLP. Namely, MetaMap, SemRep, cTakes, and others are systems developed by groups of individuals with expertise in biomedical or clinical fields.

MetaMap [12] is a tool that performs NER and Concept Normalization of biomedical text. When given a body of text, MetaMap performs a number of preprocessing steps before attempting to identify medical terms and mapping them to standardized concepts in the UMLS. As such, MetaMap relies on the UMLS to function.

SemRep [13] is a tool that performs RelEx on biomedical text. SemRep is a rule-based system that uses MetaMap and the UMLS to handle NER before determining the relationship between the extracted concepts. Returning to the example:

“Effects of Asian sand dust, Arizona sand dust, amorphous silica and aluminum oxide on allergic inflammation in the murine lung.”

  • Subject: C0002374—Alumina

  • Predicate: EFFECTS

  • Object: C0021375—Inflammation, allergic

  • Predication: Alumina EFFECTS Inflammation, Allergic

In this example, the subject and objects are identified by MetaMap in the UMLS Metathesaurus. SemRep then uses these identified concepts with a set of rules, with restrictions determined by the Semantic Network, to determine the relationship, or predicate, between the concepts. The result is the semantic predication Alumina EFFECTS inflammation, allergic.

cTAKES (clinical Text Analysis and Knowledge Extraction System) [14] is a system that extracts a variety of information from provided clinical documents. It performs NER, RelEx, negation detection, part-of-speech, tagging, normalization, sentence boundary detection, tokenization, and more. This tool also makes use of the UMLS for concept normalization.

There are many useful, freely available biomedical NLP resources. The UMLS, MetaMap, SemRep, and cTAKEs are good as tools in their own right or for use as baselines to compare newer methods against.

Pitfall 14.1

Implementing techniques from scratch when not necessary. It is unlikely that a hand-coded pipeline will be as fast as spaCy or cTAKEs when using multiple preprocessing steps.

Machine Learning for NLP

Due to the complexity of human language and the high-dimensional feature space, it is often advantageous to leverage machine learning models. These models often have a larger representational capacity than rule-based systems and often require less specialized expertise to develop. Most of the models that are used in non-NLP tasks can be readily used in NLP tasks. However, it will almost always be necessary to use some sort of a reduced input.

Consider a corpus that has 15,000 unique 1-grams (unigram) that were extracted during preprocessing. One could use a feature vector that simply has a 1 at the position of the desired unigram but this results in highly sparse input with a very high-dimension and it will be difficult, if not impossible, to train a decent model. As such, generating TF-IDF vectors or using vector embeddings will be essential to successfully training machine learning models.

Best Practice 14.2

Document rationale for design decisions. Provide performance metrics (speed and measures of accuracy) where possible.

Pitfall 14.2

Failing to document the results of all experiments can lead to frustration and repeating experiments. Include logging functions to document model parameters and performance metrics to avoid this.

Commonly Used ML Models in NLP

Supervised machine learning is the use of labeled training data to train models. Supervised models are often used in NLP for a variety of tasks. Things such as NER, RelEx, and text classification can be accomplished using models trained using labeled data. In the case of NER, for example, a sentence will have a corresponding BIO label for each token in the sentence and a model will learn to assign a label to each token.

The models typically used for supervised learning on non-NLP tasks can often be used for NLP tasks as well. Logistic regression, random forest, bagging, Naive-Bayes, etc. can all be used for classification and regression tasks in NLP.

Weakly supervised learning is a special case of supervised learning where the labels for the training set are generated using a simple heuristic or set of rules. The goal is to not generate perfect labels, or even good labels, but rather enough labels that sufficiently capture a general relationship between input text and some prediction.

For example, due to the absence of a comprehensive dietary supplement repository, or similarly specialized lexica, using a simple dictionary lookup to generate labels may be sufficient to generate a sizable labeled dataset for training. This also has the advantage of not requiring time consuming manual labeling by domain experts.

Unsupervised machine learning is the use of unlabeled training data to train models. Often, we think of clustering algorithms when unsupervised learning is discussed. While there can be a place for clustering, there are other specific instances of unsupervised learning that are more useful for language.

One such example is token embedding, where words are embedded into low-dimensional feature spaces. The key idea behind learning embedded representations of tokens is that semantically similar tokens are embedded more closely together in space than non-similar tokens. This also allows for an arithmetic of sorts to be learned on the embedded representation. The classical example of this is “king - man ~ queen - woman”, and, in some instances, this approximation can be true for some embeddings.

Embedding can be done at the level of entire words or at the level of sub-word tokens. In other words, embedding n-character grams rather than complete words. This can help avoid one key issue that word-level embeddings can encounter, Out-of-Vocabulary (OoV) issues. This occurs when one tries to retrieve the embedding for a word that was not present in the training corpus. Since that word was not used, there is not a vector representation of that word in the embedding. This results in very rigid embeddings that require that OoV words be handled during preprocessing to avoid errors.

By embedding n-character grams, words that were not present may be decomposed into strings of characters that are present in the embedding. It is easy to reason that the space of all combinations of, say, 3-character grams is smaller than the space of words. This of course only holds up to a certain n before the space of character combinations exceeds the size of the space of unique words. As mentioned earlier, this is not without its tradeoffs but does come with the advantage of OoV not being as much of a potential issue provided the training set is sufficiently large.

Best Practice 14.3

Use a grid search over a reasonable set of hyper-parameters to produce the best model possible. This in combination with k-fold cross validation can increase the likelihood that the resulting model will be the strongest and most likely to generalize to unseen data.

Deep Learning in NLP

While traditional machine learning models may successfully perform some NLP tasks to a limited degree, there is an inherent limitation to how far these models can go. The complexity of language is better captured with highly non-linear models that have large representational capacity. This is where deep learning tends to excel. Deep learning is the use of neural network models as the model in machine learning tasks.

A neural network, in its simplest case, is a sequence of layers of matrix multiplication and non-linear activation functions with the output from one layer being input into the next. This iterative process allows neural networks to learn highly complex, high dimensional representations of the input data.

Models

A large number of the models that have enjoyed success in NLP tasks are models that learn via error back propagation to update model weights. The trends in research have pursued two main veins: increasing the model capacity or architectural innovations.

Increasing model capacity can be as simple as adding more layers or modules to increase the number of learnable parameters (weights) in a model. Innovations often occur in new training techniques, which are necessary in order to facilitate training such large models on massive volumes of data, and can take days or weeks to train on huge clusters of accelerated compute hardware. Architectural advances are actual changes in the ways the model can learn from the data. In the case of transformers, the use of multi-head attention mechanisms and encoder-decoder structures with positional encoding resulted in considerable gains over previous state-of-the-art NLP methods.

In some cases, the changes in training and architecture result in significant changes. As was the case with BERT, which used masked-language modeling pre-training and encoder units to produce a model that can be easily fine-tuned for a variety of tasks without needing to fully retrain a huge model from scratch every time a new task is introduced.

Best Practice 14.4

Apply appropriate rigor in analyzing the performance of machine learning models. Permutation tests, covariance analysis, etc. can be invaluable for diagnosing issues before deploying models.

RNN, LSTM, biLSTM

A recurrent neural network (RNN) [15] is a special type of neural network that passes its internal state (weights) forward as input for each computation in a sequence of data. This allows the network to learn using sequence data and makes them useful for NLP purposes. However, the vanilla RNN can be prone to suffering from vanishing and exploding gradients. This is when the gradients that are back propagated are either so small that the weights do not change enough to learn or are so large the weights change drastically and fail to converge.

While the vanishing and exploding gradient problem can be addressed using techniques such as gradient clipping, they can also be mitigated by using specialized RNN architectures. One such architecture is the Long Short Term Memory (LSTM) [16] network. The LSTM introduces “gates”, namely, input, output, and forget gates. The forget gate is responsible for preventing gradients from vanishing or exploding.

The LSTM was a significant improvement over the vanilla RNN but can only process sequences in a single direction. To learn the context of a data point in a sequence, it is necessary to consider the points before and after in both directions, thus the bi-directional LSTM (biLSTM). The biLSTM is simply a LSTM layer that processes the sequence in the normal manner and a second LSTM layer that processes the reversed input sequence, the outputs are then combined into a single output. The two LSTMs have different sets of weights and internal states. It is often the case that a biLSTM will learn more quickly than an LSTM, depending on the task. In the case of language, where context is important, a biLSTM should often be preferred to an LSTM or RNN.

Transformers. In 2017, the transformer [17] architecture was introduced and has since become the basis for most state-of-the-art and fundamental models in NLP. The transformer is a neural transduction model that uses an encoder-decoder structure and self-attention. The use of the novel multi-head, scaled-dot product attention mechanism coupled with the unique architecture within the encoder and decoder layers helped set new state-of-the-art results.

Deep learning, transformer models in particular, can provide significant advantages over traditional machine learning methods, at some cost in terms of additional expertise and increased compute requirements.

BERT. In 2018, a new architecture and training methodology was proposed that used multiple layers of bi-directional transformer encoder layers as well as masked-language modeling and next sentence prediction training on massive amounts of data. The resulting model, Bi-directional Encoder Representations from Transformers (BERT) [18], achieved state-of-the-art results on several benchmarks.

A key advantage of BERT was the ability to extensively pre-train the model then fine-tune the pre-trained weights on individual NLP tasks in considerably less time. This meant that a single, large model could be trained over the course of days, but used indefinitely on any task by fine-tuning over hours or minutes.

For each task, a new prediction layer must be initialized to use the contextual embedding, the output from the main BERT model, as input. The weights inside BERT can be frozen or adjusted while training the new prediction layer. There has been work done to identify good fine-tuning procedures for BERT models such as a decaying learning rate with warm-up period, a range for learning rates, batch sizes, etc.

Some useful BERT models that have been trained on biomedical text to varying degrees include: Bio-BERT [19], BioClinical-BERT [20], PubMed-BERT [21], and Blue-BERT [22], to name a few. All of these models can be freely downloaded via huggingface for immediate use.

Best Practice 14.5

When using machine learning models, including deep learning models, always use a simple baseline for comparison. This can be as simple as some rules, a linear regression model, or as complex as a “vanilla” BERT model depending on the task at hand. It may be the case where a random forest will provide adequate performance and be faster at inference than a transformer!

Pitfall 14.3

Jumping straight to the most sophisticated, state of the art model can result in hours spent figuring out how to use a researcher’s GitHub repository, or worse, implementing it from scratch based on a paper when a more simple model may have sufficed. Starting with a simple model can, critically, serve as a proof of concept for a product without the resource overhead that comes with many deep learning models.

Graphs

A graph is a collection of vertices and edges. Vertices, or nodes, represent some object or concept (drugs, proteins, diseases, etc.) and edges are relationships between them. A graph is unique in that it contains topological information that is absent in most other data structures. This topological information can be leveraged for NLP, particularly for medical NLP.

Representing biomedical information with a graph structure can unlock insights via latent topological information that cannot be leveraged with other data structures. While this will not always be applicable, it can be powerful when it is.

Medicine is full of concepts and relationships between them. Drugs treat conditions, diseases affect particular organs, proteins are associated with biological processes. These relationships can be compiled into graph structures for use in tasks such as drug repurposing (a case of link prediction), predicting the type of relationship between nodes (edge classification), and interaction discovery (another case of link prediction).

Due to the special nature of graphs, specialized models have been developed called graph neural networks (GNN) [23]. These can come in a variety of flavors, but tend to leverage a mechanism called message passing between nodes to facilitate learning.

Tasks

Link prediction is attempting to determine the potential for two nodes to have some relationship where there currently is none, or rather, predicting the existence of an edge that does not exist. One example of this is to use a trained link prediction model to determine what the most likely connections are to a particular condition. The result of this could be filtered down to a ranked list of drugs that may not currently be used to treat that condition. This is an instance of drug repurposing and it can generate dozens of hypotheses to direct bench research.

Another example of link prediction is to try and predict links between drugs and dietary supplements. Given the lack of published research on drug-supplement interactions, such a task can help uncover potential interactions between a well-understood medication and a less-studied supplement [24].

Edge classification is attempting to determine what type of edge exists between two nodes. In the case of two drugs, for example, a trained edge classification model will attempt to predict the type of interaction between them (e.g. synergistic, opposite, etc.).

Graph Embeddings. A knowledge graph can be quite large and machine learning can be difficult if trying to work with the graph structure directly or one of its matrix representations. A graph embedding is a low-dimensional representation of the nodes of a graph. Each node gets associated with a learned vector representation in a low-dimensional vector space. Initial efforts in graph embedding worked in Euclidean or complex space and more recent efforts have explored hyperbolic space as well [25, 26].

Key Concepts in This Chapter

Preprocessing is often required for many NLP tasks but the specific preprocessing techniques will vary by desired outcome. This includes any potential feature extraction that may be used.

There are many useful, freely available biomedical NLP resources. The UMLS, MetaMap, SemRep, and cTAKEs are good as tools in their own right or for baselines to compare newer methods against.

Common NLP tasks that will be useful in a clinical setting include NER and RelEx as techniques of Information Extraction, Concept Normalization, Text Classification, Language Generation, and Question Answering. These can be used as means to ends (e.g. NER to extract entity mentions from patient notes, entity mention counts can be used to generate TF-IDF vectors which can be used in various machine learning models to predict outcomes) or as ends in themselves (e.g. RelEx to find if a patient had an adverse reaction to a particular substance).

Deep learning, transformer models in particular, can provide significant advantages over traditional machine learning methods, at some cost in terms of additional expertise and increased compute requirements.

Representing biomedical information with a graph structure can unlock insights via latent topological information that cannot be leveraged with other data structures. While this will not always be applicable, it can be powerful when it is.

Pitfalls in This Chapter

Pitfall 14.1 Implementing techniques from scratch when not necessary. It is unlikely that a hand-coded pipeline will be as fast as spaCy or cTAKES when using multiple preprocessing steps.

Pitfall 14.2 Failing to document the results of all experiments can lead to frustration and repeating experiments. Include logging functions to document model parameters and performance metrics to avoid this.

Pitfall 14.3 Jumping straight to the most sophisticated, state of the art model can result in hours spent figuring out how to use a researcher’s GitHub repository, or worse, implementing it from scratch based on a paper when a simpler model may have sufficed. Starting with a simple model can, critically, serve as a proof of concept for a product without the resource overhead that comes with many deep learning models.

Best Practices in This Chapter

Best Practice 14.1 Always test multiple preprocessing pipelines to find what works best for the desired goals.

Best Practice 14.2 Document rationale for design decisions. Provide performance metrics (speed and measures of accuracy) where possible.

Best Practice 14.3 Use a grid search over a reasonable set of hyper-parameters to produce the best model possible. This in combination with k-fold cross validation can increase the likelihood that the resulting model will be the strongest and most likely to generalize to unseen data.

Best Practice 14.4 Apply appropriate rigor in analyzing the performance of machine learning models. Permutation tests, covariance analysis, etc. can be invaluable for diagnosing issues before deploying models.

Best Practice 14.5 When using machine learning models, including deep learning models, always use a simple baseline for comparison. This can be as simple as some rules, a linear regression model, or as complex as a “vanilla” BERT model depending on the task at hand. It may be the case where a random forest will provide adequate performance and be faster at inference than a transformer!

Questions and Discussion Topics in This Chapter

  1. 1.

    We want to try and determine what supplements patients undergoing surgical operations are using and correlate supplement use against outcomes.

    1. (a)

      The first step would be to identify supplements in patient notes. What type of task is this?

      1. (i)

        What existing tools might be used for this?

    2. (b)

      What types of preprocessing might be necessary?

    3. (c)

      How might weakly supervised learning be leveraged in the absence of a gold standard dataset?

  2. 2.

    We want to build a graph using biomedical research papers to investigate potential alternative uses for an existing drug.

    1. (a)

      What are some node types that would be of interest? (e.g. drug, gene, etc.)

      1. (i)

        What task would this be considered?

      2. (ii)

        What would be some advantages of applying concept normalization?

      3. (iii)

        What existing tools can help with this?

    2. (b)

      What are some edge types that would be of interest? (e.g. affects, treats, etc.)

      1. (i)

        What task would this be considered?

      2. (ii)

        What existing tools can help with this?

    3. (c)

      Assume we do not have annotated data to train a model with. How might a transformer model be leveraged in this case?

    4. (d)

      Once the graph is constructed, what task are we now dealing with?

      1. (i)

        What would be a good baseline model for this task?

      2. (ii)

        How could the predictions of the trained model be evaluated?

  3. 3.

    Theoretically speaking, why might a feed-forward neural network outperform a SVM on a language related task? (Consider the capacity of each model)

    1. (a)

      What is the key difference between a feed-forward neural network receiving TF-IDF vectors and a LSTM receiving word embeddings? (Is there additional information contained in the sequential of specific tokens?)

  4. 4.

    Explain how the UMLS, MetaMap, SemRep, and SemMedDB are related.