Detection and explanation of anomalies in healthcare data

Samariya, Durgesh; Ma, Jiangang; Aryal, Sunil; Zhao, Xiaohui

doi:10.1007/s13755-023-00221-2

Detection and explanation of anomalies in healthcare data

Research
Open access
Published: 06 April 2023

Volume 11, article number 20, (2023)
Cite this article

Download PDF

You have full access to this open access article

Health Information Science and Systems Aims and scope Submit manuscript

Detection and explanation of anomalies in healthcare data

Download PDF

Durgesh Samariya ORCID: orcid.org/0000-0002-1042-7804¹,
Jiangang Ma¹,
Sunil Aryal² &
…
Xiaohui Zhao³

3373 Accesses
7 Citations
Explore all metrics

Abstract

The growth of databases in the healthcare domain opens multiple doors for machine learning and artificial intelligence technology. Many medical devices are available in the medical field; however, medical errors remain a severe challenge. Different algorithms are developed to identify and solve medical errors, such as detecting anomalous readings, anomalous health conditions of a patient, etc. However, they fail to answer why those entries are considered an anomaly. This research gap leads to an outlying aspect mining problem. The problem of outlying aspect mining aims to discover the set of features (a.k.a subspace) in which the given data point is dramatically different than others. In this paper, we present a framework that detects anomalies in healthcare data and then provides an explanation of anomalies. This paper aims to effectively and efficiently detect anomalies and explain why they are considered anomalies by detecting outlying aspects. First, we re-introduced four anomaly detection techniques and outlying aspect mining algorithms. Then, we evaluate the performance of anomaly detection techniques and choose the best anomaly detection algorithm. Later, we detect the top k anomaly as a query and detect their outlying aspect. Lastly, we evaluate their performance on 16 real-world healthcare datasets. The experimental results show that the latest isolation-based outlying aspect mining measure, SiNNE, has outstanding performance on this task and has promising results.

Mining Outlying Aspects on Healthcare Data

Discovering outlying aspects in large datasets

Article 09 February 2016

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Article Open access 29 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Despite improvements in healthcare instruments, the presence of medical errors remains a severe challenge [1]. Applying machine learning (ML) and artificial intelligence (AI) algorithms in the healthcare industry helps improve patients’ health more efficiently. According to [2], around 86% of healthcare companies use machine learning and artificial intelligence algorithms. These algorithms help in many ways, such as medical image diagnosis [3, 4], disease detection/classification [5,6,7], medical data analysis [8], medical data classification [9, 10], drug discovery [8], robot surgery [8], detect anomalous reading [11], etc. Recently, researchers have been interested in detecting abnormal activity in the healthcare industry. Anomaly or outlier^{Footnote 1} is defined as a data instance that does not conform with the remainder of that set of data instances. In the healthcare domain, an anomaly is referred to as an unusual health condition or activity of a patient [12, 13]. A vast number of applications have been developed to detect anomalies from medical data [14,15,16,17]. However, no study has been conducted to find out why these points are considered as an anomaly, i.e., on which set of features a data point is dramatically different than others, as far as we know. The problem of detecting such an explanation leads to outlying aspect mining (a.k.a, outlier explanation, outlier interpretation, outlying subspaces detection). Outlying aspect mining aims to identify the set of features where the given point (or a given anomaly) is most inconsistent with the rest of the data.

In many healthcare applications, a medical officer wants to know the most outlying aspects of a specific patient compared to other patients. For example, you are a doctor having patients with Pima Indian diabetes disease. While treating a particular patient, you want to know in which aspects this patient differs from others. For example, let’s consider the Pima Indian diabetes disease data set.^{Footnote 2} For ‘Patient A’, the most outlying aspect will be having the highest number of pregnancies and low diabetes pedigree function (see Fig. 1), compared to other subspaces.

Another example is when a medical insurance analyst wants to know in which aspects the given insurance claim is most unusual. The above-given applications are different than anomaly detection. Instead of searching the whole data set for the anomaly, in outlying aspect mining, we are specifically interested in a given data instance. The goal is to find out outlying aspects where a given data instance stands out. Such data instance is called a query $\textbf{q}$.

These interesting applications of outlying aspect mining in the medical domain motivated us to write this paper. In this paper, we first introduce four anomaly detection techniques and outlying aspect mining methods. Later, we evaluate their performance on 16 healthcare datasets. To the best of our knowledge, it is the first time when these algorithms have been applied to healthcare data. Our results have verified their performance on anomaly detection and outlying aspect mining tasks and found that isolation-based algorithm presents promising performance, i.e., iForest perform well in anomaly detection and SiNNE perform well for outlying aspect mining task.

The rest of the paper is organized as follows. Section 2 summarizes the principle and working mechanism of four outlying aspect mining algorithms and anomaly detection algorithms. Next, the experimental setup and results are summarized in Sects. 3 and 4, respectively. Finally, we conclude the paper in Sect. 5.

Existing methods

Before describing different outlying aspect mining algorithms, we first provide the problem formulation.

Basic notations and definitions

Definition 1

(Problem definition) Given a set of n instances ${\mathcal {X}}$ ($\Vert {\mathcal {X}}\Vert$ = n) in d dimensional space, a data point $\textbf{q} \in {\mathcal {X}}$, is called anomaly iff,

${{\textbf {q}}}$ dramatically differs from others in full feature space.

and a subspace S is called outlying aspect of $\textbf{q}$ iff,

outlyingness of $\textbf{q}$ in subspace S is higher than other subspaces, and there is no other subspace with the same or higher outlyingness.

Outlying aspect mining algorithms first require a scoring measure to compute the outlyingness of the query in subspace and a search method to search for the most outlying subspace. In the rest of this section, we review different scoring measures only. For the search part, we will use Beam [18] search method because it is the latest search method and is used in different studies [18,19,20,21,22,23]. The flowchart of the complete process is presented in Fig. 2.

Existing anomaly detection scoring measures

LOF

The core idea of density-based anomaly detection is the density of the anomalous object is significantly different from the normal instance. The first local density-based approach, called LOF, which stands for Local Outlier Factor introduced by [24], which is the widely used local outlier detection approach. For any data object, the LOF score is the ratio of the average local density of its k-nearest neighbors to its local density [25]. The LOF score of data object ${{\textbf {q}}}$ is defined as follows:

$$\begin{aligned} \text{ LOF }({{\textbf {q}}}) = \frac{\sum \limits _{x \in N^k({{\textbf {q}}})} lrd(x)}{\Vert N^k({{\textbf {q}}})\Vert \times lrd({{\textbf {q}}})} \end{aligned}$$

where $lrd({{\textbf {q}}}) = \frac{\Vert N^k ({{\textbf {q}}})\Vert }{\sum \limits _{x \in N^k({{\textbf {q}}})} max(dist^k(x,D),dist({{\textbf {q}}},x))}$, $N^k({{\textbf {q}}})$ is a set of k-nearest neighbours of ${{\textbf {q}}}$, $dist({{\textbf {q}}},x)$ is a distance between ${{\textbf {q}}}$ and x and $dist^k({{\textbf {q}}},D)$ is the distance between ${{\textbf {q}}}$ and its k-NN in ${\mathcal {X}}$. The LOF score represents the sparseness of the data object. Data objects with higher LOF values are considered as anomalies.

iForest

Liu et al. [26] presented a framework called Isolation Forest or iForest, which isolates each data point by axis-parallel partitioning of the attribute space. To the best of our knowledge, iForest is the first technique that uses an isolation mechanism to detect anomalies.

iForest builds an ensemble of trees called isolation trees (iTree). Each iTree is built using a randomly selected sub-sample without replacement from the data set. A random split is performed at each node on a randomly selected point from attribute space. The partition will terminate once all the nodes have only one data object or nodes reach the tree’s height limit for iTree. The anomaly score for ${{\textbf {q}}} \in {\mathcal {R}}^d$ based on iForest is defined as:

$$\begin{aligned} \text{ iForest }({{\textbf {q}}}) = \frac{1}{t} \sum \limits _{i=1}^t l_i({{\textbf {q}}}) \end{aligned}$$

where $l_i({{\textbf {q}}})$ is the path length of ${{\textbf {q}}}$ in tree $T_i$.

Sp

Rather than searching for k-nearest neighbor in the data set, [27] employs scoring measure based on the nearest neighbor (k =1) in random sub-samples (${\mathcal {S}} \subset D$). The Sp score of data object ${{\textbf {q}}}$ is defined as follows:

$$\begin{aligned} \text{ Sp }({{\textbf {q}}}) = \min \limits _{x \in {\mathcal {S}}} dist({{\textbf {q}}},x) \end{aligned}$$

where $dist({{\textbf {q}}},x)$ is a distance between ${{\textbf {q}}}$ and x.

In [27], authors have shown that Sp performs better than state-of-the-art anomaly detector LOF and runs faster than LOF.

iNNE

Bandaragoda et al. [28] proposed iNNE, which is stands for isolation using Nearest Neighbor Ensemble. The core idea behind iNNE is an anomaly is far away from its nearest neighbor, and the inverse is true for the regular object. iNNE implementation is influenced by iForest and LOF. The critical difference between iNNE and iForest is that iForest builds a tree from subspaces while iNNE builds hyperspheres using all dimensions. An isolation score of ${{\textbf {q}}}$ is defined as follows:

$$\begin{aligned} I({{\textbf {q}}}) = {\left\{ \begin{array}{ll} 1 - \frac{\tau (\eta _{cnn({{\textbf {q}}})})}{\tau (cnn({{\textbf {q}}}))}, &{}if {{\textbf {q}}} \in \bigcup _{c \in {\mathcal {S}}} B(c),\\ \quad \quad \quad \quad 1, &{}otherwise \end{array}\right. } \end{aligned}$$

where $cnn({{\textbf {q}}}) = \displaystyle \mathop {\mathrm {arg\,min}}\limits _{c \in S} \{ \tau (c): {{\textbf {q}}} \in {\mathcal {B}}(c) \}$, ${\mathcal {S}}$ is set of randomly selected sub-samples, $\Vert {\mathcal {S}}\Vert = \psi$, ${\mathcal {B}}(c)$ is a hypersphere centered at c with radius $\tau (c) = || c - \eta _c ||$, where $\eta _c$ is nearest neighbour of c. The anomaly score for data object ${{\textbf {q}}}$ is defined as:

$$\begin{aligned} \text{ iNNE }({{\textbf {q}}}) = \frac{1}{t} \sum \limits _{i=1}^t I_i({{\textbf {q}}}) \end{aligned}$$

where $I_i({{\textbf {q}}})$ is isolation score based on sub-sample in $i^{th}$ set.

Outlying aspect mining algorithms

OAMiner

Duan et al. [29] introduce Outlying Aspect Miner (OAMiner in short), which uses a Kernel Density Estimation (KDE) [30] based scoring measure to compute the outlyingness of query $\textbf{q}$ in subspace S:

$$\begin{aligned} f_{S}(\textbf{q}) = \frac{1}{n(2 \pi )^{\frac{m}{2}} \prod \limits _{i \in S} h_{i}} \sum \limits _{\textbf{x} \in {\mathcal {O}}} e^ {- \sum \limits _{i\in S} \frac{(q.i - x.i)^2}{2 h^2_{i}}} \end{aligned}$$

where $f_S(\textbf{q})$ is a kernel density estimation of $\textbf{q}$ in subspace S, m is the dimensionality of subspace S ($|S|=m$), $h_{i}$ is the kernel bandwidth in dimension i.

Duan et al. [29] have stated that density is a bias towards high-dimensional subspaces—density tends to decrease as the dimension increases. Thus, to remove the effect of dimensionality bias, they proposed using the query’s density rank as a measure of outlyingness. To find the most outlying subspace of the query, the density of all data points needs to compute in each subspace, where the subspace with the best rank is selected as an outlying aspect of the given query.

OAMiner systematically enumerates all the possible subspaces. In OAMiner, the author has used the set enumeration tree approach [31], which is widely used by the data mining research community. OAMiner searches for subspaces by traversing a depth-first manner [32]. OAMiner used some anti-monotonicity properties to prune the subspaces. Given data set ${\mathcal {O}}$, a query object $\textbf{q}$ and subspace S, if $rank(f_{S}(\textbf{q}))$ = 1, then every super-set of S cannot be a minimal subspace and thus can be pruned.

Beam

Vinh et al. [18] captures the concept of dimensionality unbiasedness and further investigates dimensionally unbiased scoring functions. Dimensionality unbiasedness is an essential property for outlying measures because the query object is compared in different subspaces with a different number of dimensions. They proposed two novel outlying scoring metrics (1) density Z-score and (2) isolation Path score (iPath in short). Their work showed that the proposed Z-score and iPath are dimensionally unbiased.

Therein, the density Z-score is defined as follows:

$$\begin{aligned} Z\hbox {-Score} ({\tilde{f}}_S(\textbf{q})) \triangleq \frac{{\tilde{f}}_S(\textbf{q}) -\mu _{{\tilde{f}}_S}}{\sigma _{{\tilde{f}}_S}} \end{aligned}$$

where $\mu _{f_S}$ and $\sigma _{f_S}$ are the mean and standard deviation of the density of all data instances in subspace S, respectively.

The iPath score is motivated by isolation Forest (iForest) anomaly detection approach [26]. The intuition behind iForest is that anomalies are few and susceptible to isolation. iForest constructs t trees, where each tree is built from randomly selected sub-samples $\psi$ ($\psi \ll n$). Later, it divides using the axis-parallel random splits. Since in the outlying aspect mining context, the main focus is on the path length of the query; thus, authors have ignored other parts of the tree. In outlying aspect mining, the intuition behind the iPath score is that in the most outlying subspace, a given query is easy to isolate than the rest of the data.

The process of calculating the iPath of query $\textbf{q}$ w.r.t. sub-samples $\psi$ of the data is

$$\begin{aligned} iPath_S(\textbf{q}) = \frac{1}{t} \sum \limits _{i=1}^t l_S^i(\textbf{q}) \end{aligned}$$

where $l_S^i(\textbf{q})$ is path length of $\textbf{q}$ in $i^{th}$ tree and subspace S.

Vinh et al. [18] was the first to coin the term dimensionality unbiasedness.

Definition 2

(Dimensionality unbiased [18]) A dimensionality unbiased outlyingness measure (OM) is a measure of which the baseline value, i.e., average value for any data sample ${\mathcal {O}} = \{o_1, o_2, \cdots , o_n \}$ drawn from a uniform distribution, is a quantity independent of the dimension of the subspace S, i.e.,

$$\begin{aligned} E[OM_S(x)\Vert x \in {\mathcal {O}}] = \frac{1}{n} \sum \limits _{x \in {\mathcal {O}}} OM(x) = \text{ const. } \text{ w.r.t } \Vert S\Vert \end{aligned}$$

In [18, Theorem 3], it is proven that rank transformation and Z-score normalization have resulted in a constant average value in any data distribution. Furthermore, it is worth noting that the Z-score scoring function is not only normalized but also the variance of the normalized measures that are constant to dimensions.

The overall beam search process is divided into three stages. All 1-D subspaces are inspected in the first stage to identify trivial outlying features. In the subsequent stage, an exhaustive search is performed on all possible 2 dimensional subspaces. In the third stage, the beam search is implemented at level l. The beam algorithm only keeps top W subspaces (called beam width) in the search process. The total number of subspace considered by the beam algorithm is in the order of $O(d^2 + W \ \ d_{max})$ where $d_{max}$ is the maximum dimension of subspace, and W is the beam width.

sGrid

Wells and Ting [23] introduced a simple grid-based density estimator called sGrid. sGrid is a smoothed variant of a grid-based density estimator [30]. Let ${\mathcal {O}}$ be a collection of n data objects in D-dimensional space, x.S be a projection of a data object $x \in {\mathcal {O}}$ in subspace S. The sGrid density of point $\textbf{q}$ is computed as points that fall in a bin that covers point $\textbf{q}$ and its surrounding neighbors.

Their work showed that the proposed density estimator has advantages over the existing kernel density estimator in outlying aspect mining by replacing the kernel density estimator with sGrid. By replacing KDE with the sGrid density estimator, OAMiner [29] and Beam [18] run two orders of magnitude faster than their original implementation. However, sGrid is not a dimensionally unbiased measure, requiring Z-Score normalization. Again, it makes sGrid computationally inefficient.

SiNNE

Very recently, [21] proposed a Simple Isolation score using Nearest Neighbor Ensemble (SiNNE in short) measure which from Isolation using Nearest Neighbor Ensembles (iNNE in short) method for anomaly detection [28]. SiNNE constructs t ensemble of models (${\mathcal {M}}_1, {\mathcal {M}}_2, \cdots , {\mathcal {M}}_t$). Each model ${\mathcal {M}}_i$ is constructed from randomly chosen sub-samples (${\mathcal {D}}_i \subset {\mathcal {O}}, \Vert {\mathcal {D}}_i\Vert = \psi < n)$. Each model has $\psi$ hyperspheres, where a radius of the hypersphere is the euclidean distance between a ($a \in {\mathcal {D}}_i)$ to its nearest neighbor in ${\mathcal {D}}_i$.

The outlying score of $\textbf{q}$ in model ${\mathcal {M}}_i$, $I(q\Vert {\mathcal {M}}_i) = 0$ if $\textbf{q}$ falls in any of the ball and 1 otherwise. The final outlying score of $\textbf{q}$ using t models is:

$$\begin{aligned} \text{ SiNNE }({{\textbf {q}}}) = \frac{1}{t} \sum \limits _{i=1}^t I({{\textbf {q}}}\Vert {\mathcal {M}}_i) \end{aligned}$$

In their work, they argue that Z-score normalization is biased towards a subspace having high-density variance, and the definition of dimensionality unbiasedness needs to be revised. Furthermore, SiNNE is computationally faster than density and distance-based measures.

Experimental setup

Datasets

In this study, we used 16 publicly available benchmarking medical datasets for anomaly detection; BreastW and Pima are from [33],^{Footnote 3}Annthyroid, Cardiotocography, Heart disease, Hepatitis, WDBC and WPBC are from [34]^{Footnote 4} and Arrhythmia, Lympho, Mammography, Musk, Thyroid, Vertebral, WBC, and Yeast are from [35].^{Footnote 5} The summary of each data set is provided in Table 1.

Table 1 Characteristics of datasets used

Detection and explanation of anomalies in healthcare data

Abstract

Similar content being viewed by others

Mining Outlying Aspects on Healthcare Data

Discovering outlying aspects in large datasets

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Introduction

Existing methods

Basic notations and definitions

Definition 1

Existing anomaly detection scoring measures

LOF

iForest

Sp

iNNE

Outlying aspect mining algorithms

OAMiner

Beam

Definition 2

sGrid

SiNNE

Experimental setup

Datasets

Algorithm implementation and parameters

Evaluation measure

Definition 3

Empirical evaluation

Experiment-1: Performance of anomaly detection algorithms

Experiment-2: Performance of outlying aspect mining algorithms

Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation