Searching for the term “bioinformatics” on the Scopus database resulted in 85,106 publications between the years of 1998 and 2016. When grouped by year, we see an upward trend in the number of publications per year (Fig. 2) except for years 2012 and 2013. Surprisingly, there appears to be a noticeable drop in publications in those two years.
4.1 Keyword-Based Analysis
We found 100,754 unique keywords across the 85,106 publications spanning across 18 years with an average of about 3 keywords per publication. The trend in the distribution of unique keywords in publications per year (Fig. 2) is very similar to the distribution of yearly publication numbers.
Temporal Keyword Trends. We manually curated 25 interesting keywords from top keywords in each year. Figure 3 shows the popularity trends of these 25 curated keywords. “big data”, “proteomics”,“rna seq”, “cancer”, “next generation sequencing”, and “transcriptomics” are among the areas that exhibit an increasing presence in publications over the last decade. It is interesting to see the emergence of big data applications within bioinformatics around 2010 accompanied by an exponential increase in relevant publications. “rna seq”, or rna sequencing, is another area that emerged during the later parts of the past decade and has emerged as a very popular research area. Unsurprisingly, the trend of “next generation sequencing” is similar to “rna seq”. Overall, “next generation sequencing techniques”, “cancer informatics”, “biomarkers”, “metabolomics”, “mirna”, “machine learning”, and “big data” are promising areas of research based on these trends. The emphasis on “cancer”, “biomarkers”, and “big data” indicate that health informatics is a sought after specialization. However, surprisingly, the same positive trend is not observed in the area of “drug discovery” which has plateaued over time. “functional genomics”, “ontologies”, and “neural networks” show mixed trends.
Keyword Network. The network built using the top 25 keywords per year comprises 6 clusters shown in blue, pink, purple, green, brown, and grey (Fig. 4). It is evident that the blue cluster is central to the network with substantial overlap with other clusters. For lack of space, we only show the central blue and the green cluster in greater detail (Figs. 5 and 6). The blue cluster (Fig. 5) is largely focused on health informatics - in particular the study of different types of cancer such as “colorectal”, “prostate”, “breast”, etc. The cluster accurately identifies that microarray and gene expression analyses have been significant contributors to the study of cancer in the past decades [8, 12, 18]. It also hints at more recent approaches to cancer analytics which include using “gene ontology”, “text mining”, and machine learning approaches such as “clustering”, etc. [14, 20].
The green cluster (Fig. 6) focuses largely on sequence analysis and alignment using algorithms and techniques from graph theory. The green cluster contains certain nodes that are a bit distant from the rest of the cluster. These words include “MPI”, “hadoop”, “mapreduce”, “cuda”, and “membrane”, “cloud computing”. Interestingly, all these words pertain to big-data approaches that have recently come into play to analyze high throughput data from next generation sequencing approaches [19, 21]. As sequencing data becomes more and more complex and voluminous, we can expect these words to become more central to this cluster over time.
The brown cluster focuses on computational techniques such as data mining, machine learning, feature selection for drug design and discovery, protein-structure prediction, pattern recognition, structural bioinformatics, etc. Moving on to the pink cluster, we see “data integration”, “database”, “semantic web”, and ontologies being used for the study of phenotypes, evolution, and phylogenies. This cluster points to the increasing applications of ontologies and data integration for the study of evolutionary phenotypes [16]. The grey cluster is largely related to proteomics, systems biology, functional genomics, analysis of microrna etc. The purple cluster is related to next generation sequencing, gene expression analyses, genomics, transcriptome, and genetics.