Skip to main content
Log in

Detecting the knowledge structure of bioinformatics by mining full-text collections

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Bioinformatics is a fast-growing, diverse research field that has recently gained much public attention. Even though there are several attempts to understand the field of bioinformatics by bibliometric analysis, the proposed approach in this paper is the first attempt at applying text mining techniques to a large set of full-text articles to detect the knowledge structure of the field. To this end, we use PubMed Central full-text articles for bibliometric analysis instead of relying on citation data provided in Web of Science. In particular, we develop text mining routines to build a custom-made citation database as a result of mining full-text. We present several interesting findings in this study. First, the majority of the papers published in the field of bioinformatics are not cited by others (63 % of papers received less than two citations). Second, there is a linear, consistent increase in the number of publications. Particularly year 2003 is the turning point in terms of publication growth. Third, most researches of bioinformatics are driven by USA-based institutes followed by European institutes. Fourth, the results of topic modeling and word co-occurrence analysis reveal that major topics focus more on biological aspects than on computational aspects of bioinformatics. However, the top 10 ranked articles identified by PageRank are more related to computational aspects. Fifth, visualization of author co-citation analysis indicates that researchers in molecular biology or genomics play a key role in connecting sub-disciplines of bioinformatics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Albarrán, P., & Ruiz-Castillo, J. (2011). References made and citations received by scientific articles. Journal of the American Society for Information Science and Technology, 62(1), 40–49.

    Article  Google Scholar 

  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.

    Google Scholar 

  • Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402.

    Article  Google Scholar 

  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, M., et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.

    Article  Google Scholar 

  • Bansard, J. Y., Rebholz-Schuhman, D., Cameron, G., Clark, D., van Mulligen, E., Beltrame, F., et al. (2007). Medical informatics and bioinformatics: a bibliometric study. IEEE Transactions on Information Technology in Biomedicine, 11(3), 237–243.

    Article  Google Scholar 

  • Belew, R.K. (2005). Scientific impact quantity and quality: Analysis of two sources of bibliographic data. arXiv:cs.IR/0504036 v1. pp. 1–12.

  • Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Brusic, V. (2007). The growth of bioinformatics. Briefings in Bioinformatics., 8(2), 69–70.

    Article  Google Scholar 

  • Butler, L. (2006). RQF Pilot Study Project—History and Political Science Methodology for Citation Analysis, November 2006. http://www.chass.org.au/papers/PAP20061102LB.php. Accessed 14 Oct 2012.

  • Chen, C., Ibekwe-SanJuan, F., & Hou, J. (2010). The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of American Society for Information Science, 61(7), 1386–1409.

    Article  Google Scholar 

  • Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.

    Google Scholar 

  • Ding, Y., Yan, E., Frazho, A., & Caverlee, J. (2009). PageRank for ranking authors in co-citation networks. Journal of the American Society for Information Science and Technology, 60(11), 2229–2243.

    Article  Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Franceschet, M. (2011). The skewness of computer science. Information Processing and Management, 47(1), 117–124.

    Article  Google Scholar 

  • Glänzel, W., Janssens, F., & Thijs, B. (2009). A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics, 79(1), 109–129.

    Article  Google Scholar 

  • Huang, H., Andrews, J., & Tang, J. (2011). Citation characterization and impact normalization in bioinformatics journals. Journal of the American Society of Information Science and Technology, 63(3), 490–497.

    Article  Google Scholar 

  • Ibáñez, A., Larrañaga, P., & Bielza, C. (2009). Predicting citation count of Bioinformatics papers within four years of publication. Bioinformatics, 25(24), 3303–3309.

    Article  Google Scholar 

  • Janssens, F., Glänzel, W., & De Moor, B. (2007). Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 07), pp. 360–369.

  • Jeong, S., Lee, S., & Kim, H. G. (2009). Are you an invited speaker? A bibliometric analysis of elite groups for scholarly events in bioinformatics. Journal of the American Society for Information Science and Technology, 60(6), 1118–1131.

    Article  MathSciNet  Google Scholar 

  • Luscombe, N. M., Greenbaum, D, & Gerstein, M. (2001). What is bioinformatics? A proposed definition and overview of the field. Methods of Information in Medicine, 40, 346–58.

    Google Scholar 

  • Manoharan, A., Kanagavel, B., Muthuchidambaram, A., Kumaravel, J.P.S. (2011) Bioinformatics Research – An Informetric View. In 2011 International Conference on Information Communication and Management (IPCSIT) vol.16.

  • Maslov, S., & Redner, S. (2008). Promise and pitfalls of extending Google’s PageRank algorithm to citation networks. Journal of Neuroscience, 28(44), 11103–11105.

    Google Scholar 

  • Osareh, F. (1996). Bibliometrics, citation analysis and co-citation analysis: A review of literature I. Libri, 46(3), 149–158.

    Article  Google Scholar 

  • Patra, S. K., & Mishra, S. (2006). Bibliometric study of bioinformatics literature. Scientometrics, 67(3), 477–489.

    Google Scholar 

  • Perez-Iratxeta, C., Andrade-Navarro, M. A., & Wren, J. D. (2007). Evolving research trends in bioinformatics. Briefings in Bioinformatics, 8(2), 88–95.

    Article  Google Scholar 

  • Ratinov, L., & Roth D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 09), pp. 147–155.

  • Seglen, P. O. (1992). The skewness of science. Journal of the American Society for Information Science, 43(9), 628–638.

    Article  Google Scholar 

  • Song, M., & Chung, Y.K. (2013). Mining citation data for automatic author co-citation analysis, to be submitted to Information Processing and Management.

  • Stringer, M. J., Sales-Pardo, M., & Nunes Amaral, L. A. (2010). Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal. Journal of the American Society for Information Science and Technology, 61(7), 1377–1385.

    Article  Google Scholar 

  • van Raan, A. F. J. (2006). Statistical properties of bibliometric indicators: Research group indicator distributions and correlations. Journal of the American Society for Information Science and Technology, 57(3), 408–430.

    Article  Google Scholar 

  • White, H. D., & Griffith, B. C. (1981). Author cocitation: A literature measure of intellectual structure. Journal of American Society for Information Science, 32(3), 163–171.

    Article  Google Scholar 

  • White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972–1995. Journal of the American Society for Information Science, 49(4), 327–355.

    Google Scholar 

Download references

Acknowledgments

We give a special think to Ying Ding for her invaluable comments on the manuscript to improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, M., Kim, S.Y. Detecting the knowledge structure of bioinformatics by mining full-text collections. Scientometrics 96, 183–201 (2013). https://doi.org/10.1007/s11192-012-0900-9

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-012-0900-9

Keywords

Navigation