Cluster Computing

, Volume 21, Issue 1, pp 481–492 | Cite as

Section-wise indexing and retrieval of research articles

  • Abdul ShahidEmail author
  • Muhammad Tanvir Afzal


Relevant information extraction is a dire need of the scholarly community. There are a number of systems available to find relevant information from scientific literature such as search engines, citation indexes, digital libraries etc. For a search query, a long list of irrelevant documents is presented to the users mainly due to the huge number of availability of the full-text document, and furthermore due to the unstructured nature of indexed scientific resources. The contemporary systems have formally defined the structure of scientific documents. However, populating the already available enriched scientific structure from unstructured/semi-structured scientific documents has not been addressed previously. In this research paper, we have designed, implemented, and evaluated an automated technique that is able to tag each paper’s content with logical sections appearing in the scientific document. The proposed system has been evaluated against the benchmark, subsequently, the proposed system have been also compared with machine learning techniques that may be used for the same task. It has been empirically shown that the overall correctness and completeness of our proposed technique is 0.78 and 0.79 respectively and thus the overall accuracy of about 0.78 was achieved. The achieved results are good as compared to machine learning based classification. The developed system may help future information retrieval systems, digital libraries, and citation indexes to index, retrieve, rank and visualize most relevant scientific documents for the scientific community.


Document structuring Populating DEo ontology Logical section mapping Machine learning based section mapping Scientific document classification 


  1. 1.
    Larsen, P.O., Ins, M.V.: The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index. Scientometrics 84, 575–603 (2010)CrossRefGoogle Scholar
  2. 2.
    Bollacker, K.D., Lawrence, S., Giles, C.L.: Discovering relevant scientific literature on the Web. IEEE Intell. Syst. 15, 4247 (2000)CrossRefGoogle Scholar
  3. 3.
    Giles, C.L., Bollacker, K.D., Lawrence, S., CiteSeer: An automatic citation indexing system. In: Proceedings of Third ACM Conference on Digital Libraries, Pittsburgh, Pennsylvania, United States, 23–26 (1998)Google Scholar
  4. 4.
    Beel, J., Gipp, B.: Google scholars ranking algorithm: an introductory overview. In: Proceedings of the 12th International Conference on Scientometrics and Informetrics, 230241(2009)Google Scholar
  5. 5.
    Blumberg, R., Atre, S.: The problem with unstructured data. Inform Manag. 6287, 42–46 (2003)Google Scholar
  6. 6.
    Roberts, R.J., Varmus, H.E., Ashburner, M., Brown, P.O., Eisen, M.B., Khosla, C., Kirschner, M., Nusse, R., Scott, M., Wold, B.: Building a “GenBank” of the published literature. Science 291, 2318–2319 (2001)CrossRefGoogle Scholar
  7. 7.
    Kafkas, S., Pi, X., Marinos, N., Talo, F., Morrison, A., McEntyre, J.R.: Section level search functionality in Europe PMC. J. Biomed. Semant. 6(1), 3–7 (2015)CrossRefGoogle Scholar
  8. 8.
    Guo, Y., Korhonen, A., Liakata, M., Silins, I., Hogberg, J., Stenius, U.: A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinform. 12(1), 7–17 (2011)CrossRefGoogle Scholar
  9. 9.
    Lin, J., Karakos, D., Demner-Fushman, D., Khudanpur, S.: Generative content models for structural analysis of medical abstracts. In: Proceedings of BioNLP-06, New York, USA, pp. 65–72 (2006)Google Scholar
  10. 10.
    Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying sections in scientific abstracts using conditional random fields. In: Proceedings of 3rd International Joint Conference on Natural Language Processing, pp. 381–388 (2008)Google Scholar
  11. 11.
    Lin, R.T.K., Dai, H.J., Bow, Y.Y., Chiu, J.L.T., Tsai, R.T.H.: Using conditional randomfields for result identification in biomedical abstracts. Integr. Comput. Aided Eng. 16(4), 339–352 (2009)Google Scholar
  12. 12.
    Teufel, S., Siddharthan, A., Batchelor, C.: Towards domain-independent argumentative zoning. Evidence from chemistry and computational linguistics. In: Proceedings of EMNLP, pp. 1493–1502 (2009)Google Scholar
  13. 13.
    Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28, 409–445 (2002)CrossRefGoogle Scholar
  14. 14.
    Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.: Corpora for the conceptualisation and zoning of scientific papers. In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC10) (2010)Google Scholar
  15. 15.
    Teufel, S.: Citations and sentiment. In: Workshop on Text mining for Scholarly Communications and Repositories, University of Manchester, UK (2009)Google Scholar
  16. 16.
    Maricic, S., Spaventi, J., Pavicic, L., Pifat-Mrzljak, G.: Citation context versus the frequency counts of citation histories. J. Am. Soc. Inf. Sci. 49, 530–540 (1998)CrossRefGoogle Scholar
  17. 17.
    Shahid, A., Afzal, M.T., Qadir, M.A.: Discovering semantic relatedness between scientific articles through citation frequency. In: Workshop on Text mining for Scholarly Communications and Repositories, Australian Journal of Basic Applied Sciences, vol. 5, pp. 1599–1604 (2011)Google Scholar
  18. 18.
    Peroni, S., Shotton, D., Vitali, F.: Faceted documents: describing document characteristics using semantic lenses. In: ACM Symposium on Document Engineering, pp. 191–194 (2012)Google Scholar
  19. 19.
    Shotton, D., Portwin, K., Klyne, G., Miles, A.: Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Comput. Biol. 5, e1000361 (2009). doi: 10.1371/journal.pcbi.1000361 CrossRefGoogle Scholar
  20. 20.
    Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inf. Nat. Lang. Process. Biomed. Appl. 75(6), 468–487 (2006)Google Scholar
  21. 21.
    Cohen, J., Ahmad, M.T., Qadir, M.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)CrossRefGoogle Scholar
  22. 22.
    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRefzbMATHGoogle Scholar
  23. 23.
    Seringhaus, M.R., Gerstein, M.B.: Publishing perishing? Towards tomorrows information architecture. BMC Bioinform. 8, 17 (2007). doi: 10.1186/1471-2105-8-17 CrossRefGoogle Scholar
  24. 24.
    Gerstein, M., Seringhaus, M., Fields, S.: Structured digital abstract makes text mining easy. Nature 447, 142 (2007). doi: 10.1038/447142a CrossRefGoogle Scholar
  25. 25.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130137 (1980)CrossRefGoogle Scholar
  26. 26.
    Afzal, M.T., Maurer, H., Balke, W.T., Kulathuramaiyer, N.: Rule based autonomous citation mining with TIERL. J. Dig. Inf. Manag. 8(3), 96–204 (2010)Google Scholar
  27. 27.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(23), 103–134 (2000)CrossRefzbMATHGoogle Scholar
  28. 28.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: European Conference on Machine Learning, pp. 137–142. Springer, Berlin (1998)Google Scholar
  29. 29.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Kohat University of Science and TechnologyKohatPakistan
  2. 2.Capital University of Science and TechnologyIslamabadPakistan

Personalised recommendations