Skip to main content
Log in

Information extraction from scientific articles: a survey

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In last few decades, with the advent of World Wide Web (WWW), world is being overloaded with huge data. This huge data carries potential information that once extracted, can be used for betterment of humanity. Information from this data can be extracted using manual and automatic analysis. Manual analysis is not scalable and efficient, whereas, the automatic analysis involves computing mechanisms that aid in automatic information extraction over huge amount of data. WWW has also affected overall growth in scientific literature that makes the process of literature review quite laborious, time consuming and cumbersome job for researchers. Hence a dire need is felt to automatically extract potential information out of immense set of scientific articles to automate the process of literature review. Therefore, in this study, aim is to present the overall progress concerning automatic information extraction from scientific articles. The information insights extracted from scientific articles are classified in two broad categories i.e. metadata and key-insights. As available benchmark datasets carry a significant role in overall development in this research domain, existing datasets against both categories are extensively reviewed. Later, research studies in literature that have applied various computational approaches applied on these datasets are consolidated. Major computational approaches in this regard include Rule-based approaches, Hidden Markov Models, Conditional Random Fields, Support Vector Machines, Naïve-Bayes classification and Deep Learning approaches. Currently, there are multiple projects going on that are focused towards the dataset construction tailored to specific information needs from scientific articles. Hence, in this study, state-of-the-art regarding information extraction from scientific articles is covered. This study also consolidates evolving datasets as well as various toolkits and code-bases that can be used for information extraction from scientific articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://people.cs.umass.edu/~mccallum/data.html.

  2. http://www.iesl.cs.umass.edu/data/data-umasscitationfield.

  3. https://github.com/alansouzati/artic-poc.

  4. https://github.com/knmnyn/ParsCit/tree/master/crfpp/traindata.

  5. http://www.nilc.icmc.usp.br/mazea-web/.

References

  • Abdelmagid, M., Himmat, M., & Ahmed, A. (2014). Survey on information extraction from chemical compound literatures: Techniques and challenges. Journal of Theoretical and Applied Information Technology, 67(2), 284–289.

    Google Scholar 

  • Adefowoke Ojokoh, B., Sunday Adewale, O., & Oluwole Falaki, S. (2009). Automated document metadata extraction. Journal of Information Science, 35(5), 563–570. https://doi.org/10.1177/0165551509105195.

    Article  Google Scholar 

  • Alam, H., Kumar, A., Werner, T., & Vyas, M. (2017). Are cited references meaningful? Measuring semantic relatedness in citation analysis. In BIRNDL@SIGIR (1) (Vol. 1888, pp. 113–118). CEUR-WS.org.

  • An, D., Gao, L., Jiang, Z., Liu, R., & Tang, Z. (2017). Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1967–1970). New York, NY, USA: ACM. https://doi.org/10.1145/3132847.3133074.

  • Anzaroot, S., & Mccallum, A. (2013). A new dataset for fine-grained citation field extraction. In ICML Workshop on Peer Reviewing and Publishing Models, 2013.

  • Anzaroot, S., Passos, A., Belanger, D., & McCallum, A. (2014). Learning soft linear constraints with application to citation field extraction. arXiv:1403.1349 [Cs]. Retrieved from http://arxiv.org/abs/1403.1349.

  • Atdağ, S., & Labatut, V. (2013). A comparison of named entity recognition tools applied to biographical texts. In 2nd International conference on systems and computer science (pp. 228–233). https://doi.org/10.1109/IcConSCS.2013.6632052.

  • Augenstein, I., Das, M., Riedel, S., Vikraman, L., & McCallum, A. (2017). SemEval 2017 task 10: ScienceIE—extracting keyphrases and relations from Scientific Publications. arXiv:1704.02853 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1704.02853.

  • Baum, L. E. (1972). an inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. In O. Shisha (Ed.), Inequalities III: Proceedings of the third symposium on inequalities (pp. 1–8). Los Angeles: University of California.

  • Beel, J., Langer, S., Genzmehr, M., & Müller, C. (2013). Docear’s PDF inspector: title extraction from PDF files. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 443–444). ACM Press. https://doi.org/10.1145/2467696.2467789.

  • Bird, S., Dale, R., Dorr, B. J., Gibson, B., Joseph, M. T., Kan, M.-Y., & Tan, Y. F. (2008). The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Sixth International Conference On Language Resources And Evaluation (LREC'08), 2008, pp. 1755–1759.  

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent dirichlet allocation. In Advances in neural information processing systems (pp. 601–608).

  • Borah, R., Brown, A. W., Capers, P. L., & Kaiser, K. A. (2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. British Medical Journal Open, 7(2), e012545. https://doi.org/10.1136/bmjopen-2016-012545.

    Article  Google Scholar 

  • Britz, D. (2015). Recurrent neural network tutorial, part 4—implementing a GRU/LSTM RNN with python and theano. Retrieved August 16, 2017, from http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/.

  • Ceurws/lod. (2014). Retrieved August 6, 2018, from https://github.com/ceurws/lod.

  • Chang, M.-W., & Yih, W. (2013). Dual coordinate descent algorithms for efficient large margin structured prediction. Transactions of the Association for Computational Linguistics, 1, 207–218.

    Google Scholar 

  • Chen, C.-C., Yang, K.-H., Chen, C.-L., & Ho, J.-M. (2012). BibPro: A citation parser based on sequence alignment. IEEE Transactions on Knowledge and Data Engineering, 24(2), 236–250.

    Article  Google Scholar 

  • CiteSeerX. (2007). Retrieved January 20, 2018, from http://citeseerx.ist.psu.edu/index.

  • Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on empirical methods in natural language processing-volume 10 (pp. 1–8). Association for Computational Linguistics.

  • Constantin, A., Pettifer, S., & Voronkov, A. (2013). PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering (pp. 177–180). New York, NY, USA: ACM. https://doi.org/10.1145/2494266.2494271.

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018.

    Article  MATH  Google Scholar 

  • Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2007). FLUX-CIM: Flexible unsupervised extraction of citation metadata. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (pp. 215–224). New York, NY, USA: ACM. https://doi.org/10.1145/1255175.1255219.

  • Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2009). A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, 60(6), 1144–1158. https://doi.org/10.1002/asi.v60:6.

    Article  Google Scholar 

  • Councill, I., Giles, C. L., & Kan, M.-Y. (2008). ParsCit: An open-source CRF reference string parsing package. In Proceedings of the sixth international conference on language resources and evaluation (LREC-08). Marrakech, Morocco: European Language Resources Association (ELRA). Retrieved August 29, 2016, from http://www.lrec-conf.org/proceedings/lrec2008/pdf/166_paper.pdf.

  • Cui, B. (2009). Scientific literature metadata extraction based on HMM. In Y. Luo (Ed.), Cooperative design, visualization, and engineering (Vol. 5738, pp. 64–68). Berlin: Springer. Retrieved December 4, 2017, from http://link.springer.com/10.1007/978-3-642-04265-2_9.

  • Cui, B.-G., & Chen, X. (2010). An improved hidden Markov model for literature metadata Extraction. In D.-S. Huang, Z. Zhao, V. Bevilacqua, & J. C. Figueroa (Eds.), Advanced intelligent computing theories and applications (Vol. 6215, pp. 205–212). Berlin: Springer. Retrieved December 26, 2017, from http://link.springer.com/10.1007/978-3-642-14922-1_26.

  • Cuong, N. V., Chandrasekaran, M. K., Kan, M.-Y., & Lee, W. S. (2015). Scholarly document information extraction using extensible features for efficient higher order semi-CRFs. In Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries (pp. 61–64). ACM.

  • Day, M.-Y., Tsai, R. T.-H., Sung, C.-L., Hsieh, C.-C., Lee, C.-W., Wu, S.-H., et al. (2007). Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 43(1), 152–167. https://doi.org/10.1016/j.dss.2006.08.006.

    Article  Google Scholar 

  • Dayrell, C., Candido, A., Lima, G., Machado, D., Copestake, A. A., Feltrim, V. D., & Aluísio, S. M. (2012). Rhetorical move detection in english abstracts: Multi-label sentence classifiers and their annotated corpora. In LREC.

  • de Price, D. S. (1961). Science since babylon. New Haven: Yale University Press.

    Google Scholar 

  • Dimou, A., Vahdati, S., Iorio, A. D., Lange, C., Verborgh, R., & Mannens, E. (2017). Challenges as enablers for high quality linked data: Insights from the semantic publishing challenge. PeerJ Computer Science, 3, e105. https://doi.org/10.7717/peerj-cs.105.

    Article  Google Scholar 

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.

    MathSciNet  MATH  Google Scholar 

  • Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). a survey of bioinformatics database and software usage through mining the literature. PLoS ONE, 11(6), e0157989. https://doi.org/10.1371/journal.pone.0157989.

    Article  Google Scholar 

  • Fisas, B., Saggion, H., & Ronzano, F. (2015). On the discoursive structure of computer graphics research papers. In Proceedings of the 9th linguistic annotation workshop (pp. 42–51).

  • Flynn, P., Zhou, L., Maly, K., Zeil, S., & Zubair, M. (2007). Automated template-based metadata extraction architecture. In Proceedings of the 10th international conference on Asian digital libraries: Looking back 10 years and forging new frontiers (pp. 327–336). Berlin: Springer. Retrieved December 26, 2017, from http://dl.acm.org/citation.cfm?id=1780653.1780708.

  • Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3), 268–278.

    Article  MathSciNet  Google Scholar 

  • Friedman, C., Kra, P., Yu, H., Krauthammer, M., & Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. In ISMB (supplement of bioinformatics) (pp. 74–82).

  • Giuffrida, G., Shek, E. C., & Yang, J. (2000). Knowledge-based metadata extraction from PostScript files. In Proceedings of the fifth ACM conference on digital libraries (pp. 77–84). ACM.

  • Granitzer, M., Hristakeva, M., Knight, R., Jack, K., & Kern, R. (2012). A comparison of layout based bibliographic metadata extraction techniques. In ACM international conference proceeding series. Retrieved August 3, 2018, from www.scopus.com.

  • Groza, T., Handschuh, S., & Hulpus, I. (2009). A document engineering approach to automatic extraction of shallow metadata from scientific publications (technical report no. 2009- 06-01). Digital Enterprise Research Institute.

  • Guo, Z., & Jin, H. (2011). Reference metadata extraction from scientific papers. In Proceedings of the 2011 12th international conference on parallel and distributed computing, applications and technologies (pp. 45–49). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/PDCAT.2011.72.

  • Guo, Y., Korhonen, A., Liakata, M., Karolinska, I. S., Sun, L., & Stenius, U. (2010). Identifying the information structure of scientific abstracts: an investigation of three different schemes. In Proceedings of the 2010 workshop on biomedical natural language processing (pp. 99–107). Association for Computational Linguistics.

  • Guo, Y., Korhonen, A., Liakata, M., Silins, I., Hogberg, J., & Stenius, U. (2011). A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics, 12(1), 69.

    Article  Google Scholar 

  • Gupta, S., & Manning, C. (2011). Analyzing the dynamics of research by extracting key aspects of scientific papers (pp. 1–9). In Proceedings of 5th international joint conference on natural language processing, asian federation of natural language processing. Retrieved November 27, 2015, from http://aclasb.dfki.de/nlp/bib/I11-1001.

  • Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines (pp. 37–48). IEEE Computer Society. https://doi.org/10.1109/JCDL.2003.1204842.

  • Handschuh, S., & QasemiZadeh, B. (2014). The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In COLING 2014: 4th international workshop on computational terminology.

  • Hanyurwimfura, D., Bo, L., Njogu, H., & Ndatinya, E. (2012). An automated cue word based text extraction. Journal of Convergence Information Technology, 7(10), 421–429. https://doi.org/10.4156/jcit.vol7.issue10.50.

    Article  Google Scholar 

  • Harkema, H., Roberts, I., Gaizauskas, R., & Hepple, M. (2005). Information extraction from clinical records. In Proceedings of the 4th UK e-science all hands meeting.

  • Haruna, K., Ismail, M. A., Damiasih, D., Sutopo, J., & Herawan, T. (2017). A collaborative approach for research paper recommender system. PLoS ONE, 12(10), e0184516. https://doi.org/10.1371/journal.pone.0184516.

    Article  Google Scholar 

  • Hetzner, E. (2008). A simple method for citation metadata extraction using hidden Markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on digital libraries (pp. 280–284). New York, NY, USA: ACM. https://doi.org/10.1145/1378889.1378937.

  • Hirohata, K., Okazaki, N., Ananiadou, S., & Ishizuka, M. (2008). Identifying sections in scientific abstracts using conditional random fields. In Proceedings of the third international joint conference on natural language processing: volume-I.

  • Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. (2005). Overview of BioCreAtIvE: critical assessment of information extraction for biology. BioMed Central.

  • Houngb, H., & Mercer, R. E. (2012). Method mention extraction from scientific research paper. In Proceedings of COLING 2012: Technical paper (pp. 1211–1222).

  • Huang, Z., Jin, H., Yuan, P., & Han, Z. (2006). Header Metadata Extraction from Semi-structured Documents Using Template Matching. In Proceedings of the 2006 international conference on on the move to meaningful internet systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET-volume part II (pp. 1776–1785). Berlin: Springer. https://doi.org/10.1007/11915072_84.

  • Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991 [Cs]. Retrieved from http://arxiv.org/abs/1508.01991.

  • IAO (2015): Information artifact ontology. Web ontology language, IAO. Retrieved  March 28, 2018, from https://github.com/information-artifact-ontology/IAO.

  • Insights, E. (2013). Using citation analysis to measure research impact. Editage Insights (04-11-2013). Retrieved December 26, 2017, from http://www.editage.com/insights/using-citation-analysis-to-measure-research-impact.

  • Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272.

    Article  MathSciNet  Google Scholar 

  • Kan, M.-Y., Luong, M.-T., & Nguyen, T. D. (2010). Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems, 1(4), 1–23. https://doi.org/10.4018/jdls.2010100101.

    Article  Google Scholar 

  • Kanya, N., & Ravi, T. (2012). Modelings and techniques in named entity recognition-an information extraction task. In IET Chennai 3rd international on sustainable energy and intelligent systems (SEISCON 2012) (pp. 1–5). https://doi.org/10.1049/cp.2012.2199.

  • Kavila, S. D., & Rani, D. F. (2016). Information extraction from research papers based on statistical methods. In S. C. Satapathy, K. S. Raju, J. K. Mandal, & V. Bhateja (Eds.), Proceedings of the second international conference on computer and communication technologies (Vol. 381, pp. 573–580). New Delhi: Springer. Retrieved from April 20, 2018, http://link.springer.com/10.1007/978-81-322-2526-3_59.

  • Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization (Vol. 4, pp. 1942–1948). In Proceedings of IEEE international conference on neural networks. Piscataway, New Jersey: IEEE. https://doi.org/10.1109/ICNN.1995.488968.

  • Kern, R., Jack, K., & Hristakeva, M. (2012). TeamBeam—meta-data extraction from scientific literature. D-Lib Magazine. https://doi.org/10.1045/july2012-kern.

    Article  Google Scholar 

  • Klink, S., Dengel, A., & Kieninger, T. (2000). Document structure analysis based on layout and textual features. In Proceedings of international workshop on document analysis systems, DAS2000 (pp. 99–111). IAPR.

  • Knoth, P., Anastasiou, L., Charalampous, A., Cancellieri, M., Pearce, S., Pontika, N., & Bayer, V. (2017). Towards effective research recommender systems for repositories. ArXiv Preprint arXiv:1705.00578.

  • Kondo, T., Nanba, H., Takezawa, T., & Okumura, M. (2009). Technical trend analysis by analyzing research papers’ titles. In Proceeding LTC’09 proceedings of the 4th conference on human language technology: Challenges for computer science and linguistics (pp. 512–521). Retrieved from http://dl.acm.org/citation.cfm?id=1987773.

  • Kovačević, A., Ivanović, D., Milosavljević, B., Konjović, Z., & Surla, D. (2011). Automatic extraction of metadata from scientific publications for CRIS systems. Program. Retrieved January 12, 2016, from http://www.emeraldinsight.com/doi/full/10.1108/00330331111182094.

  • Kovačević, A., Konjović, Z., Milosavljević, B., & Nenadic, G. (2012). Mining methodologies from NLP publications: A case study in automatic terminology recognition. Computer Speech & Language, 26(2), 105–126. https://doi.org/10.1016/j.csl.2011.09.001.

    Article  Google Scholar 

  • Lakhanpal, S., Gupta, A., & Agrawal, R. (2015). Towards extracting domains from research publications. Presented at the 26th modern artificial intelligence and cognitive science conference, MAICS 2015. Retrieved November 27, 2015, from https://ncatsu.pure.elsevier.com/en/publications/towards-extracting-domains-from-research-publications.

  • Lee, C. (2017). LSTM-CRF models for named entity recognition. IEICE Transactions on Information and Systems, 100(4), 882–887.

    Article  Google Scholar 

  • Li, G., Ross, K. E., Arighi, C. N., Peng, Y., Wu, C. H., & Vijay-Shanker, K. (2015). miRTex: A text mining system for miRNA-gene relation extraction. PLoS Computational Biology, 11(9), e1004391. https://doi.org/10.1371/journal.pcbi.1004391.

    Article  Google Scholar 

  • Liakata, M. (2009). Aberystwyth University—ART. Retrieved Feb 12, 2018, from https://www.aber.ac.uk/en/cs/research/cb/projects/art/.

  • Liakata, M. (2010). Home. Retrieved April 20, 2018, from http://www.sapientaproject.com/.

  • Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7), 991–1000. https://doi.org/10.1093/bioinformatics/bts071.

    Article  Google Scholar 

  • Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C. R., & others. (2010). Corpora for the conceptualisation and zoning of scientific papers. In LREC. Citeseer.

  • Lin, J., Karakos, D., Demner-Fushman, D., & Khudanpur, S. (2006). Generative content models for structural analysis of medical abstracts. In Proceedings of the workshop on linking natural language processing and biology: Towards deeper biological literature analysis (pp. 65–72). Association for Computational Linguistics.

  • Lin, S., Ng, J.-P., Pradhan, S., Shah, J., Pietrobon, R., & Kan, M.-Y. (2010). Extracting formulaic and free text clinical research articles metadata using conditional random fields. In Proceedings of the NAACL HLT 2010 second Louhi workshop on text and data mining of health documents (pp. 90–95). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved December 4, 2017, from http://dl.acm.org/citation.cfm?id=1867735.1867749.

  • Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries (pp. 473–474). Springer.

  • Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. ArXiv Preprint arXiv:1603.01354.

  • Mao, S., Kim, J. W., & Thoma, G. R. (2004). A dynamic feature generation system for automated metadata extraction in preservation of digital materials. In 2004 Proceedings of the first international workshop on document image analysis for libraries. (pp. 225–232). IEEE.

  • Marinai, S. (2009). Metadata extraction from PDF papers for digital library ingest. In Proceedings of the 2009 10th international conference on document analysis and recognition (pp. 251–255). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICDAR.2009.232.

  • McCallum, A. K., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the construction of internet portals with machine learning. Information Retrieval, 3(2), 127–163. https://doi.org/10.1023/A:1009953814988.

    Article  Google Scholar 

  • Mejer, A., & Crammer, K. (2010). Confidence in structured-prediction using confidence-weighted models. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 971–981). Association for Computational Linguistics.

  • Miwa, M., & Sasaki, Y. (2014). Modeling joint entity and relation extraction with table representation. In EMNLP (pp. 1858–1869).

  • Morin, B. (2017). LibGuides: Systematic reviews: Intro. Retrieved March 27, 2018, from https://researchguides.library.tufts.edu/c.php?g=249130&p=1658802.

  • Mudrak, B. (2016). Scholarly publishing in 2016 | AJE | American Journal Experts. Retrieved April 2, 2018, from https://www.aje.com/en/arc/scholarly-publishing-trends-2016/.

  • Nasar, Z., & Jaffry, S. W. (2018). Trust-based situation awareness: Agent-based versus population-based modeling—a comparative study. In international conference on advancements in computational sciences. Lahore, Pakistan: IEEE.

  • Ni, Z., & Xu, H. (2009). Automatic citation metadata extraction using hidden Markov models. In Proceedings of the 2009 first IEEE international conference on information science and engineering (pp. 802–805). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICISE.2009.353.

  • NISO. (2004). Understanding metadata. 4733 Bethesda Avenue, Suite 300, Bethesda, MD 20814 USA: NISO. Retrieved from http://www.niso.org/publications/press/UnderstandingMetadata.pdf.

  • NSF. (2018). S&E indicators 2018 | NSF-national science foundation. Retrieved April 3, 2018, from https://www.nsf.gov/statistics/2018/nsb20181/.

  • Ojokoh, B., Zhang, M., & Tang, J. (2011). A trigram hidden Markov model for metadata extraction from heterogeneous references. Information Sciences, 181(9), 1538–1551. https://doi.org/10.1016/j.ins.2011.01.014.

    Article  Google Scholar 

  • Palshikar, G. K. (2013). Techniques for named entity recognition: A Survey. In Bioinformatics: Concepts, Methodologies, Tools, and Applications (pp. 400–426). https://doi.org/10.4018/978-1-4666-3604-0.ch022

  • Patil, N., Patil, A. S., & Pawar, B. (2016). Survey of named entity recognition systems with respect to Indian and foreign languages. International Journal of Computer Applications, 134(16), 21–26.

    Article  Google Scholar 

  • Peng, F., & McCallum, A. (2004). Accurate information extraction from research papers using conditional random fields. Presented at the HLT-NAACL04. Retrieved from October 16, 2015, from http://citeseerx.ist.psu.edu/viewdoc/summary?.

  • Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979. https://doi.org/10.1016/j.ipm.2005.09.002.

    Article  Google Scholar 

  • Projects | ISU Information retrieval group. (2017). Retrieved February 12, 2018, from https://www.datadrivenscience.iastate.edu/aflexgroup/projects.

  • QasemiZadeh, B., & Schumann, A.-K. (2016). The ACL RD-TEC 2.0: A language resource for evaluating term extraction and entity recognition methods. In LREC.

  • Ronzano, F., & Saggion, H. (2015). Dr. Inventor framework: Extracting structured information from scientific publications. In Discovery science (pp. 209–220). Springer, Cham. https://doi.org/10.1007/978-3-319-24282-8_18.

  • Ruch, P., Boyer, C., Chichester, C., Tbahriti, I., Geissbühler, A., Fabry, P., et al. (2007). Using argumentation to extract key sentences from biomedical abstracts. International Journal of Medical Informatics, 76(2), 195–200. https://doi.org/10.1016/j.ijmedinf.2006.05.002.

    Article  Google Scholar 

  • SemPub2015. (2015). Retrieved August 6, 2018, from https://github.com/ceurws/lod/wiki/SemPub2015.

  • Seymore, K., Mccallum, A., & Rosenfeld, R. (1999). Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI’99 workshop machine learning for information extraction (pp. 37–42).

  • Sharnagat, R. (2014). Named entity recognition: A literature survey.

  • Shickel, B., Tighe, P., Bihorac, A., & Rashidi, P. (2017). Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. arXiv Preprint arXiv:1706.03446.

  • Shuxin, Z., Zhonghong, X., & Yuehong, C. (2013). Information extraction from research papers based on conditional random field model. TELKOMNIKA Indonesian Journal of Electrical Engineering, 11(3), 1213–1220.

    Article  Google Scholar 

  • SIGKDD. (1995). Retrieved January 20, 2018, from http://www.kdd.org/.

  • Simoes, G., Galhardas, H., & Coheur, L. (2009). Information extraction tasks: a survey. In Proceedings of INForum (Vol. 2009).

  • Sirsat, S. R., Chavan, V., & Deshpande, S. P. (2014). Mining knowledge from text repositories using information extraction: A review. Sadhana-Academy Proceedings in Engineering Sciences, 39(1), 53–62.

    Google Scholar 

  • Souza, A., Moreira, V., & Heuser, C. (2014). ARCTIC: Metadata extraction from scientific papers in pdf using two-layer CRF. In Proceedings of the 2014 ACM symposium on document engineering (pp. 121–130). New York, NY, USA: ACM. https://doi.org/10.1145/2644866.2644872.

  • Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2670–2680).

  • Tateisi, Y., Ohta, T., Pyysalo, S., Miyao, Y., & Aizawa, A. (2016). Typed entity and relation annotation on computer science papers. In LREC.

  • Tateisi, Y., Shidahara, Y., Miyao, Y., & Aizawa, A. (2014). Annotation of computer science papers for semantic relation extrac-tion. In LREC (pp. 1423–1429).

  • Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445.

    Article  Google Scholar 

  • Teufel, S., Siddharthan, A., & Batchelor, C. (2009). Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 3volume 3 (pp. 1493–1502). Association for Computational Linguistics.

  • Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 99–108). ACM.

  • Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P., & Bolikowski, Ł. (2015). CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 18(4), 317–335. https://doi.org/10.1007/s10032-015-0249-8.

    Article  Google Scholar 

  • Vilnis, L., Belanger, D., Sheldon, D., & McCallum, A. (2015). Bethe projections for non-local inference. arXiv:1503.01397 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1503.01397.

  • Wang, M., & Chai, L. (2018). Three new bibliometric indicators/approaches derived from keyword analysis. Scientometrics. https://doi.org/10.1007/s11192-018-2768-9.

    Article  Google Scholar 

  • Ware, M., & Mabe, M. (2015). The STM report: An overview of scientific and scholarly journal publishing.

  • Wu, J.-C., Chang, Y.-C., Liou, H.-C., & Chang, J. S. (2006). Computational analysis of move structures in academic abstracts. In Proceedings of the COLING/ACL on interactive presentation sessions (pp. 41–44). Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1225403.1225414.

  • Yin, P., Zhang, M., Deng, Z., & Yang, D. (2004). Metadata extraction from bibliographies using bigram HMM. In Proceedings of the 7th international conference on digital libraries: International collaboration and Cross-fertilization (pp. 310–319). Berlin: Springer. https://doi.org/10.1007/978-3-540-30544-6_33.

  • Yu, J., & Fan, X. (2007). Metadata extraction from chinese research papers based on conditional random fields. In Fourth international conference on fuzzy systems and knowledge discovery, 2007. FSKD 2007. (Vol. 1, pp. 497–501). IEEE. https://doi.org/10.1109/FSKD.2007.394.

  • Zahedi, Z., & Haustein, S. (2017). On the relationships between bibliographic characteristics of scientific documents and citation and Mendeley readership counts: A large-scale analysis of web of science publications. CoRR, http://arxiv.org/abs/1712.08637.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zara Nasar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nasar, Z., Jaffry, S.W. & Malik, M.K. Information extraction from scientific articles: a survey. Scientometrics 117, 1931–1990 (2018). https://doi.org/10.1007/s11192-018-2921-5

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-018-2921-5

Keywords

Navigation