Information extraction from scientific articles: a survey

Nasar, Zara; Jaffry, Syed Waqar; Malik, Muhammad Kamran

doi:10.1007/s11192-018-2921-5

Information extraction from scientific articles: a survey

Published: 29 September 2018

Volume 117, pages 1931–1990, (2018)
Cite this article

Scientometrics Aims and scope Submit manuscript

5157 Accesses
57 Citations
4 Altmetric
Explore all metrics

Abstract

In last few decades, with the advent of World Wide Web (WWW), world is being overloaded with huge data. This huge data carries potential information that once extracted, can be used for betterment of humanity. Information from this data can be extracted using manual and automatic analysis. Manual analysis is not scalable and efficient, whereas, the automatic analysis involves computing mechanisms that aid in automatic information extraction over huge amount of data. WWW has also affected overall growth in scientific literature that makes the process of literature review quite laborious, time consuming and cumbersome job for researchers. Hence a dire need is felt to automatically extract potential information out of immense set of scientific articles to automate the process of literature review. Therefore, in this study, aim is to present the overall progress concerning automatic information extraction from scientific articles. The information insights extracted from scientific articles are classified in two broad categories i.e. metadata and key-insights. As available benchmark datasets carry a significant role in overall development in this research domain, existing datasets against both categories are extensively reviewed. Later, research studies in literature that have applied various computational approaches applied on these datasets are consolidated. Major computational approaches in this regard include Rule-based approaches, Hidden Markov Models, Conditional Random Fields, Support Vector Machines, Naïve-Bayes classification and Deep Learning approaches. Currently, there are multiple projects going on that are focused towards the dataset construction tailored to specific information needs from scientific articles. Hence, in this study, state-of-the-art regarding information extraction from scientific articles is covered. This study also consolidates evolving datasets as well as various toolkits and code-bases that can be used for information extraction from scientific articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Abdelmagid, M., Himmat, M., & Ahmed, A. (2014). Survey on information extraction from chemical compound literatures: Techniques and challenges. Journal of Theoretical and Applied Information Technology, 67(2), 284–289.
Google Scholar
Adefowoke Ojokoh, B., Sunday Adewale, O., & Oluwole Falaki, S. (2009). Automated document metadata extraction. Journal of Information Science, 35(5), 563–570. https://doi.org/10.1177/0165551509105195.
Article Google Scholar
Alam, H., Kumar, A., Werner, T., & Vyas, M. (2017). Are cited references meaningful? Measuring semantic relatedness in citation analysis. In BIRNDL@SIGIR (1) (Vol. 1888, pp. 113–118). CEUR-WS.org.
An, D., Gao, L., Jiang, Z., Liu, R., & Tang, Z. (2017). Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 1967–1970). New York, NY, USA: ACM. https://doi.org/10.1145/3132847.3133074.
Anzaroot, S., & Mccallum, A. (2013). A new dataset for fine-grained citation field extraction. In ICML Workshop on Peer Reviewing and Publishing Models, 2013.
Anzaroot, S., Passos, A., Belanger, D., & McCallum, A. (2014). Learning soft linear constraints with application to citation field extraction. arXiv:1403.1349 [Cs]. Retrieved from http://arxiv.org/abs/1403.1349.
Atdağ, S., & Labatut, V. (2013). A comparison of named entity recognition tools applied to biographical texts. In 2nd International conference on systems and computer science (pp. 228–233). https://doi.org/10.1109/IcConSCS.2013.6632052.
Augenstein, I., Das, M., Riedel, S., Vikraman, L., & McCallum, A. (2017). SemEval 2017 task 10: ScienceIE—extracting keyphrases and relations from Scientific Publications. arXiv:1704.02853 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1704.02853.
Baum, L. E. (1972). an inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. In O. Shisha (Ed.), Inequalities III: Proceedings of the third symposium on inequalities (pp. 1–8). Los Angeles: University of California.
Beel, J., Langer, S., Genzmehr, M., & Müller, C. (2013). Docear’s PDF inspector: title extraction from PDF files. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 443–444). ACM Press. https://doi.org/10.1145/2467696.2467789.
Bird, S., Dale, R., Dorr, B. J., Gibson, B., Joseph, M. T., Kan, M.-Y., & Tan, Y. F. (2008). The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Sixth International Conference On Language Resources And Evaluation (LREC'08), 2008, pp. 1755–1759.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent dirichlet allocation. In Advances in neural information processing systems (pp. 601–608).
Borah, R., Brown, A. W., Capers, P. L., & Kaiser, K. A. (2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. British Medical Journal Open, 7(2), e012545. https://doi.org/10.1136/bmjopen-2016-012545.
Article Google Scholar
Britz, D. (2015). Recurrent neural network tutorial, part 4—implementing a GRU/LSTM RNN with python and theano. Retrieved August 16, 2017, from http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/.
Ceurws/lod. (2014). Retrieved August 6, 2018, from https://github.com/ceurws/lod.
Chang, M.-W., & Yih, W. (2013). Dual coordinate descent algorithms for efficient large margin structured prediction. Transactions of the Association for Computational Linguistics, 1, 207–218.
Google Scholar
Chen, C.-C., Yang, K.-H., Chen, C.-L., & Ho, J.-M. (2012). BibPro: A citation parser based on sequence alignment. IEEE Transactions on Knowledge and Data Engineering, 24(2), 236–250.
Article Google Scholar
CiteSeerX. (2007). Retrieved January 20, 2018, from http://citeseerx.ist.psu.edu/index.
Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on empirical methods in natural language processing-volume 10 (pp. 1–8). Association for Computational Linguistics.
Constantin, A., Pettifer, S., & Voronkov, A. (2013). PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering (pp. 177–180). New York, NY, USA: ACM. https://doi.org/10.1145/2494266.2494271.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018.
Article MATH Google Scholar
Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2007). FLUX-CIM: Flexible unsupervised extraction of citation metadata. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (pp. 215–224). New York, NY, USA: ACM. https://doi.org/10.1145/1255175.1255219.
Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2009). A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, 60(6), 1144–1158. https://doi.org/10.1002/asi.v60:6.
Article Google Scholar
Councill, I., Giles, C. L., & Kan, M.-Y. (2008). ParsCit: An open-source CRF reference string parsing package. In Proceedings of the sixth international conference on language resources and evaluation (LREC-08). Marrakech, Morocco: European Language Resources Association (ELRA). Retrieved August 29, 2016, from http://www.lrec-conf.org/proceedings/lrec2008/pdf/166_paper.pdf.
Cui, B. (2009). Scientific literature metadata extraction based on HMM. In Y. Luo (Ed.), Cooperative design, visualization, and engineering (Vol. 5738, pp. 64–68). Berlin: Springer. Retrieved December 4, 2017, from http://link.springer.com/10.1007/978-3-642-04265-2_9.
Cui, B.-G., & Chen, X. (2010). An improved hidden Markov model for literature metadata Extraction. In D.-S. Huang, Z. Zhao, V. Bevilacqua, & J. C. Figueroa (Eds.), Advanced intelligent computing theories and applications (Vol. 6215, pp. 205–212). Berlin: Springer. Retrieved December 26, 2017, from http://link.springer.com/10.1007/978-3-642-14922-1_26.
Cuong, N. V., Chandrasekaran, M. K., Kan, M.-Y., & Lee, W. S. (2015). Scholarly document information extraction using extensible features for efficient higher order semi-CRFs. In Proceedings of the 15th ACM/IEEE-CS joint conference on digital libraries (pp. 61–64). ACM.
Day, M.-Y., Tsai, R. T.-H., Sung, C.-L., Hsieh, C.-C., Lee, C.-W., Wu, S.-H., et al. (2007). Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 43(1), 152–167. https://doi.org/10.1016/j.dss.2006.08.006.
Article Google Scholar
Dayrell, C., Candido, A., Lima, G., Machado, D., Copestake, A. A., Feltrim, V. D., & Aluísio, S. M. (2012). Rhetorical move detection in english abstracts: Multi-label sentence classifiers and their annotated corpora. In LREC.
de Price, D. S. (1961). Science since babylon. New Haven: Yale University Press.
Google Scholar
Dimou, A., Vahdati, S., Iorio, A. D., Lange, C., Verborgh, R., & Mannens, E. (2017). Challenges as enablers for high quality linked data: Insights from the semantic publishing challenge. PeerJ Computer Science, 3, e105. https://doi.org/10.7717/peerj-cs.105.
Article Google Scholar
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
MathSciNet MATH Google Scholar
Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). a survey of bioinformatics database and software usage through mining the literature. PLoS ONE, 11(6), e0157989. https://doi.org/10.1371/journal.pone.0157989.
Article Google Scholar
Fisas, B., Saggion, H., & Ronzano, F. (2015). On the discoursive structure of computer graphics research papers. In Proceedings of the 9th linguistic annotation workshop (pp. 42–51).
Flynn, P., Zhou, L., Maly, K., Zeil, S., & Zubair, M. (2007). Automated template-based metadata extraction architecture. In Proceedings of the 10th international conference on Asian digital libraries: Looking back 10 years and forging new frontiers (pp. 327–336). Berlin: Springer. Retrieved December 26, 2017, from http://dl.acm.org/citation.cfm?id=1780653.1780708.
Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3), 268–278.
Article MathSciNet Google Scholar
Friedman, C., Kra, P., Yu, H., Krauthammer, M., & Rzhetsky, A. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. In ISMB (supplement of bioinformatics) (pp. 74–82).
Giuffrida, G., Shek, E. C., & Yang, J. (2000). Knowledge-based metadata extraction from PostScript files. In Proceedings of the fifth ACM conference on digital libraries (pp. 77–84). ACM.
Granitzer, M., Hristakeva, M., Knight, R., Jack, K., & Kern, R. (2012). A comparison of layout based bibliographic metadata extraction techniques. In ACM international conference proceeding series. Retrieved August 3, 2018, from www.scopus.com.
Groza, T., Handschuh, S., & Hulpus, I. (2009). A document engineering approach to automatic extraction of shallow metadata from scientific publications (technical report no. 2009- 06-01). Digital Enterprise Research Institute.
Guo, Z., & Jin, H. (2011). Reference metadata extraction from scientific papers. In Proceedings of the 2011 12th international conference on parallel and distributed computing, applications and technologies (pp. 45–49). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/PDCAT.2011.72.
Guo, Y., Korhonen, A., Liakata, M., Karolinska, I. S., Sun, L., & Stenius, U. (2010). Identifying the information structure of scientific abstracts: an investigation of three different schemes. In Proceedings of the 2010 workshop on biomedical natural language processing (pp. 99–107). Association for Computational Linguistics.
Guo, Y., Korhonen, A., Liakata, M., Silins, I., Hogberg, J., & Stenius, U. (2011). A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics, 12(1), 69.
Article Google Scholar
Gupta, S., & Manning, C. (2011). Analyzing the dynamics of research by extracting key aspects of scientific papers (pp. 1–9). In Proceedings of 5th international joint conference on natural language processing, asian federation of natural language processing. Retrieved November 27, 2015, from http://aclasb.dfki.de/nlp/bib/I11-1001.
Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines (pp. 37–48). IEEE Computer Society. https://doi.org/10.1109/JCDL.2003.1204842.
Handschuh, S., & QasemiZadeh, B. (2014). The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In COLING 2014: 4th international workshop on computational terminology.
Hanyurwimfura, D., Bo, L., Njogu, H., & Ndatinya, E. (2012). An automated cue word based text extraction. Journal of Convergence Information Technology, 7(10), 421–429. https://doi.org/10.4156/jcit.vol7.issue10.50.
Article Google Scholar
Harkema, H., Roberts, I., Gaizauskas, R., & Hepple, M. (2005). Information extraction from clinical records. In Proceedings of the 4th UK e-science all hands meeting.
Haruna, K., Ismail, M. A., Damiasih, D., Sutopo, J., & Herawan, T. (2017). A collaborative approach for research paper recommender system. PLoS ONE, 12(10), e0184516. https://doi.org/10.1371/journal.pone.0184516.
Article Google Scholar
Hetzner, E. (2008). A simple method for citation metadata extraction using hidden Markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on digital libraries (pp. 280–284). New York, NY, USA: ACM. https://doi.org/10.1145/1378889.1378937.
Hirohata, K., Okazaki, N., Ananiadou, S., & Ishizuka, M. (2008). Identifying sections in scientific abstracts using conditional random fields. In Proceedings of the third international joint conference on natural language processing: volume-I.
Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. (2005). Overview of BioCreAtIvE: critical assessment of information extraction for biology. BioMed Central.
Houngb, H., & Mercer, R. E. (2012). Method mention extraction from scientific research paper. In Proceedings of COLING 2012: Technical paper (pp. 1211–1222).
Huang, Z., Jin, H., Yuan, P., & Han, Z. (2006). Header Metadata Extraction from Semi-structured Documents Using Template Matching. In Proceedings of the 2006 international conference on on the move to meaningful internet systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET-volume part II (pp. 1776–1785). Berlin: Springer. https://doi.org/10.1007/11915072_84.
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991 [Cs]. Retrieved from http://arxiv.org/abs/1508.01991.
IAO (2015): Information artifact ontology. Web ontology language, IAO. Retrieved March 28, 2018, from https://github.com/information-artifact-ontology/IAO.
Insights, E. (2013). Using citation analysis to measure research impact. Editage Insights (04-11-2013). Retrieved December 26, 2017, from http://www.editage.com/insights/using-citation-analysis-to-measure-research-impact.
Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272.
Article MathSciNet Google Scholar
Kan, M.-Y., Luong, M.-T., & Nguyen, T. D. (2010). Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems, 1(4), 1–23. https://doi.org/10.4018/jdls.2010100101.
Article Google Scholar
Kanya, N., & Ravi, T. (2012). Modelings and techniques in named entity recognition-an information extraction task. In IET Chennai 3rd international on sustainable energy and intelligent systems (SEISCON 2012) (pp. 1–5). https://doi.org/10.1049/cp.2012.2199.
Kavila, S. D., & Rani, D. F. (2016). Information extraction from research papers based on statistical methods. In S. C. Satapathy, K. S. Raju, J. K. Mandal, & V. Bhateja (Eds.), Proceedings of the second international conference on computer and communication technologies (Vol. 381, pp. 573–580). New Delhi: Springer. Retrieved from April 20, 2018, http://link.springer.com/10.1007/978-81-322-2526-3_59.
Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization (Vol. 4, pp. 1942–1948). In Proceedings of IEEE international conference on neural networks. Piscataway, New Jersey: IEEE. https://doi.org/10.1109/ICNN.1995.488968.
Kern, R., Jack, K., & Hristakeva, M. (2012). TeamBeam—meta-data extraction from scientific literature. D-Lib Magazine. https://doi.org/10.1045/july2012-kern.
Article Google Scholar
Klink, S., Dengel, A., & Kieninger, T. (2000). Document structure analysis based on layout and textual features. In Proceedings of international workshop on document analysis systems, DAS2000 (pp. 99–111). IAPR.
Knoth, P., Anastasiou, L., Charalampous, A., Cancellieri, M., Pearce, S., Pontika, N., & Bayer, V. (2017). Towards effective research recommender systems for repositories. ArXiv Preprint arXiv:1705.00578.
Kondo, T., Nanba, H., Takezawa, T., & Okumura, M. (2009). Technical trend analysis by analyzing research papers’ titles. In Proceeding LTC’09 proceedings of the 4th conference on human language technology: Challenges for computer science and linguistics (pp. 512–521). Retrieved from http://dl.acm.org/citation.cfm?id=1987773.
Kovačević, A., Ivanović, D., Milosavljević, B., Konjović, Z., & Surla, D. (2011). Automatic extraction of metadata from scientific publications for CRIS systems. Program. Retrieved January 12, 2016, from http://www.emeraldinsight.com/doi/full/10.1108/00330331111182094.
Kovačević, A., Konjović, Z., Milosavljević, B., & Nenadic, G. (2012). Mining methodologies from NLP publications: A case study in automatic terminology recognition. Computer Speech & Language, 26(2), 105–126. https://doi.org/10.1016/j.csl.2011.09.001.
Article Google Scholar
Lakhanpal, S., Gupta, A., & Agrawal, R. (2015). Towards extracting domains from research publications. Presented at the 26th modern artificial intelligence and cognitive science conference, MAICS 2015. Retrieved November 27, 2015, from https://ncatsu.pure.elsevier.com/en/publications/towards-extracting-domains-from-research-publications.
Lee, C. (2017). LSTM-CRF models for named entity recognition. IEICE Transactions on Information and Systems, 100(4), 882–887.
Article Google Scholar
Li, G., Ross, K. E., Arighi, C. N., Peng, Y., Wu, C. H., & Vijay-Shanker, K. (2015). miRTex: A text mining system for miRNA-gene relation extraction. PLoS Computational Biology, 11(9), e1004391. https://doi.org/10.1371/journal.pcbi.1004391.
Article Google Scholar
Liakata, M. (2009). Aberystwyth University—ART. Retrieved Feb 12, 2018, from https://www.aber.ac.uk/en/cs/research/cb/projects/art/.
Liakata, M. (2010). Home. Retrieved April 20, 2018, from http://www.sapientaproject.com/.
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7), 991–1000. https://doi.org/10.1093/bioinformatics/bts071.
Article Google Scholar
Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C. R., & others. (2010). Corpora for the conceptualisation and zoning of scientific papers. In LREC. Citeseer.
Lin, J., Karakos, D., Demner-Fushman, D., & Khudanpur, S. (2006). Generative content models for structural analysis of medical abstracts. In Proceedings of the workshop on linking natural language processing and biology: Towards deeper biological literature analysis (pp. 65–72). Association for Computational Linguistics.
Lin, S., Ng, J.-P., Pradhan, S., Shah, J., Pietrobon, R., & Kan, M.-Y. (2010). Extracting formulaic and free text clinical research articles metadata using conditional random fields. In Proceedings of the NAACL HLT 2010 second Louhi workshop on text and data mining of health documents (pp. 90–95). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved December 4, 2017, from http://dl.acm.org/citation.cfm?id=1867735.1867749.
Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries (pp. 473–474). Springer.
Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. ArXiv Preprint arXiv:1603.01354.
Mao, S., Kim, J. W., & Thoma, G. R. (2004). A dynamic feature generation system for automated metadata extraction in preservation of digital materials. In 2004 Proceedings of the first international workshop on document image analysis for libraries. (pp. 225–232). IEEE.
Marinai, S. (2009). Metadata extraction from PDF papers for digital library ingest. In Proceedings of the 2009 10th international conference on document analysis and recognition (pp. 251–255). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICDAR.2009.232.
McCallum, A. K., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the construction of internet portals with machine learning. Information Retrieval, 3(2), 127–163. https://doi.org/10.1023/A:1009953814988.
Article Google Scholar
Mejer, A., & Crammer, K. (2010). Confidence in structured-prediction using confidence-weighted models. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 971–981). Association for Computational Linguistics.
Miwa, M., & Sasaki, Y. (2014). Modeling joint entity and relation extraction with table representation. In EMNLP (pp. 1858–1869).
Morin, B. (2017). LibGuides: Systematic reviews: Intro. Retrieved March 27, 2018, from https://researchguides.library.tufts.edu/c.php?g=249130&p=1658802.
Mudrak, B. (2016). Scholarly publishing in 2016 | AJE | American Journal Experts. Retrieved April 2, 2018, from https://www.aje.com/en/arc/scholarly-publishing-trends-2016/.
Nasar, Z., & Jaffry, S. W. (2018). Trust-based situation awareness: Agent-based versus population-based modeling—a comparative study. In international conference on advancements in computational sciences. Lahore, Pakistan: IEEE.
Ni, Z., & Xu, H. (2009). Automatic citation metadata extraction using hidden Markov models. In Proceedings of the 2009 first IEEE international conference on information science and engineering (pp. 802–805). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICISE.2009.353.
NISO. (2004). Understanding metadata. 4733 Bethesda Avenue, Suite 300, Bethesda, MD 20814 USA: NISO. Retrieved from http://www.niso.org/publications/press/UnderstandingMetadata.pdf.
NSF. (2018). S&E indicators 2018 | NSF-national science foundation. Retrieved April 3, 2018, from https://www.nsf.gov/statistics/2018/nsb20181/.
Ojokoh, B., Zhang, M., & Tang, J. (2011). A trigram hidden Markov model for metadata extraction from heterogeneous references. Information Sciences, 181(9), 1538–1551. https://doi.org/10.1016/j.ins.2011.01.014.
Article Google Scholar
Palshikar, G. K. (2013). Techniques for named entity recognition: A Survey. In Bioinformatics: Concepts, Methodologies, Tools, and Applications (pp. 400–426). https://doi.org/10.4018/978-1-4666-3604-0.ch022
Patil, N., Patil, A. S., & Pawar, B. (2016). Survey of named entity recognition systems with respect to Indian and foreign languages. International Journal of Computer Applications, 134(16), 21–26.
Article Google Scholar
Peng, F., & McCallum, A. (2004). Accurate information extraction from research papers using conditional random fields. Presented at the HLT-NAACL04. Retrieved from October 16, 2015, from http://citeseerx.ist.psu.edu/viewdoc/summary?.
Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979. https://doi.org/10.1016/j.ipm.2005.09.002.
Article Google Scholar
Projects | ISU Information retrieval group. (2017). Retrieved February 12, 2018, from https://www.datadrivenscience.iastate.edu/aflexgroup/projects.
QasemiZadeh, B., & Schumann, A.-K. (2016). The ACL RD-TEC 2.0: A language resource for evaluating term extraction and entity recognition methods. In LREC.
Ronzano, F., & Saggion, H. (2015). Dr. Inventor framework: Extracting structured information from scientific publications. In Discovery science (pp. 209–220). Springer, Cham. https://doi.org/10.1007/978-3-319-24282-8_18.
Ruch, P., Boyer, C., Chichester, C., Tbahriti, I., Geissbühler, A., Fabry, P., et al. (2007). Using argumentation to extract key sentences from biomedical abstracts. International Journal of Medical Informatics, 76(2), 195–200. https://doi.org/10.1016/j.ijmedinf.2006.05.002.
Article Google Scholar
SemPub2015. (2015). Retrieved August 6, 2018, from https://github.com/ceurws/lod/wiki/SemPub2015.
Seymore, K., Mccallum, A., & Rosenfeld, R. (1999). Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI’99 workshop machine learning for information extraction (pp. 37–42).
Sharnagat, R. (2014). Named entity recognition: A literature survey.
Shickel, B., Tighe, P., Bihorac, A., & Rashidi, P. (2017). Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. arXiv Preprint arXiv:1706.03446.
Shuxin, Z., Zhonghong, X., & Yuehong, C. (2013). Information extraction from research papers based on conditional random field model. TELKOMNIKA Indonesian Journal of Electrical Engineering, 11(3), 1213–1220.
Article Google Scholar
SIGKDD. (1995). Retrieved January 20, 2018, from http://www.kdd.org/.
Simoes, G., Galhardas, H., & Coheur, L. (2009). Information extraction tasks: a survey. In Proceedings of INForum (Vol. 2009).
Sirsat, S. R., Chavan, V., & Deshpande, S. P. (2014). Mining knowledge from text repositories using information extraction: A review. Sadhana-Academy Proceedings in Engineering Sciences, 39(1), 53–62.
Google Scholar
Souza, A., Moreira, V., & Heuser, C. (2014). ARCTIC: Metadata extraction from scientific papers in pdf using two-layer CRF. In Proceedings of the 2014 ACM symposium on document engineering (pp. 121–130). New York, NY, USA: ACM. https://doi.org/10.1145/2644866.2644872.
Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2670–2680).
Tateisi, Y., Ohta, T., Pyysalo, S., Miyao, Y., & Aizawa, A. (2016). Typed entity and relation annotation on computer science papers. In LREC.
Tateisi, Y., Shidahara, Y., Miyao, Y., & Aizawa, A. (2014). Annotation of computer science papers for semantic relation extrac-tion. In LREC (pp. 1423–1429).
Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445.
Article Google Scholar
Teufel, S., Siddharthan, A., & Batchelor, C. (2009). Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 3–volume 3 (pp. 1493–1502). Association for Computational Linguistics.
Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 99–108). ACM.
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P., & Bolikowski, Ł. (2015). CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 18(4), 317–335. https://doi.org/10.1007/s10032-015-0249-8.
Article Google Scholar
Vilnis, L., Belanger, D., Sheldon, D., & McCallum, A. (2015). Bethe projections for non-local inference. arXiv:1503.01397 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1503.01397.
Wang, M., & Chai, L. (2018). Three new bibliometric indicators/approaches derived from keyword analysis. Scientometrics. https://doi.org/10.1007/s11192-018-2768-9.
Article Google Scholar
Ware, M., & Mabe, M. (2015). The STM report: An overview of scientific and scholarly journal publishing.
Wu, J.-C., Chang, Y.-C., Liou, H.-C., & Chang, J. S. (2006). Computational analysis of move structures in academic abstracts. In Proceedings of the COLING/ACL on interactive presentation sessions (pp. 41–44). Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1225403.1225414.
Yin, P., Zhang, M., Deng, Z., & Yang, D. (2004). Metadata extraction from bibliographies using bigram HMM. In Proceedings of the 7th international conference on digital libraries: International collaboration and Cross-fertilization (pp. 310–319). Berlin: Springer. https://doi.org/10.1007/978-3-540-30544-6_33.
Yu, J., & Fan, X. (2007). Metadata extraction from chinese research papers based on conditional random fields. In Fourth international conference on fuzzy systems and knowledge discovery, 2007. FSKD 2007. (Vol. 1, pp. 497–501). IEEE. https://doi.org/10.1109/FSKD.2007.394.
Zahedi, Z., & Haustein, S. (2017). On the relationships between bibliographic characteristics of scientific documents and citation and Mendeley readership counts: A large-scale analysis of web of science publications. CoRR, http://arxiv.org/abs/1712.08637.

Download references

Author information

Authors and Affiliations

Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
Zara Nasar, Syed Waqar Jaffry & Muhammad Kamran Malik

Authors

Zara Nasar
View author publications
You can also search for this author in PubMed Google Scholar
Syed Waqar Jaffry
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Kamran Malik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zara Nasar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nasar, Z., Jaffry, S.W. & Malik, M.K. Information extraction from scientific articles: a survey. Scientometrics 117, 1931–1990 (2018). https://doi.org/10.1007/s11192-018-2921-5

Download citation

Received: 19 May 2018
Published: 29 September 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11192-018-2921-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information extraction from scientific articles: a survey

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Information extraction from electronic medical documents: state of the art and future research directions

Deep learning for named entity recognition: a survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Information extraction from scientific articles: a survey

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Information extraction from electronic medical documents: state of the art and future research directions

Deep learning for named entity recognition: a survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation