Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data Using Recurrent Convolutional Neural Network Model

Safder, Iqra; Sarfraz, Junaid; Hassan, Saeed-Ul; Ali, Mohsen; Tuarob, Suppawong

doi:10.1007/978-3-319-70232-2_3

Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data Using Recurrent Convolutional Neural Network Model

Conference paper
First Online: 03 November 2017

1446 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10647))

Abstract

We are observing an exponential growth of scientific literature since the last few decades. Tapping on the advancement of web-enabled tools and technologies, millions of articles are stored and indexed in the digital libraries. Among this archived scientific literature, thousands of newly emerging algorithms, mostly illustrated with pseudo-codes, are published every year in the area of Computer Science and other related computational fields. Previously, an array of techniques has been deployed to retrieve information related to these algorithms by indexing their pseudo-codes and metadata from a vast pool of scholarly documents. Unfortunately, existing search engines are only limited to indexing a textual description of each pseudo-code and are unable to provide simple algorithm-specific information such as run-time complexity, performance evaluation (such as precision, recall, or f-measure), and the size of the dataset it can effectively process, etc. In this paper, we propose a set of algorithms that extract information pertaining to the performance of algorithm(s) presented and/or discussed in the research article. Specifically, sentences in the paper that convey information about the efficiency of the corresponding algorithm are identified and extracted, using the Recurrent Convolutional Neural Network (RCNN) model. To evaluate the performance of our algorithm, we have collected a dataset of 258 manually annotated scholarly documents by four experts, originally downloaded from CiteseerX. Our proposed RCNN based model achieves encouraging 77.65% f-measure and 76.35% accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS ONE 9(5), e93949 (2014)
Article Google Scholar
ArXiv stats. https://arxiv.org/stats/monthly_submissions. Accessed 17 July 2017
Bhatia, S., Tuarob, S., Mitra, P., Giles, C.L.: An algorithm search engine for software developers. In: Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation, pp. 13–16. ACM (2011)
Google Scholar
Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: AlgorithmSeer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans. Big Data 2(1), 3–17 (2016). IEEE
Article Google Scholar
Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: Automatic detection of pseudocodes in scholarly documents using machine learning. In: Document Analysis and Recognition (ICDAR), pp. 738–742 (2013)
Google Scholar
Hingmire, S., Chougule, S., Palshikar, G.K., Chakraborti, S.: Document classification by topic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 877–880. ACM (2013)
Google Scholar
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, vol. 333, pp. 2267–2273 (2015)
Google Scholar
Coüasnon, B., Lemaitre, A.: Recognition of tables and forms. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 647–677. Springer, London (2014). doi:10.1007/978-0-85729-859-1_20
Chapter Google Scholar
Chen, S.Z., Cafarella, M.J., Adar, E.: Searching for statistical diagrams. Frontiers of Engineering, National Academy of Engineering, pp. 69–78 (2011)
Google Scholar
Kataria, S., Browuer, W., Mitra, P., Giles. C.L.: Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In: AAAI 2008, vol. 8, pp. 1169–1174 (2008)
Google Scholar
Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 664–680. Springer, Cham (2016). doi:10.1007/978-3-319-46478-7_41
Google Scholar
Fang, J., Gao, L., Bai, K., Qiu, R., Tao, X., Tang, Z.: A table detection method for multipage Pdf documents via visual seperators and tabular structures. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 779–783. IEEE (2011)
Google Scholar
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries, In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 91–100. ACM (2007)
Google Scholar
Mitra, P., Giles, C.L., Sun, B., Liu, Y., Jaiswal, A.R.: Scientific data and document processing in chemxseer. In: AAAI Spring Symposium: Semantic Scientific Knowledge Integration, pp. 51–56 (2008)
Google Scholar
Khabsa, M., Treeratpituk, P., Giles, C.L.: AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 185–194. ACM (2012)
Google Scholar
Bhatia, S., Mitra, P.: Summarizing figures, tables, and algorithms in scientific publications to augment search results. ACM Trans. Inf. Syst. (TOIS) 30(1), 3 (2012)
Article Google Scholar
Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: hlt-Naacl, vol. 13, pp. 746–751 (2013)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: EMNLP, pp. 1422–1432 (2015)
Google Scholar
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Tuarob, S., Mitra, P., Giles, C.L.: A hybrid approach to discover semantic hierarchical sections in scholarly documents. In: Document Analysis and Recognition (ICDAR), pp. 1081–1085. IEEE (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology University, Ferozepur Road, Lahore, 54000, Pakistan
Iqra Safder, Junaid Sarfraz, Saeed-Ul Hassan & Mohsen Ali
Faculty of ICT, Mahidol University, Nakhon Pathom, 73170, Thailand
Suppawong Tuarob

Authors

Iqra Safder
View author publications
You can also search for this author in PubMed Google Scholar
Junaid Sarfraz
View author publications
You can also search for this author in PubMed Google Scholar
Saeed-Ul Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Ali
View author publications
You can also search for this author in PubMed Google Scholar
Suppawong Tuarob
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saeed-Ul Hassan .

Editor information

Editors and Affiliations

Chulalongkorn University, Bangkok, Thailand
Songphan Choemprayong
University of Lugano, Lugano, Switzerland
Fabio Crestani
Waikato University, Hamilton, New Zealand
Sally Jo Cunningham

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Safder, I., Sarfraz, J., Hassan, SU., Ali, M., Tuarob, S. (2017). Detecting Target Text Related to Algorithmic Efficiency in Scholarly Big Data Using Recurrent Convolutional Neural Network Model. In: Choemprayong, S., Crestani, F., Cunningham, S. (eds) Digital Libraries: Data, Information, and Knowledge for Digital Lives. ICADL 2017. Lecture Notes in Computer Science(), vol 10647. Springer, Cham. https://doi.org/10.1007/978-3-319-70232-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-70232-2_3
Published: 03 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70231-5
Online ISBN: 978-3-319-70232-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics