Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 641)


Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.


Information extraction Research papers Ontology PDF parser Regular expression XML and plain-text formats 



We would like to thank the computer science department of CUST (Capital University of Science & Technology) to provide us research Lab and knowledgeable support to conduct this research activity. We are also thankful to the members of CDSC (Center for Distributed & Semantic Computing) research group for supporting us in different occasion to complete this research work.


  1. 1.
    Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)Google Scholar
  2. 2.
    Di Iorio, A., Peroni, S., Poggi, F., Vitali, F., Shotton, D.: Recognising document components in XML-based academic articles. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 181–184. ACM (2013)Google Scholar
  3. 3.
    Kim, S., Cho, Y., Ahn, K.: Semi-automatic metadata extraction from scientific journal article for full-text XML conversion. In: Proceedings of the International Conference on Data Mining (DMIN), p. 1 (2014). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)Google Scholar
  4. 4.
    Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)Google Scholar
  5. 5.
    Milicka, M., Burget, R.: Information extraction from web sources based on multi-aspect content analysis. In: Gandon, F., et al. (eds.) SemWebEval 2015. CCIS, vol. 548, pp. 81–92. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-25518-7_7 CrossRefGoogle Scholar
  6. 6.
    Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)Google Scholar
  7. 7.
    Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. Proc. VLDB Endow. 8(12), 1606–1617 (2015)CrossRefGoogle Scholar
  8. 8.
    Nuno, M., Fátima, R.: Extracting structure, text and entities from PDF documents of the portuguese legislation. Institute of Engineering, Polytechnic of Porto, Portugal (2012)Google Scholar
  9. 9.
    Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 1 (2012)CrossRefGoogle Scholar
  10. 10.
    Saleem, O., Latif, S.: Information extraction from research papers by data integration and data validation from multiple header extraction sources. In: Proceedings of the World Congress on Engineering and Computer Science, vol. 1 (2012)Google Scholar
  11. 11.
    Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceCapital University of Science & TechnologyIslamabadPakistan

Personalised recommendations