Information Extraction from PDF Sources Based on Rule-Based System Using Integrated Formats
Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.
KeywordsInformation extraction Research papers Ontology PDF parser Regular expression XML and plain-text formats
We would like to thank the computer science department of CUST (Capital University of Science & Technology) to provide us research Lab and knowledgeable support to conduct this research activity. We are also thankful to the members of CDSC (Center for Distributed & Semantic Computing) research group for supporting us in different occasion to complete this research work.
- 1.Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 219–228. ACM (2013)Google Scholar
- 2.Di Iorio, A., Peroni, S., Poggi, F., Vitali, F., Shotton, D.: Recognising document components in XML-based academic articles. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 181–184. ACM (2013)Google Scholar
- 3.Kim, S., Cho, Y., Ahn, K.: Semi-automatic metadata extraction from scientific journal article for full-text XML conversion. In: Proceedings of the International Conference on Data Mining (DMIN), p. 1 (2014). The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp)Google Scholar
- 4.Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. In: Multimedia Storage and Retrieval Innovations for Digital Library Systems, vol. 270 (2012)Google Scholar
- 6.Mohemad, R., Hamdan, A.R., Othman, Z.A., Noor, N.M.: Automatic document structure analysis of structured PDF files. Int. J. New Comput. Architect. Appl. (IJNCAA) 1(2), 404–411 (2011)Google Scholar
- 8.Nuno, M., Fátima, R.: Extracting structure, text and entities from PDF documents of the portuguese legislation. Institute of Engineering, Polytechnic of Porto, Portugal (2012)Google Scholar
- 10.Saleem, O., Latif, S.: Information extraction from research papers by data integration and data validation from multiple header extraction sources. In: Proceedings of the World Congress on Engineering and Computer Science, vol. 1 (2012)Google Scholar
- 11.Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 2013 ACM Symposium on Document Engineering, pp. 177–180. ACM (2013)Google Scholar