Unsupervised document structure analysis of digital scientific articles

  • Stefan Klampfl
  • Michael Granitzer
  • Kris Jack
  • Roman Kern
Article

Abstract

Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

Keywords

Document structure analysis  Machine learning  Clustering PDF extraction  Text mining 

References

  1. 1.
    Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002). doi:10.1007/s10032-002-0080-x CrossRefMATHGoogle Scholar
  2. 2.
    Beel, J., Langer, S., Genzmehr, M., Müller, C.: Docear’s PDF inspector: title extraction from PDF files. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2013) (2013)Google Scholar
  3. 3.
    Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 13th ACM Symposium on Document, Engineering (2013)Google Scholar
  4. 4.
    Councill, I.G., Giles, C.L., Kan, M.y.: ParsCit: An Open-Source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008). doi:10.1.1.150.6790Google Scholar
  5. 5.
    Dejean, H., Meunier, J.L.: A system for converting PDF documents into structured XML format. In: Document Analysis Systems VII, pp. 129–140 (2006)Google Scholar
  6. 6.
    Doucet, A., Kazai, G., Colutto, S., Mühlberger, G.: Overview of the ICDAR 2013 competition on book structure extraction. In: Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR’2013), p. 6. Washington DC, USA (2013)Google Scholar
  7. 7.
    Esposito, F., Ferilli, S., Basile, T.M.A.: Machine learning for digital document processing: from layout analysis to metadata extraction. World Wide Web Internet Web Inform. Syst. 138(2008), 1–35 (2008). doi:10.1007/978-3-540-76280-5_5 Google Scholar
  8. 8.
    Ferilli, S., Basile, T., Mauro, N.D.: Markov logic networks for document layout correction. In: Modern Approaches in, Applied Intelligence, pp. 275–284 (2011)Google Scholar
  9. 9.
    Gao, L., Tang, Z., Lin, X., Liu, Y., Qiu, R., Wang, Y.: Structure extraction from PDF-based book documents. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 11–20 (2011)Google Scholar
  10. 10.
    Gorman, L.O., Definitions, A.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)CrossRefGoogle Scholar
  11. 11.
    Granitzer, M., Hristakeva, M., Knight, R., Jack, K.: A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In: Proceedings of the 27th Symposium On Applied Computing, p. to appear. ACM, New York (2012)Google Scholar
  12. 12.
    Granitzer, M., Hristakeva, M., Knight, R., Jack, K., Kern, R.: A comparison of layout based bibliographic metadata extraction techniques. In: WIMS12—International Conference on Web Intelligence, Mining and Semantics, pp. 19:1–19:8. ACM, New York (2012)Google Scholar
  13. 13.
    Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam—meta-data extraction from scientific literature. In: 1st International Workshop on Mining Scientific Publications (2012)Google Scholar
  14. 14.
    Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Magazine 19(9/10) (2013). doi:10.1045/september2013-kern
  15. 15.
    Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems (2000)Google Scholar
  16. 16.
    Lin, X.: Header and footer extraction by page-association. Proc. SPIE 5010, 164–171 (2002). doi:10.1117/12.472833 CrossRefGoogle Scholar
  17. 17.
    Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1006–1010 (2009). doi:10.1109/ICDAR.2009.138
  18. 18.
    Liu, Y., Mitra, P., Giles, C.L.: A fast preprocessing method for table boundary detection: narrowing down the sparse lines using solely coordinate information. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 431–438. IEEE (2008). doi:10.1109/DAS.2008.77
  19. 19.
    Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceeding of the 17th ACM conference on Information and knowledge mining CIKM 08, pp. 1311–1320. ACM Press (2008). doi:10.1145/1458082.1458255
  20. 20.
    Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digital Libr. Syst. 1(4), 1–23 (2011). doi:10.4018/jdls.2010100101 CrossRefGoogle Scholar
  21. 21.
    Malerba, D., Ceci, M., Berardi, M.: Machine learning for reading order detection in document image understanding. In: Machine Learning in Document Analysis, pp. 45–69 (2008)Google Scholar
  22. 22.
    Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. Proc. SPIE 5010(1), 197–207 (2003). doi:10.1117/12.476326 CrossRefGoogle Scholar
  23. 23.
    Meunier, J.L.: Optimized XY-cut for determining a page reading order. In: Eighth International Conference on Document Analysis and Recognition ICDAR05 1, pp. 347–351 (2005). doi:10.1109/ICDAR.2005.182
  24. 24.
    Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992). doi:10.1109/2.144436 CrossRefGoogle Scholar
  25. 25.
    Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLTNAACL04, vol. 2004, pp. 329–336 (2004). doi: 10.1.1.10.5644 Google Scholar
  26. 26.
    Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7(1), 7 (2012). doi:10.1186/1751-0473-7-7 CrossRefGoogle Scholar
  27. 27.
    Summers, K.: Automatic discovery of logical document structure. Ph.D. thesis (1998)Google Scholar
  28. 28.
    Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 11–16 (2012). doi:10.1109/DAS.2012.4
  29. 29.
    Tkaczyk, D., Czeczko, A., Rusek, K.: GROTOAP: ground truth for open access publications. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 381–382 (2012)Google Scholar
  30. 30.
    Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004). doi:10.1007/s10032-004-0120-9 Google Scholar
  31. 31.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989). doi:10.1137/0218082 CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Stefan Klampfl
    • 1
  • Michael Granitzer
    • 3
  • Kris Jack
    • 4
  • Roman Kern
    • 1
    • 2
  1. 1.Know-Center GmbHGrazAustria
  2. 2.Knowledge Technologies InstituteGraz University of TechnologyGrazAustria
  3. 3.University of PassauPassauGermany
  4. 4.Mendeley LtdLondonUK

Personalised recommendations