Advertisement

PARDA: A Dataset for Scholarly PDF Document Metadata Extraction Evaluation

  • Tiantian Fan
  • Junming Liu
  • Yeliang Qiu
  • Congfeng JiangEmail author
  • Jilin Zhang
  • Wei Zhang
  • Jian Wan
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 268)

Abstract

Metadata extraction from scholarly PDF documents is the fundamental work of publishing, archiving, digital library construction, bibliometrics, and scientific competitiveness analysis and evaluations. However, different scholarly PDF documents have different layout and document elements, which make it impossible to compare different extract approaches since testers use different source of test documents even if the documents are from the same journal or conference. Therefore, standard datasets based performance evaluation of various extraction approaches can setup a fair and reproducible comparison. In this paper we present a dataset, namely, PARDA(Pdf Analysis and Recognition DAtaset), for performance evaluation and analysis of scholarly documents, especially on metadata extraction, such as title, authors, affiliation, author-affiliation-email matching, year, date, etc. The dataset covers computer science, physics, life science, management, mathematics, and humanities from various publishers including ACM, IEEE, Springer, Elsevier, arXiv, etc. And each document has distinct layouts and appearance in terms of formatting of metadata. We also construct the ground truth metadata in Dublin Core XML format and BibTex format file associated this dataset.

Keywords

Metadata extraction Dataset Performance evaluation Document analysis 

Notes

Acknowledgment

The funding support of this work by Natural Science Fund of China (No. 61472109, No. 61572163, No. 61672200, and No. 61772165) is greatly appreciated.

References

  1. 1.
    Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: JCDL 2013 Indianapolis, Indiana, USA, 22–26 July 2013, pp. 385–386 (2010)Google Scholar
  2. 2.
    Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: JCDL 2013, Indianapolis, Indiana, USA, 22–26 July 2013, pp. 219–228 (2013)Google Scholar
  3. 3.
    Jiang, C., Liu, J., Ou, D., Wang, Y., Yu, L.: Implicit semantics based metadata extraction and matching of scholarly documents. J. Database Manag. (JDM) 29, 1–22 (2018).  https://doi.org/10.4018/JDM.2018040101CrossRefGoogle Scholar
  4. 4.
    Tkaczyk, D., Szostek, P., Bolikowski, Ł.: GROTOAP2—the methodology of creating a large ground truth dataset of scientific articles. 20(11/12) (2014)Google Scholar
  5. 5.
    Märgner, V., El Abed, H.: Tools and metrics for document analysis systems evaluation. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 1011–1036CrossRefGoogle Scholar
  6. 6.
    Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2005 (2005)Google Scholar
  7. 7.
    Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for research and testing of page-reading OCR systems. In: SPIE and IS&T (2005)Google Scholar
  8. 8.
    Todoran, L., Worring, M., Smeulders, A.W.M.: The UvA color document dataset. IJDAR 7, 228–240 (2005)CrossRefGoogle Scholar
  9. 9.
    Becker, C., Duretec, K.: Free benchmark corpora for preservation experiments: using model-driven engineering to generate data sets. In: JCDL 2013, pp. 349–358 (2013)Google Scholar
  10. 10.
    Caragea, C., et al.: CiteSeerx: a scholarly big dataset. In: de Rijke, Maarten, et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 311–322. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-06028-6_26CrossRefGoogle Scholar
  11. 11.
    Antonacopoulos, A., Karatzas, D., Bridson, D.: Ground truth for layout analysis performance evaluation. In: IAPR International Workshop on Document Analysis Systems, DAS 2006 (2006)CrossRefGoogle Scholar
  12. 12.
    Tkaczyk, D., Czeczko, A., Rusek, K., Bolikowski, L., Bogacewicz, R.: GROTOAP: ground truth for open access publications. In: JCDL 2012, pp. 381–382 (2012)Google Scholar
  13. 13.
    Tao, X., Tang, Z., Xu, C., Gao, L.: Ground-truth and performance evaluation for page layout analysis of born-digital documents. In: 2014 11th IAPR International Workshop on Document Analysis Systems, DAS 2014, pp. 247–251 (2014)Google Scholar
  14. 14.
    Valveny, E.: Datasets and annotations for document analysis and recognition. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 983–1009CrossRefGoogle Scholar
  15. 15.
  16. 16.
    Jeffery, K.G., Houssos, N., Jörg, B., Asserson, A.: Research information management: the CERIF approach. Int. J. Metadata Semant. Ontol. 9, 5–14 (2014)CrossRefGoogle Scholar
  17. 17.

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2019

Authors and Affiliations

  • Tiantian Fan
    • 1
    • 2
  • Junming Liu
    • 1
    • 2
  • Yeliang Qiu
    • 1
    • 2
  • Congfeng Jiang
    • 1
    • 2
    Email author
  • Jilin Zhang
    • 1
    • 2
  • Wei Zhang
    • 1
    • 2
  • Jian Wan
    • 2
    • 3
  1. 1.School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina
  2. 2.Key Laboratory of Complex Systems Modeling and SimulationMinistry of EducationHangzhouChina
  3. 3.School of Information and Electronic EngineeringZhejiang University of Science and TechnologyHangzhouChina

Personalised recommendations