Skip to main content

Classifying Document Types to Enhance Search and Recommendations in Digital Libraries

  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (TPDL 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10450))

Included in the following conference series:

Abstract

In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over \(60\%\) of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE [1] digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.worldcat.org/.

  2. 2.

    https://www.slideshare.net/.

  3. 3.

    https://figshare.com/.

  4. 4.

    It should be noted that as CORE provides thumbnails on its SR results pages, users get an idea of the document type prior to accessing it.

  5. 5.

    The number of impressions generated in response to a query can vary across queries. In our case, it can be from zero to ten for search and from zero to five for the recommender.

  6. 6.

    This excludes network overhead from the API call and the feature extraction process.

References

  1. Knoth, P., Zdráhal, Z.: CORE: three access levels to underpin open access. D-Lib Mag. 18(11/12) (2012)

    Google Scholar 

  2. Rettberg, N., Schmidt, B.: Openaire-building a collaborative open access infrastructure for european researchers. Liber Q. 22(3) (2012)

    Google Scholar 

  3. Summann, F.: Bielefeld academic search engine: a scientific search service for institutional repositories. In: Open Scholarship 2006 Conference (2006)

    Google Scholar 

  4. Classifying document types to enhance search and recommendations in digital libraries - Dataset, https://figshare.com/articles/Classifying_document_types_to_enhance_search_and_recommendations_in_digital_libraries/4834229. Accessed 21 Apr 2017

  5. Poynder, R.: Q&A with CNI’s Clifford Lynch: Time to re-think the institutional repository? The Open Access Interviews (2016)

    Google Scholar 

  6. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  7. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)

    Google Scholar 

  8. Ghosh, S., Mitra, P.: Combining content and structure similarity for xml document classification using composite SVM kernels. In: 19th International Conference on Pattern Recognition, ICPR 2008, pp. 1–4. IEEE (2008)

    Google Scholar 

  9. Caragea, C., Wu, J., Gollapalli, S.D., Giles, C.L.: Document type classification in online digital libraries. In: AAAI, pp. 3997–4002 (2016)

    Google Scholar 

  10. Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J. Assoc. Inf. Sci. Technol. 65(10), 1964–1987 (2014)

    Article  Google Scholar 

  11. Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)

    Google Scholar 

  12. Shinyama, Y.: Pdfminer: Python PDF parser and analyzer (2015), http://www.unixuser.org/euske/python/pdfminer/. Accessed 08 Apr 2017

  13. Buuren, S., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in r. J. Stat. Softw. 45(3) (2011)

    Google Scholar 

  14. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  15. Tukey, J.W.: Comparing individual means in the analysis of variance. Biometrics, 99–114 (1949)

    Google Scholar 

  16. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: 7. computational performance - scikit-learn 0.18.1 documentation, http://scikit-learn.org/stable/modules/computational_performance.html. Accessed 08 Apr 2017

  17. Kim, Y., Hassan, A., White, R.W., Zitouni, I.: Modeling dwell time to predict click-level satisfaction. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 193–202. ACM (2014)

    Google Scholar 

Download references

Acknowledgements

This work has been partly funded by the EU OpenMinTeD project under the H2020-EINFRA-2014-2 call, Project ID: 654021. We would also like to acknowledge the support of Jisc for the CORE project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petr Knoth .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Charalampous, A., Knoth, P. (2017). Classifying Document Types to Enhance Search and Recommendations in Digital Libraries. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science(), vol 10450. Springer, Cham. https://doi.org/10.1007/978-3-319-67008-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67008-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67007-2

  • Online ISBN: 978-3-319-67008-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics