Classifying Document Types to Enhance Search and Recommendations in Digital Libraries

Charalampous, Aristotelis; Knoth, Petr

doi:10.1007/978-3-319-67008-9_15

Aristotelis Charalampous¹⁸ &
Petr Knoth¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10450))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

2443 Accesses
2 Citations
3 Altmetric

Abstract

In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over \(60\%\) of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE [1] digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.worldcat.org/.
2.
https://www.slideshare.net/.
3.
https://figshare.com/.
4.
It should be noted that as CORE provides thumbnails on its SR results pages, users get an idea of the document type prior to accessing it.
5.
The number of impressions generated in response to a query can vary across queries. In our case, it can be from zero to ten for search and from zero to five for the recommender.
6.
This excludes network overhead from the API call and the feature extraction process.

References

Knoth, P., Zdráhal, Z.: CORE: three access levels to underpin open access. D-Lib Mag. 18(11/12) (2012)
Google Scholar
Rettberg, N., Schmidt, B.: Openaire-building a collaborative open access infrastructure for european researchers. Liber Q. 22(3) (2012)
Google Scholar
Summann, F.: Bielefeld academic search engine: a scientific search service for institutional repositories. In: Open Scholarship 2006 Conference (2006)
Google Scholar
Classifying document types to enhance search and recommendations in digital libraries - Dataset, https://figshare.com/articles/Classifying_document_types_to_enhance_search_and_recommendations_in_digital_libraries/4834229. Accessed 21 Apr 2017
Poynder, R.: Q&A with CNI’s Clifford Lynch: Time to re-think the institutional repository? The Open Access Interviews (2016)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)
Google Scholar
Ghosh, S., Mitra, P.: Combining content and structure similarity for xml document classification using composite SVM kernels. In: 19th International Conference on Pattern Recognition, ICPR 2008, pp. 1–4. IEEE (2008)
Google Scholar
Caragea, C., Wu, J., Gollapalli, S.D., Giles, C.L.: Document type classification in online digital libraries. In: AAAI, pp. 3997–4002 (2016)
Google Scholar
Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. J. Assoc. Inf. Sci. Technol. 65(10), 1964–1987 (2014)
Article Google Scholar
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)
Google Scholar
Shinyama, Y.: Pdfminer: Python PDF parser and analyzer (2015), http://www.unixuser.org/euske/python/pdfminer/. Accessed 08 Apr 2017
Buuren, S., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in r. J. Stat. Softw. 45(3) (2011)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Tukey, J.W.: Comparing individual means in the analysis of variance. Biometrics, 99–114 (1949)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: 7. computational performance - scikit-learn 0.18.1 documentation, http://scikit-learn.org/stable/modules/computational_performance.html. Accessed 08 Apr 2017
Kim, Y., Hassan, A., White, R.W., Zitouni, I.: Modeling dwell time to predict click-level satisfaction. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 193–202. ACM (2014)
Google Scholar

Download references

Acknowledgements

This work has been partly funded by the EU OpenMinTeD project under the H2020-EINFRA-2014-2 call, Project ID: 654021. We would also like to acknowledge the support of Jisc for the CORE project.

Author information

Authors and Affiliations

CORE, Knowledge Media Institute, The Open University, Milton Keynes, UK
Aristotelis Charalampous & Petr Knoth

Authors

Aristotelis Charalampous
View author publications
You can also search for this author in PubMed Google Scholar
Petr Knoth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Knoth .

Editor information

Editors and Affiliations

Faculteit der Geesteswetenschappen, Universiteit van Amsterdam , Amsterdam, The Netherlands
Jaap Kamps
Library & Information Center, University of Patras , Patras, Greece
Giannis Tsakonas
Aristotle University of Thessaloniki , Thessaloniki, Greece
Yannis Manolopoulos
Civil Engineering, University of Thrace , Kimmeria, Greece
Lazaros Iliadis
Informatics, Ionian University , Kerkyra, Greece
Ioannis Karydis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Charalampous, A., Knoth, P. (2017). Classifying Document Types to Enhance Search and Recommendations in Digital Libraries. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science(), vol 10450. Springer, Cham. https://doi.org/10.1007/978-3-319-67008-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-67008-9_15
Published: 02 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67007-2
Online ISBN: 978-3-319-67008-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics