Abstract
Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study.
Similar content being viewed by others
Notes
www.eurovoc.europa.eu, accessed 28. 11. 2017.
www.nlm.nih.gov/mesh, accessed 28. 11. 2017.
www.fao.org/agrovoc, accessed 28. 11. 2017.
www.zbw.eu/en/stw-info, accessed 28. 11. 2017.
© 2017 IEEE. All rights reserved. Reprinted, with permission, from Martin Toepfer and Christin Seifert: Descriptor-invariant Fusion Architectures for Automatic Subject Indexing, 2017 ACM IEEE Joint Conference on Digital Libraries (JCDL). Personal use of this material is permitted. However, permission to reuse this material for any other purpose must be obtained from the IEEE.
The number of indexing terms depends on the particular content of a document and several other factors, such as individual institutional guidelines. As a consequence, averages reported in related work vary considerably. Some data sets are actually very similar to single-label document classification, as mentioned in Sect. 2.
www.w3.org/2004/02/skos, accessed 10. 11. 2017.
In related work, especially in the domain of machine learning, the term “label” is often used for classes, which in turn represent concepts.
This meaning of descriptors has been used in related work, but please note that descriptors denote special labels in SKOS.
At the time of the experiments (Sect. 7), release 9.02 was the latest version. Version 9.04 of the STW has been released on June 21st, 2017.
Different meanings of \( \mathbf {x} \) will be used in other sections, for instance, in Sect. 5.
https://github.com/JasonKessler/scattertext, accessed 24. 08. 2017.
Journal of Economic Literature (JEL) codes: https://www.aeaweb.org/econlit/jelCodes.php, accessed 10. 11. 2017.
Links to approaches that relax this constraint are given in the related work, see Sect. 2.
https://github.com/HaraldKi/monqjfa, accessed 10. 11. 2017.
https://github.com/zelandiya/maui, accessed 10.11.2017.
several hours on several thousand documents.
www.scikit-learn.org, accessed 10. 11. 2017.
In some cases, the data were not shown to be normally distributed (Shapiro-Wilk test, \(p<0.05\)), and thus the assumptions for t tests were not met.
http://zbw.eu/stw/thsys/70002, accessed 10. 11. 2017.
http://zbw.eu/stw/thsys/70041, accessed 10. 11. 2017.
In STWFSA, we added special processing routines. For instance, it distinguishes upper and lower case words in certain cases, which in particular enables disambiguation of acronyms like SALT (Strategic Arms Limitation Talks) versus salt (mineral) or AIDS (virus) versus aids (plural of aid).
49 documents have been rated by two indexers. Corresponding concept-level ratings have been averaged, using the floor function in order to resolve odd values.
References
Aronson, A.R., Demner-Fushman, D., Humphrey, S.M., Lin, J.J., Ruch, P., Ruiz, M.E., Smith, L.H., Tanabe, L.K., Wilbur, W.J., Liu, H.: Fusion of knowledge-intensive and statistical approaches for retrieving and annotating textual genomics documents. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Text REtrieval Conference, TREC 2005, NIST, vol Special Publication 500-266 (2005)
Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol. 66(11), 2215–2222 (2015)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)
Erbs, N., Gurevych, I., Rittberger, M.: Bringing order to digital libraries: from keyphrase extraction to index term assignment. D-Lib Mag. 19(9/10), 1–16 (2013). https://doi.org/10.1045/september2013-erbs
Ferber, R.: Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval. In: Peters, C., Thanos, C. (eds.) Research and Advanced Technology for Digital Libraries, pp. 233–252. Springer, Berlin (1997). https://doi.org/10.1007/bfb0026731
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Dean, T. (ed.) Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI ’99, Morgan Kaufmann, pp. 668–673 (1999)
Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)
Gastmeyer, M., Wannags, M., Neubert, J.: Relaunch des Standard-Thesaurus Wirtschaft—Dynamik in der Wissensrepräsentation. Inf. Wiss. Praxis. 67(4), 217–240 (2016). https://doi.org/10.1515/iwp-2016-0039
Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3), 52:1–52:38 (2015). https://doi.org/10.1145/2716262
Große-Bölting, G., Nishioka, C., Scherp, A.: A comparison of different strategies for automated semantic document annotation. In: Proceedings of the International Conference on Knowledge Capture, K-CAP 2015, ACM, pp. 8:1–8:8 (2015). https://doi.org/10.1145/2815833.2815838
Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, London, United Kingdom, September 8–12, 2014, IEEE Computer Society, pp. 229–238 (2014). https://doi.org/10.1109/JCDL.2014.6970173
Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-size-fits-all indexing method does not exist: automatic selection based on meta-learning. JCSE 6(2), 151–160 (2012). https://doi.org/10.5626/JCSE.2012.6.2.151
Kessler, J.: Scattertext: a browser-based tool for visualizing how corpora differ. In: Bansal, M., Ji, H. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, System Demonstrations, Association for Computational Linguistics, pp. 85–90 (2017). https://doi.org/10.18653/v1/P17-4015
Kosnik, L.R.: What have economists been doing for the last 50 years? A text analysis of published academic research from 1960–2010. Economics: the open-access. Open-Assess. E-J. 9, 1–38 (2015). https://doi.org/10.5018/economics-ejournal.ja.2015-13
Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350(6266), 1332–1338 (2015)
Lauser, B., Hotho, A.: Automatic multi-label subject indexing in a multilingual environment. In: Koch, T., Sølvberg, I. (eds.) Proceedings of the Conference on Research and Advanced Technology for Digital Libraries, ECDL 2003, Springer, LNCS, vol 2769, pp. 140–151 (2003). https://doi.org/10.1007/978-3-540-45175-4_14
Mencía, E.L., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts—Where the Language of Law Meets the Law of Language, LNAI, vol 6036, 1st edn, pp. 192–215. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-12837-0_11
Manning, C.D.: Computational linguistics and deep learning. Comput. Linguist. 41(4), 701–707 (2015). https://doi.org/10.1162/COLI_a_00239
Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. J. Am. Soc. Inf. Sci. Technol. 59(7), 1026–1040 (2008). https://doi.org/10.1002/asi.20790
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Koehn, P., Mihalcea, R. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, ACM, pp. 1318–1327 (2009)
Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2010). https://doi.org/10.1126/science.1199644
Nam, J., Mencía, E. Loza, Kim, H.J., Fürnkranz, J.: Predicting unseen labels using label hierarchies in large-scale multi-label learning. In: Proceedings of the Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2015, pp. 102–118. Springer (2015). https://doi.org/10.1007/978-3-319-23528-8_7
Palatucci, M., Pomerleau, D., Hinton, G., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: Proceedings of the International Conference on Neural Information Processing Systems, NIPS ’09, Curran Associates Inc., USA, pp. 1410–1418 (2009)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. In: Proceedings of the Workshop Ontologies and Information Extraction, EUROLAN 2003. arXiv:abs/cs/0609059 (2003)
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D. (eds.): Dataset shift in machine learning. Neural information processing series, MIT Press, Cambridge, Mass (2009). https://mitpress.mit.edu/books/dataset-shift-machine-learning
Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981). https://doi.org/10.1016/0306-4573(81)90028-5
Sappadla, P.V., Nam, J., Mencía, E. Loza, Fürnkranz, J.: Using semantic similarity for multi-label zero-shot classification of text documents. In: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, d-side publications (2016)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the digital humanities. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L.S., Karydis, I. (eds.) Proceedings of the Research and Advanced Technology for Digital Libraries—21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18–21, 2017, Lecture Notes in Computer Science, vol. 10450, pp 246–257. Springer (2017). https://doi.org/10.1007/978-3-319-67008-9_20
Ting, K.M., Witten, I.H.: Issues in stacked generalization. J. Artif. Intell. Res. (JAIR) 10, 271–289 (1999). https://doi.org/10.1613/jair.594
Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL 2017, Toronto, ON, Canada, June 19–23, 2017, IEEE Computer Society, pp. 31–40 (2017). https://doi.org/10.1109/JCDL.2017.7991557
Toepfer, M., Seifert, C.: Towards Semantic Quality Control of Automatic Subject Indexing, pp. 616–619. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_56
Toepfer, M., Corovic, H., Fette, G., Klügl, P., Störk, S., Puppe, F.: Fine-grained information extraction from german transthoracic echocardiography reports. BMC Med. Inform. Decis. Mak. 15, 91 (2015). https://doi.org/10.1186/s12911-015-0215-x
Tsoumakas, G., Laliotis, M., Markantonatos, N., Vlahavas, I.P.: Large-scale semantic indexing of biomedical publications. In: Ngomo, A.N., Paliouras, G. (eds.) Proceedings of the first Workshop on Bio-Medical Semantic Indexing and Question Answering, CEUR-WS.org, CEUR Workshop Proceedings, vol. 1094 (2013). URL http://ceur-ws.org/Vol-1094/bioasq2013_submission_6.pdf
Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of mesh for pubmed records. In: Proceedings of the AMIA Annual Symposium, pp. 1198–1207 (2014)
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1
Acknowledgements
We thank all reviewers for their constructive advice. Moreover, we would also like to thank the indexing experts of the ZBW for valuable discussions and their support in gathering data for the experiments.
Author information
Authors and Affiliations
Corresponding author
Additional information
The article was mainly written while C. Seifert was affiliated at the University of Passau.
Rights and permissions
About this article
Cite this article
Toepfer, M., Seifert, C. Fusion architectures for automatic subject indexing under concept drift. Int J Digit Libr 21, 169–189 (2020). https://doi.org/10.1007/s00799-018-0240-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-018-0240-3