Fusion architectures for automatic subject indexing under concept drift

Toepfer, Martin; Seifert, Christin

doi:10.1007/s00799-018-0240-3

Fusion architectures for automatic subject indexing under concept drift

Analysis and empirical results on short texts

Published: 15 May 2018

Volume 21, pages 169–189, (2020)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Martin Toepfer¹ &
Christin Seifert^2,3

919 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning Architectures for Scalable and Reliable Subject Indexing

Studying Subject Ontogeny at Scale in a Polyhierarchical Indexing Language

Towards Semantic Quality Control of Automatic Subject Indexing

Notes

www.eurovoc.europa.eu, accessed 28. 11. 2017.
www.nlm.nih.gov/mesh, accessed 28. 11. 2017.
www.fao.org/agrovoc, accessed 28. 11. 2017.
www.zbw.eu/en/stw-info, accessed 28. 11. 2017.
© 2017 IEEE. All rights reserved. Reprinted, with permission, from Martin Toepfer and Christin Seifert: Descriptor-invariant Fusion Architectures for Automatic Subject Indexing, 2017 ACM IEEE Joint Conference on Digital Libraries (JCDL). Personal use of this material is permitted. However, permission to reuse this material for any other purpose must be obtained from the IEEE.
The number of indexing terms depends on the particular content of a document and several other factors, such as individual institutional guidelines. As a consequence, averages reported in related work vary considerably. Some data sets are actually very similar to single-label document classification, as mentioned in Sect. 2.
www.w3.org/2004/02/skos, accessed 10. 11. 2017.
In related work, especially in the domain of machine learning, the term “label” is often used for classes, which in turn represent concepts.
This meaning of descriptors has been used in related work, but please note that descriptors denote special labels in SKOS.
At the time of the experiments (Sect. 7), release 9.02 was the latest version. Version 9.04 of the STW has been released on June 21st, 2017.
Different meanings of \( \mathbf {x} \) will be used in other sections, for instance, in Sect. 5.
https://github.com/JasonKessler/scattertext, accessed 24. 08. 2017.
Journal of Economic Literature (JEL) codes: https://www.aeaweb.org/econlit/jelCodes.php, accessed 10. 11. 2017.
Links to approaches that relax this constraint are given in the related work, see Sect. 2.
https://github.com/HaraldKi/monqjfa, accessed 10. 11. 2017.
https://github.com/zelandiya/maui, accessed 10.11.2017.
several hours on several thousand documents.
www.scikit-learn.org, accessed 10. 11. 2017.
In some cases, the data were not shown to be normally distributed (Shapiro-Wilk test, \(p<0.05\)), and thus the assumptions for t tests were not met.
http://zbw.eu/stw/thsys/70002, accessed 10. 11. 2017.
http://zbw.eu/stw/thsys/70041, accessed 10. 11. 2017.
In STWFSA, we added special processing routines. For instance, it distinguishes upper and lower case words in certain cases, which in particular enables disambiguation of acronyms like SALT (Strategic Arms Limitation Talks) versus salt (mineral) or AIDS (virus) versus aids (plural of aid).
49 documents have been rated by two indexers. Corresponding concept-level ratings have been averaged, using the floor function in order to resolve odd values.

References

Aronson, A.R., Demner-Fushman, D., Humphrey, S.M., Lin, J.J., Ruch, P., Ruiz, M.E., Smith, L.H., Tanabe, L.K., Wilbur, W.J., Liu, H.: Fusion of knowledge-intensive and statistical approaches for retrieving and annotating textual genomics documents. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Text REtrieval Conference, TREC 2005, NIST, vol Special Publication 500-266 (2005)
Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol. 66(11), 2215–2222 (2015)
Article Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Article MATH Google Scholar
Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21(4), 543–565 (1995)
MathSciNet Google Scholar
Erbs, N., Gurevych, I., Rittberger, M.: Bringing order to digital libraries: from keyphrase extraction to index term assignment. D-Lib Mag. 19(9/10), 1–16 (2013). https://doi.org/10.1045/september2013-erbs
Article Google Scholar
Ferber, R.: Automated indexing with thesaurus descriptors: A co-occurrence based approach to multilingual retrieval. In: Peters, C., Thanos, C. (eds.) Research and Advanced Technology for Digital Libraries, pp. 233–252. Springer, Berlin (1997). https://doi.org/10.1007/bfb0026731
Chapter Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Dean, T. (ed.) Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI ’99, Morgan Kaufmann, pp. 668–673 (1999)
Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)
Article Google Scholar
Gastmeyer, M., Wannags, M., Neubert, J.: Relaunch des Standard-Thesaurus Wirtschaft—Dynamik in der Wissensrepräsentation. Inf. Wiss. Praxis. 67(4), 217–240 (2016). https://doi.org/10.1515/iwp-2016-0039
Article Google Scholar
Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3), 52:1–52:38 (2015). https://doi.org/10.1145/2716262
Article Google Scholar
Große-Bölting, G., Nishioka, C., Scherp, A.: A comparison of different strategies for automated semantic document annotation. In: Proceedings of the International Conference on Knowledge Capture, K-CAP 2015, ACM, pp. 8:1–8:8 (2015). https://doi.org/10.1145/2815833.2815838
Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, London, United Kingdom, September 8–12, 2014, IEEE Computer Society, pp. 229–238 (2014). https://doi.org/10.1109/JCDL.2014.6970173
Jimeno-Yepes, A., Mork, J.G., Demner-Fushman, D., Aronson, A.R.: A one-size-fits-all indexing method does not exist: automatic selection based on meta-learning. JCSE 6(2), 151–160 (2012). https://doi.org/10.5626/JCSE.2012.6.2.151
Article Google Scholar
Kessler, J.: Scattertext: a browser-based tool for visualizing how corpora differ. In: Bansal, M., Ji, H. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, System Demonstrations, Association for Computational Linguistics, pp. 85–90 (2017). https://doi.org/10.18653/v1/P17-4015
Kosnik, L.R.: What have economists been doing for the last 50 years? A text analysis of published academic research from 1960–2010. Economics: the open-access. Open-Assess. E-J. 9, 1–38 (2015). https://doi.org/10.5018/economics-ejournal.ja.2015-13
Article Google Scholar
Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350(6266), 1332–1338 (2015)
Article MathSciNet Google Scholar
Lauser, B., Hotho, A.: Automatic multi-label subject indexing in a multilingual environment. In: Koch, T., Sølvberg, I. (eds.) Proceedings of the Conference on Research and Advanced Technology for Digital Libraries, ECDL 2003, Springer, LNCS, vol 2769, pp. 140–151 (2003). https://doi.org/10.1007/978-3-540-45175-4_14
Chapter Google Scholar
Mencía, E.L., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts—Where the Language of Law Meets the Law of Language, LNAI, vol 6036, 1st edn, pp. 192–215. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-12837-0_11
Chapter Google Scholar
Manning, C.D.: Computational linguistics and deep learning. Comput. Linguist. 41(4), 701–707 (2015). https://doi.org/10.1162/COLI_a_00239
Article MathSciNet Google Scholar
Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. J. Am. Soc. Inf. Sci. Technol. 59(7), 1026–1040 (2008). https://doi.org/10.1002/asi.20790
Article Google Scholar
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Koehn, P., Mihalcea, R. (eds.) Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, ACM, pp. 1318–1327 (2009)
Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2010). https://doi.org/10.1126/science.1199644
Article Google Scholar
Nam, J., Mencía, E. Loza, Kim, H.J., Fürnkranz, J.: Predicting unseen labels using label hierarchies in large-scale multi-label learning. In: Proceedings of the Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2015, pp. 102–118. Springer (2015). https://doi.org/10.1007/978-3-319-23528-8_7
Chapter Google Scholar
Palatucci, M., Pomerleau, D., Hinton, G., Mitchell, T.M.: Zero-shot learning with semantic output codes. In: Proceedings of the International Conference on Neural Information Processing Systems, NIPS ’09, Curran Associates Inc., USA, pp. 1410–1418 (2009)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. In: Proceedings of the Workshop Ontologies and Information Extraction, EUROLAN 2003. arXiv:abs/cs/0609059 (2003)
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D. (eds.): Dataset shift in machine learning. Neural information processing series, MIT Press, Cambridge, Mass (2009). https://mitpress.mit.edu/books/dataset-shift-machine-learning
Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981). https://doi.org/10.1016/0306-4573(81)90028-5
Article Google Scholar
Sappadla, P.V., Nam, J., Mencía, E. Loza, Fürnkranz, J.: Using semantic similarity for multi-label zero-shot classification of text documents. In: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, d-side publications (2016)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the digital humanities. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L.S., Karydis, I. (eds.) Proceedings of the Research and Advanced Technology for Digital Libraries—21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18–21, 2017, Lecture Notes in Computer Science, vol. 10450, pp 246–257. Springer (2017). https://doi.org/10.1007/978-3-319-67008-9_20
Chapter Google Scholar
Ting, K.M., Witten, I.H.: Issues in stacked generalization. J. Artif. Intell. Res. (JAIR) 10, 271–289 (1999). https://doi.org/10.1613/jair.594
Article MATH Google Scholar
Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL 2017, Toronto, ON, Canada, June 19–23, 2017, IEEE Computer Society, pp. 31–40 (2017). https://doi.org/10.1109/JCDL.2017.7991557
Toepfer, M., Seifert, C.: Towards Semantic Quality Control of Automatic Subject Indexing, pp. 616–619. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_56
Book Google Scholar
Toepfer, M., Corovic, H., Fette, G., Klügl, P., Störk, S., Puppe, F.: Fine-grained information extraction from german transthoracic echocardiography reports. BMC Med. Inform. Decis. Mak. 15, 91 (2015). https://doi.org/10.1186/s12911-015-0215-x
Article Google Scholar
Tsoumakas, G., Laliotis, M., Markantonatos, N., Vlahavas, I.P.: Large-scale semantic indexing of biomedical publications. In: Ngomo, A.N., Paliouras, G. (eds.) Proceedings of the first Workshop on Bio-Medical Semantic Indexing and Question Answering, CEUR-WS.org, CEUR Workshop Proceedings, vol. 1094 (2013). URL http://ceur-ws.org/Vol-1094/bioasq2013_submission_6.pdf
Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of mesh for pubmed records. In: Proceedings of the AMIA Annual Symposium, pp. 1198–1207 (2014)
Wolpert, D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992). https://doi.org/10.1016/S0893-6080(05)80023-1
Article Google Scholar

Download references

Acknowledgements

We thank all reviewers for their constructive advice. Moreover, we would also like to thank the indexing experts of the ZBW for valuable discussions and their support in gathering data for the experiments.

Author information

Authors and Affiliations

ZBW – Leibniz Information Centre for Economics, Düsternbrooker Weg 120, 24105, Kiel, Germany
Martin Toepfer
University of Passau, Innstraße 43, 94032, Passau, Germany
Christin Seifert
University of Twente, Drienerlolaan 5, 7522 NB, Enschede, The Netherlands
Christin Seifert

Authors

Martin Toepfer
View author publications
You can also search for this author in PubMed Google Scholar
Christin Seifert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Toepfer.

Additional information

The article was mainly written while C. Seifert was affiliated at the University of Passau.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Toepfer, M., Seifert, C. Fusion architectures for automatic subject indexing under concept drift. Int J Digit Libr 21, 169–189 (2020). https://doi.org/10.1007/s00799-018-0240-3

Download citation

Received: 16 September 2017
Revised: 27 April 2018
Accepted: 03 May 2018
Published: 15 May 2018
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00799-018-0240-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusion architectures for automatic subject indexing under concept drift

Abstract

Access this article

Similar content being viewed by others

Machine Learning Architectures for Scalable and Reliable Subject Indexing

Studying Subject Ontogeny at Scale in a Polyhierarchical Indexing Language

Towards Semantic Quality Control of Automatic Subject Indexing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fusion architectures for automatic subject indexing under concept drift

Abstract

Access this article

Similar content being viewed by others

Machine Learning Architectures for Scalable and Reliable Subject Indexing

Studying Subject Ontogeny at Scale in a Polyhierarchical Indexing Language

Towards Semantic Quality Control of Automatic Subject Indexing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation