Wikifying Novel Words to Mixtures of Wikipedia Senses by Structured Sparse Coding

Pintér, Balázs; Vörös, Gyula; Szabó, Zoltán; Lőrincz, András

doi:10.1007/978-3-319-12610-4_15

Balázs Pintér⁴,
Gyula Vörös⁴,
Zoltán Szabó^4,5 &
…
András Lőrincz⁴

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 318))

759 Accesses

Abstract

We extend the scope of Wikification to novel words by relaxing two premises of Wikification: (i) we wikify without using the surface form of the word (ii) to a mixture of Wikipedia senses instead of a single sense. We identify two types of “novel” words: words where the connection between their surface form and their meaning is broken (e.g., a misspelled word), and words where there is no meaning to connect to—the meaning itself is also novel. We propose a method capable of wikifying both types of novel words while also dealing with the inherently large-scale disambiguation problem. We show that the method can disambiguate between up to 1,000 Wikipedia senses, and it can explain words with novel meaning as a mixture of other, possibly related senses. This mixture representation compares favorably to the widely used bag of words representation.

The work was carried out while Zoltán Szabó was working at Eötvös Loránd University, Hungary

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The form of a word as it appears in the text.
2.
The word to be explained with Wikipedia senses.
3.
In “A rose is a rose is a rose”, there are three word types (a, rose, is), but eight word tokens.
4.
Downloaded from http://dumps.wikimedia.org/enwiki/.

References

Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 233–242 (2007)
Google Scholar
Akmajian, A.: Linguistics: An Introduction to Language and Communication. The MIT press, Cambridge (2001)
Google Scholar
Harris, Z.: Distributional structure. Word 10, 146–162 (1954)
Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4, 1–106 (2012)
Article Google Scholar
Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 509–518 (2008)
Google Scholar
Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD), pp. 457–466 (2009)
Google Scholar
Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.1375–1384 (2011)
Google Scholar
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retr. 2, 165–176 (2000)
Article Google Scholar
Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.M.: The TREC spoken document retrieval track: a success story. In: RIAO, pp. 1–20 (2000)
Google Scholar
Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12, 2297–2334 (2011)
MathSciNet MATH Google Scholar
Martins, A.F.T., Smith, N.A., Aguiar, P.M.Q., Figueiredo, M.A.T.: Structured sparsity in structured prediction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1500–1511 (2011)
Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
MathSciNet MATH Google Scholar
Yuan, M., Yuan, M., Lin, Y., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)
Article MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267–288 (1994)
MathSciNet Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)
Article Google Scholar
BNC Consortium: The British National Corpus, version 2 (BNC World) (2001)
Google Scholar
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University, Tempe (2009)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)
Google Scholar
Lee, Y.K., Ng, H.T.: An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 41–48 (2002)
Google Scholar
Schütze, H.: Automatic word sense discrimination. Comput. Linguist. 24, 97–123 (1998)
Google Scholar
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916. IEEE (2010)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)
Article MATH Google Scholar
Sharma, S.: Applied Multivariate Techniques. Wiley, New York (1996)
Google Scholar
Han, E.H., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Proceedings of the Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp. 116–123 (2000)
Google Scholar

Download references

Acknowledgments

The research has been supported by the ‘European Robotic Surgery’ EC FP7 grant (no.: 288233). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of other members of the consortium or the European Commission. The research was carried out as part of the EITKIC_12-1-2012-0001 project, which is supported by the Hungarian Government, managed by the National Development Agency, financed by the Research and Technology Innovation Fund and was performed in cooperation with the EIT ICT Labs Budapest Associate Partner Group.

Author information

Authors and Affiliations

Faculty of Informatics, Eötvös Loránd University, Pázmány P. sétány 1/C, Budapest, 1117, Hungary
Balázs Pintér, Gyula Vörös, Zoltán Szabó & András Lőrincz
Gatsby Computational Neuroscience Unit, University College London, Alexandra House, 17 Queen Square, London, WC1N 3AR, UK
Zoltán Szabó

Authors

Balázs Pintér
View author publications
You can also search for this author in PubMed Google Scholar
Gyula Vörös
View author publications
You can also search for this author in PubMed Google Scholar
Zoltán Szabó
View author publications
You can also search for this author in PubMed Google Scholar
András Lőrincz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to András Lőrincz .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal
Ana Fred
Department of Computer Science, Sapienza University of Rome, Roma, Italy
Maria De Marsico

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pintér, B., Vörös, G., Szabó, Z., Lőrincz, A. (2015). Wikifying Novel Words to Mixtures of Wikipedia Senses by Structured Sparse Coding. In: Fred, A., De Marsico, M. (eds) Pattern Recognition Applications and Methods. Advances in Intelligent Systems and Computing, vol 318. Springer, Cham. https://doi.org/10.1007/978-3-319-12610-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-12610-4_15
Published: 23 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12609-8
Online ISBN: 978-3-319-12610-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics