Skip to main content

Wikifying Novel Words to Mixtures of Wikipedia Senses by Structured Sparse Coding

  • Conference paper
  • First Online:
Pattern Recognition Applications and Methods

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 318))

  • 759 Accesses

Abstract

We extend the scope of Wikification to novel words by relaxing two premises of Wikification: (i) we wikify without using the surface form of the word (ii) to a mixture of Wikipedia senses instead of a single sense. We identify two types of “novel” words: words where the connection between their surface form and their meaning is broken (e.g., a misspelled word), and words where there is no meaning to connect to—the meaning itself is also novel. We propose a method capable of wikifying both types of novel words while also dealing with the inherently large-scale disambiguation problem. We show that the method can disambiguate between up to 1,000 Wikipedia senses, and it can explain words with novel meaning as a mixture of other, possibly related senses. This mixture representation compares favorably to the widely used bag of words representation.

The work was carried out while Zoltán Szabó was working at Eötvös Loránd University, Hungary

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The form of a word as it appears in the text.

  2. 2.

    The word to be explained with Wikipedia senses.

  3. 3.

    In “A rose is a rose is a rose”, there are three word types (a, rose, is), but eight word tokens.

  4. 4.

    Downloaded from http://dumps.wikimedia.org/enwiki/.

References

  1. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 233–242 (2007)

    Google Scholar 

  2. Akmajian, A.: Linguistics: An Introduction to Language and Communication. The MIT press, Cambridge (2001)

    Google Scholar 

  3. Harris, Z.: Distributional structure. Word 10, 146–162 (1954)

    Google Scholar 

  4. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4, 1–106 (2012)

    Article  Google Scholar 

  5. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 509–518 (2008)

    Google Scholar 

  6. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD), pp. 457–466 (2009)

    Google Scholar 

  7. Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.1375–1384 (2011)

    Google Scholar 

  8. Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retr. 2, 165–176 (2000)

    Article  Google Scholar 

  9. Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.M.: The TREC spoken document retrieval track: a success story. In: RIAO, pp. 1–20 (2000)

    Google Scholar 

  10. Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12, 2297–2334 (2011)

    MathSciNet  MATH  Google Scholar 

  11. Martins, A.F.T., Smith, N.A., Aguiar, P.M.Q., Figueiredo, M.A.T.: Structured sparsity in structured prediction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1500–1511 (2011)

    Google Scholar 

  12. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)

    MathSciNet  MATH  Google Scholar 

  13. Yuan, M., Yuan, M., Lin, Y., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  14. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267–288 (1994)

    MathSciNet  Google Scholar 

  15. Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco (1997)

    Google Scholar 

  16. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  17. BNC Consortium: The British National Corpus, version 2 (BNC World) (2001)

    Google Scholar 

  18. Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University, Tempe (2009)

    Google Scholar 

  19. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)

    Google Scholar 

  20. Lee, Y.K., Ng, H.T.: An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 41–48 (2002)

    Google Scholar 

  21. Schütze, H.: Automatic word sense discrimination. Comput. Linguist. 24, 97–123 (1998)

    Google Scholar 

  22. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916. IEEE (2010)

    Google Scholar 

  23. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)

    Article  MATH  Google Scholar 

  24. Sharma, S.: Applied Multivariate Techniques. Wiley, New York (1996)

    Google Scholar 

  25. Han, E.H., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Proceedings of the Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp. 116–123 (2000)

    Google Scholar 

Download references

Acknowledgments

The research has been supported by the ‘European Robotic Surgery’ EC FP7 grant (no.: 288233). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of other members of the consortium or the European Commission. The research was carried out as part of the EITKIC_12-1-2012-0001 project, which is supported by the Hungarian Government, managed by the National Development Agency, financed by the Research and Technology Innovation Fund and was performed in cooperation with the EIT ICT Labs Budapest Associate Partner Group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to András Lőrincz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Pintér, B., Vörös, G., Szabó, Z., Lőrincz, A. (2015). Wikifying Novel Words to Mixtures of Wikipedia Senses by Structured Sparse Coding. In: Fred, A., De Marsico, M. (eds) Pattern Recognition Applications and Methods. Advances in Intelligent Systems and Computing, vol 318. Springer, Cham. https://doi.org/10.1007/978-3-319-12610-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12610-4_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12609-8

  • Online ISBN: 978-3-319-12610-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics