Automatic Discovery of SimilarWords

  • Pierre Senellart
  • Vincent D. Blondel

The purpose of this chapter is to review some methods used for automatic extraction of similar words from different kinds of sources: large corpora of documents, the World Wide Web, and monolingual dictionaries. The underlying goal of these methods is in general the automatic discovery of synonyms. This goal, however, is most of the time too difficult to achieve since it is often hard to distinguish in an automatic way among synonyms, antonyms, and, more generally, words that are semantically close to each others. Most methods provide words that are “similar” to each other, with some vague notion of semantic similarity. We mainly describe two kinds of methods: techniques that, upon input of a word, automatically compile a list of good synonyms or near-synonyms, and techniques that generate a thesaurus (from some source, they build a complete lexicon of related words). They differ because in the latter case, a complete thesaurus is generated at the same time while there may not be an entry in the thesaurus for each word in the source. Nevertheless, the purposes of both sorts of techniques are very similar and we shall therefore not distinguish much between them.


Vector Space Model Similar Word Neighborhood Graph Automatic Discovery Principal Eigenvector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. V.D. Blondel, A. Gajardo, M. Heymans, P. Senellart, and P. Van Dooren. A measure of similarity between graph vertices: applications to synonym extraction and Web searching. SIAM Review, 46(4):647-666, 2004.zbMATHCrossRefMathSciNetGoogle Scholar
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.CrossRefGoogle Scholar
  3. H. Chen and K.J. Lynch. Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man and Cybernetics, 22(5):885-902, 1992.CrossRefGoogle Scholar
  4. J.R. Curran and M. Moens. Improvements in automatic thesaurus extraction. In Proc. ACL SIGLEX, Philadelphia, July 2002.Google Scholar
  5. C.J. Crouch. An approach to the automatic construction of global thesauri. Information Processing and Management, 26(5):629-640, 1990.CrossRefGoogle Scholar
  6. J.R. Curran. Ensemble methods for automatic thesaurus extraction. In Proc. Conference on Empirical Methods in Natural Language Processing, Philadelphia, July 2002.Google Scholar
  7. J. Dean and M.R. Henzinger. Finding related pages in the world wide web. In Proc. WWW, Toronto, Canada, May 1999.Google Scholar
  8. T.G. Dietterich. Ensemble methods in machine learning. In Proc. MCS, Cagliari, Italy, June 2000.Google Scholar
  9. G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston, MA, 1994.zbMATHGoogle Scholar
  10. J. Jannink and G. Wiederhold. Thesaurus entry extraction from an on-line dictionary. In Proc. FUSION, Sunnyvale, CA, July 1999.Google Scholar
  11. J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.zbMATHCrossRefMathSciNetGoogle Scholar
  12. D. Lin. Automatic retrieval and clustering of similar words. In Proc. COLING, Montreal, Canada, August 1998.Google Scholar
  13. The online plain text English dictionary.∼ralph/OPTED/.
  14. Y. Ollivier and P. Senellart. Finding related pages using Green measures: An illustration with Wikipedia. In Proc. AAAI, Vancouver, Canada, July 2007.Google Scholar
  15. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Proc. ACL, Columbus, OH, June 1993.Google Scholar
  16. P. Senellart. Extraction of information in large graphs. Automatic search for synonyms. Technical Report 90, Universit é catholique de Louvain, Louvain-la-neuve, Belgium, 2001.Google Scholar
  17. G. Salton, C.S. Yang, and C.T. Yu. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science, 26(1):33-44, 1975.CrossRefGoogle Scholar
  18. P.D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In Proc. ECML, Freiburg, Germany, September 2001.Google Scholar
  19. Wikipedia. The free encyclopedia.
  20. H. Wu and M. Zhou. Optimizing synonym extraction using monolingual and bilingual resources. In Proc. International Workshop on Paraphrasing, Sapporo, Japan, July 2003.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Pierre Senellart
    • 1
  • Vincent D. Blondel
    • 2
  1. 1.INRIA Futurs & Université Paris-SudOrsay CedexFrance
  2. 2.Division of Applied MathematicsUniversité de LouvainLouvain-la-neuveBelgium

Personalised recommendations