Skip to main content

Advertisement

Log in

Domain-agnostic discovery of similarities and concepts at scale

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. That is, most objects are either completely unrelated or at most negligibly correlated. Two randomly selected persons in a large social network, for instance, most likely do not know each other.

  2. https://github.com/sics-dna/concepts.

  3. http://www.last.fm/.

  4. http://nlp.stanford.edu/projects/glove/.

  5. https://catalog.ldc.upenn.edu/LDC2011T07.

  6. https://aws.amazon.com/datasets/google-books-ngrams/.

  7. http://aws.amazon.com/ec2/.

References

  1. Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47–97

    Article  MathSciNet  MATH  Google Scholar 

  2. Alexandrov A, Bergmann R, Ewen S et al (2014) The Stratosphere platform for big data analytics. VLDB J 23:163–181

  3. Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26(2):255–271

    Article  Google Scholar 

  4. Bitton D, Boral H, DeWitt DJ et al (1983) Parallel algorithms for the execution of relational database operations. ACM Trans Database Syst 8(3):324–353

    Article  Google Scholar 

  5. Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: From form to meaning: processing texts automatically, Proceedings of the Biennial GSCL Conference, pp 31–40

  6. Brown PF, deSouza PV, Mercer RL et al (1992) Class-based N-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  7. Cancho RF, Solé RV (2001) The small world of human language. Proc R Soc Lond B Biol Sci 268(1482):2261–2265

    Article  Google Scholar 

  8. Celma Ò (2010) Music recommendation and discovery in the long tail. Springer, Berlin

    Book  Google Scholar 

  9. Celma Ó, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: Proceedings of the 2nd KDD workshop on large-scale recommender systems and the Netflix Prize Competition. ACM, p 5

  10. Chandra AK, Merlin PM (1977) Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the ninth annual ACM symposium on theory of computing, STOC ’77. ACM, New York, NY, USA, pp 77–90

  11. Chelba C, Mikolov T, Schuster M et al (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.3005

  12. Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  13. Dayhoff MO, Schwartz RM (1978) Chapter 22: A model of evolutionary change in proteins. In: Atlas of protein sequence and structure

  14. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302

    Article  Google Scholar 

  15. Finkelstein L, Gabrilovich E, Matias Y et al (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01. ACM, New York, NY, USA, pp 406–414

  16. Firth JR (1957) A synopsis of linguistic theory 1930–55. In: Studies in linguistic analysis (special volume of the Philological Society), vol 1952–59. The Philological Society, pp 1–32

  17. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174

    Article  MathSciNet  Google Scholar 

  18. Görnerup O, Gillblad D, Vasiloudis T (2015) Knowing an object by the company it keeps: a domain-agnostic scheme for similarity discovery. In: IEEE international conference on data mining (ICDM 2015)

  19. Halawi G, Dror G, Gabrilovich E et al (2012) Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1406–1414

  20. Harispe S, Ranwez S, Janaqi S et al (2015) Semantic similarity from natural language and ontology analysis. Synth Lect Hum Lang Technol 8(1):1–254

    Article  Google Scholar 

  21. Harris Z (1954) Distributional structure. Word 10(23):146–162

    Article  Google Scholar 

  22. Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. CoRR arXiv:1408.3456

  23. Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50

    Article  Google Scholar 

  24. Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. ACM, New York, NY, USA, pp 538–543

  25. Jordan IK, Mariño Ramírez L, Wolf YI et al (2004) Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol 21(11):2058–2070

    Article  Google Scholar 

  26. Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14:10–25

    Article  Google Scholar 

  27. Koutris P, Suciu D (2011) Parallel evaluation of conjunctive queries. In: Proceedings of the thirteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’11. ACM, New York, NY, USA, pp 223–234

  28. Larson R (1996) Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace. Ann. Meeting of the American Soc. Info, Sci

  29. Leicht EA, Holme P, Newman MEJ (2006) Vertex similarity in networks. Phys Rev E 73:026120

    Article  Google Scholar 

  30. Lin Y, Michel J, Aiden EL et al (2012) Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 system demonstrations, ACL ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 169–174

  31. Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, Aiden EL (2011) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182

  32. Mihalcea R, Radev D (2011) Graph-based natural language processing and information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  33. Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  34. Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  35. Mislove A, Marcon M, Gummadi KP et al (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC ’07. ACM, New York, NY, USA, pp 29–42

  36. Nirenberg M, Leder P, Bernfield M et al (1965) RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc Natl Acad Sci 53:1161–1168

    Article  Google Scholar 

  37. Palla G, Derenyi I, Farkas I et al (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818

    Article  Google Scholar 

  38. Pecina P (2008) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC 2008 workshop towards a shared task for multiword expressions. European Language Resources Association, pp 54–57

  39. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 1532–1543

  40. Ravasz E, Somera AL, Mongru DA et al (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555

    Article  Google Scholar 

  41. Sahlgren M (2006) The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis, Stockholm University

  42. Schneider A, Cannarozzi G, Gonnet G (2005) Empirical codon substitution matrix. BMC Bioinform 6(134):1–7

  43. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504

    Article  Google Scholar 

  44. Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(4):265–269

    Article  MathSciNet  Google Scholar 

  45. Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol Skr 5:1–34

    Google Scholar 

  46. Steyvers M, Tenenbaum JB (2005) The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cogn Sci 29(1):41–78

    Article  Google Scholar 

  47. Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):409–10

    Article  Google Scholar 

  48. Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv 44(4):20:1–20:36

    Article  MATH  Google Scholar 

  49. Wu TD, Brutlag DL (1996) Discovering empirically conserved amino acid substitution groups in databases of protein families. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R (eds) Proceedings of the fourth international conference on intelligent systems for molecular biology, St. Louis, MO, USA, June 12–15 1996. AAAI, pp 230–240

  50. Xie J, Szymanski BK, Liu X (2011) SLPA: uncovering overlapping communities in social networks via a speaker–listener interaction dynamic process. In: ICDM 2011 Workshop on DMCCI

  51. Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL HLT ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 616–620

  52. Yu W, Zhang W, Lin X et al (2012) A space and time efficient algorithm for simrank computation. World Wide Web 15(3):327–353

    Article  Google Scholar 

  53. Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX symposium on networked systems design and implementation (NSDI 12), San Jose, CA, pp 15–28

  54. Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4, Article17

Download references

Acknowledgments

This work was funded by the Swedish Foundation for Strategic Research (Stiftelsen för strategisk forskning) and the Knowledge Foundation (Stiftelsen för kunskaps- och kompetensutveckling). The authors would like to thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olof Görnerup.

Additional information

This paper is an extended version of [18].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Görnerup, O., Gillblad, D. & Vasiloudis, T. Domain-agnostic discovery of similarities and concepts at scale. Knowl Inf Syst 51, 531–560 (2017). https://doi.org/10.1007/s10115-016-0984-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0984-2

Keywords

Navigation