Domain-agnostic discovery of similarities and concepts at scale

Görnerup, Olof; Gillblad, Daniel; Vasiloudis, Theodore

doi:10.1007/s10115-016-0984-2

Domain-agnostic discovery of similarities and concepts at scale

Regular Paper
Published: 30 August 2016

Volume 51, pages 531–560, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Olof Görnerup¹,
Daniel Gillblad¹ &
Theodore Vasiloudis¹

339 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Siamese Neural Networks: An Overview

Notes

That is, most objects are either completely unrelated or at most negligibly correlated. Two randomly selected persons in a large social network, for instance, most likely do not know each other.
https://github.com/sics-dna/concepts.
http://www.last.fm/.
http://nlp.stanford.edu/projects/glove/.
https://catalog.ldc.upenn.edu/LDC2011T07.
https://aws.amazon.com/datasets/google-books-ngrams/.
http://aws.amazon.com/ec2/.

References

Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47–97
Article MathSciNet MATH Google Scholar
Alexandrov A, Bergmann R, Ewen S et al (2014) The Stratosphere platform for big data analytics. VLDB J 23:163–181
Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26(2):255–271
Article Google Scholar
Bitton D, Boral H, DeWitt DJ et al (1983) Parallel algorithms for the execution of relational database operations. ACM Trans Database Syst 8(3):324–353
Article Google Scholar
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: From form to meaning: processing texts automatically, Proceedings of the Biennial GSCL Conference, pp 31–40
Brown PF, deSouza PV, Mercer RL et al (1992) Class-based N-gram models of natural language. Comput Linguist 18(4):467–479
Google Scholar
Cancho RF, Solé RV (2001) The small world of human language. Proc R Soc Lond B Biol Sci 268(1482):2261–2265
Article Google Scholar
Celma Ò (2010) Music recommendation and discovery in the long tail. Springer, Berlin
Book Google Scholar
Celma Ó, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: Proceedings of the 2nd KDD workshop on large-scale recommender systems and the Netflix Prize Competition. ACM, p 5
Chandra AK, Merlin PM (1977) Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the ninth annual ACM symposium on theory of computing, STOC ’77. ACM, New York, NY, USA, pp 77–90
Chelba C, Mikolov T, Schuster M et al (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.3005
Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Google Scholar
Dayhoff MO, Schwartz RM (1978) Chapter 22: A model of evolutionary change in proteins. In: Atlas of protein sequence and structure
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Article Google Scholar
Finkelstein L, Gabrilovich E, Matias Y et al (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01. ACM, New York, NY, USA, pp 406–414
Firth JR (1957) A synopsis of linguistic theory 1930–55. In: Studies in linguistic analysis (special volume of the Philological Society), vol 1952–59. The Philological Society, pp 1–32
Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174
Article MathSciNet Google Scholar
Görnerup O, Gillblad D, Vasiloudis T (2015) Knowing an object by the company it keeps: a domain-agnostic scheme for similarity discovery. In: IEEE international conference on data mining (ICDM 2015)
Halawi G, Dror G, Gabrilovich E et al (2012) Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1406–1414
Harispe S, Ranwez S, Janaqi S et al (2015) Semantic similarity from natural language and ontology analysis. Synth Lect Hum Lang Technol 8(1):1–254
Article Google Scholar
Harris Z (1954) Distributional structure. Word 10(23):146–162
Article Google Scholar
Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. CoRR arXiv:1408.3456
Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50
Article Google Scholar
Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. ACM, New York, NY, USA, pp 538–543
Jordan IK, Mariño Ramírez L, Wolf YI et al (2004) Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol 21(11):2058–2070
Article Google Scholar
Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14:10–25
Article Google Scholar
Koutris P, Suciu D (2011) Parallel evaluation of conjunctive queries. In: Proceedings of the thirteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’11. ACM, New York, NY, USA, pp 223–234
Larson R (1996) Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace. Ann. Meeting of the American Soc. Info, Sci
Leicht EA, Holme P, Newman MEJ (2006) Vertex similarity in networks. Phys Rev E 73:026120
Article Google Scholar
Lin Y, Michel J, Aiden EL et al (2012) Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 system demonstrations, ACL ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 169–174
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, Aiden EL (2011) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182
Mihalcea R, Radev D (2011) Graph-based natural language processing and information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Article Google Scholar
Mislove A, Marcon M, Gummadi KP et al (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC ’07. ACM, New York, NY, USA, pp 29–42
Nirenberg M, Leder P, Bernfield M et al (1965) RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc Natl Acad Sci 53:1161–1168
Article Google Scholar
Palla G, Derenyi I, Farkas I et al (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818
Article Google Scholar
Pecina P (2008) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC 2008 workshop towards a shared task for multiword expressions. European Language Resources Association, pp 54–57
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 1532–1543
Ravasz E, Somera AL, Mongru DA et al (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555
Article Google Scholar
Sahlgren M (2006) The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis, Stockholm University
Schneider A, Cannarozzi G, Gonnet G (2005) Empirical codon substitution matrix. BMC Bioinform 6(134):1–7
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504
Article Google Scholar
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(4):265–269
Article MathSciNet Google Scholar
Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol Skr 5:1–34
Google Scholar
Steyvers M, Tenenbaum JB (2005) The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cogn Sci 29(1):41–78
Article Google Scholar
Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):409–10
Article Google Scholar
Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv 44(4):20:1–20:36
Article MATH Google Scholar
Wu TD, Brutlag DL (1996) Discovering empirically conserved amino acid substitution groups in databases of protein families. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R (eds) Proceedings of the fourth international conference on intelligent systems for molecular biology, St. Louis, MO, USA, June 12–15 1996. AAAI, pp 230–240
Xie J, Szymanski BK, Liu X (2011) SLPA: uncovering overlapping communities in social networks via a speaker–listener interaction dynamic process. In: ICDM 2011 Workshop on DMCCI
Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL HLT ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 616–620
Yu W, Zhang W, Lin X et al (2012) A space and time efficient algorithm for simrank computation. World Wide Web 15(3):327–353
Article Google Scholar
Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX symposium on networked systems design and implementation (NSDI 12), San Jose, CA, pp 15–28
Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4, Article17

Download references

Acknowledgments

This work was funded by the Swedish Foundation for Strategic Research (Stiftelsen för strategisk forskning) and the Knowledge Foundation (Stiftelsen för kunskaps- och kompetensutveckling). The authors would like to thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Swedish Institute of Computer Science (SICS), 164 29, Kista, Sweden
Olof Görnerup, Daniel Gillblad & Theodore Vasiloudis

Authors

Olof Görnerup
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gillblad
View author publications
You can also search for this author in PubMed Google Scholar
Theodore Vasiloudis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olof Görnerup.

Additional information

This paper is an extended version of [18].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Görnerup, O., Gillblad, D. & Vasiloudis, T. Domain-agnostic discovery of similarities and concepts at scale. Knowl Inf Syst 51, 531–560 (2017). https://doi.org/10.1007/s10115-016-0984-2

Download citation

Received: 12 November 2015
Revised: 29 June 2016
Accepted: 19 August 2016
Published: 30 August 2016
Issue Date: May 2017
DOI: https://doi.org/10.1007/s10115-016-0984-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-agnostic discovery of similarities and concepts at scale

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Siamese Neural Networks: An Overview

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain-agnostic discovery of similarities and concepts at scale

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Siamese Neural Networks: An Overview

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation