Scientometrics

, Volume 111, Issue 2, pp 1017–1031 | Cite as

Clustering articles based on semantic similarity

Article

Abstract

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. As we describe in Koopman et al. (Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf, 2015), the metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.

Keywords

Semantic indexing Clustering Visualisation K-Means Louvain community detection 

References

  1. Achlioptas, D. (2003). Database-friendly random projections: Johnson–Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. doi:10.1016/S0022-0000(03)00025-4.MathSciNetCrossRefMATHGoogle Scholar
  2. Béjar, J. (2013). K-means vs mini batch k-means: A comparison. Tech. rep., Universitat Politècnica de Catalunya. http://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf.
  3. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008. (12pp).CrossRefGoogle Scholar
  4. Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351–374.CrossRefGoogle Scholar
  5. Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759–1767. doi:10.1002/asi.22896.CrossRefGoogle Scholar
  6. Bruckner, E., Ebeling, W., & Scharnhorst, A. (1990). The application of evolution models in scientometrics. Scientometrics, 18(1–2), 21–41. doi:10.1007/BF02019160.CrossRefGoogle Scholar
  7. Firth, J.R. (1957). A synopsis of linguistic theory 1930–1955. Studies in Linguistic Analysis pp. 1–32.Google Scholar
  8. Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1983). Statistical semantics: Analysis of the potential performance of keyword information systems. Bell System Technical Journal, 62(6), 17531806. doi:10.1002/j.1538-7305.1983.tb03513.x.CrossRefGoogle Scholar
  9. Garfield, E. (1983). Citation indexing—Its theory and application in science, technology and humanities. Philadelphia: ISI Press.Google Scholar
  10. Glänzel, W., & Czerwon, H. J. (1996). A new methodological approach to bibliographic coupling and its application to the national, regional and institutional level. Scientometrics, 37, 195–221.CrossRefGoogle Scholar
  11. Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ‘core documents’ for the representation of clusters and topics. the astronomy dataset. Scientometrics. doi:10.1007/s11192-017-2301-6.
  12. Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data: different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics. doi:10.1007/s11192-017-2296-z.
  13. Harris, Z. (1954). Distributional structure. Word, 10(23), 146162.Google Scholar
  14. Johnson, W., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.MathSciNetCrossRefMATHGoogle Scholar
  15. Koopman, R., Wang, S., & Scharnhorst, A. (2015) .Contextualization of topics—browsing through terms, authors, journals and cluster allocations. In: Salah, A.A., Tonta, Y., Salah, A.A.A., Sugimoto, C.R., Al, U., (Eds.), Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse. http://www.issi2015.org/files/downloads/all-papers/1042.pdf.
  16. Koopman, R., Wang, S., & Scharnhorst, A. (2017). Contextualization of topics—browsing through the universe of bibliographic information. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.Google Scholar
  17. Koopman, R., Wang, S., Scharnhorst, A., & Englebienne, G. (2015). Ariadne’s thread: Interactive navigation in a world of networked information. In: Begole, B., Kim, J., Inkpen, K., Woo, W., (Eds.), Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea, April 18–23, 2015, pp. 1833–1838. ACM doi:10.1145/2702613.2732781.
  18. Leydesdorff, L. (1989). Words and co-words as indicators of intellectual organization. Research Policy, 18(4), 209–223. doi:10.1016/0048-7333(89)90016-4.CrossRefGoogle Scholar
  19. Leydesdorff, L., & Hellsten, I. (2006). Measuring the meaning of words in contexts: An automated analysis of controversies about’monarch butterflies”,frankenfoods’,and’stem cells’. Scientometrics, 67(2), 231–258.CrossRefGoogle Scholar
  20. MacKay, D. (2003). Information Theory, Inference and Learning Algorithms, chap. Chapter 20. An Example Inference Task: Clustering, p. 284292. Cambridge University Press.Google Scholar
  21. Newman, M. E. (2006). Modularity and community structure in networks. Proc Natl Acad Sci USA, 103(23), 8577–8582. doi:10.1073/pnas.0601602103. http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=pubmed&list_uids=16723398&dopt=AbstractPlus.
  22. Rip, A., & Courtial, J. P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.CrossRefGoogle Scholar
  23. Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53–65. doi:10.1016/0377-0427(87)90125-7.CrossRefMATHGoogle Scholar
  24. Sahlgren, M. (2008). The distributional hypothesis. Rivista di Linguistica, 20(1), 3353.Google Scholar
  25. Sculley, D. (2016). Web scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, p. 11771178. Raleigh, NC, USA.Google Scholar
  26. Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265–269.CrossRefGoogle Scholar
  27. Sugimoto, C. R., & Weingart, S. (2015). The kaleidoscope of disciplinarity. Journal of Documentation, 71(4), 775–794. doi:10.1108/JD-06-2014-0082. http://www.scopus.com/inward/record.url?eid=2-s2.0-84933503812&partnerID=tZOtx3y1.
  28. Velden, T., Boyack, K., van Eck, N., Glänzel, W., Gläser, J., Havemann, F., Heinz, M., Koopman, R., Scharnhorst, A., Thijs, B., & Wang, S. (2017). Comparison of topic extraction approaches and their results. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics.Google Scholar
  29. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 28372854.MathSciNetMATHGoogle Scholar
  30. Weaver, W. (1955). Translation. In W. Locke & D. Booth (Eds.), Machine translation of languages (pp. 15–23). Cambridge, Massachusetts: MIT Press.Google Scholar
  31. Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques, third edition edn. The Morgan Kaufmann series in data management systems. Burlington: Morgan Kaufmann.Google Scholar
  32. Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics, 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832.
  33. Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics 4(2), 185–193. doi:10.1016/j.joi.2009.11.005. http://www.sciencedirect.com/science/article/pii/S1751157709000832. The ASIS&ISSI ”metrics” pre-conference seminar and the Global Alliance.

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2017

Authors and Affiliations

  1. 1.OCLC ResearchLeidenThe Netherlands

Personalised recommendations