, Volume 78, Issue 1, pp 113–130 | Cite as

Similarity measures for document mapping: A comparative study on the level of an individual scientist

  • Christian SternitzkeEmail author
  • Isumo Bergmann


This paper investigates the utility of the Inclusion Index, the Jaccard Index and the Cosine Index for calculating similarities of documents, as used for mapping science and technology. It is shown that, provided that the same content is searched across various documents, the Inclusion Index generally delivers more exact results, in particular when computing the degree of similarity based on citation data. In addition, various methodologies such as co-word analysis, Subject-Action-Object (SAO) structures, bibliographic coupling, co-citation analysis, and self-citation links are compared. We find that the two former ones tend to describe rather semantic similarities that differ from knowledge flows as expressed by the citation-based methodologies.


Similarity Measure Semantic Similarity Science Citation Index Citation Network Jaccard Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Ahlgren, P., Jarneving, B., Rousseau, R. (2003), Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient, Journal of the American Society for Information Science, 54: 550–560.CrossRefGoogle Scholar
  2. Bartkowski, A., Hill, J., Lühr, C., Schramm, R. (2004), Rationelle Patentrecherche und Patentanalyse. In: R. Schramm, S. Milde (Eds), PATINFO 2004 Patentrecht und Patentinformation — Mittel zur Innovation. pp. 177–204.Google Scholar
  3. Bergmann, I., Butzke, D., Walter, L., Fuerste, J. P., Moehrle, M. G., Erdmann, V. A. (2007), Evaluating the Risk of Patent Infringement by Means of Semantic Patent Analysis: The Case of DNA Chips, Proceedings of the R&D Management Conference, Bremen, July 4–6, 2007.Google Scholar
  4. Blanchard, A. (2007), Understanding and customizing stopword lists for enhanced patent mapping, World Patent Information, 29: 308–316.CrossRefGoogle Scholar
  5. Boerner, K., Chen, C., Boyack, K. W. (2003), Visualizing knowledge domains, Annual Review of Information Science and Technology, 37: 179–255.CrossRefGoogle Scholar
  6. Borgatti, S. P., Everett, M. G., Freeman, L. (1999), Ucinet 6 for Windows — Software for Social Network Analysis, Harvard, MA: Analytic Technologies.Google Scholar
  7. Callon, M., Courtial, J. P., Laville, F. (1991), Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry, Scientometrics, 22: 155–205.CrossRefGoogle Scholar
  8. Clarkson, G. (2004), Objective Identification of Patent Thickets: A Network Analytic Approach, Harvard Business School Doctoral Thesis
  9. Dreßler, A. (2006), Patente in technologieorientierten Mergers und Acquisitions, Dt. Univ.-Verl, Wiesbaden.Google Scholar
  10. Golbeck, J., Mutton, P. (2006), Spring-embedded graphs for semantic visualization. In: V. Geroimenko, C. Chen (Eds), Visualizing the Semantic Web — XML-based Internet and Information Visualization. Springer, pp. 172–182.Google Scholar
  11. Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R., Vanhoutte, A. (1989), Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula, Information Processing and Management, 25: 315–318.CrossRefGoogle Scholar
  12. Harter, S. P., Nisonger, T. E., Weng, A. (1993), Semantic relationships between cited and citing articles in library and information science journals, Journal of the American Society for Information Science, 44: 543–552.CrossRefGoogle Scholar
  13. Invention Machine Corporation (no date), Accelerating the speed of knowledge, White Paper, (March 09, 2007).
  14. Jaccard, P. (1901), Bulletin del la Société Vaudoisedes Sciences Naturelles, 37: 241–272.Google Scholar
  15. Jarneving, B. (2005), A comparison of two bibliometric methods for mapping of the research front, Scientometrics, 65: 245–263.CrossRefGoogle Scholar
  16. Kamada, T., Kawai, S. (1989), An algorithm for drawing general undirected graphs, Information Processing Letters, 31: 7–15.zbMATHCrossRefMathSciNetGoogle Scholar
  17. Kessler, M. M. (1963), Bibliographic coupling between scientific papers, American Documentation, 14: 10–25.CrossRefGoogle Scholar
  18. Leydesdorff, L. (1987), Various methods for the mapping of science, Scientometrics, 11: 295–324.CrossRefGoogle Scholar
  19. Marshakova, I. V. (1973), System of document connections based on references, Scientific and Technical Information Serial of VINITI, 6: 3–8.Google Scholar
  20. Moehrle, M. G., Walter, L., Geritz, A., Müller, S. (2005), Patent-based inventor profiles as a basis for human resource decisions in research and development, R & D Management, 35: 513–524.CrossRefGoogle Scholar
  21. Peters, H., Braam, R., Raan, A. (1995), Cognitive resemblance and citation relations in chemical engineering publications, Journal of the American Society for Information Science, 46: 9–21.CrossRefGoogle Scholar
  22. Porter, M. (1980), An algorithm for suffix stripping program, Program, 14: 130–137.Google Scholar
  23. Qin, J. (2000), Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature, Journal of the American Society for Information Science, 51: 166–180.CrossRefGoogle Scholar
  24. Ramlogan, R., Mina, A., Tampubolon, G., Metcalfe, J. (2007), Networks of knowledge: The distributed nature of medical innovation, Scientometrics, 70: 459–489.CrossRefGoogle Scholar
  25. Rijsbergen, C. V. (1979), Information Retrieval, Butterworth, London.Google Scholar
  26. Rip, A., Courtial, J. (1984), Co-word maps of biotechnology: An example of cognitive scientometrics, Scientometrics, 6: 381–400.CrossRefGoogle Scholar
  27. Salton, G., Macgill, M. J. (1983), Introduction to Modern Information Retrieval, McGraw-Hill, New York.zbMATHGoogle Scholar
  28. Sharabchiev, J. T. (1989), Cluster analysis of bibliographic references as a scientometric method, Scientometrics, 15: 127–137.CrossRefGoogle Scholar
  29. Small, H., Griffith, B. C. (1974), The structure of scientific literatures I: Identifying and graphing specialties, Science Studies, 4: 17–40.CrossRefGoogle Scholar
  30. Small, H. (1973), Co-citation in the scientific literature: A new measure of the relationship between two documents, Journal of the american Society for Information Science, 24: 265–269.CrossRefGoogle Scholar
  31. Sternitzke, C., Bartkowski, A., Schramm, R. (2007), Regional PATLIB centres as integrated one-stop service providers for intellectual property services, World Patent Information, 29: 241–245.CrossRefGoogle Scholar
  32. Tsourikov, V. M., Batchilo, L. S., Sovpel, I. V. (2000), Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures, United States Patent No. 6167370.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  1. 1.PATON — Landespatentzentrum ThüringenTechnische Universität IlmenauIlmenauGermany
  2. 2.Institut für Projektmanagement und Innovation (IPMI)Universität BremenBremenGermany

Personalised recommendations