Journal of Classification

, Volume 28, Issue 2, pp 156–183 | Cite as

Clustering of Distributions: A Case of Patent Citations

  • Nataša Kejžar
  • Simona Korenjak-Černe
  • Vladimir Batagelj


Often the data units are described with discrete distributions (work described with citation distribution over time, population pyramid described as age-sex distribution etc.).When the set of such units is very large, appropriate clustering methods can reveal the typical patterns hidden in the data.

In this paper we present an adapted leaders method combined with a compatible adapted agglomerative hierarchical method that are based on relative error measure between a unit and the corresponding cluster representative–leader. The proposed approach is illustrated on citation distributions derived from the data set of US patents from 1980 to 1999. These new methods were developed because clustering of units, described with distributions, with classical k-means method reveals patterns with single high peaks which correspond to a single year. These patterns prevail over other distribution shapes also present in the data. Compared with centers in k-means method, clusters’ representatives obtained with the proposed new methods better detect typical distribution shapes of units. The obtained main cluster types for different sets of units show three main patterns: patents with early or late peak of importance to the community, and patents where the importance is slowly increasing throughout the time period.


Clustering Distribution Leaders method k-means method Agglomerative hierarchical clustering method Temporal citation distribution Citation network Relative error measure Patents 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press.zbMATHGoogle Scholar
  2. BATAGELJ, V. (1988), “Generalized Ward and Related Clustering Problems”, in Classification and Related Methods of Data Analysis, ed. H.H. Bock, North-Holland: Amsterdam, pp. 67–74.Google Scholar
  3. BICKEL, P.J., and DOKSUM, K.J. (1977), Mathematical Statistics: Basic Ideas and Selected Topics, Oakland: Holden-Day, Inc.zbMATHGoogle Scholar
  4. BRUCKER, P. (1978), “On the Complexity of Clustering Problems”, in Lecture Notes in Economics and Mathematical Systems: Optimizing and Operational Research, eds. R. Henn, B. Korte, and W. Oletti, Berlin: Springer, pp. 45–54.Google Scholar
  5. CLUSTDDIST–R PACKAGE (2009), Test Version of an R Package for Clustering of Distributions, by N. Kejžar, V. Batagelj, and S. Korenjak-Černe,
  6. DIDAY, E. et al. (1979), Optimisation en classification automatique, Tomes 1., 2., Rocquencourt: INRIA.Google Scholar
  7. FORGY, E.W. (1965), “Cluster Analysis of Multivariate Data: Efficiency Vs. Interpretability of Classifications”, Biometrics, 21, 768–769.Google Scholar
  8. GARFIELD, E. (1985), “Uses and Misuses of Citation Frequency”, Current Contents. Essays of an Information Scientist, 8, 403–409.Google Scholar
  9. GARFIELD, E. (1998a), “Long-Term Vs. Short-Term Journal Impact: Does It Matter?”, The Scientist, 12, 3.Google Scholar
  10. GARFIELD, E. (1998b), “The Impact Factor and Using It Correctly”, Der Unfallchirurg, 101(6), 413.Google Scholar
  11. GOWER, J.C., and LEGENDRE, P. (1986), “Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48.MathSciNetzbMATHCrossRefGoogle Scholar
  12. HALL, B.H., JAFFE, A.B., and TRATJENBERG, M. (2001), “The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools”, NBER Working Paper 8498, NBER,
  13. HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley-Interscience.zbMATHGoogle Scholar
  14. HIRSCH, J.E. (2005), “An Index to Quantify an Individual’s Scientific Research Output”, Proceedings of the National Academy of Sciences of the United Stated of America, 102, 16569–16572.CrossRefGoogle Scholar
  15. IMU REPORT (2008), “Citation Statistics. A Report from the International Mathematical Union (IMU) in Cooperation with the International Council of Industrial and Applied Mathematics (ICIAM) and the Institute of Mathematical Statistics (IMS)”, by R. Adler, J. Ewing, and P. Taylor,
  16. KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.Google Scholar
  17. KATSAROS, D., SIDIROPOULOS, A., and MANOLOPOUS, Y. (2007), “Age Decaying HIndex for Social Network of Citations”, Proceedings of Workshop on Social Aspects of the Web, Poznan, Poland, April 27.Google Scholar
  18. KEJŽAR, N., KORENJAK-ČERNE, S., and BATAGELJ, V. (2009) “Clustering of Discrete Distributions: New R Package and Comparison of Its Methods”, Abstract for the International Conference IFCS 2009 in Dresden, March 2009.Google Scholar
  19. MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.MathSciNetGoogle Scholar
  20. NEWMAN, M.E.J. (2005), “Power Laws, Pareto Distributions and Zipf’s Law”, Contemporary Physics, 46, 5, 323–351.CrossRefGoogle Scholar
  21. RAMSEY, J., and SILVERMAN, B.W. (2005), Functional Data Analysis (2nd ed.), New York: Springer-Verlag.Google Scholar
  22. R DEVELOPMENT CORE TEAM (2008), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0,
  23. RESEARCH REPORT BY UNIVERSITIES UK (2007), “The Use of Bibliometrics to Measure Research Quality in UK Higher Educational Institutions”, 40, October 2007,
  24. SALTON, G. (1989), Authomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, Massachusetts: Addison-Wesley.Google Scholar
  25. SIDIROPOULOS, A., KATSAROS, D., and MANOLOPOUS, Y. (2006), “Generalized Hindex for Revealing Latent Facts in Social Networks of Citations”, Proceedings of the 4th ACM International Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD), (in conjunction with ACM KDD), ACM Press, pp. 45–52.Google Scholar
  26. SPÄTH, H. (1977), Cluster-Analyse-Algorithmen, München: R. Oldenbourg.zbMATHGoogle Scholar
  27. VINOD, H. (1969), “Integer Programming and the Theory of Grouping”, Journal of American Statistical Association, 64, 506–517.zbMATHCrossRefGoogle Scholar
  28. WARD, J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58, 236–244.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Nataša Kejžar
    • 1
  • Simona Korenjak-Černe
    • 2
  • Vladimir Batagelj
    • 3
  1. 1.University of Ljubljana, Faculty of Medicine, Institute of Biostatistics and Medical Informatics, IBMILjubljanaSlovenia
  2. 2.University of Ljubljana, Faculty of Economics, Department of StatisticsLjubljanaSlovenia
  3. 3.University of Ljubljana, Faculty of Mathematics and Physics, Department of MathematicsLjubljanaSlovenia

Personalised recommendations