Clustering of Distributions: A Case of Patent Citations
- 202 Downloads
Often the data units are described with discrete distributions (work described with citation distribution over time, population pyramid described as age-sex distribution etc.).When the set of such units is very large, appropriate clustering methods can reveal the typical patterns hidden in the data.
In this paper we present an adapted leaders method combined with a compatible adapted agglomerative hierarchical method that are based on relative error measure between a unit and the corresponding cluster representative–leader. The proposed approach is illustrated on citation distributions derived from the data set of US patents from 1980 to 1999. These new methods were developed because clustering of units, described with distributions, with classical k-means method reveals patterns with single high peaks which correspond to a single year. These patterns prevail over other distribution shapes also present in the data. Compared with centers in k-means method, clusters’ representatives obtained with the proposed new methods better detect typical distribution shapes of units. The obtained main cluster types for different sets of units show three main patterns: patents with early or late peak of importance to the community, and patents where the importance is slowly increasing throughout the time period.
KeywordsClustering Distribution Leaders method k-means method Agglomerative hierarchical clustering method Temporal citation distribution Citation network Relative error measure Patents
Unable to display preview. Download preview PDF.
- BATAGELJ, V. (1988), “Generalized Ward and Related Clustering Problems”, in Classification and Related Methods of Data Analysis, ed. H.H. Bock, North-Holland: Amsterdam, pp. 67–74.Google Scholar
- BRUCKER, P. (1978), “On the Complexity of Clustering Problems”, in Lecture Notes in Economics and Mathematical Systems: Optimizing and Operational Research, eds. R. Henn, B. Korte, and W. Oletti, Berlin: Springer, pp. 45–54.Google Scholar
- CLUSTDDIST–R PACKAGE (2009), Test Version of an R Package for Clustering of Distributions, by N. Kejžar, V. Batagelj, and S. Korenjak-Černe, https://r-forge.rproject.org/projects/clustddist/.
- DIDAY, E. et al. (1979), Optimisation en classification automatique, Tomes 1., 2., Rocquencourt: INRIA.Google Scholar
- FORGY, E.W. (1965), “Cluster Analysis of Multivariate Data: Efficiency Vs. Interpretability of Classifications”, Biometrics, 21, 768–769.Google Scholar
- GARFIELD, E. (1985), “Uses and Misuses of Citation Frequency”, Current Contents. Essays of an Information Scientist, 8, 403–409.Google Scholar
- GARFIELD, E. (1998a), “Long-Term Vs. Short-Term Journal Impact: Does It Matter?”, The Scientist, 12, 3.Google Scholar
- GARFIELD, E. (1998b), “The Impact Factor and Using It Correctly”, Der Unfallchirurg, 101(6), 413.Google Scholar
- HALL, B.H., JAFFE, A.B., and TRATJENBERG, M. (2001), “The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools”, NBER Working Paper 8498, NBER, http://papers.nber.org/papers/w8498.pdf.
- IMU REPORT (2008), “Citation Statistics. A Report from the International Mathematical Union (IMU) in Cooperation with the International Council of Industrial and Applied Mathematics (ICIAM) and the Institute of Mathematical Statistics (IMS)”, by R. Adler, J. Ewing, and P. Taylor, http://www.mathunion.org/fileadmin/IMU/Report/CitationStatistics.pdf.
- KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.Google Scholar
- KATSAROS, D., SIDIROPOULOS, A., and MANOLOPOUS, Y. (2007), “Age Decaying HIndex for Social Network of Citations”, Proceedings of Workshop on Social Aspects of the Web, Poznan, Poland, April 27.Google Scholar
- KEJŽAR, N., KORENJAK-ČERNE, S., and BATAGELJ, V. (2009) “Clustering of Discrete Distributions: New R Package and Comparison of Its Methods”, Abstract for the International Conference IFCS 2009 in Dresden, March 2009.Google Scholar
- RAMSEY, J., and SILVERMAN, B.W. (2005), Functional Data Analysis (2nd ed.), New York: Springer-Verlag.Google Scholar
- R DEVELOPMENT CORE TEAM (2008), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org.
- RESEARCH REPORT BY UNIVERSITIES UK (2007), “The Use of Bibliometrics to Measure Research Quality in UK Higher Educational Institutions”, 40, October 2007, http://www.universitiesuk.ac.uk/Publications/Pages/Publication-275.aspx.
- SALTON, G. (1989), Authomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, Massachusetts: Addison-Wesley.Google Scholar
- SIDIROPOULOS, A., KATSAROS, D., and MANOLOPOUS, Y. (2006), “Generalized Hindex for Revealing Latent Facts in Social Networks of Citations”, Proceedings of the 4th ACM International Workshop on Link Analysis: Dynamics and Static of Large Networks (LinkKDD), (in conjunction with ACM KDD), ACM Press, pp. 45–52.Google Scholar