# Quantitative Summarization

## Abstract

Before going to the thick of the multivariate summarization, this chapter first considers the concept of feature and its summarizations into histograms, density functions and centers. Two perspectives are defined, the probabilistic and vector-space ones, for defining concepts of feature centers and spreads. Also, current views on the types of measurement scales are described to conclude that the binary scales are both quantitative and categorical. The core of the Chapter describes the method of principal components (PCA) as a method for fitting a data-driven data summarization model. The model proposes that the data entries, up to the errors, are (sums of) products of hidden factor scores and feature loadings. This, together with the least-squares fitting criterion, appears to be equivalent to finding what is known in mathematics as part of the singular value decomposition (SVD) of a rectangular matrix. Three applications of the method are described: (1) scoring hidden aggregate factors, (2) visualization of the data, and (3) Latent Semantic Indexing. The conventional, and equivalent, formulation of PCA via covariance matrices involving their eigenvalues is also described. The main difference between the two formulations is that the property of principal components to be linear combinations of features is postulated in the conventional approach and derived in that SVD based. The issue of interpretation of the results is discussed, too. A novel promising approach based on a postulated linear model of stratification is presented via a project. The issue of data standardization in data summarization problems, remaining unsolved, is discussed at length in the beginning. A powerful application using eigenvectors for scoring node importance in networks and pair comparison matrices, the Google PageRank approach, is described too.

## References

- B. Efron, R.J. Tibshirani,
*An Introduction to the Bootstrap*(CRC Press, 1994)Google Scholar - T.K. Landauer,
*Latent Semantic Analysis*(Wiley, Hoboken, 2006)CrossRefGoogle Scholar - R.D. Luce,
*Utility of Gains and Losses: Measurement-theoretical and Experimental Approaches*(Psychology Press, 2014)Google Scholar - C.D. Manning, P. Raghavan, H. Schütze,
*Introduction to Information Retrieval*(Cambridge University Press, Cambridge, 2008)CrossRefGoogle Scholar - B. Mirkin, (1979)
*Group Choice*(Winston and Sons, 1979).*A division of Scripta Technica*(English translation from Russian, Group Choice Problems, 1974)Google Scholar - B. Mirkin,
*Mathematical Classification and Clustering*(Kluwer Academic Press, 1996)Google Scholar - B. Mirkin,
*Clustering: A Data Recovery Approach*(Chapman & Hall/CRC, Boca Raton, 2012)CrossRefGoogle Scholar - R. Tibshirani, M. Wainwright, T. Hastie,
*Statistical Learning with Sparsity: The Lasso and Generalizations*(Chapman and Hall/CRC, Boca Raton, 2015)zbMATHGoogle Scholar

## Articles

- E. Andersson, P.A. Ekström, Investigating Google’s pagerank algorithm. A Tech. Rep. Sci. Comput. (2004)Google Scholar
- J. Carpenter, J. Bithell, Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med.
**19**(9), 1141–1164 (2000)CrossRefGoogle Scholar - B. Cavallo, L. D’Apuzzo, A general unified framework for pairwise comparison matrices in multicriterial methods. Int. J. Intell. Syst.
**24**(4), 377–398 (2009)CrossRefGoogle Scholar - S. Deerwester, S. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci.
**41**(6), 391–407 (1990)CrossRefGoogle Scholar - H.J. Ferreau, C. Kirches, A. Potschka, H.G. Bock, M. Diehl, qpOASES: A parametric active-set algorithm for quadratic programming. Math. Program. Comput.
**6**(4), 327–363 (2014)MathSciNetCrossRefGoogle Scholar - W.D. Fisher, On grouping for maximum homogeneity. J. Am. Stat. Assoc.
**53**(284), 789–798 (1958)MathSciNetCrossRefGoogle Scholar - M. Franceschet, PageRank: Standing on the shoulders of giants. Commun. ACM
**54**(6), 92–101 (2011)CrossRefGoogle Scholar - E.V. Kovaleva, B.G. Mirkin, Bisecting K-means and 1D projection divisive clustering: a unified framework and experimental comparison. J. Classif.
**32**(3), 414–442 (2015)MathSciNetCrossRefGoogle Scholar - D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 556–562 (2001)Google Scholar
- M.A. Makary, M. Daniel, Medical error—the third leading cause of death in the US. BMJ
**353**, i2139 (2016)Google Scholar - F. Murtagh, M. Orlov, B. Mirkin, Qualitative judgement of research impact: Domain taxonomy as a fundamental framework for judgement of the quality of research. J. Classif.
**35**(1), 5–28 (2018)MathSciNetCrossRefGoogle Scholar - M. Orlov, B. Mirkin, A concept of multicriteria stratification: a definition and solution. Procedia Comput. Sci.
**31**, 273–280 (2014)CrossRefGoogle Scholar - L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to the web. Stanford InfoLab Technical Report (1999)Google Scholar
- V. Podinovski, O.V. Podinovskaya, Criteria importance theory for decision making problems with a hierarchical criterion structure, Moscow. HSE Working Paper WP7/2014/04 (2014)Google Scholar
- T.L. Saaty, How to make a decision: the analytic hierarchy process. Eur. J. Oper. Res.
**48**(1), 9–26 (1990)MathSciNetCrossRefGoogle Scholar - C. Wang, D.M. Blei, Collaborative topic modeling for recommending scientific articles, in
*Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*(2011), 448–456Google Scholar