Formulations and Rationales for Other Problems in Data Analysis

Owsiński, Jan W.

doi:10.1007/978-3-030-13389-4_4

Jan W. Owsiński³

Part of the book series: Studies in Computational Intelligence ((SCI,volume 818))

410 Accesses

Abstract

We shall now comment upon some other problems, considered in the broadly conceived domain of data analysis, as seen in the perspective of the bi-partial objective function . We shall show the applicability of the concept of bi-partial objective function to these (and, indeed, yet other) problems with, whenever appropriate, illustrations for the potential form of the respective objective function.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
One would expect a realistic empirical distribution to approximate some known forms, while here it is, apparently, “haphazard”.
2.
Actually, the proper Lorenz curve, used to represent the distribution of wealth or income (x_i) in some population, introduced in 1905 by M. O. Lorenz, concerns the normalised values, so that it always starts from 0 and ends with 1.
3.
We use the classical name of the k-means algorithm, although throughout this book the number of clusters, referred to in this name as “k”, is denoted p.
4.
It is obvious that with this problem we definitely go back to the basic insights of Jan Czekanowski’s from the beginnings of the 20th century, who studied the serial resemblance of the skulls of Neanderthals (see Czekanowski, 1909, 1913, 1926, 1932).
5.
Although the rules extracted take usually the form of “if… then…” clauses, the reference to causal implication is often explicitly avoided. The expressions, appearing in the rules are referred to, alternatively, respectively as: “antecedent”, “premise” or “condition”, and “consequent”, “conclusion”, “decision” or “hypothesis”.
6.
It should be noted that simplicity of rules is closely associated with their “intuitive appeal” and with practicability (effectiveness) of use in many circumstances (e.g. when they are used or at least consulted by human operators).
7.
It is common to treat the minimum length problem and approach as (practically) indistinguishable from the minimum description length (MDL) problem formulation, first coined in by Rissanen (1978). The thin distinction—in the opinion of this author—is that MDL looks for the minimum length ‘coding’ of a definite set of data items, rather than for the simplest and still effective model (and is thus quite analogous to the rule extraction problem of Sect. 4.6). A very telling quotation from Grünwald (1998) on MDL says that it is “based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally”. Because of the quite close meaning and possibility of different interpretations, MDL is also used in the clustering context, see, e.g., Figueiredo, Leitão, and Jain (1999), or Böhm et al. (2006), in a way very much like that of MML. Actually, the two examples, provided in this section, can be interpreted in the perspective of both MML and MDL.
8.
Another case of interest, which can be effectively treated in the framework here proposed, is the one of a sports tournament, in which teams play in pairs, achieving various scores in their matches, but not all pairs of teams actually meet and play with each other.

References

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In SIGMOD, vol. 5. Washington, DC: ACM Press, pp. 207–216.
Google Scholar
Asheibi, A., Stirling, D., & Soetanto, D. (2008). Determination of the optimal number of clusters in harmonic data classification. In ICHQP 2008: 13th International Conference on Harmonics & Quality of Power. IEEE Publications.
Google Scholar
Bock, H.-H. (1994). Classification and clustering: Problems for the future. In E. Diday, et al. (Eds.), New approaches in classification and data analysis (pp. 3–24). Berlin: Springer.
Chapter Google Scholar
Böhm, C., Faloutsos, Ch., Pan, J.-Y., & Plant, C. (2006). Robust information-theoretic clustering. In KDD’06, August 20–23, 2006. Philadelphia, Pennsylvania, USA: ACM Press.
Google Scholar
Bramer, M. (2007). Principles of data mining. New York: Springer.
MATH Google Scholar
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, N.J.: Lawrence Erlbaum Associates.
Google Scholar
Czekanowski, J. (1909). Zur Differenzialdiagnose der Neandertal-gruppe. Korrespondenz-Blatt der Deutschen Gesellschaft für Anthropologie etc. 1, 40.
Google Scholar
Czekanowski, J. (1913). Zarys metod statystycznych w zastosowaniu do antropologii (An outline for the statistical methods in application to anthropology; in Polish, Vol. 5). Warszawa: Prace Tow. Nauk. Warsz., III Wydz. Nauk Mat. i Przyr.
Google Scholar
Czekanowski, J. (1926). Metoda podobieństwa w zastosowaniu do badań psychometrycznych (The method of similarity applied to psychometric studies; in Polish, Vol. 3). PTF, Badania Psychologiczne, Lwów.
Google Scholar
Czekanowski, J. (1932). Coefficient of racial likeness and “durchschnittliche Differenz”. Anthrop. Anz. 9.
Google Scholar
Davidson, I. (2000). Minimum message length clustering using Gibbs sampling. In The 16th International Conference on Uncertainty in Artificial Intelligence. Stanford University.
Google Scholar
Figueiredo, M. A. T., Leitão, J. M. N., & Jain A. K. (1999). On fitting mixture models. In E. R. Hancock & M. Pelillo (Eds.), Energy minimization methods in computer vision and pattern recognition. EMMCVPR 1999. Lecture Notes in Computer Science, 1654. Berlin, Heidelberg: Springer.
Google Scholar
Fitzgibbon, L. J., Allison, L., & Dowe, D. L. (2000). Minimum message length grouping of ordered data. In: H. Arimura, S. Jain, & A. Dharma (Eds.), Algorithmic learning theory. ALT 2000. Lecture Notes in Computer Science, 1968. Springer, Berlin, Heidelberg.
Google Scholar
Fitzgibbon, L. J., Dowe, D. L., & Allison, L. (2002). Change-point estimation using new minimum message length approximation. In M. Ishizuka & A. Sattar (Eds.), PRICAI 2002. LNAI 2417. Berlin-Heidelberg: Springer, pp. 244–254.
Google Scholar
Gan, G., Ma, C., & Wu, J. (2007). Data clustering. Theory, algorithms and applications. Philadelphia: SIAM & ASA.
Google Scholar
Greco, S., Słowiński, R., & Szczęch, I. (2009). Analysis of monotonicity properties of some rule interestingness measures. Control & Cybernetics, 38(1), 9–25.
MathSciNet MATH Google Scholar
Grünwald, P. (1998). MDL Tutorial. Retrieved December 12, 2017 from http://homepages.cwi.nl/~pdg/ftp/mdlintro.
Guadagnoli, E., & Velicer, W. (1988). Relation of sample size to the stability of component patterns. Psychological Bulletin, 103, 265–275.
Article Google Scholar
Hair, J. F., Tatham, R. L., Anderson, R. E., & Black, W. (1998). Multivariate data analysis (5th ed.). London: Prentice Hall.
Google Scholar
Hansen, P., Brimberg, J., Urosević, D., & Mladenović, N. (2009). Solving large p-median clustering problems by primal-dual variable neighbourhood search. Data Mining and Knowledge Discovery, 19, 351–375.
Article MathSciNet Google Scholar
Hilderman, R., & Hamilton, H. (2001). Knowledge discovery and measures of interest. Boston: Kluwer.
Book Google Scholar
Liao, K., & Guo, D. (2008). A clustering-based approach to the capacitated facility location problem. Transactions in GIS, 12(3), 323–339.
Article Google Scholar
Mulvey, J. M., & Beck, M. P. (1984). Solving capacitated clustering problems. European Journal of Operational Research, 18, 339–348.
Article Google Scholar
Nielsen, L. (2011). Classification of Countries based on their level of development: How it is done and how it could be done. IMF Working Paper, WP/11/31, IMF.
Google Scholar
Oliver, J. J., Baxter, R. A., & Wallace, C. S. (1998). Minimum message length segmentation. In X. Wu, R. Kotagiri, & K. B. Korb (Eds.), Research and development in knowledge discovery and data mining. PAKDD 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), 1394. Berlin, Heidelberg: Springer.
Google Scholar
Owsiński, J. W. (2001). Clustering as a model and an approach in flexible manufacturing. Taksonomia 8. Klasyfikacja i analiza danych – teoria i zastosowania, In K. Jajuga & M. Walesiak, (Eds.), Prace Naukowe AE we Wrocławiu (No. 906, pp. 168–179). Wrocław: Wydawnictwo AE we Wrocławiu.
Google Scholar
Owsiński, J. W. (2009). Machine-part grouping and cluster analysis: Similarities, distances and grouping criteria. Bulletin of the Polish Academy of Sciences. Technical Sciences. Special Issue: Modeling and Optimization of Manufacturing Systems, 57(3), 217–228 (Guest Editors: Z. A. Banaszak, J. Józefczyk).
Google Scholar
Owsiński, J. W. (2011). The bi-partial approach in clustering and ordering: the model and the algorithms. Statistica & Applicazioni, 43–59 (Special Issue).
Google Scholar
Owsiński. J. W. (2012a). Clustering and ordering via the bi-partial approach: the rationale, the model and some algorithmic considerations. In J. Pociecha & Reinhold Decker (Eds.), Data analysis methods and its applications (pp. 109–124). Warszawa: Wydawnictwo C.H. Beck.
Google Scholar
Owsiński, J. W. (2012b, June). On dividing an empirical distribution into optimal segments. Rome: SIS (Italian Statistical Society) Scientific Meeting. http://meetings.sis-statistica.org/index.php/sm/sm2012/paper/viewFile/2368/229.
Owsiński, J. W. (2012c). On the optimal division of an empirical distribution (and some related problems). Przegląd Statystyczny, 1, 109–122. (Special issue).
Google Scholar
Owsiński, J. W., Stańczak, J., Sęp, K., & Potrzebowski, H. (2010). Machine-part grouping in flexible manufacturing: Formalisation and the use of genetic algorithms. In P. Leitão, C. E. Pereira, J. Barata, (Eds.), 10th IFAC Workshop on Intelligent Manufacturing Systems (pp. 233–238). IFAC (DVD).
Google Scholar
Owsiński, J. W., & Tarchalski, T. (2008). Pomiar jakości życia. Uwagi na marginesie pewnego rankingu (Measuring life quality. Remarks relative to a certain ranking; in Polish, No. 1, pp. 59–95). Zeszyty Naukowe Wydziału Informatycznych Technik Zarządzania “Współczesne Problemy Zarządzania”.
Google Scholar
Piatetsky-Shapiro, G. (1991). Discovery, analysis and presentation of strong rules. Knowledge Discovery in Databases, 2, 29–248.
Google Scholar
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471.
Article Google Scholar
Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. Computer Journal, 11(2), 185–194.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Jan W. Owsiński

Authors

Jan W. Owsiński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan W. Owsiński .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Owsiński, J.W. (2020). Formulations and Rationales for Other Problems in Data Analysis. In: Data Analysis in Bi-partial Perspective: Clustering and Beyond. Studies in Computational Intelligence, vol 818. Springer, Cham. https://doi.org/10.1007/978-3-030-13389-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-13389-4_4
Published: 24 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13388-7
Online ISBN: 978-3-030-13389-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics