Skip to main content

Formulations and Rationales for Other Problems in Data Analysis

  • Chapter
  • First Online:
Data Analysis in Bi-partial Perspective: Clustering and Beyond

Part of the book series: Studies in Computational Intelligence ((SCI,volume 818))

  • 410 Accesses

Abstract

We shall now comment upon some other problems, considered in the broadly conceived domain of data analysis, as seen in the perspective of the bi-partial objective function . We shall show the applicability of the concept of bi-partial objective function to these (and, indeed, yet other) problems with, whenever appropriate, illustrations for the potential form of the respective objective function.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    One would expect a realistic empirical distribution to approximate some known forms, while here it is, apparently, “haphazard”.

  2. 2.

    Actually, the proper Lorenz curve, used to represent the distribution of wealth or income (xi) in some population, introduced in 1905 by M. O. Lorenz, concerns the normalised values, so that it always starts from 0 and ends with 1.

  3. 3.

    We use the classical name of the k-means algorithm, although throughout this book the number of clusters, referred to in this name as “k”, is denoted p.

  4. 4.

    It is obvious that with this problem we definitely go back to the basic insights of Jan Czekanowski’s from the beginnings of the 20th century, who studied the serial resemblance of the skulls of Neanderthals (see Czekanowski, 1909, 1913, 1926, 1932).

  5. 5.

    Although the rules extracted take usually the form of “if
 then
” clauses, the reference to causal implication is often explicitly avoided. The expressions, appearing in the rules are referred to, alternatively, respectively as: “antecedent”, “premise” or “condition”, and “consequent”, “conclusion”, “decision” or “hypothesis”.

  6. 6.

    It should be noted that simplicity of rules is closely associated with their “intuitive appeal” and with practicability (effectiveness) of use in many circumstances (e.g. when they are used or at least consulted by human operators).

  7. 7.

    It is common to treat the minimum length problem and approach as (practically) indistinguishable from the minimum description length (MDL) problem formulation, first coined in by Rissanen (1978). The thin distinction—in the opinion of this author—is that MDL looks for the minimum length ‘coding’ of a definite set of data items, rather than for the simplest and still effective model (and is thus quite analogous to the rule extraction problem of Sect. 4.6). A very telling quotation from GrĂŒnwald (1998) on MDL says that it is “based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally”. Because of the quite close meaning and possibility of different interpretations, MDL is also used in the clustering context, see, e.g., Figueiredo, LeitĂŁo, and Jain (1999), or Böhm et al. (2006), in a way very much like that of MML. Actually, the two examples, provided in this section, can be interpreted in the perspective of both MML and MDL.

  8. 8.

    Another case of interest, which can be effectively treated in the framework here proposed, is the one of a sports tournament, in which teams play in pairs, achieving various scores in their matches, but not all pairs of teams actually meet and play with each other.

References

  • Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In SIGMOD, vol. 5. Washington, DC: ACM Press, pp. 207–216.

    Google Scholar 

  • Asheibi, A., Stirling, D., & Soetanto, D. (2008). Determination of the optimal number of clusters in harmonic data classification. In ICHQP 2008: 13th International Conference on Harmonics & Quality of Power. IEEE Publications.

    Google Scholar 

  • Bock, H.-H. (1994). Classification and clustering: Problems for the future. In E. Diday, et al. (Eds.), New approaches in classification and data analysis (pp. 3–24). Berlin: Springer.

    Chapter  Google Scholar 

  • Böhm, C., Faloutsos, Ch., Pan, J.-Y., & Plant, C. (2006). Robust information-theoretic clustering. In KDD’06, August 20–23, 2006. Philadelphia, Pennsylvania, USA: ACM Press.

    Google Scholar 

  • Bramer, M. (2007). Principles of data mining. New York: Springer.

    MATH  Google Scholar 

  • Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, N.J.: Lawrence Erlbaum Associates.

    Google Scholar 

  • Czekanowski, J. (1909). Zur Differenzialdiagnose der Neandertal-gruppe. Korrespondenz-Blatt der Deutschen Gesellschaft fĂŒr Anthropologie etc. 1, 40.

    Google Scholar 

  • Czekanowski, J. (1913). Zarys metod statystycznych w zastosowaniu do antropologii (An outline for the statistical methods in application to anthropology; in Polish, Vol. 5). Warszawa: Prace Tow. Nauk. Warsz., III Wydz. Nauk Mat. i Przyr.

    Google Scholar 

  • Czekanowski, J. (1926). Metoda podobieƄstwa w zastosowaniu do badaƄ psychometrycznych (The method of similarity applied to psychometric studies; in Polish, Vol. 3). PTF, Badania Psychologiczne, LwĂłw.

    Google Scholar 

  • Czekanowski, J. (1932). Coefficient of racial likeness and “durchschnittliche Differenz”. Anthrop. Anz. 9.

    Google Scholar 

  • Davidson, I. (2000). Minimum message length clustering using Gibbs sampling. In The 16th International Conference on Uncertainty in Artificial Intelligence. Stanford University.

    Google Scholar 

  • Figueiredo, M. A. T., LeitĂŁo, J. M. N., & Jain A. K. (1999). On fitting mixture models. In E. R. Hancock & M. Pelillo (Eds.), Energy minimization methods in computer vision and pattern recognition. EMMCVPR 1999. Lecture Notes in Computer Science, 1654. Berlin, Heidelberg: Springer.

    Google Scholar 

  • Fitzgibbon, L. J., Allison, L., & Dowe, D. L. (2000). Minimum message length grouping of ordered data. In: H. Arimura, S. Jain, & A. Dharma (Eds.), Algorithmic learning theory. ALT 2000. Lecture Notes in Computer Science, 1968. Springer, Berlin, Heidelberg.

    Google Scholar 

  • Fitzgibbon, L. J., Dowe, D. L., & Allison, L. (2002). Change-point estimation using new minimum message length approximation. In M. Ishizuka & A. Sattar (Eds.), PRICAI 2002. LNAI 2417. Berlin-Heidelberg: Springer, pp. 244–254.

    Google Scholar 

  • Gan, G., Ma, C., & Wu, J. (2007). Data clustering. Theory, algorithms and applications. Philadelphia: SIAM & ASA.

    Google Scholar 

  • Greco, S., SƂowiƄski, R., & Szczęch, I. (2009). Analysis of monotonicity properties of some rule interestingness measures. Control & Cybernetics, 38(1), 9–25.

    MathSciNet  MATH  Google Scholar 

  • GrĂŒnwald, P. (1998). MDL Tutorial. Retrieved December 12, 2017 from http://homepages.cwi.nl/~pdg/ftp/mdlintro.

  • Guadagnoli, E., & Velicer, W. (1988). Relation of sample size to the stability of component patterns. Psychological Bulletin, 103, 265–275.

    Article  Google Scholar 

  • Hair, J. F., Tatham, R. L., Anderson, R. E., & Black, W. (1998). Multivariate data analysis (5th ed.). London: Prentice Hall.

    Google Scholar 

  • Hansen, P., Brimberg, J., Urosević, D., & Mladenović, N. (2009). Solving large p-median clustering problems by primal-dual variable neighbourhood search. Data Mining and Knowledge Discovery, 19, 351–375.

    Article  MathSciNet  Google Scholar 

  • Hilderman, R., & Hamilton, H. (2001). Knowledge discovery and measures of interest. Boston: Kluwer.

    Book  Google Scholar 

  • Liao, K., & Guo, D. (2008). A clustering-based approach to the capacitated facility location problem. Transactions in GIS, 12(3), 323–339.

    Article  Google Scholar 

  • Mulvey, J. M., & Beck, M. P. (1984). Solving capacitated clustering problems. European Journal of Operational Research, 18, 339–348.

    Article  Google Scholar 

  • Nielsen, L. (2011). Classification of Countries based on their level of development: How it is done and how it could be done. IMF Working Paper, WP/11/31, IMF.

    Google Scholar 

  • Oliver, J. J., Baxter, R. A., & Wallace, C. S. (1998). Minimum message length segmentation. In X. Wu, R. Kotagiri, & K. B. Korb (Eds.), Research and development in knowledge discovery and data mining. PAKDD 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), 1394. Berlin, Heidelberg: Springer.

    Google Scholar 

  • OwsiƄski, J. W. (2001). Clustering as a model and an approach in flexible manufacturing. Taksonomia 8. Klasyfikacja i analiza danych – teoria i zastosowania, In K. Jajuga & M. Walesiak, (Eds.), Prace Naukowe AE we WrocƂawiu (No. 906, pp. 168–179). WrocƂaw: Wydawnictwo AE we WrocƂawiu.

    Google Scholar 

  • OwsiƄski, J. W. (2009). Machine-part grouping and cluster analysis: Similarities, distances and grouping criteria. Bulletin of the Polish Academy of Sciences. Technical Sciences. Special Issue: Modeling and Optimization of Manufacturing Systems, 57(3), 217–228 (Guest Editors: Z. A. Banaszak, J. JĂłzefczyk).

    Google Scholar 

  • OwsiƄski, J. W. (2011). The bi-partial approach in clustering and ordering: the model and the algorithms. Statistica & Applicazioni, 43–59 (Special Issue).

    Google Scholar 

  • OwsiƄski. J. W. (2012a). Clustering and ordering via the bi-partial approach: the rationale, the model and some algorithmic considerations. In J. Pociecha & Reinhold Decker (Eds.), Data analysis methods and its applications (pp. 109–124). Warszawa: Wydawnictwo C.H. Beck.

    Google Scholar 

  • OwsiƄski, J. W. (2012b, June). On dividing an empirical distribution into optimal segments. Rome: SIS (Italian Statistical Society) Scientific Meeting. http://meetings.sis-statistica.org/index.php/sm/sm2012/paper/viewFile/2368/229.

  • OwsiƄski, J. W. (2012c). On the optimal division of an empirical distribution (and some related problems). Przegląd Statystyczny, 1, 109–122. (Special issue).

    Google Scholar 

  • OwsiƄski, J. W., StaƄczak, J., Sęp, K., & Potrzebowski, H. (2010). Machine-part grouping in flexible manufacturing: Formalisation and the use of genetic algorithms. In P. LeitĂŁo, C. E. Pereira, J. Barata, (Eds.), 10th IFAC Workshop on Intelligent Manufacturing Systems (pp. 233–238). IFAC (DVD).

    Google Scholar 

  • OwsiƄski, J. W., & Tarchalski, T. (2008). Pomiar jakoƛci ĆŒycia. Uwagi na marginesie pewnego rankingu (Measuring life quality. Remarks relative to a certain ranking; in Polish, No. 1, pp. 59–95). Zeszyty Naukowe WydziaƂu Informatycznych Technik Zarządzania “WspóƂczesne Problemy Zarządzania”.

    Google Scholar 

  • Piatetsky-Shapiro, G. (1991). Discovery, analysis and presentation of strong rules. Knowledge Discovery in Databases, 2, 29–248.

    Google Scholar 

  • Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471.

    Article  Google Scholar 

  • Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. Computer Journal, 11(2), 185–194.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan W. OwsiƄski .

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

OwsiƄski, J.W. (2020). Formulations and Rationales for Other Problems in Data Analysis. In: Data Analysis in Bi-partial Perspective: Clustering and Beyond. Studies in Computational Intelligence, vol 818. Springer, Cham. https://doi.org/10.1007/978-3-030-13389-4_4

Download citation

Publish with us

Policies and ethics