Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions

  • Jaakko Hollmén
  • Jarkko Tikka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4723)


Finite mixture models can be used in estimating complex, unknown probability distributions and also in clustering data. The parameters of the models form a complex representation and are not suitable for interpretation purposes as such. In this paper, we present a methodology to describe the finite mixture of multivariate Bernoulli distributions with a compact and understandable description. First, we cluster the data with the mixture model and subsequently extract the maximal frequent itemsets from the cluster-specific data sets. The mixture model is used to model the data set globally and the frequent itemsets model the marginal distributions of the partitioned data locally. We present the results in understandable terms that reflect the domain properties of the data. In our application of analyzing DNA copy number amplifications, the descriptions of amplification patterns are represented in nomenclature used in literature to report amplification patterns and generally used by domain experts in biology and medicine.


Mixture Model Chromosomal Region Domain Expert Independent Component Analysis Frequent Itemsets 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the Twentieth International Conference on Very Large Data Bases (VLDB 1994), pp. 487–499 (1994)Google Scholar
  2. 2.
    Burdick, D., Calimlim, M., Gehrke, J.: MAFIA: A maximal frequent itemset algorithm for transactional databases. In: Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), pp. 443–452 (2001)Google Scholar
  3. 3.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B 39, 1–38 (1977)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Good, P.: Permutation Tests — A Practical Guide to Resampling Methods for Testing Hypotheses, 2nd edn. Springer, Heidelberg (2000)zbMATHGoogle Scholar
  5. 5.
    Gyllenberg, M., Koski, T.: Probabilistic models for bacterial taxonomy. TUCS Technical Report No 325, Turku Centre of Computer Science (January 2000)Google Scholar
  6. 6.
    Hollmén, J., Seppänen, J.K., Mannila, H.: Mixture models and frequent sets: combining global and local methods for 0-1 data. In: Barbará, D., Kamath, C. (eds.) Proceedings of the Third SIAM International Conference on Data Mining, pp. 289–293. Society of Industrial and Applied Mathematics (2003)Google Scholar
  7. 7.
    Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis. John Wiley & Sons, Chichester (2001)Google Scholar
  8. 8.
    Mannila, H., Toivonen, H., Verkamo, A.I.: Efficient algorithms for discovering association rules. In: Knowledge Discovery in databases: Papers form the AAAI-94 Workshop (KDD 1994), pp. 181–192. AAAI Press, Stanford (1994)Google Scholar
  9. 9.
    Myllykangas, S., Himberg, J., Böhling, T., Nagy, B., Hollmén, J., Knuutila, S.: DNA copy number amplification profiling of human neoplasms. Oncogene 25(55), 7324–7332 (2006)CrossRefGoogle Scholar
  10. 10.
    Pavlov, D., Mannila, H., Smyth, P.: Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Transactions on Knowledge and Data Engineering 15(6), 1409–1421 (2003)CrossRefGoogle Scholar
  11. 11.
    Schaffer, L.G., Tommerup, N. (eds.): An International System for Human Cytogenetic Nomenclature. S. Karger, Basel (2005)Google Scholar
  12. 12.
    Sulkava, M., Tikka, J., Hollmén, J.: Sparse regression for analyzing the development of foliar nutrient concentrations in coniferous trees. Ecological Modeling 191(1), 118–130 (2006)CrossRefGoogle Scholar
  13. 13.
    Tikka, J., Hollmén, J., Myllykangas, S.: Mixture modeling of DNA copy number amplification patterns in cancer. In: Sandoval, F., Prieto, A., Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 972–979. Springer, Heidelberg (2007)Google Scholar
  14. 14.
    Tikka, J., Lendasse, A., Hollmén, J.: Analysis of fast input selection: Application in time series prediction. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4132, pp. 161–170. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Vesanto, J., Hollmén, J.: An automated report generation tool for the data understanding phase. In: Abraham, A., Koeppen, M. (eds.) Proceedings of the First International Workshop on Hybrid Intelligent Systems (HIS 2001), pp. 611–625. Springer, Heidelberg (2002)Google Scholar
  16. 16.
    Wolfe, J.W.: Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 5, 329–350 (1970)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jaakko Hollmén
    • 1
  • Jarkko Tikka
    • 1
  1. 1.Helsinki Institute of Information Technology – HIIT, Helsinki University of Technology, Laboratory of Computer and, Information Science, P.O. Box 5400, FI-02015 TKK, EspooFinland

Personalised recommendations