Advertisement

Efficient System for Clustering of Dynamic Document Database

  • Pawel Foszner
  • Aleksandra Gruca
  • Andrzej Polanski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6874)

Abstract

We describe in this paper, a system that groups, classifies and finds the latent semantic features in a database composed of a large number of documents. The database will be constantly growing as users who co-create it will be adding more and more new documents. Users require a system to provide them information, both about a specific document, and about the entire set of documents. This information includes statistical data about words in documents, information about aspects in which this words appears, classification, clustering, etc.

To meet these expectations we propose using methods for searching for hidden patterns in multivariable data. We apply machine learning algorithms for data analysis, useful in identifying local patterns in multivariate data. We consider two different algorithms described in the literature (1) Probabilistic Latent Semantic Analysis Method [2] and (2) Nonnegative Matrix Factorization algorithm described in [4] and used in the text analysis system [1].

Keywords

clustering classification NMF semantic features document database 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chagoyen, M., Carmona-Saez, P., Shatkay, H., Carazo, J., Pascual-Montano, A.: Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 7(1), 41 (2006)CrossRefGoogle Scholar
  2. 2.
    Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. of Uncertainty in Artificial Intelligence, pp. 289–296 (1999)Google Scholar
  3. 3.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 401, 177–196 (2001)CrossRefzbMATHGoogle Scholar
  4. 4.
    Lee, D., Seung, H.: Algorithms for non-negative matrix factorizationGoogle Scholar
  5. 5.
    Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 0028–0836 (1999)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Pawel Foszner
    • 1
  • Aleksandra Gruca
    • 1
  • Andrzej Polanski
    • 1
  1. 1.Institute of InformaticsSilesian University of TechnologyGliwicePoland

Personalised recommendations