Advertisement

Document Representation Based on Maximal Frequent Sequence Sets

  • Edith Hernández-Reyes
  • J. Fco. Martínez-Trinidad
  • J. A. Carrasco-Ochoa
  • René A. García-Hernández
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4225)

Abstract

In document clustering, documents are commonly represented through the vector space model as a word vector where the features correspond to the words of the documents. However, there are a lot of words in a document set; therefore the vector size could be enormous. Also, the vector space model does not take into account the word order that could be useful to group similar documents. In order to reduce these disadvantages, we propose a new document representation in which each document is represented as a set of its maximal frequent sequences. The proposed document representation is applied for document clustering and the quality of the clustering is evaluated through internal and external measures, the results are compared with those obtained with the vector space model.

Keywords

Document Collection Money Market Vector Space Model Cluster Quality Document Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Su, Z., Zhang, L., Pan, Y.: Document Clustering Based on Vector Quatization and Growing-Cell Structure. In: Chung, P.W.H., Hinde, C.J., Ali, M. (eds.) IEA/AIE 2003. LNCS, vol. 2718, pp. 326–336. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  2. 2.
    Yoelle, S., Fagin, Ronald, Ben-Shaul, Israel Z. y Pelleg, Dan. Ephemeral Document Clustering for Web Applications. IBM Research. Report RJ 10186 (2000)Google Scholar
  3. 3.
    Salton, G., Wang, A., Yang, C.S.: A Vector Space Model for Information Retrieval. Journal of the American Society for information Science, 613–620 (1975) Google Scholar
  4. 4.
  5. 5.
    Ahonen-Myka, H.: Finding All Maximal Frequent Sequences in Text. In: Proc. of the ICML 1999 Workshop on Machine Learning in Text Data Analysis, pp. 11–17 (1999) Google Scholar
  6. 6.
    Daucet, A.: Advanced Document Description, a Sequential Approach. Thesis PhD. University of Helsinki Finland (2005) Google Scholar
  7. 7.
    García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A Fast Algorithm to Find All the Maximal Frequent Sequences. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 478–486. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. Text mining workshop, KDD (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Edith Hernández-Reyes
    • 1
  • J. Fco. Martínez-Trinidad
    • 1
  • J. A. Carrasco-Ochoa
    • 1
  • René A. García-Hernández
    • 1
  1. 1.National Institute for Astrophysics, Optics and ElectronicsPueblaMéxico

Personalised recommendations