A Statistical Model for Topically Segmented Documents

  • Giovanni Ponti
  • Andrea Tagarelli
  • George Karypis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6926)

Abstract

Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at document-level, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ali, S.M., Silvey, S.D.: A General Class of Coefficients of Divergence of One Distribution from Another. Journal of Royal Statistical Society 28(1), 131–142 (1966)MathSciNetMATHGoogle Scholar
  2. 2.
    Beeferman, D., Berger, A., Lafferty, J.: Statistical Models for Text Segmentation. Journal of Machine Learning Research 34(1-3), 177–210 (1999)CrossRefMATHGoogle Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
  4. 4.
    Brants, T., Chen, F., Tsochantaridis, I.: Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis. In: Proc. 11th ACM Int. Conf. on Information and Knowledge Management (CIKM), pp. 211–218 (2002)Google Scholar
  5. 5.
    Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent Semantic Analysis for Text Segmentation. In: Proc. Int. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 109–117 (2001)Google Scholar
  6. 6.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  7. 7.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–38 (1977)MathSciNetMATHGoogle Scholar
  8. 8.
    Du, L., Buntine, W.L., Jin, H.: A segmented topic model based on the two-parameter Poisson-Dirichlet process. Machine Learning 81(1), 5–19 (2010)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Hearst, M.A.: TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages. Computational Linguistics 23(1), 33–64 (1997)Google Scholar
  10. 10.
    Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1-2), 177–196 (2001)CrossRefMATHGoogle Scholar
  11. 11.
    Kailath, T.: The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Transactions on Communication Technology 15(1), 52–60 (1967)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Karypis, G.: CLUTO - Software for Clustering High-Dimensional Datasets (2002/2007), http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
  13. 13.
    Kim, Y.M., Pessiot, J.F., Amini, M.R., Gallinari, P.: An Extension of PLSA for Document Clustering. In: Proc. ACM Int. Conf. on Information and Knowledge Management (CIKM), pp. 1345–1346 (2008)Google Scholar
  14. 14.
    Lewis, D.D., Yang, Y., Rose, T.G., Dietterich, G., Li, F.: RCV1: A new Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  15. 15.
    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37(1), 145–150 (1991)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Ponti, G., Tagarelli, A.: Topic-based Hard Clustering of Documents using Generative Models. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS 2009. LNCS, vol. 5722, pp. 231–240. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Sato, I., Nakagawa, H.: Knowledge Discovery of Multiple-Topic Document using Parametric Mixture Model with Dirichlet Prior. In: Proc. ACM Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 590–598 (2007)Google Scholar
  18. 18.
    Shafiei, M.M., Milios, E.E.: A Statistical Model for Topic Segmentation and Clustering. In: Proc. Canadian Conf. on Artificial Intelligence, pp. 283–295 (2008)Google Scholar
  19. 19.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. KDD 2000 Workshop on Text Mining (2000)Google Scholar
  20. 20.
    Sun, Q., Li, R., Luo, D., Wu, X.: Text Segmentation with LDA-based Fisher Kernel. In: Proc. 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (HLT), pp. 269–272 (2008)Google Scholar
  21. 21.
    Tagarelli, A., Karypis, G.: A Segment-based Approach To Clustering Multi-Topic Documents. In: Proc. 6th Workshop on Text Mining, in Conjunction with the 8th SIAM Int. Conf. on Data Mining, SDM 2008 (2008)Google Scholar
  22. 22.
    Zeng, J., Cheung, W.K., Li, C., Liu, J.: Multirelational Topic Models. In: Proc. 9th IEEE Int. Conf. on Data Mining (ICDM), pp. 1070–1075 (2009)Google Scholar
  23. 23.
    Zhao, Y., Karypis, G.: Empirical and Theoretical Comparison of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)CrossRefMATHGoogle Scholar
  24. 24.
    Zhong, S., Ghosh, J.: A Unified Framework for Model-Based Clustering. Journal of Machine Learning Research 4, 1001–1037 (2003)MathSciNetMATHGoogle Scholar
  25. 25.
    Zhong, S., Ghosh, J.: Generative Model-Based Document Clustering: a Comparative Study. Knowledge and Information Systems 8(3), 374–384 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Giovanni Ponti
    • 1
  • Andrea Tagarelli
    • 2
  • George Karypis
    • 3
  1. 1.ENEA - Portici Research CenterItaly
  2. 2.Department of Electronics, Computer and Systems SciencesUniversity of CalabriaItaly
  3. 3.Department of Computer Science & Engineering, Digital Technology CenterUniversity of MinnesotaMinneapolisUSA

Personalised recommendations