Language Modelling of Constraints for Text Clustering

  • Javier Parapar
  • Álvaro Barreiro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7224)

Abstract

Constrained clustering is a recently presented family of semi-supervised learning algorithms. These methods use domain information to impose constraints over the clustering output. The way in which those constraints (typically pair-wise constraints between documents) are introduced is by designing new clustering algorithms that enforce the accomplishment of the constraints. In this paper we present an alternative approach for constrained clustering where, instead of defining new algorithms or objective functions, the constraints are introduced modifying the document representation by means of their language modelling. More precisely the constraints are modelled using the well-known Relevance Models successfully used in other retrieval tasks such as pseudo-relevance feedback. To the best of our knowledge this is the first attempt to try such approach. The results show that the presented approach is an effective method for constrained clustering even improving the results of existing constrained clustering algorithms.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abdul-jaleel, N., Allan, J., Croft, W.B., Diaz, O., Larkey, L., Li, X., Smucker, M.D., Wade, C.: UMass at trec 2004: Novelty and hard. In: Proceedings of TREC-13 (2004)Google Scholar
  2. 2.
    Ares, M.E., Parapar, J., Barreiro, Á.: Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 322–329. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  3. 3.
    Bae, E., Bailey, J.: Coala: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: ICDM 2006, pp. 53–62 (2006)Google Scholar
  4. 4.
    Balasubramanian, N., Allan, J., Croft, W.B.: A comparison of sentence retrieval techniques. In: ACM SIGIR 2007, pp. 813–814 (2007)Google Scholar
  5. 5.
    Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)MathSciNetMATHGoogle Scholar
  6. 6.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: ACM KDD 2004, pp. 59–68 (2004)Google Scholar
  7. 7.
    Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC (2008)Google Scholar
  8. 8.
    Conover, W.J.: Practical nonparametric statistics, 3rd edn. John Wiley & Sons, New York (1971)Google Scholar
  9. 9.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)CrossRefGoogle Scholar
  10. 10.
    Ji, X., Xu, W.: Document clustering with prior knowledge. In: ACM SIGIR 2006, pp. 405–412 (2006)Google Scholar
  11. 11.
    Jin, R., Ding, C., Kang, F.: A probabilistic approach for optimizing spectral clustering. In: Advances in Neural Information Processing Systems, vol. 18 (2005)Google Scholar
  12. 12.
    Klein, D., Kamvar, S., Manning, C.: From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In: ICML, pp. 307–314 (2002)Google Scholar
  13. 13.
    Lavrenko, V., Croft, W.B.: Relevance based language models. In: ACM SIGIR, pp. 120–127 (2001)Google Scholar
  14. 14.
    Lee, K.S., Croft, W.B., Allan, J.: A cluster-based resampling method for pseudo-relevance feedback. In: ACM SIGIR 2008, pp. 235–242 (2008)Google Scholar
  15. 15.
    Li, X., Zhu, Z.: Enhancing Relevance Models with Adaptive Passage Retrieval. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 463–471. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback. In: ACM CIKM 2009, pp. 1895–1898 (2009)Google Scholar
  17. 17.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)Google Scholar
  18. 18.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)CrossRefGoogle Scholar
  19. 19.
    Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: ICML 2000, pp. 1103–1110 (2000)Google Scholar
  20. 20.
    Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: ICML 2001, pp. 577–584 (2001)Google Scholar
  21. 21.
    Wang, F., Li, T., Zhang, C.: Semi-supervised clustering via matrix factorization. In: SDM 2008, pp. 1–12 (2008)Google Scholar
  22. 22.
    Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Application to Clustering with Side-information. In: Advances in Neural Information Processing Systems, vol. 15, pp. 505–512 (2002)Google Scholar
  23. 23.
    Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to cluster web search results. In: ACM SIGIR 2004, pp. 210–217 (2004)Google Scholar
  24. 24.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)CrossRefGoogle Scholar
  25. 25.
    Zhai, Z., Liu, B., Xu, H., Jia, P.: Constrained LDA for Grouping Product Features in Opinion Mining. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 448–459. Springer, Heidelberg (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Javier Parapar
    • 1
  • Álvaro Barreiro
    • 1
  1. 1.IRLab, Computer Science DepartmentUniversity of A CoruñaSpain

Personalised recommendations