Clustering by Intent: A Semi-Supervised Method to Discover Relevant Clusters Incrementally

  • George Forman
  • Hila Nachlieli
  • Renato Keshet
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9286)


Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that complex text data can be clustered in many different ways, and, really, it is optimistic to expect relevant clusters from an unsupervised process, even with parameter tinkering.

We studied this problem in an interactive context and developed an effective solution that re-casts the problem formulation, radically different from traditional or semi-supervised clustering. Given training labels of some known classes, our method incrementally proposes complementary clusters. In tests on various business datasets, we consistently get relevant results and at interactive time scales. This paper describes the method and demonstrates its superior ability using publicly available datasets. For automated evaluation, we devised a unique cluster evaluation framework to match the business user’s utility.


Semi-supervised clustering Class discovery Topic detection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allan, J. (eds.): Topic Detection and Tracking, The Information Retrieval Series, vol. 12 Springer (2002)Google Scholar
  2. 2.
    Bair, E.: Semi-supervised clustering methods. Wiley Interdisciplinary Reviews: Computational Statistics 5(5), 349–361 (2013)CrossRefGoogle Scholar
  3. 3.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: KDD 2004, pp. 59–68 (2004)Google Scholar
  4. 4.
    Bouveyron, C.: Adaptive mixture discriminant analysis for supervised learning with unobserved classes. J. Classif. 31(1), 49–84 (2014)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Caruana, R., Elhawary, M., Nguyen, N., Smith, C.: Meta clustering. In: ICDM 2006, pp. 107–118 (2006)Google Scholar
  6. 6.
    Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based on temporaland social terms. In: MDMKDD 2010, pp. 4:1–4:10 (2010)Google Scholar
  7. 7.
    Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning. Adaptive computation and machine learning. MIT Press (2006)Google Scholar
  8. 8.
    Chen, Y., Rege, M., Dong, M., Hua, J.: Non-negative matrix factorization for semi-supervised data clustering. KAIS 17, 355–379 (2008)Google Scholar
  9. 9.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. JMLR 7, 1–30 (2006)Google Scholar
  10. 10.
    Forman, G.: Quantifying trends accurately despite classifier error and class imbalance. In: KDD 2006, pp. 157–166 (2006)Google Scholar
  11. 11.
    Forman, G., Kirshenbaum, E., Suermondt, J.: Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In: KDD 2006, pp. 852–861 (2006)Google Scholar
  12. 12.
    Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: Methodology and application. J. AI Research 17(1), 501–527 (2002)Google Scholar
  13. 13.
    Haines, T.S., Xiang, T.: Active rare class discovery and classification using dirichlet processes. Int. J. Computer Vision 106(3), 315–331 (2014)CrossRefGoogle Scholar
  14. 14.
    Herrera, F., et al.: An overview on subgroup discovery: Foundations and applications. Knowledge and Information Systems 29(3), 495–525 (2011)CrossRefzbMATHGoogle Scholar
  15. 15.
    Lavrač, N., Kavšek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-SD. JMLR 5, 153–188 (2004)zbMATHGoogle Scholar
  16. 16.
    Lewis, D., et al.: RCV1: A new benchmark collection for text categorization research. JMLR 5, 361–397 (2004)Google Scholar
  17. 17.
    Li, X., Yu, P.S., Liu, B., Ng, S.: Positive unlabeled learning for data stream classification. In: SIAM 2009, pp. 259–270 (2009)Google Scholar
  18. 18.
    Liu, H., Wu, Z.: Non-negative matrix factorization with constraints. In: AAAI 2010, pp. 506–511 (2010)Google Scholar
  19. 19.
    Mencía, E.L., Fürnkranz, J.: Efficient pairwise multilabel classification for large-scale problems in the legal domain. In: ECML/PKDD 2008, pp. 50–65 (2008)Google Scholar
  20. 20.
    Miller, D.J., Browning, J.: A mixture model and em-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Trans. Pattern Anal. Mach. Intell. 25(11), 1468–1483 (2003)CrossRefGoogle Scholar
  21. 21.
    Novak, P.K., Lavrač, N., Webb, G.I.: Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. JMLR 10, 377–403 (2009)zbMATHGoogle Scholar
  22. 22.
    Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. JMLR 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Pimentel, M.A., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty detection. Signal Processing 99, 215–249 (2014)CrossRefGoogle Scholar
  24. 24.
    Sculley, D.: Web-scale K-means clustering. In: WWW 2010, pp. 1177–1178 (2010)Google Scholar
  25. 25.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data. Data Mining and Knowledge Discovery HandbookGoogle Scholar
  26. 26.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means clustering with background knowledge. In: ICML 2001, pp. 577–584 (2001)Google Scholar
  27. 27.
    Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Hewlett-Packard LabsPalo AltoUSA
  2. 2.Hewlett-Packard LabsHaifaIsrael

Personalised recommendations