Skip to main content

Text Document Cluster Analysis Through Visualization of 3D Projections

  • Chapter
  • First Online:
Data Mining for Service

Part of the book series: Studies in Big Data ((SBD,volume 3))

  • 3409 Accesses

Abstract

Clustering has been used as a tool for understanding the content of large text document sets. As the volume of stored data has increased, so has the need for tools to understand output from clustering algorithms. We developed a new visual interface to meet this demand. Our interface helps non-technical users understand documents and clusters in massive databases (e.g., document content, cluster sizes, distances between clusters, similarities of documents within clusters, extent of cluster overlaps) and evaluate the quality of output from different clustering algorithms. When a user inputs a keyword query describing his/her interests, our system retrieves and displays documents and clusters in three dimensions. More specifically, given a set of documents modeled as vectors in an orthogonal coordinate system and a query, our system finds three orthogonal coordinate axes that are most relevant to generate a display (or users may choose any three orthogonal axes). We conducted implementation studies to demonstrate the value of our system with an artificial data set and a de facto benchmark news article dataset from the United States NIST Text REtrieval Competitions (TREC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Achlioptas, D.: Database-friendly random projections. In: Proceeding of the ACM PODS, pp. 274–281. Santa Barbara, CA (2001)

    Google Scholar 

  2. Ankerst, M., Keim, D., Kriegel, H.-P.: Circle segments: a technique for visually exploring large multidimensional data sets. In: Proceeding of the IEEE Visualization, pp. 274–281. San Francisco, CA (1996)

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B. (eds.): Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  4. Banerjee, A., Krumpelman, C., Basu, S., Mooney, R., Ghosh, J.: Model-based overlapping clustering. In: Proceeding of the ACM KDD, pp. 532–537. Chicago, IL (2005)

    Google Scholar 

  5. Battle, A., Segal, E., Koller, D.: Probabilistic discovery of overlapping cellular processes and their regulation using gene expression data. In: Proceeding of the ACM RECOMB, pp. 167–176. San Diego, CA (2004)

    Google Scholar 

  6. Belew, R.: Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  7. Bezdek, J.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, NY (1981)

    Book  MATH  Google Scholar 

  8. Blum, K., Ruhe, A.: Information retrieval using a Krylov subspace method. SIAM J. Matrix Anal. Appl. 26, 566–582 (2005)

    Article  Google Scholar 

  9. Deerwester, S., et al.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  10. Dhillon, I., Modha, D., Spangler, W.: Visualizing class structure of multidimensional data. In: Proceeding of the Symposium on Interface: Computer Science and Statistics. http://www.almaden.ibm.com/cs/people/dmodha/ (1998). Accessed 31 March 2011

  11. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3, 32–57 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  12. Everitt, B., Landau, S., Leese, N.: Cluster Analysis, 4th edn. Oxford University Press, Oxford (2001)

    Google Scholar 

  13. Faloutsos, C., Lin, K.-I.: FastMap: a fast algorithm for indexing, data-mining and visualization of multimedia datasets. In: Proceeding of the ACM SIGMOD, pp. 163–174. San Jose, CA (1995)

    Google Scholar 

  14. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comp. c-23(9), 881–890 (1974)

    Google Scholar 

  15. Futschik, M.E., Carlisle, B.: Noise-robust soft clustering of gene expression time-course data. J. Bioinform. Comput. Biol. 3(4), 965–988 (2005)

    Article  Google Scholar 

  16. Golub, G., VanLoan, C.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)

    Google Scholar 

  17. Hinnenburg, A., Keim, D., Wawryniuk, M.: HD-eye: visual mining of high dimensional data. IEEE Comput. Graph. Appl. 19(5), 23–31 (1999)

    Google Scholar 

  18. Huang, Z., Lin, T.: A visual method of cluster validation using Fastmap. In: Terano, T., Liu, H., Chen, A. (eds.) Knowledge Discovery and Data Mining, pp. 153–164. Current Issues and New Applications, Springer, Berlin (2000)

    Google Scholar 

  19. Inselberg, A.: The plane with parallel coordinates. Vis. Comput. 1(2), 69–92 (1985)

    Article  MATH  Google Scholar 

  20. Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (2000)

    Article  Google Scholar 

  21. Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, Berlin (2002)

    Google Scholar 

  22. Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceeding of the ACM KDD, pp. 107–116. San Francisco, CA (2001)

    Google Scholar 

  23. Kobayashi, M., Aono, M.: Vector space models for search and cluster mining. In: Berry, M. (ed.) Survey of Text Mining: Clustering, Classification and Retrieval, pp. 103–122. Springer, NY (2004)

    Chapter  Google Scholar 

  24. Kobayashi, M., Aono, M.: Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl. Inf. Syst. 10(3), 295–313 (2006)

    Article  Google Scholar 

  25. Kobayashi, M., Aono, M.: Vector space models for search and cluster mining. In: Berry, M., Castellanos, M. (eds.) Survey of Text Mining, 2nd edn., pp. 103–122. Springer, Berlin (2008)

    Google Scholar 

  26. Kobayashi, M., Aono, M., Samukawa, H., Takeuchi, H.: Matrix computations for knowledge mining and management. J. Comput. Appl. Math. 149, 119–129 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  27. Kriegel, H.-P., Kroeger, P., Zimek, A.: Clustering high dimensional data. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009)

    Article  Google Scholar 

  28. Krushal, J.: Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new ‘index of condensation’. In: Milton, R., Nelder, J. (eds.) Stat. Comput., pp. 427–440. Academic Press, NY (1969)

    Google Scholar 

  29. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability 1, pp. 281–297, University of California Press (1967)

    Google Scholar 

  30. Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, NY (1979)

    MATH  Google Scholar 

  31. Nason, G.: Design and choice of projection indices. Dissertation, University of Bath, UK (1992)

    Google Scholar 

  32. Park, H., Jeon, M., Rosen, B.J.: Lower dimensional representation of text data based on centroids and least squares. BIT 43(2), 1–22 (2003)

    Article  MathSciNet  Google Scholar 

  33. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(6), 559–572. http://stat.smmu.edu.cn/history/pearson1901.pdf (1901). Accessed 25 Nov 2011

  34. Popescul, A., Ungar, L.: Automatic labeling of document clusters. http://www.cis.upenn.edu/popescul/Publications/popescul001labeling.pdf (2000). Accessed 31 March 2011

  35. Rasmussen, E.: Clustering algorithms. In: Frakes, E., Baeza-Yates, R. (eds.) Information Retrieval, pp. 419–442. Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  36. Rohrer, R., Silbert, J., Ebert, D.: A shape-based visual interface for text retrieval. IEEE Comput. Graph. Appl. 19(5), 40–46 (1990)

    Article  Google Scholar 

  37. Sahami, M., Hearst, M., Saund, E.: Applying the multiple cause mixture model to text categorization. In: Proceeding of the ICML, pp. 435–443. Baru, Italy (1996)

    Google Scholar 

  38. Salton, G. (ed.): The Smart Retrieval System. Prentice Hall, Englewood Cliffs (1971)

    Google Scholar 

  39. Sebrechts, M., et al.: Visualization of search results; a comparative evaluation of text, 2D, and 3D interfaces. In: Proceeding of the ACM SIGIR, pp. 3–10. Berkeley, CA (1999)

    Google Scholar 

  40. Segal, E., Battle, A., Koller, D.: Decomposing gene expression into cellular processes. In: Proceeding of the Pacific Symposium on Biocomputing, Lihue, HI, vol. 8, pp. 89–100. http://helix-web.stanford.edu/psb03/segal.pdf (2003). Accessed 25 Nov 2011

  41. Seo, J., Shneiderman, B.: Interactively exploring hierarchical clustering results. IEEE Comput. 35(7), 80–86 (2002)

    Google Scholar 

  42. Spence, R.: Information Visualization, 2nd edn. Prentice-Hall, Englewood Cliffs (2007)

    Google Scholar 

  43. Ware, C.: Information Visualization, 2nd edn. Morgan Kaufmann, Burlington (2004)

    Google Scholar 

  44. Wong, P.: Visual data mining. IEEE Comput. Graph. Appl. 19(5), 20–21 (1999)

    Article  Google Scholar 

  45. Zhao, Y., Karypis, G.: Soft clustering criterion functions for partitional document clustering. In: Proceeding of the ACM CIKM, pp. 246–247. Washington DC (2004)

    Google Scholar 

  46. Zhukov, L., Gleich, D.: Decomposing gene expression into cellular processes. http://www.stanford.edu/dgleich/publications/soft-clustering-pca-ica.pdf (2003). Accessed 31 March 2011

Download references

Acknowledgments

This work was conducted at IBM Research-Tokyo. The authors would like to acknowledge many helpful conversations with their colleagues. Our special thanks go out to Arquimedes Canedo, Yun Zhang, Steven Gardiner and Mike Berry for helpful suggestions on our manuscript and to Koichi Takeda and Fumio Ando for their thoughtful management and support of our work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mei Kobayashi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Aono, M., Kobayashi, M. (2014). Text Document Cluster Analysis Through Visualization of 3D Projections. In: Yada, K. (eds) Data Mining for Service. Studies in Big Data, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45252-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45252-9_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45251-2

  • Online ISBN: 978-3-642-45252-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics