A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization

Abstract

The Evolving tree (ETree) is a hierarchical clustering and visualization model that allows the number of clusters to grow and evolve with new data samples in an online learning manner. While many hierarchical clustering models are available in the literature, ETree stands out because of its visualization capability. It is an enhancement of the Self-Organizing Map, a famous and useful clustering and visualization model. ETree organises the trained data samples in the form of a tree structure for better presentation and visualization especially for high-dimensional data samples. Even though ETree has been used in a number of applications, its use in textual document clustering and visualization is limited. In this paper, ETree is modified and deployed as a useful model for undertaking textual documents clustering and visualization problems. We introduce a new local re-learning procedure that allows the tree structure to grow and adapt to new features, i.e., new words from new textual documents. The performance of the proposed ETree model is evaluated with two (one benchmark and one real) document data sets. A number of key aspects of the proposed ETree model, which include its topology representation, learning time, as well as recall and precision rates, are evaluated. The results show that the proposed local re-learning procedure is useful for handling increasing number of features incrementally. In summary, this study contributes towards a modified ETree model and its use in a new domain, i.e., textual document clustering and visualization.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. 1.

    Rui X, Wunsch DC (2009) Clustering. Wiley, IEEE Press

    Google Scholar 

  2. 2.

    Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin

    Google Scholar 

  3. 3.

    Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  4. 4.

    Rauber A, Merkl D, Dittenbachm M (2002) The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. IEEE Trans Neural Netw 13(6):1331–1341

    Article  Google Scholar 

  5. 5.

    Carpenter GA, Grossberg S, Rosen DB (1991) ART 2-A: an adaptive resonance algorithm for rapid category learning and recognition. Neural Netw 4:493–504

    Article  Google Scholar 

  6. 6.

    Carpenter GA, Grossberg S, Markuzon N, Reynolds JH, Rosen DB (1992) Fuzzy ARTMAP: a neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Trans Neural Netw 3(5):698–713

    Article  Google Scholar 

  7. 7.

    Pal NR, Pal K, Keller JM, Bezdek JC (2005) A possibilistic fuzzy c-means clustering algorithm. IEEE Trans Fuzzy Syst 13(4):517–530

    Article  Google Scholar 

  8. 8.

    Kanungo T, Mount DM, Nethanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892

    Article  Google Scholar 

  9. 9.

    Xu C, Tao D, Xu C (2015) Multi-view self-paced learning for clustering. In: Proceedings of 24th international conference on artificial intelligence, pp 3974–3980

  10. 10.

    Arora R, Gupta MR, Kapila A, Fazel M (2013) Similarity-based clustering by left-stochastic matrix factorization. Mach Learn Res 14(1):1715–1746

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Hsu CC, Lin SH, Tai WS (2011) Apply extended self-organizing map to cluster and classify mixed-type data. Neurocomputing 74(18):3832–3842

    Article  Google Scholar 

  12. 12.

    Tai WS, Hsu CC, Chen JC (2010) A mixed-type self-organizing map with a dynamic structure. In: International conference on neural networks, pp 1–8

  13. 13.

    Matharage S, Alahakoon D, Rajapakse J, Huang P (2011) Fast growing self-organizing map for text clustering. In: Lecturer notes computer science, neural information processing, 7063, pp 406–415

  14. 14.

    Kuo RJ, Wang CF, Chen ZY (2012) Integration of growing self-organizing and continuous genetic algorithm for grading lithium-ion battery cells. Appl Soft Comput 8(12):2012–2022

    Article  Google Scholar 

  15. 15.

    Huang SY, Tsaih RH (2012) The prediction approach with growing hierarchical self-organizing map. In: International conference on neural networks, pp 1–7

  16. 16.

    Hosseini HS (2011) Binary tree time adaptive self-organizing map. Neurocomputing 74(11):1823–1839

    MathSciNet  Article  Google Scholar 

  17. 17.

    Allahyar A, Yazdi HS, Harati A (2015) Constrained semi-supervised growing self-organizing map. Neurocomputing 147:456–471

    Article  Google Scholar 

  18. 18.

    Pakkanen J, Iivarinen J, Oja E (2006) The evolving tree-analysis and applications. IEEE Trans Neural Netw 17(3):591–603

    Article  Google Scholar 

  19. 19.

    Pakkanen J, Iivarinen J, Oja E (2004) The evolving tree: a novel self-organizing network for data analysis. Neural Process Lett 20(33):199–211

    Article  Google Scholar 

  20. 20.

    Fabrizio S (2005) Text cetegorization. In: Alessandro Z (ed) Text mining and its applications. WIT Press, Southampton, pp 109–129

    Google Scholar 

  21. 21.

    Fabrizio S (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  22. 22.

    Lagus K, Kaski S, Kohonen T (2004) Mining massive document collections by the WEBSOM method. Inf Sci 163(1):135–156

    Article  Google Scholar 

  23. 23.

    Kaski S, Honkela T, Lagus K, Kohonen T (1998) WEBSOM: self-organizing maps of document collections. Neurocomputing 21(1):101–117

    Article  MATH  Google Scholar 

  24. 24.

    Lewis DD (1998) Naïve Bayes at forty: the independence as assumption in information retrieval. Lect Notes Comp Sci 1398:4–15

    Article  Google Scholar 

  25. 25.

    Hotho A, Maedche A, Staab S (2002) Ontology-based text document clustering. KI 16(4):48–54

    Google Scholar 

  26. 26.

    Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington

    Google Scholar 

  27. 27.

    Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of 7th international conference on knowledge discovery data mining, pp 269–274

  28. 28.

    Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701

    Article  Google Scholar 

  29. 29.

    Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52

    Article  Google Scholar 

  30. 30.

    Ye J, Li Q (2004) LDA/QR: an efficient and effective dimension reduction algorithm and its theoretical foundation. Pattern Recognit 37(4):851–854

    Article  Google Scholar 

  31. 31.

    Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  32. 32.

    Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396

    Article  MATH  Google Scholar 

  33. 33.

    Yu J, Tao D, Wang M (2012) Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process 21(7):3262–3272

    MathSciNet  Article  Google Scholar 

  34. 34.

    Yu J, Hong R, Wang M, You J (2014) Image clustering based on sparse patch alignment framework. Pattern Recognit 47(11):3512–3519

    Article  Google Scholar 

  35. 35.

    Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099

    Article  Google Scholar 

  36. 36.

    Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032

    MathSciNet  Article  Google Scholar 

  37. 37.

    Tao D, Li X, Wu X, Maybank SJ (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Trans Pattern Anal Mach Intell 29(10):1700–1715

    Article  Google Scholar 

  38. 38.

    Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Y (2015) Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Trans Knowl Data Eng 27(11):3111–3124

    Article  Google Scholar 

  39. 39.

    Luo Y, Tang J, Yan J, Xu C, Chen Z (2014) Pre-trained multi-view word embedding using two-side neural network. In: Proceedings of 28th AAAI conference, pp 1982–1988

  40. 40.

    Moore BC (1981) Principle component analysis in linear systems: controllability, observability, and model reduction. IEEE Trans Automat Control 26(1):17–32

    MathSciNet  Article  MATH  Google Scholar 

  41. 41.

    Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of 7th international conference on knowledge discovery data mining, pp 245–250

  42. 42.

    Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18(5):401–409

    Article  Google Scholar 

  43. 43.

    Kohonen T, Kaski S, Lagus K, Salojarvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Netw 11(3):574–586

    Article  Google Scholar 

  44. 44.

    Bourgeois N, Cottrell M, Deruelle B, Lamasse S, Letremy P (2015) How to improve robustness in Kohonen maps and display additional information in factorial analysis: application to text mining. Neurocomputing 147:120–135

    Article  Google Scholar 

  45. 45.

    Liu Y, Wang X, Wu C (2008) ConSOM: a conceptional self-organizing map model for text clustering. Neurocomputing 71(4):857–862

    Article  Google Scholar 

  46. 46.

    Lughofer E (2011) Evolving fuzzy systems-methodologies, advanced concepts and applications, 1st edn. Springer, Berlin

    Google Scholar 

  47. 47.

    Kim HJ, Kim JU, Ra YG (2005) Boosting Naïve Bayes text classification using uncertainty-based selective sampling. Neurocomputing 67(4):403–410

    Article  Google Scholar 

  48. 48.

    Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  49. 49.

    Bezdek JC, Keller J, Krisnapuram R, Pal NR (1999) Fuzzy models and algorithms for pattern recognition and image processing. Kluwer, Dordrecht

    Google Scholar 

  50. 50.

    Chang WL, Tay KM, Lim CP (2014) A new evolving tree for text document clustering and visualization. In: Soft computing in industrial applications, Springer, pp 141–151

  51. 51.

    Chang WL, Tay KM, Lim CP (2013) Enhancing an evolving tree-based text document visualization model with fuzzy \(c\)-means clustering. In: IEEE international conference fuzzy, pp 1–6

  52. 52.

    The Reuters-21578, Distribution 1.0 test collection is available from http://www.daviddlewis.com/resources/testcollections/reuters21578

  53. 53.

    Porter MF (1980) An algorithm for suffix stripping. Program Electron Lib 14(3):130–137

    Google Scholar 

  54. 54.

    The Default English Stop-words List is available from http://www.ranks.nl/resources/stopwords.html

  55. 55.

    Debole F, Sebastiani F (2005) An analysis of the relative hardness of Rueters-21578 subsets. J Am Soc Inf Sci Technol 56(6):584–586

    Article  Google Scholar 

  56. 56.

    Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 42–49

  57. 57.

    King A (2012) Online k-means clustering of nonstationary data. Prediction Project Report

  58. 58.

    Lin YS, Jiang JY, Lee SJ (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26(7):1575–1590

    Article  Google Scholar 

  59. 59.

    Nagwani NK (2015) A comment on “a similarity measure for text classification and clustering”. IEEE Trans Knowl Data Eng 27(9):2589–2590

    Article  Google Scholar 

Download references

Acknowledgements

To 2nd Regional Engineering Conference 2008 (EnCon 2008), and the organizing committee. Special thanks to Miss Liew Hui Chang who had helped during information collections and compilations. The authors had the permission to use the collection of abstracts from EnCon 2008, in which the authors would like to express gratitude for.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kai Meng Tay.

Appendix

Appendix

See Tables 7, 8, 9, and 10.

Table 7 Textual documents mapped onto \(N_{81,14} \)
Table 8 Textual documents mapped onto \(N_{82,14} \)
Table 9 Summary of textual documents mapped onto \(N_{4,2} \)
Table 10 Summary of textual documents mapped onto \(N_{5,2} \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chang, W.L., Tay, K.M. & Lim, C.P. A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization. Neural Process Lett 46, 379–409 (2017). https://doi.org/10.1007/s11063-017-9597-3

Download citation

Keywords

  • Evolving tree
  • Textual documents
  • Clustering
  • Visualization
  • Local re-learning