Exploration of a Text Collection and Identification of Topics by Clustering
An application of cluster analysis to identify topics in a collection of posters abstracts from the Society for Neuroscience (SfN) Annual Meeting in 2006 is presented. The topics were identified by selecting from the abstracts belonging to each cluster the terms with the highest scores using different ranking schemes. The ranking scheme based on log-entropy showed better performance in this task than other more classical TFIDF schemes. An evaluation of the extracted topics was performed by comparison with previously defined thematic categories for which titles are available, and after assigning each cluster to one dominant category. The results show that repeated bisecting k-means performs better than standard k-means.
KeywordsNonnegative Matrix Factorization Vector Space Model Document Frequency Original Category Ranking Scheme
Unable to display preview. Download preview PDF.
- 2.Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
- 4.Kolda, T.G.: Limited-memory matrix methods with applications. University of Maryland, CS-TR-3806, ch. 7, pp. 59–78 (1997)Google Scholar
- 5.Steinbach, M., Karypis, G., Kumar, V.: A comparison of documents clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
- 6.Tan, P.N., Steinbach, M., Kumar, V.: Introduction to datamining. Addison-Wesley, Reading (2006)Google Scholar
- 7.Groenen, P.: Modern multidimensional scaling: Theory and Applications. Springer Series in Statistics. Springer, Heidelberg (1996)Google Scholar
- 8.Saito, K., Iwata, T., Ueda, N.: Visualization of Bipartite Graph by Spherical Embedding. In: JNNS (in Japanese) (2004)Google Scholar
- 10.CLUTO, Karypis, G., et al.: University of Minnesota (2003), available at: http://glaros.dtc.umn.edu/gkhome/views/cluto
- 11.Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on Web-page clustering. In: AAAI 2000. Proc. AAAI Workshop on AI for Web Search, Austin, pp. 58–64. AAAI-MIT Press (2000)Google Scholar