Abstract
In the present work we consider the problem of narrow-domain clustering of short texts, such as academic abstracts. Our main objective is to check whether it is possible to improve the quality of k-means algorithm expanding the feature space by adding a dictionary of word groups that were selected from texts on the basis of a fixed set of patterns. Also, we check the possibility to increase the quality of clustering by mapping the feature spaces to a semantic space with a lower dimensionality using Latent Semantic Indexing (LSI). The results allow us to assume that the aforementioned modifications are feasible in practical terms as compared to the use of k-means in the feature space defined only by the main dictionary of the corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bernardini, A., Carpineto, C.: Full-subtopic retrieval with keyphrase-based search results clustering. In: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, vol. 1 (2009)
Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of Web search results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Zeng, HJ., He, QC., Chen, Zh., Ma, WY., Ma, J.: Learning to cluster web search results. In: Proceeding SIGIR ’04 Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 210–217 (2004)
Popova, S., Khodyrev, I., Egorov, A., Logvin, S., Gulyaev, S., Karpova, M., Mouromtsev, D.: Sci-search: academic search and analysis system based on keyphrases. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 281–288. Springer, Heidelberg (2013)
Alexandrov, M., Gelbukh, A., Rosso, P.: An approach to clustering abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)
Cagnina, L., Errecalde, M., Ingaramo, D., Rosso, P.: A discrete particle swarm optimizer for clustering short text corpora. In: BIOMA08, p. 93103 (2008)
Errecalde, M., Ingaramo, D., Rosso, P.: ITSA: an effective iterative method for short-text clustering tasks. In: GarcÃa-Pedrajas, N., Herrera, F., Fyfe, C., BenÃtez, J.M., Ali, M. (eds.) IEA/AIE 2010. LNCS, vol. 6096, pp. 550–559. Springer, Heidelberg (2010)
Pinto, D.: Analysis of narrow-domain short texts clustering. In: Research report for Diploma de Estudios Avanzados (DEA), Department of Information Systems and Computation, UPV (2007)
Pinto, D., Rosso, P., Jiménez, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Hasanzadeh, E., Poyan, M., Rokny, H.: Text clustering on latent semantic indexing with particle swarm optimization (PSO) algorithm. Int. J. Phys. Sci. 7(1), 116–120 (2012)
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2012)
Eissen, S.M., Stein, B.: Analysis of clustering algorithms for Web-based search. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 168–178. Springer, Heidelberg (2002)
Stein, B., Meyer zu Eissen, S., Wißbrock, F.: On cluster validity and the information need of users. In: Hanza, MH. (ed.) 3rd IASTED International Conference on Artificial Intelligence and Applications (AIA 03), Benalmádena, Spain, pp. 216–221, ISBN 0-88986-390-3. ACTA Press, IASTED (2003)
Acknowledgement
This work was partially financially supported by the Government of Russian Federation, Grant 074-U01.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Popova, S., Danilova, V., Egorov, A. (2014). Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI. In: Ignatov, D., Khachay, M., Panchenko, A., Konstantinova, N., Yavorsky, R. (eds) Analysis of Images, Social Networks and Texts. AIST 2014. Communications in Computer and Information Science, vol 436. Springer, Cham. https://doi.org/10.1007/978-3-319-12580-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-12580-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12579-4
Online ISBN: 978-3-319-12580-0
eBook Packages: Computer ScienceComputer Science (R0)