Abstract
Nowadays, Twitter depicts a rich source of on-line reviews, ratings, recommendations, and other forms of opinion expressions. This scenario has created the compelling demand to develop innovative mechanisms to store, search, organize and analyze all this data automatically. Unfortunately, it is seldom available to have enough labeled data in Twitter, because of the cost of the process or due to the impossibility to obtain them, given the rapid growing and change of this kind of media. To avoid such limitations, unsupervised categorization strategies are employed. In this paper we face the problem of cross-domain short text clustering through a compact representation that allows us to avoid the problems that arise with the high dimensionality and sparseness of vocabulary. Our experiments, conducted on a cross-domain scenario using very short texts, indicate that the proposed representation allows to generate high quality groups, according to the value of Silhouette coefficient obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A research problem known as domain adaptation [1].
- 2.
A word that occurs just once within a text.
- 3.
Bag-of-Words representation.
- 4.
We employed the k-means as implemented in http://scikit-learn.org/.
- 5.
References
Li, Q.: Literature survey: domain adaptation algorithms for natural language processing. Department of Computer Science, The Graduate Center, The City University of New York, pp. 8–10 (2012)
Dai, W., Yang, Q., Xue, G.-R., Yu, Y.: Self-taught clustering. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 200–207. ACM, New York (2008)
Gu, Q., Zhou, J.: Learning the shared subspace for multi-task clustering and transductive transfer classification. In: 2009 Ninth IEEE International Conference on Data Mining, pp. 159–168, December 2009
Bhattacharya, I., Godbole, S., Joshi, S., Verma, A.: Cross-guided clustering: transfer of relevant supervision across tasks. ACM Trans. Knowl. Discov. Data 6, 9:1–9:28 (2012)
Samanta, S., Selvan, A.T., Das, S.: Cross-domain clustering performed by transfer of knowledge across domains. In: 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4, December 2013
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006). doi:10.1007/11671299_55
Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of DTP feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30586-6_80
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.: Twitterstand: news in tweets. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2009, pp. 42–51. ACM, New York (2009)
Atefeh, F., Khreich, W.: A survey of techniques for event detection in twitter. Comput. Intell. 31, 132–164 (2015)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Zipf, G.: Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer US, Boston (2012)
Rousseeuw, P.J.: Silhouettes: graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Amigó, E., et al.: Overview of replab 2013: evaluating online reputation monitoring systems. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 333–352. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40802-1_31
Acknowledgments
This work was partially funded by CONACyT: through project grant 258588, the Thematic Networks program (Language Technologies Thematic Network projects 260178, 271622), and scholarship number 587804. We also thank to UAM Cuajimalpa and SNI-CONACyT for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Núñez-Reyes, A., Villatoro-Tello, E., Ramírez-de-la-Rosa, G., Sánchez-Sánchez, C. (2017). A Compact Representation for Cross-Domain Short Text Clustering. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-62434-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62433-4
Online ISBN: 978-3-319-62434-1
eBook Packages: Computer ScienceComputer Science (R0)