A Compact Representation for Cross-Domain Short Text Clustering

Núñez-Reyes, Alba; Villatoro-Tello, Esaú; Ramírez-de-la-Rosa, Gabriela; Sánchez-Sánchez, Christian

doi:10.1007/978-3-319-62434-1_2

Alba Núñez-Reyes^15,16,
Esaú Villatoro-Tello¹⁶,
Gabriela Ramírez-de-la-Rosa¹⁶ &
…
Christian Sánchez-Sánchez¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10061))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1419 Accesses

Abstract

Nowadays, Twitter depicts a rich source of on-line reviews, ratings, recommendations, and other forms of opinion expressions. This scenario has created the compelling demand to develop innovative mechanisms to store, search, organize and analyze all this data automatically. Unfortunately, it is seldom available to have enough labeled data in Twitter, because of the cost of the process or due to the impossibility to obtain them, given the rapid growing and change of this kind of media. To avoid such limitations, unsupervised categorization strategies are employed. In this paper we face the problem of cross-domain short text clustering through a compact representation that allows us to avoid the problems that arise with the high dimensionality and sparseness of vocabulary. Our experiments, conducted on a cross-domain scenario using very short texts, indicate that the proposed representation allows to generate high quality groups, according to the value of Silhouette coefficient obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A research problem known as domain adaptation [1].
2.
A word that occurs just once within a text.
3.
Bag-of-Words representation.
4.
We employed the k-means as implemented in http://scikit-learn.org/.
5.
http://nlp.uned.es/replab2013/.

References

Li, Q.: Literature survey: domain adaptation algorithms for natural language processing. Department of Computer Science, The Graduate Center, The City University of New York, pp. 8–10 (2012)
Google Scholar
Dai, W., Yang, Q., Xue, G.-R., Yu, Y.: Self-taught clustering. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 200–207. ACM, New York (2008)
Google Scholar
Gu, Q., Zhou, J.: Learning the shared subspace for multi-task clustering and transductive transfer classification. In: 2009 Ninth IEEE International Conference on Data Mining, pp. 159–168, December 2009
Google Scholar
Bhattacharya, I., Godbole, S., Joshi, S., Verma, A.: Cross-guided clustering: transfer of relevant supervision across tasks. ACM Trans. Knowl. Discov. Data 6, 9:1–9:28 (2012)
Google Scholar
Samanta, S., Selvan, A.T., Das, S.: Cross-domain clustering performed by transfer of knowledge across domains. In: 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4, December 2013
Google Scholar
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006). doi:10.1007/11671299_55
Chapter Google Scholar
Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of DTP feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30586-6_80
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article Google Scholar
Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.: Twitterstand: news in tweets. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2009, pp. 42–51. ACM, New York (2009)
Google Scholar
Atefeh, F., Khreich, W.: A survey of techniques for event detection in twitter. Comput. Intell. 31, 132–164 (2015)
Article MathSciNet Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Zipf, G.: Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Google Scholar
Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)
Article MATH Google Scholar
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer US, Boston (2012)
Google Scholar
Rousseeuw, P.J.: Silhouettes: graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Amigó, E., et al.: Overview of replab 2013: evaluating online reputation monitoring systems. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 333–352. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40802-1_31
Chapter Google Scholar

Download references

Acknowledgments

This work was partially funded by CONACyT: through project grant 258588, the Thematic Networks program (Language Technologies Thematic Network projects 260178, 271622), and scholarship number 587804. We also thank to UAM Cuajimalpa and SNI-CONACyT for their support.

Author information

Authors and Affiliations

Maestría en Diseño, Información y Comunicación (MADIC), División de Ciencias de la Comunicación y Diseño, Universidad Autónoma Metropolitana (UAM) Unidad Cuajimalpa, Mexico City, Mexico
Alba Núñez-Reyes
Language and Reasoning Research Group, Information Technologies Department, Universidad Autónoma Metropolitana (UAM) Unidad Cuajimalpa, Mexico City, Mexico
Alba Núñez-Reyes, Esaú Villatoro-Tello, Gabriela Ramírez-de-la-Rosa & Christian Sánchez-Sánchez

Authors

Alba Núñez-Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Esaú Villatoro-Tello
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Ramírez-de-la-Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Christian Sánchez-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Esaú Villatoro-Tello .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov
Universidad Autónoma Metropolitana, Mexico City, Mexico
Oscar Herrera-Alcántara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Núñez-Reyes, A., Villatoro-Tello, E., Ramírez-de-la-Rosa, G., Sánchez-Sánchez, C. (2017). A Compact Representation for Cross-Domain Short Text Clustering. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-62434-1_2
Published: 03 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62433-4
Online ISBN: 978-3-319-62434-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics