Skip to main content

A Compact Representation for Cross-Domain Short Text Clustering

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10061))

Included in the following conference series:

  • 1419 Accesses

Abstract

Nowadays, Twitter depicts a rich source of on-line reviews, ratings, recommendations, and other forms of opinion expressions. This scenario has created the compelling demand to develop innovative mechanisms to store, search, organize and analyze all this data automatically. Unfortunately, it is seldom available to have enough labeled data in Twitter, because of the cost of the process or due to the impossibility to obtain them, given the rapid growing and change of this kind of media. To avoid such limitations, unsupervised categorization strategies are employed. In this paper we face the problem of cross-domain short text clustering through a compact representation that allows us to avoid the problems that arise with the high dimensionality and sparseness of vocabulary. Our experiments, conducted on a cross-domain scenario using very short texts, indicate that the proposed representation allows to generate high quality groups, according to the value of Silhouette coefficient obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A research problem known as domain adaptation [1].

  2. 2.

    A word that occurs just once within a text.

  3. 3.

    Bag-of-Words representation.

  4. 4.

    We employed the k-means as implemented in http://scikit-learn.org/.

  5. 5.

    http://nlp.uned.es/replab2013/.

References

  1. Li, Q.: Literature survey: domain adaptation algorithms for natural language processing. Department of Computer Science, The Graduate Center, The City University of New York, pp. 8–10 (2012)

    Google Scholar 

  2. Dai, W., Yang, Q., Xue, G.-R., Yu, Y.: Self-taught clustering. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 200–207. ACM, New York (2008)

    Google Scholar 

  3. Gu, Q., Zhou, J.: Learning the shared subspace for multi-task clustering and transductive transfer classification. In: 2009 Ninth IEEE International Conference on Data Mining, pp. 159–168, December 2009

    Google Scholar 

  4. Bhattacharya, I., Godbole, S., Joshi, S., Verma, A.: Cross-guided clustering: transfer of relevant supervision across tasks. ACM Trans. Knowl. Discov. Data 6, 9:1–9:28 (2012)

    Google Scholar 

  5. Samanta, S., Selvan, A.T., Das, S.: Cross-domain clustering performed by transfer of knowledge across domains. In: 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4, December 2013

    Google Scholar 

  6. Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006). doi:10.1007/11671299_55

    Chapter  Google Scholar 

  7. Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of DTP feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30586-6_80

    Chapter  Google Scholar 

  8. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  Google Scholar 

  9. Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.: Twitterstand: news in tweets. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2009, pp. 42–51. ACM, New York (2009)

    Google Scholar 

  10. Atefeh, F., Khreich, W.: A survey of techniques for event detection in twitter. Comput. Intell. 31, 132–164 (2015)

    Article  MathSciNet  Google Scholar 

  11. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  12. Zipf, G.: Human Behaviour and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

  13. Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)

    Article  MATH  Google Scholar 

  14. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer US, Boston (2012)

    Google Scholar 

  15. Rousseeuw, P.J.: Silhouettes: graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  16. Amigó, E., et al.: Overview of replab 2013: evaluating online reputation monitoring systems. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 333–352. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40802-1_31

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was partially funded by CONACyT: through project grant 258588, the Thematic Networks program (Language Technologies Thematic Network projects 260178, 271622), and scholarship number 587804. We also thank to UAM Cuajimalpa and SNI-CONACyT for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Esaú Villatoro-Tello .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Núñez-Reyes, A., Villatoro-Tello, E., Ramírez-de-la-Rosa, G., Sánchez-Sánchez, C. (2017). A Compact Representation for Cross-Domain Short Text Clustering. In: Sidorov, G., Herrera-Alcántara, O. (eds) Advances in Computational Intelligence. MICAI 2016. Lecture Notes in Computer Science(), vol 10061. Springer, Cham. https://doi.org/10.1007/978-3-319-62434-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62434-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62433-4

  • Online ISBN: 978-3-319-62434-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics