Abstract
In microblogging services such as Twitter, users post short texts messages called tweets, which are limited in length. These tweets sometimes express opinions about different topics and are presented to the user in a chronological order. As short texts do not provide sufficient contextual information, traditional texts representation methods have several limitations when directly applied to short text tasks. To tackle these issues, we propose to exploit the internal semantics from the original tweets and external knowledge from the web as a large and open corpus; and also based on the Rough Set Theory which is a mathematical tool to deal with vagueness and uncertainty; in order to enrich the tweets representation for the Arabic Language.
To test our method for enriching Arabic tweets representation, we build an Arabic tweets categorization system. The effectiveness has been evaluated and compared in terms of the F1-measure by Naïve Bayesian (NB), Support Vector Machine (SVM) classifier, and Decision Tree (DT) classifiers.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Kemp S. Global-social-media-users-pass-2-billion. 2015. http://wearesocial.net/blog/2014/08/global-social-media-users-pass-2-billion/. Accessed Dec 2015
Adamic LA, Zhang J, Bakshy E, Ackerman MS. Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of 17th International Conference on World Wide Web; 2008. New York: ACM. pp. 665–74
Jiliang T, Xufei W, Huiji G, Xia H, Huan L. Enriching short text representation in microblog for clustering front. Comput Sci. 2012;6(1) doi:10.1007/s11704-009-0000-0.
Phan XH, Nguyen LM, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th International Conference on World Wide Web; 2008. New York: ACM. pp. 91–100
Hu X, Sun N, Zhang C, Chua TS. Exploiting internal and external semantics for the clustering of short texts using world knowledge. Proceedings of the 18th ACM Conference on Information and Knowledge Management; 2009. New York: ACM. pp. 919–28
Chen M, Jin X, Shen D. Short text classification improved by learning multigranularity topics. Proceedings of the 22nd International Joint Conference on Artificial Intelligence; 2011. Barcelona: Citeseer. pp. 1776–81
Pawlak Z. Rough sets: theoretical aspects of reasoning about data. Dordrecht: Kluwer; 1991.
Komorowski J, Polkowski L, Skowron A. Rough sets: A tutorial. Singapore: Springer-Verlag; 1998.
Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H. Short Text Classification in Twitter to Improve Information Filtering, SIGIR’10, 19–23 July 2010; Geneva, Switzerland. ACM 978-1 60558-896-4/10/07
Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.
Al-Fedaghi S, Al-Anzi F. A new algorithm to generate Arabic root-pattern forms. In: Proceedings of the 11th National Computer Conference and Exhibition; 1989. pp. 391–400
Al-Shalabi R, Evens M. A computational morphology system for Arabic. In: Workshop on Computational Approaches to Semitic Languages, COLING-ACL98; 1998
Khoja S. Stemming arabic text. Lancaster: Computing Department, Lancaster University; 1999.
Larkey L, Connell ME. Arabic information retrieval at UMass in TREC-10. Proceedings of TREC 2001, Gaithersburg: NIST; 2001
Aljlayl M, Frieder O. On Arabic search: improving the retrieval effectiveness via a light stemming approach. Proceedings of ACM CIKM 2002 International Conference on Information and Knowledge Management. McLean, VA: ACM; 2002. pp. 340–7
Chen A, Gey F. Building an Arabic stemmer for information retrieval. In Proceedings of the 11th Text Retrieval Conference (TREC 2002), National Institute of Standards and Technology; 2002
Larkey L., Ballesteros L, Connell ME, Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. Proceedings of SIGIR’02; 2002. New York: ACM. pp. 275–82
Sebastiani F. A tutorial on automated text categorisation. Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence; 1999. Buenos Aires: Citeseer. pp. 7–35
Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. Proceedings of ICML-97. 1997. San Francisco: Morgan Kaufmann Publishers Inc. pp. 412–20
Rogati M, Yang Y. High-performing feature selection for text classification. CIKM’02, ACM; 2002
Liu T, Liu S, Chen Z, Ma WY. An evaluation on feature selection for text clustering. Proceedings of the 12th International Conference (ICML 2003). Washington, DC; 2003. pp. 488–95
Aas K, Eikvil L. Text categorisation: a survey. Technical report, Norwegian Computing Center; 1999
Hadni M, Lachkar A, Alaoui OS. Effective Arabic stemmer based hybrid approach for Arabic text categorization. Int J Data Min Knowl Manag Process (IJDKP). 2013;3(4):1.
Yang Y, Deng Z, Yu H. A novel content enriching model for microblog using news corpus. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers); 2014. Baltimore: ACM. pp. 218–23
Banerjee S, Ramanathan K, Gupta A. Clustering short texts using Wikipedia. Proceedings 30th annual international ACM SIGIR conference on Research and development in information retrieval; 2007. New York: ACM. pp. 787–8
Antenucci D, Handy G, Modi A, Tinkerhess M. Classification of tweets via clustering of hashtags. EECS 545 FINAL PROJECT, FALL; 2011
Nasser Al-Wehaibi R, Khan MB. Understanding the content of Arabic tweets by data and text mining techniques. Symposium on Data Mining and Applications; 2014
Froud H, Lachkar A, Ouatik SA. A comparative study of root-based and stem-based approaches for measuring the similarity between Arabic words for Arabic text mining applications. Adv Comput Int J (ACIJ). 2012;3(6):55.
Abu-Hamdiyyah M. The Qur’An: An introduction. London: Routledge; 2000.
Khoja S, Garside R. Stemming Arabic text. Lancaster: Computer Science Department, Lancaster University; 1999.
Khreisat L. Arabic text classification using N-gram frequency statistics a comparative study. Proceedings of the International Conference on Data Mining; 2006. Las Vegas: USCCM. pp. 78–82
Chi Lang N. A tolerance rough set approach to clustering web search results. Poland: Warsaw University; 2003.
Zhang J, Chen S. A study on clustering algorithm of Web search results based on rough set. Software Engineering and Service Science (ICSESS); 2013
Alsaleem S. Automated Arabic text categorization using SVM and NB. Int Arab J e-Technol. 2011;2(2):124.
Vapnik V. The nature of statistical learning theory, chapter 5. New York: Springer-Verlag; 1995.
Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the European Conference on Machine Learning (ECML); 1998. Chemnitz: Springer-Verlag. pp. 137–42
Yang Y, Liu X. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), 1999. Berkeley: ACM. pp. 42–49
Kaur D, Bedi R, Gupta SK. Review of decision tree data mining algorithms: Id3 and C4.5. Proceedings of International Conference on Information Technology and Computer Science; 11–12 July 2015
Kabra RR, Bichkar RS. Performance prediction of engineering students using decision tree. Int J Comput Appl. 2011;36(11):8–12.
Kesavraj G, Sukumaran S. A study on classification technique in data mining. 4th ICCNT-2013; 2013
Toutanova K, Klein D, Manning C, Singer Y. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL 2003. pp. 252–9
Lamberson PJ. Collecting and visualizing twitter network data with NodeXl and Gephi. http://social-dynamics.org/twitter-network-data/. Accessed Dec 2015
https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Bekkali, M., Lachkar, A. (2017). Web Search Engine-Based Representation for Arabic Tweets Categorization. In: Kaya, M., Erdoǧan, Ö., Rokne, J. (eds) From Social Data Mining and Analysis to Prediction and Community Detection. Lecture Notes in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-51367-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-51367-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51366-9
Online ISBN: 978-3-319-51367-6
eBook Packages: Computer ScienceComputer Science (R0)