Skip to main content
Log in

Feature extraction and selection for Arabic tweets authorship authentication

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

In tweet authentication, we are concerned with correctly attributing a tweet to its true author based on its textual content. The more general problem of authenticating long documents has been studied before and the most common approach relies on the intuitive idea that each author has a unique style that can be captured using stylometric features (SF). Inspired by the success of modern automatic document classification problem, some researchers followed the Bag-Of-Words (BOW) approach for authenticating long documents. In this work, we consider both approaches and their application on authenticating tweets, which represent additional challenges due to the limitation in their sizes. We focus on the Arabic language due to its importance and the scarcity of works related on it. We create different sets of features from both approaches and compare the performance of different classifiers using them. We experiment with various feature selection techniques in order to extract the most discriminating features. To the best of our knowledge, this is the first study of its kind to combine these different sets of features for authorship analysis of Arabic tweets. The results show that combining all the feature sets we compute yields the best results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://www.tweepy.org/.

  2. http://top100arabs.com/.

  3. The table is missing the row of the PCA technique. This is due to the fact that PCA creates a new set of feature as weighted linear combinations of the original features.

References

  • Abbasi A, Chen H (2005b) Applying authorship analysis to extremist-group web forum messages. IEEE Intell Syst 20(5):67–75

    Article  Google Scholar 

  • Abbasi A, Chen H, Nunamaker JF (2008) Stylometric identification in electronic markets: scalability and robustness. J Manag Inf Syst 25(1):49–78

    Article  Google Scholar 

  • Abbasi A, Chen H (2005a) Applying authorship analysis to arabic web content. In: Intelligence and Security Informatics, Springer, pp 183–197

  • Abooraig R, Alwajeeh A, Al-Ayyoub M, Hmeidi I (2014) On the automatic categorization of arabic articles based on their political orientation. In: Third International Conference on Informatics Engineering and Information Science (ICIEIS2014)

  • Aggarwal CC, Zhai C (2012) Mining text data. Springer, New York

    Book  Google Scholar 

  • Al-Ayyoub A Mahmoud Alwajeeh, Hmeidi I (2016) An extensive study of authorship authentication of arabic articles. Int J Web Inf Syst (IJWIS) (to appear)

  • Albadarneh J, Talafha B, Al-Ayyoub M, Zaqaibeh B, Al-Smadi M, Jararweh Y, Benkhelifa E (2015) Using big data analytics for authorship authentication of arabic tweets. In: 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC), IEEE, pp 448–452

  • Alsmearat K, Al-Ayyoub M, Al-Shalabi R (2014) An extensive study of the bag-of-words approach for gender identification of arabic articles. In: 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), IEEE, pp 601–608

  • Alsmearat K, Shehab M, Al-Ayyoub M, Al-Shalabi R, Kanaan G (2015) Emotion analysis of arabic articles and its impact on identifying the author’s gender. In: 2015 IEEE/ACS 12th International Conference on Computer Systems and Applications (AICCSA), IEEE

  • Altheneyan AS, Menai MEB (2014) Naïve bayes classifiers for authorship attribution of arabic texts. J King Saud Univ Comput Inf Sci 26(4):473–484

    Google Scholar 

  • Alwajeeh A, Al-Ayyoub M, Hmeidi I (2014) On authorship authentication of arabic articles. In: Information and Communication Systems (ICICS), 2014 5th International Conference on, IEEE, pp 1–6

  • Attia MA (2008) Handling arabic morphological and syntactic ambiguity within the lfg framework with a view to machine translation. PhD thesis, University of Manchester

  • Baayen H, van Halteren H, Neijt A, Tweedie F (2002) An experiment in authorship attribution. In: 6th JADT, Citeseer, pp 29–37

  • Baraka RS, Salem S, Hussien MA, Nayef N, Shaban WA (2014) Arabic text author identification using support vector machines. J Adv Comput Sci Technol Res 4(1):1–11

    Article  Google Scholar 

  • Bhargava M, Mehndiratta P, Asawa K (2013) Stylometric analysis for authorship attribution on twitter. In: International Conference on Big Data Analytics, Springer, pp 37–47

  • Boyd D, Golder S, Lotan G (2010) Tweet, tweet, retweet: Conversational aspects of retweeting on twitter. In: System Sciences (HICSS), 2010 43rd Hawaii International Conference on, IEEE, pp 1–10

  • Brocardo ML, Traore I, Woungang I (2015) Authorship verification of e-mail and tweet messages applied for continuous authentication. J Comput Syst Sci 81(8):1429–1440

    Article  MathSciNet  MATH  Google Scholar 

  • Chen HC, Mao CH, Lin YT, Kung TL, Weng CE (2016) A secure group-based mobile chat protocol. J Ambient Intell Hum Comput 7(5):693–703. doi:10.1007/s12652-016-0368-1

    Article  Google Scholar 

  • Cheng N, Chandramouli R, Subbalakshmi K (2011) Author gender identification from text. Digit Investig 8(1):78–88

    Article  Google Scholar 

  • Clark JH, Hannon CJ (2007) An algorithm for identifying authors using synonyms. In: Eighth Mexican International Conference on Current Trends in Computer Science (ENC 2007), IEEE, pp 99–104

  • Deitrick W, Miller Z, Valyou B, Dickinson B, Munson T, Hu W (2012) Author gender prediction in an email stream using neural networks. J Intell Learn Syst Appl 4:169–175

    Google Scholar 

  • De Vel O (2000) Mining e-mail authorship. In: Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000)

  • De Vel O, Anderson A, Corney M, Mohay G (2001) Mining e-mail content for author identification forensics. ACM Sigmod Rec 30(4):55–64

    Article  Google Scholar 

  • Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of arabic text: From raw text to base phrase chunks. In: Proceedings of HLT-NAACL 2004: Short Papers, Association for Computational Linguistics, pp 149–152

  • Dunteman GH (1989) Principal components analysis, vol 69. Sage, Thousand Oaks

    Book  Google Scholar 

  • Ekhtoom D, Al-Ayyoub M, Al-Saleh M, Alsmirat M, Hmeidi I (2016) A compression-based technique to classify metamorphic malware. In: 2016 IEEE/ACS 13th International Conference on Computer Systems and Applications (AICCSA), IEEE

  • Estival D, Gaustad T, Pham SB, Radford W, Hutchinson B (2007) Tat: an author profiling tool with application to arabic emails. In: Proceedings of the Australasian Language Technology Workshop, pp 21–30

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18

    Article  Google Scholar 

  • Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato

  • Helmy T, Al-Nazer A (2015) Semantic manipulation of users queries and modeling the health and nutrition preferences. J Ambient Intell Hum Comput 6(4):391–405

    Article  Google Scholar 

  • Hirst G, Wei Feng V (2012) Changes in style in authors with alzheimer’s disease. Engl Stud 93(3):357–370

    Article  Google Scholar 

  • Hmeidi I, Al-shalabi M, Al-Ayyoub M (2015) A comparative study of automatic text categorization methods using arabic text. In: The International Technology Management Conference (ITMC2015), p 73

  • Juola P (2006) Authorship attribution. Found Trends Inf Retr 1(3):233–334

    Article  Google Scholar 

  • Juola P (2012) Large-scale experiments in authorship attribution. Engl Stud 93(3):275–283

    Article  Google Scholar 

  • Kanaan G, Al-Shalabi R, Ghwanmeh S, Al-Ma’adeed H (2009) A comparison of text-classification techniques applied to arabic text. J Am Soc Inf Sci Technol 60(9):1836–1844

    Article  Google Scholar 

  • Kent JT (1983) Information gain and a general measure of correlation. Biometrika 70(1):163–173

    Article  MathSciNet  MATH  Google Scholar 

  • Khorsheed MS, Al-Thubaity AO (2013) Comparative evaluation of text classification techniques using a large diverse arabic dataset. Lang Resourc Eval 47(2):513–538

    Article  Google Scholar 

  • Kononenko I, Šimec E, Robnik-Šikonja M (1997) Overcoming the myopia of inductive learning algorithms with relieff. Appl Intell 7(1):39–55

    Article  Google Scholar 

  • Koppel M, Schler J, Argamon S (2009b) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26

    Article  Google Scholar 

  • Koppel M, Akiva N, Alshech E, Bar K (2009a) Automatically classifying documents by ideological and organizational affiliation. In: IEEE International Conference on Intelligence and Security Informatics (ISI’09), IEEE, pp 176–178

  • Kosmides P, Demestichas K, Adamopoulou E, Remoundou C, Loumiotis I, Theologou M, Anagnostou M (2016) Providing recommendations on location-based social networks. J Ambient Intell Hum Comput 7(4):567–578. doi:10.1007/s12652-016-0346-7

    Article  Google Scholar 

  • Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web, ACM, pp 591–600

  • Layton R, Watters P, Dazeley R (2010) Authorship attribution for twitter in 140 characters or less. In: Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, IEEE, pp 1–8

  • MacLeod N, Grant T (2012) Whose tweet? authorship analysis of micro-blogs and other short-form messages. In: 10th biennial conference International Association of Forensic Linguists (IAFL), Birmingham, pp 210–224

  • Mosteller F, Wallace D (1964) Inference and disputed authorship: the federalist. Addison-Wesley, Reading, Massachusetts

    MATH  Google Scholar 

  • Nowson S, Oberlander J (2006) The identity of bloggers: Openness and gender in personal weblogs. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp 163–167

  • Otoom AF, Abdullah EE, Jaafer S, Hamdallh A, Amer D (2014) Towards author identification of arabic text articles. In: Information and Communication Systems (ICICS), 2014 5th International Conference on, IEEE, pp 1–4

  • Ouamour S, Sayoud H (2012) Authorship attribution of ancient texts written by ten arabic travelers using a smo-svm classifier. In: 2012 International Conference on Communications and Information Technology (ICCIT), IEEE, pp 44–47

  • Ouamour S, Sayoud H (2013) Authorship attribution of short historical arabic texts based on lexical features. In: 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), IEEE, pp 144–147

  • Pasha A, Al-Badrashiny M, Diab MT, El Kholy A, Eskander R, Habash N, Pooleery M, Rambow O, Roth R (2014) Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. LREC 14:1094–1101

    Google Scholar 

  • Saad MK (2010) The impact of text preprocessing and term weighting on arabic text classification. Master’s thesis, Computer Engineering, The Islamic University-Gaza

  • Sabordo M, Chai SY, Berryman MJ, Abbott D (2005) Who wrote the letter to the hebrews?: data mining for detection of text authorship. In: Smart Materials, Nano-, and Micro-Smart Systems, International Society for Optics and Photonics, pp 513–524

  • Said D, Wanas NM, Darwish NM, Hegazy N (2009) A study of text preprocessing tools for arabic text categorization. In: The Second International Conference on Arabic Language, pp 230–236

  • Sayoud H (2012) Author discrimination between the holy quran and prophets statements. Lit Linguist Comput 27(4):427–444

    Article  Google Scholar 

  • Shaker K, Corne D (2010) Authorship attribution in arabic using a hybrid of evolutionary search and linear discriminant analysis. In: 2010 UK Workshop on Computational Intelligence (UKCI), IEEE, pp 1–6

  • Shaker K, Corne D, Everson R (2007) Investigating hybrids of evolutionary search and linear discriminant analysis for authorship attribution. In: Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, pp 2071–2077

  • Stamatatos E (2008) Author identification: Using text sampling to handle the class imbalance problem. Inf Process Manag 44(2):790–799

    Article  Google Scholar 

  • Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556

    Article  Google Scholar 

  • Zheng R, Qin Y, Huang Z, Chen H (2003) Authorship analysis in cybercrime investigation. In: Intelligence and Security Informatics, Springer, pp 59–73

Download references

Acknowledgements

This work was supported in part by Zayed University Research Office, Research Cluster Award # R16086 and Research Incentive Grant # R15121.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mahmoud Al-Ayyoub.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Ayyoub, M., Jararweh, Y., Rabab’ah, A. et al. Feature extraction and selection for Arabic tweets authorship authentication. J Ambient Intell Human Comput 8, 383–393 (2017). https://doi.org/10.1007/s12652-017-0452-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-017-0452-1

Keywords

Navigation