An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization

Chouigui, Amina; Ben Khiroun, Oussama; Elayeb, Bilel

doi:10.1007/s13369-020-05258-z

An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization

Research Article-Computer Engineering and Computer Science
Published: 04 February 2021

Volume 46, pages 3925–3938, (2021)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Amina Chouigui¹,
Oussama Ben Khiroun^1,2 &
Bilel Elayeb^1,3

768 Accesses
14 Citations
Explore all metrics

Abstract

Automatic text summarization is considered as an important task in various fields in natural language processing such as information retrieval. It is a process of automatically generating a text representation. Text summarization can be a solution to the problem of information overload. Hence, with the large amount of information available on the Internet, the presentation of a document by a summary helps to get the most relevant result of a search. We propose in this paper a new free Arabic structured corpus in the standard XML TREC format. ANT corpus v2.1 is collected using RSS feeds from different news sources. This corpus is useful for multiple text mining purposes such as generic text summarization, clustering or classification. We test this corpus for an unsupervised single-document extractive summarization using statistical and graph-based language-independent summarizers such as LexRank, TextRank, Luhn and LSA. We investigate the sensitivity of the summarization process to the stemming and stop words removal steps. We evaluate these summarizers performance by comparing the extracted texts fragments to the abstracts existing in ANT corpus v2.1 using ROUGE and BLEU metrics. Experimental results show that LexRank summarizer has achieved the best scores for the ROUGE metric using the stop words removal scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Keyphrase extraction using graph-based statistical approach with NLP patterns

Article 05 May 2024

Notes

https://duc.nist.gov/.
MSE 2005: Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization Workshop at the Annual Meeting of the Association of Computational Linguistics (ACL 2005).
MSE 2006: Multilingual Summarization Evaluation at the 21st International Conference on Computational Linguistics (ACL 2006)/44th Annual Meeting of the Association for Computational Linguistics.
http://www.nist.gov/tac/2011/Summarization/index.html.
http://multiling.iit.demokritos.gr/pages/view/662/multiling-2013.
http://www.mturk.com.
http://translate.google.com.
https://github.com/antcorpus/RSSCrawlerArabicCorpus.
https://www.jawharafm.net/ar/.
https://antcorpus.github.io/.
http://www.alarabiya.net/ar/.
http://www.bbc.com/arabic/.
https://arabic.cnn.com/.
http://www.france24.com/ar/.
http://skynewsarabia.com/.
https://pypi.org/project/sumy/.

References

Al-Abdallah, R.Z.; Al-Taani, A.T.: Arabic single-document text summarization using particle swarm optimization algorithm. Proc. Comput. Sci. 117, 30–37 (2017)
Article Google Scholar
Lin, C.Y.; Hovy, E.: Manual and automatic evaluation of summaries. In: Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4, Association for Computational Linguistics, Stroudsburg, PA, USA, AS ’02, pp. 45–51 (2002)
Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.: et al.: Text summarization techniques: a brief survey. arXiv:1707.02268 (2017)
Gupta, V.; Lehal, G.S.: A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2(3), 258–268 (2010)
Google Scholar
Mihalcea, R.; Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing (2004)
Hark, C.; Karcı, A.: Karcı summarization: A simple and effective approach for automatic text summarization using karcı entropy. Inf. Process. Manage 57(3), 102187 (2020)
Article Google Scholar
Uçkan, T.; Karcı, A.: Extractive multi-document text summarization based on graph independent sets. Egyptian Inf. J. 21(3), 145–157 (2020)
Article Google Scholar
Al-Shalabi, R.; Kanaan, G.; Al-Sarayreh, B.; Khanfar, K. et al.: Proper noun extracting algorithm for Arabic language. In: International Conference on IT to Celebrate S. Charmonman’s 72nd Birthday, pp. 28–1 (2009)
Al-Saleh, A.B.; Menai, M.E.B.: Automatic Arabic text summarization: A survey. Artif. Intell. Rev. 45(2), 203–234 (2016)
Article Google Scholar
Darwish, K.; Magdy, W.; et al.: Arabic information retrieval. Found. Trends Inf. Retr. 7(4), 239–342 (2014)
Article Google Scholar
Elayeb, B.; Bounhas, I.: Arabic cross-language information retrieval: A review. ACM Trans. Asian Low-Resour Lang. Inf. Process. 15(3), 18:1–18:44 (2016)
Elayeb, B.: Arabic word sense disambiguation: A review. Artif. Intell. Rev. 52(4), 2475–2532 (2019)
Article Google Scholar
Bounhas, I.; Elayeb, B.; Evrard, F.; Slimani, Y.: Organizing contextual knowledge for Arabic text disambiguation and terminology extraction. Knowl. Organ. 38(6), 473–490 (2011)
Google Scholar
Habash, N.; Rambow, O.: Arabic diacritization through full morphological tagging. Human Language Technologies 2007, In: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, pp. 53–56. Short Papers, Association for Computational Linguistics (2007)
Habash, N.; Rambow, O.; Roth, R.: MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. In: Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt, vol. 41, p. 62 (2009)
Al Qassem, L.M.; Wang, D.; Al Mahmoud, Z.; Barada, H.; et al.: Automatic Arabic summarization: A survey of methodologies and systems. Proc. Comput. Sci. 117, 10–18 (2017)
El-Haj, M.; Kruschwitz, U.; Fox, C.: Multi-document Arabic text summarisation. In: Computer Science and Electronic Engineering Conference (CEEC), 2011 3rd, IEEE, pp. 40–44 (2011)
Giannakopoulos, G.; El-Haj, M.; Favre, B.; Litvak, M. et al.: TAC 2011 multiling pilot overview. In: Text Analysis Conference (TAC) 2011, MultiLing Summarisation Pilot, TAC (2011)
Li, L.; Forascu, C.; El-Haj, M.; Giannakopoulos, G.: Multi-document multilingual summarization corpus preparation, part 1: Arabic, english, greek, chinese, romanian. Association for Computational Linguistics (2013)
El-Haj, M.; Kruschwitz, U.; Fox, C.: Using mechanical turk to create a corpus of Arabic summaries. In: Language Resources (LRs) and Human Language Technologies (HLT) for Semitic Languages workshop held in conjunction with the 7th International Language Resources and Evaluation Conference (LREC 2010), European Language Resources Association (2010)
El-Haj, M.; Koulali, R.: KALIMAT a multipurpose Arabic corpus. In: Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25 (2013)
Belkebir, R.; Guessoum, A.: TALAA-ASC: a sentence compression corpus for Arabic. In: IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), IEEE, pp. 1–8 (2015b)
Ismail, S.; Moawd, I.; Aref, M.: Arabic text representation using rich semantic graph: A case study. In: Proceedings of the Fourth European Conference of Computer Science (ECCS), pp. 148–153 (2013)
Azmi, M.; Al-Thanyyan, S.: A text summarizer for arabic. Comput. Speech Lang. 26(4), 260–273 (2012)
Article Google Scholar
El-Shishtawy, T.; El-Ghannam, F.: Keyphrase based arabic summarizer (kpas). In: The 8th international conference on informatics and systems (INFOS 2012) (2012)
Haboush, A.; Al-Zoubi, M.; Momani, A.; Tarazi, M.: Arabic text summarization model using clustering techniques. World Comput. Sci. Inform. Technol. J. 2(2), 62–67 (2012)
Google Scholar
Ibrahim, A.; Elghazaly, T.: Improve the automatic summarization of arabic text depending on rhetorical structure theory. In: The 12th Mexican international conference on artificial intelligence (MICAI), pp. 223–227 (2013)
Fejer, H.; Omar, N.: Automatic multi-document arabic text summarization using clustering and keyphrase extraction. J. Artif. Intell. 8(1), 1–9 (2015)
Article Google Scholar
Belkebir, R.; Guessoum, A.: A supervised approach to arabic text summarization using adaboost. In: Rocha, A., Correia, A. (eds.) New Contributions in Information Systems and Technologies, pp. 227–236. Costanzo S, Reis L) (2015a)
Chapter Google Scholar
Freund, Y.; Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet Google Scholar
Al-Khawaldeh, F.T.; Samawi, V.W.: Lexical cohesion and entailment based segmentation for Arabic text summarization. World Comput. Sci. Inf. Technol. J. 5(3), 51–60 (2015)
Google Scholar
Al-Radaideh, Q.; Bataineh, D.: A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognitive Comput. 10(4), 651–669 (2018)
Article Google Scholar
Qaroush, A.; Farah, I.A.; Ghanem, W.; Washaha, M.; et al.: An efficient single document arabic text summarization using a combination of statistical and semantic features. J. King Saud Univ. - Comput. Inf. Sci. (2019). doi: https://doi.org/10.1016/jjksuci201903010
Azmi, A.M.; Altmami, N.I.: An abstractive Arabic text summarizer with user controlled granularity. Inf. Process. Manag. 54(6), 903–921 (2018)
Article Google Scholar
Wanzhong, S.; Hongpeng, G.; Huilei, H.; Zibin, D.: Design and optimized implementation of the sha-2(256, 384, 512) hash algorithms. In: International Conference on on ASIC, IEEE, pp. 272–280 (2007)
Chouigui, A.; Ben Khiroun, O.; Elayeb, B.: Ant corpus: An Arabic news text collection for textual classification. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 135–142 (2017)
Chouigui, A.; Ben Khiroun, O.; Elayeb, B.: Related terms extraction from Arabic news corpus using word embedding. In: OTM Conferences & Workshops: Proceedings of the 7th International Workshop on Methods, Evaluation, Tools and Applications for the Creation and Consumption of Structured Data for the e-Society, Springer, LNCS, Valletta (Malta), pp. 1–11 (2018a)
Chouigui, A.; Ben Khiroun, O.; Elayeb, B.: A TF-IDF and co-occurrence based approach for events extraction from Arabic news corpus. In: International Conference on Applications of Natural Language to Information Systems, Springer, pp. 272–280 (2018b)
Elayeb, B.; Chouigui, A.; Bounhas, M.; Ben Khiroun, O.: Automatic arabic text summarization using analogical proportions. Cogn. Comput. 12(5), 1043–1069 (2020)
Article Google Scholar
Erkan, G.; Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Article Google Scholar
Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2), 159–165 (1958)
Article MathSciNet Google Scholar
Landauer, T.K.; Foltz, P.W.; Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)
Article Google Scholar
Humayoun, M.; Yu, H.: Analyzing preprocessing settings for urdu single-document extractive summarization. In: The International Conference on Language Resources and Evaluation (LREC) (2016)
Wang, S.; Wan, X.; Du, S.: Phrase-based presentation slides generation for academic papers. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
De la Peña Sarracén, G.L.; Rosso, P.: Automatic text summarization based on betweenness centrality. In: Proceedings of the 5th Spanish Conference on Information Retrieval, ACM, p. 11 (2018)
Larkey, L.S.; Ballesteros, L.; Connell, M.E.: Light stemming for arabic information retrieval. In: Arabic computational morphology, Springer, pp. 221–243 (2007)
Harrag, F.; El-Qawasmah, E.; Al-Salman, A.M.S.: Stemming as a feature reduction technique for arabic text categorization. In: Programming and Systems (ISPS), 2011 10th International Symposium on, IEEE, pp. 128–133 (2011)
Dahab, M.Y.; Ibrahim, A.; Al-Mutawa, R.: A comparative study on arabic stemmers. Int. J. Comput. Appl. 125(8), (2015)
Darwish, K.: Al-stem: A light arabic stemmer. As part of Dissertation Work Probabilistic Methods for Searching OCR-Degraded Arabic Text, University of Maryland, College Park (2002)
Elrajubi, O.M.: An improved arabic light stemmer. In: 2013 International Conference on Research and Innovation in Information Systems (ICRIIS), pp. 33–38 (2013)
Brin, S.; Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
Article Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 311–318 (2002)
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was funded by Emirates College of Technology in Abu Dhabi (UAE) under research Grant IRG-BIT-002-2020. The authors would like to thank the editors and the anonymous reviewers for their relevant comments and suggestions, which significantly enhanced the quality of this paper during the reviewing process.

Author information

Authors and Affiliations

RIADI Research Laboratory, ENSI, Manouba University, Manouba, Tunisia
Amina Chouigui, Oussama Ben Khiroun & Bilel Elayeb
Faculty of Economics and Management of Nabeul, University of Carthage, Tunis, Tunisia
Oussama Ben Khiroun
Emirates College of Technology, Abu Dhabi, United Arab Emirates
Bilel Elayeb

Authors

Amina Chouigui
View author publications
You can also search for this author in PubMed Google Scholar
Oussama Ben Khiroun
View author publications
You can also search for this author in PubMed Google Scholar
Bilel Elayeb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bilel Elayeb.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chouigui, A., Ben Khiroun, O. & Elayeb, B. An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization. Arab J Sci Eng 46, 3925–3938 (2021). https://doi.org/10.1007/s13369-020-05258-z

Download citation

Received: 18 April 2020
Accepted: 17 December 2020
Published: 04 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s13369-020-05258-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Automated identification of media bias in news articles: an interdisciplinary literature review

Keyphrase extraction using graph-based statistical approach with NLP patterns

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Automated identification of media bias in news articles: an interdisciplinary literature review

Keyphrase extraction using graph-based statistical approach with NLP patterns

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation