Document Similarity for Arabic and Cross-Lingual Web Content

Salhi, Ali; Yahya, Adnan H.

doi:10.1007/978-3-319-73500-9_10

Ali Salhi¹⁴ &
Adnan H. Yahya¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 782))

Included in the following conference series:

International Conference on Arabic Language Processing

907 Accesses

Abstract

Document similarity is basic for Information Retrieval. Cross Lingual (CL) similarity is important for many data processing tasks such as CL palgiarism detection and retrieval and document quality assessment. We study CL similarity based on the Explicit Semantic Association (ESA) adapted to a cross lingual setting with focus on Arabic. We compare the degree to which CL similarity testing performs where one of the language is Arabic with its monolingual counterpart for various text chunk sizes. We describe the used infrastructure and report on some of the testing results, study the possible sources of encountered weaknesses and point to the possible directions for improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Azmi-Murad, M., Martin, T.: Asymmetric word similarities for information retrieval, document grouping and taxonomic organization. In: Proceedings of EUNITE 2004 - Aachen, Germany, pp. 277–282 (2004)
Google Scholar
Barrón-Cedeño, A., Paramita, M.L., Clough, P., Rosso, P.: A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 424–429. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_36
Chapter Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Franco-Salvador, M., Rosso, P., Navigli, R.: A knowledge-based representation for cross-language document retrieval and categorization. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, Gothenburg, Sweden, 26–30 April 2014
Google Scholar
Freeman, A., Condon, S., Ackerman, C.: Cross linguistic name matching in English and Arabic: a “one to many mapping” extension of the Levenshtein edit distance algorithm. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, New York, pp. 471–478, June 2006
Google Scholar
Gupta, P., Banchs, R., Rosso, P.: Continuous space models for CLIR. Inf. Process. Manage. 53(2), 359–370 (2017)
Article Google Scholar
Hayashi, Y., Luo, W.: Extending monolingual semantic textual similarity task to multiple cross-lingual settings. In: Proceedings of the 10th Language Resources and Evaluation Conference (LREC2016), 23–28 May 2016, Portorož–Slovenia (2016)
Google Scholar
Liberman, S., Markovitch, S.: Compact hierarchical explicit semantic representation. In: Proceedings of IJCAI09 WS on User Contributed Knowledge and Artificial Intelligence, Pasadena, CA, July 2009
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 775–780. AAAI Press, Boston (2006)
Google Scholar
Metzler, D., Dumais, S., Meek, C.: Similarity measures for short segments of text. In: ECIR07 (2007)
Google Scholar
Moreau, E., Yvon, F., Cappe, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, pp. 593–600, August 2008
Google Scholar
Ponzetto, S., Strube, M.: Knowledge derived from Wikipedia for computing semantic relatedness. J. Artif. Intell. Res. JAIR 30, 181–212 (2007)
MATH Google Scholar
Potthast, M., Stein, B., Anderka, M.: A Wikipedia-based multilingual retrieval model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_51
Chapter Google Scholar
Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. JAIR 11, 95–130 (1999)
MATH Google Scholar
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B., Grobenik, M.: News across languages - cross-lingual document similarity and event tracking. J. Artif. Intell. Res. 55, 283–316 (2016)
MathSciNet Google Scholar
Sorg, T., Cimiano, P.: Cross lingual information retrieval with explicit semantic analysis. In: Working Notes of the Annual CLEF, 2008 Workshop (2008)
Google Scholar
Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using Wikipedia. In: AAAI, vol. 6 (2006)
Google Scholar

Download references

Acknowledgement

The authors would like to thank Birzeit University which supported this work under a University research grant. They would like to thank the three anonymous referees for their valuable input.

Author information

Authors and Affiliations

Birzeit University, Birzeit, Palestine
Ali Salhi & Adnan H. Yahya

Authors

Ali Salhi
View author publications
You can also search for this author in PubMed Google Scholar
Adnan H. Yahya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Salhi .

Editor information

Editors and Affiliations

Ex ENSA-USMBA, Fez, Morocco
Abdelmonaime Lachkar
EMI, UM5, Rabat, Morocco
Karim Bouzoubaa
FS, UMP, Oujda, Morocco
Azzedine Mazroui
IERA, UM5, Rabat, Morocco
Abdelfettah Hamdani
FS, UMP, Oujda, Morocco
Abdelhak Lekhouaja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salhi, A., Yahya, A.H. (2018). Document Similarity for Arabic and Cross-Lingual Web Content. In: Lachkar, A., Bouzoubaa, K., Mazroui, A., Hamdani, A., Lekhouaja, A. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2017. Communications in Computer and Information Science, vol 782. Springer, Cham. https://doi.org/10.1007/978-3-319-73500-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-73500-9_10
Published: 05 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73499-6
Online ISBN: 978-3-319-73500-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics