Abstract
Robust statistical methodologies are imperative for effectively analysing data and quantifying specific phenomena, particularly when attempting to comprehend intricate events. The present research endeavours to introduce and assess potent non-parametric statistical approaches that are compatible with heterogeneous data structures. The primary focus lies in their application within the domains of language clustering and natural language processing. A central objective is to refine our understanding of language clustering and its potential implications, including the emergence of linguistic regions known as sprachbunds. To achieve this, the study delves into diverse non-parametric facets of linguistic data processing and exploration. Building upon the foundation established by previous work (Chattopadhyay et al. in International conference on soft computing and its engineering applications, Springer, Cham, 2022), this study extends its scope by proposing a novel framework for structuring language families. This is accomplished through the incorporation of typological and areal characteristics, enriching the accuracy and depth of language classification. The utilisation of non-parametric techniques takes centre stage throughout this process. Notably, multidimensional scaling (MDS) is harnessed to transform resulting data into a Cartesian framework, enabling the deployment of data-depth-based methods for reliable outlier identification. This proves invaluable for effectively categorising a wide array of languages situated on the fringes of existing classifications. Furthermore, it opens avenues for reevaluating established language categorisation schemes in light of newfound insights.
Similar content being viewed by others
References
Chattopadhyay A, Ghosh SS, Karmakar S. On language clustering: non-parametric statistical approach. In: International conference on soft computing and its engineering applications. Springer; 2022. p. 42–55.
Siegel S. Nonparametric statistics. Am Stat. 1957;11(3):13–9.
Savage IR. Nonparametric statistics: a personal review. Indian J Stat Ser A. 1969;31(2):107–44.
Wasserman L. All of nonparametric statistics. New York: Springer; 2006.
Tan X, Chen J, He D, Xia Y, Liu T-Y. Multilingual neural machine translation with language clustering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics (ACL); 2019. p. 963–73.
Karmakar S, Ghosh SS, Chattopadhyay A. Sprachbund as metric space: quantifying linguistic traces of convergences and diffusions in synchrony. 2022. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4230203
Liu RY, Parelius JM, Singh K. Multivariate analysis by data depth: descriptive statistics, graphics and inference. Ann Stat. 1999;27(3):783–858.
Vardi Y, Zhang C-H. The multivariate l1-median and associated data depth. Proc Natl Acad Sci. 2000;97(4):1423–6.
He X, Wang G. Convergence of depth contours for multivariate datasets. Ann Stat. 1997;25(2):495–504.
Dyckerhoff R, Mosler K, Koshevoy G. Zonoid data depth: theory and computation. In: Albert P, editor. COMPSTAT. Heidelberg: Physica-Verlag HD; 1996. p. 235–40.
Aloupis G. Geometric measures of data depth. DIMACS Ser Discrete Math Theoret Comput Sci. 2006;72:147–58.
Classification of Romance languages. 2023. https://en.wikipedia.org/wiki/Classification_of_Romance_languages. Accessed 01 Jan 2023.
Romance Language Word Lists. 2021. http://people.disim.univaq.it/~serva/languages/55+2.romance.htm. Accessed 23 Dec 2021.
Swadesh M. Towards greater accuracy in lexicostatistic dating. Int J Am Linguist. 1955;21(2):121–37.
Wichmann S, Rama T, Holman EW. Phonological diversity, word length, and population sizes across languages: the asjp evidence. Linguist Typol. 2011;15(2):177–97.
ASJP Database. 2023. https://asjp.clld.org/languages. Accessed 01 Jan 2023.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–10.
John N, Wilbert H, Peter K. Edit distance and dialect proximity. In: Sankoff D, Kruskal J, editors. Time Warps, String Edits and Macromolecules: The theory and practice of sequence comparison. Stanford: CSLI Press; 1999. p. v–xv.
John N, Wilbert H. Measuring dialect distance phonetically. In: Proceedings of the third meeting of the ACL special interest group in computational phonology (SIGPHON-97). ACL Anthology; 1997. p. 11–8.
John N, Heeringa W, Van den Hout E, Van der Kooi P, Otten S, Van de Vis W, et al. Phonetic distance between Dutch dialects. In: Durieux G, Daelemans W, Gillis S, editors., et al., CLIN VI: proceedings of the sixth CLIN meeting. Antwerp: Centre for Dutch Language and Speech UIA; 1996. p. 185–202.
Ciobanu AM, Dinu LP. A computational perspective on the Romanian dialects. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 3281–5.
Emeneau MB. India as a lingustic area. Language. 1956;32(1):3–16.
Abbi A. Languages of India and India as a linguistic area. Centre of Linguistics, Jawaharlal Nehru University; 2012.
Rebecca P, Sala M. Linguistic characteristics of the Romance languages. 2023.https://www.britannica.com/topic/Romance-languages/Linguistic-characteristics-of-the-Romance-languages. Accessed 10 Aug.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all the authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Soft Computing in Engineering Applications” guest edited by Kanubhai K. Patel.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chattopadhyay, A., Ghosh, S.S. & Karmakar, S. Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach. SN COMPUT. SCI. 5, 65 (2024). https://doi.org/10.1007/s42979-023-02389-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-02389-6