Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach

Chattopadhyay, Anagh; Ghosh, Soumya Sankar; Karmakar, Samir

doi:10.1007/s42979-023-02389-6

Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach

Original Research
Published: 06 December 2023

Volume 5, article number 65, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Anagh Chattopadhyay¹,
Soumya Sankar Ghosh ORCID: orcid.org/0000-0002-4469-4070² &
Samir Karmakar³

46 Accesses
Explore all metrics

Abstract

Robust statistical methodologies are imperative for effectively analysing data and quantifying specific phenomena, particularly when attempting to comprehend intricate events. The present research endeavours to introduce and assess potent non-parametric statistical approaches that are compatible with heterogeneous data structures. The primary focus lies in their application within the domains of language clustering and natural language processing. A central objective is to refine our understanding of language clustering and its potential implications, including the emergence of linguistic regions known as sprachbunds. To achieve this, the study delves into diverse non-parametric facets of linguistic data processing and exploration. Building upon the foundation established by previous work (Chattopadhyay et al. in International conference on soft computing and its engineering applications, Springer, Cham, 2022), this study extends its scope by proposing a novel framework for structuring language families. This is accomplished through the incorporation of typological and areal characteristics, enriching the accuracy and depth of language classification. The utilisation of non-parametric techniques takes centre stage throughout this process. Notably, multidimensional scaling (MDS) is harnessed to transform resulting data into a Cartesian framework, enabling the deployment of data-depth-based methods for reliable outlier identification. This proves invaluable for effectively categorising a wide array of languages situated on the fringes of existing classifications. Furthermore, it opens avenues for reevaluating established language categorisation schemes in light of newfound insights.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Language Clustering: Non-parametric Statistical Approach

Analysing Frequency Lists

Machine Versus Structure of Language via Statistical Universals

References

Chattopadhyay A, Ghosh SS, Karmakar S. On language clustering: non-parametric statistical approach. In: International conference on soft computing and its engineering applications. Springer; 2022. p. 42–55.
Google Scholar
Siegel S. Nonparametric statistics. Am Stat. 1957;11(3):13–9.
MathSciNet Google Scholar
Savage IR. Nonparametric statistics: a personal review. Indian J Stat Ser A. 1969;31(2):107–44.
MathSciNet MATH Google Scholar
Wasserman L. All of nonparametric statistics. New York: Springer; 2006.
MATH Google Scholar
Tan X, Chen J, He D, Xia Y, Liu T-Y. Multilingual neural machine translation with language clustering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics (ACL); 2019. p. 963–73.
Chapter Google Scholar
Karmakar S, Ghosh SS, Chattopadhyay A. Sprachbund as metric space: quantifying linguistic traces of convergences and diffusions in synchrony. 2022. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4230203
Liu RY, Parelius JM, Singh K. Multivariate analysis by data depth: descriptive statistics, graphics and inference. Ann Stat. 1999;27(3):783–858.
Article MathSciNet MATH Google Scholar
Vardi Y, Zhang C-H. The multivariate l1-median and associated data depth. Proc Natl Acad Sci. 2000;97(4):1423–6.
Article MathSciNet MATH Google Scholar
He X, Wang G. Convergence of depth contours for multivariate datasets. Ann Stat. 1997;25(2):495–504.
Article MathSciNet MATH Google Scholar
Dyckerhoff R, Mosler K, Koshevoy G. Zonoid data depth: theory and computation. In: Albert P, editor. COMPSTAT. Heidelberg: Physica-Verlag HD; 1996. p. 235–40.
Chapter Google Scholar
Aloupis G. Geometric measures of data depth. DIMACS Ser Discrete Math Theoret Comput Sci. 2006;72:147–58.
Article MathSciNet Google Scholar
Classification of Romance languages. 2023. https://en.wikipedia.org/wiki/Classification_of_Romance_languages. Accessed 01 Jan 2023.
Romance Language Word Lists. 2021. http://people.disim.univaq.it/~serva/languages/55+2.romance.htm. Accessed 23 Dec 2021.
Swadesh M. Towards greater accuracy in lexicostatistic dating. Int J Am Linguist. 1955;21(2):121–37.
Article Google Scholar
Wichmann S, Rama T, Holman EW. Phonological diversity, word length, and population sizes across languages: the asjp evidence. Linguist Typol. 2011;15(2):177–97.
Google Scholar
ASJP Database. 2023. https://asjp.clld.org/languages. Accessed 01 Jan 2023.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–10.
MathSciNet Google Scholar
John N, Wilbert H, Peter K. Edit distance and dialect proximity. In: Sankoff D, Kruskal J, editors. Time Warps, String Edits and Macromolecules: The theory and practice of sequence comparison. Stanford: CSLI Press; 1999. p. v–xv.
Google Scholar
John N, Wilbert H. Measuring dialect distance phonetically. In: Proceedings of the third meeting of the ACL special interest group in computational phonology (SIGPHON-97). ACL Anthology; 1997. p. 11–8.
Google Scholar
John N, Heeringa W, Van den Hout E, Van der Kooi P, Otten S, Van de Vis W, et al. Phonetic distance between Dutch dialects. In: Durieux G, Daelemans W, Gillis S, editors., et al., CLIN VI: proceedings of the sixth CLIN meeting. Antwerp: Centre for Dutch Language and Speech UIA; 1996. p. 185–202.
Google Scholar
Ciobanu AM, Dinu LP. A computational perspective on the Romanian dialects. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 3281–5.
Emeneau MB. India as a lingustic area. Language. 1956;32(1):3–16.
Article Google Scholar
Abbi A. Languages of India and India as a linguistic area. Centre of Linguistics, Jawaharlal Nehru University; 2012.
Google Scholar
Rebecca P, Sala M. Linguistic characteristics of the Romance languages. 2023.https://www.britannica.com/topic/Romance-languages/Linguistic-characteristics-of-the-Romance-languages. Accessed 10 Aug.

Download references

Author information

Authors and Affiliations

Department of Statistics, Indian Statistical Institute, Kolkata, West Bengal, India
Anagh Chattopadhyay
School of Advanced Sciences and Languages, VIT Bhopal University, Bhopal, Madhya Pradesh, India
Soumya Sankar Ghosh
School of Languages and Linguistics, Jadavpur University, Kolkata, West Bengal, India
Samir Karmakar

Authors

Anagh Chattopadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Soumya Sankar Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Samir Karmakar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soumya Sankar Ghosh.

Ethics declarations

Conflict of Interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Soft Computing in Engineering Applications” guest edited by Kanubhai K. Patel.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chattopadhyay, A., Ghosh, S.S. & Karmakar, S. Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach. SN COMPUT. SCI. 5, 65 (2024). https://doi.org/10.1007/s42979-023-02389-6

Download citation

Received: 28 February 2023
Accepted: 04 October 2023
Published: 06 December 2023
DOI: https://doi.org/10.1007/s42979-023-02389-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach

Abstract

Access this article

Similar content being viewed by others

On Language Clustering: Non-parametric Statistical Approach

Analysing Frequency Lists

Machine Versus Structure of Language via Statistical Universals

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach

Abstract

Access this article

Similar content being viewed by others

On Language Clustering: Non-parametric Statistical Approach

Analysing Frequency Lists

Machine Versus Structure of Language via Statistical Universals

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation