Skip to main content
Log in

Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Robust statistical methodologies are imperative for effectively analysing data and quantifying specific phenomena, particularly when attempting to comprehend intricate events. The present research endeavours to introduce and assess potent non-parametric statistical approaches that are compatible with heterogeneous data structures. The primary focus lies in their application within the domains of language clustering and natural language processing. A central objective is to refine our understanding of language clustering and its potential implications, including the emergence of linguistic regions known as sprachbunds. To achieve this, the study delves into diverse non-parametric facets of linguistic data processing and exploration. Building upon the foundation established by previous work (Chattopadhyay et al. in International conference on soft computing and its engineering applications, Springer, Cham, 2022), this study extends its scope by proposing a novel framework for structuring language families. This is accomplished through the incorporation of typological and areal characteristics, enriching the accuracy and depth of language classification. The utilisation of non-parametric techniques takes centre stage throughout this process. Notably, multidimensional scaling (MDS) is harnessed to transform resulting data into a Cartesian framework, enabling the deployment of data-depth-based methods for reliable outlier identification. This proves invaluable for effectively categorising a wide array of languages situated on the fringes of existing classifications. Furthermore, it opens avenues for reevaluating established language categorisation schemes in light of newfound insights.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Chattopadhyay A, Ghosh SS, Karmakar S. On language clustering: non-parametric statistical approach. In: International conference on soft computing and its engineering applications. Springer; 2022. p. 42–55.

    Google Scholar 

  2. Siegel S. Nonparametric statistics. Am Stat. 1957;11(3):13–9.

    MathSciNet  Google Scholar 

  3. Savage IR. Nonparametric statistics: a personal review. Indian J Stat Ser A. 1969;31(2):107–44.

    MathSciNet  MATH  Google Scholar 

  4. Wasserman L. All of nonparametric statistics. New York: Springer; 2006.

    MATH  Google Scholar 

  5. Tan X, Chen J, He D, Xia Y, Liu T-Y. Multilingual neural machine translation with language clustering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics (ACL); 2019. p. 963–73.

    Chapter  Google Scholar 

  6. Karmakar S, Ghosh SS, Chattopadhyay A. Sprachbund as metric space: quantifying linguistic traces of convergences and diffusions in synchrony. 2022. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4230203

  7. Liu RY, Parelius JM, Singh K. Multivariate analysis by data depth: descriptive statistics, graphics and inference. Ann Stat. 1999;27(3):783–858.

    Article  MathSciNet  MATH  Google Scholar 

  8. Vardi Y, Zhang C-H. The multivariate l1-median and associated data depth. Proc Natl Acad Sci. 2000;97(4):1423–6.

    Article  MathSciNet  MATH  Google Scholar 

  9. He X, Wang G. Convergence of depth contours for multivariate datasets. Ann Stat. 1997;25(2):495–504.

    Article  MathSciNet  MATH  Google Scholar 

  10. Dyckerhoff R, Mosler K, Koshevoy G. Zonoid data depth: theory and computation. In: Albert P, editor. COMPSTAT. Heidelberg: Physica-Verlag HD; 1996. p. 235–40.

    Chapter  Google Scholar 

  11. Aloupis G. Geometric measures of data depth. DIMACS Ser Discrete Math Theoret Comput Sci. 2006;72:147–58.

    Article  MathSciNet  Google Scholar 

  12. Classification of Romance languages. 2023. https://en.wikipedia.org/wiki/Classification_of_Romance_languages. Accessed 01 Jan 2023.

  13. Romance Language Word Lists. 2021. http://people.disim.univaq.it/~serva/languages/55+2.romance.htm. Accessed 23 Dec 2021.

  14. Swadesh M. Towards greater accuracy in lexicostatistic dating. Int J Am Linguist. 1955;21(2):121–37.

    Article  Google Scholar 

  15. Wichmann S, Rama T, Holman EW. Phonological diversity, word length, and population sizes across languages: the asjp evidence. Linguist Typol. 2011;15(2):177–97.

    Google Scholar 

  16. ASJP Database. 2023. https://asjp.clld.org/languages. Accessed 01 Jan 2023.

  17. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–10.

    MathSciNet  Google Scholar 

  18. John N, Wilbert H, Peter K. Edit distance and dialect proximity. In: Sankoff D, Kruskal J, editors. Time Warps, String Edits and Macromolecules: The theory and practice of sequence comparison. Stanford: CSLI Press; 1999. p. v–xv.

    Google Scholar 

  19. John N, Wilbert H. Measuring dialect distance phonetically. In: Proceedings of the third meeting of the ACL special interest group in computational phonology (SIGPHON-97). ACL Anthology; 1997. p. 11–8.

    Google Scholar 

  20. John N, Heeringa W, Van den Hout E, Van der Kooi P, Otten S, Van de Vis W, et al. Phonetic distance between Dutch dialects. In: Durieux G, Daelemans W, Gillis S, editors., et al., CLIN VI: proceedings of the sixth CLIN meeting. Antwerp: Centre for Dutch Language and Speech UIA; 1996. p. 185–202.

    Google Scholar 

  21. Ciobanu AM, Dinu LP. A computational perspective on the Romanian dialects. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 3281–5.

  22. Emeneau MB. India as a lingustic area. Language. 1956;32(1):3–16.

    Article  Google Scholar 

  23. Abbi A. Languages of India and India as a linguistic area. Centre of Linguistics, Jawaharlal Nehru University; 2012.

    Google Scholar 

  24. Rebecca P, Sala M. Linguistic characteristics of the Romance languages. 2023.https://www.britannica.com/topic/Romance-languages/Linguistic-characteristics-of-the-Romance-languages. Accessed 10 Aug.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soumya Sankar Ghosh.

Ethics declarations

Conflict of Interest

On behalf of all the authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Soft Computing in Engineering Applications” guest edited by Kanubhai K. Patel.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chattopadhyay, A., Ghosh, S.S. & Karmakar, S. Generalisation in Natural Language Clustering Through Non-parametric Statistical Approach. SN COMPUT. SCI. 5, 65 (2024). https://doi.org/10.1007/s42979-023-02389-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02389-6

Keywords

Navigation