Statistical Studies on Language Corpus

Dash, Niladri Sekhar; Ramamoorthy, L.

doi:10.1007/978-981-13-1801-6_4

Niladri Sekhar Dash³ &
L. Ramamoorthy⁴

362 Accesses

Abstract

In this chapter , we make attempt to discuss in brief about various types of statistical approaches that are normally used for processing and analyzing a text corpus as well as for obtaining data which may be considered statistically reliable in making some generalized or specific comments on the patterns of occurrence or manner of distribution of linguistic elements in a corpus . Moreover, in this chapter , we try to show how, based on the nature of text used in a corpus , the patterns of quantitative analysis may vary from that of qualitative analysis , although in the long run both types of analysis may be combined together to get a clear picture of the linguistic phenomenon under scrutiny. We also try to give a short history about the use of statistical methods and techniques in the analysis of corpus before and after the introduction of digital corpus as well as describe how descriptive approaches , inferential approaches , and evaluative approaches can be combined together in the act of corpus analysis , linguistics investigation, and inference deduction .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barnbrook, G. 1996. Language and Computers: A Practical Introduction to the Computer Analysis of Language. Edinburgh: Edinburgh University Press.
Google Scholar
Bhattacharya, N. 1965. Some Statistical Studies of the Bangla Language. Unpublished Doctoral Dissertation. Indian Statistical Institute, Kolkata.
Google Scholar
Biber, D. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8 (4): 243–257.
Article Google Scholar
Biber, D., S. Conrad, and R. Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.
Book Google Scholar
Borg, I., and P. Groenen. 2005. Modern Multidimensional Scaling: Theory and Applications, 2nd ed. Springer-Verlag: New York.
Google Scholar
Cardinal, R.N., and M.R.F. Aitken. 2006. ANOVA for the Behavioural Sciences Researcher. Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Chatterji, S.K. 1926. The Origin and Development of the Bengali Language. Kolkata: Calcutta University Press (Reprinted by Rupa, Kolkata in 1993).
Google Scholar
Das, G., S. Bhattacharya, and S. Mitra. 1984. Representing Asamia, Bengali and Manipuri text in Line Printer and Daisy-Wheel Printer. Journal of the Institution of Electronics and Telecommunication Engineers. 30 (2): 251–256.
Google Scholar
Dash, N.S. 2005. Corpus Linguistics, and Language Technology: With Reference to Indian Languages. New Delhi: Mittal Publications.
Google Scholar
Dewey, G. 1950. Relativ Frequency of English Speech Sounds. Harvard: Harvard University Press.
Google Scholar
Edwards, A.W., and R.L. Chambers. 1964. Occurrence of Various Language Properties in English. Journal of the Association for Computing Machinery 2: 465–482.
Article Google Scholar
Everitt, B. 2011. Cluster Analysis. Chichester, West Sussex, UK: Wiley.
Book Google Scholar
Fasold, R.W. (ed.). 1989. Language Change and Variation. London: John Benjamins.
Google Scholar
Good, I.J. 1957. Distribution of Word Frequencies. Nature 179: 595.
Article Google Scholar
Greenwood, P.E., and M.S. Nikulin. 1996. A Guide to Chi-squared Testing. New York: Wiley.
Google Scholar
Herden, G. 1962. Calculus of Linguistic Observation. Hague: Mouton & Co.
Google Scholar
Huber, P.J. 2004. Robust Statistics. New York: Wiley.
Google Scholar
Katz, M.H. 2006. Multivariable Analysis: A Practical Guide for Clinicians, 2nd ed. Cambridge: Cambridge University Press.
Book Google Scholar
Kennedy, G. 1998. An Introduction to Corpus Linguistics. New York: Addison-Wesley Longman Inc.
Google Scholar
Kenny, A.J.P. 1982. The Computation of Style. Oxford: Pergamon Press.
Google Scholar
Kilgarriff, A. 1996. Corpus Similarity and Homogeneity via Word Frequency. In EURALEX Proceedings. Gothenburg, Sweden.
Google Scholar
Leech, G., B. Francis, and X. Xu. 1994. The Use of Computer Corpora in the Textual Demonstrability of Gradience in Linguistic Categories. In Continuity in Linguistic Semantics, ed. C. Fuchs and B. Vitorri, 31–47. John Benjamins: Amsterdam and Philadelphia.
Google Scholar
Mallik, B.P., N. Bhattacharya, S.C. Kundu, and M. Dawn. 1998. Phonemic and Morphemic Frequency in the Bengali Language. Kolkata: The Asiatic Society.
Google Scholar
Manning, C.D., P. Raghavan, and H. Schütze. 2009. Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Google Scholar
McEnery, T., and A. Wilson. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press.
Google Scholar
Miller, G.A. 1951. Language and Communication. New York: McGraw-Hills.
Book Google Scholar
Miller, G.A., F.B. Newman, and E.A. Friedman. 1958. Length-Frequency Statistics for Written English. Information and Control 1: 370–389.
Article Google Scholar
Oakes, M.P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Google Scholar
Rice, J.A. 2006. Mathematical Statistics and Data Analysis, 3rd ed. Belmont, CA: Duxbury Press.
Google Scholar
Rutherford, A. 2001. Introducing ANOVA and ANCOVA: A GLM approach. Thousand Oaks, CA: Sage Publications.
Google Scholar
Wilcox, R.R. 2005. Introduction to Robust Estimation and Hypothesis Testing. London: Academic Press.
Google Scholar
Williams, C.B. 1940. A Note on the Statistical Analysis of Sentence Length as a Criterion of Literary Style. Biometrika 31: 356–361.
Google Scholar
Yule, G.U. 1964. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.
Google Scholar

Web Links

Download references

Author information

Authors and Affiliations

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Niladri Sekhar Dash
Linguistic Data Consortium-Indian Languages, Central Institute of Indian Languages, Mysore, Karnataka, India
L. Ramamoorthy

Authors

Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar
L. Ramamoorthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dash, N.S., Ramamoorthy, L. (2019). Statistical Studies on Language Corpus. In: Utility and Application of Language Corpora . Springer, Singapore. https://doi.org/10.1007/978-981-13-1801-6_4

Download citation

DOI: https://doi.org/10.1007/978-981-13-1801-6_4
Published: 14 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1800-9
Online ISBN: 978-981-13-1801-6
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics