A 300 MB Turkish Corpus and Word Analysis

Dalkilic, Gökhan; Cebi, Yalcin

doi:10.1007/3-540-36077-8_20

A 300 MB Turkish Corpus and Word Analysis

Gökhan Dalkilic⁵ &
Yalcin Cebi⁵

Conference paper
First Online: 24 October 2002

796 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2457))

Abstract

In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Shannon, C.E.: Prediction and Entropy of Printed English, The Bell System Technical Journal, 30(1) (1951) 50–64
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, Prentice Hall (2000) 193–196
Google Scholar
Garett, P.: Making-Breaking Codes, ISBN 0-13-030369-0, Prentice Hall (2001) 31–36
Google Scholar
Dalkilic, G.: Some Statistical Properties of Contemporary Printed Turkish and A Text Copression Application, MSc Thesis, International Computing Institute, Ege University (2001)
Google Scholar
Dalkilic, M.E., Dalkilic, G.: Some Measurable Language Characteristics of Printed Turkish, Proc. of the XVI. International Symposium on Computer and Information Sciences (2001) 217–224
Google Scholar
Güngör, T.: Computer Processing of Turkish: Morphological and Lexical Investigation, PhD. Dissertation, Computer Engineering Dept., Bogazici University, Istanbul, Turkey (1995)
Google Scholar
Teahan, W.J.: Modeling English Text, PhD. Dissertation, The University of Waikato, New Zeland (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering Dept., Dokuz Eylul University, 35100, Bornova, Izmir, Turkey
Gökhan Dalkilic & Yalcin Cebi

Authors

Gökhan Dalkilic
View author publications
You can also search for this author in PubMed Google Scholar
Yalcin Cebi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Engineering Department, Dokuz Eylul University, 35100, Izmir, Bornova, Turkey
Tatyana Yakhno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dalkilic, G., Cebi, Y. (2002). A 300 MB Turkish Corpus and Word Analysis. In: Yakhno, T. (eds) Advances in Information Systems. ADVIS 2002. Lecture Notes in Computer Science, vol 2457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36077-8_20

Download citation

DOI: https://doi.org/10.1007/3-540-36077-8_20
Published: 24 October 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00009-9
Online ISBN: 978-3-540-36077-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics