Corpus Editing and Text Normalization

Dash, Niladri Sekhar; Ramamoorthy, L.

doi:10.1007/978-981-13-1801-6_3

Niladri Sekhar Dash³ &
L. Ramamoorthy⁴

359 Accesses
1 Citations

Abstract

In this chapter , we propose for applying processes like pre-editing and text standardization as some of the essential components of corpus editing and text normalization for making a text corpus ready for access across various domains of linguistics and language technology . Here, we identify some of the basic pre-editing and text standardization tasks, and we describe these works with reference to Bangla text corpus . As the name suggests, text normalization involves diverse tasks of text adjustment and standardization to improve utility of the texts stored in a corpus in manual- and machine-based applications. The methods and the strategies that we propose here to overcome the problems of text normalization are largely tilted toward written text corpus since text normalization activities relating to spoken text corpus usually invoke a new set of operations that hardly match with the normalization processes normally applied on written text corpus . The normalized version of a text not only reduces workload in subsequent utilization of a corpus but also enhances its accessibility by man and machine across all domains where language corpus has application and referential relevance .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abel, S. 2011. Ready for the World: Is Your Content Strategy Global Ready? Blog on 7 April 2011 at: http://thecontentwrangler.com/2011/04/07/ready-for-the-world-is-your-content-strategy-global-ready/.
Arens, R. 2004. A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization. In Proceedings of the Student Research Workshop (HLT-SRWS’04), HLT-NAACL-2004, 37–42. PA, USA: Association for Computational Linguistics Stroudsburg.
Google Scholar
Chaudhuri, B.B., and U. Pal. 1996. Non-word error detection and correction of an inflectional Indian language. In Symposium on Machine Aids for Translation and Communication (SMATAC-96), New Delhi, April 11–12, 1996 (Hand out).
Google Scholar
Chen, K.J., and S.H. Liu. 1992. Word Identification for Mandarin Chinese Sentences. In Proceedings of the 14th Conference on Computational Linguistics, 101–107. France.
Google Scholar
Chiang, T.H., J.S. Chang, M.Y. Lin, and K.Y. Su. 1996. Statistical Word Segmentation. Journal of Chinese Linguistics. 9: 147–173.
Google Scholar
Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun. 1992. A Practical Part-of-Speech Tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, 133–140.
Google Scholar
Fiser, D., N. Ljubesic, and O. Kubelka. 2012. Addressing Polysemy in Bilingual Lexicon Extraction From Comparable Corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
Google Scholar
Habert, B., G. Adda, M. Adda-Decker, P. Boula de Mareuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek. 1998. Towards Tokenization Evaluation. In Proceedings of LREC-98, 427–431.
Google Scholar
Huang, C.R., P. Simon, S.K. Hsieh, and Prevot, L. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. In Proceedings of the ACL 2007 Demo and Poster Sessions, 69–72. Prague.
Google Scholar
Jeffrey, T.C., H. Scuhtze, and R.B. Altman. 2002. Creating an Online Dictionary of Abbreviations from MEDLINE. Journal of American Medical Informatics Association 9 (6): 612–620.
Google Scholar
Mikheev, A. 2003. Text Segmentation. In The Oxford Handbook of Computational Linguistics, ed. R. Mitkov, 201–218. New York: Oxford University Press, Inc.
Google Scholar
Olinsky, C., and A. Black. 2000. Non-Standard Word and Homograph Resolution for Asian Language Text Analysis. In Proceedings of the ICSLP-2000, Beijing, China, (available: www.cs.cmu.edu/~awb/papers/ICSLP2000_usi.pdf).
Panchapagesan, K., P.P. Talukdar, N.S. Krishna, K. Bali, A.G. Ramakrishnan. 2004. Hindi Text Normalization. In Presented at the 5th International Conference on Knowledge Based Computer Systems (KBCS), Hyderabad, India, 19–22 December 2004. (www.cis.upenn.edu/~partha/papers/KBCS04_HPL-1.pdf).
Raj, A., T. Sarkar, S.C. Pammi, S. Yuvaraj, M. Bansal, K. Prahallad, and A. Black. 2006. Text Processing for Text-to-Speech Systems in Indian Languages. In Proceedings of the ISCASSW6, 188–193. Bonn, Germany, (www.cs.cmu.edu/~awb/papers/ssw6/ssw6_188.pdf).
Sproat R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 1999. Normalization of Non-Standard Words: WS’99 Final Report. In Proceedings of the CLSP Summer Workshop, Johns Hopkins University, (Available: www.clsp.jhu.edu/ws99/projects/normal).
Sproat, R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 2001. Normalization of Non-Standard Words. Computer Speech and Language 15 (3): 287–333.
Article Google Scholar
Xue, N. 2003. Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing. 8 (1): 29–48.
Google Scholar
Yarowsky, D. 1994. Homograph Disambiguation in Text-to-Speech Synthesis. In Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, 244–247. New Paltz, NY.
Google Scholar
Yarowsky, D. 1996. Homograph Disambiguation in Text-to-Speech Synthesis. In Progress in Speech Synthesis, ed. J.V. Santen, R. Sproat, J. Olive, and J. Hirschberg, 157–172. New York: Springer.
Google Scholar
Yeasir, K.M., A. Majumder, M.Z. Islam, N. UzZaman, and M. Khan. 2006. Analysis of and Observations from a Bangla News Corpus. In Proceedings of the 9th International Conference on Computer and Information Technology (ICCIT-2006), Dhaka, Bangladesh.
Google Scholar

Web Links

Download references

Author information

Authors and Affiliations

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Niladri Sekhar Dash
Linguistic Data Consortium-Indian Languages, Central Institute of Indian Languages, Mysore, Karnataka, India
L. Ramamoorthy

Authors

Niladri Sekhar Dash
View author publications
You can also search for this author in PubMed Google Scholar
L. Ramamoorthy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dash, N.S., Ramamoorthy, L. (2019). Corpus Editing and Text Normalization. In: Utility and Application of Language Corpora . Springer, Singapore. https://doi.org/10.1007/978-981-13-1801-6_3

Download citation

DOI: https://doi.org/10.1007/978-981-13-1801-6_3
Published: 14 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1800-9
Online ISBN: 978-981-13-1801-6
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics