Skip to main content

Corpus Editing and Text Normalization

  • Chapter
  • First Online:
Utility and Application of Language Corpora

Abstract

In this chapter , we propose for applying processes like pre-editing and text standardization as some of the essential components of corpus editing and text normalization for making a text corpus ready for access across various domains of linguistics and language technology . Here, we identify some of the basic pre-editing and text standardization tasks, and we describe these works with reference to Bangla text corpus . As the name suggests, text normalization involves diverse tasks of text adjustment and standardization to improve utility of the texts stored in a corpus in manual- and machine-based applications. The methods and the strategies that we propose here to overcome the problems of text normalization are largely tilted toward written text corpus since text normalization activities relating to spoken text corpus usually invoke a new set of operations that hardly match with the normalization processes normally applied on written text corpus . The normalized version of a text not only reduces workload in subsequent utilization of a corpus but also enhances its accessibility by man and machine across all domains where language corpus has application and referential relevance .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abel, S. 2011. Ready for the World: Is Your Content Strategy Global Ready? Blog on 7 April 2011 at: http://thecontentwrangler.com/2011/04/07/ready-for-the-world-is-your-content-strategy-global-ready/.

  • Arens, R. 2004. A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization. In Proceedings of the Student Research Workshop (HLT-SRWS’04), HLT-NAACL-2004, 37–42. PA, USA: Association for Computational Linguistics Stroudsburg.

    Google Scholar 

  • Chaudhuri, B.B., and U. Pal. 1996. Non-word error detection and correction of an inflectional Indian language. In Symposium on Machine Aids for Translation and Communication (SMATAC-96), New Delhi, April 11–12, 1996 (Hand out).

    Google Scholar 

  • Chen, K.J., and S.H. Liu. 1992. Word Identification for Mandarin Chinese Sentences. In Proceedings of the 14th Conference on Computational Linguistics, 101–107. France.

    Google Scholar 

  • Chiang, T.H., J.S. Chang, M.Y. Lin, and K.Y. Su. 1996. Statistical Word Segmentation. Journal of Chinese Linguistics. 9: 147–173.

    Google Scholar 

  • Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun. 1992. A Practical Part-of-Speech Tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, 133–140.

    Google Scholar 

  • Fiser, D., N. Ljubesic, and O. Kubelka. 2012. Addressing Polysemy in Bilingual Lexicon Extraction From Comparable Corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.

    Google Scholar 

  • Habert, B., G. Adda, M. Adda-Decker, P. Boula de Mareuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek. 1998. Towards Tokenization Evaluation. In Proceedings of LREC-98, 427–431.

    Google Scholar 

  • Huang, C.R., P. Simon, S.K. Hsieh, and Prevot, L. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. In Proceedings of the ACL 2007 Demo and Poster Sessions, 69–72. Prague.

    Google Scholar 

  • Jeffrey, T.C., H. Scuhtze, and R.B. Altman. 2002. Creating an Online Dictionary of Abbreviations from MEDLINE. Journal of American Medical Informatics Association 9 (6): 612–620.

    Google Scholar 

  • Mikheev, A. 2003. Text Segmentation. In The Oxford Handbook of Computational Linguistics, ed. R. Mitkov, 201–218. New York: Oxford University Press, Inc.

    Google Scholar 

  • Olinsky, C., and A. Black. 2000. Non-Standard Word and Homograph Resolution for Asian Language Text Analysis. In Proceedings of the ICSLP-2000, Beijing, China, (available: www.cs.cmu.edu/~awb/papers/ICSLP2000_usi.pdf).

  • Panchapagesan, K., P.P. Talukdar, N.S. Krishna, K. Bali, A.G. Ramakrishnan. 2004. Hindi Text Normalization. In Presented at the 5th International Conference on Knowledge Based Computer Systems (KBCS), Hyderabad, India, 19–22 December 2004. (www.cis.upenn.edu/~partha/papers/KBCS04_HPL-1.pdf).

  • Raj, A., T. Sarkar, S.C. Pammi, S. Yuvaraj, M. Bansal, K. Prahallad, and A. Black. 2006. Text Processing for Text-to-Speech Systems in Indian Languages. In Proceedings of the ISCASSW6, 188–193. Bonn, Germany, (www.cs.cmu.edu/~awb/papers/ssw6/ssw6_188.pdf).

  • Sproat R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 1999. Normalization of Non-Standard Words: WS’99 Final Report. In Proceedings of the CLSP Summer Workshop, Johns Hopkins University, (Available: www.clsp.jhu.edu/ws99/projects/normal).

  • Sproat, R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 2001. Normalization of Non-Standard Words. Computer Speech and Language 15 (3): 287–333.

    Article  Google Scholar 

  • Xue, N. 2003. Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing. 8 (1): 29–48.

    Google Scholar 

  • Yarowsky, D. 1994. Homograph Disambiguation in Text-to-Speech Synthesis. In Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, 244–247. New Paltz, NY.

    Google Scholar 

  • Yarowsky, D. 1996. Homograph Disambiguation in Text-to-Speech Synthesis. In Progress in Speech Synthesis, ed. J.V. Santen, R. Sproat, J. Olive, and J. Hirschberg, 157–172. New York: Springer.

    Google Scholar 

  • Yeasir, K.M., A. Majumder, M.Z. Islam, N. UzZaman, and M. Khan. 2006. Analysis of and Observations from a Bangla News Corpus. In Proceedings of the 9th International Conference on Computer and Information Technology (ICCIT-2006), Dhaka, Bangladesh.

    Google Scholar 

Web Links

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niladri Sekhar Dash .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Dash, N.S., Ramamoorthy, L. (2019). Corpus Editing and Text Normalization. In: Utility and Application of Language Corpora . Springer, Singapore. https://doi.org/10.1007/978-981-13-1801-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1801-6_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1800-9

  • Online ISBN: 978-981-13-1801-6

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics