Abstract
In this chapter , we propose for applying processes like pre-editing and text standardization as some of the essential components of corpus editing and text normalization for making a text corpus ready for access across various domains of linguistics and language technology . Here, we identify some of the basic pre-editing and text standardization tasks, and we describe these works with reference to Bangla text corpus . As the name suggests, text normalization involves diverse tasks of text adjustment and standardization to improve utility of the texts stored in a corpus in manual- and machine-based applications. The methods and the strategies that we propose here to overcome the problems of text normalization are largely tilted toward written text corpus since text normalization activities relating to spoken text corpus usually invoke a new set of operations that hardly match with the normalization processes normally applied on written text corpus . The normalized version of a text not only reduces workload in subsequent utilization of a corpus but also enhances its accessibility by man and machine across all domains where language corpus has application and referential relevance .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abel, S. 2011. Ready for the World: Is Your Content Strategy Global Ready? Blog on 7 April 2011 at: http://thecontentwrangler.com/2011/04/07/ready-for-the-world-is-your-content-strategy-global-ready/.
Arens, R. 2004. A Preliminary Look into the Use of Named Entity Information for Bioscience Text Tokenization. In Proceedings of the Student Research Workshop (HLT-SRWS’04), HLT-NAACL-2004, 37–42. PA, USA: Association for Computational Linguistics Stroudsburg.
Chaudhuri, B.B., and U. Pal. 1996. Non-word error detection and correction of an inflectional Indian language. In Symposium on Machine Aids for Translation and Communication (SMATAC-96), New Delhi, April 11–12, 1996 (Hand out).
Chen, K.J., and S.H. Liu. 1992. Word Identification for Mandarin Chinese Sentences. In Proceedings of the 14th Conference on Computational Linguistics, 101–107. France.
Chiang, T.H., J.S. Chang, M.Y. Lin, and K.Y. Su. 1996. Statistical Word Segmentation. Journal of Chinese Linguistics. 9: 147–173.
Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun. 1992. A Practical Part-of-Speech Tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, 133–140.
Fiser, D., N. Ljubesic, and O. Kubelka. 2012. Addressing Polysemy in Bilingual Lexicon Extraction From Comparable Corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
Habert, B., G. Adda, M. Adda-Decker, P. Boula de Mareuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek. 1998. Towards Tokenization Evaluation. In Proceedings of LREC-98, 427–431.
Huang, C.R., P. Simon, S.K. Hsieh, and Prevot, L. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. In Proceedings of the ACL 2007 Demo and Poster Sessions, 69–72. Prague.
Jeffrey, T.C., H. Scuhtze, and R.B. Altman. 2002. Creating an Online Dictionary of Abbreviations from MEDLINE. Journal of American Medical Informatics Association 9 (6): 612–620.
Mikheev, A. 2003. Text Segmentation. In The Oxford Handbook of Computational Linguistics, ed. R. Mitkov, 201–218. New York: Oxford University Press, Inc.
Olinsky, C., and A. Black. 2000. Non-Standard Word and Homograph Resolution for Asian Language Text Analysis. In Proceedings of the ICSLP-2000, Beijing, China, (available: www.cs.cmu.edu/~awb/papers/ICSLP2000_usi.pdf).
Panchapagesan, K., P.P. Talukdar, N.S. Krishna, K. Bali, A.G. Ramakrishnan. 2004. Hindi Text Normalization. In Presented at the 5th International Conference on Knowledge Based Computer Systems (KBCS), Hyderabad, India, 19–22 December 2004. (www.cis.upenn.edu/~partha/papers/KBCS04_HPL-1.pdf).
Raj, A., T. Sarkar, S.C. Pammi, S. Yuvaraj, M. Bansal, K. Prahallad, and A. Black. 2006. Text Processing for Text-to-Speech Systems in Indian Languages. In Proceedings of the ISCASSW6, 188–193. Bonn, Germany, (www.cs.cmu.edu/~awb/papers/ssw6/ssw6_188.pdf).
Sproat R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 1999. Normalization of Non-Standard Words: WS’99 Final Report. In Proceedings of the CLSP Summer Workshop, Johns Hopkins University, (Available: www.clsp.jhu.edu/ws99/projects/normal).
Sproat, R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards. 2001. Normalization of Non-Standard Words. Computer Speech and Language 15 (3): 287–333.
Xue, N. 2003. Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing. 8 (1): 29–48.
Yarowsky, D. 1994. Homograph Disambiguation in Text-to-Speech Synthesis. In Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, 244–247. New Paltz, NY.
Yarowsky, D. 1996. Homograph Disambiguation in Text-to-Speech Synthesis. In Progress in Speech Synthesis, ed. J.V. Santen, R. Sproat, J. Olive, and J. Hirschberg, 157–172. New York: Springer.
Yeasir, K.M., A. Majumder, M.Z. Islam, N. UzZaman, and M. Khan. 2006. Analysis of and Observations from a Bangla News Corpus. In Proceedings of the 9th International Conference on Computer and Information Technology (ICCIT-2006), Dhaka, Bangladesh.
Web Links
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Dash, N.S., Ramamoorthy, L. (2019). Corpus Editing and Text Normalization. In: Utility and Application of Language Corpora . Springer, Singapore. https://doi.org/10.1007/978-981-13-1801-6_3
Download citation
DOI: https://doi.org/10.1007/978-981-13-1801-6_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1800-9
Online ISBN: 978-981-13-1801-6
eBook Packages: Social SciencesSocial Sciences (R0)