Abstract
This paper explores the use of machine learning techniques to restore punctuation and case in English text, as part of which it investigates the co-dependence of case information and punctuation. We achieve an overall F-score of .619 for the task using a variety of lexical and contextual features, and iterative retagging.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abney, S.P.: Parsing by chunks. In: Berwick, R.C., Abney, S.P., Tenny, C. (eds.) Principle-Based Parsing: Computation and Psycholinguistics, pp. 257–278. Kluwer, Dordrecht (1991)
Beeferman, D., Berger, A., Lafferty, J.: Cyberpunc: A lightweight punctuation annotation system for speech. In: Proceedings of 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1998), Seattle, USA (1998)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python — Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Sebastopol (2009)
Briscoe, E., Carroll, J., Watson, R.: The second release of the RASP system. In: Proceedings of the COLING/ACL 2006 Interactive Poster System, Sydney, Australia, pp. 77–80 (2006)
Burnard, L.: User Reference Guide for the British National Corpus. Technical report, Oxford University Computing Services (2000)
Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. ILK Technical Report 04-02 (2004)
Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification. Technical report, Department of Computer Science National Taiwan University (2008)
Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525 (2006)
Lita, L.V., Ittycheriah, A., Roukos, S., Kambhatla, N.: tRuEcasIng. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 152–159 (2003)
Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers, Dordrecht (1988)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19(2), 313–330 (1993)
Minnen, G., Carroll, J., Pearce, D.: Applied morphological processing of English. Natural Language Engineering 7(3), 207–223 (2001)
Ngai, G., Florian, R.: Transformation-based learning in the fast lane. In: Proceedings of the 2nd Annual Meeting of the North American Chapter of Association for Computational Linguistics (NAACL 2001), Pittsburgh, USA, pp. 40–47 (2001)
Shieber, S.M., Tao, X.: Comma restoration using constituency information. In: Proceedings of the 3rd International Conference on Human Language Technology Research and 4th Annual Meeting of the NAACL (HLT-NAACL 2003), Edmonton, Canada, pp. 142–148 (2003)
Wynne, M.: A post-editor’s guide to CLAWS7 tagging. UCREL University of Lancaster, Lancaster, England (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baldwin, T., Joseph, M.P.A.K. (2009). Restoring Punctuation and Casing in English Text. In: Nicholson, A., Li, X. (eds) AI 2009: Advances in Artificial Intelligence. AI 2009. Lecture Notes in Computer Science(), vol 5866. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10439-8_55
Download citation
DOI: https://doi.org/10.1007/978-3-642-10439-8_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10438-1
Online ISBN: 978-3-642-10439-8
eBook Packages: Computer ScienceComputer Science (R0)