Advertisement

Language Resources and Evaluation

, Volume 47, Issue 1, pp 179–193 | Cite as

Automatic normalization of short texts by combining statistical and rule-based techniques

  • Marta R. Costa-jussàEmail author
  • Rafael E. Banchs
Original Paper

Abstract

Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a combination of statistical and rule-based techniques to normalize short texts. More specifically, we focus our attention on SMS messages. We base our normalization approach on a statistical machine translation system which translates from noisy data to clean data. This system is trained on a small manually annotated set. Then, we study several automatic methods to extract more general rules from the normalizations generated with the statistical machine translation system. We illustrate the proposed methodology by conducting some experiments with a SMS Haitian-Créole data collection. In order to evaluate the performance of our methodology we use several Haitian-Créole dictionaries, the well-known perplexity criteria and the achieved reduction of vocabulary.

Keywords

Normalization of short texts Statistical machine translation Automatic extraction of rules Perplexity 

Notes

Acknowledgments

The authors want to thank the anonymous reviewers for their valuable comments and suggestions which helped improving this paper. The authors also want to thank Barcelona Media Innovation Center and Institute for Infocomm Research for their support and permission to publish this research. This work has been partially funded by the Spanish Ministry of Economy and Competitive through the Juan de la Cierva fellowship program and by the Seventh Framework Programme of the European Comission through the T4ME contract (grant agreement no.: 249119).

References

  1. Aw, A., Zhang, M., Xiao, J., & Su, J. (2006). A phrase-base statistical model for sms text normalization. In Proceedings of the COLING/ACL on main conference poster sessions, (pp. 33–40), Sydney, Australia.Google Scholar
  2. Brown, P., Della Pietra, S., Della Pietra, V., & Mercer, R., (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.Google Scholar
  3. Callison-Burch, C., Koehn, P., Monz, C., & Zaidan. O. (2011). Findings of the 2011 workshop on statistical machine translation. In Proceedings of the sixth workshop on statistical machine translation, (pp. 22–64), Edinburgh, Scotland, July.Google Scholar
  4. Costa-jussà, M. R., & Fonollosa, J. A. R. (2009). State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems, 92(11), 2179–2185, November.Google Scholar
  5. Henriquez, C., & Hernández, A. (2009). A ngram-based statistical machine translation approach for text normalization on chat-speak style communications. In Proceedings of the CAW2 workshop, Madrid, June.Google Scholar
  6. Koehn, P., & Knight, K. (2003). Feature-rich statistical translation of noun phrases. In Proceedings of the 41th annual meeting of the association for computational linguistics, (pp. 311–318).Google Scholar
  7. Koehn, P., Amittai, A., Birch, A., Callison-Burch, C., Osborne, M., Talbot, D., et al (2005). Edinburgh system description for the 2005 iwslt speech translation evaluation. In Proceedings of international workshop on spoken languages translation, Pittsburgh, October.Google Scholar
  8. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics, (pp. 177–180), Prague, Czech Republic.Google Scholar
  9. Och, F. J., & Ney, H. (2000). A comparison of alignment models for statistical machine translation. In Proceedings of the 18th conference on computational linguistics, (pp. 1086–1090), Morristown, NJ, USA.Google Scholar
  10. Och, F. J., & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 295–302), Philadelphia, USA, July.Google Scholar
  11. Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41th annual meeting of the association for computational linguistics (pp. 160–167), Sapporo, July.Google Scholar
  12. Papineni, K., Roukos, S., Ward, T., & Zhu, W-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318), Philadelphia, PA, July.Google Scholar
  13. Stolcke, A. (2002). SRILM—An extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing, ICSLP’02, (pp. 901–904), Denver, USA, September.Google Scholar
  14. Tillmann, C. (2004). A unigram orientation model for statistical machine translation. In Proceedings of the human language technology conference, HLT-NAACL’04, (pp. 101–104), Boston, May.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.Barcelona Media Innovation CenterBarcelonaSpain
  2. 2.Institute for Infocomm ResearchSingaporeSingapore

Personalised recommendations