Advertisement

Language Resources and Evaluation

, Volume 47, Issue 4, pp 1213–1231 | Cite as

A feature-based approach to better automatic treebank conversion

  • Muhua Zhu
  • Jingbo ZhuEmail author
  • Huizhen Wang
Original Paper
  • 157 Downloads

Abstract

In the field of constituency parsing, there exist multiple human-labeled treebanks which are built on non-overlapping text samples and follow different annotation standards. Due to the extreme cost of annotating parse trees by human, it is desirable to automatically convert one treebank (called source treebank) to the standard of another treebank (called target treebank) which we are interested in. Conversion results can be manually corrected to obtain higher-quality annotations or can be directly used as additional training data for building syntactic parsers. To perform automatic treebank conversion, we divide constituency parses into two separate levels: the part-of-speech (POS) and syntactic structure (bracketing structures and constituent labels), and conduct conversion on these two levels respectively with a feature-based approach. The basic idea of the approach is to encode original annotations in a source treebank as guide features during the conversion process. Experiments on two Chinese treebanks show that our approach can convert POS tags and syntactic structures with the accuracy of 96.6 and 84.8 %, respectively, which are the best reported results on this task.

Keywords

Automatic treebank conversion Feature-based approach Part of speech Constituency syntactic structure 

Notes

Acknowledgments

This work was supported in part by the National Science Foundation of China (61073140; 61272376; 61100089), Specialized Research Fund for the Doctoral Program of Higher Education (20100042110031), and the Fundamental Research Funds for the Central Universities (N110404012; N100204002).

References

  1. Bikel, D. M. (2004). On the parameter space of generative lexicalized statistical parsing models. Ph.D. thesis, University of Pennsylvania.Google Scholar
  2. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th conference on computational learning theory (COLT 1998). Madison, Wisconsin, USA, July 24–26, 1998.Google Scholar
  3. Charniak, E., Goldwater, S., & Johnson, M. (1998). Edge-based best-first chart parsing. In Proceedings of the ACL 1998 workshop on very large corpora. Montreal, Quebec, Canada, August 15–16, 1998.Google Scholar
  4. Chen, W., Kazama, J., Uchimoto, K., & Torisawa, K. (2009). Improving dependency parsing with subtrees from auto-parsed data. In Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009). Singapore, Singapore, Auguest 6–7, 2009.Google Scholar
  5. Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.Google Scholar
  6. Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithm. In Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002). Philadelphia, PA, USA, July 6–7, 2002.Google Scholar
  7. Collins, M., & Roark, B. (2004). Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd annual meeting of the assofication for computational linguistics (ACL 2004). Barcelona, Spain, July 21–26, 2004.Google Scholar
  8. Daumé, H. III, Marcu, D. (2006). Adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1), 101–166.Google Scholar
  9. Huang, L. (2008). Forest reranking: Discriminative parsing with non-local features. In Proceedings of the 46th annual meeting of the association for computational linguistics (ACL 2008). Columbus, Ohio, USA, June 15–20, 2008.Google Scholar
  10. Jiang, W., & Liu, Q. (2009). Automatic adaptation of annotation standards for dependency parsing—using projected treebank as source corpus. In Proceedings of the 11th international conference on parsing technologies (IWPT 2009). Paris, France, October 7–9, 2009.Google Scholar
  11. Jiang, W., Huang, L., & Liu, Q. (2009). Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging—a case study. In Proceedings of the 47th annual meeting of the association for computational linguistics and 5th international joint conference on natural language processing of the asian federation of natural language processing (ACL-IJCNLP 2009). Singapore, Singapore, August 2–7, 2009.Google Scholar
  12. Lafferty, J., McCallun, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 8th international conference on machine learning (ICML 2001). Williamstown, MA, USA, June 28–July 1, 2001.Google Scholar
  13. Low, J. K., Ng, H. T., & Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. In Proceedings of the 5th SINGHAN Workshop (SIGHAN 2005). October 14–15, 2005.Google Scholar
  14. Martins, A., Das, D., Smith, N., & Xing, E. (2008). Stack dependency parsers. In Proceedings of the 2008 conference on empirical methods in natural language processing (EMNLP 2008). Honolulu, Hawaii, USA, October 25–27, 2008.Google Scholar
  15. McClosky D., Charniak, E., & Johnson, M. (2006). Effective self-training for parsing. In Proceedings of human language technologies and North American chapter of the association for computational linguistics. HLT-NAACL 2006, New York, USA, June 4–9, 2006.Google Scholar
  16. Niu, Z.-Y., Wang, H., & Wu, H. (2009). Exploiting heterogeneous treebanks for parsing. In Proceedings of the 47th annual meeting of the association for computational linguistics and 5th international joint conference on natural language processing of the asian federation of natural language processing (ACL-IJCNLP 2009). Singapore, Singapore, August 2–7, 2009.Google Scholar
  17. Nivre, J., & McDonald, R. (2008). Integrating graph-based and transition-based dependency parsers. In Proceedings of the 46th annual meeting of the association for computational linguistics (ACL 2008). Ohio, USA, June 15–20, 2008.Google Scholar
  18. Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Proceedings of North American chapter of the association for computational linguistics (NAACL 2007). New York, USA, April 22–27, 2007.Google Scholar
  19. Petrov, S., Chang, P.-C., Ringgaard, M., & Alshawi, H. (2010). Uptraining for accurate deterministic question parsing. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP 2010). Cambridge, Massachusetts, USA, October 9–11, 2010.Google Scholar
  20. Sagae, K., & Lavie, A. (2006a). A best-first probabilistic shift-reduce parser. In Proceedings of the 21th international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (COLING-ACL 2006). Sydney, Australia, July 17–21, 2006.Google Scholar
  21. Sagae, K., & Lavie, A. (2006b). Parser combination by reparsing. In Proceedings of human language technologies and North American chapter of the association for computational linguistics. HLT-NAACL 2006, New York, USA, June 4–9, 2006.Google Scholar
  22. Wang, J.-N., Chang, J.-S., & Su, K.-Y. (1994). An automatic treebank conversion algorithm for corpus sharing. In Proceedings of the 32nd annual meeting of the association for computational linguistics (ACL 1994). Las Cruces, New Mexico, USA, June 27–30, 1994.Google Scholar
  23. Xue, N., Xia, F., Chiou, F., & Palmer, M. (2005). The Penn Chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.CrossRefGoogle Scholar
  24. Zhang, Y., & Clark, S. (2009). Transition-based parsing of the Chinese treebank using a global discriminative model. In Proceedings of 11th international conference on parsing technologies (IWPT 2009). Paris, France, October 7–9, 2009.Google Scholar
  25. Zhou, Q. (1996). Phrase bracketing and annotation on Chinese language corpus (in Chinese). Ph.D. thesis, Peking University.Google Scholar
  26. Zhu, M., Zhu, J., & Xiao, T. (2011a). Automatic treebank conversion via informed decoding—a case study on Chinese treebanks. ACM Transaction on Asian Language Information Processing, 10(3), 1–24.Google Scholar
  27. Zhu, M., Zhu, J., & Hu, M. (2011b). Better automatic treebank conversion using a feature-based approach. In Proceedings of the 49th annual meeting of the association for computational linguistics–human language technologies (ACL-HLT 2011). Portland, Oregon, June 19–24, 2011.Google Scholar
  28. Zhu, M., Zhu, J., & Wang, H. (2012). Exploiting lexical dependencies from large-scale unlabeled data for better shift-reduce constituency parsing. In Proceedings of the 24th international conference on computational linguistics (COLING 2012). Mumbai, India, December 8–15, 2012.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  1. 1.Natural Language Processing LaboratoryNortheastern UniversityShenyangChina

Personalised recommendations