Data Acquisition and Linguistic Resources

  • Stephanie Strassel
  • Caitlin Christianson
  • John McCary
  • William Staderman
  • Joseph Olive
Chapter

Abstract

All human language technology demands substantial quantities of data for system training and development, plus stable benchmark data to measure ongoing progress. While creation of high quality linguistic resources is both costly and time consuming, such data has the potential to profoundly impact not just a single evaluation program but language technology research in general. GALE’s challenging performance targets demand linguistic data on a scale and complexity never before encountered. Resources cover multiple languages (Arabic, Chinese, and English) and multiple genres -- both structured (newswire and broadcast news) and unstructured (web text, including blogs and newsgroups, and broadcast conversation). These resources include significant volumes of monolingual text and speech, parallel text, and transcribed audio combined with multiple layers of linguistic annotation, ranging from word aligned parallel text and Treebanks to rich semantic annotation.

Keywords

Entropy Transportation Income Coherence Assure 

References

  1. Al-Sughaiyer, I. A. and I. A. Al-Kharashi. 2004. Arabic Morphological Analysis Techniques: A Comprehensive Survey. Journal of the American Society for Information Science and Technology 55(3):189–213.CrossRefGoogle Scholar
  2. Babko-Malaya, O. 2008. Annotation of Nuggets and Relevance in GALE Distillation Evaluation. Proceedings of the International Conference on Language Resources and Evaluation. Marrakech, Morocco.Google Scholar
  3. Babko-Malaya, O., A. Bies, A. Taylor, S. Yi, M. Palmer, M. Marcus, S. Kulick and L. Shen. 2006. Issues in Synchronizing the English Treebank and PropBank. Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora.Google Scholar
  4. Babko-Malaya, O., M. Palmer, N. Xue, A. Joshi and S. Kulick. 2004. Proposition Bank II: Delving Deeper. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics Frontiers in Corpus Annotation Workshop. Google Scholar
  5. Babko-Malaya, O., Z. Song, R. Zakhary, S. Strassel and J. Medero. 2006. GALE Y1 - Distillation Training Query Answer Keys V7.0. LDC2006E15.Google Scholar
  6. Badr, I., R. Zbib and J. Glass. 2008. Segmentation For English-To-Arabic Statistical Machine Translation. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics. Short Papers: 153–156. Columbus, Ohio.Google Scholar
  7. Baker, C.F., C. J. Fillmore and J. B. Lowe. 1998. The Berkeley FrameNet Project. Proceedings of the International Conference on Computational Linguistics/Meeting of the Association for Computational Linguistics pp. 86-90.Google Scholar
  8. Benajiba, Y., M. Diab and P. Rosso. 2008. Arabic Named Entity Recognition Using Optimized Feature Sets. Proceedings of the 2008 Conference on Empirical Methods on Natural Language Processing pp. 284–293. Honolulu, Hawaii.Google Scholar
  9. Benajiba, Y., M. Diab and P. Rosso. 2009. Arabic Named Entity Recognition: A Feature-Driven Study. IEEE Transactions on Audio, Speech & Language Processing 17(5): 926-934.CrossRefGoogle Scholar
  10. Bies, A. and M. Ferguson, K. Katz and R. MacIntyre, Eds. 1995. Bracketing Guidelines for Treebank II Style. Penn Treebank Project, University of Pennsylvania, CIS Technical Report MS-CIS-95-06.Google Scholar
  11. Bikel, D.M. 2004. On the Parameter Space of Generative Lexicalized Statistical Parsing Models. Ph.D. thesis, University of Pennsylvania.Google Scholar
  12. Bikel, D.M. and D. Chiang. 2000. Two Statistical Parsing Models Applied To The Chinese Treebank. Proceedings of the Second Chinese Language Processing Workshop.Google Scholar
  13. Black, E., R. Garside and G. Leech. 1993. Statistically-driven Computer Grammars of English: The IBM/Lancaster Approach. Rodopi, Amsterdam.Google Scholar
  14. Blum, A. and T. Mitchell. 1998. Combining Labeled and Unlabeled Data With Co-Training. Proceedings of the Conference on Learning Theory.Google Scholar
  15. Buckwalter, T. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2002L49.Google Scholar
  16. Buckwalter, T. 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2004L02.Google Scholar
  17. Burchardt, A., K. Erk, A. Frank, A. Kowalski, S. Pado and M. Pinkal. 2006. Consistency And Coverage: Challenges for Exhaustive Semantic Annotation. Proceedings of The German Society for Linguistics Conference (DGfS). Google Scholar
  18. Chang, P.C., M. Gally and C. Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. Proceedings of the Meeting of the Association for Computational Linguistics Third Workshop on Statistical Machine Translation. Google Scholar
  19. Charniak, E. 1997. Statistical Parsing with a Context-Free Grammar and Word Statistics. Proceedings of the International Conference on Artificial Intelligence.Google Scholar
  20. Charniak, E. 2000. A Maximum-Entropy-Inspired Parser. Proceedings of the Meeting of the North American Chapter of the Association for Computational Linguistics.Google Scholar
  21. Charniak, E. and M. Johnson. 2005. Coarse-to-fine Nbest Parsing and MaxEnt Discriminative Reranking. Proceedings of the 43rd Meeting of the Association for Computational Linguistics.Google Scholar
  22. Chen, J. and M. Palmer. 2005. Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features. Proceedings of the International Joint Conference on Natural Language Processing pp. 933-944.Google Scholar
  23. Chen, K.J. and S. H. Liu. 1992. Word Identification for Mandarin Chinese Sentences. Proceedings of the International Conference on Computational Linguistics pp. 101–107.Google Scholar
  24. Cieri, C., S. Strassel, D. Graff, N. Martey, Ka. Rennert and M. Liberman. 2002. Corpora for topic detection and tracking. James Allen, editor, Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers.Google Scholar
  25. Cieri, C. and Liberman, M. 2000. Issues in Corpus Creation and Distribution: The Evolution of the Linguistic Data Consortium. Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar
  26. Cieri, C., S. Strassel, M. Glenn and L. Friedman. 2007. Linguistic Resources in Support of Various Evaluation Metrics. Automatic Procedures in MT Evaluation Workshop: MT Summit XI.Google Scholar
  27. Collins, M. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. Google Scholar
  28. Collins, M. and T. Koo. 2005. Discriminative Reranking For Natural Language Parsing. Computational Linguistics, 31(1):25–70.MathSciNetCrossRefGoogle Scholar
  29. Costa-jussà, M.R., J.M. Crego, A. de Gispert, P. Lambert, M. Khalilov, J.A.R. Fonollosa, J.B. Mariño and R. Banchs. 2006. TALP Phrase-Based System and TALP System Combination. Proceedings of the International Workshop on Spoken Language Translation pp. 123–129. Kyoto, Japan.Google Scholar
  30. Cowan, M. 2009. GNU Wget. http://www.gnu.org/software/wget.
  31. Crego, J.M. and N. Habash. 2008. Using Shallow Syntax Information to Improve Word Alignment and Reordering for SMT. Proceedings of the Third Workshop on Statistical Machine Translation pp. 53–61. Columbus, Ohio.Google Scholar
  32. CRL Lesser Studied Languages Center. 2009. http://crl.nmsu.edu/say/
  33. Dean, J. and S. Ghemawat. 2005. MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI’04).Google Scholar
  34. Diab, M. 2007. Improved Arabic Base Phrase Chunking with a New Enriched Pos Tag Set. Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources pp. 89–96. Prague, Czech Republic.Google Scholar
  35. Diab, M. 2007a. Towards An Optimal Pos Tag Set For Modern Standard Arabic Processing. Proceedings of the Conference on Recent Advances in Natural Language Processing. Borovets, Bulgaria.Google Scholar
  36. Diab, M., K. Hacioglu and D. Jurafsky. 2004. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics. Boston, MA.Google Scholar
  37. Diab, M., M. Alkhalifa, S. ElKateb, C. Fellbaum, A. Mansouri, M. Palmer. 2007. SemEval-2007 Task 18: Arabic Semantic Labeling. Proceedings of the International Workshop on Semantic Evaluations. Google Scholar
  38. Diab, M., M. Ghoneim and N. Habash. 2007a. Arabic Diacritization In The Context Of Statistical Machine Translation. Proceedings of the Machine Translation Summit. Copenhagen, Denmark.Google Scholar
  39. Diab, M., K. Hacioglu and D. Jurafsky. 2007b. Automatic Processing of Modern Standard Arabic Text. Arabic Computational Morphology: Knowledge-based and Empirical Methods. Abdelhadi Soudi, Antal van den Bosch and Günter Neumann, eds. Springer.Google Scholar
  40. Dligach, D. and M. Palmer. 2008. Novel Semantic Features for Verb Sense Disambiguation. Proceedings of the Meeting of the Association for Computational Linguistics.Google Scholar
  41. Dligach, D. and M. Palmer. 2009. Using Language Modeling to Select Useful Annotation Data. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics. Boulder, Colorado.Google Scholar
  42. Elming, J. and N. Habash. 2007. Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics Companion Volume, Short Papers: 25–28. Rochester, New York.Google Scholar
  43. Elming, J., N. Habash and J. Crego. 2008. Combination Of Statistical Word Alignments Based On Multiple Preprocessing Schemes. Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds., Learning for Machine Translation. MIT Press.Google Scholar
  44. Emerson, T. 2005. The Second International Chinese Word Segmentation Bakeoff. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing pp. 123–133. Jeju Island, Korea.Google Scholar
  45. Farber, B., D. Freitag, N. Habash and O. Rambow. 2008. Improving NER in Arabic Using a Morphological Tagger. Proceedings of the International Conference on Language Resources and Evaluation. Marrakech, Morocco.Google Scholar
  46. Farwell, D., J. Giménez, E. González, R. Halkoum, H. Rodríguez and M. Surdeanu. 2007. The UPC System for Arabic-to-English Entity Translation. Proceedings of the Automatic Content Extraction Conference.Google Scholar
  47. Favre, B., D. Hakkani-Tur, S. Petrov and D. Klein. 2008. Efficient Sentence Segmentation Using Syntactic Features. Proceedings of the IEEE/Meeting of the Association for Computational Linguistics Workshop on Spoken Language Technologies.Google Scholar
  48. Favre, B., R. Grishman, D. Hillard, H. Ji, D. Hakkani-Tur and M. Ostendorf. 2008a. Punctuating Speech for Information Extraction. Proceedings of the International Conference on Acoustics, Speech and Signal Processing pp. 5013–5106.Google Scholar
  49. Fellbaum, C. ed. 1998. WordNet: An On-line Lexical Database and Some of its Applications. MIT Press. Google Scholar
  50. Fiscus, J., J. Ajot, N. Radde and C. Laprun. 2006. Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech. Proceedings of the International Conference on Linguistic Resources and Evaluation.Google Scholar
  51. Fraser, A. and D. Marcu. 2007. Measuring Word Alignment Quality for Statistical Machine Translation. Computational Linguistics 33(3):293-303MathSciNetCrossRefGoogle Scholar
  52. Friedman, L. and S. Strassel. 2008. Identifying Common Challenges for Human and Machine Translation: A Case Study from the GALE Program. Proceedings of the Meeting of the Association for Machine Translation in the Americas. Waikiki, Hawaii.Google Scholar
  53. Friedman, L., H. Lee and S. Strassel. 2008. A Quality Control Framework for Gold Standard Reference Translations: The Process and Toolkit Developed for GALE. Translating and the Computer 30 Proceedings. London, UK.Google Scholar
  54. Friedman, L., S. Strassel and M. Glenn. 2008. Explicit and Implicit Requirements of Technology Evaluations: Implications for Test Data Creation. Proceedings of the International Conference on Language Resources and Evaluation. Marrakech, Morocco.Google Scholar
  55. Gabbard, R. and S. Kulick. 2008. Construct State Modification in the Arabic Treebank. Proceedings of Meeting of the Association for Computational Linguistics Short Papers. Google Scholar
  56. Gao, J.F., M. Li, A. Wu and C.N. Huang. 2005. Chinese Word Segmentation and Named Entity Recognition: a Pragmatic Approach. Computational Linguistics, 31(4):531–574.CrossRefGoogle Scholar
  57. Glenn, M. and S. Strassel. 2006. Shared Linguistic Resources for the Meeting Domain. Proceedings of the Classification of Events, Activities, and Relationships Evaluation and RT Evaluation. Google Scholar
  58. Glenn, Meghan Lammie, Haejoong Lee and Stephanie M. Strassel. 2009. XTrans: a speech annotation and transcription tool. Proceedings of Interspeech 2009, Brighton, UK. Google Scholar
  59. Habash, N. 2004. Large Scale Lexeme Based Arabic Morphological Generation. Proceedings of the Conference on Automated Processing of Natural Languages (TALN-04) pp. 271–276. Fez, Morocco.Google Scholar
  60. Habash, N. 2007. Arabic Morphological Representations for Machine Translation. Arabic Computational Morphology: Knowledge-based and Empirical Methods. Abdelhadi Soudi, Antal van den Bosch and Günter Neumann, eds. Springer.Google Scholar
  61. Habash, N. and F. Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation. Proceedings of the 7th Meeting of the North American Chapter of the Association for Computational Linguistics/ Human Language Technologies Conference pp. 49–52. New York, New York.Google Scholar
  62. Habash, N. and O. Rambow. 2005. Arabic Tokenization, Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop. Proceedings of the Meeting of the Association for Computational Linguistics. Ann Arbor, Michigan.Google Scholar
  63. Habash, N. and O. Rambow. 2006. Magead: A morphological analyzer for Arabic and its dialects. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia.Google Scholar
  64. Habash, N. and O. Rambow. 2007. Arabic Diacritization through Full Morphological Tagging. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics.Google Scholar
  65. Habash, N. and R. Roth. 2008. Identification of Naturally Occurring Numerical Expressions in Arabic. Proceedings of the International Conference on Language Resources and Evaluation. Marrakech, Morocco.Google Scholar
  66. Habash, N. and R. Roth. 2009. CATiB: The Columbia Arabic Treebank. Proceedings of the Conference of the Association for Computational Linguistics. Suntec, Singapore.Google Scholar
  67. Habash, N., O. Rambow and G. Kiraz. 2005. Morphological Analysis and Generation for Arabic Dialects. Proceedings of the Meeting of the Association for Computational Linguistics, Workshop on Computational Approaches to Semitic Languages. Ann Arbor, MI.Google Scholar
  68. Habash, N., A. Soudi and T. Buckwalter. 2007. On Arabic Transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods. Abdelhadi Soudi, Antal van den Bosch, and Günter Neumann, eds. Springer.Google Scholar
  69. Habash, N., R. Faraj and R. Roth. 2009. Syntactic Annotation in the Columbia Arabic Treebank. Proceedings of the International Conference on Arabic Language Resources and Tools (MEDAR). Cairo, Egypt.Google Scholar
  70. Hajic, A.J., B. Vidová-Hladká and P. Pajas. 2001. The Prague Dependency Treebank: Annotation Structure and Support. Proceeding of the IRCS Workshop on Linguistic Databases, pp. 105–114.Google Scholar
  71. Hajič, J., O. Smrž, P. Zemánek, J. Šnaidauf and E. Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools. Proceedings of the Network for Euro-Mediterranean Language Resources Conference on Arabic Language Resources and Tools. Cairo, Egypt.Google Scholar
  72. Harper, M., B. Dorr, J. Hale, B. Roark, I. Shafran, M. Lease, Y. Liu, M. Snover, L.Yung, A. Krasnyanskaya and R. Stewart. 2005. The Johns Hopkins Summer Workshop Final Report on Parsing and Spoken Structural Event Detection. Johns Hopkins University.Google Scholar
  73. Hillard, D., Z. Huang, H. Ji, R. Grishman, D. Hakkani-Tur, M. Harper, M. Ostendorf and W. Wang. 2006. Impact of Automatic Comma Prediction on POS/Name Tagging Of Speech. Proceedings of the IEEE/Meeting of the Association for Computational Linguistics Workshop Spoken Language Technology pp. 58–61.Google Scholar
  74. Holovaty, A. 2007. Templatemaker. http://code.-google.com/p/templatemaker/.
  75. Hovy, E.H., A. Philpot, et al. 2009. The Omega Upper Model. Unpublished ms.Google Scholar
  76. Huang, C.R., T.S. You, P. Simon and S.K. Hsieh. 2008. A Realistic and Robust Model for Chinese Word Segmentation. Proceedings of the Conference on Computational Linguistics and Speech Processing (ROCLING-2008). Taiwan.Google Scholar
  77. Huang, J. and G. Zweig. 2002. Maximum Entropy Model for Punctuation Annotation from Speech. Proceedings of the International Conference on Spoken Language Processing pp.917–920.Google Scholar
  78. Huang, Z., M. Harper and W. Wang. 2007. Mandarin Part-of-speech Tagging and Discriminative Reranking. Proceedings of the Conference on Empirical Methods on Natural Language Processing.Google Scholar
  79. Hwang, M.Y., X. Lei, W. Wang and T. Shinozaki. 2006. Investigation on Mandarin Broadcast News Speech Recognition. Proceedings of the International Conference on Spoken Language Processing.Google Scholar
  80. Jin, G. and X. Chen. 2008. The Fourth International Chinese Word Segmentation Bakeoff. Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing. Hyderabad, India.Google Scholar
  81. Jones, D., E. Gibson, W. Shen, N. Granoien, M. Herzog, D. Reynolds and C. Weinstein. 2005. Measuring Human Readability of Machine-Generated Text: Three Case Studies in Speech Recognition and Machine Translation. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp. 1009–1012.Google Scholar
  82. Kahn, J.G., M. Ostendorf and C. Chelba. 2004. Parsing Conversational Speech Using Enhanced Segmentation. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics pp. 125–128.Google Scholar
  83. Kiraz, G.A. 2000. Multi-tiered Nonlinear Morphology Using Multi-tape Finite Automata: A Case Study on Syriac and Arabic. Computational Linguistics 26(1):77–105.CrossRefGoogle Scholar
  84. Klein, D. and C. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics pp. 423– 430.Google Scholar
  85. Kruijff-Korbayová, I., K. Chvátalová, O. Postolache. 2006. Annotation Guidelines for Czech-English Word Alignment. Proceedings of the International Conference on Language Resources and Evaluation pp. 1256-1261.Google Scholar
  86. Kulick, S., R. Gabbard and M. Marcus. 2006. Parsing the Arabic Treebank: Analysis and Improvements. Proceedings of the Treebanks and Linguistic Theories Conference. Prague, Czech Republic.Google Scholar
  87. Larkey, L.S., L. Ballesteros and M.E. Connell. 2007. Light Stemming for Arabic Information Retrieval. Arabic Computational Morphology: Knowledge-based and Empirical Methods. Abdelhadi Soudi, Antal van den Bosch, and Günter Neumann, eds. Springer.Google Scholar
  88. Le, S., Y. Jin, L. Du and Y. Sun. 2000. Word Alignment of English-Chinese Bilingual Corpus Based on Chunks. Proceedings of the 2000 Joint SIGDAT conference on Empirical Methods on Natural Language Processing and Very Large Corpora/ The 38th Annual Meeting of the Association for Computational Linguistics 13: 110 – 116.Google Scholar
  89. Levow, G.A. 2006. The Third International ChineseWord Segmentation Bakeoff: Word Segmentation and Named Entity Recognition. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney, Australia.Google Scholar
  90. Levy, R. and C. Manning. 2003. Is it Harder to Parse Chinese, or the Chinese Treebank. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan.Google Scholar
  91. Levy, R. and G. Andrew. 2006. Tregex and Tsurgeon: Tools for Querying and Manipulating Tree Data Structures. Proceedings of the International Conference on Language Resources and Evaluation. Google Scholar
  92. LDC. 2004. Simple Metadata Annotation Specification, Version 6.2. http://projects.ldc.upenn.edu/MDE/Guidelines/SimpleMDE_V6.2.pdf
  93. LDC. 2006. Simple Named Entity Guidelines for Less Commonly Taught Languages, Version 6.5. http://projects.ldc.upenn.edu/LCTL/Specifications/SimpleNamedEntityGuidelinesV6.5.pdf
  94. LDC. 2007. Using XTrans for Broadcast Transcription: A User Manual, Version 3.0. http://projects.ldc.upenn.edu/gale/Transcription/XTransManualV3.pdf
  95. LDC. 2008. GALE P3 Supra-Lexical Annotation Experiment Guidelines. http://projects.ldc.upenn.edu/gale/Transcription/Supra-lexicalAnnotation_v0.6.pdf
  96. Liu, F. and Y. Liu. 2007. Soundbite Identification Using Reference and Automatic Transcripts of Broadcast News Speech. Proceedings of the Automatic Speech Recognition and Understanding Workshop pp. 653–658.Google Scholar
  97. Liu, Y., E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf and M. Harper. 2006. Enriching Speech Recognition with Automatic Detection of Sentence Boundaries and Disfluencies. IEEE Transactions on Audio, Speech, and Language Processing 14(5):1526–1540.CrossRefGoogle Scholar
  98. Low, J.K., H.T. Ng and W.Y. Guo. 2005. A Maximum Entropy Approach to ChineseWord Segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing pp. 161–164. Jeju Island, Korea.Google Scholar
  99. Luo, X.Q. 2003. A Maximum Entropy Chinese Character-Based Parser. Proceedings of the Conference on Empirical Methods on Natural Language Processing. Sapporo, Japan.Google Scholar
  100. Ma, X. 2006. Champollion: A Robust Parallel Text Sentence Aligner. Proceedings of LREC-2006, Genoa, Italy.Google Scholar
  101. Ma, X. and M. Liberman. 1999. BITS: A method for bilingual text search over the web. Proceedings of the Machine Translation Summit VII. Google Scholar
  102. Maamouri, M. and A. Bies. 2004. Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools. Proceedings of the International Conference on Computational Linguistics. Geneva, Switzerland.Google Scholar
  103. Maamouri, M., A. Bies and S. Kulick. 2008. Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar
  104. Maamouri, M., A. Bies and T. Buckwalter. 2004. The Penn Arabic Treebank : Building a Large-scale Annotated Arabic Corpus. Proceedings of the Network for Euro-Mediterranean Language Resources Conference on Arabic Language Resources. Cairo, Egypt.Google Scholar
  105. Maamouri, M., A. Bies, T. Buckwalter, M. Diab, N. Habash, O. Rambow and D. Tabessi. 2006. Developing and Using a Pilot Dialectal Arabic Treebank. Proceedings of the Language Resource and Evaluation Conference. Genoa, Italy.Google Scholar
  106. MacWhinney, B. 2001. From CHILDES to Talkbank. Research on Child Language Acquisition. M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal and B. MacWhinney eds., pp. 17-34. Somerville, MA: Cascadilla.Google Scholar
  107. Maeda, K., H. Lee, J. Medero and S. Strassel, 2006. A new phase in annotation tool development at the Linguistic Data Consortium: The evolution of the Annotation Graph Toolkit. Proceedings of the Fifth International Conference on Language Resources and Evaluation. Google Scholar
  108. Maeda, K., H. Lee, S. Medero, J. Medero, R. Parker and S. Strassel, 2008. Annotation tool development for large-scale corpus creation projects at the Linguistic Data Consortium. Proceedings of the Sixth International Conference on Language Resources and Evaluation. Google Scholar
  109. Maeda, K., X. Ma and S. Strassel, 2008a. Creating sentence-aligned parallel text corpora from a large archive of potential parallel text using BITS and Champollion. Proceedings of the Sixth International Conference on Language Resources and Evaluation. Google Scholar
  110. Makhoul, J., A. Baron, I. Bulyko, L. Nguyen, L. Ramshaw, D. Stallard, R. Schwartz and B. Xiang. 2005. The effects of speech recognition and punctuation on information extraction performance. Proceedings of the European Conference on Speech Communication and Technology (Eurospeech) pp. 57–60.Google Scholar
  111. Marcus, M. G. Kim, M.A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz and B. Schasberger. 1994. The Penn Treebank: annotating predicate argument structure. Proceedings of the Human Language Technologies Workshop.Google Scholar
  112. Marcus, M., B. Santorini and M. A. Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19: 313-330.Google Scholar
  113. Marcus, M., G. Kim, M.A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz and B. Schasberger. 1994. The Penn Treebank: Annotating Predicate Argument Structure. Proceedings of the Human Language Technology Workshop. San Francisco, California.Google Scholar
  114. Matsoukas, S., I. Bulyko, B. Xiang, K. Nguyen, R. Schwartz and J. Makhoul. 2007. Integrating Speech Recognition and Machine Translation. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp. 1281–1284.Google Scholar
  115. Matsuzaki, T., Y. Miyao and J. Tsujii. 2005. Probabilistic CFG with Latent Annotations. Proceedings of the Meeting of the Association for Computational Linguistics.Google Scholar
  116. Matusov, E., A. Mauser and H. Ney. 2006. Automatic Sentence Segmentation And Punctuation Prediction For Spoken Language Translation. Proceedings of the International Workshop on Spoken Language Translation pp. 158–165.Google Scholar
  117. Matusov, E., D. Hillard, M. Magimai-Doss, D. Hakkani-Tur, M. Ostendorf and H. Ney. 2007. Improving Speech Translation by Automatic Boundary Prediction. Proceedings of the International Speech Communication Association Conference (Interspeech) pp. 2449–2452.Google Scholar
  118. Mauser, A., R. Zens, E. Matusov, S. Hasan and H. Ney. 2006. The RWTH Statistical Machine Translation System for the IWSLT 2006 Evaluation. Proceedings of the International Workshop on Spoken Language Translation pp. 103– 110.Google Scholar
  119. McClosky, D., E. Charniak and M. Johnson. 2006. Effective Self-training for Parsing. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics.Google Scholar
  120. Melamed, D. 1998. Annotation Style Guide for the Blinker Project, IRCS Technical Report #98-06. http://www.cs.nyu.edu/~melamed/ftp/papers/styleguide.ps.gz
  121. Meyers, A., R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young and R. Grishman. 2004. The NomBank Project: An Interim Report. Proceedings of the Frontiers in Corpus Annotation Workshop, held in conjunction with the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics Google Scholar
  122. Mohit, B. and R. Hwa. 2007. Localization of Difficult-to-translate Phrases. Proceedings of the Second Workshop on Statistical Machine Translation pp. 248–255, Prague, Czech Republic.Google Scholar
  123. Mohri, M., F. Pereira and M. Riley. 1998. A Rational Design for a Weighted Finite-state Transducer Library. Automata Implementation, Lecture Notes in Computer Science 1436:144– 58. D.Wood and S. Yu, eds. Springer.CrossRefGoogle Scholar
  124. Nichols, C. and R. Hwa. 2005. Word Alignment and Cross-lingual Resource Acquisition. Proceedings of the Meeting of the Association for Computational Linguistics Poster and Demonstration Sessions.Google Scholar
  125. Nigram, K. and R. Ghani. 2000. Analyzing the Effectiveness and Applicability of Co-training. Proceedings of the Conference on Information and Knowledge Management. Google Scholar
  126. Nivre, J., J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov and E. Marsi. 2007. MaltParser: A language-independent System for Data-driven Dependency Parsing. Natural Language Engineering 13(2), 95-135.Google Scholar
  127. Packard, J. 2000. The Morphology of Chinese. Cambridge University Press.Google Scholar
  128. Pajas, P. 2002. Tree Editor TrEd. Charles University, Prague. http://ufal.mff.cuni.cz/~pajas/tred/.
  129. Palmer, M., D. Gildea and P. Kingsbury. 2005. The Proposition Bank: A Corpus Annotated with Semantic Roles. Computational Linguistics 31(1).Google Scholar
  130. Palmer, M., H. Dang and C. Fellbaum. 2007. Making Fine-grained and Coarse-grained Sense Distinctions, Both Manually and Automatically. Journal of Natural Language Engineering 13:2, 137-163.Google Scholar
  131. Palmer, M., J. Hwang, S. W. Brown, K. K. Schuler and A. Lanfranchi, 2009 Leveraging Lexical Resources for the Detection of Event Relations. Proceedings of the AAAI 2009 Spring Symposium on Learning by Reading.Google Scholar
  132. Penman, R.B. 2009. Web Scraping Made Simple with SiteScrapper. http://code.google.com/p/site-scraper/.
  133. Petrov, S., L. Barrett, R. Thibaux and D. Klein. 2006. Learning Accurate, Compact, and Interpretable Tree Annotation. Proceedings of the Meeting of the Association for Computational Linguistics pp. 433– 440. Sydney, Australia, July.Google Scholar
  134. Petrov, S. and D. Klein. 2007. Discriminative Log- Linear Grammars with Latent Variables. Proceedings of the Conference on Neural Information Processing Systems. Google Scholar
  135. Petrov, S. and D. Klein. 2007a. Improved Inference for Unlexicalized Parsing. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics. Google Scholar
  136. Petrov, S. and D. Klein. 2008. Sparse Multi-Scale Grammars for Discriminative Latent Variable Parsing. Proceedings of the Conference on Empirical Methods on Natural Language Processing. Google Scholar
  137. Philpot, A., E.H. Hovy and P. Pantel. 2005. The Omega Ontology. Proceedings of the International Joint Conference on Natural Language Processing ONTOLEX Workshop. Google Scholar
  138. Pradhan, S., L. Ramshaw, R. Weischedel, J. MacBride, L. Miccuilla. 2007. Unrestricted Coreference: Identifying Entities and Events in OntoNotes. Proceedings of the International Conference on Semantic Computing. Google Scholar
  139. Pradhan, S., E. Hovy, M. Marcus, M. Palmer, L. Ramshaw and R. Weischedel. 2007a. OntoNotes: A Unified Relational Representation. International Journal of Semantic Computing 1:4 405-419CrossRefGoogle Scholar
  140. Pradhan, S., E. Loper, D. Dligach and M. Palmer. 2007b. SemEval-2007 Task 17: English Lexical Sample, SRL and All Words. Proceedings of the International Workshop on Semantic Evaluations. Google Scholar
  141. Pradhan, S., W. Ward, K. Hacioglu, J. Martin, D. Jurafsky. 2005. Semantic Role Labeling Using Different Syntactic Views. Proceedings of Meeting of the Association for Computational Linguistics. Google Scholar
  142. Prescher, D. 2005. Inducing Head-Driven PCFGs with Latent Heads: Refining a Tree-Bank Grammar for Parsing. Proceedings of the European Conference on Machine Learning.Google Scholar
  143. Reeder, F., B. Dorr, D. Farwell, N. Habash, S. Helmreich, E.H. Hovy, L. Levin, T. Mitamura, K. Miller, O. Rambow, A. Siddharthan. 2004. Interlingual Annotation for MT Development. Proceedings of the Meeting of the Association for Machine Translation of the Americas. Google Scholar
  144. Roark, B., M. Harper, E. Charniak, B. Dorr, M. Johnson, J.G. Kahn, M. Ostendorf, Y. Liu, J. Hale, A. Krasnyanskaya, M. Lease, I. Shafran, M. Snover, R. Stewart and L. Yung. 2006. SParseval: Evaluation Metrics for Parsing Speech. Proceedings of the International Conference on Language Resources and Evaluation. Genoa, Italy.Google Scholar
  145. Roark, B., Y. Liu, M. P. Harper, R. Stewart, M. Lease, M. Snover, I. Shafran, B. Dorr, J. Hale, A. Krasnyanskaya and L.Yung. 2006a. Reranking for Sentence Boundary Detection in Conversational Speech. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp. 545–548.Google Scholar
  146. Rosenberg, A., M. Sharifi and J. Hirschberg. 2007. Varying Input Segmentation for Story Boundary Detection in English, Arabic, and Mandarin Broadcast News. Proceedings of the International Speech Communication Association Conference (Interspeech) pp. 2589–2592.Google Scholar
  147. Roth, R., O. Rambow, N. Habash, M. Diab and C. Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. Proceedings of the Meeting of the Association for Computational Linguistics Short Papers. Columbus, Ohio.Google Scholar
  148. Sadat, F. and N. Habash. 2006. Combination of Arabic Preprocessing Schemes for Statistical Machine Translation. Proceedings of the International Conference on Computational Linguistics and Meeting of the Association for Computational Linguistics pp. 1–8. Sydney, Australia.Google Scholar
  149. Sarkar, A. 2001. Applying Co-training Methods to Statistical Parsing. Proceedings of the Meeting of the North American Chapter of the Association for Computational Linguistics.Google Scholar
  150. Shriberg, E., R. Dhillon, S. Bhagat, J. Ang and H. Carvey. 2004. The ICSI meeting recorder dialog act (MRDA) Corpus. Proceedings of the Special Interest Group on Discourse and Dialogue Conference. Google Scholar
  151. Smrž, O. 2007. Functional Arabic Morphology. Formal System and Implementation. Ph.D. thesis, Charles University in Prague, Prague, Czech Republic.Google Scholar
  152. Sproat, R. 1995. Lextools: Tools for finite-state linguistic analysis. Technical Report 11522-951108- 10TM, Bell Laboratories.Google Scholar
  153. Sproat, R. and C. L. Shih. 1990. A Statistical Method for Finding Word Boundaries in Chinese Text. Computer Processing of Chinese and Oriental Languages 4(4):336–351.Google Scholar
  154. Sproat, R. and T. Emerson. 2003. The First International Chinese Word Segmentation Bakeoff. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan.Google Scholar
  155. Sproat, R., C. Shih, W. Gale and N. Chang. 1996. A Stochastic Finite-State Word-segmentation Algorithm for Chinese. Computational Linguistics 22(3):377–404.Google Scholar
  156. Steedman, M., M. Osborne, A. Sarkar, S. Clark, R. Hwa, J. Hockenmaier, P. Ruhlen, S. Baker and J. Crim. 2003. Bootstrapping statistical parsers from small datasets. Proceedings of the Meeting of the European Chapter of the Association for Computational Linguistics. Google Scholar
  157. Stolcke, A. and E. Shriberg. 1996. Automatic Linguistic Segmentation of Conversational Speech. Proceedings of the International Conference on Spoken Language Processing pp. 005–1008.Google Scholar
  158. Strassel, S. 2004. Linguistic Resources for Effective, Affordable, Reusable Speech-to-Text. Proceedings of the International Conference on Language Resources and Evaluation. Lisbon, Portugal.Google Scholar
  159. Strassel, S. and Cole, A. W. 2006. Corpus Development and Publication. Proceedings of the International Conference on Language Resources and Evaluation. Google Scholar
  160. Strassel, S. et al. 2006. Integrated Linguistic Resources for Language Exploitation Technologies. Proceedings of the International Conference on Language Resources and Evaluation. Google Scholar
  161. Stroppa, N. and A. Way. 2006. MATREX: DCU Machine Translation System for IWSLT 2006. Proceedings of the International Workshop on Spoken Language Translation pp. 31–36. Kyoto, Japan.Google Scholar
  162. Tomalin, M. and P. Woodland. 2006. Discriminatively Trained Gaussian Mixture Models for Sentence Boundary Detection. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing pp. 549–552.Google Scholar
  163. Tranter, S. and D. Reynolds. 2006. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing 14(5):1557–1565.CrossRefGoogle Scholar
  164. Tseng, H., P.C. Chang, G. Andrew, D. Jurafsky and C. Manning. 2005. A Conditional Random Field Word Segmenter. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.Google Scholar
  165. Vilar, D., D. Stein, Y. Zhang, E. Matusov, A. Mauser, O. Bender, S. Mansour and H. Ney. 2008. The RWTH Machine Translation System for IWSLT 2008. Proceedings of the International Workshop on Spoken Language Translation pp. 108–115. Hawaii, USA.Google Scholar
  166. Wang, W. 2008. Weakly supervised training for parsing Mandarin broadcast transcripts. Proceedings of the International Speech Communication Association Conference (Interspeech). Brisbane, Australia.Google Scholar
  167. Wang, W., Z. Huang and M. P. Harper. 2007. Semisupervised Learning for Part-of-speech Tagging of Mandarin Transcribed Speech. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. Google Scholar
  168. Xia, F. 2000. The Segmentation Guidelines for the Penn Chinese Treebank (3.0). University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-00-06.Google Scholar
  169. Xue, N. 2003. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing 8(1):29–48.Google Scholar
  170. Xue, N. and L. Shen. 2003. Chinese Word Segmentation as LMR Tagging. Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan.Google Scholar
  171. Xue, N. and M. Palmer. 2009. Adding Semantic Roles to the Chinese Treebank. Natural Language Engineering 15(1): 143-172.CrossRefGoogle Scholar
  172. Xue, N., F. Xia, C. Chiou and M. Palmer. 2005. The Penn Chinese Treebank: Phrase Structure Annotation of a Large Corpus. Natural language Engineering 11(2): 207-238.CrossRefGoogle Scholar
  173. Xue, Ni., F.D. Chiou and M. Palmer. 2002. Building a Large-scale Annotated Chinese Corpus. Proceedings of the Meeting of the Association for Computational Linguistics.Google Scholar
  174. Zens, R. and H. Ney. 2006. Discriminative Reordering Models for Statistical Machine Translation. Proceedings of the Human Language Technologies Conference/ Meeting of the North American Chapter of the Association for Computational Linguistics Workshop on Statistical Machine Translation pp. 55–63.Google Scholar
  175. Zhang, B. and J.G. Kahn. 2008. Evaluation of Decatur Text Normalizer for Language Model Training. Technical report, University of Washington.Google Scholar
  176. Zhao, H. and C. Kit. 2008. Unsupervised Segmentation Helps Supervised Learning of Character Tagging For Word Segmentation and Named Entity Recognition. Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing pp. 106–111. Hyderabad, India.Google Scholar
  177. Zhao, H., C.N. Huang, M. Li and B.L. Lu. 2006. Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling. Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation pp. 87–94.Wuhan, China.Google Scholar
  178. Zhou, Q. 2003. Build a Large-scale Syntactically Annotated Chinese Corpus. Springer Lecture Notes in Computer Science 2807: 106–113.Google Scholar
  179. Zhu, J. and E.H. Hovy. 2007. Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem. Proceedings of the Conference on Empirical Methods on Natural Language Processing. Google Scholar
  180. Zhu, J., H. Wang and E.H. Hovy. 2008. Multi-Criteria-based Strategy to Stop Active Learning for Data Annotation. Proceedings of the International Conference on Computational Linguistics.Google Scholar
  181. Zimmermann, M., D. Tur, J. Fung, N. Mirghafori, L. Gottlieb, E. Shriberg and Y. Liu. 2006. The ICSI+ Multilingual Sentence Segmentation System. Proceedings of the International Speech Communication Association Conference (Interspeech).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Stephanie Strassel
    • 1
  • Caitlin Christianson
    • 2
  • John McCary
    • 2
  • William Staderman
    • 2
  • Joseph Olive
    • 2
  1. 1.Linguistic Data Consortium, University of PennsylvaniaPhiladelphiaUSA
  2. 2.Defense Advanced Research Projects AgencyArlingtonUSA

Personalised recommendations