Researchers have begun to investigate the use of statistical and machine learning methods for question answering. These techniques require training data, usually in the form of question/answer sets. In this chapter, we describe a reverse-engineering procedure that can be used to generate question/answer sets automatically from ordinary text corpora. Our technique identifies sentences that are good candidates for question/answer extraction, extracts the portions of the sentence corresponding to the question and the answer, and then transforms the information into an actual question and answer. Using this procedure, a collection of questions and answers can be automatically generated from any text corpus. One key benefit of this automatic procedure is that question/answer sets can be easily generated from domain-specific corpora, creating training data which could be used to build a Q/A system tailored for a specific domain.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
7. References
Barzilay, Regina and McKeown, Kathleen R. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of ACL/EACL, Toulouse, France.
Berger, A., Caruana, R., Cohn, D., Freitag, D., and Mittal, V. (2000). Bridging the lexical chasm: Statistical approaches to answer-finding. Proceedings of the 23rd Annual Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 192-199.
Brill, E., Dumais, S., and Banko, M. (2002). An analysis of the askmsr question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing., pages 257-264.
Caraballo, Sharon (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.
Charniak, E. (1993). Statistical Language Learning. The MIT Press, Cambridge, MA.
Charniak, E., Altun, Y., de Salvo Braz, R., Garrett, B., Kosmala, M., Moscovich, T., Pang, L., Pyo, C., Sun, Y., Wy, W., Yang, Z., Zeller, S., and Zorn, L. (2000). Reading Comprehension Programs in a Statistical-Language-Processing Class. In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.
Church, K. (1989). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing.
Fleischman, M. and Hovy, E. (2003). Offline strategies for online question answering: Answering questions before they are asked. In the Annual Meeting of the Association for Computational Linguistics, page (to appear).
Fujii, Atsushu and Ishikawa, Tetsuya (2001). Question answering using encyclopedic knowledge from the web. In Workshop on Open-Domain Question Answering at ACL.
Girju, Roxana (2001). Answer fusion with on-line ontology development. In Student Research Workshop Proceedings at The 2nd Meeting of the North American Chapter of the Association for Computa-tional Linguistics.
Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., nu, M. Surdea, Bunescu, R., Girju, R., Rus, V., and Mor, P. (2000). Falcon: Boosting knowledge for answer engines. Proc. of TREC-9.
Hearst, Marti (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING-92).
Hermjakob, Ulf (2001). Parsing and questiong classification for question answering. In Workshop on Open-Domain Question Answering at ACL.
Hirschman, L., Light, M., Breck, E., and Burger, J. (1999). Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.
Ittycheriah, A. (2003). A statistical approach for open domain question answering. In Harabagiu, S. and Strzalkowski, T., editors, Advances in Open Domain Question Answering. Kluwer.
Ittycheriah, A., Franz, M., Zhu, W-J., and Ratnaparkhi, A. (2001). Question Answering Using Maximum Entropy Components. Proceedings of the Second Meeting of The North American Chapter of the Association of Computational Linguistics, pages 33-39.
Jacquemin, Christian, Klavens, Judith, and Tzoukermann, Evelyne (1997). Explansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of ACL/EACL, Barcelona, Spain.
Light, Marc, Mann, Gideon S., Riloff, Ellen, and Breck, Eric (2001). Analyses for elucidating current question answering technology. Journal of Natural Language Engineering.
Lin, Dekang and Pantel, Patrick (2002). Discovery of inference rules for question/answering. Journal for Natural Language Engineering.
MacDonald, G. (1999). Phishy web trivia.
Mann, Gideon S. (2001). A statistical method for short answer extraction. In Workshop on Open-Domain Question Answering, pages 23-30.
Mann, Gideon S. (2002a). Building a proper noun ontology for question answering. In Proceedings of SemaNet02: Building and Using Semantic Networks, Taipei, Taiwan.
Mann, Gideon S. (2002b). Learning how to answer questions using trivia games. In Proceedings of the Nineteenth International Conference on Computational Linguistics (COLING 2002).
Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313-330.
Miller, G. (1990). Wordnet: An On-line Lexical Database. International Journal of Lexicography, 3(4): 235-312.
Moldovan, D., Clark, C., Harabagiu, S., and Maiorano, S. (2003). Cogex: A logic prover for question answering. In Proceedings of HLT-NAACL 2003, pages 166-172.
MUC-4 Proceedings (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.
Ng, H.T., Teo, L.H., and Kwan, J.L.P. (2000a). A Machine Learning Approach to Answering Questions for Reading Comprehension Tests. In Proceedings of EMNLP/VLC-2000 at ACL-2000.
Ng, Hwee Tou, Kwan, Jennifer Lai Pheng, and Xia, Yiyuan (2001). Question answering using a larger text database: A machine learning approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Ng, Hwee Tou, Teo, Leong Hwee, and Kwan, Jennifer Lai Pheng (2000b). A machine learning approach to answering questions for reading comprehension tests. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 124-132.
Phillips, W. and Riloff, E. (2002). Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.
Prager, J. M., Chu-Carroll, J., Brown, E. W., and Czuba, K. (2003). Question Answering by Predictive Annotation. In Harabagiu, S. and Strzalkowski, T., editors, Advances in Open Domain Question Answering. Kluwer.
Prager, John, Chu-Carroll, Jennifer, and Czuba, Krzysztof (2002). Statistical answer-type identification in open-domain question answering. In Human Language Technologies Conference.
Radev, Dragomir R., Prager, John, and Samn, Valeria (2000). Ranking suspected answers to natural language questions using predictive annotation. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 150-157.
Ravichandran, Deepak and Hovy, Eduard (2002). Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
Reuters Ltd. (1997). Reuters-21578, Distribution 1.0. http://www.research.att.com/∼lewis.
Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049. The AAAI Press/MIT Press.
Riloff, E. and Jones, R. (1999). Learning Dictionaries for Information Extraction by Multi-Level Boot-strapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence.
Riloff, E. and Shepherd, J. (1999). A Corpus-based Bootstrapping Algorithm for Semi-Automated Semantic Lexicon Construction. Journal for Natural Language Engineering, 5(2):147-156.
Riloff, E. and Thelen, M. (2000). A Rule-based Question Answering System for Reading Comprehension Tests. In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.
Roark, B. and Charniak, E. (1998). Noun-phrase Co-occurrence Statistics for Semi-automatic Semantic Lexicon Construction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 1110-1116.
Strzalkowski, T., Lin, F., Perez-Caraballo, J., and Wang, J. (1997). Building effective queries in natural language information retrieval. In ANLP, pages 299-306.
Thelen, M. and Riloff, E. (2002). A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.
TREC-10 Proceedings (2001). Proceedings of the Tenth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-250, Gaithersburg, MD.
TREC-11 Proceedings (2002). Proceedings of the Eleventh Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-251, Gaithersburg, MD.
TREC-8 Proceedings (1999). Proceedings of the Eighth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-246, Gaithersburg, MD.
TREC-9 Proceedings (2000). Proceedings of the Ninth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-249, Gaithersburg, MD.
TriviaMachine Inc. (1999). TriviaSpot.com. www.triviaspot.com .
Turtle, Howard and Croft, W. Bruce (1991). Efficient Probabilistic Inference for Text Retrieval. In Proceedings of RIAO 91, pages 644-661.
Wang, W., J., Auer, Parasuraman, R., Zubarev, I., Brandyberry, D., and Harper, M.P. (2000). A Question Answering System Developed as a Project in a Natural Language Processing Course. In ANLP/ NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.
Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L., and Palmucci, J. (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models. Computational Linguistics, 19(2):359-382.
Yangarber, R., Grishman, R., Tapanainen, P., and Huttunen, S. (2000). Automatic Acquisiton of Domain Knowledge for Information Extraction. In Proceedings of the Eighteenth International Conference on Computational Linguistics (COLING 2000).
Yarowsky, D. (1992). Word sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING-92), pages 454-460.
Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer
About this chapter
Cite this chapter
Riloff, E., Mann, G.S., Phillips, W. (2008). Reverse-Engineering Question/Answer Collections From Ordinary Text. In: Strzalkowski, T., Harabagiu, S.M. (eds) Advances in Open Domain Question Answering. Text, Speech and Language Technology, vol 32. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-4746-6_17
Download citation
DOI: https://doi.org/10.1007/978-1-4020-4746-6_17
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-4744-2
Online ISBN: 978-1-4020-4746-6
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)