Abstract
Representing a document as a bag-of-words and using keywords to retrieve relevant documents have seen a great success in large scale information retrieval systems such as Web search engines. Bag-of-words representation is computationally efficient and with proper term weighting and document ranking methods can perform surprisingly well for a simple document representation method. However, such a representation ignores the rich discourse structure in a document, which could provide useful clues when determining the relevancy of a document to a given user query. We develop the first-ever Discourse Search Engine (DSE) that exploits the discourse structure in documents to overcome the limitations associated with the bag-of-words document representations in information retrieval. We use Rhetorical Structure Theory (RST) to represent a document as a discourse tree connecting numerous elementary discourse units (EDUs) via discourse relations. Given a query, our discourse search engine can retrieve not only relevant documents to the query, but also individual statements from those relevant documents that describe some discourse relations to the query. We propose several ranking scores that consider the discourse structure in the documents to measure the relevance of a pair of EDUs to a query. Moreover, we combine those individual relevance scores using a random decision forest (RDF) model to create a single relevance score. Despite the numerous challenges of constructing a rich document representation using the discourse relations in a document, our experimental results show that it improves the F-score in an information retrieval task. We publicly release our manually annotated test collection to expedite future research in discourse-based information retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Brooks, H.M., Belkin, N.J.: Using discourse analysis for the design of information retrieval interaction mechanisms. In: SIGIR, pp. 31–47 (1983)
Duc, N.T., Bollegala, D., Ishizuka, M.: Cross-language latent relational search: Mapping knowledge across languages. In: AAAI, pp. 1237 – 1242 (2011)
duVerle, D.A., Prendinger, H.: A novel discourse parser based on support vector machine classification. In: ACL, pp. 665–673 (2009)
Feng, V.W., Hirst, G.: Text-level discourse parsing with rich linguistic features. In: ACL, pp. 60–68 (2012)
Hernault, H., Bollegala, D., Ishizuka, M.: A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. In: EMNLP, pp. 399–409 (2010)
Hernault, H., Prendinger, H., duVerle, D., Ishizuka, M.: Hilda: A discourse parser using support vector machine classification. Dialogue and Discourse. An International Journal 1(3), 1–33 (2010)
Joty, S., Carenini, G., Ng, R.: A novel discriminative framework for sentence-level discourse analysis. In: EMNLP, pp. 904–915 (2012)
Joty, S., Carenini, G., Ng, R., Mehdad, Y.: Combining intra- and multi-sentential rhetorical parsing for document-level discourse analysis. In: ACL, pp. 486–496 (2013)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
Lan, M., Xu, Y., Niu, Z.: Leveraging synthetic discourse data via multi-task learning for implicit discourse relation recognition. In: ACL, pp. 476–485 (2013)
Louis, A., Joshi, A., Nenkova, A.: Discourse indicators for content selection in summarization. In: SIGDIAL, pp. 147–156 (2010)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: Toward a functional theory of text organization. Text 8(3), 243–281 (1988)
Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., Tsujii, J.: Semantic retrieval for the accurate identification of relational concepts in massive textbases. In: ACL, pp. 1017–1024 (2006)
Sadikov, E., Madhavan, J., Wang, L., Halevy, A.: Clustering query refinements by user intent. In: WWW, pp. 841–850 (2010)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In: EMNLP, pp. 254–263 (2008)
Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical information. In: NAACL, pp. 149–156 (2003)
Subba, R., Eugenio, B.D.: An effective discourse parser that uses rich linguistic information. In: HLT-NAACL, pp. 566–574 (2009)
Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR, pp. 295–302 (2007)
Wang, D.Y., Luk, R.W.P., Wong, K.F., Kwok, K.L.: An information retrieval approach based on discourse type. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 197–202. Springer, Heidelberg (2006)
Zhou, L., Li, B., Gao, W., Wei, Z., Wong, K.F.: Unsupervised discovery of discourse relations for eliminating intra-sentence polarity ambiguities. In: EMNLP, pp. 162–171 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Kuyten, P., Bollegala, D., Hollerit, B., Prendinger, H., Aizawa, K. (2015). A Discourse Search Engine Based on Rhetorical Structure Theory. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-16354-3_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16353-6
Online ISBN: 978-3-319-16354-3
eBook Packages: Computer ScienceComputer Science (R0)