Skip to main content

A Discourse Search Engine Based on Rhetorical Structure Theory

  • Conference paper
Advances in Information Retrieval (ECIR 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9022))

Included in the following conference series:

Abstract

Representing a document as a bag-of-words and using keywords to retrieve relevant documents have seen a great success in large scale information retrieval systems such as Web search engines. Bag-of-words representation is computationally efficient and with proper term weighting and document ranking methods can perform surprisingly well for a simple document representation method. However, such a representation ignores the rich discourse structure in a document, which could provide useful clues when determining the relevancy of a document to a given user query. We develop the first-ever Discourse Search Engine (DSE) that exploits the discourse structure in documents to overcome the limitations associated with the bag-of-words document representations in information retrieval. We use Rhetorical Structure Theory (RST) to represent a document as a discourse tree connecting numerous elementary discourse units (EDUs) via discourse relations. Given a query, our discourse search engine can retrieve not only relevant documents to the query, but also individual statements from those relevant documents that describe some discourse relations to the query. We propose several ranking scores that consider the discourse structure in the documents to measure the relevance of a pair of EDUs to a query. Moreover, we combine those individual relevance scores using a random decision forest (RDF) model to create a single relevance score. Despite the numerous challenges of constructing a rich document representation using the discourse relations in a document, our experimental results show that it improves the F-score in an information retrieval task. We publicly release our manually annotated test collection to expedite future research in discourse-based information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)

    Google Scholar 

  2. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  3. Brooks, H.M., Belkin, N.J.: Using discourse analysis for the design of information retrieval interaction mechanisms. In: SIGIR, pp. 31–47 (1983)

    Google Scholar 

  4. Duc, N.T., Bollegala, D., Ishizuka, M.: Cross-language latent relational search: Mapping knowledge across languages. In: AAAI, pp. 1237 – 1242 (2011)

    Google Scholar 

  5. duVerle, D.A., Prendinger, H.: A novel discourse parser based on support vector machine classification. In: ACL, pp. 665–673 (2009)

    Google Scholar 

  6. Feng, V.W., Hirst, G.: Text-level discourse parsing with rich linguistic features. In: ACL, pp. 60–68 (2012)

    Google Scholar 

  7. Hernault, H., Bollegala, D., Ishizuka, M.: A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. In: EMNLP, pp. 399–409 (2010)

    Google Scholar 

  8. Hernault, H., Prendinger, H., duVerle, D., Ishizuka, M.: Hilda: A discourse parser using support vector machine classification. Dialogue and Discourse. An International Journal 1(3), 1–33 (2010)

    Article  Google Scholar 

  9. Joty, S., Carenini, G., Ng, R.: A novel discriminative framework for sentence-level discourse analysis. In: EMNLP, pp. 904–915 (2012)

    Google Scholar 

  10. Joty, S., Carenini, G., Ng, R., Mehdad, Y.: Combining intra- and multi-sentential rhetorical parsing for document-level discourse analysis. In: ACL, pp. 486–496 (2013)

    Google Scholar 

  11. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

  12. Lan, M., Xu, Y., Niu, Z.: Leveraging synthetic discourse data via multi-task learning for implicit discourse relation recognition. In: ACL, pp. 476–485 (2013)

    Google Scholar 

  13. Louis, A., Joshi, A., Nenkova, A.: Discourse indicators for content selection in summarization. In: SIGDIAL, pp. 147–156 (2010)

    Google Scholar 

  14. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: Toward a functional theory of text organization. Text 8(3), 243–281 (1988)

    Google Scholar 

  15. Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T., Tsujii, J.: Semantic retrieval for the accurate identification of relational concepts in massive textbases. In: ACL, pp. 1017–1024 (2006)

    Google Scholar 

  16. Sadikov, E., Madhavan, J., Wang, L., Halevy, A.: Clustering query refinements by user intent. In: WWW, pp. 841–850 (2010)

    Google Scholar 

  17. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In: EMNLP, pp. 254–263 (2008)

    Google Scholar 

  18. Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical information. In: NAACL, pp. 149–156 (2003)

    Google Scholar 

  19. Subba, R., Eugenio, B.D.: An effective discourse parser that uses rich linguistic information. In: HLT-NAACL, pp. 566–574 (2009)

    Google Scholar 

  20. Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR, pp. 295–302 (2007)

    Google Scholar 

  21. Wang, D.Y., Luk, R.W.P., Wong, K.F., Kwok, K.L.: An information retrieval approach based on discourse type. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 197–202. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  22. Zhou, L., Li, B., Gao, W., Wei, Z., Wong, K.F.: Unsupervised discovery of discourse relations for eliminating intra-sentence polarity ambiguities. In: EMNLP, pp. 162–171 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Kuyten, P., Bollegala, D., Hollerit, B., Prendinger, H., Aizawa, K. (2015). A Discourse Search Engine Based on Rhetorical Structure Theory. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16354-3_10

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16353-6

  • Online ISBN: 978-3-319-16354-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics