Abstract
The abundance of opinions on the Web is now becoming a critical source of information in a variety of application areas such as business intelligence, market research and online shopping. Unfortunately, due to the rapid growth of online content, there is no one source to obtain a comprehensive set of opinions about a specific entity or a topic, making access to such content severely limited. While previous works have been focused on mining and summarizing online opinions, there is limited work on exploring the automatic collection of opinion content on the Web. In this paper, we propose a lightweight and practical approach to collecting opinion containing pages, namely review pages on the Web for arbitrary entities. We leverage existing Web search engines and use a novel information network called the FetchGraph to efficiently obtain review pages for entities of interest. Our experiments in three different domains show that our method is more effective than plain search engine results and we are able to collect entity specific review pages efficiently with reasonable precision and accuracy.
This is a preview of subscription content,
to check access.




Similar content being viewed by others
References
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of WWW ’02 (2002).
Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the WWW ’99 (1999).
Chen, H., Chung, Y.-M., Ramsey, M. C., & Yang, C. C. (1998). A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, Special Issue on AI Techniques for Emerging Information Systems Applications, 49(7), 604–618.
De Bra, P., Houben, G. J., Kornatzky, Y., & Post, R. (1994). Information retrieval in distributed hypertexts. In Proceedings of the 4th RIAO conference, (1994).
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., & Gori, M. (2000). Focused crawling using context graphs. In Proceedings of the 26th international conference on VLDB, VLDB ’00 (2000).
Ganesan, K., Zhai, C., & Han, J. (2010). Opinosis: A graph based approach to abstractive summarization of highly redundant opinions. In Proceedings of COLING ’10, Beijing, China (2010).
Ganesan, K., Zhai, C., & Viegas, E. (2012). Micropinion generation: An unsupervised approach to generating ultra-concise summaries of opinions. In Proceedings of the WWW ’12 (2012).
Gerani, S., Mehdad, Y., Carenini, G., Ng, R. T., & Nejat, B. (2014). Abstractive summarization of product reviews using discourse structure. In Proceedings of the EMNLP ’14.
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., & Ur, S. (1998). The shark-search algorithm. An application: Tailored web site mapping. Computer Networks and ISDN Systems, 30(1), 317–326.
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of KDD ’04 (2004).
Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. In Proceedings of AAAI ’04 (2004).
Johnson, J., Tsioutsiouliklis, K., & Giles, C. L. (2003). In T. Fawcett & N. Mishra (Eds.), ICML.
Kim, H. D., & Zhai, C. (2009). Generating comparative summaries of contradictory opinions in text. In Proceedings of the CIKM ’09 (2009).
Lu, Y., Zhai, C., & Sundaresan, N. (2009). Rated aspect summarization of short comments. In Proceedings of the 18th international conference on World wide web.
McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (1999). A machine learning approach to building domain-specific search engines. AAAI Spring symposium on intelligent agents in cyberspace: Proceedings
Novak, B. (2004). A survey of focused web crawling algorithms. SKIDD.
Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., & Lioma, C. (2006). Terrier: A high performance and scalable information retrieval platform. In Proceedings of ACM SIGIR’06 workshop on open source information retrieval (OSIR 2006).
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of EMNLP ’02.
Real, R., & Vargas, J. M. (1996). The probabilistic basis of Jaccard’s Index of similarity. Systematic Biology, 45, 380–385.
Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4), 35–43.
Snyder, B., & Barzilay, R. (2007). Multiple aspect ranking using the good grief algorithm. In Proceedings of HLT-NAACL ’07, pp. 300–307.
Vural, A. G., Cambazoglu, B. B., & Senkul, P. (2012). Sentiment-focused web crawling. In Proceedings of the CIKM ’12.
Zhai, C. (2008). Statistical language models for information retrieval. Synthesis Lectures on Human Language Technologies, 1(1), 1–141.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ganesan, K., Zhai, C. OpinoFetch: a practical and efficient approach to collecting opinions on arbitrary entities. Inf Retrieval J 18, 530–558 (2015). https://doi.org/10.1007/s10791-015-9272-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-015-9272-0