Distributed and Parallel Databases

, Volume 32, Issue 3, pp 405–446 | Cite as

Scalable entity-based summarization of web search results using MapReduce

  • Ioannis Kitsos
  • Kostas Magoutis
  • Yannis TzitzikasEmail author


Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities exploiting the availability of structured (and semantic) data in Web search, is to enrich it with entity mining over the full contents of the search results. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. This is especially important in areas of professional search such as medical search, patent search, etc. In this paper we consider a general scenario of providing such services as meta-services (that is, layered over systems that support keywords search) without a-priori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions. A key contribution of our work is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. Finally we report experimental results about the achieved speedup in various settings.


Text data analytics through summaries and synopses Interactive data analysis through queryable summaries and indices Information retrieval and named entity mining MapReduce Cloud computing 



Many thanks to Carlo Allocca and to Pavlos Fafalios for their contributions. We thankfully acknowledge the support of the iMarine (FP7 Research Infrastructures, 2011–2014) and PaaSage (FP7 Integrated Project 317715, 2012–2016) EU projects and of Amazon Web Services through an Education Grant. We also acknowledge the interesting discussions we had in the context of the MUMIA COST action (IC1002, 2010–2014).


  1. 1.
    Allocca, C., dAquin, M., Motta, E.: Impact of using relationships between ontologies to enhance the ontology search results. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) The Semantic Web: Research and Applications. Lecture Notes in Computer Science, vol. 7295, pp. 453–468. Springer, Berlin (2012) CrossRefGoogle Scholar
  2. 2.
    Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. pages 483–485, 1967 Google Scholar
  3. 3.
    Apache Software Foundation: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Accessed: 03/05/2013
  4. 4.
    Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010) CrossRefGoogle Scholar
  5. 5.
    Assel, M., Cheptsov, A., Gallizo, G., Celino, I., Dell’Aglio, D., Bradeško, L., Witbrock, M., Della Valle, E.: Large knowledge collider—a service-oriented platform for large-scale semantic reasoning. In: Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS’11), pp. 41:1–41:9. ACM, New York (2011) Google Scholar
  6. 6.
    Bonino, D., Ciaramella, A., Corno, F.: Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Pat. Inf. 32(1), 30–38 (2010) CrossRefGoogle Scholar
  7. 7.
    Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002) CrossRefGoogle Scholar
  8. 8.
    Callaghan, G., Moffatt, L., Szasz, S.: General architecture for text engineering. Accessed: 03/04/2013
  9. 9.
    Callan, J.: Distributed information retrieval. Advances in Information Retrieval, 7, 127–150, 2002 Google Scholar
  10. 10.
    Caputo, A., Basile, P., Semeraro, G.: Boosting a semantic search engine by named entities. In: Proceedings of the 18th International Symposium on Foundations of Intelligent Systems (ISMIS’09), pp. 241–250. Springer, Berlin (2009) Google Scholar
  11. 11.
    Carpineto, C., DAmico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering versus diversification of search results. Inf. Process. Manag. 48(2), 358–373 (2012) CrossRefGoogle Scholar
  12. 12.
    Chen, S., Schlosser, S.W.: Map-reduce meets wider varieties of applications. Technical report IRP-TR-08-05, Intel Research Pittsburgh (2008) Google Scholar
  13. 13.
    Cheng, T., Yan, X., Chang, K.: Supporting entity search: a large-scale prototype search engine. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD’07), pp. 1144–1146. ACM, New York (2007) CrossRefGoogle Scholar
  14. 14.
    Clinton, D., Tesler, J., Fagan, M., Snell, J., Suave, A., et al.: OpenSearch is a collection of simple formats for the sharing of search results. Accessed: 03/05/2013
  15. 15.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02) (2002) Google Scholar
  16. 16.
    Das, D., Martins, A.: A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU 4, 192–195 (2007) Google Scholar
  17. 17.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) CrossRefGoogle Scholar
  18. 18.
    Ernde, B., Lebel, M., Thiele, C., Hold, A., Naumann, F., Barczyn’ski, W., Brauer, F.: ECIR—a lightweight approach for entity-centric information retrieval. In: Proceedings of the 18th Text REtrieval Conference (TREC 2010) (2010) Google Scholar
  19. 19.
    Fafalios, P., Kitsos, I., Marketakis, Y., Baldassarre, C., Salampasis, M., Tzitzikas, Y.: Web searching with entity mining at query time. In: Proceedings of the 5th Information Retrieval Facility Conference (IRFC 2012), Vienna (2012) Google Scholar
  20. 20.
    Fafalios, P., Salampasis, M., Tzitzikas, Y.: Exploratory patent search with faceted search and configurable entity mining. In: Proceedings of the 1st International Workshop on Integrating IR Technologies for Professional Search (ECIR 2013) (2013) Google Scholar
  21. 21.
    Grossman, R.L., Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. CoRR, abs/0808.3019:920–927, 2008 Google Scholar
  22. 22.
    Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001) CrossRefzbMATHGoogle Scholar
  23. 23.
    Herzig, D.M., Tran, T.: Heterogeneous web data search using relevance-based on the fly data integration. In: Proceedings of the 21st International Conference on World Wide Web (WWW ’12), pp. 141–150. ACM, New York (2012) CrossRefGoogle Scholar
  24. 24.
    Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: 2010 IEEE 3rd International Conference on Clod Computing (CLOUD), pp. 1–10. IEEE Press, New York (2010) CrossRefGoogle Scholar
  25. 25.
    Hwang, J.: IBM pattern modeling and analysis tool for Java garbage collector. Accessed: 28/01/2013
  26. 26.
    Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912) CrossRefGoogle Scholar
  27. 27.
    Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. Proc. VLDB Endow. 5(2), 109–120 (2011) Google Scholar
  28. 28.
    Jiménez-Ruiz, E., Grau, B.C., Horrocks, I., Berlanga, R.: Ontology integration using mappings: towards getting the right logical consequences. In: The Semantic Web: Research and Applications, pp. 173–187. Springer, Berlin (2009) CrossRefGoogle Scholar
  29. 29.
    Joho, H., Azzopardi, L., Vanderbauwhede, W.: A survey of patent users: an analysis of tasks, behavior, search functionality and system requirements. In: Proc. of the 3rd Symposium on Information Interaction in Context, pp. 13–24. ACM, New York (2010) Google Scholar
  30. 30.
    Käki, M.: Findex: search result categories help users when document ranking fails. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 131–140. ACM, New York (2005) Google Scholar
  31. 31.
    Käki, M., Aula, A.: Findex: improving search result use through automatic filtering categories. Interact. Comput. 17(2), 187–206 (2005) CrossRefGoogle Scholar
  32. 32.
    Kitsos, I., Papaioannou, A., Tsikoudis, N., Magoutis, K.: Adapting data-intensive workloads to generic allocation policies in cloud infrastructures. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium (NOMS 2012), pp. 25–33. IEEE Press, New York (2012) CrossRefGoogle Scholar
  33. 33.
    Kohn, A., Bry, F., Manta, A., Ifenthaler, D.: Professional Search: Requirements, Prototype and Preliminary Experience Report, pp. 195–202. 2008 Google Scholar
  34. 34.
    Kules, B., Capra, R., Banta, M., Sierra, T.: What do exploratory searchers look at in a faceted search interface? In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 313–322. ACM, New York (2009) Google Scholar
  35. 35.
    Kulkarni, P.: Distributed SPARQL query engine using MapReduce. Master’s thesis Google Scholar
  36. 36.
    Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11), pp. 985–996. ACM, New York (2011) CrossRefGoogle Scholar
  37. 37.
    Marketakis, Y., Tzanakis, M., Tzitzikas, Y.: Prescan: towards automating the preservation of digital objects. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems (MEDES’09), pp. 60:404–60:411. ACM, New York (2009) Google Scholar
  38. 38.
    Massie, M., Chun, B., Culler, D.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004) CrossRefGoogle Scholar
  39. 39.
    Massie, M., Li, B., Nicholes, B., Vuksan, V., Alexander, R., Buchbinder, J., Costa, F., Dean, A., Josephsen, D., Phaal, P., et al.: Monitoring with Ganglia. O’Reilly Media, Inc., Sebastopol (2012) Google Scholar
  40. 40.
    McCreadie, R., Macdonald, C., Ounis, I.: Comparing distributed indexing: to mapreduce or not? In: Proc. of LSDS-IR, pp. 41–48 (2009) Google Scholar
  41. 41.
    Mccreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: studying scalability and efficiency. Inf. Process. Manag. 48(5), 873–888 (2012) CrossRefGoogle Scholar
  42. 42.
    Mika, P., Tummarello, G.: Web semantics in the clouds. IEEE Intell. Syst. 23(5), 82–87 (2008) CrossRefGoogle Scholar
  43. 43.
    Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Mining Text Data, pp. 43–76 (2012) Google Scholar
  44. 44.
    Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Eighth IEEE International Conference on Data Mining (ICDM’08), pp. 512–521. IEEE Press, New York (2008) CrossRefGoogle Scholar
  45. 45.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD’09), pp. 165–178. ACM, New York (2009) CrossRefGoogle Scholar
  46. 46.
    Phaal, P.: SFlow is an industry standard technology for monitoring high speed switched networks. Accessed: 03/05/2013
  47. 47.
    Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved Histograms for Selectivity Estimation of Range Predicates vol. 25, pp. 294–305. ACM, New York (1996) Google Scholar
  48. 48.
    Pratt, W., Fagan, L.: The usefulness of dynamically categorizing search results. J. Am. Med. Inform. Assoc. 7(6), 605–617 (2000) CrossRefGoogle Scholar
  49. 49.
    Ramachandran, S.: Google developers: Web metrics. Accessed: 03/05/2013
  50. 50.
    Sacco, G., Tzitzikas, Y.: Dynamic Taxonomies and Faceted Search. Springer, Berlin (2009) CrossRefGoogle Scholar
  51. 51.
    Thakker, D., Osman, T., Lakin, P.: Java annotation patterns engine. Accessed: 03/04/2013
  52. 52.
    Tom, W.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2009) Google Scholar
  53. 53.
    Tzitzikas, Y., Meghini, C.: Ostensive automatic schema mapping for taxonomy-based peer-to-peer systems. In: Cooperative Information Agents VII, pp. 78–92. Springer, Berlin (2003) CrossRefGoogle Scholar
  54. 54.
    Tzitzikas, Y., Spyratos, N., Constantopoulos, P.: Mediators over taxonomy-based information sources. VLDB J. 14(1), 112–136 (2005) CrossRefGoogle Scholar
  55. 55.
    Urbani, J., Kotoulas, S., Oren, E., Van Harmelen, F.: Scalable distributed reasoning using Mapreduce. pp. 634–649 (2009) Google Scholar
  56. 56.
    van Zwol, R., Garcia Pueyo, L., Muralidharan, M., Sigurbjörnsson, B.: Machine learned ranking of entity facets. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), pp. 879–880. ACM, New York (2010) CrossRefGoogle Scholar
  57. 57.
    Venner, J.: Pro Hadoop. Apress, Berkeley (2009) CrossRefGoogle Scholar
  58. 58.
    White, R.W., Kules, B., Drucker, S.M., Schraefel, M.: Supporting exploratory search, introduction (special issue). Communications of the ACM. Commun. ACM 49(4), 36–39 (2006) CrossRefGoogle Scholar
  59. 59.
    Wilson, M., et al.: A longitudinal study of exploratory and keyword search. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’08), pp. 52–56. ACM, New York (2008) Google Scholar
  60. 60.
    Yahoo! Inc. Chaining jobs. Accessed: 09/05/2013
  61. 61.
    Zhai, K., Boyd-Graber, J., Asadi, N., Alkhouja, M.: Mr. LDA: a flexible large scale topic modeling package using variational inference in Mapreduce. In: Proceedings of the 21st International Conference on World Wide Web (WWW’12), pp. 879–888. ACM, New York (2012) CrossRefGoogle Scholar
  62. 62.
    Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in Mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM, New York (2012) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Ioannis Kitsos
    • 1
    • 2
  • Kostas Magoutis
    • 1
    • 2
  • Yannis Tzitzikas
    • 1
    • 2
    Email author
  1. 1.Institute of Computer ScienceFORTH-ICSCreteGreece
  2. 2.Computer Science DepartmentUniversity of CreteHeraklionGreece

Personalised recommendations