Advertisement

Exploratory Ad-Hoc Analytics for Big Data

  • Julian Eberius
  • Maik ThieleEmail author
  • Wolfgang Lehner
Chapter

Abstract

In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this chapter, we present Drill Beyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over a relational database, but additionally enables the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of two worlds RDBMS and IR systems.

Keywords

Query Processing Information Retrieval System Augmentation System Keyword Query Query Plan 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    A. Abello, O. Romero, T. Bach Pedersen, R. Berlanga, V. Nebot, M. Aramburu, A. Simitsis, Using semantic web technologies for exploratory olap: a survey. IEEE Trans. Knowl. Data Eng. 27(2), 571–588 (2015)Google Scholar
  2. 2.
    A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.N. Mazón, F. Naumann, T.B. Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, G. Vossen, Fusion cubes: towards self-service business intelligence. Int. J. Data Wareh. Mining (IJDWM) (2012). (accepted)Google Scholar
  3. 3.
    R. Agrawal, S. Gollapudi, A. Halverson, S. Ieong, Diversifying search results. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09 (ACM, New York, 2009), pp. 5–14Google Scholar
  4. 4.
    S. Amer-Yahia, P. Case, T. Rölleke, J. Shanmugasundaram, G. Weikum, Report on the db/ir panel at sigmod 2005. ACM SIGMOD Rec. 34(4), 71–74 (2005)CrossRefGoogle Scholar
  5. 5.
    P. André, J. Teevan, S.T. Dumais, From x-rays to silly putty via Uranus: serendipity and its role in web search. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (ACM, New York, 2009), pp. 2033–2036Google Scholar
  6. 6.
    H. Bast, I. Weber, The complete search engine: Interactive, efficient, and towards IR and db integration, in CIDR 2007: 3rd Biennial Conference on Innovative Data Systems Research, ed. by G. Weikum (VLDB Endowment, Asilomar, CA, USA, 2007), pp. 88–95Google Scholar
  7. 7.
    J. Bautista, J. Pereira, A grasp algorithm to solve the unicost set covering problem. Comput. Oper. Res. 34(10), 3162–3173 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    M.J. Cafarella, J. Madhavan, A. Halevy, Web-scale extraction of structured data. SIGMOD Rec. 37(4), 55–61 (2009)CrossRefGoogle Scholar
  9. 9.
    M.J. Cafarella, C. Re, D. Suciu, O. Etzioni, M. Banko, Structured querying of web text, in 3rd Biennial Conference on Innovative Data Systems Research (CIDR) (Asilomar, California, USA, 2007)Google Scholar
  10. 10.
    J. Carbonell, J. Goldstein, The use of MMR, diversity-based reranking for reordering documents and producing summaries, in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98 (ACM, New York, NY, USA, 1998), pp. 335–336Google Scholar
  11. 11.
    S. Chaudhuri, R. Ramakrishnan, G. Weikum, Integrating DB and IR technologies: what is the sound of one hand clapping, in CIDR (2005), pp. 1–12Google Scholar
  12. 12.
    J. Cohen, B. Dolan, M. Dunlap, J.M. Hellerstein, C. Welton, Mad skills: new analysis practices for big data. Proc. VLDB Endow. 2, 1481–1492 (2009)CrossRefGoogle Scholar
  13. 13.
    N. Dalvi, A. Machanavajjhala, B. Pang, An analysis of structured data on the web. Proc. VLDB Endow. 5(7), 680–691 (2012)CrossRefGoogle Scholar
  14. 14.
    E. Demidova, P. Fankhauser, X. Zhou, W. Nejdl, Divq: Diversification for keyword search over structured databases, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10 (ACM, New York, NY, USA, 2010), pp. 331–338Google Scholar
  15. 15.
    G.W. DePuy, R.J. Moraga, G.E. Whitehouse, Meta-raps: a simple and effective approach for solving the traveling salesman problem. Transp. Res. Part E Logist. Transp. Rev. 41(2), 115–130 (2005)CrossRefGoogle Scholar
  16. 16.
    G. Di Lorenzo, H. Hacid, Hy Paik, B. Benatallah, Data integration in mashups. SIGMOD Rec. 38(1), 59–66 (2009)CrossRefGoogle Scholar
  17. 17.
    P. Dixon, Basics of oracle text retrieval. IEEE Data Eng. Bull. 24(4), 11–14 (2001)Google Scholar
  18. 18.
    X.L. Dong, B. Saha, D. Srivastava, Less is more: selecting sources wisely for integration, in Proceedings of the 39th international conference on Very Large Data Bases, PVLDB’13, VLDB Endowment (2013), pp. 37–48Google Scholar
  19. 19.
    J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, W. Lehner, Building the dresden web table corpus: a classification approach, in 2nd IEEE/ACM International Symposium on Big Data Computing, BDC (2015)Google Scholar
  20. 20.
    J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, DrillBeyond: enabling business analysts to explore the web of open data, in PVLDB (2012)Google Scholar
  21. 21.
    J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, Drillbeyond: processing multi-result open world SQL queries, in Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM ’15 (ACM, New York, NY, USA, 2015), pp. 16:1–16:12Google Scholar
  22. 22.
    J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, Top-k entity augmentation using consistent set covering, in Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM ’15 (ACM, New York, NY, USA, 2015), pp. 8:1–8:12Google Scholar
  23. 23.
    S. Endrullis, A. Thor, E. Rahm, Entity search strategies for mashup applications, in 2012 IEEE 28th International Conference on Data Engineering (ICDE) (IEEE, New Jersey, 2012), pp. 66–77Google Scholar
  24. 24.
    S. Endrullis, A. Thor, E. Rahm, Wetsuit: an efficient mashup tool for searching and fusing web entities. Proceedings of the VLDB Endowment 5(12), 1970–1973 (2012)CrossRefGoogle Scholar
  25. 25.
    L. Etcheverry, A. Vaisman, Enhancing olap analysis with web cubes, in The Semantic Web: Research and Applications, vol. 7295, Lecture Notes in Computer Science, ed. by E. Simperl, P. Cimiano, A. Polleres, O. Corcho, V. Presutti (Springer, Berlin Heidelberg, 2012), pp. 469–483CrossRefGoogle Scholar
  26. 26.
    T.A. Feo, M.G. Resende, Greedy randomized adaptive search procedures. J. Glob. Optim. 6(2), 109–133 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    F. Glover, Tabu search-part i. ORSA J. Comput. 1(3), 190–206 (1989)CrossRefzbMATHGoogle Scholar
  28. 28.
    L. Grammel, M.A. Storey, A survey of mashup development environments, in The Smart Internet, Lecture Notes in Computer Science vol. 6400 (2010), pp. 137–151Google Scholar
  29. 29.
    R. Gupta, S. Sarawagi, Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow. 2(1), 289–300 (2009)CrossRefGoogle Scholar
  30. 30.
    A. Halevy, A. Rajaraman, J. Ordille, Data integration: the teenage years, in Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, VLDB Endowment (2006), pp. 9–16Google Scholar
  31. 31.
    J.R. Hamilton, T.K. Nayak, Microsoft SQL server full-text search. IEEE Data Eng. Bull. 24(4), 7–10 (2001)Google Scholar
  32. 32.
    M. Hasan, A. Mueen, V. Tsotras, E. Keogh, Diversifying query results on semi-structured data, in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12 (ACM, New York, NY, USA, 2012), pp. 2099–2103Google Scholar
  33. 33.
    Z.G. Ives, D. Florescu, M. Friedman, A. Levy, D.S. Weld, An adaptive query execution system for data integration, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99 (ACM, New York, NY, USA, 1999), pp. 299–310Google Scholar
  34. 34.
    R.M. Karp, Reducibility among combinatorial problems, in Complexity of Computer Computations (1972)Google Scholar
  35. 35.
    G. Lan, G.W. DePuy, G.E. Whitehouse, An effective and simple heuristic for the set covering problem. Eur. J. Oper. Res. 176(3), 1387–1403 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    D. Laney, 3d data management: controlling data volume, velocity and variety. META Group Res. Note 6, 70 (2001)Google Scholar
  37. 37.
    O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, C. Bizer, The mannheim search join engine, in Web Semantics: Science, Services and Agents on the World Wide Web (2015)Google Scholar
  38. 38.
    X. Li, X.L. Dong, K. Lyons, W. Meng, D. Srivastava, Truth finding on the deep web: is the problem solved? in Proceedings of the 39th International Conference on Very Large Data Bases, PVLDB’13, VLDB Endowment (2013), pp. 97–108Google Scholar
  39. 39.
    J. Liu, X. Dong, A.Y. Halevy, Answering structured queries on unstructured data, in WebDB, vol. 6 (Citeseer, 2006), pp. 25–30Google Scholar
  40. 40.
    A. Löser, F. Hueske, V. Markl, Situational business intelligence, in Business Intelligence for the Real-Time Enterprise, vol. 27, Lecture Notes in Business Information Processing, ed. by M. Castellanos, U. Dayal, T. Sellis (Springer, Berlin, 2009), pp. 1–11CrossRefGoogle Scholar
  41. 41.
    A. Maier, D.E. Simmen, DB2 optimization in support of full text search. IEEE Data Eng. Bull. 24(4), 3–6 (2001)Google Scholar
  42. 42.
    G. Marchionini, Exploratory search: From finding to understanding. Commun. ACM 49(4), 41–46 (2006)CrossRefGoogle Scholar
  43. 43.
    R. Martí, M.G. Resende, C.C. Ribeiro, Multi-start methods for combinatorial optimization. Eur. J. Oper. Res. 226(1), 1–8 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  44. 44.
    H. Mohanty, P. Bhuyan, D. Chenthati, Big Data: A Primer (Springer, India, 2015)Google Scholar
  45. 45.
    J. Morcos, Z. Abedjan, I.F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, Dataxformer: an interactive data transformation tool, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (ACM, New Jersey, 2015), pp. 883–888Google Scholar
  46. 46.
    T.T. Nguyen, Q.V.H. Nguyen, M. Weidlich, K. Aberer, Result selection and summarization for web table search, in 2015 IEEE 31st International Conference on Data Engineering (ICDE) (2015), pp. 231–242Google Scholar
  47. 47.
    D.E. O’Leary, Embedding ai and crowdsourcing in the big data lake. IEEE Intell. Syst. 29(5), 70–73 (2014)CrossRefGoogle Scholar
  48. 48.
    R. Pimplikar, S. Sarawagi, Answering table queries on the web using column keywords, in Proceedings of the 36th Int’l Conference on Very Large Databases (VLDB) (2012)Google Scholar
  49. 49.
    J. Pound, I.F. Ilyas, G. Weddell, Expressive and flexible access to web-extracted data: a keyword-based structured query language, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10 (ACM, New York, NY, USA, 2010), pp. 423–434Google Scholar
  50. 50.
    S. Sarawagi, S. Chakrabarti, Open-domain quantity queries on web tables: Annotation, response, and consensus models, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14 (ACM, New York, NY, USA, 2014), pp. 711–720Google Scholar
  51. 51.
    D.E. Simmen, M. Altinel, V. Markl, S. Padmanabhan, A. Singh, Damia: data mashups for intranet applications, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08 (ACM, New York, NY, USA, 2008)Google Scholar
  52. 52.
    F.M. Suchanek, G. Kasneci, G. Weikum, Yago: a core of semantic knowledge, in Proceedings of the 16th International Conference on World Wide Web, WWW ’07 (ACM, New York, NY, USA, 2007), pp. 697–706Google Scholar
  53. 53.
    M.A. Tahraoui, K. Pinel-Sauvagnat, C. Laitang, M. Boughanem, H. Kheddouci, L. Ning, A survey on tree matching and XML retrieval. Comput. Sci. Rev. 8, 1–23 (2013)CrossRefzbMATHGoogle Scholar
  54. 54.
    A. Thor, D. Aumueller, E. Rahm, Data integration support for mashups, in Workshops at the Twenty-Second AAAI Conference on Artificial Intelligence (2007)Google Scholar
  55. 55.
    G. Weikum, DB and IR: both sides now, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM, New Jersey, 2007), pp. 25–30Google Scholar
  56. 56.
    M. Yakout, K. Ganjam, K. Chakrabarti, S. Chaudhuri, Infogather: entity augmentation and attribute discovery by holistic matching with web tables, in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12 (ACM, New York, NY, USA, 2012), pp. 97–108Google Scholar
  57. 57.
    M. Zhang, K. Chakrabarti, Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables, in Proceedings of the 2013 International Conference on Management of data, SIGMOD ’13 (ACM, New York, NY, USA, 2013), pp. 145–156Google Scholar
  58. 58.
    C.N. Ziegler, S.M. McNee, J.A. Konstan, G. Lausen, Improving recommendation lists through topic diversification, in Proceedings of the 14th International Conference on World Wide Web, WWW ’05 (ACM, New York, NY, USA, 2005), pp. 22–32Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Faculty of Computer Science, Database Technology GroupTechnische Universität DresdenDresdenGermany

Personalised recommendations