The GoOLAP Fact Retrieval Framework

  • Alexander Löser
  • Sebastian Arnold
  • Tillmann Fiehn
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 96)

Summary

We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000’s of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. Three recent developments have the potential to become key components of such an ad-hoc analysis platform: significant improvements in cloud computing query languages, advances in self-supervised keyword generation techniques and powerful fact extraction frameworks. We will give an informative and practical look at the underlying research challenges in supporting ”Web-Scale Business Analytics” applications that we met when building GoOLAP, a system that already enjoys a broad user base and over 6 million objects and facts.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. SIGMOD Record 37(4) (2008)Google Scholar
  2. 2.
    Crow, D.: Google Squared: Web scale, open domain information extraction and presentation. In: Proceedings of the 32nd European Conference on IR Research, ECIR 2010 (2010)Google Scholar
  3. 3.
    Boden, C., Löser, A., Nagel, C., Pieper, S.: Factcrawl: A fact retrieval framework for full-text indices. In: Proceedings of the 14th International Workshop on the Web and Databases, WebDB (2011)Google Scholar
  4. 4.
    Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD 2006, pp. 265–276. ACM, New York (2006)CrossRefGoogle Scholar
  5. 5.
    Löser, A., Nagel, C., Pieper, S.: Augmenting Tables by Self-Supervised Web Search. In: 4th BIRTE Workshop in Conjunction with VLDB (2010)Google Scholar
  6. 6.
    Boden, C., Häfele, T., Löser, A.: Classification Algorithms for Relation Prediction. In: Proceedings of the ICDE Workshops (2010)Google Scholar
  7. 7.
    Löser, A., Nagel, C., Pieper, S., Boden, C.: Self-Supervised Web Search for Any-k Complete Tuples. In: Proceedings of the EDBT Workshops (2010)Google Scholar
  8. 8.
    DBpedia data set, http://wiki.dbpedia.org/Datasets#h18-3 (last visited June 14, 2011)
  9. 9.
    CrunchBase, http://www.crunchbase.com (last visited June 14, 2011)
  10. 10.
    Agichtein, E., Gravano, L.: Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), pp. 113–124 (2003)Google Scholar
  11. 11.
    Robertson, S.E.: On term selection for query expansion. J. Doc. 46, 359–364 (1991)CrossRefGoogle Scholar
  12. 12.
    Cohen, W.W.: Fast Effective Rule Induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)Google Scholar
  13. 13.
    Liu, J.: Answering structured queries on unstructured data. In: Proceedings of the Ninth International Workshop on the Web and Databases, WebDB 2006, pp. 25–30 (2006)Google Scholar
  14. 14.
    Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51, 68–74 (2008)CrossRefGoogle Scholar
  15. 15.
    Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1031–1042. ACM, New York (2008)CrossRefGoogle Scholar
  16. 16.
    Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R., Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 128–137. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
  17. 17.
    Feldman, R., Regev, Y., Gorodetsky, M.: A modular information extraction system. Intell. Data Anal. 12, 51–71 (2008)Google Scholar
  18. 18.
    Zhou, M., Cheng, T., Chang, K.C.C.: Docqs: a prototype system for supporting data-oriented content query. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1211–1214. ACM, New York (2010)Google Scholar
  19. 19.
    Bohannon, P., Merugu, S., Yu, C., Agarwal, V., DeRose, P., Iyer, A., Jain, A., Kakade, V., Muralidharan, M., Ramakrishnan, R., Shen, W.: Purple sox extraction management system. SIGMOD Rec. 37, 21–27 (2009)CrossRefGoogle Scholar
  20. 20.
    Chen, Z., Garcia-Alvarado, C., Ordonez, C.: Enhancing document exploration with OLAP. In: Proceedings of the ICDM Workshops, pp. 1407–1410 (2010)Google Scholar
  21. 21.
    Pérez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B.: R-cubes: OLAP cubes contextualized with documents. In: ICDE, pp. 1477–1478 (2007)Google Scholar
  22. 22.
    Sismanis, Y., Reinwald, B., Pirahesh, H.: Document-Centric OLAP in the Schema-Chaos World. In: Bussler, C.J., Castellanos, M., Dayal, U., Navathe, S. (eds.) BIRTE 2006. LNCS, vol. 4365, pp. 77–91. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  23. 23.
    Löser, A., Hüske, F., Markl, V.: Situational Business Intelligence. In: 3rd BIRTE Workshop in Conjunction with VLDB (2009)Google Scholar
  24. 24.
    Löser, A.: Beyond search: Web-scale business analytics. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, p. 5. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  25. 25.
    Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACM 49, 41–46 (2006)CrossRefGoogle Scholar
  26. 26.
    Lin, T., Etzioni, O., Fogarty, J.: Identifying interesting assertions from the web. In: Proceedings of the 18th CIKM Conference (2009)Google Scholar
  27. 27.
    Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD 2005, pp. 85–96. ACM, New York (2005)Google Scholar
  28. 28.
    McCarthy, J.F., Lehnert, W.G.: Using decision trees for conference resolution. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1050–1055. Morgan Kaufmann Publishers Inc., San Francisco (1995)Google Scholar
  29. 29.
    OpenCalais, http://www.opencalais.com (last visited June 14, 2011)
  30. 30.
    Apache Hadoop, http://hadoop.apache.org (last visited June 14, 2011)
  31. 31.
    Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakhy, M., Kanne, C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. Proceedings of the VLDB Endowment 4(12) (2011)Google Scholar
  32. 32.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)CrossRefGoogle Scholar
  33. 33.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive- a warehousing solution over a map-reduce framework. In: VLDB 2009: Proceedings of the VLDB Endowment, vol. 2, pp. 1626–1629 (2009)Google Scholar
  34. 34.
    Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI 2008, pp. 1–14. USENIX Association, Berkeley (2008)Google Scholar
  35. 35.
    Liu, B., Chiticariu, L., Chu, V., Jagadish, H.V., Reiss, F.: Automatic rule refinement for information extraction. PVLDB 3(1), 588–597 (2010)Google Scholar
  36. 36.
    Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 406–414. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  37. 37.
    Soderland, S., Roof, B., Qin, B., Xu, S., Mausam, E.O.: Adapting open information extraction to domain-specific relations. AI Magazine 31(3), 93–102 (2010)Google Scholar
  38. 38.
    Jain, A., Pantel, P.: Factrank: random walks on a web of facts. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 501–509. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
  39. 39.
    Dong, X.L., Srivastava, D.: Large-scale copy detection. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD 2011, pp. 1205–1208. ACM, New York (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alexander Löser
    • 1
  • Sebastian Arnold
    • 1
  • Tillmann Fiehn
    • 1
  1. 1.FG DIMATechnische Universität BerlinBerlinGermany

Personalised recommendations