Skip to main content

The GoOLAP Fact Retrieval Framework

  • Chapter

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 96))

Summary

We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000’s of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. Three recent developments have the potential to become key components of such an ad-hoc analysis platform: significant improvements in cloud computing query languages, advances in self-supervised keyword generation techniques and powerful fact extraction frameworks. We will give an informative and practical look at the underlying research challenges in supporting ”Web-Scale Business Analytics” applications that we met when building GoOLAP, a system that already enjoys a broad user base and over 6 million objects and facts.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. SIGMOD Record 37(4) (2008)

    Google Scholar 

  2. Crow, D.: Google Squared: Web scale, open domain information extraction and presentation. In: Proceedings of the 32nd European Conference on IR Research, ECIR 2010 (2010)

    Google Scholar 

  3. Boden, C., Löser, A., Nagel, C., Pieper, S.: Factcrawl: A fact retrieval framework for full-text indices. In: Proceedings of the 14th International Workshop on the Web and Databases, WebDB (2011)

    Google Scholar 

  4. Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD 2006, pp. 265–276. ACM, New York (2006)

    Chapter  Google Scholar 

  5. Löser, A., Nagel, C., Pieper, S.: Augmenting Tables by Self-Supervised Web Search. In: 4th BIRTE Workshop in Conjunction with VLDB (2010)

    Google Scholar 

  6. Boden, C., Häfele, T., Löser, A.: Classification Algorithms for Relation Prediction. In: Proceedings of the ICDE Workshops (2010)

    Google Scholar 

  7. Löser, A., Nagel, C., Pieper, S., Boden, C.: Self-Supervised Web Search for Any-k Complete Tuples. In: Proceedings of the EDBT Workshops (2010)

    Google Scholar 

  8. DBpedia data set, http://wiki.dbpedia.org/Datasets#h18-3 (last visited June 14, 2011)

  9. CrunchBase, http://www.crunchbase.com (last visited June 14, 2011)

  10. Agichtein, E., Gravano, L.: Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), pp. 113–124 (2003)

    Google Scholar 

  11. Robertson, S.E.: On term selection for query expansion. J. Doc. 46, 359–364 (1991)

    Article  Google Scholar 

  12. Cohen, W.W.: Fast Effective Rule Induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)

    Google Scholar 

  13. Liu, J.: Answering structured queries on unstructured data. In: Proceedings of the Ninth International Workshop on the Web and Databases, WebDB 2006, pp. 25–30 (2006)

    Google Scholar 

  14. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51, 68–74 (2008)

    Article  Google Scholar 

  15. Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1031–1042. ACM, New York (2008)

    Chapter  Google Scholar 

  16. Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R., Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 128–137. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  17. Feldman, R., Regev, Y., Gorodetsky, M.: A modular information extraction system. Intell. Data Anal. 12, 51–71 (2008)

    Article  Google Scholar 

  18. Zhou, M., Cheng, T., Chang, K.C.C.: Docqs: a prototype system for supporting data-oriented content query. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1211–1214. ACM, New York (2010)

    Google Scholar 

  19. Bohannon, P., Merugu, S., Yu, C., Agarwal, V., DeRose, P., Iyer, A., Jain, A., Kakade, V., Muralidharan, M., Ramakrishnan, R., Shen, W.: Purple sox extraction management system. SIGMOD Rec. 37, 21–27 (2009)

    Article  Google Scholar 

  20. Chen, Z., Garcia-Alvarado, C., Ordonez, C.: Enhancing document exploration with OLAP. In: Proceedings of the ICDM Workshops, pp. 1407–1410 (2010)

    Google Scholar 

  21. Pérez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B.: R-cubes: OLAP cubes contextualized with documents. In: ICDE, pp. 1477–1478 (2007)

    Google Scholar 

  22. Sismanis, Y., Reinwald, B., Pirahesh, H.: Document-Centric OLAP in the Schema-Chaos World. In: Bussler, C.J., Castellanos, M., Dayal, U., Navathe, S. (eds.) BIRTE 2006. LNCS, vol. 4365, pp. 77–91. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  23. Löser, A., Hüske, F., Markl, V.: Situational Business Intelligence. In: 3rd BIRTE Workshop in Conjunction with VLDB (2009)

    Google Scholar 

  24. Löser, A.: Beyond search: Web-scale business analytics. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, p. 5. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  25. Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACM 49, 41–46 (2006)

    Article  Google Scholar 

  26. Lin, T., Etzioni, O., Fogarty, J.: Identifying interesting assertions from the web. In: Proceedings of the 18th CIKM Conference (2009)

    Google Scholar 

  27. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD 2005, pp. 85–96. ACM, New York (2005)

    Chapter  Google Scholar 

  28. McCarthy, J.F., Lehnert, W.G.: Using decision trees for conference resolution. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1050–1055. Morgan Kaufmann Publishers Inc., San Francisco (1995)

    Google Scholar 

  29. OpenCalais, http://www.opencalais.com (last visited June 14, 2011)

  30. Apache Hadoop, http://hadoop.apache.org (last visited June 14, 2011)

  31. Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakhy, M., Kanne, C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. Proceedings of the VLDB Endowment 4(12) (2011)

    Google Scholar 

  32. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)

    Chapter  Google Scholar 

  33. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive- a warehousing solution over a map-reduce framework. In: VLDB 2009: Proceedings of the VLDB Endowment, vol. 2, pp. 1626–1629 (2009)

    Google Scholar 

  34. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI 2008, pp. 1–14. USENIX Association, Berkeley (2008)

    Google Scholar 

  35. Liu, B., Chiticariu, L., Chu, V., Jagadish, H.V., Reiss, F.: Automatic rule refinement for information extraction. PVLDB 3(1), 588–597 (2010)

    Google Scholar 

  36. Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 406–414. Morgan Kaufmann Publishers Inc., San Francisco (1999)

    Google Scholar 

  37. Soderland, S., Roof, B., Qin, B., Xu, S., Mausam, E.O.: Adapting open information extraction to domain-specific relations. AI Magazine 31(3), 93–102 (2010)

    Article  Google Scholar 

  38. Jain, A., Pantel, P.: Factrank: random walks on a web of facts. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 501–509. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  39. Dong, X.L., Srivastava, D.: Large-scale copy detection. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD 2011, pp. 1205–1208. ACM, New York (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Löser, A., Arnold, S., Fiehn, T. (2012). The GoOLAP Fact Retrieval Framework. In: Aufaure, MA., Zimányi, E. (eds) Business Intelligence. eBISS 2011. Lecture Notes in Business Information Processing, vol 96. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27358-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27358-2_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27357-5

  • Online ISBN: 978-3-642-27358-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics