The GoOLAP Fact Retrieval Framework

Löser, Alexander; Arnold, Sebastian; Fiehn, Tillmann

doi:10.1007/978-3-642-27358-2_4

The GoOLAP Fact Retrieval Framework

Alexander Löser⁸,
Sebastian Arnold⁸ &
Tillmann Fiehn⁸

Chapter

2840 Accesses
4 Citations

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 96))

Summary

We discuss the novel problem of supporting analytical business intelligence queries over web-based textual content, e.g., BI-style reports based on 100.000’s of documents from an ad-hoc web search result. Neither conventional search engines nor conventional Business Intelligence and ETL tools address this problem, which lies at the intersection of their capabilities. Three recent developments have the potential to become key components of such an ad-hoc analysis platform: significant improvements in cloud computing query languages, advances in self-supervised keyword generation techniques and powerful fact extraction frameworks. We will give an informative and practical look at the underlying research challenges in supporting ”Web-Scale Business Analytics” applications that we met when building GoOLAP, a system that already enjoys a broad user base and over 6 million objects and facts.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. SIGMOD Record 37(4) (2008)
Google Scholar
Crow, D.: Google Squared: Web scale, open domain information extraction and presentation. In: Proceedings of the 32nd European Conference on IR Research, ECIR 2010 (2010)
Google Scholar
Boden, C., Löser, A., Nagel, C., Pieper, S.: Factcrawl: A fact retrieval framework for full-text indices. In: Proceedings of the 14th International Workshop on the Web and Databases, WebDB (2011)
Google Scholar
Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD 2006, pp. 265–276. ACM, New York (2006)
Chapter Google Scholar
Löser, A., Nagel, C., Pieper, S.: Augmenting Tables by Self-Supervised Web Search. In: 4th BIRTE Workshop in Conjunction with VLDB (2010)
Google Scholar
Boden, C., Häfele, T., Löser, A.: Classification Algorithms for Relation Prediction. In: Proceedings of the ICDE Workshops (2010)
Google Scholar
Löser, A., Nagel, C., Pieper, S., Boden, C.: Self-Supervised Web Search for Any-k Complete Tuples. In: Proceedings of the EDBT Workshops (2010)
Google Scholar
DBpedia data set, http://wiki.dbpedia.org/Datasets#h18-3 (last visited June 14, 2011)
CrunchBase, http://www.crunchbase.com (last visited June 14, 2011)
Agichtein, E., Gravano, L.: Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), pp. 113–124 (2003)
Google Scholar
Robertson, S.E.: On term selection for query expansion. J. Doc. 46, 359–364 (1991)
Article Google Scholar
Cohen, W.W.: Fast Effective Rule Induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Google Scholar
Liu, J.: Answering structured queries on unstructured data. In: Proceedings of the Ninth International Workshop on the Web and Databases, WebDB 2006, pp. 25–30 (2006)
Google Scholar
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51, 68–74 (2008)
Article Google Scholar
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1031–1042. ACM, New York (2008)
Chapter Google Scholar
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R., Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 128–137. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Feldman, R., Regev, Y., Gorodetsky, M.: A modular information extraction system. Intell. Data Anal. 12, 51–71 (2008)
Article Google Scholar
Zhou, M., Cheng, T., Chang, K.C.C.: Docqs: a prototype system for supporting data-oriented content query. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1211–1214. ACM, New York (2010)
Google Scholar
Bohannon, P., Merugu, S., Yu, C., Agarwal, V., DeRose, P., Iyer, A., Jain, A., Kakade, V., Muralidharan, M., Ramakrishnan, R., Shen, W.: Purple sox extraction management system. SIGMOD Rec. 37, 21–27 (2009)
Article Google Scholar
Chen, Z., Garcia-Alvarado, C., Ordonez, C.: Enhancing document exploration with OLAP. In: Proceedings of the ICDM Workshops, pp. 1407–1410 (2010)
Google Scholar
Pérez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B.: R-cubes: OLAP cubes contextualized with documents. In: ICDE, pp. 1477–1478 (2007)
Google Scholar
Sismanis, Y., Reinwald, B., Pirahesh, H.: Document-Centric OLAP in the Schema-Chaos World. In: Bussler, C.J., Castellanos, M., Dayal, U., Navathe, S. (eds.) BIRTE 2006. LNCS, vol. 4365, pp. 77–91. Springer, Heidelberg (2007)
Chapter Google Scholar
Löser, A., Hüske, F., Markl, V.: Situational Business Intelligence. In: 3rd BIRTE Workshop in Conjunction with VLDB (2009)
Google Scholar
Löser, A.: Beyond search: Web-scale business analytics. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, p. 5. Springer, Heidelberg (2009)
Chapter Google Scholar
Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACM 49, 41–46 (2006)
Article Google Scholar
Lin, T., Etzioni, O., Fogarty, J.: Identifying interesting assertions from the web. In: Proceedings of the 18th CIKM Conference (2009)
Google Scholar
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD 2005, pp. 85–96. ACM, New York (2005)
Chapter Google Scholar
McCarthy, J.F., Lehnert, W.G.: Using decision trees for conference resolution. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1050–1055. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Google Scholar
OpenCalais, http://www.opencalais.com (last visited June 14, 2011)
Apache Hadoop, http://hadoop.apache.org (last visited June 14, 2011)
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakhy, M., Kanne, C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. Proceedings of the VLDB Endowment 4(12) (2011)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)
Chapter Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive- a warehousing solution over a map-reduce framework. In: VLDB 2009: Proceedings of the VLDB Endowment, vol. 2, pp. 1626–1629 (2009)
Google Scholar
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI 2008, pp. 1–14. USENIX Association, Berkeley (2008)
Google Scholar
Liu, B., Chiticariu, L., Chu, V., Jagadish, H.V., Reiss, F.: Automatic rule refinement for information extraction. PVLDB 3(1), 588–597 (2010)
Google Scholar
Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 406–414. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
Soderland, S., Roof, B., Qin, B., Xu, S., Mausam, E.O.: Adapting open information extraction to domain-specific relations. AI Magazine 31(3), 93–102 (2010)
Article Google Scholar
Jain, A., Pantel, P.: Factrank: random walks on a web of facts. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 501–509. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Dong, X.L., Srivastava, D.: Large-scale copy detection. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD 2011, pp. 1205–1208. ACM, New York (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

FG DIMA, Technische Universität Berlin, Berlin, Germany
Alexander Löser, Sebastian Arnold & Tillmann Fiehn

Authors

Alexander Löser
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Arnold
View author publications
You can also search for this author in PubMed Google Scholar
Tillmann Fiehn
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MAS Laboratory, Ecole Centrale Paris, 92 295, Châtenay-Malabry Cedex, France
Marie-Aude Aufaure
Department of Computer & Decision Engineering (CoDE) CP 165/15, Universite Libre de Bruxelles, 1050, Brussels, Belgium
Esteban Zimányi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Löser, A., Arnold, S., Fiehn, T. (2012). The GoOLAP Fact Retrieval Framework. In: Aufaure, MA., Zimányi, E. (eds) Business Intelligence. eBISS 2011. Lecture Notes in Business Information Processing, vol 96. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27358-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-27358-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27357-5
Online ISBN: 978-3-642-27358-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics