Abstract
We present a general framework for the task of extracting specific information “on demand” from a large corpus such as the Web under resource-constraints. Given a database with missing or uncertain information, the proposed system automatically formulates queries, issues them to a search interface, selects a subset of the documents, extracts the required information from them, and fills the missing values in the original database. We also exploit inherent dependency within the data to obtain useful information with fewer computational resources. We build such a system in the citation database domain that extracts the missing publication years using limited resources from the Web. We discuss a probabilistic approach for this task and present first results. The main contribution of this paper is to propose a general, comprehensive architecture for designing a system adaptable to different domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bilgic, M., Getoor, L.: Voila: Efficient feature-value acquisition for classification. In: AAAI, pp. 1225–1230. AAAI Press, Menlo Park (2007)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D., Yates, A.: Web-scale information extraction in knowitall. In: WWW 2004, May 2004, ACM, New York (2004)
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of the 16th conference on Computational linguistics (1996)
Kanani, P., McCallum, A.: Resource-bounded information gathering for correlation clustering. In: Bshouty, N.H., Gentile, C. (eds.) COLT. LNCS (LNAI), vol. 4539, pp. 625–627. Springer, Heidelberg (2007)
Kanani, P., McCallum, A., Pal, C.: Improving author coreference by resource-bounded information gathering from the web. In: Proceedings of IJCAI (2007)
Kanani, P., Melville, P.: Prediction-time active feature-value acquisition for customer targeting. In: Workshop on Cost Sensitive Learning, NIPS 2008 (2008)
Krause, A., Guestrin, C.: Near-optimal nonmyopic value of information in graphical models. In: UAI 2005, p. 5 (2005)
Lin, J., Fernandes, A., Katz, B., Marton, G., Tellex, S.: Extracting answers from the web using knowledge annotation and knowledge mining techniques (2002)
Lizotte, D., Madani, O., Greiner, R.: Budgeted learning of naive-Bayes classifiers. In: UAI 2003, Acapulco, Mexico (2003)
Melville, P., Saar-Tsechansky, M., Provost, F., Mooney, R.: An expected utility approach to active feature-value acquisition. In: ICDM 2005, pp. 745–748 (2005)
Nodine, M.H., Fowler, J., Ksiezyk, T., Perry, B., Taylor, M., Unruh, A.: Active information gathering in infosleuth. IJCIS 9(1-2), 3–28 (2000)
Sheng, V., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: SIGKDD (2008)
Sheng, V.S., Ling, C.X.: Feature value acquisition in testing: a sequential batch test algorithm. In: ICML 2006, pp. 809–816. ACM, New York (2006)
Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from wikipedia: moving down the long tail. In: 14th ACM SIGKDD, pp. 731–739 (2008)
Zhao, S., Betz, J.: Corroborate and learn facts from the web. In: KDD, pp. 995–1003 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kanani, P., McCallum, A., Hu, S. (2010). Resource-Bounded Information Extraction: Acquiring Missing Feature Values on Demand. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_45
Download citation
DOI: https://doi.org/10.1007/978-3-642-13657-3_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)