Automatically Training Form Classifiers

Moraes, Mauricio C.; Heuser, Carlos A.; Moreira, Viviane P.; Barbosa, Denilson

doi:10.1007/978-3-642-41230-1_37

Automatically Training Form Classifiers

Mauricio C. Moraes²⁰,
Carlos A. Heuser²⁰,
Viviane P. Moreira²⁰ &
…
Denilson Barbosa²¹

Conference paper

1959 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8180))

Abstract

The state-of-the-art in domain-specific Web form discovery relies on supervised methods requiring substantial human effort in providing training examples, which limits their applicability in practice. This paper proposes an effective alternative to reduce the human effort: obtaining high-quality domain-specific training forms. In our approach, the only user input is the domain of interest; we use a search engine and a focused crawler to locate query forms which are fed as training data into supervised form classifiers. We tested this approach thoroughly, using thousands of real Web forms from six domains, including a representative subset of a publicly available form base to validate this approach. The results reported in this paper show that it is feasible to mitigate the demanding manual work required by some methods of the current state-of-the-art in form discovery, at the cost of a negligible loss in effectiveness.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cazoodle, http://www.cazoodle.com/ (last access in January 2011)
DeepPeep repository, http://www.cs.utah.edu/~lbarbosa/forms/forms.tar.gz (last access in January 2011 )
DeepPeep. http://www.deeppeep.org/ (last access in October 2009)
UIUC web integration repository, http://metaquerier.cs.uiuc.edu/repository (last access in October 2009)
Trulia, http://www.trulia.com/ (last access in January 2011)
Xelda, http://www.xrce.xerox.com/Research-Development/Historical-projects/XeLDA/%28language%29/eng-GB. (last access in January 2011)
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the Eight International Workshop on the Web and Databases, vol. 5, pp. 1–6. ACM (2005)
Google Scholar
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International World Wide Web Conference, pp. 441–450. ACM, New York (2007)
Chapter Google Scholar
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International World Wide Web Conference, pp. 431–440. ACM, New York (2007)
Chapter Google Scholar
Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J.: Creating and exploring web form repositories. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1175–1178. ACM (2010)
Google Scholar
Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the 4th International Conference on Web Information Systems Engineering, pp. 125–133 (2003)
Google Scholar
Chang, K., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the web. In: Second Biennial Conference on Innovative Data Systems Research, pp. 44–55 (2005)
Google Scholar
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)
Article Google Scholar
Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the world-wide web. In: International Conference on Autonomous Agents, pp. 39–48. ACM, New York (1997)
Chapter Google Scholar
Gong, Z., Zhang, J., Liu, Q.: Hidden-Web Database Exploration. In: Sixth International Conf. on Intelligent Systems Design and Applications, pp. 838–843. IEEE (2006)
Google Scholar
He, B., Tao, T., Chang, K.: Organizing structured web sources by query schemas: a clustering approach. In: Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp. 22–31. ACM (2004)
Google Scholar
He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: A schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)
Chapter Google Scholar
He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: an automatic integrator of web search interfaces for e-commerce. In: Proceedings of 29th International Conference on Very Large Data Bases, pp. 357–368 (2003)
Google Scholar
Kabra, G., Li, C., Chang, K.: Query routing: Finding ways in the maze of the DeepWeb. In: International Workshop on Challenges in Web Information Retrieval and Integration, pp. 64–73 (2005)
Google Scholar
Lin, K., Chen, H.: Automatic information discovery from the invisible Web. In: International Conference on Information Technology: Coding and Computing, pp. 332–337 (2002)
Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. PVLDB 1(2), 1241–1252 (2008)
Google Scholar
Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: A survey. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)
Article Google Scholar
Peng, Q., Meng, W., He, H., Yu, C.: WISE-Cluster: Clustering e-commerce search engines automatically. In: Sixth ACM CIKM International Workshop on Web Information and Data Management, pp. 104–111. ACM (2004)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 129–138 (2001)
Google Scholar
Wu, P., Wen, J., Liu, H., Ma, W.: Query selection techniques for efficient crawling of structured web sources. In: ICDE, pp. 47–56. ACM (2006)
Google Scholar
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106. ACM (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal do Rio Grande do Sul, Brazil
Mauricio C. Moraes, Carlos A. Heuser & Viviane P. Moreira
Department of Computing Science, University of Alberta, Canada
Denilson Barbosa

Authors

Mauricio C. Moraes
View author publications
You can also search for this author in PubMed Google Scholar
Carlos A. Heuser
View author publications
You can also search for this author in PubMed Google Scholar
Viviane P. Moreira
View author publications
You can also search for this author in PubMed Google Scholar
Denilson Barbosa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Xuemin Lin
Aristotle University of Thessaloniki, Thessaloniki, Greece
Yannis Manolopoulos
AT&T Labs-Research, Florham Park, NJ, USA
Divesh Srivastava
Victoria University, Melbourne, Australia
Guangyan Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D. (2013). Automatically Training Form Classifiers. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-41230-1_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics