Abstract
The state-of-the-art in domain-specific Web form discovery relies on supervised methods requiring substantial human effort in providing training examples, which limits their applicability in practice. This paper proposes an effective alternative to reduce the human effort: obtaining high-quality domain-specific training forms. In our approach, the only user input is the domain of interest; we use a search engine and a focused crawler to locate query forms which are fed as training data into supervised form classifiers. We tested this approach thoroughly, using thousands of real Web forms from six domains, including a representative subset of a publicly available form base to validate this approach. The results reported in this paper show that it is feasible to mitigate the demanding manual work required by some methods of the current state-of-the-art in form discovery, at the cost of a negligible loss in effectiveness.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Cazoodle, http://www.cazoodle.com/ (last access in January 2011)
DeepPeep repository, http://www.cs.utah.edu/~lbarbosa/forms/forms.tar.gz (last access in January 2011 )
DeepPeep. http://www.deeppeep.org/ (last access in October 2009)
UIUC web integration repository, http://metaquerier.cs.uiuc.edu/repository (last access in October 2009)
Trulia, http://www.trulia.com/ (last access in January 2011)
Xelda, http://www.xrce.xerox.com/Research-Development/Historical-projects/XeLDA/%28language%29/eng-GB. (last access in January 2011)
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the Eight International Workshop on the Web and Databases, vol. 5, pp. 1–6. ACM (2005)
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International World Wide Web Conference, pp. 441–450. ACM, New York (2007)
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International World Wide Web Conference, pp. 431–440. ACM, New York (2007)
Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J.: Creating and exploring web form repositories. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1175–1178. ACM (2010)
Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the 4th International Conference on Web Information Systems Engineering, pp. 125–133 (2003)
Chang, K., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the web. In: Second Biennial Conference on Innovative Data Systems Research, pp. 44–55 (2005)
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)
Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the world-wide web. In: International Conference on Autonomous Agents, pp. 39–48. ACM, New York (1997)
Gong, Z., Zhang, J., Liu, Q.: Hidden-Web Database Exploration. In: Sixth International Conf. on Intelligent Systems Design and Applications, pp. 838–843. IEEE (2006)
He, B., Tao, T., Chang, K.: Organizing structured web sources by query schemas: a clustering approach. In: Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp. 22–31. ACM (2004)
He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: A schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)
He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: an automatic integrator of web search interfaces for e-commerce. In: Proceedings of 29th International Conference on Very Large Data Bases, pp. 357–368 (2003)
Kabra, G., Li, C., Chang, K.: Query routing: Finding ways in the maze of the DeepWeb. In: International Workshop on Challenges in Web Information Retrieval and Integration, pp. 64–73 (2005)
Lin, K., Chen, H.: Automatic information discovery from the invisible Web. In: International Conference on Information Technology: Coding and Computing, pp. 332–337 (2002)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. PVLDB 1(2), 1241–1252 (2008)
Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: A survey. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)
Peng, Q., Meng, W., He, H., Yu, C.: WISE-Cluster: Clustering e-commerce search engines automatically. In: Sixth ACM CIKM International Workshop on Web Information and Data Management, pp. 104–111. ACM (2004)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 129–138 (2001)
Wu, P., Wen, J., Liu, H., Ma, W.: Query selection techniques for efficient crawling of structured web sources. In: ICDE, pp. 47–56. ACM (2006)
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106. ACM (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D. (2013). Automatically Training Form Classifiers. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-41230-1_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)