Skip to main content

Automatically Training Form Classifiers

  • Conference paper
  • 1959 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8180))

Abstract

The state-of-the-art in domain-specific Web form discovery relies on supervised methods requiring substantial human effort in providing training examples, which limits their applicability in practice. This paper proposes an effective alternative to reduce the human effort: obtaining high-quality domain-specific training forms. In our approach, the only user input is the domain of interest; we use a search engine and a focused crawler to locate query forms which are fed as training data into supervised form classifiers. We tested this approach thoroughly, using thousands of real Web forms from six domains, including a representative subset of a publicly available form base to validate this approach. The results reported in this paper show that it is feasible to mitigate the demanding manual work required by some methods of the current state-of-the-art in form discovery, at the cost of a negligible loss in effectiveness.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cazoodle, http://www.cazoodle.com/ (last access in January 2011)

  2. DeepPeep repository, http://www.cs.utah.edu/~lbarbosa/forms/forms.tar.gz (last access in January 2011 )

  3. DeepPeep. http://www.deeppeep.org/ (last access in October 2009)

  4. UIUC web integration repository, http://metaquerier.cs.uiuc.edu/repository (last access in October 2009)

  5. Trulia, http://www.trulia.com/ (last access in January 2011)

  6. Xelda, http://www.xrce.xerox.com/Research-Development/Historical-projects/XeLDA/%28language%29/eng-GB. (last access in January 2011)

  7. Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the Eight International Workshop on the Web and Databases, vol. 5, pp. 1–6. ACM (2005)

    Google Scholar 

  8. Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International World Wide Web Conference, pp. 441–450. ACM, New York (2007)

    Chapter  Google Scholar 

  9. Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International World Wide Web Conference, pp. 431–440. ACM, New York (2007)

    Chapter  Google Scholar 

  10. Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J.: Creating and exploring web form repositories. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1175–1178. ACM (2010)

    Google Scholar 

  11. Bergholz, A., Chidlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the 4th International Conference on Web Information Systems Engineering, pp. 125–133 (2003)

    Google Scholar 

  12. Chang, K., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the web. In: Second Biennial Conference on Innovative Data Systems Research, pp. 44–55 (2005)

    Google Scholar 

  13. Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)

    Article  Google Scholar 

  14. Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the world-wide web. In: International Conference on Autonomous Agents, pp. 39–48. ACM, New York (1997)

    Chapter  Google Scholar 

  15. Gong, Z., Zhang, J., Liu, Q.: Hidden-Web Database Exploration. In: Sixth International Conf. on Intelligent Systems Design and Applications, pp. 838–843. IEEE (2006)

    Google Scholar 

  16. He, B., Tao, T., Chang, K.: Organizing structured web sources by query schemas: a clustering approach. In: Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, pp. 22–31. ACM (2004)

    Google Scholar 

  17. He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: A schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  18. He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: an automatic integrator of web search interfaces for e-commerce. In: Proceedings of 29th International Conference on Very Large Data Bases, pp. 357–368 (2003)

    Google Scholar 

  19. Kabra, G., Li, C., Chang, K.: Query routing: Finding ways in the maze of the DeepWeb. In: International Workshop on Challenges in Web Information Retrieval and Integration, pp. 64–73 (2005)

    Google Scholar 

  20. Lin, K., Chen, H.: Automatic information discovery from the invisible Web. In: International Conference on Information Technology: Coding and Computing, pp. 332–337 (2002)

    Google Scholar 

  21. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. PVLDB 1(2), 1241–1252 (2008)

    Google Scholar 

  22. Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: A survey. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)

    Article  Google Scholar 

  23. Peng, Q., Meng, W., He, H., Yu, C.: WISE-Cluster: Clustering e-commerce search engines automatically. In: Sixth ACM CIKM International Workshop on Web Information and Data Management, pp. 104–111. ACM (2004)

    Google Scholar 

  24. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 129–138 (2001)

    Google Scholar 

  25. Wu, P., Wen, J., Liu, H., Ma, W.: Query selection techniques for efficient crawling of structured web sources. In: ICDE, pp. 47–56. ACM (2006)

    Google Scholar 

  26. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106. ACM (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D. (2013). Automatically Training Form Classifiers. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41230-1_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41229-5

  • Online ISBN: 978-3-642-41230-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics