Deep Questions in the “Deep or Hidden” Web

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 236)

Abstract

The Hidden Web is a part of the Web that consists mainly of the information inside databases, i.e., anything behind an interactive electronic form (search interfaces), which cannot be accessed by the conventional Web crawlers [1, 2, 8]. However, there have been well-defined, effective, and efficient methods for accessing Deep Web contents. One of these methods for accessing the Hidden Web employs an approach similar to ‘traditional’ crawling but aims at extracting the data behind the search interfaces or forms residing in databases. The paper brings insight into the various steps, a crawler must perform to access the contents in the Hidden Web. We structure the problem area and analyze what aspects have already been covered by previous research and what needs to be done.

Keywords

WWW Hidden web Surface web Hidden web crawler 

References

  1. 1.
    Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publ. 7(1), 1174–1175 (2001)CrossRefGoogle Scholar
  2. 2.
    Sherman, C., Price, G.: The Invisible Web: Uncovering Information Sources Search Engines Can’t See. CyberAge Books, Medford (2001)Google Scholar
  3. 3.
    Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: 27th International Conference on Very large databases (Rome, Italy, September 11–14: VLDB’01), pp. 129–138. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  4. 4.
    Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: 5th ACM/IEEE Joint Conference on Digital Libraries (Denver, USA, Jun 2005) JCDL05, pp. 100–109 (2005)Google Scholar
  5. 5.
    Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, 2004, Brasilia, Brazil, pp. 309–321 (2004)Google Scholar
  6. 6.
    Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: 28th VLDB Conference 2002, HongKong, China, pp. 38–49 (2002)Google Scholar
  7. 7.
    Chang, K.C.-C., He. B., Li. C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)Google Scholar
  8. 8.
    Gupta, S., Bhatia, K.: Exploring ’hidden’ parts of the web: The hidden web. In: Lecture notes in Electrical Engineering, Proceedings of the International Conference ArtCom 2012, pp. 508–515, Springer, Heidelberg (2012)Google Scholar
  9. 9.
    Gupta, S., Bhatia, K.: A system’s approach towards domain identification of web pages. In: Proceedings of the Second IEEE International Conference on Parallel, Distributed and Grid Computing (India, December 6–8, 2012) PDGC’12, IEEE XploreGoogle Scholar
  10. 10.
    Lawrence, S., Giles, C.L.: Accessibility of information on the web. Nature 400, 107–109 (1999)CrossRefGoogle Scholar
  11. 11.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley-Longman, Boston (1999)Google Scholar
  12. 12.
    Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 67–78, Santa Barbara, CA, USA, May (2001)Google Scholar
  13. 13.
    Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization. In: Proceedings of International WISE Conference, pp. 283–290, China, June (2000)Google Scholar
  14. 14.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. In: Proceedings of the 8th International WWW Conference (1999)Google Scholar
  15. 15.
    Zhang, Z., He, B., Chang, K.C.-C.: Light-weight domain-based form assistant: Querying web databases on the fly. In: Proceedings of the 31st Very Large Data Bases Conference (2005)Google Scholar
  16. 16.
    McCallum, A., Nigam, K., Rennie, J., Seymore.K.: Building domain-specific search engines with machine learning techniques. In: Proceedings of the AAAI Spring Symposium on Intelligent Agents in Cyberspace (1999)Google Scholar
  17. 17.
    Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pp. 148–159 (2002)Google Scholar
  18. 18.
    Rennie, J., McCallum, A.: Using reinforcement learning to spider the web efficiently. In Proceedings of ICML, pp. 335–343 (1999)Google Scholar
  19. 19.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori.M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Databases, pp. 527–534 (2000)Google Scholar
  20. 20.
    Profusion’s search engine directory. http://www.profusion.com/nav

Copyright information

© Springer India 2014

Authors and Affiliations

  1. 1.Department of Computer EngineeringYMCA University of Science and TechnologyFaridabadIndia

Personalised recommendations