Skip to main content

Web Context Analysis Based on Generic Ontology

  • Conference paper
  • 1547 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 391))

Abstract

Now, the popular search engine system does not take into account the context information of search word, so the returned list of web pages contain a large number of irrelevant web pages. One word can reflect its true semantics only in context. This paper proposed the concepts of “contextual word”, “web page parsing with contextualinformation”, and “context representation” for the first time. Based on general ontology, we use the techniques of the word sense disambiguation to determine the context of the word to realize the web-page parsing in the level of word sense and sentence semantics according to the background of the word in the web-page. First, we transformthe web page into DOM tree, do the web parsing in the tradition method to remove the noise in the web page, extract the main body of the web page, and then use the real time search technology to get the last-modified-time of the web page. Second, we do the Lexical analysis on the body of the web page. Based on general ontology and natural language processing techniques, we mark the word or terms, and get the interpretation corresponding to the context. Third, we use the Named Entity Recognition Technology to get the time and the location information in the web pages,then we organize the information which we obtained into a structure called web context representation that we proposed. Based on the above theoretical basis, the author implements a complete set of web page contextual parse tool—JLUCAS. After a large number analysis of comparative experiments, JLUCAS achieves excellent results in everyaspect. This fully demonstrates thatthe theory and algorithm proposed in this paper can solve the problem of automatic web page contextual parsing, and lay a good basis for ultimately realizing the web contextual search engine.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kraft, R., Chang, C.C., Maghoul, F., Kumar, R.: Searching with context. In: Proceedings of the 15th International Conference on World Wide Web, pp. 477–486. ACM (2006)

    Google Scholar 

  2. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a visionbased page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)

    Google Scholar 

  3. Shaohua, Y., Hailue, L., Yanbo, H.: Automatic data extraction from template-generated Web pages. Journal of Software 19, 209–223 (2008)

    Article  Google Scholar 

  4. Xiaodong, L., Yuqing, G.: DOM-based information extraction for the web sources. Chinese Journal of Computers 25, 526–533 (2002)

    Google Scholar 

  5. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM (2003)

    Google Scholar 

  6. Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)

    Google Scholar 

  7. Qi, W., Tang, S.W., Yang, D.Q., Wang, T.J.: DOM-Based Automatic Extraction of Topical Information from Web Pages. Journal of Computer Research and Development, 10 (2004)

    Google Scholar 

  8. Han, Z., Li, W., Mo, Q.: Research on methods for extracting text information from HTML pages. Application Research of Computers 12,012 (2008)

    Google Scholar 

  9. Zhou, J., Zhu, Z., Cao, X.: Research on Content Extraction from Chinese Web Page Based on Statistic and Content-Feature. Journal of Chinese Information Processing 23(5), 80–85 (2009)

    Google Scholar 

  10. Lu, S., Bai, S., Huang, X., Zhang, J.: Supervised word sense disambiguation based on Vector Space Model. Journal of Computer Research & Development 38, 662–667 (2001)

    Google Scholar 

  11. Lu, S., Bai, S., Huang, X.: An Unsupervised Approach to Word Sense Disambiguation Based on Sense-Words in Vector Space Model. Journal of Software 13(6), 1082–1089 (2002)

    Google Scholar 

  12. Wu, Y., Wang, M., Jin, P., Yu, S.: Ensembles of Classifiers for Chinese Word Sense Disambiguation. Journal of Computer Research and Development 45(8), 1354–1361 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, L., Zuo, W., He, F., Han, J., Lu, J. (2013). Web Context Analysis Based on Generic Ontology. In: Yang, Y., Ma, M., Liu, B. (eds) Information Computing and Applications. ICICA 2013. Communications in Computer and Information Science, vol 391. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53932-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-53932-9_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53931-2

  • Online ISBN: 978-3-642-53932-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics