Web Context Analysis Based on Generic Ontology

Zhu, Liang; Zuo, Wanli; He, Fengling; Han, Jiayu; Lu, Jingya

doi:10.1007/978-3-642-53932-9_4

Web Context Analysis Based on Generic Ontology

Liang Zhu⁴,
Wanli Zuo⁴,
Fengling He⁴,
Jiayu Han⁴ &
…
Jingya Lu⁴

Conference paper

1547 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 391))

Abstract

Now, the popular search engine system does not take into account the context information of search word, so the returned list of web pages contain a large number of irrelevant web pages. One word can reflect its true semantics only in context. This paper proposed the concepts of “contextual word”, “web page parsing with contextualinformation”, and “context representation” for the first time. Based on general ontology, we use the techniques of the word sense disambiguation to determine the context of the word to realize the web-page parsing in the level of word sense and sentence semantics according to the background of the word in the web-page. First, we transformthe web page into DOM tree, do the web parsing in the tradition method to remove the noise in the web page, extract the main body of the web page, and then use the real time search technology to get the last-modified-time of the web page. Second, we do the Lexical analysis on the body of the web page. Based on general ontology and natural language processing techniques, we mark the word or terms, and get the interpretation corresponding to the context. Third, we use the Named Entity Recognition Technology to get the time and the location information in the web pages,then we organize the information which we obtained into a structure called web context representation that we proposed. Based on the above theoretical basis, the author implements a complete set of web page contextual parse tool—JLUCAS. After a large number analysis of comparative experiments, JLUCAS achieves excellent results in everyaspect. This fully demonstrates thatthe theory and algorithm proposed in this paper can solve the problem of automatic web page contextual parsing, and lay a good basis for ultimately realizing the web contextual search engine.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kraft, R., Chang, C.C., Maghoul, F., Kumar, R.: Searching with context. In: Proceedings of the 15th International Conference on World Wide Web, pp. 477–486. ACM (2006)
Google Scholar
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a visionbased page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
Google Scholar
Shaohua, Y., Hailue, L., Yanbo, H.: Automatic data extraction from template-generated Web pages. Journal of Software 19, 209–223 (2008)
Article Google Scholar
Xiaodong, L., Yuqing, G.: DOM-based information extraction for the web sources. Chinese Journal of Computers 25, 526–533 (2002)
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based content extraction of HTML documents. In: Proceedings of the 12th International Conference on World Wide Web, pp. 207–214. ACM (2003)
Google Scholar
Wang, L., Liu, Z.-T., Wang, Y.-H., Liao, T.: Web Page Main Text Extraction Based on Content Similarity. Computer Engineering 36(6), 102–104 (2010)
Google Scholar
Qi, W., Tang, S.W., Yang, D.Q., Wang, T.J.: DOM-Based Automatic Extraction of Topical Information from Web Pages. Journal of Computer Research and Development, 10 (2004)
Google Scholar
Han, Z., Li, W., Mo, Q.: Research on methods for extracting text information from HTML pages. Application Research of Computers 12,012 (2008)
Google Scholar
Zhou, J., Zhu, Z., Cao, X.: Research on Content Extraction from Chinese Web Page Based on Statistic and Content-Feature. Journal of Chinese Information Processing 23(5), 80–85 (2009)
Google Scholar
Lu, S., Bai, S., Huang, X., Zhang, J.: Supervised word sense disambiguation based on Vector Space Model. Journal of Computer Research & Development 38, 662–667 (2001)
Google Scholar
Lu, S., Bai, S., Huang, X.: An Unsupervised Approach to Word Sense Disambiguation Based on Sense-Words in Vector Space Model. Journal of Software 13(6), 1082–1089 (2002)
Google Scholar
Wu, Y., Wang, M., Jin, P., Yu, S.: Ensembles of Classifiers for Chinese Word Sense Disambiguation. Journal of Computer Research and Development 45(8), 1354–1361 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, China Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education of China, Jilin University, Changchun, 130012, China
Liang Zhu, Wanli Zuo, Fengling He, Jiayu Han & Jingya Lu

Authors

Liang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Fengling He
View author publications
You can also search for this author in PubMed Google Scholar
Jiayu Han
View author publications
You can also search for this author in PubMed Google Scholar
Jingya Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, 800 Dongchuan Road, Dianxinqunlou 1-401, 200240, Shanghai, China
Yuhang Yang
School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore, Singapore
Maode Ma
College of Science, Hebei United University, 063009, Tangshan, Hebei, China
Baoxiang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, L., Zuo, W., He, F., Han, J., Lu, J. (2013). Web Context Analysis Based on Generic Ontology. In: Yang, Y., Ma, M., Liu, B. (eds) Information Computing and Applications. ICICA 2013. Communications in Computer and Information Science, vol 391. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53932-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-53932-9_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53931-2
Online ISBN: 978-3-642-53932-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics