Using Internet Glossaries to Determine Interests from Home Pages

Portscher, Edwin; Geller, James; Scherl, Richard

doi:10.1007/978-3-540-45229-4_25

Using Internet Glossaries to Determine Interests from Home Pages

Edwin Portscher⁷,
James Geller⁷ &
Richard Scherl⁸

Conference paper

593 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2738))

Abstract

There are millions of home pages on the web. Each page contains valuable data about the page’s owner that can be used for marketing purposes. These pages have to be classified according to interests. The traditional Information Retrieval approach requires large training sets that are classified by human experts. Knowledge-based methods, which use handcrafted rules, require a significant investment to develop the rule base. Both these approaches are very time consuming. We are using glossaries, which are freely available on the Internet, to determine interests from home pages. Processing of these glossaries can be automated and requires little human effort and time, compared to the other two approaches. Once the terms have been extracted from these glossaries, they can be used to infer interests from the home pages of web users. This paper describes the system we have developed for classifying home pages by interests. On an experiment of 400 pages, we found that the glossary with the highest number of word matches is the correct interest in 44.75% of the pages. The correct interest is in the top three highest returned interests in 72.25% of the pages, and the correct interest is in the top five returned interest matches in 84.5% of the pages.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Attardi, G., Gullí, A., Sebastiani, F.: Automatic Web page categorization by link and context analysis. In: Hutchison, C., Lanzarone, G. (eds.) Proceedings of THAI-1999, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, Varese, IT, pp. 105–119 (1999)
Google Scholar
Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using Web Structure for Classifying and Describing Web Pages. In: Proceedings of WWW-2002, International Conference on the World Wide Web (2002)
Google Scholar
Gelfand, B., Wulfekuler, M., Punch, W.F.: Automated concept extraction from plain text. Papers from the AAAI 1998 Workshop on Text Categorization, Madison, WI, pp. 13–17 (1998)
Google Scholar
Furnkranz, J.: Using Links for Classifying Web-pages. Technical Report OEFAI TR-98-29. Austrian Research Institute for Artificial Intelligence
Google Scholar
Mase, H.: Experiments on Automatic Web Page Categorization for IR System. Technical Report. Department of Computer Science, Stanford University (1998)
Google Scholar
Asirvathan, A.P., Ravi, K.K.: Web Page Classification based on Document Structure. International Institute of Information Technology (2001)
Google Scholar
Pierre, J.M.: Practical Issues for Automated Categorization of Web Sites. In: ECDL 2000 Workshop on the Semantic Web (2000)
Google Scholar
Apte, C., Damerau, F., Weiss, S.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems, 233–240 (July 1994)
Google Scholar
Yu, H., Chang, K.C.-C., Han, J.: Heterogeneous Learner for Web Page Classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Maebashi, Japan, pp. 538–545 (December 2002)
Google Scholar
Tai, S.-M., Yang, C.-Z., Chen, I.-X.: Improved Automatic Web-Page Classification by Neighbor Text Percolation. Department of Computer Engineering and Science, Yuan Ze University Kaohsiung, Taiwan, November 23 (2002)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins Publishing Company, Amsterdam (2002)
Google Scholar
Hayes, P.J., Weinstein, S.P.: CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In: Rappaport, A., Smith, R. (eds.) Proceedings of IAAI-1990, 2nd Conference on Innovative Applications of Artificial Intelligence, pp. 49–66. AAAI Press, Menlo Park (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

New Jersey Institute of Technology, Newark, New Jersey, 07102
Edwin Portscher & James Geller
Monmouth University, West Long Branch, New Jersey, 07764
Richard Scherl

Authors

Edwin Portscher
View author publications
You can also search for this author in PubMed Google Scholar
James Geller
View author publications
You can also search for this author in PubMed Google Scholar
Richard Scherl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Zurich, Department of Informatics (IFI), Winterthurer Stra{ß}e 190, 8057, Zurich, Switzerland
Kurt Bauknecht
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040,, Wien, Austria
A Min Tjoa
Division of Information Technology, Engineering and the Environment, School of Computer and Information Science, University of South Australia, 5095, Mawson Lakes, SA, Australia
Gerald Quirchmayr

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Portscher, E., Geller, J., Scherl, R. (2003). Using Internet Glossaries to Determine Interests from Home Pages. In: Bauknecht, K., Tjoa, A.M., Quirchmayr, G. (eds) E-Commerce and Web Technologies. EC-Web 2003. Lecture Notes in Computer Science, vol 2738. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45229-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-540-45229-4_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40808-6
Online ISBN: 978-3-540-45229-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics