Skip to main content

Classifying Documents According to Locational Relevance

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5816))

Abstract

This paper presents an approach for categorizing documents according to their implicit locational relevance. We report a thorough evaluation of several classifiers designed for this task, built by using support vector machines with multiple alternatives for feature vectors. Experimental results show that using feature vectors that combine document terms and URL n-grams, with simple features related to the locality of the document (e.g. total count of place references) leads to high accuracy values. The paper also discusses how the proposed categorization approach can be used to help improve tasks such as document retrieval or online contextual advertisement.

This work was partially supported by the FCT (Portugal), through project grant PTDC/EIA/73614/2006 (GREASE-II).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ding, J., Gravano, L., Shivakumar, N.: Computing Geographical Scopes of Web Resources. In: Proceedings of the 26th international Conference on Very Large Data Bases, pp. 545–556 (2000)

    Google Scholar 

  2. Amitay, E., Har’El, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: Proceedings of the 27th international ACM SIGIR Conference on Research and Development in information Retrieval, pp. 273–280 (2004)

    Google Scholar 

  3. Gravano, L., Hatzivassiloglou, V., Lichtenstein, R.: Categorizing web queries according to geographical locality. In: Proceedings of the 12th international Conference on information and Knowledge Management, pp. 325–333 (2003)

    Google Scholar 

  4. Zhuang, Z., Brunk, C., Giles, C.L.: Modeling and visualizing geo-sensitive queries based on user clicks. In: Proceedings of the 1st international Workshop on Location and the Web, pp. 73–76 (2008)

    Google Scholar 

  5. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  6. Woodruff, A.G., Plaunt, C.: GIPSY: Automated geographic indexing of text documents. Journal of the American Society for Information Science 45(9), 645–655 (1994)

    Article  Google Scholar 

  7. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)

    Book  MATH  Google Scholar 

  8. Johansson, M., Harrie, L.: Using Java Topology Suite for real-time data generalisation and integration. In: Proceedings of the 2002 workshop of the International Society for Photogrammetry and Remote Sensing (2002)

    Google Scholar 

  9. Leidner, J.L.: Toponym Resolution: a Comparison and Taxonomy of Heuristics and Methods. PhD Thesis, University of Edinburgh (2007)

    Google Scholar 

  10. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)

    Article  Google Scholar 

  11. Sebastiani, F.: Machine learning in automated text categorization. ACM Computer Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  12. Joachims, T.: Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  13. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  14. Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12(3), 233–251 (1994)

    Article  Google Scholar 

  15. Genkin, A., Lewis, D.D., Madigan, D.: Large-Scale Bayesian Logistic Regression for Text Categorization. Rutgers University Technical Report (2004)

    Google Scholar 

  16. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature selection methods for text classification. In: Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 230–239 (2007)

    Google Scholar 

  17. Sang, E.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-Independent Named Entity Recognition. In: Proceedings of the 7th Conference on Natural Language Learning, pp. 142–147 (2003)

    Google Scholar 

  18. Kornai, A.: Proceedings of the HLT-NAACL 2003 workshop on the analysis of geographic references (2003)

    Google Scholar 

  19. Garbin, E., Mani, I.: Disambiguating toponyms in news. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 363–370 (2005)

    Google Scholar 

  20. Rauch, E., Bukatin, M., Baker, K.: A confidence-based framework for disambiguating geographic terms. In: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, pp. 50–54 (2003)

    Google Scholar 

  21. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of the 1998 ACM SIGMOD international Conference on Management of Data, pp. 307–318 (1998)

    Google Scholar 

  22. Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. In: Proceedings of the 15th ACM international Conference on information and Knowledge Management, pp. 228–237 (2006)

    Google Scholar 

  23. Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based Topic Classification. In: Proceedings of the 18th international World Wide Web Conference, Alternate Track Papers and Posters, p. 1109 (2009)

    Google Scholar 

  24. Baykan, E., Henzinger, M., Weber, I.: Web page language identification based on URLs. Proceedings of the VLDB Endowment 1(1), 176–187 (2008)

    Article  Google Scholar 

  25. Jones, R., Zhang, W.V., Rey, B., Jhala, P., Stipp, E.: Geographic intention and modification in web search. International Journal of Geographical Information Science 22(3), 229–246 (2009)

    Article  Google Scholar 

  26. Yu, B., Cai, G.: A query-aware document ranking method for geographic information retrieval. In: Proceedings of the 4th ACM workshop on Geographical information retrieval, pp. 49–54 (2007)

    Google Scholar 

  27. Cai, G.: GeoVSM: An Integrated Retrieval Model for Geographic Information. GIScience, 65–79 (2002)

    Google Scholar 

  28. Anastáio, I., Martins, B., Calado, P.: A Comparison of Different Approaches for Assigning Geographic Scopes to Documents. In: Proceedings of the 1st INForum - Simpósio de Informática (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Anastácio, I., Martins, B., Calado, P. (2009). Classifying Documents According to Locational Relevance. In: Lopes, L.S., Lau, N., Mariano, P., Rocha, L.M. (eds) Progress in Artificial Intelligence. EPIA 2009. Lecture Notes in Computer Science(), vol 5816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04686-5_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04686-5_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04685-8

  • Online ISBN: 978-3-642-04686-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics