Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 370))

  • 872 Accesses

Abstract

This chapter focuses on the structure-based classification of websites according to their hypertext type or genre. A website usually consists of several web pages. Its structure is given by their hyperlinks resulting in a directed graph. In order to represent the logical structure of a website, the underlying graph structure is represented as a so-called directed Generalized Tree (GT), in which a rooted spanning tree represents the logical core structure of the site. The remaining arcs are classified as reflexive, lateral, and vertical up- and downward arcs with respect to this kernel tree.

We consider unsupervised and supervised approaches for learning classifiers from a given web corpus. Quantitative Structure Analysis (QSA) is based on describing GTs using a series of attributes that characterize their structural complexity, and employs feature selection combined with unsupervised learning techniques. Kernel methods – the second class of approaches we consider – focus on typical substructures characterizing the classes. We present a series of tree, graph and GT kernels that are suitable for solving the problem and discuss the problem of scalability. All learning approaches are evaluated using a web corpus containing classified websites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aiolli, F., Martino, G.D.S., Sperduti, A., Moschitti, A.: Efficient kernel-based learning for trees. In: CIDM, pp. 308–315. IEEE, New York (2007)

    Google Scholar 

  2. Biber, D.: Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)

    Book  Google Scholar 

  3. Blanchard, P., Volchenkov, D.: Mathematical Analysis of Urban Spatial Networks. Springer, Berlin (2009)

    MATH  Google Scholar 

  4. Bloehdorn, S., Moschitti, A.: Combined syntactic and semanitc kernels for text classification. In: Proceedings of the 29th European Conference on Information Retrieval, Rome, Italy (2007)

    Google Scholar 

  5. Bollobás, B., Riordan, O.M.: Mathematical results on scale-free random graphs. In: Bornholdt, S., Schuster, H.G. (eds.) Handbook of Graphs and Networks. From the Genome to the Internet, pp. 1–34. Wiley-VCH, Weinheim (2003)

    Google Scholar 

  6. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group (1984)

    Google Scholar 

  7. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002), http://www.cse.iitb.ac.in/~soumen/mining-the-web/

    Google Scholar 

  8. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm

  9. Collins, M., Duffy, N.: Convolution kernels for natural language. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) NIPS, pp. 625–632. MIT Press, Cambridge (2001)

    Google Scholar 

  10. Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In: ACL, pp. 263–270 (2002)

    Google Scholar 

  11. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge University Press, Cambridge (2000)

    Google Scholar 

  12. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: ACL 2004: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, p. 423 (2004), http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf , doi:http://dx.doi.org/10.3115/1218955.1219009

  13. Cumby, C., Roth, D.: On kernel methods for relational learning. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 107–115. AAAI Press, Menlo Park (2003)

    Google Scholar 

  14. Dehmer, M.: Information processing in complex networks: Graph entropy and information functionals. Applied Mathematics and Computation 201, 82–94 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  15. Dehmer, M., Emmert-Streib, F., Mehler, A., Kilian, J.: Measuring the structural similarity of web-based documents: A novel approach. International Journal of Computational Intelligence 3(1), 1–7 (2006)

    Google Scholar 

  16. Dehmer, M., Mehler, A., Emmert-Streib, F.: Graph-theoretical characterizations of generalized trees. In: Proceedings of the 2007 International Conference on Machine Learning: Models, Technologies & Applications (MLMTA 2007), Las Vegas, June 25-28, pp. 113–117 (2007)

    Google Scholar 

  17. Foscarini, F., Kim, Y., Lee, C.A., Mehler, A., Oliver, G., Ross, S.: On the notion of genre in digital preservation. In: Chanod, J.P., Dobreva, M., Rauber, A., Ross, S. (eds.) Proceedings of the Dagstuhl Seminar 10291 on Automation in Digital Preservation, July 18–23, Dagstuhl Seminar Proceedings. Leibniz Center for Informatics, Schloss Dagstuhl (2010)

    Google Scholar 

  18. Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(2), 49–58 (2003)

    Article  Google Scholar 

  19. Gärtner, T.: A survey of kernels for structured data. SIGKDD Explor. Newsl. 5(1), 49–58 (2003), doi: http://doi.acm.org/10.1145/959242.959248

    Google Scholar 

  20. Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and efficient alternatives. In: Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop (2003)

    Google Scholar 

  21. Gärtner, T., Lloyd, J.W., Flach, P.A.: Kernels and distances for structured data. Machine Learning 57(3), 205–232 (2004)

    Article  MATH  Google Scholar 

  22. Geibel, P.: Induktion von merkmalsbasierten und logische Klassifikatoren für relationale Strukturen. Infix-Verlag (1999)

    Google Scholar 

  23. Geibel, P., Wysotzki, F.: Induction of Context Dependent Concepts. In: De Raedt, L. (ed.) Proceedings of the 5th International Workshop on Inductive Logic Programming, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, pp. 323–336 (1995)

    Google Scholar 

  24. Geibel, P., Wysotzki, F.: Learning relational concepts with decision trees. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 166–174. Morgan Kaufmann Publishers, San Francisco (1996)

    Google Scholar 

  25. Geibel, P., Wysotzki, F.: Relational learning with decision trees. In: Wahlster, W. (ed.) Proceedings of the 12th European Conference on Artificial Intelligence, pp. 428–432. J. Wiley and Sons, Ltd, Chichester (1996)

    Google Scholar 

  26. Geibel, P., Jain, B.J., Wysotzki, F.: Combining recurrent neural networks and support vector machines for structural pattern recognition. Neurocomputing 64, 63–105 (2005)

    Google Scholar 

  27. Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.-U.: Classification of documents based on the structure of their DOM trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 779–788. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  28. Gleim, R.: HyGraph: Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte. In: Fisseni, B., Schmitz, H.C., Schröder, B., Wagner, P. (eds) Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Frühjahrstagung 2005, 10. März – 01, Universität Bonn, Lang, Frankfurt a. M., pp. 42–53 (April 2005)

    Google Scholar 

  29. Haussler, D.: Convolution Kernels on Discrete Structure. Tech. Rep. UCSC-CRL-99-10, University of California at Santa Cruz, Santa Cruz, CA, USA (1999)

    Google Scholar 

  30. Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. Journal for Language Technology and Computational Linguistics (JLCL) 20(1), 19–62 (2005)

    Google Scholar 

  31. Hunt, E.B., Marin, J., Stone, P.J.: Experiments in Induction. Academic Press, London (1966)

    Google Scholar 

  32. Joachims, T.: Learning to classify text using support vector machines. Kluwer, Boston (2002)

    Book  Google Scholar 

  33. Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Sammut, C., Hoffmann, A.G. (eds.) ICML, pp. 291–298. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  34. Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernels between labeled graphs. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 321–328. AAAI Press, Menlo Park (2003)

    Google Scholar 

  35. Kemp, C., Tenenbaum, J.B.: The discovery of structural form. Proceedings of the National Academy of Sciences 105(31), 10,687–10,692 (2008)

    Google Scholar 

  36. Kersting, K., Gärtner, T.: Fisher kernels for logical sequences. In: ECML, pp. 205–216 (2004)

    Google Scholar 

  37. Kondor, R.I., Shervashidze, N., Borgwardt, K.M.: The graphlet spectrum. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) ICML, ACM International Conference Proceeding Series, vol. 382, p. 67. ACM, New York (2009)

    Google Scholar 

  38. Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000)

    MATH  Google Scholar 

  39. Mehler, A.: Generalized shortest paths trees: A novel graph class applied to semiotic networks. In: Dehmer, M., Emmert-Streib, F. (eds.) Analysis of Complex Networks: From Biology to Linguistics, pp. 175–220. Wiley-VCH, Weinheim (2009)

    Google Scholar 

  40. Mehler, A.: Minimum spanning Markovian trees: Introducing context-sensitivity into the generation of spanning trees. In: Dehmer, M. (ed.) Structural Analysis of Complex Networks. Birkhäuser Publishing, Basel (2009)

    Google Scholar 

  41. Mehler, A.: A quantitative graph model of social ontologies by example of Wikipedia. In: Dehmer, M., Emmert-Streib, F., Mehler, A. (eds.) Towards an Information Theory of Complex Networks: Statistical Methods and Applications. Birkhäuser, Basel (2010)

    Google Scholar 

  42. Mehler, A.: Structure formation in the web. A graph-theoretical model of hypertext types. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology, Text, Speech and Language Technology, pp. 225–247. Springer, Dordrecht (2010)

    Chapter  Google Scholar 

  43. Mehler, A., Lücking, A.: A structural model of semiotic alignment: The classification of multimodal ensembles as a novel machine learning task. In: Proceedings of IEEE Africon 2009, September 23-25. IEEE, Nairobi (2009)

    Google Scholar 

  44. Mehler, A., Waltinger, U.: Integrating content and structure learning: A model of hypertext zoning and sounding. In: Mehler, A., Kühnberger, K.U., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds.) Modeling, Learning and Processing of Text Technological Data Structures. SCI. Springer, Berlin (2010)

    Google Scholar 

  45. Mehler, A., Geibel, P., Pustylnikov, O.: Structural classifiers of text types: Towards a novel model of text representation. Journal for Language Technology and Computational Linguistics (JLCL) 22(2), 51–66 (2007)

    Google Scholar 

  46. Mehler, A., Waltinger, U., Wegner, A.: A formal text representation model based on lexical chaining. In: Proceedings of the KI 2007 Workshop on Learning from Non-Vectorial Data (LNVD 2007), September 10, Osnabrück, Universität Osnabrück, Osnabrück, pp. 17–26 (2007)

    Google Scholar 

  47. Mehler, A., Pustylnikov, O., Diewald, N.: Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia. Computer Speech and Language (2010), doi:10.1016/j.csl.2010.05.006

    Google Scholar 

  48. Mehler, A., Sharoff, S., Santini, M. (eds.): Genres on the Web: Computational Models and Empirical Studies. Springer, Dordrecht (2010)

    MATH  Google Scholar 

  49. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Alon, D.C.U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)

    Article  Google Scholar 

  50. Moschitti, A.: A study on convolution kernels for shallow statistic parsing. In: ACL, pp. 335–342 (2004)

    Google Scholar 

  51. Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  52. Muggleton, S., Lodhi, H., Amini, A., Sternberg, M.J.E.: Support vector inductive logic programming. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 163–175. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  53. Power, R., Scott, D., Bouayad-Agha, N.: Document structure. Computational Linguistics 29(2), 211–260 (2003)

    Article  Google Scholar 

  54. Pustylnikov, O., Mehler, A.: Structural differentiae of text types. A quantitative model. In: Proceedings of the 31st Annual Conference of the German Classification Society on Data Analysis, Machine Learning, and Applications (GfKl), pp. 655–662 (2007)

    Google Scholar 

  55. Quinlan, J.: Induction of Decision Trees. Machine Learning 1(1), 82–106 (1986)

    Google Scholar 

  56. Rehm, G.: Towards automatic web genre identification – a corpus-based approach in the domain of academia by example of the academic’s personal homepage. In: Proc. of the Hawaii Internat. Conf. on System Sciences (2002)

    Google Scholar 

  57. Rehm, G., Santini, M., Mehler, A., Braslavski, P., Gleim, R., Stubbe, A., Symonenko, S., Tavosanis, M., Vidulin, V.: Towards a reference corpus of web genres for the evaluation of genre identification systems. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)

    Google Scholar 

  58. Santini, M.: Cross-testing a genre classification model for the web. In: [48] (2010)

    Google Scholar 

  59. Santini, M., Mehler, A., Sharoff, S.: Riding the rough waves of genre on the web: Concepts and research questions. In: [48], pp. 3–32 (2010)

    Google Scholar 

  60. Saunders, S.: Improved shortest path algorithms for nearly acyclic graphs. PhD thesis, University of Canterbury, Computer Science (2004)

    Google Scholar 

  61. Schoelkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)

    Google Scholar 

  62. Sharoff, S.: In the garden and in the jungle. Comparing genres in the BNC and Internet. In: [48] (2010)

    Google Scholar 

  63. Smola, A., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M. (eds.) Proceedings of the Annual Conference on Computational Learning Theory and Kernel Workshop. LNCS. Springer, Heidelberg (2003)

    Google Scholar 

  64. Stubbe, A., Ringlstetter, C., Goebel, R.: Elements of a learning interface for genre qualified search. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 791–797. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  65. Unger, S., Wysotzki, F.: Lernfähige Klassifizierungssysteme (Classifier Systems that are able to Learn). Akademie-Verlag, Berlin (1981)

    Google Scholar 

  66. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    MATH  Google Scholar 

  67. Weisfeiler, B.: On Construction and Identification of Graphs. No. 558 in Lecture Notes in Mathematics. Springer, Berlin (1976)

    MATH  Google Scholar 

  68. Wysotzki, F., Kolbe, W., Selbig, J.: Concept Learning by Structured Examples - An Algebraic Approach. In: Proceedings of the Seventh IJCAI (1981)

    Google Scholar 

  69. Zhang, D., Mao, R.: Extracting community structure features for hypertext classification. In: Pichappan, P., Abraham, A. (eds.) ICDIM, pp. 436–441. IEEE, Los Alamitos (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Geibel, P., Mehler, A., Kühnberger, KU. (2011). Learning Methods for Graph Models of Document Structure. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22613-7_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22612-0

  • Online ISBN: 978-3-642-22613-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics