Learning Methods for Graph Models of Document Structure

Geibel, Peter; Mehler, Alexander; Kühnberger, Kai-Uwe

doi:10.1007/978-3-642-22613-7_14

Peter Geibel⁷,
Alexander Mehler⁸ &
Kai-Uwe Kühnberger⁹

Part of the book series: Studies in Computational Intelligence ((SCI,volume 370))

872 Accesses

Abstract

This chapter focuses on the structure-based classification of websites according to their hypertext type or genre. A website usually consists of several web pages. Its structure is given by their hyperlinks resulting in a directed graph. In order to represent the logical structure of a website, the underlying graph structure is represented as a so-called directed Generalized Tree (GT), in which a rooted spanning tree represents the logical core structure of the site. The remaining arcs are classified as reflexive, lateral, and vertical up- and downward arcs with respect to this kernel tree.

We consider unsupervised and supervised approaches for learning classifiers from a given web corpus. Quantitative Structure Analysis (QSA) is based on describing GTs using a series of attributes that characterize their structural complexity, and employs feature selection combined with unsupervised learning techniques. Kernel methods – the second class of approaches we consider – focus on typical substructures characterizing the classes. We present a series of tree, graph and GT kernels that are suitable for solving the problem and discuss the problem of scalability. All learning approaches are evaluated using a web corpus containing classified websites.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aiolli, F., Martino, G.D.S., Sperduti, A., Moschitti, A.: Efficient kernel-based learning for trees. In: CIDM, pp. 308–315. IEEE, New York (2007)
Google Scholar
Biber, D.: Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)
Book Google Scholar
Blanchard, P., Volchenkov, D.: Mathematical Analysis of Urban Spatial Networks. Springer, Berlin (2009)
MATH Google Scholar
Bloehdorn, S., Moschitti, A.: Combined syntactic and semanitc kernels for text classification. In: Proceedings of the 29th European Conference on Information Retrieval, Rome, Italy (2007)
Google Scholar
Bollobás, B., Riordan, O.M.: Mathematical results on scale-free random graphs. In: Bornholdt, S., Schuster, H.G. (eds.) Handbook of Graphs and Networks. From the Genome to the Internet, pp. 1–34. Wiley-VCH, Weinheim (2003)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group (1984)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002), http://www.cse.iitb.ac.in/~soumen/mining-the-web/
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm
Collins, M., Duffy, N.: Convolution kernels for natural language. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) NIPS, pp. 625–632. MIT Press, Cambridge (2001)
Google Scholar
Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In: ACL, pp. 263–270 (2002)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge University Press, Cambridge (2000)
Google Scholar
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: ACL 2004: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, p. 423 (2004), http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf , doi:http://dx.doi.org/10.3115/1218955.1219009
Cumby, C., Roth, D.: On kernel methods for relational learning. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 107–115. AAAI Press, Menlo Park (2003)
Google Scholar
Dehmer, M.: Information processing in complex networks: Graph entropy and information functionals. Applied Mathematics and Computation 201, 82–94 (2008)
Article MathSciNet MATH Google Scholar
Dehmer, M., Emmert-Streib, F., Mehler, A., Kilian, J.: Measuring the structural similarity of web-based documents: A novel approach. International Journal of Computational Intelligence 3(1), 1–7 (2006)
Google Scholar
Dehmer, M., Mehler, A., Emmert-Streib, F.: Graph-theoretical characterizations of generalized trees. In: Proceedings of the 2007 International Conference on Machine Learning: Models, Technologies & Applications (MLMTA 2007), Las Vegas, June 25-28, pp. 113–117 (2007)
Google Scholar
Foscarini, F., Kim, Y., Lee, C.A., Mehler, A., Oliver, G., Ross, S.: On the notion of genre in digital preservation. In: Chanod, J.P., Dobreva, M., Rauber, A., Ross, S. (eds.) Proceedings of the Dagstuhl Seminar 10291 on Automation in Digital Preservation, July 18–23, Dagstuhl Seminar Proceedings. Leibniz Center for Informatics, Schloss Dagstuhl (2010)
Google Scholar
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(2), 49–58 (2003)
Article Google Scholar
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explor. Newsl. 5(1), 49–58 (2003), doi: http://doi.acm.org/10.1145/959242.959248
Google Scholar
Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and efficient alternatives. In: Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop (2003)
Google Scholar
Gärtner, T., Lloyd, J.W., Flach, P.A.: Kernels and distances for structured data. Machine Learning 57(3), 205–232 (2004)
Article MATH Google Scholar
Geibel, P.: Induktion von merkmalsbasierten und logische Klassifikatoren für relationale Strukturen. Infix-Verlag (1999)
Google Scholar
Geibel, P., Wysotzki, F.: Induction of Context Dependent Concepts. In: De Raedt, L. (ed.) Proceedings of the 5th International Workshop on Inductive Logic Programming, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, pp. 323–336 (1995)
Google Scholar
Geibel, P., Wysotzki, F.: Learning relational concepts with decision trees. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 166–174. Morgan Kaufmann Publishers, San Francisco (1996)
Google Scholar
Geibel, P., Wysotzki, F.: Relational learning with decision trees. In: Wahlster, W. (ed.) Proceedings of the 12th European Conference on Artificial Intelligence, pp. 428–432. J. Wiley and Sons, Ltd, Chichester (1996)
Google Scholar
Geibel, P., Jain, B.J., Wysotzki, F.: Combining recurrent neural networks and support vector machines for structural pattern recognition. Neurocomputing 64, 63–105 (2005)
Google Scholar
Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.-U.: Classification of documents based on the structure of their DOM trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 779–788. Springer, Heidelberg (2008)
Chapter Google Scholar
Gleim, R.: HyGraph: Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte. In: Fisseni, B., Schmitz, H.C., Schröder, B., Wagner, P. (eds) Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Frühjahrstagung 2005, 10. März – 01, Universität Bonn, Lang, Frankfurt a. M., pp. 42–53 (April 2005)
Google Scholar
Haussler, D.: Convolution Kernels on Discrete Structure. Tech. Rep. UCSC-CRL-99-10, University of California at Santa Cruz, Santa Cruz, CA, USA (1999)
Google Scholar
Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. Journal for Language Technology and Computational Linguistics (JLCL) 20(1), 19–62 (2005)
Google Scholar
Hunt, E.B., Marin, J., Stone, P.J.: Experiments in Induction. Academic Press, London (1966)
Google Scholar
Joachims, T.: Learning to classify text using support vector machines. Kluwer, Boston (2002)
Book Google Scholar
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Sammut, C., Hoffmann, A.G. (eds.) ICML, pp. 291–298. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernels between labeled graphs. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 321–328. AAAI Press, Menlo Park (2003)
Google Scholar
Kemp, C., Tenenbaum, J.B.: The discovery of structural form. Proceedings of the National Academy of Sciences 105(31), 10,687–10,692 (2008)
Google Scholar
Kersting, K., Gärtner, T.: Fisher kernels for logical sequences. In: ECML, pp. 205–216 (2004)
Google Scholar
Kondor, R.I., Shervashidze, N., Borgwardt, K.M.: The graphlet spectrum. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) ICML, ACM International Conference Proceeding Series, vol. 382, p. 67. ACM, New York (2009)
Google Scholar
Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000)
MATH Google Scholar
Mehler, A.: Generalized shortest paths trees: A novel graph class applied to semiotic networks. In: Dehmer, M., Emmert-Streib, F. (eds.) Analysis of Complex Networks: From Biology to Linguistics, pp. 175–220. Wiley-VCH, Weinheim (2009)
Google Scholar
Mehler, A.: Minimum spanning Markovian trees: Introducing context-sensitivity into the generation of spanning trees. In: Dehmer, M. (ed.) Structural Analysis of Complex Networks. Birkhäuser Publishing, Basel (2009)
Google Scholar
Mehler, A.: A quantitative graph model of social ontologies by example of Wikipedia. In: Dehmer, M., Emmert-Streib, F., Mehler, A. (eds.) Towards an Information Theory of Complex Networks: Statistical Methods and Applications. Birkhäuser, Basel (2010)
Google Scholar
Mehler, A.: Structure formation in the web. A graph-theoretical model of hypertext types. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology, Text, Speech and Language Technology, pp. 225–247. Springer, Dordrecht (2010)
Chapter Google Scholar
Mehler, A., Lücking, A.: A structural model of semiotic alignment: The classification of multimodal ensembles as a novel machine learning task. In: Proceedings of IEEE Africon 2009, September 23-25. IEEE, Nairobi (2009)
Google Scholar
Mehler, A., Waltinger, U.: Integrating content and structure learning: A model of hypertext zoning and sounding. In: Mehler, A., Kühnberger, K.U., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds.) Modeling, Learning and Processing of Text Technological Data Structures. SCI. Springer, Berlin (2010)
Google Scholar
Mehler, A., Geibel, P., Pustylnikov, O.: Structural classifiers of text types: Towards a novel model of text representation. Journal for Language Technology and Computational Linguistics (JLCL) 22(2), 51–66 (2007)
Google Scholar
Mehler, A., Waltinger, U., Wegner, A.: A formal text representation model based on lexical chaining. In: Proceedings of the KI 2007 Workshop on Learning from Non-Vectorial Data (LNVD 2007), September 10, Osnabrück, Universität Osnabrück, Osnabrück, pp. 17–26 (2007)
Google Scholar
Mehler, A., Pustylnikov, O., Diewald, N.: Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia. Computer Speech and Language (2010), doi:10.1016/j.csl.2010.05.006
Google Scholar
Mehler, A., Sharoff, S., Santini, M. (eds.): Genres on the Web: Computational Models and Empirical Studies. Springer, Dordrecht (2010)
MATH Google Scholar
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Alon, D.C.U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)
Article Google Scholar
Moschitti, A.: A study on convolution kernels for shallow statistic parsing. In: ACL, pp. 335–342 (2004)
Google Scholar
Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)
Chapter Google Scholar
Muggleton, S., Lodhi, H., Amini, A., Sternberg, M.J.E.: Support vector inductive logic programming. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 163–175. Springer, Heidelberg (2005)
Chapter Google Scholar
Power, R., Scott, D., Bouayad-Agha, N.: Document structure. Computational Linguistics 29(2), 211–260 (2003)
Article Google Scholar
Pustylnikov, O., Mehler, A.: Structural differentiae of text types. A quantitative model. In: Proceedings of the 31st Annual Conference of the German Classification Society on Data Analysis, Machine Learning, and Applications (GfKl), pp. 655–662 (2007)
Google Scholar
Quinlan, J.: Induction of Decision Trees. Machine Learning 1(1), 82–106 (1986)
Google Scholar
Rehm, G.: Towards automatic web genre identification – a corpus-based approach in the domain of academia by example of the academic’s personal homepage. In: Proc. of the Hawaii Internat. Conf. on System Sciences (2002)
Google Scholar
Rehm, G., Santini, M., Mehler, A., Braslavski, P., Gleim, R., Stubbe, A., Symonenko, S., Tavosanis, M., Vidulin, V.: Towards a reference corpus of web genres for the evaluation of genre identification systems. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)
Google Scholar
Santini, M.: Cross-testing a genre classification model for the web. In: [48] (2010)
Google Scholar
Santini, M., Mehler, A., Sharoff, S.: Riding the rough waves of genre on the web: Concepts and research questions. In: [48], pp. 3–32 (2010)
Google Scholar
Saunders, S.: Improved shortest path algorithms for nearly acyclic graphs. PhD thesis, University of Canterbury, Computer Science (2004)
Google Scholar
Schoelkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)
Google Scholar
Sharoff, S.: In the garden and in the jungle. Comparing genres in the BNC and Internet. In: [48] (2010)
Google Scholar
Smola, A., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M. (eds.) Proceedings of the Annual Conference on Computational Learning Theory and Kernel Workshop. LNCS. Springer, Heidelberg (2003)
Google Scholar
Stubbe, A., Ringlstetter, C., Goebel, R.: Elements of a learning interface for genre qualified search. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 791–797. Springer, Heidelberg (2007)
Chapter Google Scholar
Unger, S., Wysotzki, F.: Lernfähige Klassifizierungssysteme (Classifier Systems that are able to Learn). Akademie-Verlag, Berlin (1981)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
MATH Google Scholar
Weisfeiler, B.: On Construction and Identification of Graphs. No. 558 in Lecture Notes in Mathematics. Springer, Berlin (1976)
MATH Google Scholar
Wysotzki, F., Kolbe, W., Selbig, J.: Concept Learning by Structured Examples - An Algebraic Approach. In: Proceedings of the Seventh IJCAI (1981)
Google Scholar
Zhang, D., Mao, R.: Extracting community structure features for hypertext classification. In: Pichappan, P., Abraham, A. (eds.) ICDIM, pp. 436–441. IEEE, Los Alamitos (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Fak. IV, TU Berlin, Straße des 17. Juni 135, D-10623, Berlin, Germany
Peter Geibel
Computer Science and Informatics, Goethe-Universität Frankfurt, Senckenberganlage 31, D-60325, Frankfurt am Main, Germany
Alexander Mehler
Institute of Cognitive Science, Universität Osnabrück, Albrechtstraße 28, D-49076, Osnabrück, Germany
Kai-Uwe Kühnberger

Authors

Peter Geibel
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Mehler
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Kühnberger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Linguistics and Literature, Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, Germany
Alexander Mehler
Institute of Cognitive Science, University of Osnabrück, Albrechtstr. 28, 49076, Osnabrück, Germany
Kai-Uwe Kühnberger
Angewandte Sprachwissenschaft und, Justus-Liebig-Universität Gießen, Computerlinguistik, Otto-Behaghel-Straße 10D, 35394, Gießen, Germany
Henning Lobin & Harald Lüngen &
Institut für deutsche Sprache und Literatur, Technical University Dortmund, Emil-Figge-Straße 50, 44227, Dortmund, Germany
Angelika Storrer
SFB 441 Linguistic Data Structures, Eberhard Karls Universität Tübingen, Nauklerstraße 35, 72074, Tübingen, Germany
Andreas Witt

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Geibel, P., Mehler, A., Kühnberger, KU. (2011). Learning Methods for Graph Models of Document Structure. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-22613-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22612-0
Online ISBN: 978-3-642-22613-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics