Advertisement

Mining Graph Patterns in Web-based Systems: A Conceptual View

  • Matthias Dehmer
  • Frank Emmert-Streib
Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 42)

Abstract

The task of applying Data Mining methods [38] to web-based hypertexts is often referred to as Web Mining [16]. In view of the steadily increasing complexity of web data sources and the huge amount of information available online, Web Mining has been an important and fruitful research topic [16, 46]. Generally, Web Mining can be divided into the following categories:

Keywords

Web structure mining Web mining Web genre Graph Similarity Graph Classification 

Notes

Acknowledgements

We are thankful to Alexander Mehler for fruitful discussions on this topic.

References

  1. 1.
    Albert, R., H. Jeong, and A.L. Barabási. 1999. Diameter of the world wide web. Nature 401:130–131.CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R., and B. Ribeiro-Neto, eds. 1999. Modern information retrieval. Reading, MA: Addison-Wesley.Google Scholar
  3. 3.
    Barabási, A.-L., and Z.N. Oltvai. 2004. Network biology: Understanding the cell’s functional organization. Nature Reviews Genetics, 5(2):101–113.CrossRefGoogle Scholar
  4. 4.
    Basak, S.C., V.R. Magnuson, G.J. Niemi, and R.R. Regal. 1988. Determining structural similarity of chemicals using graph-theoretic indices. Discrete Applied Mathematics 19:17–44.MathSciNetCrossRefGoogle Scholar
  5. 5.
    Batagelj, V. 1988. Similarity measures between structured objects. In Proceedings of an International Course and Conference on the Interfaces between Mathematics, Chemistry and Computer Sciences. Dubrovnik, Yugoslavia.Google Scholar
  6. 6.
    Bonchev, D. 1979. Information indices for atoms and molecules. MATCH 7:65–113.Google Scholar
  7. 7.
    Bonchev, D. 1983. Information theoretic indices for characterization of-chemical structures. Chichester: Research Studies Press.Google Scholar
  8. 8.
    Bornholdt, S., and H.G. Schuster. 2003. Handbook of graphs and networks. From the genome to the Internet. Weinheim: Wiley-VCH.Google Scholar
  9. 9.
    Brandes, U., and T. Erlebach. 2005. Network analysis. Lecture Notes in Computer Science. Heidelberg: Springer.Google Scholar
  10. 10.
    Bunke, H. 1983. What is the distance between graphs? Bulletin of the EATCS 20:35–39.Google Scholar
  11. 11.
    Bunke, H. 2000a. Recent developments in graph matching. In Proceedings of the 15th International Conference on Pattern Recognition 2:117–124.Google Scholar
  12. 12.
    Bunke, H. 2000b. Graph matching: Theoretical foundations, algorithms, and applications. In Proceedings of Vision Interface 2000, 82–88. Montreal, Canada.Google Scholar
  13. 13.
    Buttler, D. 2004. A short survey of document structure similarity algorithms. In International Conference on Internet Computing, 3–9. Los Vegas, Nevada, USA.Google Scholar
  14. 14.
    Carrière, S.J., and R. Kazman. 1997. Webquery: Searching and visualizing the web through connectivity. Computer Networks and ISDN Systems 29(8–13):1257–1267.CrossRefGoogle Scholar
  15. 15.
    Chakrabarti, S. 2001. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference, May 1–5, 211–220. Hong Kong.Google Scholar
  16. 16.
    Chakrabarti, S. 2002. Mining the web: Discovering knowledge from hypertext data. San Francisco, CA: Morgan Kaufmann.Google Scholar
  17. 17.
    Cook, D., and L.B. Holder. 2007. Mining graph data. Weinheim: Wiley-Interscience.Google Scholar
  18. 18.
    Dehmer, M. 2006. Strukturelle analyse web-basierter Dokumente. Multimedia und Telekooperation. Wiesbaden: Deutscher Universitäts Verlag.Google Scholar
  19. 19.
    Dehmer, M. 2008a. Information-theoretic concepts for the analysis of complex networks. Applied Artificial Intelligence 22(7 and 8):684–706.CrossRefGoogle Scholar
  20. 20.
    Dehmer, M. 2008b. Information processing in complex networks:graph entropy and information functionals. Applied Mathematics and Computation 201:82–94.MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Dehmer, M., and F. Emmert-Streib. 2007. Structural similarity of directed universal hierarchical graphs: A low computational complexity approach. Applied Mathematics and Computation 194:7–20.MathSciNetzbMATHCrossRefGoogle Scholar
  22. 22.
    Dehmer, M., and A. Mehler. 2007. A new method of measuring similarity for a special class of directed graphs. Tatra Mountains Mathematical Publications 36:39–59.MathSciNetzbMATHGoogle Scholar
  23. 23.
    Dehmer, M., A. Mehler, and R. Gleim. 2004. Aspekte der Kategorisierung von Webseiten. In Proceedings des Multimediaworkshops der Jahrestagung der Gesellschaft für Informatik, eds. P. Dadam und M. Reichert, Lecture Notes in Computer Science, vol. 2, 39–43, Berlin: Springer.Google Scholar
  24. 24.
    Dehmer, M., F. Emmert-Streib, and J. Kilian. 2006. A similarity measure for graphs with lowcomputational complexity. Applied Mathematics and Computation 182:447–459.MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Dehmer, M., A. Mehler, and F. Emmert-Streib. 2007. Graphtheoretical characterizations of generalized trees. In Proceedings of the International Conference on Machine Learning: Models, Technologies & Applications (MLMTA’07). Las Vegas, NV.Google Scholar
  26. 26.
    Dehmer, M., F. Emmert-Streib, and T. Gesell. 2008. A comparative analysis of multidimensional featuresof objects resembling sets of graphs. Applied Mathematics and Computation 196:221–235.MathSciNetzbMATHCrossRefGoogle Scholar
  27. 27.
    Dehmer, M., F. Emmert-Streib, A. Mehler, and J. Kilian. 2006. Measuring the structural similarity of web-based documents: A novel approach. International Journal of Computational Intelligence 3(1):1–7.Google Scholar
  28. 28.
    Dimter, M. 1981. Textklassenkonzepte heutiger Alltagssprache. Tübingen: Niemeyer.CrossRefGoogle Scholar
  29. 29.
    Dorogovtsev, S.N., and J.F.F. Mendes. 2003. Evolution of networks. From biological networks to the internet and http://WWW. Oxford: Oxford University Press.
  30. 30.
    Emmert-Streib, F., and M. Dehmer. 2007. Information theoretic measures of UHG graphs with low computational complexity. Applied Mathematics and Computation 190:1783–1794.MathSciNetzbMATHCrossRefGoogle Scholar
  31. 31.
    Ferber, R. 2003. Information retrieval. Suchmodelle und Data-Mining-Verfahren für Textsammlungen und das Web. Heidelberg: dpunkt.verlag.Google Scholar
  32. 32.
    Flesca, S., G. Manco, E. Masciari, L. Pontieri, and A. Pugliese. 2002. Detecting structural similarities between XML documents. In Proceedings of the International Workshop on the Web and Databases (WebDB 2002). Madison, Wisconsin, USA.Google Scholar
  33. 33.
    Foulds, L.R. 1992. Graph theory applications. New York, NY: Springer.CrossRefGoogle Scholar
  34. 34.
    Gibson, D., R. Kumar, K.S. McCurley, and A. Tomkins. 2007. Dense subgraph extraction. In Mining graph data, eds. D. Cook and L.B. Holder, 411–441. Hoboken, NJ: Wiley-Interscience.Google Scholar
  35. 35.
    Gleim, R. 2004. Integrierte Repräsentation, Kategorisierung und Strukturanalyse Web-basierter Hypertexte. Master’s thesis, Technische Universität Darmstadt, Fachbereich Informatik, Sept 2004.Google Scholar
  36. 36.
    Gleim, R. 2005. HyGraph: Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte. In Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Tagung 2005 in Bonn, eds. B. Fisseni, H.-C. Schmitz, B. Schröder, and P. Wagner, 42–53. Frankfurt a.M.: Lang.Google Scholar
  37. 37.
    Halin, R. 1989. Graphentheorie. Berlin: Akademie Verlag.Google Scholar
  38. 38.
    Han, J., and M. Kamber. 2001. Data mining: Concepts and techniques. New York, NY: Morgan and Kaufmann Publishers.Google Scholar
  39. 39.
    Harary, F. 1969. Graph theory. Reading, MA: Addison Wesley Publishing Company.Google Scholar
  40. 40.
    Huberman, B., and L. Adamic. 1999. Growth dynamics of the world-wide web. Nature, 399:130.Google Scholar
  41. 41.
    Jiang, T., L. Wang, and K. Zhang. 1994. Alignment of trees – an alternative to tree edit. In CPM ’94: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, 75–86, London: Springer-Verlag.Google Scholar
  42. 42.
    Joshi, S., N. Agrawal, R. Krishnapuram, and S. Negi. 2003. A bag of paths model for measuring structural similarity in web documents. In KDD ’03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 577–582, New York, NY.Google Scholar
  43. 43.
    Kaden, F. 1982. Graphmetriken und Distanzgraphen. ZKI-Informationen, Akademie der Wissenschaften der DDR 2(82):1–63.Google Scholar
  44. 44.
    Kaden, F. 1986. Graphmetriken und Isometrieprobleme zugehöriger Distanzgraphen. ZKI-Informationen, Akademie der Wissenschaften der DDR 1(P6):1–100.Google Scholar
  45. 45.
    Kleinberg, J.M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5):604–632.MathSciNetzbMATHCrossRefGoogle Scholar
  46. 46.
    Kosala, R., and H. Blockeel. 2000. Web mining research: A survey. SIGKDD explorations: Newsletter of the Special Interest Group (SIG) on knowledge discovery & data mining, ACM 2(1):1–15.Google Scholar
  47. 47.
    Koschützki, D., K.A. Lehmann, L. Peters, S. Richter, D. Tenfelde-Podehl, and O. Zlotkowski. 2005. Clustering. In Centrality indices, eds. U. Brandes and T. Erlebach, Lecture Notes of Computer Science, 16–61. Berlin: Springer.Google Scholar
  48. 48.
    Kumar, R., P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tompkins, and E. Upfal. 2000. The web as a graph. In PODS ’00: Proceedings of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 1–10, New York, NY: ACM Press.Google Scholar
  49. 49.
    Levenstein, V.I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics – Doklady 10(8):707–710, Feb 1966.MathSciNetGoogle Scholar
  50. 50.
    Lindemann, C., and L. Littig. 2010. Classification of web sites at super-genre level. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.Google Scholar
  51. 51.
    Mason, O., and M. 2007. Verwoerd. Graph theory and networks in biology. IET Systems Biology 1(2):89–119.CrossRefGoogle Scholar
  52. 52.
    Mehler, A. 2001. Textbedeutung. Zur prozeduralen Analyse und Repräsentation struktureller ähnlichkeiten von Texten, volume 5 of Sprache, Sprechen und Computer/Computer Studies in Language and Speech. Frankfurt a. M.: Peter Lang.Google Scholar
  53. 53.
    Mehler, A. 2004. Textmining. In Texttechnologie. Perspektiven und Anwendungen, eds. H. Lobin and L. Lemnitzer, 83–107. Tübingen: Stauffenburg.Google Scholar
  54. 54.
    Mehler, A. 2009. Generalized shortest paths trees: A novel graph class applied to semiotic networks. In Analysis of complex networks: From biology to linguistics, eds. M. Dehmer and F. Emmert-Streib, 175–220. Weinheim: Wiley-VCH.Google Scholar
  55. 55.
    Mehler, A. 2010. Structure formation in the web. toward a graphtheoretical model of hypertext types. In Linguistic modelling of information and markup languages, eds. A. Witt and D. Metzing, 225–247. Dordrecht: Springer.Google Scholar
  56. 56.
    Mehler, A., and R. Gleim. 2006. The net for the graphs – towards webgenre representation for corpus linguistic studies. In WaCky! Working papers on the web as corpus, eds. M. Baroni and S. Bernardini, 191–224. Bologna: Gedit.Google Scholar
  57. 57.
    Mehler, A., M. Dehmer, and R. Gleim. 2004. Towards logical hypertext structure – A graph-theoretic perspective. In Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS ’04), eds. T. Böhme and G. Heyer, Lecture Notes in Computer Science, vol. 3473, 136–150, Berlin/New York: Springer.Google Scholar
  58. 58.
    Mehler, A., R. Gleim, and M. Dehmer. 2005. Towards structure-sensitive hypertext categorization. In Proceedings of the 29th Annual Conference of the German Classification Society, LNCS, Mar 9–11. Universität Magdeburg, Berlin/New York, NY: Springer.Google Scholar
  59. 59.
    Mehler, A., R. Gleim, and A. Wegner. 2007. Structural uncertainty of hypertext types. An empirical study. In Proceedings of the Workshop “Towards Genre-Enabled Search Engines: The Impact of NLP”, 30 Sept 2007, 13–19, in conjunction with RANLP 2007. Borovets, Bulgaria.Google Scholar
  60. 60.
    Messmer, B.T., and H. Bunke. 1998. A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5):493–504.CrossRefGoogle Scholar
  61. 61.
    Noller, S., J. Naumann, and T. Richter. 2001. LOGPAT – Ein webbasiertes Tool zur Analyse von Navigationsverläufen in Hypertexten. http://www.psych.uni-goettingen.de/congress/gor-2001
  62. 62.
    Power, R., D. Scott, and N. Bouayad-Agha. 2003. Document structure. Computational Linguistics 29(2):211–260.CrossRefGoogle Scholar
  63. 63.
    Raghavan, P. 2000. Graph structure of the web: A survey. In LATIN 2000: Theoretical Informatics. Proceedings of 4th Latin American Symposium, 123–125. Punta del Este, Uruguay.Google Scholar
  64. 64.
    Rahm, E. 2002. Web usage mining. Datenbank-Spektrum 2(2)75–76.Google Scholar
  65. 65.
    Rehm, G. 2007. Hypertextsorten. Definition – Struktur – Klassifikation. Norderstedt: Books on Demand.Google Scholar
  66. 66.
    Richter, T., J. Naumann, and S. Noller. 2003. Logpat: A semi-automatic way to analyze hypertext navigation behavior. Swiss Journal of Psychology 62:113:120.CrossRefGoogle Scholar
  67. 67.
    Schädler, C. 1999. Die Ermittlung struktureller ähnlichkeit undstruktureller-Merkmale bei komplexen Objekten: Einkonnektionistischer Ansatz und seine Anwendungen. PhD thesis, Technische Universität Berlin.Google Scholar
  68. 68.
    Scsibrany, H., K. Karlovits, W. Demuth, F. Müller, and K. Varmuza. 2003. Clustering and similarity of chemical structures represented by binary substructure descriptors. Chemometrics and Intelligent Laboratory Systems 67:95–108.CrossRefGoogle Scholar
  69. 69.
    Selkow, S.M. 1977. The tree-to-tree editing problem. Information Processing Letters 6(6):184–186.MathSciNetzbMATHCrossRefGoogle Scholar
  70. 70.
    Skorobogatov, V.A., and A.A. Dobrynin. 1988. Metrical analysis of graphs. MATCH 23:105–155.MathSciNetzbMATHGoogle Scholar
  71. 71.
    Sobik, F. 1982. Graphmetriken und Klassifikation strukturierter Objekte. ZKI-Informationen, Akademie der Wissenschaften der DDR 2(82):63–122.Google Scholar
  72. 72.
    Sobik, F. 1986. Modellierung von Vergleichsprozessen auf der Grundlage von ähnlichkeitsmaßen für Graphen. ZKI-Informationen, Akademie der Wissenschaften der DDR 4:104–144.Google Scholar
  73. 73.
    Spiliopoulou, M. 2000. Web usage mining for web site evaluation. Communications of the ACM 43(8):127–134.CrossRefGoogle Scholar
  74. 74.
    Tai, K.C. 1979. The tree-to-tree correction problem. Journal of the ACM 26(3):422–433. ISSN 0004-5411.MathSciNetzbMATHCrossRefGoogle Scholar
  75. 75.
    Waltinger, U., A. Mehler, and A. Wegner. 2009. A two-level approach to web genre classification. In Proceedings of the 5th International Conference on Web Information Systems and Technologies (WEBIST ’09), 23–26 Mar 2009. Lisboa.Google Scholar
  76. 76.
    Wasserman, S., and K. Faust. 1994. Social network analysis: Methods and applications, Structural Analysis in the Social Sciences. Cambridge, MA: Cambridge University Press.Google Scholar
  77. 77.
    Zelinka, B. 1975. On a certain distance between isomorphism classes of graphs. Časopis pro \(\breve{p}\) est. Mathematiky 100:371–373.MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Institute of Discrete Mathematics and Geometry, Vienna University of TechnologyViennaAustria
  2. 2.Institute for Bioinformatics and Translational ResearchHall in TyrolAustria
  3. 3.Computational Biology and Machine Learning, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University BelfastBelfastUK

Personalised recommendations