Mining Taxonomies from Web Menus: Rule-Based Concepts and Algorithms

  • Matthias Keller
  • Hannes Hartenstein
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7977)


The logical hierarchies of Web sites (i.e. Web site taxonomies) are obvious to humans, because humans can distinguish different menu levels and their relationships. But such accurate information about the logical structure is not yet available to machines. Many applications would benefit if Web site taxonomies could be mined from menus, but it was an almost unsolvable problem in the past. While a tag newly introduced in HTML5 and novel mining methods allow to distinguish menus from other contents today, it has not yet been researched, how the underlying taxonomies can be extracted, given the menus. In this paper we present the first detailed analysis of the problem and introduce rule-based concepts for addressing each identified sub problem. We report on a large-scale study on mining hierarchical menus of 350 randomly selected domains. Our methods allow extracting Web site taxonomy information that was not available before with high precision and high recall.


Web site taxonomies Web mining Content hierarchies 


  1. 1.
    Morville, P., Rosenfeld, L.: Information architecture for the World Wide Web. O’Reilly, Sebastopol (2006)Google Scholar
  2. 2.
    Kalbach, J.: Designing Web navigation. O’Reilly, Sebastopol (2007)Google Scholar
  3. 3.
    Lin, S.-H., Chu, K.-P., Chiu, C.-M.: Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis. Expert Systems with Applications 38, 3944–3958 (2011)CrossRefGoogle Scholar
  4. 4.
    Yang, Q., Jiang, P., Zhang, C., Niu, Z.: Reconstruct Logical Hierarchical Sitemap for Related Entity Finding. In: Voorhees, E.M., Buckland, L.P. (eds.) The Nineteenth Text Retrieval Conf (TREC 2010). National Institute of Standards and Technology, NIST (2010)Google Scholar
  5. 5.
    Pavan Kumar, G.M., Leela, K.P., Parsana, M., Garg, S.: Learning website hierarchies for keyword enrichment in contextual advertising. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 425–434. ACM, Hong Kong (2011)Google Scholar
  6. 6.
    Amitay, E., Carmel, D., Darlow, A., Lempel, R., Soffer, A.: The connectivity sonar: detecting site functionality by structural patterns. In: Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pp. 38–47. ACM, Nottingham (2003)CrossRefGoogle Scholar
  7. 7.
    Keller, M., Nussbaumer, M.: MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques. In: Proceedings of the 21st Int’l. Conf. Companion on World Wide Web, pp. 1025–1034. ACM, Lyon (2012)CrossRefGoogle Scholar
  8. 8.
    Rossi, G., Schwabe, D., Lyardet, O., Puc-rio, D.D.I., MarquêS, R., Vicente, S.: Improving Web information systems with navigational patterns. Computer Networks 31 (1999)Google Scholar
  9. 9.
    Ceri, S., Fraternali, P., Bongio, A.: Web Modeling Language (WebML): a modeling language for designing Web sites. Computer Networks 33, 137–157 (2000)CrossRefGoogle Scholar
  10. 10.
    Schwabe, D., Rossi, G., Barbosa, S.D.J.: Systematic hypermedia application design with OOHDM. In: Proc. of the the Seventh ACM Conf. on Hypertext, pp. 116–128. ACM, Bethesda (1996)CrossRefGoogle Scholar
  11. 11.
    Koch, N., Knapp, A., Zhang, G., Baumeister, H.: Uml-Based Web Engineering. In: Rossi, G., Pastor, O., Schwabe, D., Olsina, L. (eds.) Web Engineering: Modelling and Implementing Web Applications, pp. 157–191. Springer London, London (2008)CrossRefGoogle Scholar
  12. 12.
    Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci. 38, 420–442 (1987)CrossRefGoogle Scholar
  13. 13.
    Ho, Q., Eisenstein, J., Xing, E.P.: Document hierarchies from text and links. In: Proceedings of the 21st International Conference on World Wide Web, pp. 739–748. ACM, Lyon (2012)CrossRefGoogle Scholar
  14. 14.
    Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. In: Proc. of the 21st Int’l. Conf. Companion on World Wide Web, pp. 93–102. ACM, Lyon (2012)CrossRefGoogle Scholar
  15. 15.
    Bernardi, M., Di Lucca, G., Distante, D.: The RE-UWA approach to recover user centered conceptual models from Web applications. International Journal on Software Tools for Technology Transfer 11, 485–501 (2009)CrossRefGoogle Scholar
  16. 16.
    Yang, C.C., Liu, N.: Web site topic-hierarchy generation based on link structure. J. Am. Soc. Inf. Sci. Technol. 60, 495–508 (2009)CrossRefGoogle Scholar
  17. 17.
    Kumar, R., Punera, K., Tomkins, A.: Hierarchical topic segmentation of websites. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 257–266. ACM, Philadelphia (2006)CrossRefGoogle Scholar
  18. 18.
    Cheung, W.K., Sun, Y.: Identifying a hierarchy of bipartite subgraphs for web site abstraction. Web Intelli. and Agent Sys. 5, 343–355 (2007)Google Scholar
  19. 19.
    Bose, A., Beemanapalli, K., Srivastava, J., Sahar, S.: Incorporating concept hierarchies into usage mining based recommendations. In: Nasraoui, O., Spiliopoulou, M., Srivastava, J., Mobasher, B., Masand, B. (eds.) WebKDD 2006. LNCS (LNAI), vol. 4811, pp. 110–126. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    Wang, C., Lu, J., Zhang, G.: Mining key information of web pages: A method and its application. Expert Syst. Appl. 33, 425–433 (2007)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Liu, Z., Ng, W.K., Lim, E.-P.: An Automated Algorithm for Extracting Website Skeleton. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 799–811. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  22. 22.
    Keller, M., Nussbaumer, M.: Beyond the Web Graph: Mining the Information Architecture of the WWW with Navigation Structure Graphs. In: Proc. of the 2011 Int’l. Conf. on Emerging Intelligent Data and Web Technologies, pp. 99–106. IEEE Computer Society, Tirana (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Matthias Keller
    • 1
  • Hannes Hartenstein
    • 1
  1. 1.Steinbuch Centre for ComputingKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations