Advertisement

Applied Intelligence

, Volume 18, Issue 3, pp 243–256 | Cite as

Genetic Mining of HTML Structures for Effective Web-Document Retrieval

  • Sun Kim
  • Byoung-Tak Zhang
Article

Abstract

Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.

genetic algorithms machine learning web-documents information retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M. Gordon and P. Pathak, “Finding information on the world wide web: The retrieval effectiveness of search engines,” Information Processing and Management, vol. 35, no.2, pp. 141–180, 1999.Google Scholar
  2. 2.
    N.J. Belkin and W.B. Croft, “Retrieval techniques,” Annual Review of Information Science and Technology, vol. 22, pp. 109–145, 1987.Google Scholar
  3. 3.
    J. Boyan, D. Freitag, and T. Joachims, “A machine learning architecture for optimizing web search engines,” in Proceedings of the AAAI Workshop on Internet-Based Information Systems, pp. 1–8, 1996.Google Scholar
  4. 4.
    K. Bharat and M.R. Henzinger, “Improved algorithms for topic distillation in a hyperlinked environment,” in Proceedings of the ACM SIGIR’98 Conference, pp. 104–111, 1998.Google Scholar
  5. 5.
    S. Chakrabarti, “Data mining for hypertext: A tutorial survey,” ACM SIGKDD Explorations, vol. 1, no.2, pp. 1–11, 2000.Google Scholar
  6. 6.
    J. Kleinberg, “Authoriatative sources in a hyperlinked environment,” in Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677, 1998.Google Scholar
  7. 7.
    J. Picard, “Modeling and combining evidence provided by document relationships using probabilistic argumentation systems,” in Proceedings of the ACM SIGIR’98 Conference, pp. 182–189, 1998.Google Scholar
  8. 8.
    E. Spertus, “ParaSite: Mining structural information on the web,” in Proceedings of the Sixth International World Wide Web Conference (WWW6), pp. 1205–1215, 1997.Google Scholar
  9. 9.
    S. Chakrabarti et al., “Experiments in topic distillation,” ACM-SIGIR’ 98 Post-ConferenceWorkshop on Hypertext Information Retrieval for the Web, 1998.Google Scholar
  10. 10.
    S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in Proceedings of the Seventh International World Wide Web Conference (WWW7), pp. 107–117, 1998.Google Scholar
  11. 11.
    M.L. Mauldin, “Lycos: Design choices in an internet search service,” IEEE Expert, vol. 12, no.1, pp. 8–11, 1997.Google Scholar
  12. 12.
    S. Kim and B.-T. Zhang, “Web-Document retrieval by genetic learning of importance factors for HTML tags,” in Proceedings of the Sixth Pacific Rim International Conference on AI Workshop on Text and Web Mining, pp. 13–23, 2000.Google Scholar
  13. 13.
    M. Cutler, H. Deng, S. Maniccam, and W. Meng, “A new study on using HTML structures to improve retrieval,” in Proceedings of the Eleventh IEEE Conference on Tools with AI, pp. 406–409, 1999.Google Scholar
  14. 14.
    O. Frieder and H.T. Siegelmann, “On the allocation of documents in multiprocessor information retrieval systems,” in Proceedings of the ACM SIGIR’91 Conference, pp. 230–239, 1991.Google Scholar
  15. 15.
    M.D. Gordon, “User-based document clustering by redescribing subject descriptions with a genetic algorithm,” Journal of the American Society for Information Science, vol. 42, no.5, pp. 311–322, 1991.Google Scholar
  16. 16.
    F. Petry, B. Buckles, D. Prabhu, and D. Kraft, “Fuzzy information retrieval using genetic algorithms and relevance feedback,” in Proceedings of the ASIS Annual Meeting, pp. 122–125, 1993.Google Scholar
  17. 17.
    M. Gordon, “Probabilistic and genetic algorithms for document retrieval,” Communications of the ACM, vol. 31, pp. 1208–1218, 1988.Google Scholar
  18. 18.
    J. Yang and R.R. Korfhage, “Effects of query term weights modification in document retrieval: A study base on a genetic algorithm,” in Proceedings of the Second Annual Symposium on Document Analysis and Information Retrieval, pp. 271–185, 1993.Google Scholar
  19. 19.
    J. Yang, R.R. Korfhage, and E. Rasmussen, “Query improvement in information retrieval using genetic algorithms: A report on the experiments of the TREC project,” in Proceedings of the First Text Retrieval Conference (TREC-1), pp. 31–58, 1993.Google Scholar
  20. 20.
    J.T. Horng and C.C. Yeh, “Applying genetic algorithms to query optimization in document retrieval,” Information Processing and Management, vol. 36, pp. 737–759, 2000.Google Scholar
  21. 21.
    NIST, Text REtrieval Conference homepage, http://trec.nist.gov.Google Scholar
  22. 22.
    D.-H. Shin and B.-T. Zhang, “A two-stage retrieval model for the TREC-7 ad hoc task,” in Proceedings of the Seventh Text Retrieval Conference (TREC-7), pp. 501–507, 1998.Google Scholar
  23. 23.
    G. Salton, A. Wong, and C.S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, pp. 613–620, 1975.Google Scholar
  24. 24.
    G. Salton,Automatic Text Processing, Addison-Wesley, pp. 279–281, 1989.Google Scholar
  25. 25.
    J. Broglio, J.P. Callan, W.B. Croft, and D.W. Nachbar, “Document retrieval and routing using the INQUERY system,” in Proceedings of the Third Text REtrieval Conference (TREC-3), pp. 29–38, 1995.Google Scholar
  26. 26.
    J.P. Callan, W.B. Croft, and S.M. Harding, “The INQUERY retrieval system,” in Proceedings of the Third International Conference on Database and Expert Systems Applications, pp. 78–83, 1992.Google Scholar
  27. 27.
    H. Turtle and W.B. Croft, “Inference networks for document retrieval,” in Proceedings of the Thirteenth International Conference on Research and Development in Information Retrieval, pp. 1–24, 1990.Google Scholar
  28. 28.
    S.E. Robertson and S. Walker, “Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval,” in Proceedings of the ACM SIGIR’94 Conference, pp. 232–241, 1994.Google Scholar
  29. 29.
    S.E. Robertson et al., “Okapi at TREC-3,” in Proceedings of the Third Text Retrieval Conference (TREC-3), pp. 109–126, 1995.Google Scholar
  30. 30.
    T. Bäck, Evolutionary Algorithms in Theory and Practice, Oxford University Press, 1996.Google Scholar
  31. 31.
    J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975.Google Scholar
  32. 32.
    J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992.Google Scholar
  33. 33.
    D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, 1989.Google Scholar
  34. 34.
    T. Blickle and L. Thiele, “A mathematical analysis of tournament selection,” in Proceedings of the Sixth International Conference on Genetic Algorithms, pp. 9–16, 1995.Google Scholar
  35. 35.
    D.E. Goldberg and K. Deb, “A comparative analysis of selection schemes used in genetic algorithms,” Foundations of Genetic Algorithms, pp. 69–93, Morgan Kaufmann, 1991.Google Scholar
  36. 36.
    J.E. Baker, “Adaptive selection methods for genetic algorithms,” in Proceedings of the First International Conference on Genetic Algorithms and Their Applications, pp. 101–111, 1985.Google Scholar
  37. 37.
    J.J. Grefenstette and J.E. Baker, “How genetic algorithms work: A critical look at implicit parallelism,” in Proceedings of the Third International Conference on Genetic Algorithms, pp. 20–27, 1989.Google Scholar
  38. 38.
    D. Whitley, “The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best,” in Proceedings of the Third International Conference on Genetic Algorithms, pp. 116–121, 1989.Google Scholar
  39. 39.
    H. Mühlenbein and D. Schlierkamp-Voosen, “Predictive models for the breeder genetic algorithm,” Evolutionary Computation, vol. 1, no.1, pp. 25–49, 1993.Google Scholar
  40. 40.
    G. Syswerda, “Uniform crossover in genetic algorithms,” in Proceedings of the Third International Conference on Genetic Algorithms and Their Applications, pp. 2–9, 1989.Google Scholar
  41. 41.
    E.M. Voorhees and D. Harman, “Overview of the eighth text Retrieval conference,” in Proceedings of the Eighth Text Retrieval Conference (TREC-8), pp. 1–27, 1999.Google Scholar
  42. 42.
    Z. Michalewicz, Genetic Algorithms + Data Structures = Evolutionary Programs, Springer, pp. 104–105, 1992.Google Scholar
  43. 43.
    Internet Archive, Building an Internet Library, http://www.archive.org.Google Scholar
  44. 44.
    J. Zobel, “How Reliable are the Results of Large-Scale Information Retrieval Experiments?,” in Proceedings of the ACM SIGIR’98 Conference, pp. 307–314, 1998.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Sun Kim
    • 1
  • Byoung-Tak Zhang
    • 1
  1. 1.Biointelligence Laboratory, School of Computer Science and EngineeringSeoul National UniversitySeoulKorea

Personalised recommendations