Sales Intelligence Using Web Mining

  • Viara Popova
  • Robert John
  • David Stockton
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5633)


This paper presents a knowledge extraction system for providing sales intelligence based on information downloaded from the WWW. The information is first located and downloaded from relevant companies’ websites and then machine learning is used to find these web pages that contain useful information where useful is defined as containing news about orders for specific products. Several machine learning algorithms were tested from which k-nearest neighbour, support vector machines, multi-layer perceptron and C4.5 decision tree produced best results in one or both experiments however k-nearest neighbour and support vector machines proved to be most robust which is a highly desired characteristic in the particular application. K-nearest neighbour slightly outperformed the support vector machines in both experiments which contradicts the results reported previously in the literature.


web content mining text mining machine learning natural language processing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Billsus, D., Pazzani, M.: A Personal News Agent that Talks, Learns and Explains. In: Proceedings of the Third International Conference on Autonomous Agents (Agents 1999), Seattle, Washington (1999)Google Scholar
  2. 2.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)CrossRefGoogle Scholar
  3. 3.
    Cooley, R.: Classification of News Stories Using Support Vector Machines. In: IJCAI 1999 Workshop on Text Mining (1999)Google Scholar
  4. 4.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002)Google Scholar
  5. 5.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Databases (VLDB), pp. 527–534 (2000)Google Scholar
  6. 6.
    Eikvil, L.: Information Extraction from World Wide Web - A Survey. Technical Report 945 (1999)Google Scholar
  7. 7.
    Frank, E., Bouckaert, R.R.: Naive Bayes for Text Classification with Unbalanced Classes. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS, vol. 4213, pp. 503–510. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)Google Scholar
  9. 9.
    Kosala, R., Blockeel, H.: Web Mining Research: A Survey. SIGKDD Explorations 2, 1–15 (2000)CrossRefGoogle Scholar
  10. 10.
    Kumaran, G., Allan, J.: Text Classification and Named Entities for New Event Detection. In: Proceedings of SIGIR 2004, pp. 297–304 (2004)Google Scholar
  11. 11.
    le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)CrossRefMATHGoogle Scholar
  12. 12.
    Li, Y., Bontcheva, K., Cunningham, H.: SVM Based Learning System For Information Extraction. In: Winkler, J.R., Niranjan, M., Lawrence, N.D. (eds.) Deterministic and Statistical Methods in Machine Learning. LNCS (LNAI), vol. 3635, pp. 319–339. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
  14. 14.
    Masand, B., Lino, G., Waltz, D.: Classifying News Stories Using Memory Based Reasoning. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 59–65 (1992)Google Scholar
  15. 15.
    Mccallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
  16. 16.
    Menczer, F.: ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. In: Fisher, D. (ed.) Proceedings of the 14th International Conference on Machine Learning (ICML 1997). Morgan Kaufmann, San Francisco (1997)Google Scholar
  17. 17.
    Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1998)Google Scholar
  18. 18.
    Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)Google Scholar
  19. 19.
    Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the International Conference on Machine Learning (ICML 2003), pp. 616–623 (2003)Google Scholar
  20. 20.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)MATHGoogle Scholar
  21. 21.
    Selamat, A., Omatu, S.: Web Page Feature Selection and Classification Using Neural Networks. Information Sciences 158, 69–88 (2004)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefMATHGoogle Scholar
  23. 23.
    Wermter, S.: Hung, Ch.: Selforganizing Classification on the Reuters News Corpus. In: Proceedings of the 19th international conference on Computational linguistics, Taipei, Taiwan, pp. 1–7 (2002)Google Scholar
  24. 24.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  25. 25.
    Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1, 69–90 (1999)CrossRefGoogle Scholar
  26. 26.
    Yang, Y., Chute, C.G.: A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING 1992), pp. 447–453 (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Viara Popova
    • 1
  • Robert John
    • 2
  • David Stockton
    • 1
  1. 1.Centre for ManufacturingDe Montfort UniversityLeicesterUK
  2. 2.Centre for Computational IntelligenceDe Montfort UniversityLeicesterUK

Personalised recommendations