Skip to main content

HotMiner: Discovering Hot Topics from Dirty Text

  • Chapter

Abstract

For companies with websites that contain millions of documents available to their customers, it is critical to identify their customers’ hottest information needs along with their associated documents. This valuable information gives companies the potential of reducing costs and being more competitive and responsive to their customers’ needs. In particular, technical support centers could drastically lower the number of support engineers by knowing the topics of their customers’ hot problems (i.e., hot topics), and making them available on their websites along with links to the corresponding solutions documents so that customers could efficiently find the right documents to self-solve their problems. In this chapter we present a novel approach to discovering hot topics of customers’ problems by mining the logs of customer support centers. Our technique for search log mining discovers hot topics that match the user’s perspective, which often is different from the topics derived from document content categorization’ methods. Our techniques to mine case logs include extracting relevant sentences from cases to conform case excerpts which are more suitable for hot topics mining. In contrast to most text mining work, our approach deals with dirty text containing typos, adhoc abbreviations, special symbols, incorrect use of English grammar, cryptic tables and ambiguous and missing punctuation. It includes a variety of techniques that either directly handle some of these anomalies or that are robust in spite of them. In particular, we have developed a postfiltering technique to deal with the effects of noisy clickstreams due to random clicking behavior, a Thesaurus Assistant to help in the generation of a thesaurus of “dirty” variations of words that is used to normalize the terminology, and a Sentence Identifier with the capability of eliminating code and tables. The techniques that compose our approach have been implemented as a toolbox HotMiner, which has been used in experiments on logs from Hewlett-Packard’s Customer Support Center

Keywords

  • Noun Phrase
  • Edit Distance
  • Case Document
  • Content View
  • Sentence Boundary

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-1-4757-4305-0_6
  • Chapter length: 35 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-1-4757-4305-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.00
Price excludes VAT (USA)
Hardcover Book
USD   159.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Barzilayand and M. Elhadad. Using Lexical Chains for Text Summarization.In [MM99], 1999.

    Google Scholar 

  2. W.D. Climenson, H.H. Hardwick, and S.N. Jacobson. Automatic syntax analysis in machine indexing and abstracting. American Documentation, 12 (3): 178–183, 1961.

    CrossRef  Google Scholar 

  3. D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, Jun 1992.

    Google Scholar 

  4. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, second edition. Wiley, New York, 2001.

    MATH  Google Scholar 

  5. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, Bethesda, MD, Nov 1998.

    Google Scholar 

  6. H.P. Edmundson. New methods in automatic extraction. Journal of the ACM, 16 (2): 264–285, 1968.

    CrossRef  Google Scholar 

  7. M.A. Elmi and M. Evens. Spelling correction using context.In Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, pages 360–364, 1998.

    Google Scholar 

  8. HL99] E.H. Hovy and H. Liu. The value of indicator phrases for automated text summarization.Unpublished, 1999. [Inx] Inx [online].Available from World Wide Web: www:inxight. corn/products/linguistx.

    Google Scholar 

  9. T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen. Sompak: The self-organizing map program package.Laboratory of Computer and Information Science, Report A31, 1996.

    Google Scholar 

  10. T. Kohonen. The Self-Organizing Map. Neural Networks: Theoretical Foundations and Analysis IEEE Press, New York 1992.

    Google Scholar 

  11. J. Kupied, J. Piedersen, and F. Chen. A trainable document summarizer. In Proceedings of the Eighteenth Annual International SIGIR Conference on Research and Development in Information Retrieval, pages 68–73, 1995.

    Google Scholar 

  12. G.H. Kuenning International ispell version 3.1.00.f tp. c s. uc 1 a. edu, 1987.

    Google Scholar 

  13. K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24 (4): 377–439, 1992.

    CrossRef  Google Scholar 

  14. W.G. Lehnert. Plot Units: A Narrative Summarization Strategy. Erlbaum, Hillsdale, NJ, 1982.

    Google Scholar 

  15. D. Lewis, R. Schapire, J. Cllan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, Nineteenth ACM International Conference on Research and Development in Information Retrieval, 1996.

    Google Scholar 

  16. H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2 (2), 1958.

    Google Scholar 

  17. D. Marcu. The Rhetorical Parsing, Summarization and Generation of Natural Language Texts.PhD dissertation. University of Toronto, 1997.

    Google Scholar 

  18. M.D. McIlroy.Development of a spelling list. IEEE Transactions on Communication, 30, 1: 91–99, Jan 1982.

    CrossRef  Google Scholar 

  19. I. Mani and M. Maybury. Introduction. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, 1999.

    Google Scholar 

  20. G. Nunberg. The linguistics of punctuation. Center for the Study of Language and Information Lecture Notes 90 (18), 1990.

    Google Scholar 

  21. J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Conference on Applied Natural Language, 1994.

    Google Scholar 

  22. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.

    Google Scholar 

  23. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering algorithms. In Proceedings of the KDD Workshop on Text Mining, 2000.

    Google Scholar 

  24. G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic Text Structuring and Summarization. I n [MM99], 1999.

    Google Scholar 

  25. J.R. Stinger. Automatic table detection method and system.HP Internal Paper, 2000.

    Google Scholar 

  26. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147: 195–197, 1981.

    CrossRef  Google Scholar 

  27. J. Toole. Categorizing unknown words: Using decision trees to identify names and misspellings. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 173–179, 2000.

    CrossRef  Google Scholar 

  28. P. Willet. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management, 577 (97), 1988.

    Google Scholar 

  29. X. Lui Y. Yang. A reexamination of text categorization methods. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), University of California, Berkeley, pages 42–49, 1999.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2004 Springer Science+Business Media New York

About this chapter

Cite this chapter

Castellanos, M. (2004). HotMiner: Discovering Hot Topics from Dirty Text. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-4305-0_6

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-3057-6

  • Online ISBN: 978-1-4757-4305-0

  • eBook Packages: Springer Book Archive