Abstract
For companies with websites that contain millions of documents available to their customers, it is critical to identify their customers’ hottest information needs along with their associated documents. This valuable information gives companies the potential of reducing costs and being more competitive and responsive to their customers’ needs. In particular, technical support centers could drastically lower the number of support engineers by knowing the topics of their customers’ hot problems (i.e., hot topics), and making them available on their websites along with links to the corresponding solutions documents so that customers could efficiently find the right documents to self-solve their problems. In this chapter we present a novel approach to discovering hot topics of customers’ problems by mining the logs of customer support centers. Our technique for search log mining discovers hot topics that match the user’s perspective, which often is different from the topics derived from document content categorization’ methods. Our techniques to mine case logs include extracting relevant sentences from cases to conform case excerpts which are more suitable for hot topics mining. In contrast to most text mining work, our approach deals with dirty text containing typos, adhoc abbreviations, special symbols, incorrect use of English grammar, cryptic tables and ambiguous and missing punctuation. It includes a variety of techniques that either directly handle some of these anomalies or that are robust in spite of them. In particular, we have developed a postfiltering technique to deal with the effects of noisy clickstreams due to random clicking behavior, a Thesaurus Assistant to help in the generation of a thesaurus of “dirty” variations of words that is used to normalize the terminology, and a Sentence Identifier with the capability of eliminating code and tables. The techniques that compose our approach have been implemented as a toolbox HotMiner, which has been used in experiments on logs from Hewlett-Packard’s Customer Support Center
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Barzilayand and M. Elhadad. Using Lexical Chains for Text Summarization.In [MM99], 1999.
W.D. Climenson, H.H. Hardwick, and S.N. Jacobson. Automatic syntax analysis in machine indexing and abstracting. American Documentation, 12 (3): 178–183, 1961.
D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, Jun 1992.
R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, second edition. Wiley, New York, 2001.
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, Bethesda, MD, Nov 1998.
H.P. Edmundson. New methods in automatic extraction. Journal of the ACM, 16 (2): 264–285, 1968.
M.A. Elmi and M. Evens. Spelling correction using context.In Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, pages 360–364, 1998.
HL99] E.H. Hovy and H. Liu. The value of indicator phrases for automated text summarization.Unpublished, 1999. [Inx] Inx [online].Available from World Wide Web: www:inxight. corn/products/linguistx.
T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen. Sompak: The self-organizing map program package.Laboratory of Computer and Information Science, Report A31, 1996.
T. Kohonen. The Self-Organizing Map. Neural Networks: Theoretical Foundations and Analysis IEEE Press, New York 1992.
J. Kupied, J. Piedersen, and F. Chen. A trainable document summarizer. In Proceedings of the Eighteenth Annual International SIGIR Conference on Research and Development in Information Retrieval, pages 68–73, 1995.
G.H. Kuenning International ispell version 3.1.00.f tp. c s. uc 1 a. edu, 1987.
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24 (4): 377–439, 1992.
W.G. Lehnert. Plot Units: A Narrative Summarization Strategy. Erlbaum, Hillsdale, NJ, 1982.
D. Lewis, R. Schapire, J. Cllan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, Nineteenth ACM International Conference on Research and Development in Information Retrieval, 1996.
H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2 (2), 1958.
D. Marcu. The Rhetorical Parsing, Summarization and Generation of Natural Language Texts.PhD dissertation. University of Toronto, 1997.
M.D. McIlroy.Development of a spelling list. IEEE Transactions on Communication, 30, 1: 91–99, Jan 1982.
I. Mani and M. Maybury. Introduction. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, 1999.
G. Nunberg. The linguistics of punctuation. Center for the Study of Language and Information Lecture Notes 90 (18), 1990.
J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Conference on Applied Natural Language, 1994.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering algorithms. In Proceedings of the KDD Workshop on Text Mining, 2000.
G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic Text Structuring and Summarization. I n [MM99], 1999.
J.R. Stinger. Automatic table detection method and system.HP Internal Paper, 2000.
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147: 195–197, 1981.
J. Toole. Categorizing unknown words: Using decision trees to identify names and misspellings. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 173–179, 2000.
P. Willet. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management, 577 (97), 1988.
X. Lui Y. Yang. A reexamination of text categorization methods. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), University of California, Berkeley, pages 42–49, 1999.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer Science+Business Media New York
About this chapter
Cite this chapter
Castellanos, M. (2004). HotMiner: Discovering Hot Topics from Dirty Text. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_6
Download citation
DOI: https://doi.org/10.1007/978-1-4757-4305-0_6
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-3057-6
Online ISBN: 978-1-4757-4305-0
eBook Packages: Springer Book Archive