HotMiner: Discovering Hot Topics from Dirty Text

Castellanos, Malú

doi:10.1007/978-1-4757-4305-0_6

Malú Castellanos

2243 Accesses
7 Citations

Abstract

For companies with websites that contain millions of documents available to their customers, it is critical to identify their customers’ hottest information needs along with their associated documents. This valuable information gives companies the potential of reducing costs and being more competitive and responsive to their customers’ needs. In particular, technical support centers could drastically lower the number of support engineers by knowing the topics of their customers’ hot problems (i.e., hot topics), and making them available on their websites along with links to the corresponding solutions documents so that customers could efficiently find the right documents to self-solve their problems. In this chapter we present a novel approach to discovering hot topics of customers’ problems by mining the logs of customer support centers. Our technique for search log mining discovers hot topics that match the user’s perspective, which often is different from the topics derived from document content categorization’ methods. Our techniques to mine case logs include extracting relevant sentences from cases to conform case excerpts which are more suitable for hot topics mining. In contrast to most text mining work, our approach deals with dirty text containing typos, adhoc abbreviations, special symbols, incorrect use of English grammar, cryptic tables and ambiguous and missing punctuation. It includes a variety of techniques that either directly handle some of these anomalies or that are robust in spite of them. In particular, we have developed a postfiltering technique to deal with the effects of noisy clickstreams due to random clicking behavior, a Thesaurus Assistant to help in the generation of a thesaurus of “dirty” variations of words that is used to normalize the terminology, and a Sentence Identifier with the capability of eliminating code and tables. The techniques that compose our approach have been implemented as a toolbox HotMiner, which has been used in experiments on logs from Hewlett-Packard’s Customer Support Center

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Barzilayand and M. Elhadad. Using Lexical Chains for Text Summarization.In [MM99], 1999.
Google Scholar
W.D. Climenson, H.H. Hardwick, and S.N. Jacobson. Automatic syntax analysis in machine indexing and abstracting. American Documentation, 12 (3): 178–183, 1961.
Article Google Scholar
D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Turkey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pages 318–329, Jun 1992.
Google Scholar
R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, second edition. Wiley, New York, 2001.
MATH Google Scholar
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, Bethesda, MD, Nov 1998.
Google Scholar
H.P. Edmundson. New methods in automatic extraction. Journal of the ACM, 16 (2): 264–285, 1968.
Article Google Scholar
M.A. Elmi and M. Evens. Spelling correction using context.In Proceedings of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, pages 360–364, 1998.
Google Scholar
HL99] E.H. Hovy and H. Liu. The value of indicator phrases for automated text summarization.Unpublished, 1999. [Inx] Inx [online].Available from World Wide Web: www:inxight. corn/products/linguistx.
Google Scholar
T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen. Sompak: The self-organizing map program package.Laboratory of Computer and Information Science, Report A31, 1996.
Google Scholar
T. Kohonen. The Self-Organizing Map. Neural Networks: Theoretical Foundations and Analysis IEEE Press, New York 1992.
Google Scholar
J. Kupied, J. Piedersen, and F. Chen. A trainable document summarizer. In Proceedings of the Eighteenth Annual International SIGIR Conference on Research and Development in Information Retrieval, pages 68–73, 1995.
Google Scholar
G.H. Kuenning International ispell version 3.1.00.f tp. c s. uc 1 a. edu, 1987.
Google Scholar
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24 (4): 377–439, 1992.
Article Google Scholar
W.G. Lehnert. Plot Units: A Narrative Summarization Strategy. Erlbaum, Hillsdale, NJ, 1982.
Google Scholar
D. Lewis, R. Schapire, J. Cllan, and R. Papka. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, Nineteenth ACM International Conference on Research and Development in Information Retrieval, 1996.
Google Scholar
H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2 (2), 1958.
Google Scholar
D. Marcu. The Rhetorical Parsing, Summarization and Generation of Natural Language Texts.PhD dissertation. University of Toronto, 1997.
Google Scholar
M.D. McIlroy.Development of a spelling list. IEEE Transactions on Communication, 30, 1: 91–99, Jan 1982.
Article Google Scholar
I. Mani and M. Maybury. Introduction. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, 1999.
Google Scholar
G. Nunberg. The linguistics of punctuation. Center for the Study of Language and Information Lecture Notes 90 (18), 1990.
Google Scholar
J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the Conference on Applied Natural Language, 1994.
Google Scholar
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.
Google Scholar
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering algorithms. In Proceedings of the KDD Workshop on Text Mining, 2000.
Google Scholar
G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic Text Structuring and Summarization. I n [MM99], 1999.
Google Scholar
J.R. Stinger. Automatic table detection method and system.HP Internal Paper, 2000.
Google Scholar
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147: 195–197, 1981.
Article Google Scholar
J. Toole. Categorizing unknown words: Using decision trees to identify names and misspellings. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 173–179, 2000.
Chapter Google Scholar
P. Willet. Recent trends in hierarchical document clustering: A critical review. Information Processing and Management, 577 (97), 1988.
Google Scholar
X. Lui Y. Yang. A reexamination of text categorization methods. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), University of California, Berkeley, pages 42–49, 1999.
Google Scholar

Download references

Authors

Malú Castellanos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Tennessee, 203 Claxton Complex, 37996-3450, Knoxville, TN, USA
Michael W. Berry

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Castellanos, M. (2004). HotMiner: Discovering Hot Topics from Dirty Text. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_6

Download citation

DOI: https://doi.org/10.1007/978-1-4757-4305-0_6
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-3057-6
Online ISBN: 978-1-4757-4305-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics