Skip to main content

Mining Text Data: Special Features and Patterns

  • Conference paper
  • First Online:

Part of the Lecture Notes in Computer Science book series (LNAI,volume 2447)

Abstract

Text mining is an increasingly important research field because of the necessity of obtaining knowledge from the enormous number of text documents available, especially on the Web. Text mining and data mining, both included in the field of information mining, are similar in some sense, and thus it may seem that data mining techniques may be adapted in a straightforward way to mine text. However, data mining deals with structured data, whereas text presents special characteristics and is basically unstructured. In this context, the aims of this paper are three:

  • To study particular features of text.

  • To identify the patterns we may look for in text.

  • To discuss the tools we may use for that purpose.

In relation with the third point we overview existing proposals, as well as some new tools we are developing by adapting data mining tools previously developed by our research group.

Keywords

  • Data Mining
  • Association Rule
  • Knowledge Discovery
  • Text Mining
  • Textual Data

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/3-540-45728-3_11
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   74.99
Price excludes VAT (USA)
  • ISBN: 978-3-540-45728-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. Of the 1993 ACM SIGMOD Conference, pages 207–216, 1993.

    Google Scholar 

  2. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th Int. Conf. On Data Engineering, pages 3–14, 1995.

    Google Scholar 

  3. H. Ahonen, O. Heinonen, M. Klemettinen, and A. Inkeri-Verkamo. Applying data mining techniques in text analysis. Technical Report C-1997-23, Department of Computer Science, University of Helsinki, 1997.

    Google Scholar 

  4. H. Ahonen-Myka. Finding all frequent maximal sequences in text. In D. Mladenic and M. Grobelnik, eds., Proc. 16th Int. Conf. On Machine Learning ICML-99 Workshopon Machine Learning in TExt DAta Analysis, pages 11–17, 1999.

    Google Scholar 

  5. H. Ahonen-Myka, O. Heinonen, M. Klemettinen, and A. Inkeri-Verkamo. Finding co-occurring text phrases by combining sequence and frequent set discovery. In R. Feldman, ed., Proc. 16th Int. Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pages 1–9, 1999.

    Google Scholar 

  6. J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proc. 21st Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval, 1998.

    Google Scholar 

  7. F. Berzal, I. Blanco, D. Sánchez, and M.A. Vila. A new framework to assess association rules. In F. Hoffmann, D.J. Hand, N. Adams, D. Fisher, and G. Guimaraes, eds., Advances in Intelligent Data Analysis. Fourth International Symposium, IDA’01. Lecture Notes in Computer Science 2189, pages 95–104. Springer-Verlag, 2001.

    Google Scholar 

  8. F. Berzal, I. Blanco, D. Sánchez, and M.A. Vila. Measuring the accuracy and interest of association rules: A new framework. An extension of [7]. Intelligent Data Analysis, submitted, 2002.

    Google Scholar 

  9. I. Blanco, M.J. Martín-Bautista, D. Sánchez, and M.A. Vila. On the support of dependencies in relational databases: strong approximate dependencies. Data Mining and Knowledge Discovery, Submitted, 2000.

    Google Scholar 

  10. G. Bordogna, P. Carrara, and G. Pasi. Fuzzy approaches to extend boolean information retrieval. In P. Bosc and J. Kacprzyk, eds., Fuzziness in Database Management Systems, pages 231–274. Physica-Verlag, 1995.

    Google Scholar 

  11. S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD Record, 26(2):255–264, 1997.

    CrossRef  Google Scholar 

  12. M. Delgado, N. Marín, D. Sánchez, and M.A. Vila. Fuzzy association rules: General model and applications. IEEE Transactions on Fuzzy Systems, 2001. Submitted.

    Google Scholar 

  13. M. Delgado, M.J. Martín-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila. Association rules extraction for text mining. FQAS’2002, Submitted, 2002.

    Google Scholar 

  14. M. Delgado, M.J. Martín-Bautista, D. Sánchez, J.M. Serrano, and M.A. Vila. Web mining via fuzzy association rules. NAFIPS’2002, Submitted, 2002.

    Google Scholar 

  15. M. Delgado, M.J. Martín-Bautista, D. Sánchez, and M.A. Vila. Mining strong approximate dependencies from relational databases. In Proceedings of IPMU’2000, 2000.

    Google Scholar 

  16. M. Delgado, D. Sánchez, and M.A. Vila. Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23:23–66, 2000.

    MATH  CrossRef  MathSciNet  Google Scholar 

  17. R. Feldman and I. Dagan. Knowledge discovery in textual databases (KDT). In Proceedings of the 1st Int. Conference on Knowledge Discovery and Data Mining (KDD-95), pages 112–117. AAAI Press, 1995.

    Google Scholar 

  18. R. Feldman, I. Dagan, and W. Kloegsen. Efficient algorithm for mining and manipulating associations in texts. In Proc. 13th European Meeting on Cybernetics and Research, 1996.

    Google Scholar 

  19. R. Feldman, M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler, and O. Zamir. Text mining at the term level. In Proc. 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, pages 65–73, 1998.

    Google Scholar 

  20. R. Feldman and H. Hirsh. Mining associations in text in presence of background knowledge. In Proc 2nd Int. Conf. On Knowledge Discovery and Data Mining, KDD’96, pages 343–346, 1996.

    Google Scholar 

  21. M.A. Hearst. Untangling text data mining. In Proceedings of the 37 Annual Meeting of the Association for Computational Linguistics, pages 20–26, 1999.

    Google Scholar 

  22. H. Karanikas, C. Tjortjis, and B. Theodoulidis. An approach to text mining using information extraction. In Proc. Knowledge Management Theory Applications Workshop, (KMTA 2000), 2000.

    Google Scholar 

  23. Y. Kodratoff. Comparing machine learning and knowledge discovery in DataBases: An application to knowledge discovery in texts. In G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos, eds., Machine Learning and Its Applications, Advanced Lectures. Lecture Notes in Computer Science Series 2049, pages 1–21. Springer, 2001.

    CrossRef  Google Scholar 

  24. D.H. Kraft and D.A. Buell. Fuzzy sets and generalized boolean retrieval systems. In D. Dubois and H. Prade, eds., Readings in Fuzzy Sets for Intelligent Systems, pages 648–659. Morgan Kaufmann Publishers, San Mateo, CA, 1993.

    Google Scholar 

  25. B. Lent, R. Agrawal, and R. Srikant. Discovering trends in text databases. In Proc. 3rd Int. Conference on Knowledge Discovery and Data Mining (KDD-97), pages 227–230, 1997.

    Google Scholar 

  26. S.-H. Lin, C.-S. Shih, M.C. Chen, J.-M. Ho, M.-T. Ko, and Y.-M. Huang. Extracting classification knowledge of internet documents with mining term associations: A semantic approach. In Proc. ACM/SIGIR’98, pages 241–249, 1998.

    Google Scholar 

  27. A. Maedche and S. Staab. Mining ontologies from text. In Proc. 12th International Workshop on Knowledge Engineering and Knowledge Management (EKAW’2000), pages 189–202, 2000.

    Google Scholar 

  28. H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Proc. 2nd Int. Conf on Knowledge Discovery and Data Mining (KDD’96), pages 146–151, 1996.

    Google Scholar 

  29. C.D. Manning. Automatic acquisition of a large subcategorization dictionary from corpora. In Proc. 31st Annual Meeting of the Association for Computational Linguistics, pages 235–242, 1993.

    Google Scholar 

  30. D. Mladenic. Feature subset selection in text-learning. In Proc. 10th European Conference on Machine Learning ECML98, 1998.

    Google Scholar 

  31. D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98, 1998.

    Google Scholar 

  32. U.Y. Nahm and R.J. Mooney. Using information extraction to aid the discovery of prediction rules from text. In Proceedings 6th Int. Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 51–58, 2000.

    Google Scholar 

  33. U.Y. Nahm and R.J. Mooney. Mining soft-matching rules from textual data. In Proc. 7th Int. Joint Conference on Artificial Intelligence (IJCAI-01), 2001.

    Google Scholar 

  34. H.J. Peat and P. Willett. The limitations of term co-occurence data for query expansion in document retrieval systems. JASIS, 42(5):378–383, 1991.

    CrossRef  Google Scholar 

  35. G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. Frawley, eds., Knowledge Discovery in Databases, pages 229–238. AAAI/MIT Press, 1991.

    Google Scholar 

  36. M. Rajman and R. Besançon. Text mining: Natural language techniques and text mining applications. In Proc. Of the 7th IFIP Working Conference on Database Semantics (DS-7). Chapam & Hall, 1997.

    Google Scholar 

  37. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988.

    CrossRef  Google Scholar 

  38. R.C. Schank. Identification of conceptualizations underlying natural language. In R.C. Schank & K.M. Colby, ed., Compputer Models of Thought and Language. Freeman, San Francisco, 1973.

    Google Scholar 

  39. R.C. Schank. Language and memory. Cognitive Science, 4, 1980.

    Google Scholar 

  40. E. Shortliffe and B. Buchanan. A model of inexact reasoning in medicine. Mathematical Biosciences, 23:351–379, 1975.

    CrossRef  MathSciNet  Google Scholar 

  41. C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68, 1998.

    CrossRef  Google Scholar 

  42. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc 21th Int’l Conf. Very Large Data Bases, pages 407–419, September 1995.

    Google Scholar 

  43. Ah-Hwee Tan. Text mining: The state of the art and the challenges. In Proceedings PAKDD’99 Workshopon Knowledge Discovery from Advanced Databases (KDAD’99), pages 71–76, 1999.

    Google Scholar 

  44. E.M Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the 17h Annual Int. ACM/SIGIR Conference, pages 61–69, 1994.

    Google Scholar 

  45. W. Wang, J. Yang, and P.S. Yu. Efficient mining of weighted association rules. In Proc. Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2000.

    Google Scholar 

  46. K. Winkler and M. Spiliopoulou. Extraction of semantic XML DTDs from texts using data mining techniques. In Proc. K-CAP 2001 Workshop Knowledge Markup & Semantic Annotation, 2001.

    Google Scholar 

  47. J. Xu and W.B. Croft. Query expansion using local and global document analysis. In Proceedings 19th Annual Int. ACM/SIGIR Conference on REsearch and Development in Information Retrieval, pages 4–11, 1996.

    Google Scholar 

  48. Y. Yang, T. Pierce, and J. Carbonell. A study on restrospective and online event detection. In Proc. 21st Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval, pages 28–36, 1998.

    Google Scholar 

  49. L. A. Zadeh. A computational approach to fuzzy quantifiers in natural languages. Computing and Mathematics with Applications, 9(1):149–184, 1983.

    MATH  CrossRef  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Delgado, M., Martín-Bautista, M.J., Sánchez, D., Vila, M.A. (2000). Mining Text Data: Special Features and Patterns. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds) Pattern Detection and Discovery. Lecture Notes in Computer Science(), vol 2447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45728-3_11

Download citation

  • DOI: https://doi.org/10.1007/3-540-45728-3_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44148-9

  • Online ISBN: 978-3-540-45728-2

  • eBook Packages: Springer Book Archive

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.