Skip to main content

An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods

  • Chapter
Emergent Web Intelligence: Advanced Semantic Technologies

Abstract

Ambiguity is a challenge faced by systems that handle natural language. To assuage the issue of linguistic ambiguities found in text classification, this work proposes a text categorizer using the methodology of Fuzzy Similarity. The clustering algorithms Stars and Cliques are adopted in the Agglomerative Hierarchical method and they identify the groups of texts by specifying some type of relationship rule to create categories based on the similarity analysis of the textual terms. The proposal is based on the methodology suggested, categories can be created from the analysis of the degree of similarity of the texts to be classified, without needing to determine the number of initial categories. The combination of techniques proposed in the categorizer’s steps brought satisfactory results, proving to be efficient in textual classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Stopwords are closed classes of words that do not carry meaning, such as articles, pronouns, interjections and prepositions.

  2. 2.

    Corpus extracted from Terra Networks Brasil S/A.

  3. 3.

    The complete collection has 1,578 texts, however, these files were not available for use in their totality. Hence, we used only the 100 texts that are available online.

  4. 4.

    These files, which come from the most diverse RSS channels of Terra Networks Brasil S/A, were collected daily during the period comprising February to March 2008.

References

  1. Aldenderfe, M.S., Mark, R.K., Aldenderfe, S.: Cluster Analysis, p. 88. SAGE University, Beverly Hills (1978)

    Google Scholar 

  2. Arora, R., Bangarole, P.: Text mining: classification & clustering of articles related to sports. In: Proceedings of the 43rd Annual Southeast Regional Conference ACM-SE 43, vol. 1. ACM, New York (2005)

    Google Scholar 

  3. Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)

    Google Scholar 

  4. Cross, V.: Fuzzy information retrieval. Journal of Intelligent Information Systems 3, 29–56 (1994)

    Article  Google Scholar 

  5. Dagan, I., Feldman, R., Hirsh, H.: Keyword-based browsing and analysis of large document sets. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval—SDAIR, Las Vegas, Nevada, pp. 191–208 (1996)

    Google Scholar 

  6. Everitt, B.S., Dunn, G.: Applied Multivariate Data Analysis, 2nd edn. Edward Arnold, London (2000). http://www.iop.kcl.ac.uk/iop/Departments/BioComp/MvBook.stm

    Google Scholar 

  7. Fasulo, D.: An analysis of recent work on clustering algorithms. Technical report, Univ. of Washington, Washington, DC (1999).

    Google Scholar 

  8. Fávero, L.: Coesão e Coerência Textuais. Ática, São Paulo (2000). In Portuguese

    Google Scholar 

  9. Fayyad, U., Uthurusamy, R.: Data mining and knowledge discovery in databases (introduction to the Special Issue) Editorial. Data Mining and Knowledge Discovery. Communications of the ACM 39(11), 24–26 (1996)

    Article  Google Scholar 

  10. Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1997)

    Article  Google Scholar 

  11. Frawley, W.J., Piatestsky, S.G., Matheus, C.: Knowledge discovery in data bases: An overview. AI Magazine 13(3), 57–70 (1992). http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1992.pdf

    Google Scholar 

  12. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 1st edn. Morgan Kaufmann, New York (2001)

    Google Scholar 

  13. Hearst, M.A.: Automated Discovery of WordNet Relations. MIT University Press, Cambridge (1998)

    Google Scholar 

  14. Hellmann, M.: Fuzzy logic introduction. Université de Rennes (2001)

    Google Scholar 

  15. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988). http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf

    MATH  Google Scholar 

  16. Jianan, W., Rangaswamy, A.: A fuzzy set model of consideration set formation calibrated on data from an online supermarket. EBusiness research Center Working Paper, No. 5, 1999

    Google Scholar 

  17. Karypis, G., Han, S.H.E.: Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)

    Article  Google Scholar 

  18. Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proc. of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 23–26 July 2002, pp. 102–111. ACM, New York (2002)

    Google Scholar 

  19. Klir, G.J., Folger, T.A.: Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  20. Kowalski, G.: Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Norwell (1997)

    MATH  Google Scholar 

  21. Kwok, R.C., Ma, J., Zhou, D.: Improving group decision making: A fuzzy GSS approach. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews 32, 54–63 (2002)

    Article  Google Scholar 

  22. Mitchell, T.M.: Machine Learning. McGraw-Hill Series in Computer Science. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  23. Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing, and Bioinformatics. Wiley, New York (2003)

    Google Scholar 

  24. Moscarola, J., Bolden, R.: From the data mine to the knowledge mill: applying the principles of lexical analysis to the data mining and knowledge discovery process. Technical report, Université de Savoie (1998)

    Google Scholar 

  25. Oliveira, H.M.: Seleção de entes complexos usando lógica difusa. Dissertation (Masters in Computer Science), Instituto de Informática (1996). In Portuguese

    Google Scholar 

  26. Pardo, T.A.S.: Dmsumm: Um gerador automático de sumários. Master’s thesis, Universidade Federal de São Carlos, São Carlos (2002). In Portuguese

    Google Scholar 

  27. Pottenger, W.M., Yang, T.: Dmsumm: Um gerador automático de sumários. Detecting emerging concepts in textual data mining. In: Berry, M. (ed.) Computational Information Retrieval. SIAM, Philadelphia (2001). In Portuguese

    Google Scholar 

  28. Rohf, F.J., Sokal, R.R.: Statistical Tables, 2nd edn. W.H. Freeman, San Francisco (1981)

    Google Scholar 

  29. Salton, G.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  30. Silva, C.M., Vidigal, M.C., Vidigal Filho, P.S., Scapim, C.A., Daros, E., Silvério, L.: Genetic diversity among sugarcane clones (saccharum spp.). Scientiarum Agronomy 27, 315–319 (2005)

    Google Scholar 

  31. Snedecor, G.W.: Calculation and interpretation of analysis of variance and covariance (1934)

    Google Scholar 

  32. Tan, A.H.: Text mining: the state of the art and the challenges. In: Workshop on Knowledge Discovery from Advanced Databases. Lecture Notes in Computer Science, pp. 65–70. Springer, Berlin (1999)

    Google Scholar 

  33. Tsaur, S.H., Chang, T.Y., Yen, C.H.: The evaluation of airline service quality by fuzzy MCDM. Tourism Management 23(2), 107–115 (2007). Available at: http://mslab.hau.ac.kr/mgyoon/master_02/ahp8.pdf. Accessed on June 23, 2007. Lecture Notes in Computer Science, vol. 1574

    Article  Google Scholar 

  34. Velickov, S.: Textminer theoretical background. http://www.delft-cluster.nl/textminer/theory/ (2004). Accessed on September 10, 2007

  35. Vianna, D.S.: Heurísticas híbridas para o problema da logenia. PhD Thesis, Pontifícia Universidade Católica—PUC, Rio de Janeiro, Brazil (2004). In Portuguese

    Google Scholar 

  36. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Van Nostrand Reinhold, New York (1994)

    MATH  Google Scholar 

  37. Wives, L.K.: Um estudo sobre agrupamento de documentos textuais em processamento de informações não estruturadas usando técnicas de clustering. Master’s thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (1999). In Portuguese

    Google Scholar 

  38. Wives, L.K.: Utilizando conceitos como descritores de textos para o processo de identificação de conglomerados (clustering) de documentos. PhD Thesis, Universidade Federal do Rio Grande do Sul.Programa de Pós-graduação em Computação, Porto Alegre, RS, Brazil (2004). In Portuguese

    Google Scholar 

  39. Wives, L.K., Rodrigues, N.A.: Eurekha. Revista Eletrônica da Escola de Administração da UFRGS (READ) 6(5) (2000). In Portuguese

    Google Scholar 

  40. Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  41. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and decision processes. Transactions on Systems, Man and Cybernetics 3, 28–44 (1973)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcus V. C. Guelpeli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London

About this chapter

Cite this chapter

Guelpeli, M.V.C., Bicharra Garcia, A.C., Bernardini, F.C. (2010). An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods. In: Badr, Y., Chbeir, R., Abraham, A., Hassanien, AE. (eds) Emergent Web Intelligence: Advanced Semantic Technologies. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84996-077-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-1-84996-077-9_11

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84996-076-2

  • Online ISBN: 978-1-84996-077-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics