Abstract
Ambiguity is a challenge faced by systems that handle natural language. To assuage the issue of linguistic ambiguities found in text classification, this work proposes a text categorizer using the methodology of Fuzzy Similarity. The clustering algorithms Stars and Cliques are adopted in the Agglomerative Hierarchical method and they identify the groups of texts by specifying some type of relationship rule to create categories based on the similarity analysis of the textual terms. The proposal is based on the methodology suggested, categories can be created from the analysis of the degree of similarity of the texts to be classified, without needing to determine the number of initial categories. The combination of techniques proposed in the categorizer’s steps brought satisfactory results, proving to be efficient in textual classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Stopwords are closed classes of words that do not carry meaning, such as articles, pronouns, interjections and prepositions.
- 2.
Corpus extracted from Terra Networks Brasil S/A.
- 3.
The complete collection has 1,578 texts, however, these files were not available for use in their totality. Hence, we used only the 100 texts that are available online.
- 4.
These files, which come from the most diverse RSS channels of Terra Networks Brasil S/A, were collected daily during the period comprising February to March 2008.
References
Aldenderfe, M.S., Mark, R.K., Aldenderfe, S.: Cluster Analysis, p. 88. SAGE University, Beverly Hills (1978)
Arora, R., Bangarole, P.: Text mining: classification & clustering of articles related to sports. In: Proceedings of the 43rd Annual Southeast Regional Conference ACM-SE 43, vol. 1. ACM, New York (2005)
Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)
Cross, V.: Fuzzy information retrieval. Journal of Intelligent Information Systems 3, 29–56 (1994)
Dagan, I., Feldman, R., Hirsh, H.: Keyword-based browsing and analysis of large document sets. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval—SDAIR, Las Vegas, Nevada, pp. 191–208 (1996)
Everitt, B.S., Dunn, G.: Applied Multivariate Data Analysis, 2nd edn. Edward Arnold, London (2000). http://www.iop.kcl.ac.uk/iop/Departments/BioComp/MvBook.stm
Fasulo, D.: An analysis of recent work on clustering algorithms. Technical report, Univ. of Washington, Washington, DC (1999).
Fávero, L.: Coesão e Coerência Textuais. Ática, São Paulo (2000). In Portuguese
Fayyad, U., Uthurusamy, R.: Data mining and knowledge discovery in databases (introduction to the Special Issue) Editorial. Data Mining and Knowledge Discovery. Communications of the ACM 39(11), 24–26 (1996)
Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1997)
Frawley, W.J., Piatestsky, S.G., Matheus, C.: Knowledge discovery in data bases: An overview. AI Magazine 13(3), 57–70 (1992). http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1992.pdf
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 1st edn. Morgan Kaufmann, New York (2001)
Hearst, M.A.: Automated Discovery of WordNet Relations. MIT University Press, Cambridge (1998)
Hellmann, M.: Fuzzy logic introduction. Université de Rennes (2001)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988). http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf
Jianan, W., Rangaswamy, A.: A fuzzy set model of consideration set formation calibrated on data from an online supermarket. EBusiness research Center Working Paper, No. 5, 1999
Karypis, G., Han, S.H.E.: Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)
Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proc. of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 23–26 July 2002, pp. 102–111. ACM, New York (2002)
Klir, G.J., Folger, T.A.: Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs (1988)
Kowalski, G.: Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Norwell (1997)
Kwok, R.C., Ma, J., Zhou, D.: Improving group decision making: A fuzzy GSS approach. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews 32, 54–63 (2002)
Mitchell, T.M.: Machine Learning. McGraw-Hill Series in Computer Science. McGraw-Hill, New York (1997)
Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing, and Bioinformatics. Wiley, New York (2003)
Moscarola, J., Bolden, R.: From the data mine to the knowledge mill: applying the principles of lexical analysis to the data mining and knowledge discovery process. Technical report, Université de Savoie (1998)
Oliveira, H.M.: Seleção de entes complexos usando lógica difusa. Dissertation (Masters in Computer Science), Instituto de Informática (1996). In Portuguese
Pardo, T.A.S.: Dmsumm: Um gerador automático de sumários. Master’s thesis, Universidade Federal de São Carlos, São Carlos (2002). In Portuguese
Pottenger, W.M., Yang, T.: Dmsumm: Um gerador automático de sumários. Detecting emerging concepts in textual data mining. In: Berry, M. (ed.) Computational Information Retrieval. SIAM, Philadelphia (2001). In Portuguese
Rohf, F.J., Sokal, R.R.: Statistical Tables, 2nd edn. W.H. Freeman, San Francisco (1981)
Salton, G.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Silva, C.M., Vidigal, M.C., Vidigal Filho, P.S., Scapim, C.A., Daros, E., Silvério, L.: Genetic diversity among sugarcane clones (saccharum spp.). Scientiarum Agronomy 27, 315–319 (2005)
Snedecor, G.W.: Calculation and interpretation of analysis of variance and covariance (1934)
Tan, A.H.: Text mining: the state of the art and the challenges. In: Workshop on Knowledge Discovery from Advanced Databases. Lecture Notes in Computer Science, pp. 65–70. Springer, Berlin (1999)
Tsaur, S.H., Chang, T.Y., Yen, C.H.: The evaluation of airline service quality by fuzzy MCDM. Tourism Management 23(2), 107–115 (2007). Available at: http://mslab.hau.ac.kr/mgyoon/master_02/ahp8.pdf. Accessed on June 23, 2007. Lecture Notes in Computer Science, vol. 1574
Velickov, S.: Textminer theoretical background. http://www.delft-cluster.nl/textminer/theory/ (2004). Accessed on September 10, 2007
Vianna, D.S.: Heurísticas híbridas para o problema da logenia. PhD Thesis, Pontifícia Universidade Católica—PUC, Rio de Janeiro, Brazil (2004). In Portuguese
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Van Nostrand Reinhold, New York (1994)
Wives, L.K.: Um estudo sobre agrupamento de documentos textuais em processamento de informações não estruturadas usando técnicas de clustering. Master’s thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (1999). In Portuguese
Wives, L.K.: Utilizando conceitos como descritores de textos para o processo de identificação de conglomerados (clustering) de documentos. PhD Thesis, Universidade Federal do Rio Grande do Sul.Programa de Pós-graduação em Computação, Porto Alegre, RS, Brazil (2004). In Portuguese
Wives, L.K., Rodrigues, N.A.: Eurekha. Revista Eletrônica da Escola de Administração da UFRGS (READ) 6(5) (2000). In Portuguese
Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965)
Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and decision processes. Transactions on Systems, Man and Cybernetics 3, 28–44 (1973)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag London
About this chapter
Cite this chapter
Guelpeli, M.V.C., Bicharra Garcia, A.C., Bernardini, F.C. (2010). An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods. In: Badr, Y., Chbeir, R., Abraham, A., Hassanien, AE. (eds) Emergent Web Intelligence: Advanced Semantic Technologies. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84996-077-9_11
Download citation
DOI: https://doi.org/10.1007/978-1-84996-077-9_11
Publisher Name: Springer, London
Print ISBN: 978-1-84996-076-2
Online ISBN: 978-1-84996-077-9
eBook Packages: Computer ScienceComputer Science (R0)