An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods

Guelpeli, Marcus V. C.; Bicharra Garcia, Ana Cristina; Bernardini, Flavia Cristina

doi:10.1007/978-1-84996-077-9_11

Marcus V. C. Guelpeli⁵,
Ana Cristina Bicharra Garcia⁵ &
Flavia Cristina Bernardini⁶

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

731 Accesses

Abstract

Ambiguity is a challenge faced by systems that handle natural language. To assuage the issue of linguistic ambiguities found in text classification, this work proposes a text categorizer using the methodology of Fuzzy Similarity. The clustering algorithms Stars and Cliques are adopted in the Agglomerative Hierarchical method and they identify the groups of texts by specifying some type of relationship rule to create categories based on the similarity analysis of the textual terms. The proposal is based on the methodology suggested, categories can be created from the analysis of the degree of similarity of the texts to be classified, without needing to determine the number of initial categories. The combination of techniques proposed in the categorizer’s steps brought satisfactory results, proving to be efficient in textual classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Stopwords are closed classes of words that do not carry meaning, such as articles, pronouns, interjections and prepositions.
2.
Corpus extracted from Terra Networks Brasil S/A.
3.
The complete collection has 1,578 texts, however, these files were not available for use in their totality. Hence, we used only the 100 texts that are available online.
4.
These files, which come from the most diverse RSS channels of Terra Networks Brasil S/A, were collected daily during the period comprising February to March 2008.

References

Aldenderfe, M.S., Mark, R.K., Aldenderfe, S.: Cluster Analysis, p. 88. SAGE University, Beverly Hills (1978)
Google Scholar
Arora, R., Bangarole, P.: Text mining: classification & clustering of articles related to sports. In: Proceedings of the 43rd Annual Southeast Regional Conference ACM-SE 43, vol. 1. ACM, New York (2005)
Google Scholar
Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)
Google Scholar
Cross, V.: Fuzzy information retrieval. Journal of Intelligent Information Systems 3, 29–56 (1994)
Article Google Scholar
Dagan, I., Feldman, R., Hirsh, H.: Keyword-based browsing and analysis of large document sets. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval—SDAIR, Las Vegas, Nevada, pp. 191–208 (1996)
Google Scholar
Everitt, B.S., Dunn, G.: Applied Multivariate Data Analysis, 2nd edn. Edward Arnold, London (2000). http://www.iop.kcl.ac.uk/iop/Departments/BioComp/MvBook.stm
Google Scholar
Fasulo, D.: An analysis of recent work on clustering algorithms. Technical report, Univ. of Washington, Washington, DC (1999).
Google Scholar
Fávero, L.: Coesão e Coerência Textuais. Ática, São Paulo (2000). In Portuguese
Google Scholar
Fayyad, U., Uthurusamy, R.: Data mining and knowledge discovery in databases (introduction to the Special Issue) Editorial. Data Mining and Knowledge Discovery. Communications of the ACM 39(11), 24–26 (1996)
Article Google Scholar
Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1997)
Article Google Scholar
Frawley, W.J., Piatestsky, S.G., Matheus, C.: Knowledge discovery in data bases: An overview. AI Magazine 13(3), 57–70 (1992). http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1992.pdf
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 1st edn. Morgan Kaufmann, New York (2001)
Google Scholar
Hearst, M.A.: Automated Discovery of WordNet Relations. MIT University Press, Cambridge (1998)
Google Scholar
Hellmann, M.: Fuzzy logic introduction. Université de Rennes (2001)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988). http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf
MATH Google Scholar
Jianan, W., Rangaswamy, A.: A fuzzy set model of consideration set formation calibrated on data from an online supermarket. EBusiness research Center Working Paper, No. 5, 1999
Google Scholar
Karypis, G., Han, S.H.E.: Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)
Article Google Scholar
Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proc. of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 23–26 July 2002, pp. 102–111. ACM, New York (2002)
Google Scholar
Klir, G.J., Folger, T.A.: Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Kowalski, G.: Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Norwell (1997)
MATH Google Scholar
Kwok, R.C., Ma, J., Zhou, D.: Improving group decision making: A fuzzy GSS approach. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews 32, 54–63 (2002)
Article Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill Series in Computer Science. McGraw-Hill, New York (1997)
MATH Google Scholar
Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing, and Bioinformatics. Wiley, New York (2003)
Google Scholar
Moscarola, J., Bolden, R.: From the data mine to the knowledge mill: applying the principles of lexical analysis to the data mining and knowledge discovery process. Technical report, Université de Savoie (1998)
Google Scholar
Oliveira, H.M.: Seleção de entes complexos usando lógica difusa. Dissertation (Masters in Computer Science), Instituto de Informática (1996). In Portuguese
Google Scholar
Pardo, T.A.S.: Dmsumm: Um gerador automático de sumários. Master’s thesis, Universidade Federal de São Carlos, São Carlos (2002). In Portuguese
Google Scholar
Pottenger, W.M., Yang, T.: Dmsumm: Um gerador automático de sumários. Detecting emerging concepts in textual data mining. In: Berry, M. (ed.) Computational Information Retrieval. SIAM, Philadelphia (2001). In Portuguese
Google Scholar
Rohf, F.J., Sokal, R.R.: Statistical Tables, 2nd edn. W.H. Freeman, San Francisco (1981)
Google Scholar
Salton, G.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Silva, C.M., Vidigal, M.C., Vidigal Filho, P.S., Scapim, C.A., Daros, E., Silvério, L.: Genetic diversity among sugarcane clones (saccharum spp.). Scientiarum Agronomy 27, 315–319 (2005)
Google Scholar
Snedecor, G.W.: Calculation and interpretation of analysis of variance and covariance (1934)
Google Scholar
Tan, A.H.: Text mining: the state of the art and the challenges. In: Workshop on Knowledge Discovery from Advanced Databases. Lecture Notes in Computer Science, pp. 65–70. Springer, Berlin (1999)
Google Scholar
Tsaur, S.H., Chang, T.Y., Yen, C.H.: The evaluation of airline service quality by fuzzy MCDM. Tourism Management 23(2), 107–115 (2007). Available at: http://mslab.hau.ac.kr/mgyoon/master_02/ahp8.pdf. Accessed on June 23, 2007. Lecture Notes in Computer Science, vol. 1574
Article Google Scholar
Velickov, S.: Textminer theoretical background. http://www.delft-cluster.nl/textminer/theory/ (2004). Accessed on September 10, 2007
Vianna, D.S.: Heurísticas híbridas para o problema da logenia. PhD Thesis, Pontifícia Universidade Católica—PUC, Rio de Janeiro, Brazil (2004). In Portuguese
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Van Nostrand Reinhold, New York (1994)
MATH Google Scholar
Wives, L.K.: Um estudo sobre agrupamento de documentos textuais em processamento de informações não estruturadas usando técnicas de clustering. Master’s thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (1999). In Portuguese
Google Scholar
Wives, L.K.: Utilizando conceitos como descritores de textos para o processo de identificação de conglomerados (clustering) de documentos. PhD Thesis, Universidade Federal do Rio Grande do Sul.Programa de Pós-graduação em Computação, Porto Alegre, RS, Brazil (2004). In Portuguese
Google Scholar
Wives, L.K., Rodrigues, N.A.: Eurekha. Revista Eletrônica da Escola de Administração da UFRGS (READ) 6(5) (2000). In Portuguese
Google Scholar
Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965)
Article MathSciNet MATH Google Scholar
Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and decision processes. Transactions on Systems, Man and Cybernetics 3, 28–44 (1973)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Ciência da Computação, Instituto de Computação—IC, Universidade Federal Fluminense—UFF, Rua Passo da Pátria 156, Bloco E, 3º andar, São Domingos, Niterói, RJ CEP 24210-240, Brazil
Marcus V. C. Guelpeli & Ana Cristina Bicharra Garcia
Departamento de Ciência e Tecnologia—RCT, Pólo Universitário de Rio das Ostras—PURO, Universidade Federal Fluminense—UFF, Rua Recife, s/n, Jardim Bela Vista, Rio das Ostras, RJ CEP 28890-000, Brazil
Flavia Cristina Bernardini

Authors

Marcus V. C. Guelpeli
View author publications
You can also search for this author in PubMed Google Scholar
Ana Cristina Bicharra Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Flavia Cristina Bernardini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcus V. C. Guelpeli .

Editor information

Editors and Affiliations

INSA de Lyon, avenue Jean Capelle 7, Villeurbanne CX, 69621, France
Youakim Badr
Fac. Sciences Mirande, UMR CNRS 5158, Université de Bourgogne, Dijon CX, France
Richard Chbeir
Technology, Center for Quantifiable Quality of, Norwegian University of Science &, O.S. Bragstads plass 2E, Trondheim, 7491, Norway
Ajith Abraham
College of Business & Administration, Dept. Quantitative Methods &, Kuwait University, Safat, Kuwait
Aboul-Ella Hassanien

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Guelpeli, M.V.C., Bicharra Garcia, A.C., Bernardini, F.C. (2010). An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods. In: Badr, Y., Chbeir, R., Abraham, A., Hassanien, AE. (eds) Emergent Web Intelligence: Advanced Semantic Technologies. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84996-077-9_11

Download citation

DOI: https://doi.org/10.1007/978-1-84996-077-9_11
Publisher Name: Springer, London
Print ISBN: 978-1-84996-076-2
Online ISBN: 978-1-84996-077-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods