Abstract
The present work briefly describes a novel approach for categorizing semi-structure documents by using fuzzy rule-based system. We propose fuzzy logic representation for semi-structured documents and then by proposing new metric, categorize documents into different classes. The idea behind of our approach is to divide web pages into different semantic sections and by using fuzzy logic system extract features and weight harvested terms to represent semi-structure documents. A set of metrics are also used to measure similarity between documents based on the weight of each region in the text. A clustering algorithm is also explained that categorized documents into several categories. This idea is inspired as a subfield of the area of Matchmaking that tries to match document creators and users in order to find the best similarities between them and connect them for further collaborations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Garc´ıa-Plaza, P., Fresno, V., Mart´ınez, R.: Web page clustering using a fuzzy logic based representation and self-organizing maps. In: Proceedings of the WI-IAT, pp. 851–854 (2008)
Aggarwal, C.C., Zhai, C.X.: A Survey of Text Classification Algorithms. Mining Text Data, pp. 163–222. Springer, US (2012)
Forman, G.: Feature Selection for Text Classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, pp. 257–276. CRC Press/Taylor and Francis Group (2008)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Thomas, H.: Probabilistic latent semantic analysis. In: Uncertainity in Artificial Intelligence (1999)
Lan, M., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Lin, D.: An Information-Theoretic Definition of Similarity. In: Proc. Int’l Conf. Machine Learning, ICM (1998)
Biletskiy, Y., Brown, J.A., Ranganathan, G.R.: Information extraction from syllabi for academic e-Advising. Expert Systems with Applications 36(3), 4508–4516 (2009)
Jang, R., Mizutani, E.: Neuro-Fuzzy and Soft Computing. Prentice Hall, Englewood Cliffs (1997)
Mitchel, T.M.: Machine Learning. Mc Graw Hill (1996)
Lee, M., Pincombe, B., Welsh, W.: An Empirical Evaluation of Models of Text Document Similarity. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society, pp. 1254–1259 (2005)
Huang, A.: Similarity measures for Text Document Clustering. In: Proceedings of New Zealand Computer Science Research Student Conference, July 3, 2012, pp. 49–56. Weka (2008)
Weka digital library (2010), http://www.cs.waikato.ac.nz/ml/weka/ (retrieved July 3, 2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ensan, A., Biletskiy, Y. (2013). Matching Semi-structured Documents Using Similarity of Regions through Fuzzy Rule-Based System. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2013. Lecture Notes in Computer Science(), vol 7987. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39736-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-39736-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39735-6
Online ISBN: 978-3-642-39736-3
eBook Packages: Computer ScienceComputer Science (R0)