Text Classification using Graph Mining-based Feature Extraction

Jiang, Chuntao; Coenen, Frans; Sanderson, Robert; Zito, Michele

doi:10.1007/978-1-84882-983-1_2

Chuntao Jiang⁴,
Frans Coenen⁴,
Robert Sanderson⁴ &
…
Michele Zito⁴

1444 Accesses
12 Citations

Abstract

A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy. Document sets are represented as graph sets to which a weighted graph mining algorithm is applied to extract frequent subgraphs, which are then further processed to produce feature vectors (one per document) for classification. Weighted subgraph mining is used to ensure classification effectiveness and computational efficiency; only the most significant subgraphs are extracted. The approach is validated and evaluated using several popular classification algorithms together with a real world textual data set. The results demonstrate that the approach can outperform existing text classification algorithms on some dataset. When the size of dataset increased, further processing on extracted frequent features is essential.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cai, C. H., Fu, A. W., Cheng, C. H. and Kwong, W. W. Mining Association Rules with Weighted Items. In Proceedings of International Database Engineering and Applications Symposium, August 1998.
Google Scholar
Chi, Y., Nijssen, S., Muntz, R. and Kok, J. Frequent Subgree Mining An Overview. In Fundamenta Informaticae, Special Issue on Graph and Tree Mining, 66(1-2), 161-198, 2005.
MATH MathSciNet Google Scholar
Coenen, F. The LUCS-KDD TFPC Classification Association Rule Mining Algorithm. http://www.csc.liv.ac.uk/∼frans/KDD/Software/Apriori_TFPC/aprioriTFPC.html, Dept. of Computer Science, The University of Liverpool, UK, 2004.
Coenen, F., Leng, P. Obtaining Best Parameter Values for Accurate Classification. In Proceedings of International Conference on Data Mining, Pages: 597-600, 2005.
Google Scholar
Garey, M. R. and Johnson, D. S. Computers and Intractability - A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979.
MATH Google Scholar
Gee, K. R. and Cook, D. J. Text Classification Using Graph-Encoded Linguistic Elements, In FLAIRS Conference 2005, pp. 487-492.
Google Scholar
Geibel, P., Krumnack, U., Pustylnikow, O., Mehler, A., et al. Structure-Sensitive Learning of Text Types, In AI 2007: Advances in Artificial Intelligence, Vol 4830, pp. 642-646.
Google Scholar
Huan, J., Wang, W. and Prins, J. Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. In Proceedings of the 2003 International Conference on Data Mining, 2003.
Google Scholar
Inokuci, A., Washio, T. and Motoda, H. An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pages: 13-23, 2000.
Google Scholar
Kuramochi, M. and Karypis, G. Frequent Subgraph Discovery. In Proceedings of 2001 IEEE International Conference on Data Mining, 2001.
Google Scholar
Lee, S. D. and Park, H. C. Mining Weighted Frequent Patterns from Path Traversals on Weighted Graph. In IJCSNS International Journal of Computer Science and Network Security, VOL.7, No.4, April 2007.
Google Scholar
Markov, A., Last, M. Efficient Graph-based Representation of Web Documents. In Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences, Pages: 52-62, Porto Portugal, 2005.
Google Scholar
Markov, A., Last, M. and Kandel, A. Fast Categorization of Web Documents represented by Graphs, In Advances in Web Mining and Web Usage Analysis, Vol 4811, pp. 56-71, 2007.
Article Google Scholar
Mukund, D., Kuramochi, M. and Karypis, G. Frequent Sub-structure based Approaches for Classifying Chemical Compounds. In Proceedings of the Third IEEE International Conference on Data Mining, 2003.
Google Scholar
Reynolds, H. T. The Analysis of Cross-classifications. The Free Press, New York, 1977.
Google Scholar
Schenker, A. Graph Theorectic Techniques for Web Content Mining. PhD thesis, University of South Florida, 2003.
Google Scholar
Tao, F.,Murtagh, F. and Farid,M.Weighted Association RuleMining usingWeighted Support and Significance Framework. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, Aug. 2003.
Google Scholar
Tsuruoka, Y. and Tsujii, J. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In Proceedings of HLT/EMNLP 2005, pp. 467-474.
Google Scholar
Wang, W., Yang, J. and Yu, P. S. Efficient Mining of Weighted Association Rules(WAR). In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, Aug. 2000.
Google Scholar
Wang,W., Do, D. B. and Lin, X. Term GraphModel for Text Classification, In Advanced Data Mining and Applications, pp. 19-30, 2005.
Google Scholar
Witten, Ian H. and Frank, Eibe. Data Mining: Practical Machine Learning Tools and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, 2005.
MATH Google Scholar
Yan, X. and Han, J. gSpan: Graph-based Substructure Pattern Mining. In Proceedings of 2002 International Conference on Data Mining, 2002.
Google Scholar
Yun, U. and Leggett, J. J. WFIM: Weighted Frequent Itemset Mining with a Weight Range and a Minimum Weight. InProceedings of the Fifth SIAM International Conference on Data Mining, Pages: 636-640, April 2005.
Google Scholar
Yun, U. and Leggett, J. J. WIP: Mining Weighted Interesting Patterns with a Strong Weight and/or Support Affinity. In Proceedings of the Sixth SIAM International Conference on Data Mining, 2006.
Google Scholar
Yun, U. WIS:Weighted Interesting Sequential Pattern Mining with a Similar Level of Support and/or Weight. ETRI Journal, Vol. 29, No. 3, Pages: 336-352, June 2007.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Liverpool, Ashton Building, Ashton Street, L69 3BX, Liverpool, UK
Chuntao Jiang, Frans Coenen, Robert Sanderson & Michele Zito

Authors

Chuntao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Frans Coenen
View author publications
You can also search for this author in PubMed Google Scholar
Robert Sanderson
View author publications
You can also search for this author in PubMed Google Scholar
Michele Zito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuntao Jiang .

Editor information

Editors and Affiliations

Dept. Computer Science and, University of Portsmouth, Lion Terrace, Portsmouth, PO1 3HE, United Kingdom
Max Bramer
Stratum Management Ltd., Southbrook Place 11, Micheldever, Hants., SO21 3DE, United Kingdom
Richard Ellis
School of Computing &, University of Greenwich, Park Row 30, London, SE10 9LS, United Kingdom
Miltos Petridis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, C., Coenen, F., Sanderson, R., Zito, M. (2010). Text Classification using Graph Mining-based Feature Extraction. In: Bramer, M., Ellis, R., Petridis, M. (eds) Research and Development in Intelligent Systems XXVI. Springer, London. https://doi.org/10.1007/978-1-84882-983-1_2

Download citation

DOI: https://doi.org/10.1007/978-1-84882-983-1_2
Published: 19 October 2009
Publisher Name: Springer, London
Print ISBN: 978-1-84882-982-4
Online ISBN: 978-1-84882-983-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics