Skip to main content

Text Classification using Graph Mining-based Feature Extraction

  • Conference paper
  • First Online:
Research and Development in Intelligent Systems XXVI

Abstract

A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy. Document sets are represented as graph sets to which a weighted graph mining algorithm is applied to extract frequent subgraphs, which are then further processed to produce feature vectors (one per document) for classification. Weighted subgraph mining is used to ensure classification effectiveness and computational efficiency; only the most significant subgraphs are extracted. The approach is validated and evaluated using several popular classification algorithms together with a real world textual data set. The results demonstrate that the approach can outperform existing text classification algorithms on some dataset. When the size of dataset increased, further processing on extracted frequent features is essential.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cai, C. H., Fu, A. W., Cheng, C. H. and Kwong, W. W. Mining Association Rules with Weighted Items. In Proceedings of International Database Engineering and Applications Symposium, August 1998.

    Google Scholar 

  2. Chi, Y., Nijssen, S., Muntz, R. and Kok, J. Frequent Subgree Mining An Overview. In Fundamenta Informaticae, Special Issue on Graph and Tree Mining, 66(1-2), 161-198, 2005.

    MATH  MathSciNet  Google Scholar 

  3. Coenen, F. The LUCS-KDD TFPC Classification Association Rule Mining Algorithm. http://www.csc.liv.ac.uk/∼frans/KDD/Software/Apriori_TFPC/aprioriTFPC.html, Dept. of Computer Science, The University of Liverpool, UK, 2004.

  4. Coenen, F., Leng, P. Obtaining Best Parameter Values for Accurate Classification. In Proceedings of International Conference on Data Mining, Pages: 597-600, 2005.

    Google Scholar 

  5. Garey, M. R. and Johnson, D. S. Computers and Intractability - A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979.

    MATH  Google Scholar 

  6. Gee, K. R. and Cook, D. J. Text Classification Using Graph-Encoded Linguistic Elements, In FLAIRS Conference 2005, pp. 487-492.

    Google Scholar 

  7. Geibel, P., Krumnack, U., Pustylnikow, O., Mehler, A., et al. Structure-Sensitive Learning of Text Types, In AI 2007: Advances in Artificial Intelligence, Vol 4830, pp. 642-646.

    Google Scholar 

  8. Huan, J., Wang, W. and Prins, J. Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. In Proceedings of the 2003 International Conference on Data Mining, 2003.

    Google Scholar 

  9. Inokuci, A., Washio, T. and Motoda, H. An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pages: 13-23, 2000.

    Google Scholar 

  10. Kuramochi, M. and Karypis, G. Frequent Subgraph Discovery. In Proceedings of 2001 IEEE International Conference on Data Mining, 2001.

    Google Scholar 

  11. Lee, S. D. and Park, H. C. Mining Weighted Frequent Patterns from Path Traversals on Weighted Graph. In IJCSNS International Journal of Computer Science and Network Security, VOL.7, No.4, April 2007.

    Google Scholar 

  12. Markov, A., Last, M. Efficient Graph-based Representation of Web Documents. In Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences, Pages: 52-62, Porto Portugal, 2005.

    Google Scholar 

  13. Markov, A., Last, M. and Kandel, A. Fast Categorization of Web Documents represented by Graphs, In Advances in Web Mining and Web Usage Analysis, Vol 4811, pp. 56-71, 2007.

    Article  Google Scholar 

  14. Mukund, D., Kuramochi, M. and Karypis, G. Frequent Sub-structure based Approaches for Classifying Chemical Compounds. In Proceedings of the Third IEEE International Conference on Data Mining, 2003.

    Google Scholar 

  15. Reynolds, H. T. The Analysis of Cross-classifications. The Free Press, New York, 1977.

    Google Scholar 

  16. Schenker, A. Graph Theorectic Techniques for Web Content Mining. PhD thesis, University of South Florida, 2003.

    Google Scholar 

  17. Tao, F.,Murtagh, F. and Farid,M.Weighted Association RuleMining usingWeighted Support and Significance Framework. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, Aug. 2003.

    Google Scholar 

  18. Tsuruoka, Y. and Tsujii, J. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In Proceedings of HLT/EMNLP 2005, pp. 467-474.

    Google Scholar 

  19. Wang, W., Yang, J. and Yu, P. S. Efficient Mining of Weighted Association Rules(WAR). In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, Aug. 2000.

    Google Scholar 

  20. Wang,W., Do, D. B. and Lin, X. Term GraphModel for Text Classification, In Advanced Data Mining and Applications, pp. 19-30, 2005.

    Google Scholar 

  21. Witten, Ian H. and Frank, Eibe. Data Mining: Practical Machine Learning Tools and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, 2005.

    MATH  Google Scholar 

  22. Yan, X. and Han, J. gSpan: Graph-based Substructure Pattern Mining. In Proceedings of 2002 International Conference on Data Mining, 2002.

    Google Scholar 

  23. Yun, U. and Leggett, J. J. WFIM: Weighted Frequent Itemset Mining with a Weight Range and a Minimum Weight. InProceedings of the Fifth SIAM International Conference on Data Mining, Pages: 636-640, April 2005.

    Google Scholar 

  24. Yun, U. and Leggett, J. J. WIP: Mining Weighted Interesting Patterns with a Strong Weight and/or Support Affinity. In Proceedings of the Sixth SIAM International Conference on Data Mining, 2006.

    Google Scholar 

  25. Yun, U. WIS:Weighted Interesting Sequential Pattern Mining with a Similar Level of Support and/or Weight. ETRI Journal, Vol. 29, No. 3, Pages: 336-352, June 2007.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuntao Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London

About this paper

Cite this paper

Jiang, C., Coenen, F., Sanderson, R., Zito, M. (2010). Text Classification using Graph Mining-based Feature Extraction. In: Bramer, M., Ellis, R., Petridis, M. (eds) Research and Development in Intelligent Systems XXVI. Springer, London. https://doi.org/10.1007/978-1-84882-983-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-84882-983-1_2

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84882-982-4

  • Online ISBN: 978-1-84882-983-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics