Skip to main content

Learning Structural Representations of Text Documents in Large Document Collections

  • Chapter
Handbook on Neural Information Processing

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 49))

Abstract

The main aim of this chapter is to study the effects of structural representation of text documents when applying a connectionist approach to modelling the domain. While text documents are often processed un-structured, we will show in this chapter that the performance and problem solving capability of machine learning methods can be enhanced through the use of suitable structural representations of text documents. It will be shown that the extraction of structure from text documents does not require a knowledge of the underlying semantic relationships among words used in the text. This chapter describes an extension of the bag of words approach. By incorporating the “relatedness” of word tokens as they are used in the context of a document, this results in a structural representation of text documents which is richer in information than the bag of words approach alone. An application to very large datasets for a classification and a regression problem will show that our approach scales very well. The classification problem will be tackled by the latest in a series of techniques which applied the idea of self organizing map to graph domains. It is shown that with the incorporation of the relatedness information as expressed using the Concept Link Graph, the resulting clusters are tighter when compared them with those obtained using a self organizing map alone using a bag of words representation. The regression problem is to rank a text corpus. In this case, the idea is to include content information in the ranking of documents and compare them with those obtained using PageRank. In this case, the results are inconclusive due possibly to the truncation of the representation of the Concept Link Graph representations. It is conjectured that the ranking of documents will be sped up if we include the Concept Link Graph representation of all documents together with their hyperlinked structure. The methods described in this chapter are capable of solving real world and data mining problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Haykin, S.: Neural Networks, A Comprehensive Foundation. Prentice Hall (1998)

    Google Scholar 

  2. Hornik, K.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (1989)

    Article  Google Scholar 

  3. Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: Computational capabilities of graph neural networks. IEEE Transactions on Neural Networks 20, 81–102 (2009)

    Article  Google Scholar 

  4. McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow

  5. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP, pp. 404–411. ACL, Barcelona (2004)

    Google Scholar 

  6. Sowa, J.F.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley, Reading (1984)

    MATH  Google Scholar 

  7. Chau, R., Tsoi, A.C., Hagenbuchner, M., Lee, V.: A conceptlink graph for text structure mining. In: Mans, B. (ed.) Thirty-Second Australasian Computer Science Conference (ACSC 2009), Wellington, New Zealand. CRPIT, vol. 91, pp. 129–137. ACS (2009)

    Google Scholar 

  8. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)

    Article  MATH  Google Scholar 

  9. Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer-Verlag Inc., New York (2002)

    MATH  Google Scholar 

  10. Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 417–441 (1933)

    Article  Google Scholar 

  11. Kohonen, T.: Self-Organisation and Associative Memory, 3rd edn. Springer (1990)

    Google Scholar 

  12. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International Conference on World Wide Web (WWW), Brisbane, Australia, pp. 107–117 (1998)

    Google Scholar 

  13. Chiang, W., Hagenbuchner, M., Tsoi, A.: The wt10g dataset and the evolution of the web. In: 14th International World Wide Web Conference, Alternate track papers and posters, Chiba city, Japan, pp. 938–939 (May 2005)

    Google Scholar 

  14. Green, D.: The evolution of web searching. Online Information Review 24(2), 124–137 (2000)

    Article  Google Scholar 

  15. Despeyroux, T.: Practical semantic analysis of web sites and documents. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, New York, USA, pp. 685–693 (May 2004)

    Google Scholar 

  16. Netcraft, “Web server survey” (October 13 , 2005), http://news.netcraft.com/archives/web_server_survey.html

  17. The google platform, http://en.wikipedia.org/wiki/Google_platform (accessed July 07, 2011)

  18. Hagenbuchner, M., Sperduti, A., Tsoi, A.: A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks 14, 491–505 (2003)

    Article  Google Scholar 

  19. Scarselli, F., Gori, M., Tsoi, A., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Transactions on Neural Networks 20, 61–80 (2009)

    Article  Google Scholar 

  20. Yuan, M.: Efficient computation of the l1 regularized solution path in gaussian graphical models. Journal of Computational and Graphical Statistics 17, 809–826 (2008)

    Article  MathSciNet  Google Scholar 

  21. Zhang, S., Hagenbuchner, M., Tsoi, A.C., Sperduti, A.: Self Organizing Maps for the Clustering of Large Sets of Labeled Graphs. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 469–481. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  22. Hagenbuchner, M., Da San Martino, G., Tsoi, A.C., Spertudi, A.: Sparsity issues in self-organizing-maps for structures. In: Proceedings of European Symposium on Artificial Neural Networks, vol. ES2011–71 (2011)

    Google Scholar 

  23. Chen, Y., Gan, Q., Suel, T.: Local methods for estimating pagerank values. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM 2004, pp. 381–389. ACM, New York (2004)

    Chapter  Google Scholar 

  24. Yong, S., Hagenbuchner, M., Tsoi, A.: Ranking web pages using machine learning approaches. In: International Conference on Web Intelligence, Sydney, Australia, December 9-12, vol. 3, pp. 677–680 (2008)

    Google Scholar 

  25. Scarselli, F., Yong, S., Gori, M., Hagenbuchner, M., Tsoi, A., Maggini, M.: Graph neural networks for ranking web pages. In: Web Intelligence Conference, pp. 666–672 (2005)

    Google Scholar 

  26. Zhang, S.J., Hagenbuchner, M., Scarselli, F., Tsoi, A.C.: Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 449–461. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  27. Feldman, R., Sanger, J.: The Text Mining Handbook. Cambridge University Press (2007)

    Google Scholar 

  28. Tsoi, A.C., Hagenbuchner, M., Chau, R., Lee, V.: Unsupervised and supervised learning of graph domains. In: Bianchini, M., Maggini, M., Scarselli, F., Jain, L. (eds.) Innovations in Neural Information Paradigms and Applications, pp. 43–66. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  29. Hagenbuchner, M., Sperduti, A., Tsoi, A.C., Trentini, F., Scarselli, F., Gori, M.: Clustering XML Documents Using Self-organizing Maps for Structures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 481–496. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  30. Kc, M., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Sperduti, A., Gori, M.: XML Document Mining Using Contextual Self-organizing Maps for Structures. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 510–524. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  31. Yong, S.L., Hagenbuchner, M., Tsoi, A.C., Scarselli, F., Gori, M.: Document Mining Using Graph Neural Network. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 458–472. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  32. Hagenbuchner, M., Tsoi, A., Sperduti, A., Kc, M.: Efficient clustering of structured documents using graph self-organizing maps. In: Comparative Evaluation of XML Information Retrieval Systems, pp. 207–221. Springer, Berlin (2008)

    Google Scholar 

  33. Kc, M., Chau, R., Hagenbuchner, M., Tsoi, A.C., Lee, V.: A Machine Learning Approach to Link Prediction for Interlinked Documents. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 342–354. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  34. Muratore, D., Hagenbuchner, M., Scarselli, F., Tsoi, A.C.: Sentence Extraction by Graph Neural Networks. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part III. LNCS, vol. 6354, pp. 237–246. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  35. de Mauro, C., Diligenti, M., Gori, M., Maggini, M.: Similarity learning for graph-based image representations. Pattern Recognition Letters 24, 1115–1122 (2003)

    Article  MATH  Google Scholar 

  36. Hagenbuchner, M., Kc, M., Tsoi, A.: XML Data Mining: Models, Methods, and Applications. In: Data Driven Encoding of Structures and Link Predictions in Large XML Document Collections. IGI Global (2010) (accepted for publication on May 30, 2010)

    Google Scholar 

  37. Kutty, S., Nayak, R., Li, Y.: Xml documents clustering using tensor space model-a preliminary study. In: ICDM 2010 Workshop on Optimization Based Methods for Emerging Data Mining Problems, pp. 1167–1173 (December 13, 2010)

    Google Scholar 

  38. Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1989)

    Google Scholar 

  39. Leung, H., Chung, F., Chan, S., Luk, R.: Xml document clustering using common xpath. In: Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pp. 91–96. IEEE Computer Society, Washington, DC (2005)

    Chapter  Google Scholar 

  40. Vercoustre, A.-M., Fegas, M., Gul, S., Lechevallier, Y.: A Flexible Structured-Based Representation for XML Document Mining. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 443–457. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  41. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., Shi, B.-L.: Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 441–451. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  42. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: Proc.of the15th Int. Conf. on Data Engineering (2000)

    Google Scholar 

  43. Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering xml documents by structure. Information Systems 31(3), 187–228 (2006)

    Article  Google Scholar 

  44. Nierman, A., Jagadish, H.: Evaluating structural similarity in xml documents. In: Proceedings of International Workshop on Mining Graphs, Trees, and Sequences, pp. 61–66 (2002)

    Google Scholar 

  45. Neuhaus, M., Bunke, H.: Self-organizing maps for learning the edit costs in graph matching. IEEE Transactions on Systems, Man, and Cybernetics, Part B 3(35), 503–514 (2005)

    Article  Google Scholar 

  46. Nayak, R., Tran, T.: A progressive clustering algorithm to group the xml data by structural and semantic similarity. IJPRAI 21(4), 723–743 (2007)

    Google Scholar 

  47. Tagarelli, A., Greco, S.: Toward semantic xml clustering. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM, pp. 188–199. SIAM (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ah Chung Tsoi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Tsoi, A.C., Hagenbuchner, M., Kc, M., Zhang, S. (2013). Learning Structural Representations of Text Documents in Large Document Collections. In: Bianchini, M., Maggini, M., Jain, L. (eds) Handbook on Neural Information Processing. Intelligent Systems Reference Library, vol 49. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36657-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36657-4_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36656-7

  • Online ISBN: 978-3-642-36657-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics