Configurable Application Designed for Mining XML Document Collections

Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 110)

Abstract

In this chapter we present a flexible and configurable application for mining large XML document collections. This work is centered on the process of extracting document features related to structure and content. From this process, an attribute frequency matrix is generated and, depending on the cluster algorithm, it is transformed and/or used to obtain similarity measures.

Keywords

Clustering Conceptual framework Methodology Modularity UML XML mining 

Notes

Acknowledgments

The authors acknowledge financial support given to the Promep group “Analysis and Information Management” and to the Universidad Autónoma Metropolitana, Unidad Azcapotzalco (UAM-A) for the financial support given to the project, “Study and Systems Modelling,” Number: 2270217. They acknowledge Dr. Nicolas Dominguez Vergara (former head of the Systems Department) and Dr. Silvia Gonzalez Brambila (former Coordinator of the Master’s Computer Science Program at the UAM Azcapotzalco) for their support and encouragement. They also acknowledge Peggy Currid for her advice in copyediting.

References

  1. 1.
    Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2008) eXtensibleMarkup Language (XML) 1.0 (5th Edition), X3C recommendation. http://www.w3.org/TR/REC-xml/ [Accessed November 8, 2011]
  2. 2.
    Harold ER, Scott Means W (2004) XML in a nutshell. O’Reilly Media, Sebastopol, CA, USAGoogle Scholar
  3. 3.
    Abiteboul S, Buneman P, Suciu D (2001) Data on the web: from relations to semi structured data and XML. Morgan Kaufmann, Series in data management systems, San Francisco, USAGoogle Scholar
  4. 4.
    Duran CKG, Juganaru-Mathieu M, Vazquez HJ (2011) Specification design for an xml mining configurable application, lecture notes in engineering and computer science. In: Proceedings of the international multiconference of engineers and computer scientists 2011, IMECS 2011, Hong Kong, 16–18 March 2011. (best paper award). http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp378-381.pdf [Accessed November 8, 2011]
  5. 5.
    Mikut R, Reischl M (2011) Data mining tools, WIREs data mining knowledge discovery. 1(5):431–443. http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf [Accessed November 8, 2011]
  6. 6.
    Büchner AG, Mulvenna MD, Anand SS, Hughes JG (1999) An internet-enabled knowledge discovery process (best paper award). In: Proceedings of the 9th international database conference, Hong Kong, July 1999. http://mcbuchner.com/HTML/Research/PDF/IDC99.pdf [Accessed November 8, 2011]
  7. 7.
    Minaei-Bidgoli B (2004) Data mining for a web-based educational system, PhD thesis, Michigan State University. http://www.lon-capa.org/papers/BehrouzThesisRevised.pdf [Accessed November 8, 2011]
  8. 8.
    Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach, Springer Science+Business Media, New York, USAMATHGoogle Scholar
  9. 9.
    Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9Google Scholar
  10. 10.
    Oracle text application developer’s guide release 1 (10.1) (2010). http://www.stanford.edu/dept/itss/docs/oracle/10g/text.101/b10729/classify.htm [Accessed November 8, 2011]
  11. 11.
    Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining, in data mining and knowledge discovery. Kluwer Academic Publishers, 6(4):303–360, Netherlands. http://www.springerlink.com/content/d6ekxxcu0d2ngamj/fulltext.pdf [Accessed November 8, 2011]
  12. 12.
    Candillier L, Denoyer L, Gallinari P, Rousset MC, Termier A, Vercoustre AM (2007) Mining XML documents. In: Poncelet P, Masseglia F, Teisseire M (ed) Data mining patterns: new methods and applications. pp 198–219Google Scholar
  13. 13.
    Büchner AG, Baumgarten M, Mulvenna MD, Böhm R, Anand SS (2000) Data mining and XML: current and future issues. In: Proceedings of the 1st international conference on web information systems engineering, Hong Kong. http://mcbuchner.com/HTML/Research/PDF/WISE00.pdf [Accessed November 8, 2011]
  14. 14.
    Tran T, Nayak R, Bruza P (2008) Document clustering using incremental and pairwise approaches. In: Focused access to XML documents 6th international workshop of the initiative for the evaluation of XML retrieval, INEX-2007. pp. 222–233. Springer-Verlag, Berlin HeidelbergGoogle Scholar
  15. 15.
    Candillier L, Tellier I, Torre F (2005) Transforming XML trees for efficient classification and clustering. Workshop on mining XML documents, INEX-2005, Schloss Dagstuhl, 28–30 November 2005. http://www.grappa.univ-lille3.fr/_candillier/publis/INEX05.pdf [Accessed November 8, 2011]
  16. 16.
    Dalamagas T, Cheng T, Winkel K-J, Sellis Timos (2003) A methodology for clustering XML documents by structure. Inform Syst 31:187–228CrossRefGoogle Scholar
  17. 17.
    Manning CD, Raghavan P, Schtze H (2009) An introduction to information retrieval. Cambridge University Press, Cambridge, pp 201–210. http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf [Accessed November 8, 2011]
  18. 18.
    Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, CambridgeGoogle Scholar
  19. 19.
    Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for XML document clustering. In: Proceedings 7th Australasian data mining conference, Glenelg, South Australia. CRPIT, 87. Roddick JF, Li J, Christen P, Kennedy PJ, (ed). http://crpit.com/confpapers/CRPITV87TranT.pdf [Accessed November 8, 2011]
  20. 20.
    Lee JW and Park SS (2004) Computing similarity between XML documents for XML mining. In Engineering Knowledge in the Age of the Semantic Web, 14th International Conference, EKAW, pp 492–493. Springer VerlagGoogle Scholar
  21. 21.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38. Royal Statistical Society, London, http://web.mit.edu/6.435/www/Dempster77.pdf [Accessed November 8, 2011]
  22. 22.
    Roux M (1985) Algorithmes de Classification. Masson, ParisGoogle Scholar
  23. 23.
    Schmid H (1997) Probabilistic part-of-speech tagging using decision trees. In: Jones DB, Somers H (ed) New methods in language processing. Routledge, LondonGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Laboratoire en Sciences et Technologies de l’InformationInstitut H. Fayol, Ecole Nationale Supérieure des Mines de Saint EtienneSaintEtienne Cedex 2France
  2. 2.Computer Science, MSE ProgramUniversidad Autónoma Metropolitana, Unidad AzcapotzalcoMexicoUSA
  3. 3.Departamento de SistemasUniversidad Autónoma Metropolitana, Unidad AzcapotzalcoMexico D.F.Mexico

Personalised recommendations