Skip to main content

The GENIE System: Classifying Documents by Combining Mixed-Techniques

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2014)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 226))

Included in the following conference series:

  • 653 Accesses

Abstract

Today, the automatic text classification is still an open problem and its implementation in companies and organizations with large volumes of data in text format is not a trivial matter. To achieve optimum results many parameters come into play, such as the language, the context, the level of knowledge of the issues discussed, the format of the documents, or the type of language that has been used in the documents to be classified. In this paper we describe a multi-language rule-based pipeline system, called GENIE, used for automatic document categorisation. We have used several business corpora in order to test the real capabilities of our proposal, and we have studied the results of applying different stages of the pipeline over the same data to test the influence of each step in the categorization process. The results obtained by this system are very promising, and in fact, the GENIE system is already being used on real production environments with very good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.w3.org/TR/2006/WD-rdf-sparql-query-20061004/.

  2. 2.

    http://www.heraldo.es/.

  3. 3.

    http://www.diariodenavarra.es/.

  4. 4.

    http://www.heraldodesoria.es/.

  5. 5.

    http://nlp.lsi.upc.edu/freeling/.

  6. 6.

    http://www.geonames.org/.

References

  1. Buey, M.G., Garrido, A.L., Escudero, S., Trillo, R., Ilarri, S., Mena, E.: SQX-Lib: developing a semantic query expansion system in a media group. In: European Conference on Information Retrieval, 780–784 (2014)

    Google Scholar 

  2. Garrido, A.L., Pera, M.S., Ilarri, S.: SOLE-R, a semantic and linguistic aproach for book recommendations. In: 14th IEEE International Conference on Advanced Learning Technologies - ICALT, pp. 524–528. IEEE Computer Society (2014)

    Google Scholar 

  3. Goodchild, M.F., Hill, L.: Introduction to digital gazetteer research. Int. J. Geogr. Inf. Sci. 22, 1039–1044 (2008)

    Article  Google Scholar 

  4. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  5. Vossen, P.: EuroWordNet: a Multilingual Database with Lexical Semantic Networks. Kluwer Academic, Boston (1998)

    Book  MATH  Google Scholar 

  6. Sekine, S., Ranchhod, E.: Named Entities: Recognition, Classification and Use. John Benjamins, Amsterdam (2009)

    Book  Google Scholar 

  7. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)

    Article  Google Scholar 

  8. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  9. Amitay, E., Har’El, N., Sivan, R., Soffer, A.: Web-a-where: geotagging web content. In: 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–280. ACM (2004)

    Google Scholar 

  10. Quercini, G., Samet, H., Sankaranarayanan, J., Lieberman, M.D.: Determining the spatial reader scopes of news sources using local lexicons. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 43–52. ACM (2010)

    Google Scholar 

  11. Rauch, E., Bukatin, M., Baker, K.: A confidence-based framework for disambiguating geographic terms. In: HLT-NAACL 2003 Workshop on Analysis of Geographic References, vol. 1, pp. 50–54. Association for Computational Linguistics (2003)

    Google Scholar 

  12. Li, H., Srihari, R.K., Niu, C., Li, W.: Location normalization for information extraction. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)

    Google Scholar 

  13. Garrido, A.L., Buey, M.G., Ilarri, S., Mena, E.: GEO-NASS: a semantic tagging experience from geographical data on the media. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 56–69. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Resnik, P.: Disambiguating noun groupings with respect to WordNet senses. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora, pp. 77–98. Springer, Berlin (1999)

    Chapter  Google Scholar 

  15. Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. 41, 10:1–10:69 (2009)

    Article  Google Scholar 

  16. Joachims, T.: Text categorization with support vector machines: learning with many relevant. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  17. McGuinness, D.L., Van Harmelen, F., et al.: OWL web ontology language overview. W3C recommendation, 10 February 2004

    Google Scholar 

  18. Trillo, R., Gracia, J., Espinoza, M., Mena, E.: Discovering the semantics of user keywords. J. Univers. Comput. Sci. 13, 1908–1935 (2007)

    Google Scholar 

  19. Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Lee, S.O.K., Chun, A.H.W.: Automatic tag recommendation for the web 2.0 blogosphere using collaborative tagging and hybrid and semantic structures. In: Sixth Conference on WSEAS International Conference on Applied Computer Science (ACOS 2007), vol. 7, pp. 88–93. World Scientific and Engineering Academy and Society (WSEAS) (2007)

    Google Scholar 

  21. Maynard, D., Peters, W., Li, Y.: Metrics for evaluation of ontology-based information extraction. In: Workshop on Evaluation of Ontologies for the Web (EON) at the International World Wide Web Conference (WWW 2006) (2006)

    Google Scholar 

  22. Scharkow, M.: Thematic content analysis using supervised machine learning: an empirical evaluation using German online news. Qual. Quant. 47, 761–773 (2013)

    Article  Google Scholar 

  23. Bruno, M., Canfora, G., Di Penta, M., Scognamiglio, R.: An approach to support web service classification and annotation. In: 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE 2005), pp. 138–143. IEEE (2005)

    Google Scholar 

  24. Garrido, A.L., Gomez, O., Ilarri, S., Mena, E.: NASS: news annotation semantic system. In: 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2011), pp. 904–905. IEEE Computer Society, Boca Raton, Florida (USA) (2011)

    Google Scholar 

  25. Garrido, A.L., Gómez, O., Ilarri, S., Mena, E.: An experience developing a semantic annotation system in a media group. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 333–338. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  26. Bikakis, N., Giannopoulos, G., Dalamagas, T., Sellis, T.: Integrating keywords and semantics on document annotation and search. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 921–938. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  27. Carrasco, R., Gelbukh, A.: Evaluation of TnT tagger for Spanish. In: Proceedings of ENC, Fourth Mexican International Conference on Computer Science, pp. 18–25. IEEE (2003)

    Google Scholar 

  28. Aguado de Cea, G., Puch, J., Ramos, J.: Tagging Spanish texts: the problem of ’se’. In: Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 2321–2324 (2008)

    Google Scholar 

  29. Silveira, S.B., Branco, A.: Extracting multi-document summaries with a double clustering approach. In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.) NLDB 2012. LNCS, vol. 7337, pp. 70–81. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  30. Garrido, A.L., Buey, M.G., Escudero, S., Ilarri, S., Mena, E., Silveira, S.B.: TM-gen: a topic map generator from text documents. In: 25th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2013), pp. 735–740. IEEE Computer Society, Washington DC (USA) (2013)

    Google Scholar 

Download references

Acknowledgements

This research work has been supported by the CICYT project TIN2013-46238-C4-4-R and DGA-FSE. Thank you to Heraldo Group and Diario de Navarra.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angel L. Garrido .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Garrido, A.L., Buey, M.G., Escudero, S., Peiro, A., Ilarri, S., Mena, E. (2015). The GENIE System: Classifying Documents by Combining Mixed-Techniques. In: Monfort, V., Krempels, KH. (eds) Web Information Systems and Technologies. WEBIST 2014. Lecture Notes in Business Information Processing, vol 226. Springer, Cham. https://doi.org/10.1007/978-3-319-27030-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27030-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27029-6

  • Online ISBN: 978-3-319-27030-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics