Beyond the Bag of Words: A Text Representation for Sentence Selection

  • Maria Fernanda Caropreso
  • Stan Matwin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4013)


Sentence selection shares some but not all the characteristics of Automatic Text Categorization. Therefore some but not all the same techniques should be used. In this paper we study a syntactic and semantic enriched text representation for the sentence selection task in a genomics corpus. We show that using technical dictionaries and syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. Furthermore, the syntactic relations can be used by a first order rule learner to obtain even better performance.


Noun Phrase Semantic Knowledge Inductive Logic Programming Latent Semantic Indexing Prepositional Phrase 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Chin, A.G. (ed.) Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)Google Scholar
  2. 2.
    Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization in ACM Trans. Inf. Syst.  17(2), 141–173 (1999)Google Scholar
  3. 3.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)CrossRefGoogle Scholar
  4. 4.
    Fagan, J.L.: Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods. PhD thesis, Department of Computer Science, Cornell University, Ithaca, US (1987)Google Scholar
  5. 5.
    Furnkranz, J.: A study using n-gram features for text categorization. Technical Report TR-98-30, Oesterreichisches Forschungsinstitut Artificial Intelligence, Wien, AT (1998)Google Scholar
  6. 6.
    Furnkranz, J., Mitchell, T.M., Rilo®, E.: A case study in using linguistic phrases for text categorization on the WWW. In: Proceedings of the 1st AAAI Workshop on Learning for Text Categorization, Madison, US, pp. 5–12 (1998)Google Scholar
  7. 7.
    Furnkranz, J.: Inductive Logic Programming (a short introduction and a thesis abstract)Google Scholar
  8. 8.
    Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)Google Scholar
  9. 9.
    Kramer, S.: Relational Learning vs. Propositionalization. PhD. Thesis, Vienna University of Technology, Vienna, Austria (1999)Google Scholar
  10. 10.
    Lewis, D.D., Croft, W.B.: Term clustering of syntactic phrases. In: Proceedings of SIGIR-1990, 13th ACM International Conference on Research and Development in Information Retrieval, Bruxelles, BE, pp. 385–404 (1990)Google Scholar
  11. 11.
    Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)Google Scholar
  12. 12.
    Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Belkin, N.J., Ingwersen, P., Pejtersen, A.M. (eds.) Proceedings of SIGIR-1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 37–50. ACM Press, New York (1992)CrossRefGoogle Scholar
  13. 13.
    Mladenic, D., Grobelnik, M.: Word sequences as features in text learning. In: Proceedings of ERK-1998, the seventh Electrotecnical and Computer Science Conference (pp, Ljubljana, Slovenia, pp. 145–148 (1998)Google Scholar
  14. 14.
    Mitra, M., Buckley, C., Singhal, A., Cardie, C.: An Analysis of Statistical and Syntactic Phrases. In: 5TH RIAO Conference, Computer-Assisted Information Searching On the Internet, pp. 200–214 (1997)Google Scholar
  15. 15.
    Nédellec, C., Vetah, M.O.A., Bessières, P.: Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, Springer, Heidelberg (2001)CrossRefGoogle Scholar
  16. 16.
    Ould, M.: Apprentissage Automatique Applique a l’Extraction d’Information a Partir de Textes Biologiques. PhD Thesis. L’Universite Paris-Sud. France (2005)Google Scholar
  17. 17.
    Ould, M., Caropreso, F., Manine, P., Nedellec, C., Matwin, S.: Sentence Categorization in Genomics Bibliography: a Naïve Bayes Approach. Informatique pour lèanalyse du transcriptome, Paris (2003)Google Scholar
  18. 18.
    Ray, S., Craven, M.: Representing Sentence Structure in Hidden Markov Models for Information Extraction. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) (2001)Google Scholar
  19. 19.
    Scott, S., Matwin, S.: Feature Engineering for Text Classification. In: Proceedings of ICML-1999, 16th International Conference on Machine Learning (1999)Google Scholar
  20. 20.
    Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents, Paris, Thèse de doctorat de l’Université (July 6, 2003)Google Scholar
  21. 21.
  22. 22.
    Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196 (October 1991)Google Scholar
  23. 23.
    Temkin, J.M., Gilder, M.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)CrossRefGoogle Scholar
  24. 24.
    Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  25. 25.
    Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM-01, 10th ACM International Conference on Information and Knowledge Management (2001)Google Scholar
  26. 26.
    Fillmore, C.J.: The Case for Case. In: Bach, Harms (eds.) Universals in Linguistic Theory, pp. 1–88. Holt, Rinehart, and Winston, New York (1968)Google Scholar
  27. 27.
    Maarek, Y., Berry, D.M., Kaiser, G.E.: GURU: Information Retrieval for Reuse. In: Hall, P. (ed.) Landmark Contributions in Software Reuse and Reverse Engineering (1994)Google Scholar
  28. 28.
    Jacquemin, C.: What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing and Management 32(4), 445–458 (1996)CrossRefGoogle Scholar
  29. 29.
    Hachey, B., Grover, C.: Sequence Modelling for Sentence Classification in a Legal Summarisation System. In: Proceedings of the 2005 ACM Symposium on Applied Computing (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Maria Fernanda Caropreso
    • 1
  • Stan Matwin
    • 1
    • 2
  1. 1.School of Information Technology and Engineering.University of OttawaOttawa, Ontario
  2. 2.Institute for Computer SciencePolish Academy of ScienceWavsaw

Personalised recommendations