Knowledge and Information Systems

, Volume 15, Issue 3, pp 285–320 | Cite as

Boosting text segmentation via progressive classification

  • Eugenio Cesario
  • Francesco Folino
  • Antonio Locane
  • Giuseppe MancoEmail author
  • Riccardo Ortale
Regular Paper


A novel approach for reconciling tuples stored as free text into an existing attribute schema is proposed. The basic idea is to subject the available text to progressive classification, i.e., a multi-stage classification scheme where, at each intermediate stage, a classifier is learnt that analyzes the textual fragments not reconciled at the end of the previous steps. Classification is accomplished by an ad hoc exploitation of traditional association mining algorithms, and is supported by a data transformation scheme which takes advantage of domain-specific dictionaries/ontologies. A key feature is the capability of progressively enriching the available ontology with the results of the previous stages of classification, thus significantly improving the overall classification accuracy. An extensive experimental evaluation shows the effectiveness of our approach.


Schema reconciliation Text segmentation Classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adelberg B (1998). NoDoSE: A tool for semi-automatically extracting semistructured data from text documents. In: Haas LM, Tiwary A (eds) Proceedings of 1998 ACM SIGMOD conference on management of data. ACM Press, Seattle, WA, USA, June 1998, pp 283–294Google Scholar
  2. 2.
    Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Kim W, Kohavi R, Gehrke J, DuMouchel W (eds) Proceedings of 2004 ACM SIGKDD conference on knowledge discovery and data mining. ACM Press, Seattle, WA, USA, August 2004, pp 20–29Google Scholar
  3. 3.
    Borkar VR, Deshmukh K, Sarawagi S (2001) Automatic segmentation of text into structured records. In: Aref WG (ed) Proceedings of 2001 ACM SIGMOD conference on management of Data. ACM Press, Santa Barbara, CA, USA, May 2001, pp 175–186Google Scholar
  4. 4.
    Brill E (1995). Transformation-based error-driven learning and natural language processing: a cased study in POS tagging. Comput Linguist 21(4): 543–565 Google Scholar
  5. 5.
    Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: Proceedings of 16th national conference on artificial intelligence. AAAI/MIT Press, Madison, WI, USA, July 1999, pp 328–334Google Scholar
  6. 6.
    Cohen WW (1995) Learning to classify english text with ILP methods. In: De Raedt L (ed). Proceedings of the 5th international Workshop on inductive logic programming. Katholieke Universiteit Leuven, Haverlee, Belgium, pp 3–24Google Scholar
  7. 7.
    Elmagarmid AK, Panagiotis GI and Verykios VS (2007). Duplicate Record Dectection: A Survey. IEEE Trans Knowl Data Eng 19(1): 1–16 CrossRefGoogle Scholar
  8. 8.
    Flesca F, Manco G and Masciari E (2004). Web wrapper induction: a brief survey. AI Commun 17(2): 57–61 Google Scholar
  9. 9.
    Freitag D (1998) Toward general-purpose learning for information extraction. In: Proceedings of 17th national conference on computational linguistics. ACL/Morgan Kaufmann Publishers, Universit de Montral, Montreal, Quebec, Canada, August 1998, pp 404–408Google Scholar
  10. 10.
    Gu L, Baxter RA, Vickers D et al (2003) Record linkage: current practice and future directions. Technical report. CSIRO Mathematical and Information Sciences, AustraliaGoogle Scholar
  11. 11.
    Hernández MA and Stolfo J (1998). Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2(1): 9–37 CrossRefGoogle Scholar
  12. 12.
    Junker M, Sintek M, Rinck M (1999) Learning for text categorization and information Extraction with ILP. In: Cussens J, Dzeroski S (eds) Learning language in logic. Springer Heidelberg, pp 247–258Google Scholar
  13. 13.
    Kupiec J (1992). Robust part-of-speech tagging using a hidden Markov model. Comput Speech Lang 6(3): 225–242 CrossRefGoogle Scholar
  14. 14.
    Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley CE, Pohoreckyj Danyluk A (eds). Proceedings of 18th international conference on machine learning. Morgan Kaufmann, Williamstown, MA, USA, June 2001, pp 282–289Google Scholar
  15. 15.
    Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of 4th ACM SIGKDD international conference on knowledge discovery and data mining. AAAI Press, New York City, NY, USA, August 1998, pp 80–86Google Scholar
  16. 16.
    Manning CD and Schultze C (1999). Foundations of statistical natural language processing. MIT Press, Cambridge zbMATHGoogle Scholar
  17. 17.
    Marquez L, Padro L and Rodriguez H (2000). A machine learning approach to POS tagging. Mach Learn 39(1): 59–91 CrossRefzbMATHGoogle Scholar
  18. 18.
    McCallum A (2002) MALLET: a machine learning for language toolkit. http://mallet.cs.umass.eduGoogle Scholar
  19. 19.
    McCallum A, Freitag D, Pereira F (2000) Maximum entropy Markov models for information extraction and segmentation. In: Langley P (ed) Proceedings of 17th international conference on machine learning. Morgan Kaufmann, Standford University, Standord, CA, USA, June 2000, pp 591–598Google Scholar
  20. 20.
    Mukherjee S, Ramakrishnan IV (2004) Taming the unstructured: creating structured content from partially labeled schematic text sequences. In: Meersman R, Tari Z (eds) Proceedings of 12th CoopIS/DOA/ ODBASE international conference. Springer, Agia Napa, Cyprus, October 2004, pp 909–926Google Scholar
  21. 21.
    Soderland S (1999). Learning information extraction rules for semi/structured and free text. Mach Learn 34: 233–272 CrossRefzbMATHGoogle Scholar
  22. 22.
    Srikant R, Agrawal R (1995) Mining generalized association rules. In: Dayal U, Gray PMD, Nishio S (eds) Proceedings of 21th international conference on Very large databases. Morgan Kaufmann, Zurich, Switzerland, September 1995, pp 407–419Google Scholar
  23. 23.
    Winkler WE (1999) The state of record linkage and current research problems. Technical report. Statistical Research Division, U.S. Census Bureau, Wachington, DCGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  • Eugenio Cesario
    • 1
  • Francesco Folino
    • 1
  • Antonio Locane
    • 1
  • Giuseppe Manco
    • 1
    Email author
  • Riccardo Ortale
    • 1
  1. 1.ICAR-CNRRende(CS)Italy

Personalised recommendations