Automatic Legal Document Analysis: Improving the Results of Information Extraction Processes Using an Ontology

  • María G. Buey
  • Cristian Roman
  • Angel Luis Garrido
  • Carlos Bobed
  • Eduardo Mena
Part of the Studies in Big Data book series (SBD, volume 40)


Information Extraction (IE) is a pervasive task in the industry that allows to obtain automatically structured data from documents in natural language. Current software systems focused on this activity are able to extract a large percentage of the required information, but they do not usually focus on the quality of the extracted data. In this paper we present an approach focused on validating and improving the quality of the results of an IE system. Our proposal is based on the use of ontologies which store domain knowledge, and which we leverage to detect and solve consistency errors in the extracted data. We have implemented our approach to run against the output of the AIS system, an IE system specialized in analyzing legal documents and we have tested it using a real dataset. Preliminary results confirm the interest of our approach.


Information extraction Natural language processing Ontologies Data curation Legal document analysis 



This research work has been supported by projects TIN2013-46238-C4-4-R, TIN2016-78011-C4-3-R (AEI/FEDER, UE), and DGA/FEDER.


  1. 1.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  2. 2.
    Curry, E., Freitas, A., ORiáin, S.: The role of community-driven data curation for enterprises. In: Linking Enterprise Data, pp. 25–47 (2010)CrossRefGoogle Scholar
  3. 3.
    Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Human Comput. Stud. 43(5–6), 907–928 (1995)CrossRefGoogle Scholar
  4. 4.
    Buey, M.G., Garrido, A.L., Bobed, C., Ilarri, S.: The AIS project: boosting information extraction from legal documents by using ontologies. In: Proceedings of the 8th International Conference on Agents and Artificial Intelligence (ICAART 2016), pp. 438–445 (2016)Google Scholar
  5. 5.
    Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. 36(3), 306–323 (2010)CrossRefGoogle Scholar
  6. 6.
    Borobia, J.R., Bobed, C., Garrido, A.L., Mena, E.: SIWAM: using social data to semantically assess the difficulties in mountain activities. In: Proceedings of 10th International Conference on Web Information Systems and Technologies (WEBIST’14), pp. 41–48 (2014)Google Scholar
  7. 7.
    Garrido, A.L., Buey, M.G., Muñoz, G., Casado-Rubio, J.L.: Information extraction on weather forecasts with semantic technologies. In: International Conference on Applications of Natural Language to Information Systems (NLDB 2016), pp. 140–151. Springer International Publishing, Berlin (2016)CrossRefGoogle Scholar
  8. 8.
    Maletic, J.I., Marcus, A.: Data cleansing. In: Data Mining and Knowledge Discovery Handbook, pp. 21–36. Springer, Boston, MA (2005)Google Scholar
  9. 9.
    Sarpong, K.A.M., Arthur, J.K.: Analysis of data cleansing approaches regarding dirty data-a comparative study. Int. J. Comput. Appl. 76(7) (2013)Google Scholar
  10. 10.
    Yeganeh, S., Hassanzadeh, O., Miller, R. J.: Linking semistructured data on the web. In: Interface (2011)Google Scholar
  11. 11.
    Guo, W., Li, H., Ji, H., Diab, M.T.: Linking tweets to news: a framework to enrich short text data in social media. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pp. 239–249 (2013)Google Scholar
  12. 12.
    Wang, J., Bansal, M., Gimpel, K., Ziebart, B.D., Clement, T.Y.: A sense-topic model for word sense induction with unsupervised data enrichment. Trans. Assoc. Comput. Linguist. 3, 59–71 (2015)Google Scholar
  13. 13.
    Sekine, S., Ranchhod, E.: Named Entities: Recognition, Classification and Use. John Benjamins Publishing Company (2009)Google Scholar
  14. 14.
    Hu, Y., McKenzie, G., Yang, J.A., Gao, S., Abdalla, A., Janowicz, K.: A linked-data-driven web portal for learning analytics: data enrichment, interactive visualization, and knowledge discovery. In: LAK Workshops (2014)Google Scholar
  15. 15.
    Yosef, M.A.: U-AIDA: a customizable system for named entity recognition, classification, and disambiguation. Ph.D thesis, Saarland University (2016)Google Scholar
  16. 16.
    Suárez-Figueroa, M. C., Gómez-Pérez, A., Motta, E., Gangemi, A. Ontology engineering in a networked world. Springer Science and Business Media (2012)Google Scholar
  17. 17.
    Euzenat, J., Valtchev, P.: Similarity-based ontology alignment in owl-lite. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004), pp. 323–327. IOS Press, Amsterdam (2004)Google Scholar
  18. 18.
    Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th International World Wide Web Conference (WWW’07), pp. 757–766 (2007)Google Scholar
  19. 19.
    Jiang, Y., Wang, X., Zheng, H.T.: A semantic similarity measure based on information distance for ontology alignment. Inf. Sci. 278, 76–87 (2014)CrossRefGoogle Scholar
  20. 20.
    Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)CrossRefGoogle Scholar
  21. 21.
    van Rijsbergen, C.J.: Information Retrieval, 2nd. edn. Butterworth-Heinemann (1979). ISBN 0408709294Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • María G. Buey
    • 1
  • Cristian Roman
    • 1
  • Angel Luis Garrido
    • 2
  • Carlos Bobed
    • 2
  • Eduardo Mena
    • 2
  1. 1.InSynergy Consulting S.A.MadridSpain
  2. 2.Department of Computer Science and System EngineeringUniversity of ZaragozaZaragozaSpain

Personalised recommendations