Advertisement

Adding Missing Words to Regular Expressions

  • Thomas Rebele
  • Katerina Tzompanaki
  • Fabian M. Suchanek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10938)

Abstract

Regular expressions (regexes) are patterns that are used in many applications to extract words or tokens from text. However, even hand-crafted regexes may fail to match all the intended words. In this paper, we propose a novel way to generalize a given regex so that it matches also a set of missing (previously non-matched) words. Our method finds an approximate match between the missing words and the regex, and adds disjunctions for the unmatched parts appropriately. We show that this method can not just improve the precision and recall of the regex, but also generate much shorter regexes than baselines and competitors on various datasets.

Notes

Acknowledgments

This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR as part of the program “Investissement d’Avenir” Idex Paris-Saclay (ANR-11-IDEX-0003-02).

References

  1. 1.
    Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: Workshop on Analytics for Noisy Unstructured Text Data (2010)Google Scholar
  2. 2.
    Bartoli, A., Davanzo, G., Lorenzo, A.D., Mauri, M., Medvet, E., Sorio, E.: Automatic generation of regular expressions from examples with genetic programming. In: GECCO (2012)Google Scholar
  3. 3.
    Bartoli, A., Davanzo, G., Lorenzo, A.D., Medvet, E., Sorio, E.: Automatic synthesis of regular expressions from examples. IEEE Comput. 47(12), 72–80 (2014)CrossRefGoogle Scholar
  4. 4.
    Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: On the automatic construction of regular expressions from examples. In: GECCO (2016)Google Scholar
  5. 5.
    Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: CIKM (2011)Google Scholar
  6. 6.
    Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi, G., Di Pietro, A.: An improved DFA for fast regular expression matching. SIGCOMM Comput. Commun. Rev. 38(5), 29–40 (2008).  https://doi.org/10.1145/1452335.1452339 CrossRefGoogle Scholar
  7. 7.
    Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: SIGPLAN Notices, vol. 46 (2011)CrossRefGoogle Scholar
  8. 8.
    Knight, J.R., Myers, E.W.: Approximate regular expression pattern matching with concave gap penalties. Algorithmica 14(1), 85–121 (1995)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Le, V., Gulwani, S.: FlashExtract: a framework for data extraction by examples. In: PLDI (2014)CrossRefGoogle Scholar
  10. 10.
    Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Seman. Web J. 6(2), 167–195 (2015)Google Scholar
  11. 11.
    Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP (2008)Google Scholar
  12. 12.
    Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email: applying named entity recognition to informal text. In: EMNLP (2005)Google Scholar
  13. 13.
    Murthy, K., Padmanabhan, D., Deshpande, P.M.: Improving recall of regular expressions for information extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 455–467. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35063-4_33 CrossRefGoogle Scholar
  14. 14.
    Myers, E.W., Miller, W.: Approximate matching of regular expressions. Bull. Math. Biol. 51(1), 5–37 (1989)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Navarro, G.: Approximate regular expression searching with arbitrary integer weights. Nord. J. Comput. 11(4), 356–373 (2004)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Prasse, P., Sawade, C., Landwehr, N., Scheffer, T.: Learning to identify concise regular expressions that describe email campaigns. J. Mach. Learn. Res. 16(1), 3687–3720 (2015)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Rebele, T., Tzompanaki, K., Suchanek, F.: Visualizing the addition of missing words to regular expressions. In: ISWC (2017)Google Scholar
  18. 18.
    Rebele, T., Tzompanaki, K., Suchanek, F.: Technical report: adding missing words to regular expressions. Technical report, Telecom ParisTech (2018)Google Scholar
  19. 19.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)Google Scholar
  20. 20.
    Wu, S., Manber, U., Myers, E.: A subquadratic algorithm for approximate regular expression matching. J. Algorithms 19(3), 346–360 (1995)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR (1999)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Thomas Rebele
    • 1
  • Katerina Tzompanaki
    • 2
  • Fabian M. Suchanek
    • 1
  1. 1.Télécom ParisTechParisFrance
  2. 2.ETIS lab/ENSEA/Cergy-Pontoise University/CNRSCergy-PontoiseFrance

Personalised recommendations