Integrating Approximate String Matching with Phonetic String Similarity

  • Junior Ferri
  • Hegler Tissot
  • Marcos Didonet Del FabroEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11019)


Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.


Entity recognition Metaphone Text tagging Trie Active nodes Fast similarity search 



This work was partially funded by Project Sistema de Monitoramento de Políticas de Promoção da Igualdade Racial (SNPPIR).


  1. 1.
    Cunningham, H.: Information extraction, automatic. In: Encyclopedia of Language and Linguistics, 2nd edn. (2005)Google Scholar
  2. 2.
    Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLOS Comput. Biol. 9(2), e1002854 (2013)CrossRefGoogle Scholar
  3. 3.
    Deng, D., Li, G., Wen, H., Jagadish, H.V., Feng, J.: Meta: an efficient matching-based method for error-tolerant autocompletion. Proc. VLDB Endow. 9(10), 828–839 (2016)CrossRefGoogle Scholar
  4. 4.
    Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING, vol. 96, pp. 466–471 (1996)Google Scholar
  5. 5.
    Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th WWW, WWW 2009, Madrid, Spain, pp. 371–380. ACM (2009)Google Scholar
  6. 6.
    Lamontagne, L., Abi-Zeid, I.: Combining multiple similarity metrics using a multicriteria approach. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 415–428. Springer, Heidelberg (2006). Scholar
  7. 7.
    Li, G., Ji, S., Li, C., Feng, J.: Efficient type-ahead search on relational data: a tastier approach. In: Proceedings of the 2009 ACM SIGMOD, SIGMOD 2009, Providence, Rhode Island, USA, pp. 695–706. ACM (2009)Google Scholar
  8. 8.
    Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)CrossRefGoogle Scholar
  9. 9.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  10. 10.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
  11. 11.
    Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 38–44 (1990)Google Scholar
  12. 12.
    Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)CrossRefGoogle Scholar
  13. 13.
    Tissot, H., Peschl, G., Del Fabro, M.D.: Fast phonetic similarity search over large repositories. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 74–81. Springer, Cham (2014). Scholar
  14. 14.
    Culotta, A., Kristjansson, T., McCallum, A., Viola, P.: Corrective feedback and persistent learning for information extraction. Artif. Intell. 170(14), 1101–1122 (2006)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Stonebraker, M., Tao, W., Deng, D.: Approximate string joins with abbreviations. Proc. VLDB Endow. 11(1), 53–65 (2017)CrossRefGoogle Scholar
  16. 16.
    Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. 36(3), 306–323 (2010)CrossRefGoogle Scholar
  17. 17.
    Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. VLDB Endow. 6(6), 373–384 (2013)CrossRefGoogle Scholar
  18. 18.
    Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: The 19th SIGIR, SIGIR 1996, Zurich, Switzerland, pp. 166–172. ACM (1996)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Junior Ferri
    • 1
  • Hegler Tissot
    • 1
  • Marcos Didonet Del Fabro
    • 1
    Email author
  1. 1.C3SL LabsUniversidade Federal do ParanáCuritibaBrazil

Personalised recommendations