Abstract
Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.
Keywords
- Entity recognition
- Metaphone
- Text tagging
- Trie
- Active nodes
- Fast similarity search
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
Implementation from the commons-codec-1.10.jar library, available at https://commons.apache.org/proper/commons-codec/download_codec.cgi.
- 5.
Implementation from lucene-suggest-5.2.1.jar, at http://lucene.apache.org/.
References
Cunningham, H.: Information extraction, automatic. In: Encyclopedia of Language and Linguistics, 2nd edn. (2005)
Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLOS Comput. Biol. 9(2), e1002854 (2013)
Deng, D., Li, G., Wen, H., Jagadish, H.V., Feng, J.: Meta: an efficient matching-based method for error-tolerant autocompletion. Proc. VLDB Endow. 9(10), 828–839 (2016)
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING, vol. 96, pp. 466–471 (1996)
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th WWW, WWW 2009, Madrid, Spain, pp. 371–380. ACM (2009)
Lamontagne, L., Abi-Zeid, I.: Combining multiple similarity metrics using a multicriteria approach. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 415–428. Springer, Heidelberg (2006). https://doi.org/10.1007/11805816_31
Li, G., Ji, S., Li, C., Feng, J.: Efficient type-ahead search on relational data: a tastier approach. In: Proceedings of the 2009 ACM SIGMOD, SIGMOD 2009, Providence, Rhode Island, USA, pp. 695–706. ACM (2009)
Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 38–44 (1990)
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)
Tissot, H., Peschl, G., Del Fabro, M.D.: Fast phonetic similarity search over large repositories. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 74–81. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_6
Culotta, A., Kristjansson, T., McCallum, A., Viola, P.: Corrective feedback and persistent learning for information extraction. Artif. Intell. 170(14), 1101–1122 (2006)
Stonebraker, M., Tao, W., Deng, D.: Approximate string joins with abbreviations. Proc. VLDB Endow. 11(1), 53–65 (2017)
Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. 36(3), 306–323 (2010)
Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. VLDB Endow. 6(6), 373–384 (2013)
Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: The 19th SIGIR, SIGIR 1996, Zurich, Switzerland, pp. 166–172. ACM (1996)
Acknowledgments
This work was partially funded by Project Sistema de Monitoramento de Políticas de Promoção da Igualdade Racial (SNPPIR).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ferri, J., Tissot, H., Del Fabro, M.D. (2018). Integrating Approximate String Matching with Phonetic String Similarity. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-98398-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98397-4
Online ISBN: 978-3-319-98398-1
eBook Packages: Computer ScienceComputer Science (R0)