Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

Varol, Cihan; Bayrak, Coskun; Wagner, Rick; Goff, Dana

doi:10.1007/978-1-4419-0176-7_5

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

Cihan Varol⁴,
Coskun Bayrak⁴,
Rick Wagner⁵ &
…
Dana Goff⁵

Chapter
First Online: 01 January 2009

2781 Accesses
1 Citations

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 132))

Abstract

In today’s information age, processing customer information in a standardized and accurate manner is known to be a difficult task. Data collection methods vary from source to source by format, volume, and media type. Therefore, it is advantageous to deploy customized data hygiene techniques to standardize the data for meaningfulness and usefulness based on the organization.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
NameCheck is an algorithm currently being used by Axciom.

References

AJAX (2007) AJAX Spell Checker. Retrieved from http://www.broken-notebook.com/spell_checker/.
ASPELL (2007) ASPELL. Retrieved from http://aspell.net/metaphone/.
Becchetti C, Ricotti LP (1999) Speech Recognition: Theory and C++ Implementation. John Wiley & Sons.
Google Scholar
Beitzel SM, Jensen EC, and Grossman, DA (2002) Retrieving OCR text: A survey of current approaches. White Paper.
Google Scholar
Brill E, Moore RC (2002) An improved error model for noisy channel spelling correction. In: Proceedings of ACL-2000, the 38th Annual Meeting of the Association for Computational Linguistics, pp 286-293.
Google Scholar
Cardinal J (2002) Quantization with an information-theoretic distortion measure. Technical Report 491, ULB.
Google Scholar
Census (2007) Census Bureau Home Page, www.census.gov.
Damerau FJ (1990) Evaluating computer generated domain-oriented vocabularies. Information Process. Management. 26: 791 – 801.
Article Google Scholar
Durhaiw I, Lamb DA, and Sax JB (1983) Spelling correction in user interfaces. CACM 26: 764–773.
Google Scholar
Golding A, Schabes Y (1996) Combining trigram based and feature-based methods for context-sensitive spelling correction. In: Joshi A, and Palmer M, (eds.). Proceedings of the 34th Annual Meeting of the ACL. San Francisco.
Google Scholar
JSPELL (2007) JSPELL HTML. Retrieved from http://www.thesolutioncafe.com/html-spell-checker.html.
Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the ACL.
Google Scholar
Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163: 845-848, also {1966) Soviet Physics Doklady 10: 707-710.
MathSciNet Google Scholar
Kukich K (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, Vol. 24, No. 4.
Google Scholar
Mihov S, Ringlstetter C, Schulz KU, and Strohmaier C (2003) Lexical post-correction of OCR-results: The web as a dynamic secondary dictionary? In: Document Analysis and Recognition Proceedings Volume 2, pp 03–06.
Google Scholar
NetSpell (2007) Near Miss Strategy. Retrieved from http://www.codeproject.com/csharp/NetSpell.asp.
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.
Article Google Scholar
Pedro JM, Purdy PH, Vasconcelos N (2004) A Kullback-Leibler divergence based kernel for SVM classification in multimedia application. In: Thrun S, Saul L, Scholkopf B (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA.
Google Scholar
Philips L (1990) Hanging on the metaphone. Computer Language, 7 (12): 39-43.
Google Scholar
Philips L (2000) The double-metaphone search algorithm. C/C++ User's Journal, 18(6).
Google Scholar
Taghva K, Stofsky E (2001) OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR, 3: 125-137.
Article Google Scholar
Tillenius M (1996) Efficient generation and ranking of spelling error corrections Master’s thesis, Royal Institute of Technology, Stockholm, Sweden.
Google Scholar
Trenkle JM and Vogt RC (1994) Disambiguation and spelling correction for a neural network based character recognition system. In: Proceedings of SPIE. Volume 2181, pp 322-333.
Article Google Scholar
Ullman JR (1977) A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words. Computer J., 20 (2): 141-147.
Article Google Scholar
Varol C, Robinette C, Kulaga J, Bayrak C, Wagner R, Goff D (2006) Application of Near Miss Strategy and Edit Distance to Handle Dirty Data. In: ALAR Conference on Applied Research in Information Technology, March 3, Conway, Arkansas, USA.
Google Scholar
Veronis, J (1998) Morphosyntactic correction in natural language interfaces. In: Proceedings of the 12th International Conference on Computational Linguistics. Budapest, Hungary, pp 708-713.
Google Scholar
Wu S, Manber U (1992a) AGREP - A Fast Approximate Pattern Matching Tool. In: Proc. Usenix Winter 1992 Technical Conf., pp 153-162.
Google Scholar
Wu S, Manber U (1992b) Fast Text Searching With Errors. Comm. ACM, Vol. 35.
Google Scholar
Yannakoudakis EJ, Fawthrop D (1983) The rules of spelling errors. Information Processing Management 19 (2): 87–99.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Arkansas at Little Rock, Little Rock, AR, USA
Cihan Varol & Coskun Bayrak
Acxiom Corporation, Conway, AR, USA
Rick Wagner & Dana Goff

Authors

Cihan Varol
View author publications
You can also search for this author in PubMed Google Scholar
Coskun Bayrak
View author publications
You can also search for this author in PubMed Google Scholar
Rick Wagner
View author publications
You can also search for this author in PubMed Google Scholar
Dana Goff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. Systems Engineering, Donaghey College of Info Sci., University of Arkansas, South University Avenue 2801, Little Rock, 72204-1099, Arkansas, USA
Yupo Chan
Dept. Information Science, University of Arkansas, Little Rock, South University Ave. 2801, Little Rock, 72204-1099, USA
John Talburt
Acxiom Corporation, Conway, USA
Terry M. Talley

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Varol, C., Bayrak, C., Wagner, R., Goff, D. (2009). Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data. In: Chan, Y., Talburt, J., Talley, T. (eds) Data Engineering. International Series in Operations Research & Management Science, vol 132. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0176-7_5

Download citation

DOI: https://doi.org/10.1007/978-1-4419-0176-7_5
Published: 05 September 2009
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-0175-0
Online ISBN: 978-1-4419-0176-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics