Skip to main content

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

  • Chapter
  • First Online:

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 132))

Abstract

In today’s information age, processing customer information in a standardized and accurate manner is known to be a difficult task. Data collection methods vary from source to source by format, volume, and media type. Therefore, it is advantageous to deploy customized data hygiene techniques to standardize the data for meaningfulness and usefulness based on the organization.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    NameCheck is an algorithm currently being used by Axciom.

References

  • AJAX (2007) AJAX Spell Checker. Retrieved from http://www.broken-notebook.com/spell_checker/.

  • ASPELL (2007) ASPELL. Retrieved from http://aspell.net/metaphone/.

  • Becchetti C, Ricotti LP (1999) Speech Recognition: Theory and C++ Implementation. John Wiley & Sons.

    Google Scholar 

  • Beitzel SM, Jensen EC, and Grossman, DA (2002) Retrieving OCR text: A survey of current approaches. White Paper.

    Google Scholar 

  • Brill E, Moore RC (2002) An improved error model for noisy channel spelling correction. In: Proceedings of ACL-2000, the 38th Annual Meeting of the Association for Computational Linguistics, pp 286-293.

    Google Scholar 

  • Cardinal J (2002) Quantization with an information-theoretic distortion measure. Technical Report 491, ULB.

    Google Scholar 

  • Census (2007) Census Bureau Home Page, www.census.gov.

  • Damerau FJ (1990) Evaluating computer generated domain-oriented vocabularies. Information Process. Management. 26: 791 – 801.

    Article  Google Scholar 

  • Durhaiw I, Lamb DA, and Sax JB (1983) Spelling correction in user interfaces. CACM 26: 764–773.

    Google Scholar 

  • Golding A, Schabes Y (1996) Combining trigram based and feature-based methods for context-sensitive spelling correction. In: Joshi A, and Palmer M, (eds.). Proceedings of the 34th Annual Meeting of the ACL. San Francisco.

    Google Scholar 

  • JSPELL (2007) JSPELL HTML. Retrieved from http://www.thesolutioncafe.com/html-spell-checker.html.

  • Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the ACL.

    Google Scholar 

  • Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163: 845-848, also {1966) Soviet Physics Doklady 10: 707-710.

    MathSciNet  Google Scholar 

  • Kukich K (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, Vol. 24, No. 4.

    Google Scholar 

  • Mihov S, Ringlstetter C, Schulz KU, and Strohmaier C (2003) Lexical post-correction of OCR-results: The web as a dynamic secondary dictionary? In: Document Analysis and Recognition Proceedings Volume 2, pp 03–06.

    Google Scholar 

  • NetSpell (2007) Near Miss Strategy. Retrieved from http://www.codeproject.com/csharp/NetSpell.asp.

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.

    Article  Google Scholar 

  • Pedro JM, Purdy PH, Vasconcelos N (2004) A Kullback-Leibler divergence based kernel for SVM classification in multimedia application. In: Thrun S, Saul L, Scholkopf B (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA.

    Google Scholar 

  • Philips L (1990) Hanging on the metaphone. Computer Language, 7 (12): 39-43.

    Google Scholar 

  • Philips L (2000) The double-metaphone search algorithm. C/C++ User's Journal, 18(6).

    Google Scholar 

  • Taghva K, Stofsky E (2001) OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR, 3: 125-137.

    Article  Google Scholar 

  • Tillenius M (1996) Efficient generation and ranking of spelling error corrections Master’s thesis, Royal Institute of Technology, Stockholm, Sweden.

    Google Scholar 

  • Trenkle JM and Vogt RC (1994) Disambiguation and spelling correction for a neural network based character recognition system. In: Proceedings of SPIE. Volume 2181, pp 322-333.

    Article  Google Scholar 

  • Ullman JR (1977) A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words. Computer J., 20 (2): 141-147.

    Article  Google Scholar 

  • Varol C, Robinette C, Kulaga J, Bayrak C, Wagner R, Goff D (2006) Application of Near Miss Strategy and Edit Distance to Handle Dirty Data. In: ALAR Conference on Applied Research in Information Technology, March 3, Conway, Arkansas, USA.

    Google Scholar 

  • Veronis, J (1998) Morphosyntactic correction in natural language interfaces. In: Proceedings of the 12th International Conference on Computational Linguistics. Budapest, Hungary, pp 708-713.

    Google Scholar 

  • Wu S, Manber U (1992a) AGREP - A Fast Approximate Pattern Matching Tool. In: Proc. Usenix Winter 1992 Technical Conf., pp 153-162.

    Google Scholar 

  • Wu S, Manber U (1992b) Fast Text Searching With Errors. Comm. ACM, Vol. 35.

    Google Scholar 

  • Yannakoudakis EJ, Fawthrop D (1983) The rules of spelling errors. Information Processing Management 19 (2): 87–99.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Varol, C., Bayrak, C., Wagner, R., Goff, D. (2009). Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data. In: Chan, Y., Talburt, J., Talley, T. (eds) Data Engineering. International Series in Operations Research & Management Science, vol 132. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0176-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-0176-7_5

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-0175-0

  • Online ISBN: 978-1-4419-0176-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics