Advertisement

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

  • Cihan Varol
  • Coskun Bayrak
  • Rick Wagner
  • Dana Goff
Chapter
Part of the International Series in Operations Research & Management Science book series (ISOR, volume 132)

Abstract

In today’s information age, processing customer information in a standardized and accurate manner is known to be a difficult task. Data collection methods vary from source to source by format, volume, and media type. Therefore, it is advantageous to deploy customized data hygiene techniques to standardize the data for meaningfulness and usefulness based on the organization.

Keywords

Edit Distance Optical Character Recognition Spelling Error Cognitive Error Spelling Correction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. AJAX (2007) AJAX Spell Checker. Retrieved from http://www.broken-notebook.com/spell_checker/.
  2. ASPELL (2007) ASPELL. Retrieved from http://aspell.net/metaphone/.
  3. Becchetti C, Ricotti LP (1999) Speech Recognition: Theory and C++ Implementation. John Wiley & Sons.Google Scholar
  4. Beitzel SM, Jensen EC, and Grossman, DA (2002) Retrieving OCR text: A survey of current approaches. White Paper.Google Scholar
  5. Brill E, Moore RC (2002) An improved error model for noisy channel spelling correction. In: Proceedings of ACL-2000, the 38th Annual Meeting of the Association for Computational Linguistics, pp 286-293.Google Scholar
  6. Cardinal J (2002) Quantization with an information-theoretic distortion measure. Technical Report 491, ULB.Google Scholar
  7. Census (2007) Census Bureau Home Page, www.census.gov.
  8. Damerau FJ (1990) Evaluating computer generated domain-oriented vocabularies. Information Process. Management. 26: 791 – 801.CrossRefGoogle Scholar
  9. Durhaiw I, Lamb DA, and Sax JB (1983) Spelling correction in user interfaces. CACM 26: 764–773.Google Scholar
  10. Golding A, Schabes Y (1996) Combining trigram based and feature-based methods for context-sensitive spelling correction. In: Joshi A, and Palmer M, (eds.). Proceedings of the 34th Annual Meeting of the ACL. San Francisco.Google Scholar
  11. JSPELL (2007) JSPELL HTML. Retrieved from http://www.thesolutioncafe.com/html-spell-checker.html.
  12. Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the ACL.Google Scholar
  13. Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163: 845-848, also {1966) Soviet Physics Doklady 10: 707-710.MathSciNetGoogle Scholar
  14. Kukich K (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, Vol. 24, No. 4.Google Scholar
  15. Mihov S, Ringlstetter C, Schulz KU, and Strohmaier C (2003) Lexical post-correction of OCR-results: The web as a dynamic secondary dictionary? In: Document Analysis and Recognition Proceedings Volume 2, pp 03–06.Google Scholar
  16. NetSpell (2007) Near Miss Strategy. Retrieved from http://www.codeproject.com/csharp/NetSpell.asp.
  17. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453.CrossRefGoogle Scholar
  18. Pedro JM, Purdy PH, Vasconcelos N (2004) A Kullback-Leibler divergence based kernel for SVM classification in multimedia application. In: Thrun S, Saul L, Scholkopf B (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA.Google Scholar
  19. Philips L (1990) Hanging on the metaphone. Computer Language, 7 (12): 39-43.Google Scholar
  20. Philips L (2000) The double-metaphone search algorithm. C/C++ User's Journal, 18(6).Google Scholar
  21. Taghva K, Stofsky E (2001) OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR, 3: 125-137.CrossRefGoogle Scholar
  22. Tillenius M (1996) Efficient generation and ranking of spelling error corrections Master’s thesis, Royal Institute of Technology, Stockholm, Sweden.Google Scholar
  23. Trenkle JM and Vogt RC (1994) Disambiguation and spelling correction for a neural network based character recognition system. In: Proceedings of SPIE. Volume 2181, pp 322-333.CrossRefGoogle Scholar
  24. Ullman JR (1977) A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words. Computer J., 20 (2): 141-147.CrossRefGoogle Scholar
  25. Varol C, Robinette C, Kulaga J, Bayrak C, Wagner R, Goff D (2006) Application of Near Miss Strategy and Edit Distance to Handle Dirty Data. In: ALAR Conference on Applied Research in Information Technology, March 3, Conway, Arkansas, USA.Google Scholar
  26. Veronis, J (1998) Morphosyntactic correction in natural language interfaces. In: Proceedings of the 12th International Conference on Computational Linguistics. Budapest, Hungary, pp 708-713.Google Scholar
  27. Wu S, Manber U (1992a) AGREP - A Fast Approximate Pattern Matching Tool. In: Proc. Usenix Winter 1992 Technical Conf., pp 153-162.Google Scholar
  28. Wu S, Manber U (1992b) Fast Text Searching With Errors. Comm. ACM, Vol. 35.Google Scholar
  29. Yannakoudakis EJ, Fawthrop D (1983) The rules of spelling errors. Information Processing Management 19 (2): 87–99.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Cihan Varol
    • 1
  • Coskun Bayrak
    • 1
  • Rick Wagner
    • 2
  • Dana Goff
    • 2
  1. 1.Computer Science DepartmentUniversity of Arkansas at Little RockLittle RockUSA
  2. 2.Acxiom CorporationConwayUSA

Personalised recommendations