LexEQUAL: Supporting Multiscript Matching in Database Systems

  • A. Kumaran
  • Jayant R. Haritsa
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2992)

Abstract

To effectively support today’s global economy, database systems need to store and manipulate text data in multiple languages simultaneously. Current database systems do support the storage and management of multilingual data, but are not capable of querying or matching text data across different scripts. As a first step towards addressing this lacuna, we propose here a new query operator called LexEQUAL, which supports multiscript matching of proper names. The operator is implemented by first transforming matches in multiscript text space into matches in the equivalent phoneme space, and then using standard approximate matching techniques to compare these phoneme strings. The algorithm incorporates tunable parameters that impact the phonetic match quality and thereby determine the match performance in the multiscript space. We evaluate the performance of the LexEQUAL operator on a real multiscript names dataset and demonstrate that it is possible to simultaneously achieve good recall and precision by appropriate parameter settings. We also show that the operator run-time can be made extremely efficient by utilizing a combination of q-gram and database indexing techniques. Thus, we show that the LexEQUAL operator can complement the standard lexicographic operators, representing a first step towards achieving complete multilingual functionality in database systems.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Navarro, G.: Faster Approximate String Matching. Algorithmica 23(2), 127–158 (1999)MATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.: Searching in Metric Space. ACM Computing Surveys 33(3), 273–321 (2001)CrossRefGoogle Scholar
  3. 3.
    Davis, M.: Unicode collation algorithm. Unicode Consortium Technical Report (2001)Google Scholar
  4. 4.
    Dhvani - A Text-to-Speech System for Indian Languages, http://dhvani.sourceforge.net/
  5. 5.
    The Foreign Word – The Language Site, Alicante, Spain, http://www.ForeignWord.com
  6. 6.
    Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (almost) for Free. In: Proc. of 27th VLDB Conf. (September 2001)Google Scholar
  7. 7.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (2001)Google Scholar
  8. 8.
    International Organization for Standardization. ISO/IEC 9075-1-5:1999, Information Technology – Database Languages – SQL (parts 1 through 5) (1999)Google Scholar
  9. 9.
    The International Phonetic Association. Univ. of Glasgow, Glasgow, UK, http://www.arts.gla.ac.uk/IPA/ipa.html
  10. 10.
    Jurafskey, D., Martin, J.: Speech and Language Processing. Pearson Education (2000)Google Scholar
  11. 11.
    Knuth, D.: The Art of Computer Programming. Sorting and Searching, vol. 3. Addison-Wesley, Reading (1993)Google Scholar
  12. 12.
    Kumaran, A., Haritsa, J.: On Database Support for Multilingual Environments. In: Proc. of 9th IEEE RIDE Workshop (March 2003)Google Scholar
  13. 13.
    Kumaran, A., Haritsa, J.: On the Costs of Multilingualism in Database Systems. In: Proc. of 29th VLDB Conference (September 2003)Google Scholar
  14. 14.
    Kumaran, A., Haritsa, J.: Supporting Multilexical Matching in Database Systems. DSL/SERC Technical Report TR-2004-01 (2004)Google Scholar
  15. 15.
    Lambert, B., Chang, K., Lin, S.: Descriptive analysis of the drug name lexicon. Drug Information Journal 35, 163–172 (2001)Google Scholar
  16. 16.
    Liberman, M., Church, K.: Text Analysis and Word Pronunciation in TTS Synthesis. Advances in Speech Processing (1992)Google Scholar
  17. 17.
    Melton, J., Simon, A.: SQL 1999: Understanding Relational Language Components. Morgan Kaufmann, San Francisco (2001)Google Scholar
  18. 18.
    Mareuil, P., Corredor-Ardoy, C., Adda-Decker, M.: Multilingual Automatic Phoneme Clustering. In: Proc. of 14th Intl. Congress of Phonetic Sciences (August 1999)Google Scholar
  19. 19.
    Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys 33(1), 31–88 (2001)CrossRefGoogle Scholar
  20. 20.
    Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-grams. In: Proc. of 11th Combinatorial Pattern Matching Conf. (June 2000)Google Scholar
  21. 21.
    Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)Google Scholar
  22. 22.
    The Oxford English Dictionary. Oxford University Press (1999)Google Scholar
  23. 23.
    Pfeifer, U., Poersch, T., Fuhr, N.: Searching Proper Names in Databases. In: Proc. Conf. Hypertext-Information Retrieval-Multimedia (April 1995)Google Scholar
  24. 24.
    Rabiner, L., Juang, B.: Fundamentals of Speech Processing. Prentice-Hall, Englewood Cliffs (1993)Google Scholar
  25. 25.
    The Unicode Consortium. The Unicode Standard. Addison-Wesley (2000)Google Scholar
  26. 26.
    The Unisyn Project. The Center for Speech Technology Research, Univ. of Edinburgh, United Kingdom, http://www.cstr.ed.ac.uk/projects/unisyn/
  27. 27.
    Zobel, J., Dart, P.: Finding Approximate Matches in Large Lexicons. Software – Practice and Experience 25(3), 331–345 (1995)CrossRefGoogle Scholar
  28. 28.
    Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Proc. of 19th ACM SIGIR Conf. (August 1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • A. Kumaran
    • 1
  • Jayant R. Haritsa
    • 1
  1. 1.Department of Computer Science and AutomationIndian Institute of ScienceBangaloreINDIA

Personalised recommendations