Finding Median and Center Strings for a Probability Distribution on a Set of Strings Under Levenshtein Distance Based on Integer Linear Programming
For a data set composed of numbers or numerical vectors, a mean is the most fundamental measure for capturing the center of the data. However, for a data set of strings, a mean of the data cannot be defined, and therefore, median and center strings are frequently used as a measure of the center of the data. In contrast to calculating a mean of numerical data, constructing median and center strings of string data is not easy, and no algorithm is found that is guaranteed to construct exact solutions of center strings. In this study, we first generalize the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet. This generalization corresponds to that of a mean of numerical data into an expected value of a probability distribution on a set of numbers or numerical vectors. Next, we develop methods for constructing exact solutions of median and center strings for a probability distribution on a set of strings, applying integer linear programming. These methods are improved into faster ones by using the triangle inequality on the Levenshtein distance in the case where a set of strings is a metric space with the Levenshtein distance. Furthermore, we also develop methods for constructing approximate solutions of median and center strings very rapidly if the probability of a subset composed of similar strings is close to one. Lastly, we perform simulation experiments to examine the usefulness of our proposed methods in practical applications.
This work was partially supported by Grants-in-Aid #24500361 and #26610037 from MEXT, Japan.
- 3.Casacuberta, F., de Antoni, M.: A greedy algorithm for computing approximate median strings. In: Proceedings of National Symposium on Pattern Recognition and Image Analysis, pp. 193–198 (1997)Google Scholar
- 6.Gramm, J.: Fixed-parameter algorithms for the consensus analysis of genomic data. Ph.D. thesis, Universität Tübingen (2003)Google Scholar
- 23.Olivares-Rodríguez, C., Oncina, J.: A stochastic approach to median string computation. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) SSPR/SPR 2008. LNCS, vol. 5342, pp. 431–440. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-89689-0_47 CrossRefGoogle Scholar
- 26.Winkler, W.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 354–359 (1990)Google Scholar