New Algorithms for Text Fingerprinting

  • Roman Kolpakov
  • Mathieu Raffinot
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4009)


Let s = s 1 .. s n be a text (or sequence) on a finite alphabet Σ. A fingerprint in s is the set of distinct characters contained in one of its substrings. Fingerprinting a text consists of computing the set \({\mathcal{F}}\) of all fingerprints of all its substrings and being able to efficiently answer several questions on this set. A given fingerprint \(f \in {\mathcal{F}}\) is represented by a binary array, F, of size |Σ| named a fingerprint table. A fingerprint, \(f \in {\mathcal{F}}\), admits a number of maximal locations (i,j) in S, that is the alphabet of s i .. s j is f and s i − − 1, s j + 1, if defined, are not in f. The total number of maximal locations is \({\mathcal{L}} \leq n |\Sigma|+1.\) We present new algorithms and a new data structure for the three problems: (1) compute \({\mathcal{F}}\); (2) given F, answer if F represents a fingerprint in \({\mathcal{F}}\); (3) given F, find all maximal locations of F in s. These problems are respectively solved in \(O(({\mathcal{L}}+ n) \log |\Sigma|)\), Θ(|Σ|), and Θ(|Σ| + K) time – where K is the number of maximal locations of F.


