# New Algorithms for Text Fingerprinting

• Roman Kolpakov
• Mathieu Raffinot
## Abstract

Let s = s 1 .. s n be a text (or sequence) on a finite alphabet Σ. A fingerprint in s is the set of distinct characters contained in one of its substrings. Fingerprinting a text consists of computing the set $${\mathcal{F}}$$ of all fingerprints of all its substrings and being able to efficiently answer several questions on this set. A given fingerprint $$f \in {\mathcal{F}}$$ is represented by a binary array, F, of size |Σ| named a fingerprint table. A fingerprint, $$f \in {\mathcal{F}}$$, admits a number of maximal locations (i,j) in S, that is the alphabet of s i .. s j is f and s i − − 1, s j + 1, if defined, are not in f. The total number of maximal locations is $${\mathcal{L}} \leq n |\Sigma|+1.$$ We present new algorithms and a new data structure for the three problems: (1) compute $${\mathcal{F}}$$; (2) given F, answer if F represents a fingerprint in $${\mathcal{F}}$$; (3) given F, find all maximal locations of F in s. These problems are respectively solved in $$O(({\mathcal{L}}+ n) \log |\Sigma|)$$, Θ(|Σ|), and Θ(|Σ| + K) time – where K is the number of maximal locations of F.

## Keywords

Maximal Location Hash Table Distinct Character Naming Algorithm Edge Label

## Authors and Affiliations

• Roman Kolpakov
• 1
• Mathieu Raffinot
• 2
1. 1.Liapunov French-Russian InstituteLomonosov Moscow State UniversityMoscowRussia
2. 2.CNRS, Poncelet LaboratoryIndependent University of MoscowMoscowRussia