Background

DNA in eukaryotic genomes is characterized, and often dominated, by repetitive, non-genic DNA sequences. Initially thought to be non-functional, repeats have been found to influence gene expression [1] and provide diversity to the genome via mutation. Mobile repeat sequences [2](transposons) have played a prominent role in the evolutionary histories of eukaryotic genomes [3, 4], and their persistence in eukaryotic DNA indicates that they have, on the whole, been evolutionarily advantageous. While there are an increasing number of algorithms that have been developed for discovering novel dispersed repeats [57], significant analysis of the repeats and their relationships to other genome features will be required before we can truly understand the complex ways in which dispersed repeat sequences contribute to evolutionary fitness. We propose a spatial proximity rule based data mining technique to discover highly fragmented repeat regions for which only the conserved parts are reported by a computational repeat finder.

Materials and methods

We present an algorithm for mining the coordinates of different families of ab initio identified repetitive regions on chromosomal length DNA sequences to yield proximity relationships between repeat families [8]. Association rule mining [9] is used to compute the statistical significance of the discovered relationships. False positives are screened out by means of Monte Carlo methods. The filtered proximity relationships are in turn used to build graphs in which repeat families correspond to the vertices and the discovered proximity relationships correspond to edges. Connected components are extracted from the graphs to yield sets of related families denoting diverged repeat regions.

Results and conclusion

We demonstrate that this approach applied to the rice genome [10] can discover annotated repeat regions [11, 12] and can identify novel relationships among repetitive DNA sequences. The novel relationships can be used to detect hitherto unknown repeat regions in sequenced genomes. The approach described can be extended to address and investigate proximity relationships between all annotated elements within a genome including genes, repetitive elements, non-coding RNAs, and regulatory elements.