First Application of a Distance-Based Outlier Approach to Detect Highly Differentiated Genomic Regions Across Human Populations

Lodi, Stefano; Angiulli, Fabrizio; Basta, Stefano; Luiselli, Donata; Pagani, Luca; Sartori, Claudio

doi:10.1007/978-3-319-23497-7_10

Stefano Lodi⁴,
Fabrizio Angiulli⁵,
Stefano Basta⁶,
Donata Luiselli⁷,
Luca Pagani⁸ &
…
Claudio Sartori⁴

2002 Accesses

Abstract

Genomic scans for positive selection or population differentiation are often used in evolutionary genetics to shortlist genetic loci with potentially adaptive biological functions. However, the vast majority of such tests relies on empirical ranking methods, which suffer from high false positive rates. In this work we computed a modified genetic distance on a 10,000 bp sliding window between sets of three samples each from CHB, CEU and YRI samples from the 1000 Genomes Project. We applied SolvingSet, a distance-based outlier detection method capable of mining hundreds of thousands of multivariate entries in a computationally efficient manner, to the average pairwise distances obtained from each window for each CHB-CEU, CHB-YRI and CEU-YRI to compute the top-n genic windows exhibiting the highest scores for the three distances. The outliers detected by this approach were screened for their biological significance, showing good overlap with previously known targets of differentiation and positive selection in human populations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

1000 Genomes Project Consortium, Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., McVean, G.A.: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)
Google Scholar
Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Distributed strategies for mining outliers in large data sets. IEEE Trans. Knowl. Data Eng. 25(7), 1520–1532 (2013)
Article Google Scholar
Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Fast outlier detection using a gpu. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 143–150 (2013)
Google Scholar
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. Trans. Knowl. Data Eng. 2(17), 203–215 (2005)
Article Google Scholar
Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Accelerating outlier detection with intra- and inter-node parallelism. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 476–483. IEEE, Bologna, Italy, 21–25 July (2014)
Google Scholar
Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. Trans. Knowl. Data Eng. 18(2), 145–160 (2006)
Article Google Scholar
Angiulli, F., Fassetti, F.: Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data 3(1), 4:1–4:57 (2009)
Google Scholar
Ayub, Q., Moutsianas, L., Chen, Y., Panoutsopoulou, K., Colonna, V., Pagani, L., Prokopenko, I., Ritchie, G.R.S., Smith, T.C., McCarthy, M.I., et al.: Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes. Am. J Hum. Genet. 94(2), 176–185 (2014)
Article Google Scholar
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)
MATH Google Scholar
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Knowledge Discovery and Data Mining (2003)
Book Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York, USA (2000)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Google Scholar
Colonna, V., Ayub, Q., Chen, Y., Pagani, L., Luisi, P., Pybus, M., Garrison, E., Xue, Y., Tyler-Smith, C., et al.: Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol. 15(6), R88 (2014)
Article Google Scholar
Dutta, H., Giannella, C., Borne, K.D., Kargupta, H.: Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: SDM (2007)
Book Google Scholar
Ewing, G., Hermisson, J.: Msms: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26(26), 2064–2065 (2010)
Article Google Scholar
Fay, J.C., Wu, C.I.: The neutral theory in the genomic era. Curr. Opin. Genet. Dev. 11(6), 642–646 (2001)
Article Google Scholar
Ghoting, A., Parthasarathy, S., Otey, M.E.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Disc. 16(3), 349–364 (2008)
Article MathSciNet Google Scholar
Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)
Article MATH Google Scholar
Hung, E., Cheung, D.W.: Parallel mining of outliers in large database. Distrib. Parallel Dat. 12(1), 5–26 (2002)
Article MATH Google Scholar
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: VLDB. pp. 392–403 (1998)
Google Scholar
Koufakou, A., Georgiopoulos, M.: A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min. Knowl. Disc. (2009, Published online)
Google Scholar
Lozano, E., Acuña, E.: Parallel algorithms for distance-based and density-based outliers. In: ICDM. pp. 729–732 (2005)
Google Scholar
Otey, M.E., Ghoting, A., Parthasarathy, S.: Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Disc. 12(2–3), 203–228 (2006)
Article MathSciNet Google Scholar
Pickrell, J.K., Coop, G., Novembre, J., Kudaravalli, S., Li, J.Z., Absher, D., Srinivasan, B.S., Barsh, G.S., Myers, R.M., Feldman, M.W., et al.: Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19(5), 826–837 (2009)
Article Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD, pp. 427–438 (2000)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 427–438. ACM, New York, USA (2000)
Google Scholar
Sabeti, P.C., Varilly, P., Fry, B., Lohmueller, J., Hostetter, E., Cotsapas, C., Xie, X., Byrne, E.H., McCarroll, S.A., Gaudet, R., et al.: Genome-wide detection and characterization of positive selection in human populations. Nature 449(7164), 913–918 (2007)
Article Google Scholar
Tajima, F.: Statistical method for testing the neutral mutation hypothesis by dna polymorphism. Genetics 123(3), 585–595 (1989)
MathSciNet Google Scholar
Tao, Y., Xiao, X., Zhou, S.: Mining distance-based outliers from large databases in any metric space. In: KDD, pp. 394–403 (2006)
Google Scholar
Voight, B.F., Kudaravalli, S., Wen, X., Pritchard, J.K.: A map of recent positive selection in the human genome. PLoS Biol. 4(3), e72 (2006)
Article Google Scholar
Wright, S.: Isolation by distance under diverse systems of mating. Genetics 31(1), 39 (1946)
Google Scholar
Yi, X., Liang, Y., Huerta-Sanchez, E., Jin, X., Cuo, Z.X.P., Pool, J.E., Xu, X., Jiang, H., Vinckenbosch, N., Korneliussen, T.S., et al.: Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329(5987), 75–78 (2010)
Article Google Scholar

Download references

Acknowledgements

This work has been partially supported by the Italian Ministry of Education, Universities and Research under PRIN Data-Centric Genomic Computing (GenData 2020) and by CINECA ISCRA project HIOXICGP. Luca Pagani would like to thank Guy Jacobs for his help with simulations. The authors have no conflict of interests to declare.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Bologna, 40136, Bologna, Italy
Stefano Lodi & Claudio Sartori
Department of Computer Engineering, Modelling, Electronics, and Systems, University of Calabria, 87036, Rende, Italy
Fabrizio Angiulli
Institute of High Performance Computing and Networking, Italian National Research Council, 87036, Rende, Italy
Stefano Basta
Department of Biological, Geological and Environment Sciences, University of Bologna, 40126, Bologna, Italy
Donata Luiselli
Department of Archaeology and Anthropology, University of Cambridge, Cambridge, UK
Luca Pagani

Authors

Stefano Lodi
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Angiulli
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Basta
View author publications
You can also search for this author in PubMed Google Scholar
Donata Luiselli
View author publications
You can also search for this author in PubMed Google Scholar
Luca Pagani
View author publications
You can also search for this author in PubMed Google Scholar
Claudio Sartori
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Genetics and Biophysics (IGB) "A. Buzzati-Traverso", National Research Council of Italy (CNR), Naples, Italy
Valeria Zazzu
Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy
Maria Brigida Ferraro
High Performance Computing and Networking Institute (ICAR), National Research Council of Italy (CNR), Napoli, Italy
Mario R. Guarracino

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lodi, S., Angiulli, F., Basta, S., Luiselli, D., Pagani, L., Sartori, C. (2015). First Application of a Distance-Based Outlier Approach to Detect Highly Differentiated Genomic Regions Across Human Populations. In: Zazzu, V., Ferraro, M., Guarracino, M. (eds) Mathematical Models in Biology. Springer, Cham. https://doi.org/10.1007/978-3-319-23497-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-23497-7_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23496-0
Online ISBN: 978-3-319-23497-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics