Abstract
Genomic scans for positive selection or population differentiation are often used in evolutionary genetics to shortlist genetic loci with potentially adaptive biological functions. However, the vast majority of such tests relies on empirical ranking methods, which suffer from high false positive rates. In this work we computed a modified genetic distance on a 10,000 bp sliding window between sets of three samples each from CHB, CEU and YRI samples from the 1000 Genomes Project. We applied SolvingSet, a distance-based outlier detection method capable of mining hundreds of thousands of multivariate entries in a computationally efficient manner, to the average pairwise distances obtained from each window for each CHB-CEU, CHB-YRI and CEU-YRI to compute the top-n genic windows exhibiting the highest scores for the three distances. The outliers detected by this approach were screened for their biological significance, showing good overlap with previously known targets of differentiation and positive selection in human populations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
1000 Genomes Project Consortium, Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., McVean, G.A.: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)
Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Distributed strategies for mining outliers in large data sets. IEEE Trans. Knowl. Data Eng. 25(7), 1520–1532 (2013)
Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Fast outlier detection using a gpu. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 143–150 (2013)
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. Trans. Knowl. Data Eng. 2(17), 203–215 (2005)
Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Accelerating outlier detection with intra- and inter-node parallelism. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 476–483. IEEE, Bologna, Italy, 21–25 July (2014)
Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. Trans. Knowl. Data Eng. 18(2), 145–160 (2006)
Angiulli, F., Fassetti, F.: Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data 3(1), 4:1–4:57 (2009)
Ayub, Q., Moutsianas, L., Chen, Y., Panoutsopoulou, K., Colonna, V., Pagani, L., Prokopenko, I., Ritchie, G.R.S., Smith, T.C., McCarthy, M.I., et al.: Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes. Am. J Hum. Genet. 94(2), 176–185 (2014)
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Knowledge Discovery and Data Mining (2003)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York, USA (2000)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Colonna, V., Ayub, Q., Chen, Y., Pagani, L., Luisi, P., Pybus, M., Garrison, E., Xue, Y., Tyler-Smith, C., et al.: Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol. 15(6), R88 (2014)
Dutta, H., Giannella, C., Borne, K.D., Kargupta, H.: Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: SDM (2007)
Ewing, G., Hermisson, J.: Msms: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26(26), 2064–2065 (2010)
Fay, J.C., Wu, C.I.: The neutral theory in the genomic era. Curr. Opin. Genet. Dev. 11(6), 642–646 (2001)
Ghoting, A., Parthasarathy, S., Otey, M.E.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Disc. 16(3), 349–364 (2008)
Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001)
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)
Hung, E., Cheung, D.W.: Parallel mining of outliers in large database. Distrib. Parallel Dat. 12(1), 5–26 (2002)
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: VLDB. pp. 392–403 (1998)
Koufakou, A., Georgiopoulos, M.: A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min. Knowl. Disc. (2009, Published online)
Lozano, E., Acuña, E.: Parallel algorithms for distance-based and density-based outliers. In: ICDM. pp. 729–732 (2005)
Otey, M.E., Ghoting, A., Parthasarathy, S.: Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Disc. 12(2–3), 203–228 (2006)
Pickrell, J.K., Coop, G., Novembre, J., Kudaravalli, S., Li, J.Z., Absher, D., Srinivasan, B.S., Barsh, G.S., Myers, R.M., Feldman, M.W., et al.: Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19(5), 826–837 (2009)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD, pp. 427–438 (2000)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 427–438. ACM, New York, USA (2000)
Sabeti, P.C., Varilly, P., Fry, B., Lohmueller, J., Hostetter, E., Cotsapas, C., Xie, X., Byrne, E.H., McCarroll, S.A., Gaudet, R., et al.: Genome-wide detection and characterization of positive selection in human populations. Nature 449(7164), 913–918 (2007)
Tajima, F.: Statistical method for testing the neutral mutation hypothesis by dna polymorphism. Genetics 123(3), 585–595 (1989)
Tao, Y., Xiao, X., Zhou, S.: Mining distance-based outliers from large databases in any metric space. In: KDD, pp. 394–403 (2006)
Voight, B.F., Kudaravalli, S., Wen, X., Pritchard, J.K.: A map of recent positive selection in the human genome. PLoS Biol. 4(3), e72 (2006)
Wright, S.: Isolation by distance under diverse systems of mating. Genetics 31(1), 39 (1946)
Yi, X., Liang, Y., Huerta-Sanchez, E., Jin, X., Cuo, Z.X.P., Pool, J.E., Xu, X., Jiang, H., Vinckenbosch, N., Korneliussen, T.S., et al.: Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329(5987), 75–78 (2010)
Acknowledgements
This work has been partially supported by the Italian Ministry of Education, Universities and Research under PRIN Data-Centric Genomic Computing (GenData 2020) and by CINECA ISCRA project HIOXICGP. Luca Pagani would like to thank Guy Jacobs for his help with simulations. The authors have no conflict of interests to declare.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Lodi, S., Angiulli, F., Basta, S., Luiselli, D., Pagani, L., Sartori, C. (2015). First Application of a Distance-Based Outlier Approach to Detect Highly Differentiated Genomic Regions Across Human Populations. In: Zazzu, V., Ferraro, M., Guarracino, M. (eds) Mathematical Models in Biology. Springer, Cham. https://doi.org/10.1007/978-3-319-23497-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-23497-7_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23496-0
Online ISBN: 978-3-319-23497-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)