Examine Manipulated Datasets with Topology Data Analysis: A Case Study

  • Yun Guo
  • Daniel Sun
  • Guoqiang LiEmail author
  • Shiping Chen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11149)


Learning and mining technologies have been broadly applied to reveal the value of tremendous data and impact decision-making. Usually, the correctness of decisions roots in the truth of data for these technologies. Data fraud presents everywhere, and even if data were true, could data be maliciously manipulated by cyber-attackers. Methods have been long exploited to examine data authenticity, but are less effective when only values are manipulated without violating scopes and definitions. Then the decisions made from fraud and manipulated data are wrong or hijacked. It has been concluded that data manipulation is the latest technique in “the art of war in cyberspace.” Examining each data instance from its source is exhaustive and impossible, for example recollecting data for national consensus. In this paper, through a case study on the data of banknotes, we exploit Topological Data Analysis (TDA) for examining manipulated data. A fraction of data records are examined integrally other than individually. The possibility of using TDA to verify data efficiently is then evaluated. We first test the possibility of using TDA for the above detection, and then discuss the limitations of the state of the art. Although TDA is not so matured, it has been reported to be effective in many applications, and now our work evidences its usage for data anomalies.


Data manipulation Topological features TDA Mapper 



This work is supported by the Key Program of National Natural Science Foundation of China with grant No. 61732013, and the Key R&D Project of Zhejiang Province with No. 2017C02036.


  1. 1.
    Adcock, A., Carlsson, E., Carlsson, G.: The ring of algebraic functions on persistence bar codes. Mathematics (2013)Google Scholar
  2. 2.
    Bhattacharya, S., Ghrist, R., Kumar, V.: Persistent homology for path planning in uncertain environments. IEEE Trans. Rob. 31(3), 578–590 (2015)CrossRefGoogle Scholar
  3. 3.
    Bolton, R.J., Hand, D.J.: Statistical fraud detection: a review. Stat. Sci. 17(3), 235–255 (2002)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Carlsson, G.: Topological pattern recognition for point cloud data. Acta Numerica 23, 289–368 (2014)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Carlsson, G., Zomorodian, A., Collins, A., Guibas, L.J.: Persistence barcodes for shapes. Int. J. Shape Model. 11(2), 149–187 (2008)CrossRefGoogle Scholar
  6. 6.
    Dewoskin, D., Climent, J., Cruz-White, I., Vazquez, M., Park, C., Arsuaga, J.: Applications of computational homology to the analysis of treatment response in breast cancer patients. Topol. Appl. 157(1), 157–164 (2010)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Dey, T., Wang, Y.: Multiscale mapper: topological summarization via codomain covers. In: Twenty-Seventh ACM-SIAM Symposium on Discrete Algorithms, pp. 997–1013 (2016)Google Scholar
  8. 8.
    Edelsbrunner, Letscher, Zomorodian: Topological Persistence and Simplification, vol. 28. Discrete and Computational Geometry (2002)Google Scholar
  9. 9.
    Estévez, P., Held, C., Perez, C.: Subscription fraud prevention in telecommunications using fuzzy rules and neural networks. Expert Syst. Appl. 31(2), 337–344 (2006)CrossRefGoogle Scholar
  10. 10.
    Gade, S.V.: Credit card fraud detection using hidden Markov model. Indian Streams Res. J. 4(4), 37–48 (2014)Google Scholar
  11. 11.
    Ghosh, S., Reilly, D.: Card fraud detection with a neural-network. Twenty-Seventh Hawaii Int. Conf. Syst. Sci. 3, 621–630 (2011)Google Scholar
  12. 12.
    Johnson, S.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)CrossRefGoogle Scholar
  13. 13.
    Lenz, H.J.: Data fraud detection: a first general perspective data fraud detection: a first general perspective. Int. Conf. Enterp. Inf. Syst. 227, 14–35 (2014)Google Scholar
  14. 14.
    Lum, P.Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., Carlsson, J., Carlsson, G.: Extracting insights from the shape of complex data using topology. Sci. Rep. 3(1), 1236 (2013)CrossRefGoogle Scholar
  15. 15.
    Maria, C., Boissonnat, J., Glisse, M., Yvinec, M.: The gudhi library: simplicial complexes and persistent homology, in mathematical software. Int. Congr. Math. Softw. 8592, 167–174 (2014)zbMATHGoogle Scholar
  16. 16.
    Nicolau, M., Levine, A.J., Carlsson, G.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. USA 108(17), 7265–7270 (2011)CrossRefGoogle Scholar
  17. 17.
    Nicolaua, M., Levineb, A.J., Carlsson, G.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. USA 108(17), 7265–7270 (2011)CrossRefGoogle Scholar
  18. 18.
    Oentaryo, R., Lim, E.P., Finegold, M., Lo, D., Zhu, F.: Detecting click fraud in online advertising: a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014)MathSciNetGoogle Scholar
  19. 19.
    Otter, N., Porter, M., Tillmann, U., Grindrod, P., Harrington, H.: A roadmap for the computation of persistent homology. Mathematics 6(1), 17 (2017)Google Scholar
  20. 20.
    Pokorny, F., Hawasly, M., Ramamoorthy, S.: Topological trajectory classification with filtrations of simplicial complexes and persistent homology. Int. J. Rob. Res. 35(1–3), 204–223 (2016)CrossRefGoogle Scholar
  21. 21.
    Rahman, M., Rahman, M., Carbunar, B., Chau, D.: Search rank fraud and malware detection in Google Play. IEEE Trans. Knowl. Data Eng. Data Eng. PP(99), 1329–1342 (2017)CrossRefGoogle Scholar
  22. 22.
    Savic, A., Toth, G., Duponchel, L.: Topological data analysis (TDA) applied to reveal pedogenetic principles of european topsoil system. Sci. Total Environ. 586, 1091–1100 (2017)CrossRefGoogle Scholar
  23. 23.
    de Silva, V., Ghrist, R.: Coverage in sensor networks via persistent homology. Algebraic Geom. Topol. PP, 339–358 (2007)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Singh, G., Mémoli, F., Carlsson, G.: Topological methods for the analysis of high dimensional data sets and 3D object recognition. In: Eurographics Symposium on Point Based Graphics (2007)Google Scholar
  25. 25.
    Viaene, S., Derrig, R., Dedene, G.: A case study of applying boosting naive bayes to claim fraud diagnosis. IEEE Trans. Knowl. Data Eng. 16(5), 612–620 (2004)CrossRefGoogle Scholar
  26. 26.
    Xia, K., Feng, X., Tong, Y., Wei, G.: Persistent homology for the quantitative prediction of fullerene stability. J. Comput. Chem. 36(6), 408–422 (2014)CrossRefGoogle Scholar
  27. 27.
    Xia, K., Wei, G.W.: Persistent homology analysis of protein structure, flexibility, and folding. Int. J. Numer. Methods Biom. Eng. 30(8), 814–844 (2014)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Zomorodian, A., Carlsson, G.: Computing persistent homology. Discret. Comput. Geom. 33, 249–274 (2005)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yun Guo
    • 1
  • Daniel Sun
    • 2
    • 3
  • Guoqiang Li
    • 2
    Email author
  • Shiping Chen
    • 4
  1. 1.Department of Computer Science and TechnologyShanghai Jiao Tong UniversityShanghaiChina
  2. 2.School of SoftwareShanghai Jiao Tong UniversityShanghaiChina
  3. 3.Data61, CSIROCanberraAustralia
  4. 4.Data61, CSIROSydneyAustralia

Personalised recommendations