Skip to main content

Exploring the Freedoms in Data Mining: Why the Trustworthiness and Integrity of the Findings are the Casualties, and How to Resolve These?

  • Conference paper
  • First Online:
Proceedings of the Future Technologies Conference (FTC) 2021, Volume 1 (FTC 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 358))

Included in the following conference series:

Abstract

This work aims to improve the accuracy of data-mining. A problem with today’s system for Artificial Intelligence (AI) is the reliance on frameworks that lacks oversight: AI is trained on data that is not guaranteed to be representative and is constructed through software-APIs maintained by a large number of developers, each with their interests. In contrast, the society expects the systems to work, as the result is otherwise a zombie-like behavior of systems for industrial control of how new drugs are discovered, oversight of public governance, heating systems, etc. This paper aims at tackling this issue. The idea is to relate the freedoms in software and algorithms, thereby identifying the blind spots in AI. The results reveal how knowledge discovery is directly linked to hidden attributes (in software and algorithms). Thus, this work provides users with a recipe for improving the trust in their AI-predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal, A., Menzies, T., Minku, L.L., Wagner, M., Yu, Z.: Better software analytics via “duo’’: data mining algorithms using/used-by optimizers. Empirical Softw. Eng. 25, 2099–2136 (2020)

    Article  Google Scholar 

  2. Ana, L.N.F., Jain, A.K.: Robust data clustering. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Proceedings. , vol. 2, p. II-128. IEEE (2003)

    Google Scholar 

  3. Antezana, E.: Towards semantic systems biology: biological knowledge management using semantic web technologies. Ph.D. thesis, University of Gent (Belgium) (2008)

    Google Scholar 

  4. Antezana, E., et al.: Biogateway: a semantic systems biology tool for the life sciences. BMC Bioinform. 10(10), S11 (2009)

    Google Scholar 

  5. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genetics 25(1), 25–29 (2000)

    Google Scholar 

  6. Barabási, A.-L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nat. Rev. Genetics 12(1), 56–68 (2011)

    Article  Google Scholar 

  7. Bayer, R.: Symmetric binary B-trees: data structure and maintenance algorithms. Acta Informatica 1, 290–306 (1972). https://doi.org/10.1007/BF00289509

    Article  MathSciNet  MATH  Google Scholar 

  8. Belleau, F., Nolin, M.-A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform. 41(5), 706–716 (2008)

    Article  Google Scholar 

  9. Bezdek, J.C., Keller, J.M., Krishnapuram, R., Kuncheva, L.I., Pal, N.R.: Will the real iris data please stand up? IEEE Trans. Fuzzy Syst. 7(3), 368–369 (1999)

    Google Scholar 

  10. Blonde, W.: Metarel, an ontology facilitating advanced querying of biomedical knowledge. Ph.D. thesis, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Ghent, Belgium (2012)

    Google Scholar 

  11. Blonde, W., Antezana, E., Mironov, V., Schulz, S., Kuiper, M., De Baets, B.: Using the relation ontology metarel for modelling linked data as multi-digraphs (2012)

    Google Scholar 

  12. Blonde, W., Mironov, V., Antezana, E., Venkatesan, A., De Baets, B., Kuiper, M.: Reasoning with bio-ontologies: using relational closure rules to enable practical querying. Oxford Bioinform. 27, 1562–1568 (2011)

    Article  Google Scholar 

  13. Butcher, E.C., Berg, E.L., Kunkel, E.J.: Systems biology in drug discovery. Nat. Biotechnol. 22(10), 1253 (2004)

    Article  Google Scholar 

  14. Camon, E., et al.: The gene ontology annotation (Goa) database: sharing knowledge in uniprot with gene ontology. Nucl. Acids Res. 32(suppl 1), D262–D266 (2004)

    Google Scholar 

  15. Chowdhury, S., Sarkar, R.R.: Comparison of human cell signaling pathway databases-evolution, drawbacks and challenges. Database (2015)

    Google Scholar 

  16. UniProt Consortium: Uniprot: the universal protein knowledgebase. Nucl. Acids Res. 45(D1), D158–D169 (2017)

    Google Scholar 

  17. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  18. Croft, D., et al.: Reactome: a database of reactions, pathways and biological processes. Nucl. Acids Res. 39(suppl 1), D691–D697 (2011)

    Google Scholar 

  19. Cuatrecasas, P.: Drug discovery in jeopardy. J. Clin. Investig. 116(11), 2837 (2006)

    Article  Google Scholar 

  20. Demir, E., et al.: Using biological pathway data with Paxtools. PLoS Comput. Biol. 9(9), e1003194 (2013)

    Article  Google Scholar 

  21. Demir, E., et al.: The biopax community standard for pathway data sharing. Nat. Biotechnol. 28(9), 935–942 (2010)

    Article  Google Scholar 

  22. Dräger, A., Palsson, B.: Improving collaboration by standardization efforts in systems biology. Front. Bioeng. Biotechnol. 2 (2014)

    Google Scholar 

  23. The Economist. Don’t trust AI until we build systems that earn trust (2019). Accessed June 2020

    Google Scholar 

  24. The Economist. An understanding of AI’s limitations is starting to sink in (2020). Accessed June 2020

    Google Scholar 

  25. Ekseth, O.K., Furnes, P.-J., Hvasshovd, S.-O.: Pattern matching in the era of big data: A benchmark of cluster quality metrics. Int. J. Adv. Softw. (2019)

    Google Scholar 

  26. Ekseth, O.K., Gribbestad, M., Hvasshovd, S.-O.: Inventing wheels: why improvements to established cluster algorithms fails to catch the wheel. In: The International Conference on Digital Image and Signal Processing (DISP 2019). Springer, Heidelberg (2019)

    Google Scholar 

  27. Ekseth, O.K., Hvasshovd, S.-O.: hpLysis database-engine: a new data-scheme for fast semantic queries in biomedical databases. In: Under Review: Provides Details of the In-memory Data-Engine: Contact oekseth@gmail.com for the Paper (2017)

    Google Scholar 

  28. Ekseth, O.K., Hvasshovd, S.-O.: In the realm of big data: how an understanding of users and computers results in a framework for finding the needles in the haystack of knowledge (2020). Manuscript ready for submission

    Google Scholar 

  29. Ekseth, O.K., Hvasshovd, S.-O.: A new framework for automated knowledge discovery of feature-data translates worst-performing cluster algorithms into best-performers through lazyness (2020). Manuscript ready for submission

    Google Scholar 

  30. Ekseth, O.K., Hvasshovd, S.-O.: A new framework that translates zombie like predictions into trustworthy knowledge grants fairness, and removes the bias, of AI (2020). Manuscript ready for submission

    Google Scholar 

  31. Ekseth, O.K., Hvasshovd, S.-O.: An empirical study of strategies boosts performance of mutual information similarity. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10842, pp. 321–332. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_29

    Chapter  Google Scholar 

  32. Ekseth, O.K., Kuiper, M., Mironov, V.: Orthagogue: an agile tool for the rapid prediction of orthology relations. Bioinformatics 30(5), 734–736 (2013)

    Article  Google Scholar 

  33. Ekseth, O.K., Meyer, J.C., Hvasshovd, S.O.: hpLysis database-engine: a new data-scheme for fast semantic queries in biomedical databases. In: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pp. 383–390. IEEE (2018)

    Google Scholar 

  34. Ekseth, O.K., Meyer, J.C., Hvasshovd, S.O.: A new database for drug discovery through application of data-integration and semantics. In: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pp. 403–410. IEEE (2018)

    Google Scholar 

  35. Eltabakh, M.Y., et al.: Managing biological data using BDBMS. In: IEEE 24th International Conference on Data Engineering 2008, ICDE 2008, pp. 1600–1603. IEEE (2008)

    Google Scholar 

  36. Fernández-Suárez, X.M., Birney, E.: Advanced genomic data mining. PLoS Comput. Biol. 4(9), e1000121 (2008)

    Article  Google Scholar 

  37. Feuerherm, A.J., Johansen, B.: Rheumatoid arthritis treatment, 1 March 2013. US Patent App. 13/783,088

    Google Scholar 

  38. National Center for Biotechnology Information. Pubmed data-base for biomedical literature, August 2020. https://www.ncbi.nlm.nih.gov/pubmed/

  39. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)

    Article  Google Scholar 

  40. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. (TOMS) 3(3), 209–226 (1977)

    Article  Google Scholar 

  41. Eric, L., et al.: High-performance computing applied to semantic databases. In: Antoniou, G., et al. (eds.) The Semanic Web: Research and Applications. LNCS, vol. 6644, pp. 31–45. Springer, Heidelberg (2011)

    Google Scholar 

  42. Goodman, L.A., Kruskal, W.H.: Measures of Association for Cross Classifications, pp. 2–34. Springer, Heidelberg (1979). https://doi.org/10.1007/978-1-4612-9995-0

  43. Gregory, S.G., et al.: The DNA sequence and biological annotation of human chromosome 1. Nature 441(7091), 315–321 (2006)

    Google Scholar 

  44. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., et al.: Intact: an open source molecular interaction database. Nucl. Acids Res. 32(suppl 1), D452–D455 (2004)

    Article  Google Scholar 

  45. Hopcroft, J., Tarjan, R.: Efficient algorithms for graph manipulation. Technical report, Stanford University, Stanford, CA, USA (1971)

    Google Scholar 

  46. Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H.: The ERATO systems biology workbench: enabling interaction and exchange between software tools for computational biology (2002)

    Google Scholar 

  47. Hunter, A.J.: The innovative medicines initiative: a pre-competitive initiative to enhance the biomedical science base of Europe to expedite the development of new medicines for patients. Drug Discov. Today 13(9), 371–373 (2008)

    Article  Google Scholar 

  48. Ioannidis, Y., Ramakrishnan, R., Winger, L.: Transitive closure algorithms based on graph traversal. ACM Trans. Database Syst. (TODS) 18(3), 512–576 (1993)

    Article  Google Scholar 

  49. Jagadish, H.V., Olken, F.: Database management for life sciences research. ACM SIGMOD Rec. 33(2), 15–20 (2004)

    Article  Google Scholar 

  50. Kohonen, T., Somervuo, P.: Self-organizing maps of symbol strings. Neurocomputing 21(1), 19–30 (1998)

    Article  Google Scholar 

  51. Kolpakov, F.: Cyclonet-an integrated database on cell cycle regulation and carcinogenesis. Nucl. Acids Res. 35(suppl. 1), D550–D556 (2007)

    Article  Google Scholar 

  52. Kusner, M.J., Loftus, J.R.: The long road to fairer algorithms (2020)

    Google Scholar 

  53. Lawley, M.: Exploiting fast classification of SNOMED CT for query and integration of health data. In: KR-MED (2008)

    Google Scholar 

  54. Li, S., Sejong, O.: Improving feature selection performance using pairwise pre-evaluation. BMC Bioinform. 17(1), 312 (2016)

    Article  Google Scholar 

  55. Liu, C., Wang, H., Yong, Yu., Linhao, X.: Towards efficient Sparql query processing on RDF data. Tsinghua Sci. Technol. 15(6), 613–622 (2010)

    Article  Google Scholar 

  56. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  57. Ma, X., Gao, L.: Biological network analysis: insights into structure and functions. Brief. Funct. Genomics 11(6), 434–442 (2012)

    Article  Google Scholar 

  58. Masseroli, M., et al.: Genometric query language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)

    Google Scholar 

  59. McMahon, E., Patton, M., Samtani, S., Chen, H.: Benchmarking vulnerability assessment tools for enhanced cyber-physical system (CPS) resiliency. In: 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 100–105. IEEE (2018)

    Google Scholar 

  60. Mirkin, B.: Eleven ways to look at the chi-squared coefficient for contingency tables. Am. Stat. 55(2), 111–120 (2001)

    Article  MathSciNet  Google Scholar 

  61. Mironov, V., Seethappan, N., Blondé, W., Antezana, E., Splendiani, A., Kuiper, M.: Gauging triple stores with actual biological data. BMC Bioinform. 13(1), S3 (2012)

    Article  Google Scholar 

  62. Morset, E.: Email conversations with the CTO of winns reveals how accurate regulations of heat-pumps maps to their energy consumption (2021). Accessed Apr 2021

    Google Scholar 

  63. United Nations. World stumbling zombie-like into a digital welfare dystopia, warns un human rights expert (2019)

    Google Scholar 

  64. Le Novere, N., et al.: Minimum information requested in the annotation of biochemical models (MIRIAM). Nat. Biotechnol. 23(12), 1509–1515 (2005)

    Article  Google Scholar 

  65. Pang, C.: Biobankconnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing. J. Am. Med. Inform. Assoc. 22(1), 65–75 (2015)

    Article  Google Scholar 

  66. Papanikolaou, N., et al.: Biotextquest+: a knowledge integration platform for literature mining and concept discovery. Bioinformatics 30(22), 3249–3256 (2014)

    Article  Google Scholar 

  67. Pareto, V.: Translated into English by A.S. Schwieras Manual of Political Economy (1906)

    Google Scholar 

  68. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    Google Scholar 

  69. Pieroni, E., et al.: Protein networking: insights into global functional organization of proteomes. Proteomics 8(4), 799–816 (2008)

    Article  Google Scholar 

  70. Ritchie, M.D., Holzinger, E.R., Li, R., Pendergrass, S.A., Kim, D.: Methods of integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genetics 16(2), 85–97 (2015)

    Article  Google Scholar 

  71. San Martín, M., Gutierrez, C.: Representing, querying and transforming social networks with RDF/SPARQL. In: Aroyo, L., et al. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 293–307. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02121-3_24

    Chapter  Google Scholar 

  72. Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., Lausen, G.: Sempala: interactive SPARQL query processing on Hadoop. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 164–179. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_11

    Chapter  Google Scholar 

  73. Sertkaya, A., Wong, H.-H., Jessup, A., Beleche, T.: Key cost drivers of pharmaceutical clinical trials in the United States. Clin. Trials 13(2), 117–126 (2016)

    Article  Google Scholar 

  74. Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: Biomart-biological queries made easy. BMC Genom. 10(1), 1 (2009)

    Article  Google Scholar 

  75. Smith, B., et al.: The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25(11), 1251–1255 (2007)

    Article  Google Scholar 

  76. Soussi, T., Asselain, B., Hamroun, D., Kato, S., Ishioka, C., Claustres, M., Béroud, C.: Meta-analysis of the p53 mutation database for mutant p53 biological activity reveals a methodologic bias in mutation detection. Clin. Cancer Res. 12(1), 62–69 (2006)

    Article  Google Scholar 

  77. Stark, C., et al.: The biogrid interaction database: 2011 update. Nucl. Acids Res. 39(suppl 1), D698–D704 (2011)

    Article  Google Scholar 

  78. Tomašev, N., Radovanović, M.: Clustering evaluation in high-dimensional data. In: Celebi, M.E., Aydin, K. (eds.) Unsupervised Learn. Alg., pp. 71–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24211-8_4

    Chapter  Google Scholar 

  79. US-CERT. Alert (ta16-288a) heightened DDOs threat posed by Mirai and other botnets (2016). Accessed Sept 2019

    Google Scholar 

  80. Venkatesan, A.: Application of semantic web technology to establish knowledge management and discovery in the life sciences. Ph.D. thesis (2014)

    Google Scholar 

  81. Venkatesan, A., et al.: Finding gene regulatory network candidates using the gene expression knowledge base. BMC Bioinform. 15(1), 386 (2014)

    Article  Google Scholar 

  82. Wandeto, J.M., Dresp, B.: Ultrafast automatic classification of SEM image sets showing CD4 + cells with varying extent of HIV virion infection. Int. J. Adv. Softw. (2019)

    Google Scholar 

  83. Westerlund, M., Neovius, M., Pulkkis, G.: Providing tamper-resistant audit trails with distributed ledger based solutions for forensics of IOT systems using cloud resources. Int. J. Adv. Secur. 11(3 & 4), 2018 (2018)

    Google Scholar 

  84. Wheeler, D.L., et al.: Database resources of the national center for biotechnology information. Nucl. Acids Res. 35(suppl 1), D5–D12 (2007)

    Article  MathSciNet  Google Scholar 

  85. Wylot, M., Cudré-Mauroux, P.: DiploCloud: efficient and scalable management of RDF data in the cloud. IEEE Trans. Knowl. Data Eng. 28(3), 659–674 (2016)

    Article  Google Scholar 

  86. Ye, K.Q., Green, M., Sanguansin, N., Beringer, L., Petcher, A., Appel, A.W.: Verified correctness and security of mbedTLS HMAC-DRBG. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2007–2020. ACM (2017)

    Google Scholar 

  87. Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp. 856–863 (2003)

    Google Scholar 

  88. Zhao, M., Yang, C.C.: Mining online heterogeneous healthcare networks for drug repositioning. In: 2016 IEEE International Conference on Healthcare Informatics (ICHI), pp. 106–112. IEEE (2016)

    Google Scholar 

  89. Ziegeldorf, J.H., Morchon, O.G., Wehrle, K.: Privacy in the internet of things: threats and challenges. Secur. Commun. Netw. 7(12), 2728–2742 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank MD K.I. Ekseth at UIO, Dr. O.V. Solberg at SINTEF, Dr. S.A. Aase at GE Healthcare, and Dr. B.H. Helleberg at NTNU/St. Olavs, for their support.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ekseth, O.K., Morset, E., Witzø, V., Refsnes, S., Hvasshovd, SO. (2022). Exploring the Freedoms in Data Mining: Why the Trustworthiness and Integrity of the Findings are the Casualties, and How to Resolve These?. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2021, Volume 1. FTC 2021. Lecture Notes in Networks and Systems, vol 358. Springer, Cham. https://doi.org/10.1007/978-3-030-89906-6_41

Download citation

Publish with us

Policies and ethics