Skip to main content

Advertisement

Log in

The basics of data, big data, and machine learning in clinical practice

  • Review Article
  • Published:
Clinical Rheumatology Aims and scope Submit manuscript

Abstract

Health informatics and biomedical computing have introduced the use of computer methods to analyze clinical information and provide tools to assist clinicians during the diagnosis and treatment of diverse clinical conditions. With the amount of information that can be obtained in the healthcare setting, new methods to acquire, organize, and analyze the data are being developed each day, including new applications in the world of big data and machine learning. In this review, first we present the most basic concepts in data science, including the structural hierarchy of information and how it is managed. A section is dedicated to discussing topics relevant to the acquisition of data, importantly the availability and use of online resources such as survey software and cloud computing services. Along with digital datasets, these tools make it possible to create more diverse models and facilitate collaboration. After, we describe concepts and techniques in machine learning used to process and analyze health data, especially those most widely applied in rheumatology. Overall, the objective of this review is to aid in the comprehension of how data science is used in health, with a special emphasis on the relevance to the field of rheumatology. It provides clinicians with basic tools on how to approach and understand new trends in health informatics analysis currently being used in rheumatology practice. If clinicians understand the potential use and limitations of health informatics, this will facilitate interdisciplinary conversations and continued projects relating to data, big data, and machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539

    Article  CAS  Google Scholar 

  2. Kleinberg JM, (1999) Hubs, authorities, and communities. ACM Comput Surv 31:5-es. https://doi.org/10.1145/345966.345982

  3. Wasserman S (2009) Network science 9

  4. Jacomy M, Venturini T, Heymann S, Bastian M (2014) ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS One 9(6):e98679. https://doi.org/10.1371/journal.pone.0098679

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Fruchterman TMJ, Reingold EM (1991) Graph drawing by force-directed placement. Softw Pract Exp 21:1129–1164. https://doi.org/10.1002/spe.4380211102

    Article  Google Scholar 

  6. Topol EJ (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25:44–56

    Article  CAS  Google Scholar 

  7. Curtis JR, Michaud K, Winthrop K (2019) Technology and big data in rheumatology. Rheum Dis Clin N Am 45(2):xiii–xiv. https://doi.org/10.1016/s0889-857x(19)30016-x

    Article  Google Scholar 

  8. Gossec L, Kedra J, Servy H, Pandit A, Stones S, Berenbaum F, Finckh A, Baraliakos X, Stamm TA, Gomez-Cabrero D, Pristipino C, Choquet R, Burmester GR, Radstake TRDJ (2020) EULAR points to consider for the use of big data in rheumatic and musculoskeletal diseases. Ann Rheum Dis 79:69–76. https://doi.org/10.1136/annrheumdis-2019-215694

    Article  PubMed  Google Scholar 

  9. Kahate A (2004) Introduction to database management systems. Pearson Education, Singapore

    Google Scholar 

  10. Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107. https://doi.org/10.1109/TKDE.2013.109

    Article  Google Scholar 

  11. Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45:211. https://doi.org/10.1145/505248.506010

    Article  Google Scholar 

  12. Sebastian-Coleman L (2013) Measuring data quality for ongoing improvement. Elsevier

  13. Gregory KE, Radovinsky L (2012) Research strategies that result in optimal data collection from the patient medical record. Appl Nurs Res 25:108–116. https://doi.org/10.1016/J.APNR.2010.02.004

    Article  PubMed  Google Scholar 

  14. Liddy C, Wiens M, Hogg W (2011) Methods to achieve high interrater reliability in data collection from primary care medical records. Ann Fam Med 9:57–62. https://doi.org/10.1370/afm.1195

    Article  PubMed  PubMed Central  Google Scholar 

  15. Kongsved SM, Basnov M, Holm-Christensen K, Hjollund NH (2007) Response rate and completeness of questionnaires: a randomized study of internet versus paper-and-pencil versions. J Med Internet Res 9:e25. https://doi.org/10.2196/jmir.9.3.e25

    Article  PubMed  PubMed Central  Google Scholar 

  16. Pringle M, Ward P, Chilvers C (1995) Assessment of the completeness and accuracy of computer medical records in four practices committed to recording data on computer. Br J Gen Pract 45:537–541

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Northrop RB (2017) Introduction to instrumentation and measurements. CRC Pr I Llc

  18. Goodman KW (2010) Ethics, information technology, and public health: new challenges for the clinician-patient relationship. J Law Med Ethics 38:58–63. https://doi.org/10.1111/j.1748-720X.2010.00466.x

    Article  PubMed  Google Scholar 

  19. Turkington RC (1997) Medical record confidentiality law, scientific research, and data collection in the information age. J Law Med Ethics 25:113–129. https://doi.org/10.1111/j.1748-720X.1997.tb01887.x

    Article  CAS  PubMed  Google Scholar 

  20. Glandon GL, Smaltz DH, Slovensky DJ Information systems for healthcare management

  21. Manrique de Lara A, Peláez-Ballestas I (2020) Big data and data processing in rheumatology: bioethical perspectives. Clin Rheumatol 39:1007–1014. https://doi.org/10.1007/s10067-020-04969-w

    Article  PubMed  Google Scholar 

  22. Fernández-Alemán JL, Señor IC, Lozoya PÁO, Toval A (2013) Security and privacy in electronic health records: a systematic literature review. J Biomed Inform 46:541–562. https://doi.org/10.1016/J.JBI.2012.12.003

    Article  PubMed  Google Scholar 

  23. Stowell S (2014) Using R for statistics. Apress, Berkeley

    Book  Google Scholar 

  24. Anton H (1994) Elementary linear algebra. John Wiley

  25. Viswanathan V, Viswanathan SR data analysis cookbook: over 80 recipes to help you breeze through your data analysis projects using R

  26. Samuel AL (1988) Some studies in machine learning using the game of checkers. II—Recent progress. In: Computer games I. Springer New York, New York, pp 366–400

  27. Russell SJ, Davis E, Norvig P Artificial intelligence: a modern approach

  28. Alpaydin E (2010) Introduction to machine learning. MIT Press

  29. Fox J (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks

    Google Scholar 

  30. Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46:175–185. https://doi.org/10.1080/00031305.1992.10475879

    Article  Google Scholar 

  31. Montejo LD, Jia J, Kim HK, Netz UJ, Blaschke S, Müller GA, Hielscher AH (2013) Computer-aided diagnosis of rheumatoid arthritis with optical tomography, part 2: image classification. J Biomed Opt 18:076002. https://doi.org/10.1117/1.JBO.18.7.076002

    Article  PubMed  PubMed Central  Google Scholar 

  32. Rajathi S, Radhamani G (2016) Prediction and analysis of rheumatic heart disease using kNN classification with ACO. In: 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE). IEEE, pp 68–73

  33. Monmarché N, Guinand F, Siarry P (2010) Artificial ants: from collective intelligence to real-life optimization and beyond. ISTE

  34. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1023/A:1022627411411

    Article  Google Scholar 

  35. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques. Morgan Kaufmann

  36. Lin C, Karlson EW, Canhao H, Miller TA, Dligach D, Chen PJ, Perez RNG, Shen Y, Weinblatt ME, Shadick NA, Plenge RM, Savova GK (2013) Automatic prediction of rheumatoid arthritis disease activity from the electronic medical records. PLoS One 8:e69932. https://doi.org/10.1371/journal.pone.0069932

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG (2010) Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 17:507–513. https://doi.org/10.1136/jamia.2009.001560

    Article  PubMed  PubMed Central  Google Scholar 

  38. Mittag F, Büchel F, Saad M, Jahn A, Schulte C, Bochdanovits Z, Simón-Sánchez J, Nalls MA, Keller M, Hernandez DG, Gibbs JR, Lesage S, Brice A, Heutink P, Martinez M, Wood NW, Hardy J, Singleton AB, Zell A, Gasser T, Sharma M, International Parkinson’s Disease Genomics Consortium (2012) Use of support vector machines for disease risk prediction in genome-wide association studies: concerns and opportunities. Hum Mutat 33:1708–1718. https://doi.org/10.1002/humu.22161

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Briggs FBS, Ramsay PP, Madden E, Norris JM, Holers VM, Mikuls TR, Sokka T, Seldin MF, Gregersen PK, Criswell LA, Barcellos LF (2010) Supervised machine learning and logistic regression identifies novel epistatic risk factors with PTPN22 for rheumatoid arthritis. Genes Immun 11:199–208. https://doi.org/10.1038/gene.2009.110

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Bellman R (2003) Dynamic programming. Dover Publications

  41. Ester M, Ester M, Kriegel H-P, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. 226–231

  42. Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. London, Edinburgh. Dublin Philos Mag J Sci 2:559–572. https://doi.org/10.1080/14786440109462720

    Article  Google Scholar 

  43. Smith JA, Barnes MD, Hong D, DeLay ML, Inman RD, Colbert RA (2008) Gene expression analysis of macrophages derived from ankylosing spondylitis patients reveals interferon-γ dysregulation. Arthritis Rheum 58:1640–1649. https://doi.org/10.1002/art.23512

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Patterson KA, Roberts-Thomson PJ, Lester S, Tan JA, Hakendorf P, Rischmueller M, Zochling J, Sahhar J, Nash P, Roddy J, Hill C, Nikpour M, Stevens W, Proudman SM, Walker JG (2015) Interpretation of an extended autoantibody profile in a well-characterized Australian systemic sclerosis (scleroderma) cohort using principal components analysis. Arthritis Rheum 67:3234–3244. https://doi.org/10.1002/art.39316

    Article  CAS  Google Scholar 

  45. Lakota K, Thallinger GG, Sodin-Semrl S, Rozman B, Ambrozic A, Tomsic M, Praprotnik S, Cucnik S, Mrak-Poljsak K, Ceribelli A, Cavazzana I, Franceschini F, Vencovsky J, Czirják L, Varjú C, Steiner G, Aringer M, Stamenkovic B, Distler O, Matucci-Cerinic M, Kveder T (2012) International cohort study of 73 anti-Ku-positive patients: association of p70/p80 anti-Ku antibodies with joint/bone features and differentiation of disease populations by using principal-components analysis. Arthritis Res Ther 14:R2. https://doi.org/10.1186/ar3550

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Rao CR, Miller JP, Rao DC. (2008) Epidemiology and medical statistics. Elsevier

  47. Estivill-Castro V (2002) Why so many clustering algorithms. ACM SIGKDD Explor Newsl 4:65–75. https://doi.org/10.1145/568574.568575

    Article  Google Scholar 

  48. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254. https://doi.org/10.1007/BF02289588

    Article  CAS  PubMed  Google Scholar 

  49. Macqueen J, Macqueen J (1967) Some methods for classification and analysis of multivariate observations. 5-TH BERKELEY Symp Math Stat Probab 281–297

  50. McNicholas PD Mixture model-based classification

  51. Molano-González N, Rojas M, Monsalve DM, Pacheco Y, Acosta-Ampudia Y, Rodríguez Y, Rodríguez-Jimenez M, Ramírez-Santana C, Anaya JM (2019) Cluster analysis of autoimmune rheumatic diseases based on autoantibodies. New insights for polyautoimmunity. J Autoimmun 98:24–32. https://doi.org/10.1016/J.JAUT.2018.11.002

    Article  PubMed  Google Scholar 

  52. Yildirim P, Çeken Ç, Hassanpour R, Tolun MR (2012) Prediction of similarities among rheumatic diseases. J Med Syst 36:1485–1490. https://doi.org/10.1007/s10916-010-9609-6

    Article  PubMed  Google Scholar 

  53. McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133. https://doi.org/10.1007/BF02478259

    Article  Google Scholar 

  54. Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends® Signal Process 7:197–387. https://doi.org/10.1561/2000000039

  55. Tiulpin A, Thevenot J, Rahtu E, Lehenkari P, Saarakkala S (2018) Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci Rep 8:1727. https://doi.org/10.1038/s41598-018-20132-7

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Khanna NN, Jamthikar AD, Gupta D, Piga M, Saba L, Carcassi C, Giannopoulos AA, Nicolaides A, Laird JR, Suri HS, Mavrogeni S, Protogerou AD, Sfikakis P, Kitas GD, Suri JS (2019) Rheumatoid arthritis: atherosclerosis imaging and cardiovascular risk assessment using machine and deep learning–based tissue characterization. Curr Atheroscler Rep 21:7. https://doi.org/10.1007/s11883-019-0766-x

    Article  PubMed  Google Scholar 

  57. Pham T, Tran T, Phung D, Venkatesh S (2017) Predicting healthcare trajectories from medical records: a deep learning approach. J Biomed Inform 69:218–229. https://doi.org/10.1016/J.JBI.2017.04.001

    Article  PubMed  Google Scholar 

  58. Chen X-W, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access 2:514–525. https://doi.org/10.1109/ACCESS.2014.2325029

    Article  Google Scholar 

  59. Nwana HS (1996) Software agents: an overview. Knowl Eng Rev 11:205–244. https://doi.org/10.1017/s026988890000789x

    Article  Google Scholar 

  60. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285. https://doi.org/10.1613/jair.301

    Article  Google Scholar 

  61. Richard S. Sutton AGB (2008) Reinforced learning: an introduction

  62. Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292. https://doi.org/10.1007/bf00992698

    Article  Google Scholar 

  63. Rummery GA, Rummery GA, Niranjan M (1994) On-line Q-Learning using connectionist systems

  64. Mulani J, Heda S, Tumdi K et al (2020) Deep reinforcement learning based personalized health recommendations. Springer, Cham, pp 231–255

    Google Scholar 

  65. Ling Y, Hasan SA, Datla V, et al (2017) Learning to diagnose: assimilating clinical narratives using deep reinforcement learning

  66. Huang Z, Van Der Aalst WMP, Lu X, Duan H (2011) Reinforcement learning based resource allocation in business process management. Data Knowl Eng 70:127–145. https://doi.org/10.1016/j.datak.2010.09.002

    Article  Google Scholar 

  67. Cherven K Network graph analysis and visualization with Gephi: visualize and analyze your data swiftly using dynamic network graphs built with Gephi

  68. Peláez-Ballestas I, Granados Y, Quintana R, Loyola-Sánchez A, Julián-Santiago F, Rosillo C, Gastelum-Strozzi A, Alvarez-Nemegyei J, Santana N, Silvestre A, Pacheco-Tena C, Goñi M, García-García C, Cedeño L, Pons-Éstel BA, Latin American Study Group of Rheumatic Diseases in Indigenous Peoples (GLADERPO) (2018) Epidemiology and socioeconomic impact of the rheumatic diseases on indigenous people: an invisible syndemic public health problem. Ann Rheum Dis 77:1397–1404. https://doi.org/10.1136/annrheumdis-2018-213625

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfonso Gastelum-Strozzi.

Ethics declarations

Disclosures

None.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Soriano-Valdez, D., Pelaez-Ballestas, I., Manrique de Lara, A. et al. The basics of data, big data, and machine learning in clinical practice. Clin Rheumatol 40, 11–23 (2021). https://doi.org/10.1007/s10067-020-05196-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10067-020-05196-z

Keywords

Navigation