Abstract
Health informatics and biomedical computing have introduced the use of computer methods to analyze clinical information and provide tools to assist clinicians during the diagnosis and treatment of diverse clinical conditions. With the amount of information that can be obtained in the healthcare setting, new methods to acquire, organize, and analyze the data are being developed each day, including new applications in the world of big data and machine learning. In this review, first we present the most basic concepts in data science, including the structural hierarchy of information and how it is managed. A section is dedicated to discussing topics relevant to the acquisition of data, importantly the availability and use of online resources such as survey software and cloud computing services. Along with digital datasets, these tools make it possible to create more diverse models and facilitate collaboration. After, we describe concepts and techniques in machine learning used to process and analyze health data, especially those most widely applied in rheumatology. Overall, the objective of this review is to aid in the comprehension of how data science is used in health, with a special emphasis on the relevance to the field of rheumatology. It provides clinicians with basic tools on how to approach and understand new trends in health informatics analysis currently being used in rheumatology practice. If clinicians understand the potential use and limitations of health informatics, this will facilitate interdisciplinary conversations and continued projects relating to data, big data, and machine learning.
Similar content being viewed by others
References
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Kleinberg JM, (1999) Hubs, authorities, and communities. ACM Comput Surv 31:5-es. https://doi.org/10.1145/345966.345982
Wasserman S (2009) Network science 9
Jacomy M, Venturini T, Heymann S, Bastian M (2014) ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS One 9(6):e98679. https://doi.org/10.1371/journal.pone.0098679
Fruchterman TMJ, Reingold EM (1991) Graph drawing by force-directed placement. Softw Pract Exp 21:1129–1164. https://doi.org/10.1002/spe.4380211102
Topol EJ (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25:44–56
Curtis JR, Michaud K, Winthrop K (2019) Technology and big data in rheumatology. Rheum Dis Clin N Am 45(2):xiii–xiv. https://doi.org/10.1016/s0889-857x(19)30016-x
Gossec L, Kedra J, Servy H, Pandit A, Stones S, Berenbaum F, Finckh A, Baraliakos X, Stamm TA, Gomez-Cabrero D, Pristipino C, Choquet R, Burmester GR, Radstake TRDJ (2020) EULAR points to consider for the use of big data in rheumatic and musculoskeletal diseases. Ann Rheum Dis 79:69–76. https://doi.org/10.1136/annrheumdis-2019-215694
Kahate A (2004) Introduction to database management systems. Pearson Education, Singapore
Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26:97–107. https://doi.org/10.1109/TKDE.2013.109
Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45:211. https://doi.org/10.1145/505248.506010
Sebastian-Coleman L (2013) Measuring data quality for ongoing improvement. Elsevier
Gregory KE, Radovinsky L (2012) Research strategies that result in optimal data collection from the patient medical record. Appl Nurs Res 25:108–116. https://doi.org/10.1016/J.APNR.2010.02.004
Liddy C, Wiens M, Hogg W (2011) Methods to achieve high interrater reliability in data collection from primary care medical records. Ann Fam Med 9:57–62. https://doi.org/10.1370/afm.1195
Kongsved SM, Basnov M, Holm-Christensen K, Hjollund NH (2007) Response rate and completeness of questionnaires: a randomized study of internet versus paper-and-pencil versions. J Med Internet Res 9:e25. https://doi.org/10.2196/jmir.9.3.e25
Pringle M, Ward P, Chilvers C (1995) Assessment of the completeness and accuracy of computer medical records in four practices committed to recording data on computer. Br J Gen Pract 45:537–541
Northrop RB (2017) Introduction to instrumentation and measurements. CRC Pr I Llc
Goodman KW (2010) Ethics, information technology, and public health: new challenges for the clinician-patient relationship. J Law Med Ethics 38:58–63. https://doi.org/10.1111/j.1748-720X.2010.00466.x
Turkington RC (1997) Medical record confidentiality law, scientific research, and data collection in the information age. J Law Med Ethics 25:113–129. https://doi.org/10.1111/j.1748-720X.1997.tb01887.x
Glandon GL, Smaltz DH, Slovensky DJ Information systems for healthcare management
Manrique de Lara A, Peláez-Ballestas I (2020) Big data and data processing in rheumatology: bioethical perspectives. Clin Rheumatol 39:1007–1014. https://doi.org/10.1007/s10067-020-04969-w
Fernández-Alemán JL, Señor IC, Lozoya PÁO, Toval A (2013) Security and privacy in electronic health records: a systematic literature review. J Biomed Inform 46:541–562. https://doi.org/10.1016/J.JBI.2012.12.003
Stowell S (2014) Using R for statistics. Apress, Berkeley
Anton H (1994) Elementary linear algebra. John Wiley
Viswanathan V, Viswanathan SR data analysis cookbook: over 80 recipes to help you breeze through your data analysis projects using R
Samuel AL (1988) Some studies in machine learning using the game of checkers. II—Recent progress. In: Computer games I. Springer New York, New York, pp 366–400
Russell SJ, Davis E, Norvig P Artificial intelligence: a modern approach
Alpaydin E (2010) Introduction to machine learning. MIT Press
Fox J (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46:175–185. https://doi.org/10.1080/00031305.1992.10475879
Montejo LD, Jia J, Kim HK, Netz UJ, Blaschke S, Müller GA, Hielscher AH (2013) Computer-aided diagnosis of rheumatoid arthritis with optical tomography, part 2: image classification. J Biomed Opt 18:076002. https://doi.org/10.1117/1.JBO.18.7.076002
Rajathi S, Radhamani G (2016) Prediction and analysis of rheumatic heart disease using kNN classification with ACO. In: 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE). IEEE, pp 68–73
Monmarché N, Guinand F, Siarry P (2010) Artificial ants: from collective intelligence to real-life optimization and beyond. ISTE
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1023/A:1022627411411
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
Lin C, Karlson EW, Canhao H, Miller TA, Dligach D, Chen PJ, Perez RNG, Shen Y, Weinblatt ME, Shadick NA, Plenge RM, Savova GK (2013) Automatic prediction of rheumatoid arthritis disease activity from the electronic medical records. PLoS One 8:e69932. https://doi.org/10.1371/journal.pone.0069932
Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG (2010) Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 17:507–513. https://doi.org/10.1136/jamia.2009.001560
Mittag F, Büchel F, Saad M, Jahn A, Schulte C, Bochdanovits Z, Simón-Sánchez J, Nalls MA, Keller M, Hernandez DG, Gibbs JR, Lesage S, Brice A, Heutink P, Martinez M, Wood NW, Hardy J, Singleton AB, Zell A, Gasser T, Sharma M, International Parkinson’s Disease Genomics Consortium (2012) Use of support vector machines for disease risk prediction in genome-wide association studies: concerns and opportunities. Hum Mutat 33:1708–1718. https://doi.org/10.1002/humu.22161
Briggs FBS, Ramsay PP, Madden E, Norris JM, Holers VM, Mikuls TR, Sokka T, Seldin MF, Gregersen PK, Criswell LA, Barcellos LF (2010) Supervised machine learning and logistic regression identifies novel epistatic risk factors with PTPN22 for rheumatoid arthritis. Genes Immun 11:199–208. https://doi.org/10.1038/gene.2009.110
Bellman R (2003) Dynamic programming. Dover Publications
Ester M, Ester M, Kriegel H-P, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. 226–231
Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. London, Edinburgh. Dublin Philos Mag J Sci 2:559–572. https://doi.org/10.1080/14786440109462720
Smith JA, Barnes MD, Hong D, DeLay ML, Inman RD, Colbert RA (2008) Gene expression analysis of macrophages derived from ankylosing spondylitis patients reveals interferon-γ dysregulation. Arthritis Rheum 58:1640–1649. https://doi.org/10.1002/art.23512
Patterson KA, Roberts-Thomson PJ, Lester S, Tan JA, Hakendorf P, Rischmueller M, Zochling J, Sahhar J, Nash P, Roddy J, Hill C, Nikpour M, Stevens W, Proudman SM, Walker JG (2015) Interpretation of an extended autoantibody profile in a well-characterized Australian systemic sclerosis (scleroderma) cohort using principal components analysis. Arthritis Rheum 67:3234–3244. https://doi.org/10.1002/art.39316
Lakota K, Thallinger GG, Sodin-Semrl S, Rozman B, Ambrozic A, Tomsic M, Praprotnik S, Cucnik S, Mrak-Poljsak K, Ceribelli A, Cavazzana I, Franceschini F, Vencovsky J, Czirják L, Varjú C, Steiner G, Aringer M, Stamenkovic B, Distler O, Matucci-Cerinic M, Kveder T (2012) International cohort study of 73 anti-Ku-positive patients: association of p70/p80 anti-Ku antibodies with joint/bone features and differentiation of disease populations by using principal-components analysis. Arthritis Res Ther 14:R2. https://doi.org/10.1186/ar3550
Rao CR, Miller JP, Rao DC. (2008) Epidemiology and medical statistics. Elsevier
Estivill-Castro V (2002) Why so many clustering algorithms. ACM SIGKDD Explor Newsl 4:65–75. https://doi.org/10.1145/568574.568575
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254. https://doi.org/10.1007/BF02289588
Macqueen J, Macqueen J (1967) Some methods for classification and analysis of multivariate observations. 5-TH BERKELEY Symp Math Stat Probab 281–297
McNicholas PD Mixture model-based classification
Molano-González N, Rojas M, Monsalve DM, Pacheco Y, Acosta-Ampudia Y, Rodríguez Y, Rodríguez-Jimenez M, Ramírez-Santana C, Anaya JM (2019) Cluster analysis of autoimmune rheumatic diseases based on autoantibodies. New insights for polyautoimmunity. J Autoimmun 98:24–32. https://doi.org/10.1016/J.JAUT.2018.11.002
Yildirim P, Çeken Ç, Hassanpour R, Tolun MR (2012) Prediction of similarities among rheumatic diseases. J Med Syst 36:1485–1490. https://doi.org/10.1007/s10916-010-9609-6
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133. https://doi.org/10.1007/BF02478259
Deng L, Yu D (2014) Deep learning: methods and applications. Found Trends® Signal Process 7:197–387. https://doi.org/10.1561/2000000039
Tiulpin A, Thevenot J, Rahtu E, Lehenkari P, Saarakkala S (2018) Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci Rep 8:1727. https://doi.org/10.1038/s41598-018-20132-7
Khanna NN, Jamthikar AD, Gupta D, Piga M, Saba L, Carcassi C, Giannopoulos AA, Nicolaides A, Laird JR, Suri HS, Mavrogeni S, Protogerou AD, Sfikakis P, Kitas GD, Suri JS (2019) Rheumatoid arthritis: atherosclerosis imaging and cardiovascular risk assessment using machine and deep learning–based tissue characterization. Curr Atheroscler Rep 21:7. https://doi.org/10.1007/s11883-019-0766-x
Pham T, Tran T, Phung D, Venkatesh S (2017) Predicting healthcare trajectories from medical records: a deep learning approach. J Biomed Inform 69:218–229. https://doi.org/10.1016/J.JBI.2017.04.001
Chen X-W, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access 2:514–525. https://doi.org/10.1109/ACCESS.2014.2325029
Nwana HS (1996) Software agents: an overview. Knowl Eng Rev 11:205–244. https://doi.org/10.1017/s026988890000789x
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285. https://doi.org/10.1613/jair.301
Richard S. Sutton AGB (2008) Reinforced learning: an introduction
Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8:279–292. https://doi.org/10.1007/bf00992698
Rummery GA, Rummery GA, Niranjan M (1994) On-line Q-Learning using connectionist systems
Mulani J, Heda S, Tumdi K et al (2020) Deep reinforcement learning based personalized health recommendations. Springer, Cham, pp 231–255
Ling Y, Hasan SA, Datla V, et al (2017) Learning to diagnose: assimilating clinical narratives using deep reinforcement learning
Huang Z, Van Der Aalst WMP, Lu X, Duan H (2011) Reinforcement learning based resource allocation in business process management. Data Knowl Eng 70:127–145. https://doi.org/10.1016/j.datak.2010.09.002
Cherven K Network graph analysis and visualization with Gephi: visualize and analyze your data swiftly using dynamic network graphs built with Gephi
Peláez-Ballestas I, Granados Y, Quintana R, Loyola-Sánchez A, Julián-Santiago F, Rosillo C, Gastelum-Strozzi A, Alvarez-Nemegyei J, Santana N, Silvestre A, Pacheco-Tena C, Goñi M, García-García C, Cedeño L, Pons-Éstel BA, Latin American Study Group of Rheumatic Diseases in Indigenous Peoples (GLADERPO) (2018) Epidemiology and socioeconomic impact of the rheumatic diseases on indigenous people: an invisible syndemic public health problem. Ann Rheum Dis 77:1397–1404. https://doi.org/10.1136/annrheumdis-2018-213625
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Disclosures
None.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Soriano-Valdez, D., Pelaez-Ballestas, I., Manrique de Lara, A. et al. The basics of data, big data, and machine learning in clinical practice. Clin Rheumatol 40, 11–23 (2021). https://doi.org/10.1007/s10067-020-05196-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10067-020-05196-z