Abstract
The data we will analyze will typically come from multiple places – both internal repositories and external databases. Even when all of the data is in one place, it is highly unlikely that our needs are met in its current form. Preparing the data for analysis takes an appreciable amount of time and can often be the most complex part of the analytical effort. So what do we need to do so that the data is in the right format, with the right content, so that our analysis is not skewed by incorrect, inadequate, or invalid data? Here, we focus on some of the physical data manipulation activities we may (or will) need to undertake in order to transform the data into the structure and content we need to be able to effectively analyze it. Is each piece of data relevant? What do we do about data elements missing from our dataset? Can we enforce consistency on the data without changing its meaning? Extracting the data from its source(s), transforming it, and then loading it into our analytical repository (ETL) can be a significant effort: what techniques can we use? How do we update our dataset when new data is generated? We will also consider techniques to help us standardize, normalize, transform data, eliminate and reduce unwanted data, and on how we deal with noisy data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This example highlights an interesting point relevant to the following question: Do we include calculated values such as age in our analytics environment or do we calculate “on the fly”? This issue will be considered later in this chapter.
- 2.
Sample, record, and input are often used in the literature.
- 3.
Elsewhere in this book we use the generic term “encounter” to cover any encounter (sic) between a patient and a health-care worker.
- 4.
We are not even including the more colloquial nomenclatures!
- 5.
We will use the prime suffix to identify the transformed attribute and its value.
- 6.
It should be noted that this type of operation can also be accomplished for tables in a one-to-many relationship, but the update process is significantly more complicated due to the fact that there will be much more redundant data to handle. Discussion of this scenario is out of the scope of this book.
- 7.
This also has a significant impact on the amount of storage taken up by the data since storing a numeric value such as 12345 takes 2 or 4 bytes of storage, whereas Staphylococcus aureus takes up 21 bytes of storage.
- 8.
Recall that normalizing for repeating groups takes groups of columns (which repeat) and makes them distinct rows in different tables. This results in typically larger disk storage requirements (if only because of the overhead of the table itself) and requires the query to process more rows of data – using joins – to satisfy the query.
- 9.
Two notes: first, a UNION is an operation in Structured Query Language (SQL) which is the underlying relational data manipulation language. Second, you would need to know the value of n in order to code the n-1 UNIONs!
- 10.
SQL alone is not typically used to build the table due to the complexity of the logic, the inefficiency of SQL, and because the number of levels in the hierarchy is either unknown or constantly changing.
- 11.
- 12.
Of course, this assumes our call does not get dropped as we switch cells! Some readers familiar with the telecommunications industry may also be aware that additional data is captured on a periodic basis – such as every 6 s – but we avoid this over-complication for our illustration.
- 13.
This is because the challenge of mining unstructured documents requires their own special techniques and we consider these separately.
- 14.
Assuming we don’t have chain-smoking children.
- 15.
Statisticians reading this will immediately raise a red flag about the data we have already collected, since the underlying population may be different to the “independent” population we are looking for to eliminate our missing data. This is true, and we return to this in Chap. 6.
- 16.
From Pearson (2006 #83) once again, the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) liver transplant database includes cyclosporine (CSA) levels and FK506 measurements but states that if one is recorded for a given patient at a given time, the other will be omitted.
- 17.
We are being fast and loose with the use of the word “same.” Here only, we use it to mean identical or similar to the point that any variances in the results are so small as to be inconsequential for our analysis.
- 18.
A timestamp is a data value that includes both a date component and a time component (typically down to some subset of a second, such as millisecond).
- 19.
We are purposely using the term data warehouse as a generic label at this time.
- 20.
In the last several years, Bill Inmon, one of the leading figures in the data warehousing world, has defined exploration warehouse and adaptive data mart as parts of an infrastructure and model designed to support such change in a (somewhat) deterministic manner.
References
Chen LYY et al (2002) Single nucleotide polymorphism mapping using genome wide unique sequences. J Genome Res
DuBois D, DuBois E (1916) A formula to estimate the approximate surface area if height and weight be known. Arch Intern Med 17:863–871
Horton NJ, Lipsitz SR (2001) Multiple imputation in practice: comparison of software packages for regression models with missing variables. Am Stat 55:244–254
Lippert R, Mobarry C, Walenz B (2005) A space-efficient construction of the Burrows-Wheeler transform for genomic data. J Comput Biol 12:943–951
Lyman P, Varian HR (2003) How much information. University of California, Berkeley, pp. 100
Pearson RK (2006) The problem of disguised missing data. SIGKDD Explorations Newsl 8:83–92
Refaat M (2007) Data preparation for data mining using SAS. Morgan Kaufmann Publishers, San Francisco
Tietz NW (1995) Clinical guide to laboratory tests. W.B. Saunders Co, Philadelphia, pp. xxxix, 1096
Witten I, Frank E (2005) Emboss European Molecular Biology Open Software Suite. In: Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufman, Amsterdam/Boston
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Sullivan, R. (2012). The Input Side of the Equation. In: Introduction to Data Mining for the Life Sciences. Humana Press. https://doi.org/10.1007/978-1-59745-290-8_5
Download citation
DOI: https://doi.org/10.1007/978-1-59745-290-8_5
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-58829-942-0
Online ISBN: 978-1-59745-290-8
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)