Skip to main content

The Input Side of the Equation

  • Chapter
  • First Online:
Introduction to Data Mining for the Life Sciences
  • 2746 Accesses

Abstract

The data we will analyze will typically come from multiple places – both internal repositories and external databases. Even when all of the data is in one place, it is highly unlikely that our needs are met in its current form. Preparing the data for analysis takes an appreciable amount of time and can often be the most complex part of the analytical effort. So what do we need to do so that the data is in the right format, with the right content, so that our analysis is not skewed by incorrect, inadequate, or invalid data? Here, we focus on some of the physical data manipulation activities we may (or will) need to undertake in order to transform the data into the structure and content we need to be able to effectively analyze it. Is each piece of data relevant? What do we do about data elements missing from our dataset? Can we enforce consistency on the data without changing its meaning? Extracting the data from its source(s), transforming it, and then loading it into our analytical repository (ETL) can be a significant effort: what techniques can we use? How do we update our dataset when new data is generated? We will also consider techniques to help us standardize, normalize, transform data, eliminate and reduce unwanted data, and on how we deal with noisy data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This example highlights an interesting point relevant to the following question: Do we include calculated values such as age in our analytics environment or do we calculate “on the fly”? This issue will be considered later in this chapter.

  2. 2.

    Sample, record, and input are often used in the literature.

  3. 3.

    Elsewhere in this book we use the generic term “encounter” to cover any encounter (sic) between a patient and a health-care worker.

  4. 4.

    We are not even including the more colloquial nomenclatures!

  5. 5.

    We will use the prime suffix to identify the transformed attribute and its value.

  6. 6.

    It should be noted that this type of operation can also be accomplished for tables in a one-to-many relationship, but the update process is significantly more complicated due to the fact that there will be much more redundant data to handle. Discussion of this scenario is out of the scope of this book.

  7. 7.

    This also has a significant impact on the amount of storage taken up by the data since storing a numeric value such as 12345 takes 2 or 4 bytes of storage, whereas Staphylococcus aureus takes up 21 bytes of storage.

  8. 8.

    Recall that normalizing for repeating groups takes groups of columns (which repeat) and makes them distinct rows in different tables. This results in typically larger disk storage requirements (if only because of the overhead of the table itself) and requires the query to process more rows of data – using joins – to satisfy the query.

  9. 9.

    Two notes: first, a UNION is an operation in Structured Query Language (SQL) which is the underlying relational data manipulation language. Second, you would need to know the value of n in order to code the n-1 UNIONs!

  10. 10.

    SQL alone is not typically used to build the table due to the complexity of the logic, the inefficiency of SQL, and because the number of levels in the hierarchy is either unknown or constantly changing.

  11. 11.

    http://david.abcc.ncifcrf.gov/helps/knowledgebase/DAVID_gene.html#gene

  12. 12.

    Of course, this assumes our call does not get dropped as we switch cells! Some readers familiar with the telecommunications industry may also be aware that additional data is captured on a periodic basis – such as every 6 s – but we avoid this over-complication for our illustration.

  13. 13.

    This is because the challenge of mining unstructured documents requires their own special techniques and we consider these separately.

  14. 14.

    Assuming we don’t have chain-smoking children.

  15. 15.

    Statisticians reading this will immediately raise a red flag about the data we have already collected, since the underlying population may be different to the “independent” population we are looking for to eliminate our missing data. This is true, and we return to this in Chap. 6.

  16. 16.

    From Pearson (2006 #83) once again, the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) liver transplant database includes cyclosporine (CSA) levels and FK506 measurements but states that if one is recorded for a given patient at a given time, the other will be omitted.

  17. 17.

    We are being fast and loose with the use of the word “same.” Here only, we use it to mean identical or similar to the point that any variances in the results are so small as to be inconsequential for our analysis.

  18. 18.

    A timestamp is a data value that includes both a date component and a time component (typically down to some subset of a second, such as millisecond).

  19. 19.

    We are purposely using the term data warehouse as a generic label at this time.

  20. 20.

    In the last several years, Bill Inmon, one of the leading figures in the data warehousing world, has defined exploration warehouse and adaptive data mart as parts of an infrastructure and model designed to support such change in a (somewhat) deterministic manner.

References

  • Chen LYY et al (2002) Single nucleotide polymorphism mapping using genome wide unique sequences. J Genome Res

    Google Scholar 

  • DuBois D, DuBois E (1916) A formula to estimate the approximate surface area if height and weight be known. Arch Intern Med 17:863–871

    Article  CAS  Google Scholar 

  • Horton NJ, Lipsitz SR (2001) Multiple imputation in practice: comparison of software packages for regression models with missing variables. Am Stat 55:244–254

    Article  Google Scholar 

  • Lippert R, Mobarry C, Walenz B (2005) A space-efficient construction of the Burrows-Wheeler transform for genomic data. J Comput Biol 12:943–951

    Article  PubMed  CAS  Google Scholar 

  • Lyman P, Varian HR (2003) How much information. University of California, Berkeley, pp. 100

    Google Scholar 

  • Pearson RK (2006) The problem of disguised missing data. SIGKDD Explorations Newsl 8:83–92

    Article  Google Scholar 

  • Refaat M (2007) Data preparation for data mining using SAS. Morgan Kaufmann Publishers, San Francisco

    Google Scholar 

  • Tietz NW (1995) Clinical guide to laboratory tests. W.B. Saunders Co, Philadelphia, pp. xxxix, 1096

    Google Scholar 

  • Witten I, Frank E (2005) Emboss European Molecular Biology Open Software Suite. In: Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufman, Amsterdam/Boston

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Sullivan, R. (2012). The Input Side of the Equation. In: Introduction to Data Mining for the Life Sciences. Humana Press. https://doi.org/10.1007/978-1-59745-290-8_5

Download citation

Publish with us

Policies and ethics