Key words

1 Introduction

In 2021, there should be no need to explain that scientists’ daily routines are challenged by a constant data deluge. Email messages, conferences, social media, web and smartphone alerts from around the world all bring information about new articles to read, discoveries to understand, datasets to analyze, or hypotheses to evaluate, all the time. The rapidity with which data acquisition technologies evolve contrasts with the amount of work required to develop solid expertise in a domain. Researchers are struggling to find a balance between the ever-increasing flow of data they receive and the limited time they have, to process and understand the data.

The terminology “data deluge” first appeared in the literature over 10 years ago [1]. In biology and life sciences, it has been tightly coupled to the emergence of multi-omics experimental methods, i.e., genomics, transcriptomics, proteomics, metabolomics, and more recently epigenomics, glycomics, lipidomics, or microbiomics [2]. The common element of these disciplines is the simultaneous acquisition of experimental measurements for several thousands of cellular components. Year after year, the number of technical solutions grows considerably, and the associated biological data accumulate in public databases. As illustrations, the website “Enseqlopdia” references more than 350 different Next-Generation Sequencing (NGS) methods [3], and the scientific journal Nucleic Acids Research (NAR) publishes descriptions of more than 1600 molecular biology databases [4]. And this is very likely just the tip of the iceberg. How much data has been left behind on hard drives of scientists’ computers? It is not surprising that journals like BMC Research Notes have created a new category called “data notes” to encourage data sharing [5].

Omics data are very valuable for research in biology, providing incredible opportunities to better understand cell function, and the underlying genomic regulatory systems, but like any innovative technology, omics data follows the “hype cycle” [6]. After a peak of inflated expectations in the 2000s (human genome project), many researchers experienced the trough of disillusionment [7]. It became clear that the accumulated omics datasets cannot speak by themselves and thus, the discipline of “data science” emerged as the new hype [8]. Ten years later, despite great perspectives raised by machine learning and artificial intelligence in medical fields [9], multi-omics data analysis remains a major bottleneck in most research projects [10].

We define data analysis as the process of inspecting, cleaning, transforming, and modeling a dataset, with the aim of revealing new information. In an omics project, full interpretation of the data that were generated constitutes the major expense [11]. It requires a multidisciplinary team of bioinformaticians, statisticians, and biologists to transform the raw data into useful biological knowledge. With this in mind, many integrated computational tools were developed to help researchers with no particular informatic skills to work on their data. The most emblematic is the Galaxy web platform [12], which is based on a tool shed gathering several thousand software packages. Clearly, technical resources exist today, as well as training programs to learn how to use them [13]. Exhaustive courses in Data Science, Bioinformatics, Statistics, and Computational Biology are freely available on MOOC platforms like Coursera [14], for both beginners and advanced researchers. In this context, we may wonder why data analysis remains such a burden for those who produce the data. We believe that beyond the issue of dealing with a very large number of experimental results, working with omics data requires a change in the way we solve scientific problems.

Most biologists were taught to follow the scientific method, which consists of (1) making an observation; (2) formulating hypotheses to explain this observation; (3) designing experiments to test the hypotheses; (4) performing the experiments; and (5) drawing conclusions from the experiments which support or reject the hypotheses. This is a difficult process, which can last several weeks, months, or even years. Therefore, how can we imagine generalizing such a method in the context of omics data? In the literature, the answer to this question instigates disagreements between researchers over how science should be properly conducted [15]. More pragmatically, if a researcher does not want to spend his/her entire scientific career testing the hypotheses which can be raised after only a single omics experiment, deductive and inductive logics are both indispensable. And this is a new way of thinking especially difficult to acquire, because it is not explicitly described or even understood. In a word, there is no standard procedure.

In this chapter, our aim is to share pitfalls and tips from our experience of functional genomics in yeasts, working with multi-omics datasets [16,17,18,19]. Even though our colleagues refer to us as “bioinformaticians,” we have very different scientific backgrounds and histories. Fifteen years separate our first contributions to functional genomics projects [20, 21], 15 years during which bioinformatics has changed considerably [22]. From a field essentially devoted to the study of sequences (genes and proteins), bioinformatics has become a discipline centered around the data (Fig. 1). We each have a different conception of what “we should be doing” to improve what “we already do.” Above all, we have learned by working together that following a protocol does not guarantee a successful project. More than ever, there is no right or wrong answer; hence, data analysts must be creative scientists and feel free to think outside the box. The following text is organized into three main sections. In the first one, we present three basics for data analysis. We believe they can provide a foundation for assessment of the concepts that underlie the challenges of any data analysis. In the second section, we remind the reader that “cherry picking” and “p-hacking” are two major issues they must be aware of, especially in omics data analyses. Finally, in the third section, we recommend “10 simple rules,” which we have experienced to be helpful in our daily work.

Fig. 1
figure 1

Word clouds from the titles of articles published in the field of “bioinformatics” between 2000–2005 (left) and 2015–2020 (right). The list of journals from which the articles were taken is: Bioinformatics, BMC Bioinformatics, Briefings in Bioinformatics, and Journal of Bioinformatics and Computational Biology. Note that the larger the word, the more frequently it is used. The period 2000–2005 corresponds to the time G. Lelandais prepared her PhD and 2015–2020 corresponds to the time T. Denecker prepared his PhD

2 Basics for Data Analysis

2.1 Basics #1: Differentiate “Data,” “Information,” and “Knowledge”

Data analysis is the procedure by which new information and knowledge are discovered from the data, and hence, it helps make decisions and highlight conclusions. In this context, the words “data,” “information,” and “knowledge” are often connected, but it is worth considering that they refer to very different steps and aims in a data analysis [23].

Data, by definition, represents the starting point of the whole story. The term is plural, equivalent to “facts.” Therefore, data are the individual facts, which are still out of any context. In biology, they rely on the raw experimental results or observations. Data can be of different types, i.e., numerical or textual, but they can also be images, audio recordings, videos, etc. The diversity of data sources is certainly one of the biggest challenges for their analysis. At this stage, it is important to distinguish structured data from unstructured data (Fig. 2). Structured data are easily organized in databases and can be efficiently retrieved and manipulated. Unstructured data are more inconsistent and complicated to use. Preliminary work is required to find and/or define their underlying organization, and for that, manual verification or curation is often necessary. This makes the process uncertain and time consuming. Whether structured or unstructured, the key idea behind the concept of data is that they have no meaning.

Fig. 2
figure 2

Illustrations of basics #1 for data analysis. The progression from data to information and knowledge is illustrated here. Data are individual facts, out of context, and with no meaning. They can be either structured data or unstructured data (see the main text). Information arises from a set of data, which have been placed in a context. Information is therefore relevant to several people. In this picture, two lists of genes for which expression is activated (red color) or repressed (green color) are represented. They are classified around functional categories. More details for this analysis can be found in [16]. Knowledge is the final stage of any data analysis. It is the most challenging to achieve

This is the exact difference with the term “information.” Considered to be singular, information can be seen as a set of data placed in a particular context. Hence, information has a relevance to one or more people and, for them, means something. Information can, for instance, be that a gene is more expressed when cells are in a particular condition than it is in another condition or that the overall organization of a network of genes has a particular topology (Fig. 2). Whatever the signification is (simple or more complex), an information arises from inspection of the data, applying procedures decided by the scientist doing the data analysis. Importantly, this implies that multiple information can arise from a single dataset, depending on the processes used to inspect the data. Information therefore represents a great advance in a data analysis, but it should not be confused with the word “knowledge.”

This last idea is a key concept to understand what data analysis really is. Indeed, by definition, knowledge represents an information which is understood. It means that we know why and how the information is obtained, and we can often make predictions based on this understanding. Considering previous examples, the knowledge can be, for instance, the existence of a regulatory protein which impacts the transcriptional activity of a set of genes in the studied condition and drives the overall topology of the associated gene network.

Data analysis is a very lexically challenged discipline. The terms “data,” “information,” and “knowledge” are often misused or used interchangeably. To limit misunderstanding among collaborators in a project, these three terms must be formally defined and consistently used. Indeed, if one person is expecting new knowledge from an analysis, whereas the other is working on collecting new information, it could be difficult to reconcile both their expectations. Also, it is our experience that trying to obtain new knowledge without first generating new information is a very risky strategy, a bit like looking for a needle in a haystack.

2.2 Basics #2: Recognize the Importance of Data Visualization

At all stages of a data analysis, visualization plays a critical role. This is well illustrated by considering Anscombe’s quartet [24]. Whereas the four datasets have almost identical descriptive statistics (in terms of individual mean, variance, and pairwise correlation), they appear to have very different distributions when they are plotted on a graph (Fig. 3a). Graphics are therefore very important for increasing perspectives on a set of observations.

Fig. 3
figure 3

Illustrations of basics #2 for data analysis. (a) Presentation of Anscombe’s quartet, which is an emblematic example of the necessity of converting numbers into graphics. Whereas all sets of data (x, y) have identical means, standard deviation, and pairwise correlations, they are very different in terms of graphical shapes. (b) Illustration of the direct connection between data visualization and cognition [42]

We have often observed that data visualization is underestimated by our collaborators. The main reason is probably that we are so immersed in a world of images that we no longer notice them. But more than a simple process to “show” or to “communicate” ideas, visualization triggers a cognitive process (Fig. 3b). Applied to data, visualization (often referred to as “dataviz”) helps to detect potential bias in the collected datasets, to define underlying structures or to anticipate the significance of the arising information. Applied to information (often referred to as “infographics”), visualization helps to integrate heterogeneous results, compare, and confront them with current knowledge in a particular field of interest. Finally, applied to knowledge, visualization helps to explain and communicate new understanding of processes. While some visualizations are quick procedures, possibly automated with the use of graphics software, others are very slow. They require deep thought to realistically synthesize and convey a complex message.

Again, visualization is above all a cerebral process. It allows us to form mental images of concepts, supports thinking, and amplifies cognition. Visualization thus gives new perspectives and drives further analyses. This is worth considering for the benefit of a data analysis.

2.3 Basics #3: Think of Data Analysis as a Cycle

Our research projects have taught us a lesson: data analysis never happens as originally planned. More complicated or time consuming than anticipated, something is always missing. What is the effect of a method parameter on the retrieved list of genes? The genes are not those expected, what shall we do? Should we change raw data normalization? These are examples of questions raised in our analyses of transcriptomics data. Writing this chapter was an opportunity for us to take a step back from our practices and explain the different steps of a more reasoned and efficient strategy.

In that respect, our main insight is that data analysis is not a linear process, but rather a cycle. The challenge for researchers is to avoid getting stuck in a perpetual cycle. We identified five different steps, which are represented in Fig. 4 and detailed below.

Fig. 4
figure 4

Representation of the data analysis cycle that underlies the step-by-step progression of a scientific project. Five main steps are required: (1) Formulation of the scientific question; (2) Collection of available data; (3) Data exploration and preliminary analyses; (4) Formulation of statistical hypotheses; and (5) Interpretations and conclusions. This last step often opens new perspectives in the project, thus initiating a new cycle of data analysis. The succession of coherent cycles allows a step-by-step progression of the project

2.3.1 Formulation of the Scientific Question

This is the most important step. A clear formulation of the scientific question helps to delineate the purpose of the data analysis and to anticipate the information that will be obtained and the associated potential issues. A typical mistake is to directly target the creation of new knowledge, forgetting the information intermediate. Even if they are very interdependent notions (see previous section), keeping in mind their specificities is at this step critical. Information is associated with “data meaning,” whereas knowledge is associated with “data understanding.” These are very different aims. In case of a transcriptomics data analysis, for instance, it can be illustrated as follows. Imagine you are comparing gene expression between two conditions; it is very different to say “I want to know how cells adapt in response to environmental changes?” (vague question, multiple answers can be formulated) than to say “I want to know the set of genes for which normalized gene expression measurements are significantly different, with an associated type I error rate of 1%?” (precise question associated with classical statistical procedures). Of course, the first question is more attractive from a scientific point of view. The underlying idea is clearly to create new knowledge, the goal for all researchers. But we can anticipate the first question will be very hard to fully answer, especially if several collaborators share different points of view concerning “cell adaptation.” There is no such risk with the second goal. Even more, we can anticipate, based on the number of replicates available, the power of the statistical test, and hence, our ability to detect genes that are truly differentially expressed [25]. The results may be disappointing, but they will still be results, i.e., new information collected that can be associated with other information, as part of a more ambitious scientific project.

2.3.2 Collection (and Cleaning) of Available Data

Being a data analyst in 2021 is a strong advantage for this step. Many sources of biological data exist and are freely available in public databases [4]. It is thus possible to answer a lot of scientific questions without carrying out new experiments. Of course, the experiments of others can still be considered imperfect and unsatisfactory, but this is not so important if some new information can be obtained from the data. A typical mistake at this step is to collect more data than we actually need. It is important to remain focused on the scientific question only. Indeed, with too much data, the risk is to spend more time than necessary on purely technical (informatics) issues. Data cleaning is also a necessary step, whose cost increases with the amount of data collected. On this subject, interesting advice is presented here [26]. To ensure good data quality, it is necessary to check validity, accuracy, completeness, and consistency. This takes time but deserves to be taken into account because the quality of cleaning and data collection has a direct impact on the quality of the information, i.e., the adage “garbage in, garbage out” [27]. In this context, being specific about what we are trying to accomplish is a great help in deciding what actions to take.

2.3.3 Data Exploration and Preliminary Analyses

This step is our favorite. For the first time, the collected dataset is handled without any restraint regarding what is right or wrong to do. It consists of exploring the data, plotting multiple graphs, calculating many parameters, etc. We like the idea that this step is about “getting to know” the collected data. The visualization methods described above are of great interest at this stage, while ensuring the reproducibility of the process is not yet necessary (see next section). This greatly reduces the work of the data analyst and provides very appreciable freedom. A typical mistake is to spend too much time at this step and over-interpret the observations. Indeed, no solid conclusions will be drawn from these explorations and preliminary analyses, the objective of this step is only to develop intuitions, which will help for the following steps.

2.3.4 Formulation of Statistical Hypotheses

This is undoubtedly the most challenging part of the cycle. The problem is that we are going to subject the collected data to more complex calculations, applying procedures that involve mathematical details, often beyond our general understanding. At this step, the hard work is purely mental. It consists of converting the scientific question into relevant statistical hypotheses, prerequisite to using the right methodology. Once done, all the rest is mere computation, which simply requires statistical software. Note that at this step of the data analysis cycle, the issues of repeatability, replicability, and reproducibility must be carefully considered [28]. While the previous exploration step can remain “quick and dirty” [29], this part of the analysis must be rigorous and well documented [30]. A typical mistake is to confuse “scientific hypotheses” with “statistical hypotheses.” Again, data analysis is a very lexically challenged discipline. As an illustration, a scientific hypothesis could be “I think that the function of this gene is required for the adaptation of cells to this environmental change,” whereas a derived statistical hypothesis could be “the mean of logFC replicates in RNAseq data for this gene is different from 0.” These are two very different statements. We have often encountered colleagues who were already convinced that their scientific hypothesis was true (for very good reasons which we do not question), and thus strongly rejected the results of a statistical analysis that did not go in the expected direction. Such misunderstandings lead to difficult discussions, which could be avoided through a better understanding of statistics [31]. Rejecting a statistical hypothesis, based on a p-value calculation and/or a threshold for type I error rate, does not imply the scientific hypothesis is wrong [32]. But the information cannot simply be ignored, on the contrary, it can help decide the course of further analyses.

2.3.5 Interpretation and Conclusions

Finally, the last step. It is an essential time to describe the results and to interpret them in the general context of the scientific project. Expertise in the scientific field for which the data analysis is performed is very important. Links with already established knowledge are identified, discussed, and may lead to new scientific questions. A new cycle of analysis can be performed and, in this way, the scientific project advances one step (or cycle) after another.

3 Specific Statements in Omics Data Analyses

Below, we alert the reader about two important issues, which are specific to the analysis of omics data. Although detailed explanation is beyond the scope of this chapter, we provide references for those who wish to pursue these topics.

3.1 Be Aware of Cherry Picking

Over the past 10 years, Big Data has revolutionized many areas of the life sciences [33]. But is it an infinite source of new information? The answer is a priori yes, if appropriate analytic procedures are applied. Otherwise, Big Data can become an important source of false discoveries. Indeed, when the mind is motivated to search through a mass of data, it always ends up discovering a path, usually a straight line, to the narrative it initially wished to find. In other words, if we are already convinced that our scientific hypothesis is true (see previous section), there is a risk of retaining from the multiple explorations of a large data set, only the information that supports the hypothesis, while rejecting (consciously or unconsciously) the information that contradicts the initial conviction. This is called “cherry picking” [34], i.e., the action of pointing to individual cases that confirm a particular idea, while ignoring a significant portion of related and similar cases that may contradict that idea. In this context, statistics have an important role to play, helping to discriminate observations that have a high probability of being random from those which, on the contrary, have a low probability.

3.2 Be Aware of P-Hacking

P-hacking relies on an inappropriate manipulation of data which favors the result to be presented as statistically significant [35]. Extensively discussed in the literature (for instance [36]), there are several P-hacking scenarios. In the first one, several statistical tests are performed on the same dataset (e.g., a parametric test and a nonparametric test), and only the results of the most significant test (with the lowest P-value) are shown. In the second one, observations are added (or removed) from a sample used as statistical test input, until the calculated P-value is below a targeted threshold (generally 0.05). In the last scenario, a single dataset composed of multiple variables is used to test multiple statistical hypotheses until one of them is significant. It is very easy to fall into the P-hacking trap, even for experienced data analysts and especially with P-hacking scenario 3, which is very close to the exploration step of the data analysis cycle described above. This is the reason why this exploration step must be followed by a rigorous validation step. To delve deeper into this subject, we suggest the very comprehensive article of [37].

4 Our Ten Recommended Rules

In the last section, we list 10 rules that help us move forward, despite the constant deluge of data. We are not always able to follow them consistently, but they represent a common objective.

  1. 1.

    Take the time to set your long-term goal and break it into smaller, very specific parts.

    While it is important to dream big, it is also important to define the different stages that will allow you to move forward in the desired direction. Each of these steps must be specific enough to be carried out in a simple and effective way, without too much effort.

  2. 2.

    Do not bypass elementary analyses.

    Even if there is a temptation to directly apply “hype” methodologies (in 2021, this is AI), simple analyses are still very valuable. Because they are easy to understand, you can stay focused on the data rather than the underlying assumptions that guide the methodology. You also limit the risk of misunderstanding what the method can tell you. Elementary analyses are also interesting for guiding subsequent choices of more complex methodologies. Taking a little time with this can save a lot of time later. If your data is not worth considering for instance, we might not put as much energy into using it. Note here that the choice of methods (simple or complex) must always be guided to achieve the stated objective, no more, no less, hence the importance of the previous rule.

  3. 3.

    Use unlimited data visualization.

    We explained in previous sections how data visualization can drive our thinking in an analysis. Be careful not to restrict your data explorations to graphics that are automatically and systematically generated by a single tool. To be fully useful, data visualization should be as flexible and broad as possible, allowing data to be viewed from very different angles. Online tools [38, 39] exist to support your creativity. Keep in mind the importance of distinguishing “dataviz” (rapid process used for exploration) and “infographics” (lengthy process to support communication and help other people understand your message).

  4. 4.

    Use statistics because you need them, not because others tell you to.

    Do not ask a question if you do not want to listen to the answer. This is the best advice to remember when the issue of carrying out a statistical test is discussed. Do we need statistics to convince ourselves or to convince others? We deeply believe that statistics are a powerful and indispensable tool for performing rigorous data analysis. However, poorly mastered or applied for the wrong reasons, their use can lead to borderline situations. “P-hacking” and “Cherry picking” are perfect illustrations (see previous section). Do not hesitate to consult a statistician to verify the accuracy of the analyses and keep in mind that, at the end, it is you and you alone who make the decision, statistics are only one part of your overall thinking.

  5. 5.

    Choose the right time to discuss your results, neither too early nor too late.

    This is an important advice. Presenting your results (whether they are preliminary or advanced) requires clear thinking. Explaining your motivations, justifying your choices, and expressing your doubts cannot be improvised. We control how a message is sent, but we do not control how it is received. Communication therefore deserves a little preparation. Also, to be able to listen to feedback and comments, and thus identify weak points in the analysis, you cannot be in a hurry because of an impending deadline. Timing is therefore important. Finally, it is essential to take into account your audience. The less people know about your project, the more you will have to explain. This is worth considering if you are expecting the discussion to focus on a specific element of your project.

  6. 6.

    Listen to the opinions of others but take full responsibility for all decisions.

    Collaboration in a multidisciplinary context can sometimes be difficult. While having different perspectives on the same issue can be very rich, it can also be crippling. At some point, therefore, it is important to be able to decide and move forward. At the end of the study, a critical review process can be applied in order to assess the relevance of the decisions that have been taken and therefore to learn from any errors.

  7. 7.

    Guarantee the reproducibility of analyses but pay attention to not unnecessarily burden the process.

    The reproducibility of analyses is one of the major challenges of current bioinformatics. However, it can be arduous to test and very constraining in a situation where creativity can make all the difference. This is why we clearly separate two stages of the data analysis cycle. When exploring the data, it is important to work as freely as possible and then, only if the preliminary results are worth it, to implement good reproducibility practices. Keep in mind that the more these practices are mastered, the less the effort required to use them.

  8. 8.

    Remain constantly intentional regarding where you are in a cycle.

    Working in analysis cycles has been a decisive change of habit for us. Our experience is that performing short cycles (a few days) is most effective.

  9. 9.

    Take your time, but not too much, perfection is stagnation.

    You understood that to navigate through a constant deluge of data, you must be able to slow down. Not run faster and faster just to stay in the same place, like the Red Queen [40]. Be careful, however, not to stop completely. Doing nothing but reading and thinking, looking for the perfect solution to the problem being addressed is ultimately useless if it does not turn into concrete action.

  10. 10.

    Remember that being a data analyst is an amazing job!

    Data are nowadays everywhere and allow endless creativity for those who appreciate it. Combining computer science, statistics, and biology, omics data analysis is fascinating interdisciplinary work with many attractive perspectives. Isn’t data the new (s)oil?

5 Conclusion

In this chapter, we have shared key ideas and convictions that help us work on a daily basis. They have been developed both during the experiences of our own omics data analyses and in the context of collaborative projects with biologists, informaticians, and statisticians. Overwhelmed by the constant flow of data and challenged with the methodological evolution in bioinformatics to follow the FAIR Guiding Principles [41], we started to feel “stuck,” as if our work was never enough. The desire to automate our analyses was very strong at that point and we did make significant progress in that direction. While this has been helpful, it was not enough. Today, we accept that we cannot compete with the pace at which the omics technologies evolve, and we recognize that the burden represented by omics data is much too heavy to be carried alone. Again, data analysis is a complex process for which there is no universal protocol. It requires strong expertise, a lot of curiosity, and above all, patience. We believe that developing one’s own “know-how” is one key to success. It helps to become strategic and intentional, thus avoiding an endless drift into the ocean of possibilities. Taking full responsibility for the different choices required at each analysis step is another key to success. It is not easy in a multidisciplinary context, where we often think that others are more legitimate to say that we are proceeding in the right direction. But in practice, we learned that if we become the person in the room who asks the right questions, we greatly help the experts to give us accurate answers. Finally, we like the idea that data analysis is as much an art as a science [29], there are no right or wrong answers, and that it is ok to have differences in opinion. So, listening, reading, brainstorming, and going one step at a time are our main recommendations, and we suggest constantly keeping in mind that if we often overestimate what we can do in a day, we also underestimate what we can do in a year.