Biomarker heatmaps: visualization of complex biomarker data to detect storm-induced source changes in fluvial particulate organic carbon

Fluvial particulate organic carbon (POC) is a complex mixture that undergoes rapid and complicated shifts in source during storm events. High-temporal resolution sampling and source-sensitive chemical analyses, such as those for organic geochemical biomarkers, are necessary to investigate the dynamic POC source behaviour during storm events. However, experimental designs that accommodate those requirements inevitably yield large datasets that require a new data analysis approach. Here, we adapt one of the widely used data visualization techniques, heatmaps with clustering analysis, to seek patterns in source mobilization and transition and pinpoint their timing during storm events more effectively and intuitively. Biomarker concentration data are scaled and used to construct a biomarker heatmap using the ComplexHeatmap package in R. Hierarchical clustering is performed to reorder the biomarkers based on (dis)similarities in their concentration fluctuations during storm events. We implemented our approach to visualize our high-frequency biomarker data obtained from storm POC samples collected in the well-characterized field site of Clear Creek, Iowa. The results demonstrated clear sequential source changes from algal and microbial OC to vascular plants- and soil-rich OC during the event, with an additional source transition identified within the vascular plant biomarkers. The sensitivity analyses results showed that the additional source transition was lost as the temporal resolution of sampling was reduced to 25% of the original data. The sensitivity of the identified clustering to varying scaling methods and number of biomarkers was also examined. Comparison with principal component analysis (PCA) showed that the biomarker heatmap performed better in visualizing temporal changes of individual biomarkers. This biomarker heatmap approach will help scientists to understand the complex storm-induced POC source changes by offering a new perspective to explore the data and generate hypotheses to be tested in follow-up analyses.


Introduction
Research in the earth and environment sciences has become increasingly data-rich as advancements in technology have enabled the collection of large quantities of environmental data (Blair et al. 2019;Hassanpour et al. 2022;Hubbard et al. 2020). As a result, new challenges have emerged in integrating and interpreting such often-complex data with large spatial and temporal scales (Bell and Blais 2019;Hassanpour et al. 2022). To tackle these challenges, an interdisciplinary approach that incorporates data science can serve as a potential solution. Data science involves utilizing techniques from computer science and statistics to extract insights from data and has been widely used to address similar challenges in many fields such as life sciences (Blair et al. 2019;Greene et al. 2014;Provost and Fawcett 2013). Given the data-driven nature of earth and environment sciences and the growing need for interdisciplinary insights to discover new patterns and relationships in complex environmental data, data science has the potential to greatly improve monitoring and research in these fields.
One example where the application of data science can be beneficial is identification of the origins of particulate organic carbon (POC) in rivers and streams. Understanding the sources of organic carbon in rivers is crucial for comprehending the carbon cycle at regional and global levels, the impacts of land use and management practices on the carbon cycle, and the functioning of ecosystems (Cole et al. 2007;Giorgio and Pace 2008;Jeanneau et al. 2018). Source changes of POC in rivers and streams occur most actively during storm events (Hope et al. 1994;McClain et al. 2003). Dramatic changes in both concentrations and fluxes of fluvial POC during storm events have been well documented (Caverly et al. 2013;Cerro et al. 2014;Dalzell et al. 2005;Dhillon and Inamdar 2014;Jung et al. 2012). However, knowledge of time-resolved POC sources and related hydrological processes that transport the sources is sparse. Previous research has shown that POC is a complex mixture of autochthonous (produced in-channel) and allochthonous (imported) sources, the transport of which are initiated at different times during storm events (Hatten et al. 2012;Jeanneau et al. 2018). Details concerning the triggers for the initiation of transport, the temporal sequencing of inputs, and the integration of inputs are significant knowledge gaps.
In an effort to address these knowledge gaps, chemical composition changes of riverine POC have been monitored during storm events using various analytical techniques such as elemental analysis (C and N contents, C:N ratio), stable isotope measurements (δ 13 C, δ 15 N), and biomarker analysis (Blair et al. 2021;Derrien et al. 2020;Hatten et al. 2012;Johnson et al. 2018;Rowland et al. 2017). Often used in combination with other analytical techniques, biomarker analysis is a powerful tool that allows for the characterization of the complex composition of organic carbon by tracking biological sources through the measurement of individual organic compounds (Canuel and Hardison 2016;He et al. 2014). By monitoring multiple biomarkers simultaneously, a broad perspective of the organic carbon composition can be obtained (Pondell and Canuel 2020;Schmidt et al. 2010). However, such analysis, when applied to large sample collections such as our high-temporal resolution sampling, can create data sets that evade easy interpretation.
In this study, we adapt heatmap visualization and hierarchical clustering to interpret the complex biomarker data and obtain insights into how organic carbon sources change during storm events. Ideally, once the changes of sources and their timing are detected, hypotheses can be generated and tested in follow-up analyses (Bell and Blais 2019). Heatmaps are an exploratory method to visualize complex multi-dimensional data (Gu et al. 2016). They are extremely useful to capture hidden patterns in complex data when combined with clustering, given their ability to classify samples/variables based on their (dis)similarity and display the similar ones close together in the generated heatmap (Gu et al. 2016;Pleil et al. 2011). When detecting source shifts of POC during storm events, the absolute mass of biomarkers in samples is less important to understand. Rather, how the (relative) quantity of biomarkers change as a function of time is more of interest; therefore, the co-variations in the concentrations of biomarkers in POC are greater utility. Heatmaps with clustering analysis help to identify patterns or trends in relative abundance variations. Here we demonstrate (1) how to construct a biomarker heatmap from biomarker concentration data of POC samples collected during a storm event, (2) how to interpret the biomarker heatmap in the context of detecting POC source shifts during a storm event, (3) evaluation of the sensitivity of the method, and (4) comparison with a widely used visualization method, principal component analysis (PCA). Our work will be demonstrated using R, the widely used open-source statistical computing language (Ihaka and Gentleman 1996).

Input data and data scaling
The input data are organized as a two-dimensional matrix (n samples in rows × i biomarkers in columns) containing sample ID and concentrations of biomarkers of interest. The input data can be provided in any concentration/flux format such as the absolute mass of a biomarker normalized to organic carbon (OC) contents in the sample (mg biomarker/100 mg OC), to dry weight of the sample (ng/ mg sediment), to the volume of water collected ( µ g/L), or the storm water flux (mg/s) depending on the scientific questions asked. Alternatively, the peak area of biomarkers can be used as input data. Biomarker concentration data obtained from GC-MS analysis contain peak area quantified for each compound. Heatmap visualization focuses on the relative abundances of biomarkers. Thus, the peak area of biomarkers can be used directly without the need to calculate response factors for each compound to determine absolute concentrations. This is particularly advantageous when dealing with a large collection of diverse compounds, for which individual concentration calibrations would be arduous. This is a major advantage of this approach over more traditional methods.
The input data are loaded into RStudio, the integrated development environment for the statistical programming language R (Racine 2012). Biomarker concentrations often have different averages and ranges of values. To avoid inadvertent weighting of analyses due to extreme high or low absolute concentrations, the data are scaled by normalizing each column (i.e., biomarker) of data using autoscaling, which converts all biomarker concentrations into z-scores (Eq. 1). Autoscaling was selected for our heatmap visualization due to its better performance in visualizing the color contrast in the heatmap. Comparison with other scaling methods is discussed in Sect. 3.2. The calculated z-scores indicate how far the value of the sample is from the mean biomarker concentration of all samples. Negative z-scores mean the biomarker concentration of the sample is lower than the mean of biomarker concentration of the sample group and positive means higher.
Once the biomarker concentration data are scaled, the z-scores are used to construct a biomarker heatmap. A variety of packages are available to construct heatmaps in R including the built-in heatmap() function. In our example, we used the Heatmap() function from the ComplexHeatmap package version 2.8.0 due to its flexibility and the ability to add annotations and choose clustering algorithms (Gu et al. 2016). In heatmaps, z-scores are represented with colors.
Color in each box indicates the relative amount of the corresponding biomarker in the sample at the collection time compared to that of samples collected at other times of the event. With the default color setting, red indicates a higher concentration of the biomarker than the mean biomarker concentration of the sample group, whereas blue indicates a lower concentration. Colors in heatmaps must be interpreted with caution. Z-scores were calculated by comparing the relative concentrations of individual biomarkers across samples, not between biomarkers. Z-score (color) comparisons between biomarkers within a sample cannot be done with this method of normalization. For example, a biomarker with the richest red does not mean that the biomarker is the most abundant compound among all quantified biomarkers in the sample.

Clustering
Both rows and columns of the heatmap can be reordered based on their (dis)similarity in biomarker compositions using a choice of clustering algorithm (Gu et al. 2016). This heatmap coupled with clustering allows us to statistically identify clusters of samples and/or biomarkers and help to recognize hidden patterns in the data. In our example, we used hierarchical clustering on biomarkers with the Pearson distance metric (Jain et al. 1999) to capture the timing of compositional changes in POC based on chemical (dis)similarities ( Fig. 1). Categorical variables can be added next to biomarkers to color-tag their subgroups such as biochemical classes or biological sources.

Data source
We tested our biomarker heatmap approach on a field data set collected from the Clear Creek watershed, a small, wellcharacterized agricultural watershed (270 km 2 ) in Iowa, in which there is extensive prior knowledge concerning POC sources, both visually and chemically (Blair et al. 2021;Kim et al. 2020;Papanicolaou et al. 2017). Highfrequency suspended sediment samples were collected near Coralville, the outlet of the Clear Creek watershed, using an autosampler installed adjacent to the USGS stream gage Biomarker heatmap constructed using the same dataset without clustering (c) Biomarkers (rows) were reordered using hierarchical clustering (Fig. 1). The relative abundances of biomarkers are now recognized effectively and intuitively as 'patterns'.

Identification of POC sources using clustering and detection of source shifts on heatmap visualization
Multiple sources contributing riverine POC and the relative importance changes of each source during storm events have been observed in many watersheds with a range of sizes, land uses, and locations (Giorgio and Pace 2008; Derrien et al. 2020; Goñi et al. 2013;Hatten et al. 2012). Due to the low-frequency sampling strategy used in previous studies, such as once a day or once a week, their results have mostly focused on the contrasting POC compositions during low-flow conditions vs. high-flow conditions. Algal and microbial OC have been shown to dominate during lowflow conditions, while increasing contributions of terrestrial inputs were observed with high flow (Goñi et al. 2013;Hatten et al. 2012;Qiao et al. 2020).
Clear Creek was an Intensively Managed Landscapes -Critical Zone Observatory (IML-CZO) study site (Kumar et al. 2018), and previous research in Clear Creek has provided evidence that this trend of event-driven POC source shifts is present . As an agriculturedominated watershed, the low water velocity during lowflow conditions and the nutrients supplied from croplands in Clear Creek create an environment conducive for algae growth in the channel (Abban et al. 2016;Blair et al. 2021;Kim et al. 2020). With precipitation and increasing stream flow, increasing sediment load from surface soils and bank failure has been observed in the subbasin in Clear Creek (Abaci and Papanicolaou 2009;Wilson et al. 2012). Particularly in March when the POC samples for this study were collected, the fields lacked plant cover, enhancing the precipitation-driven erosion of surface materials (Xu et al. 2019). Another previous study in Clear Creek, which used stable C isotope measurements, has also shown temporal variations in the isotopic signature during storm events, highlighting different dominant POC sources at contrasting flow conditions (Blair et al. 2021). Given this knowledge of the watershed, we predict partitioning of biomarkers on the heatmap based on their OC sources and their expected mobilization timing during the storm event. Figure 2 summarizes the biomarkers quantified in this study and their corresponding biomarker groups and potential OC sources.
The expected partitioning was observed in the heatmap (Fig. 3a). Biomarkers were grouped together with those that simultaneously increased in concentration. This was shown as a 'hot spot' in the heatmap where the normalized (USGS 05454300). A total of 24 storm water samples were collected at approximately 2-hr intervals during a spring storm that took place between March 29 and 31 in 2017. The suspended sediment samples were analysed using the tetramethylammonium hydroxide (TMAH)-thermochemolysis sample preparation followed by GC-MS analysis (Frazier et al. 2003;He et al. 2020). This is a broad-spectrum analysis method that provides information on a wide range of organic species. In this case, we used 36 biomarker compounds to track common POC sources (e.g., algae, bacteria, vascular plants, and soils) (Canuel and Hardison 2016;He et al. 2020;Schmidt et al. 2010). The biomarker concentrations were obtained by normalizing the mass of a biomarker in a sample to the mass of OC in the sample and an internal standard. %OC of samples were obtained from elemental analysis. Detailed procedures of the chemical analyses are available in our previous publications (Blair et al. 2021;Kim et al. 2020).
A matrix with 24 samples × 36 biomarkers was imported into RStudio and scaled using autoscaling to normalize the concentration data within each column (i.e., biomarker). After scaling, the matrix was transposed to display time series samples in columns and was then used to construct a biomarker heatmap. Hierarchical clustering analysis was performed to find clusters among biomarkers. Samples were displayed in the order of their collection time in columns of the heatmap. This biomarker heatmap illustrates a comprehensive overview of relative abundances of biomarkers, instead of the conventional data plots such as line graphs displaying concentration profiles of individual biomarkers  Hedges and Mann 1979;Otto and Simpson 2006;Wysocki et al. 2008). Though the transition from algal/ microbial sources to vascular plant POC was fully expected, the presence of two time-resolved clusters (2 and 3) of different vascular plant biomarkers was not predicted, and this provides an indication of the heatmap value. This observation leads us to the following hypotheses to be tested in the future: (1) different sources or different parts of a source (e.g., leaves, stems) were mobilized and introduced into the system sequentially, (2) there are different diagenetic (degradation) states of materials with altered biomarker distributions and they were delivered at different times, and/or (3) hydrodynamic sorting separated the sources with different biomarker distributions during transport processes.

Sensitivity analyses
To evaluate the sensitivity of the method, we examined the effects of altering key parameters such as (1) scaling method, (2) number of samples (i.e., sampling frequency), and (3) number of biomarkers. The original input data (i.e., biomarker concentration data) was scaled using five different scaling methods (autoscaling, range scaling, pareto scaling, vast scaling, and level scaling) to determine which concentration of a group of biomarkers is at its maximum. This hot spot may indicate mobilization of in-stream sources, input of external sources into the stream, or transport of sources from upstream, and the biomarkers associated with it will be referred to be 'activated' at the time point hereafter.
The first cluster of biomarkers (cluster 1 in Fig. 3) was activated in the early portion of the storm and consists of short-(C12-14) to mid-chain (C16-24) even carbon number saturated fatty acids and unsaturated C16 and C18 fatty acids, which are commonly associated with algal or microbial inputs (Budge and Parrish 1998;Pondell and Canuel 2020;Schmidt et al. 2010;Volkman et al. 1989). This cluster was expected based on previous visual, stable carbon isotope, and biomarker studies at this site (Blair et al. 2021;Kim et al. 2020;Papanicolaou et al. 2017). At ~ 8 h after the first sample and during the rising limb of the hydrograph, the second group of biomarkers (cluster 2 in Fig. 3) began to increase in concentration, while concentrations of cluster 1 biomarkers decreased. Cluster 2 includes long-chain saturated fatty acids (C26-32), bacterial fatty acids, and phytosterols, indicating vascular plant and bacterial inputs (Jeanneau et al. 2018;Nakakuni et al. 2019;Pondell and Canuel 2022), and we hypothesize this is from in-channel debris and/or surface soils mobilized by rain impact on bare  Stream discharge is shown in the shaded area. Cluster 1 was the most important POC source in the pre-event period than any other time during the storm. Cluster 2 was more important during the rapid increase of discharge. Concentrations of biomarkers in cluster 3 were the highest on and after peak discharge large relative changes in their concentrations (hot spots in the heatmaps of pareto scaling and level scaling in Fig. 4), whereas vast scaling draws more attention to the biomarkers with relatively small concentration changes (red/blue color contrast in the heatmap of vast scaling in Fig. 4) (Berg et al. 2006). Therefore, pareto scaling and level scaling could be useful to focus on biomarkers with largest inputs during the storm event, whereas vast scaling may be useful to explore biomarkers that are relatively less impacted by the event.
The next key input factor we tested was the temporal resolution of sampling (i.e., number of samples used to construct a heatmap) on the biomarker clustering results (Fig. 5). The original dataset consisted of samples collected every ~ 2 h (Δt = 2 h) with a total event duration of ~ 47 h (sample n = 24). We omitted a given number of evenly spaced samples (e.g., exclusion of every 4th sample in the first trial) and generated a heatmap using the adjusted dataset (Fig. 5). As shown in Fig. 5, the clustering results consistently preserved the three biomarker clusters until the sampling frequency reached 8 h (Δt = 8 h ; n = 6) with only minor deviations from the original clustering. Specifically, when Δt = 4 h (n = 12), a single unsaturated fatty acid (C18:2) was included in cluster 2, which was originally in cluster 1. C18:2 fatty acid is a relatively less source-specific biomarker and has been reported to be originated from microalgae (Volkman et al. 1989) or vascular plants (Budge and Parrish 1998). Such biomarkers with multiple origins need additional information such as co-activated biomarkers to elucidate their origin. Hence, the change in this single biomarker did not affect our original conclusion with three biomarker clusters and their associated source identification. Similarly, when Δt = 6 h (n = 8), the three-biomarker cluster structure in the heatmap remained with only two biomarkers (iso-C17 fatty acid and β-sitosterol) moved from cluster 2 to cluster 3 in the trial. However, with a further reduction of the temporal resolution (Δt = 8 h; n = 6), the three-group clustering pattern disappeared, and biomarkers clustered into two groups instead. The first biomarker was most effective (Fig. 4). As shown in Fig. 4, the clustering results remained unchanged in the resulting heatmaps, indicating that the choice of scaling did not influence the ordering of biomarkers in the heatmaps. However, autoscaling and range scaling provided greater contrasts for all biomarkers compared to the other methods.
The original biomarker data include biomarker concentrations with differences in orders of magnitude between biomarkers. For example, C16-18 fatty acids are common in most organisms (Rustan and Drevon 2001), and their concentrations are higher than less abundant biomarkers such as bacterial fatty acids. Moreover, the range of concentration changes in response to storm events may vary for different biomarkers depending on the introduced sources. Autoscaling and range scaling correct these concentration variations and provide a relatively consistent range for all biomarkers, which allows us to compare relationships among all biomarkers equally. In contrast, the heatmaps with pareto scaling and level scaling highlight the biomarkers with Fig. 5 Effects of varying sampling frequency on heatmap visualization and biomarker clustering. The original data has 24 samples at ~ 2 h intervals (Δt = 2 h; n = 24). Δt and n below each heatmap indicate the sampling interval and the number of samples used to construct the corresponding heatmap, respectively. 1st trial: every 4th sample was removed; 2nd trial: every other sample was omitted; 3rd trial: starting from the first sample, the next two samples were removed; 4th trial: starting from the first sample, the next three samples were removed acids affected the number of clusters by removing cluster 3. This was because cluster 3 was primarily defined by those two biomarker groups (lignin phenols and cutin acids), and removing the two groups eliminated almost the entire cluster from the dataset. This result highlights the advantage of broad-spectrum analysis that offers redundancy in the definition and detection of clusters compared to more targeted analyses.

Comparison with PCA
Principal component analysis (PCA) is a widely used dimension reduction method in similar studies (Goñi et al. 2000;He et al. 2014;Pondell and Canuel 2022;Wysocki et al. 2008). We applied PCA to the same data set to compare the results with our biomarker heatmap (Fig. 6). Both PCA and our biomarker heatmap are unsupervised exploratory methods that generate informative visualizations of samples and variables by positioning similar objects close together on the plot and dissimilar objects apart from each other. In fact, the trend captured in our biomarker heatmap (Fig. 3) appeared on the PCA plot (Fig. 6). Specifically, the largest gradient of variability in the data was captured along the PC1 axis (40.3%), which separated early POC containing high concentrations of algal and microbial biomarkers (negative PC1) from later POC samples containing more lignin phenols and cutin acids (positive PC1). The PC2 axis, which explains the next 24.2% of variability in the data, separated the samples containing high concentrations of bacterial fatty acids and long-chain fatty acids from those collected at the early or late stages of the storm event.
However, there were limitations in capturing temporal changes of individual biomarkers independently in the PCA group that was mobilized earlier included algal and microbial biomarkers, and the later group included vascular plant biomarkers that were originally separated into cluster 2 and 3. This clustering in the lowest resolution leads to the more traditional interpretation of POC source shifts during storm events. This result clearly demonstrates that high temporal resolution sampling captures additional source resolution that would have otherwise been overlooked with low-frequency sampling. It should be noted that the optimal temporal resolution of sampling may vary based on several factors, such as watershed size, geomorphology, and precipitation characteristics.
Lastly, the sensitivity of the clustering to the number of biomarkers was examined by reducing the number of biomarker groups (see Fig. 2 for biomarker groups) as it might occur if one changed from using a broad-spectrum method such as TMAH-thermochemolysis to more targeted analyses such as lignin phenol analysis. When we eliminated a single biomarker group from the total of nine groups, the clustering structure remained unaffected in all nine trials. This is because clusters identified in this study are composed of multiple biomarker groups. To further reduce the number of biomarkers, we selected two biomarker groups from each biomarker cluster, including one group with the most specific temporal change associated with the cluster (e.g., lignin phenols for cluster 3) and another group that is more generic (e.g., unsaturated fatty acids for cluster 1 and bacterial fatty acids for cluster 2) within each cluster. Twelve trials were conducted by removing pairs among these selected six groups. The results showed a consistent three-group clustering pattern with each cluster displaying distinct activation timing in all trials except one. In that exceptional trial, the elimination of lignin phenols and cutin that constituted an entire cluster were excluded. Comparison with a conventional statistical method (PCA) showed that our biomarker heatmap provided more detailed information about biomarker compositions and their temporal changes. Potential exists for applications of the biomarker heatmap approach in addressing spatial or seasonal variations of biomarkers. Furthermore, untargeted analyses and the discovery of novel biomarkers can be facilitated by the clustering technique. plot, unlike our heatmap. Our biomarker heatmap provided a clear visualization of the relationships between all biomarkers using the dendrogram on the side of the heatmap (Fig. 3a). Also, transitions between co-activated biomarkers were visualized while samples were displayed in the order of collection time. This allows us to specifically pinpoint the transition timing of POC source activation and the shifts in biomarker compositions. In contrast, the PCA highlights which biomarkers contribute to explaining the dominant gradients in the data and how much of the variability in the given PC the biomarkers account for. Therefore, PCA does not provide temporally resolved information of individual biomarkers or a group of biomarkers. Furthermore, an artifact of PCA was observed when used on our dataset (Fig. 6). Early POC samples and late POC samples appeared close to each other in the PC plot, which can be misinterpreted as the biomarker compositions of late POC samples being similar to those of early POC samples. This issue may arise when the samples at both ends of the gradient (i.e., collection time) lack common biomarkers (Paliy and Shankar 2016). Applying a data transformation prior to the PCA analysis can partially address this issue (Paliy and Shankar 2016). Alternatively, one may choose to keep the data form as is, as it provides additional information about the data (Morton et al. 2017). However, we will not discuss it further as it is out of the scope of this study.

Conclusion
Appropriate data analysis techniques are essential in earth and environmental sciences to fully utilize the increasing volume of data obtained through technological and analytical advancements. With the recent trend towards high-temporal resolution sampling in storm-driven POC source shift studies, the size and complexity of datasets are expected to increase, especially when combined with the multi-biomarker analysis. In this study, we adopted the heatmap with clustering approach, one of the most popular data visualization methods, to better visualize POC sources and their changes as a function of time within a storm. Correlations between biomarkers produced identifiable clusters that could be linked to previously identified sources in the organic geochemical literature and in prior field studies. This approach provides flexibility in the form of input data, data scaling, and clustering, which was explored in part in this study. The choice of data scaling could be determined depending on the scientific question asked. The clustering pattern identified in the original analysis was lost when the temporal resolution of sampling was reduced from a 2 h-interval to an 8 h-interval. The reduction in the number of biomarkers did not affect the clustering pattern, unless the biomarkers article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.