Skip to main content
Log in

Analysing spectroscopy data using two-step group penalized partial least squares regression

  • Published:
Environmental and Ecological Statistics Aims and scope Submit manuscript

Abstract

A statistical challenge to analyse hyperspectral data is the multicollinearity between spectral bands. Partial least squares (PLS) has been extensively used as a dimensionality reduction technique through constructing lower dimensional latent variables from the spectral bands that correlate with the response variables. However, it does not take into account the grouping structure of the full spectrum where spectral subsets may exhibit distinct relationships with the response variables. We propose a two-step group penalized PLS regression approach by performing a PLS regression on each group of predictors identified from a clustering approach in the first step. In the second step, a group penalty is imposed on the latent components to select the group with the highest predictive power. Our proposed method demonstrated a superior prediction performance, higher R-squared value and faster computation time over other PLS variations when applied to simulations and a real-world observational data set. Interpretations of the model performance are illustrated using the real-world data example of leaf spectra to indirectly quantify leaf traits. The method is implemented in an R package called “groupPLS”, which is accessible from github.com/jialiwang1211/groupPLS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Akarachantachote N, Chadcham S, Saithanu K (2014) Cutoff threshold of variable importance in projection for variable selection. Int J Pure Appl Math 94(3):307–322

    Article  Google Scholar 

  • Broge NH, Leblanc E (2001) Comparing prediction power and stability of broadband and hyperspectral vegetation indices for estimation of green leaf area index and canopy chlorophyll density. Remote Sens Environ 76(2):156–172

    Article  Google Scholar 

  • Bühlmann P, Rütimann P, van de Geer S, Zhang C-H (2013) Correlated variables in regression: clustering and sparse estimation. J Stat Plan Inference 143(11):1835–1858

    Article  Google Scholar 

  • Chun H, Keleş S (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Ser B 72(1):3–25

    Article  Google Scholar 

  • Chung D, Chun H, Keles S (2018) spls: Sparse Partial Least Squares (SPLS) Regression and Classification. URL https://CRAN.R-project.org/package=spls. R package version 2.2-2

  • Cloutis EA (1996) Review article hyperspectral geological remote sensing: evaluation of analytical techniques. Int J Remote Sens 17(12):2215–2242

    Article  Google Scholar 

  • De Jong S (1993) Simpls: an alternative approach to partial least squares regression. Chemom Intell Lab Syst 18(3):251–263

    Article  Google Scholar 

  • de Micheaux PL, Liquet B, Sutton M et al (2019) Pls for big data: a unified parallel algorithm for regularised group pls. Stat Surv 13:119–149

    Google Scholar 

  • Gamon J, Penuelas J, Field C (1992) A narrow-waveband spectral index that tracks diurnal changes in photosynthetic efficiency. Remote Sens Environ 41(1):35–44

    Article  Google Scholar 

  • Goodhue DL, Lewis W, Thompson R (2012) Does pls have advantages for small sample size or non-normal data? Mis Quarterly, pages 981–1001

  • Govender M, Chetty K, Bulcock H (2007) A review of hyperspectral remote sensing and its application in vegetation and water resource studies. Water Sa 33(2)

  • Guyot G et al (1990) Optical properties of vegetation canopies. Optical properties of vegetation canopies. 19–43

  • Li Y, Nan B, Zhu J (2015) Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 71(2):354–363

    Article  CAS  Google Scholar 

  • Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2015) Group and sparse group partial least square approaches applied in genomics context. Bioinformatics 32(1):35–42

    PubMed  Google Scholar 

  • Liquet B, de Micheaux PL, Broc C (2017) sgPLS: Sparse Group Partial Least Square Methods. URL https://CRAN.R-project.org/package=sgPLS. R package version 1.7

  • Liu A, Zhang Y, Gehan E, Clarke R (2002) Block principal component analysis with application to gene microarray data classification. Stat Med 21(22):3465–3474

    Article  Google Scholar 

  • Mehmood T, Ahmed B (2016) The diversity in the applications of partial least squares: an overview. J Chemom 30(1):4–17

    Article  CAS  Google Scholar 

  • Meier L, Van De Geer S, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc Ser B 70(1):53–71

    Article  Google Scholar 

  • Merchante LFS, Grandvalet Y, Govaert G (2012) An efficient approach to sparse linear discriminant analysis. arXiv preprint arXiv:1206.6472

  • Mercier G, Lennon M (2003) Support vector machines for hyperspectral image classification with spectral-based kernels. In IGARSS 2003. In: 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No. 03CH37477), volume 1, pages 288–290. IEEE

  • Mevik B-H, Wehrens R, Liland KH (2018) pls: Partial Least Squares and Principal Component Regression. URL https://CRAN.R-project.org/package=pls. R package version 2.7-0

  • Musumarra G, Barresi V, Condorelli D, Fortuna C, Scire S (2004) Potentialities of multivariate approaches in genome-based cancer research: identification of candidate genes for new diagnostics by pls discriminant analysis. J Chemom 18(3–4):125–132

    Article  CAS  Google Scholar 

  • Nguyen DV, Rocke DM (2002a) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18(9):1216–1226

    Article  CAS  Google Scholar 

  • Nguyen DV, Rocke DM (2002b) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18(1):39–50

    Article  CAS  Google Scholar 

  • Rebetzke G, Jimenez-Berni J, Fischer R, Deery D, Smith D (2019) High-throughput phenotyping to enhance the use of crop genetic resources. Plant Sci 282:40–48

    Article  CAS  Google Scholar 

  • Roitsch T, Cabrera-Bosquet L, Fournier A, Ghamkhar K, Jiménez-Berni J, Pinto F, Ober ES (2019) New sensors and data-driven approaches–a path to next generation phenomics. Plant Sci 282:2–10

  • Silva-Perez V, Molero G, Serbin SP, Condon AG, Reynolds MP, Furbank RT, Evans JR (2017) Hyperspectral reflectance as a tool to measure biochemical and physiological traits in wheat. J Exp Bot 69(3):483–496

    Article  Google Scholar 

  • Sutton M, Thiébaut R, Liquet B (2018) Sparse partial least squares with group and subgroup structure. Stat Med 37(23):3338–3356

    Article  Google Scholar 

  • Tan KM, Witten D, Shojaie A (2015) The cluster graphical lasso for improved estimation of gaussian graphical models. Comput Stat Data Anal 85:23–36

    Article  Google Scholar 

  • Ter Braak CJ, de Jong S (1998) The objective function of partial least squares regression. J Chemom 12(1):41–54

    Article  Google Scholar 

  • Thenkabail PS, Lyon JG (2016) Hyperspectral remote sensing of vegetation. CRC Press, Boca Raton

    Book  Google Scholar 

  • Van der Meer FD, Van der Werff HM, Van Ruitenbeek FJ, Hecker CA, Bakker WH, Noomen MF, Van Der Meijde M, Carranza EJM, De Smeth JB, Woldai T (2012) Multi-and hyperspectral geologic remote sensing: a review. Int J Appl Earth Obs Geoinf 14(1):112–128

    Article  Google Scholar 

  • Verrelst J, Malenovskỳ Z, Van der Tol C, Camps-Valls G, Gastellu-Etchegorry J-P, Lewis P, North P, Moreno J (2018) Quantifying vegetation biophysical variables from imaging spectroscopy data: a review on retrieval methods. Surv Geophys 40(3):589–629

  • Wold H (1966) Estimation of principal components and related models by iterative least squares. Multivar Anal 391–420

  • Woodgate W, Suarez L, van Gorsel E, Cernusak L, Dempsey R, Devilla R, Held A, Hill M, Norton A (2019) tri-pri: a three band reflectance index tracking dynamic photoprotective mechanisms in a mature eucalypt forest. Agric For Meteorol 272:187–201

    Article  Google Scholar 

  • Yu S, Jia S, Xu C (2017) Convolutional neural networks for hyperspectral image classification. Neurocomputing 219:88–98

    Article  Google Scholar 

  • Yuan L, Huang Y, Loraamm RW, Nie C, Wang J, Zhang J (2014) Spectral analysis of winter wheat leaves for detection and differentiation of diseases and insects. Field Crops Res 156:199–207

    Article  Google Scholar 

  • Zhu H, Cen H, Zhang C, He Y (2016) Early detection and classification of tobacco leaves inoculated with tobacco mosaic virus based on hyperspectral imaging technique. In: 2016 ASABE Annual International Meeting, page 1. American Society of Agricultural and Biological Engineers

Download references

Acknowledgements

The authors would like to thank Dr. Klara L Verbyla and Dr. Alexander B Zwart for their insightful discussions and comments. Dr. William Woodgate is supported by an Australian Research Council DECRA Fellowship (DE190101182). The OzFlux and SuperSite network is supported by the National Collaborative Infrastructure Strategy (NCRIS) through the Terrestrial Ecosystem Research Network (TERN).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiali Wang.

Additional information

Communicated by Pierre Dutilleul.

Appendix

Appendix

Fig. 8
figure 8

Root mean squared prediction error (RMSPE) for predicting Carotene (\(\mu g~cm^{-2}\)), Chlorophyll (\(\mu g~cm^{-2}\)), EPS and Weight (g) by the six methods in the eucalypt data set

Fig. 9
figure 9

R-squared value (\(R^2\)) for Carotene (\(\mu g~cm^{-2}\)), Chlorophyll (\(\mu g~cm^{-2}\)), EPS and Weight (g) by the six methods in the eucalypt data set

Table 3 The averages and standard deviations (in brackets) of the root mean squared prediction error (RMSPE) for Carotene (\(\mu g~cm^{-2}\)), Chlorophyll (\(\mu g~cm^{-2}\)), EPS and Weight (g) by the six methods in the eucalypt data set
Table 4 The averages and standard deviations (in brackets) of R-squared value (\(R^2\)) for Carotene (\(\mu g~cm^{-2}\)), Chlorophyll (\(\mu g~cm^{-2}\)), EPS and Weight (g) by the six methods in the eucalypt data set

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, L., Wang, J. & Woodgate, W. Analysing spectroscopy data using two-step group penalized partial least squares regression. Environ Ecol Stat 28, 445–467 (2021). https://doi.org/10.1007/s10651-021-00496-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-021-00496-2

Keywords

Navigation