Skip to main content

Markov regression model for analyzing big data to predict trajectories of repeated categorical outcomes: an application to \(\hbox {PM}_{2.5}\) air pollution data


Fine particulate matter (\(\text{ PM}_{2.5}\)), tiny particles in the air, is air contamination that negatively impacts the environment and human health when levels in the air are high. The elevated level of \(\text{ PM}_{2.5}\) also reduces visibility and causes the air to appear hazy. Due to its impact on environment and health, almost every country around the world keeps track of \(\text{ PM}_{2.5}\) air quality level and records the data repeatedly over time in many sites. As the data are collected repeatedly, there is likely to be a natural dependency among the repeated measures of \(\text{ PM}_{2.5}\) level in a specific site. Modeling and analyzing these repeated data will help policymakers recommend new policies and/or update existing policies. Thus adequate modeling of such data is of enormous interest among the researchers and policymakers. It is noteworthy that as the data are collected repeatedly in immense volume, big data modeling techniques are required for modeling such data. This paper proposed a new modeling framework to analyze and trajectory risk prediction of categorical responses from big data collected repeatedly. We developed a divide and recombine approach to analyzing big data gathered continually. We used the Markov model for data division, and the Markov chain is used to recombine the marginal and conditional probabilities and estimated joint probabilities for trajectory. We illustrated the proposed model using \(\text{ PM}_{2.5}\) outdoor air pollution data from the United States between the years 2000 to 2020. The performance of the proposed methodology is also checked through bootstrap simulation studies. The proposed methodology will be useful to analyze and trajectory risk prediction of repeatedly measured responses from big data from various fields.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  • Boonphun J, Kaisornsawad C, Wongchaisuwat P (2018) Machine learning algorithms for predicting air pollutants. E3S Web Conf 120:03004

    Article  Google Scholar 

  • Bzdok D, Nichols TE, Smith SM (2019) Towards algorithmic analytics for large-scale datasets. Nat Mach Intell 1:296–306

    Article  Google Scholar 

  • Chang HH, Hu X, Liu Y (2014) Calibrating MODIS aerosol optical depth for predicting daily \(\text{ PM}_{2.5}\) concentrations via statistical downscaling. J Expo Sci Environ Epidemiol 24:398–404

    CAS  Article  Google Scholar 

  • Chowdhury RI, Islam MA (2020) Regressive models for risk prediction of repeated multinomial outcomes: an illustration using Health and Retirement Study data. Biom J 62:898–915

    Article  Google Scholar 

  • Chowdhury RI, Islam MA (2020) Prediction of risks of sequence of events using multistage proportional hazards model: a marginal–conditional modelling approach. Stat Methods Appl 29:141–171

    Article  Google Scholar 

  • Cleveland WS, Hafen R (2014) Divide and recombine (D&R): data science for large complex data. Stat Anal Data Min ASA Data Sci J 7:425–433

    Article  Google Scholar 

  • Guha S, Hafen R, Rounds J et al (2012) Large complex data: divide and recombine (D&R) with RHIPE. Stat 1:53–56

    Article  Google Scholar 

  • Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York

    Book  Google Scholar 

  • Hwang H, Ryan L (2020) Statistical strategies for the analysis of massive data sets. Biom J 62:270–281

    Article  Google Scholar 

  • Islam MA, Chowdhury RI (2006) A higher order Markov model for analyzing covariate dependence. Appl Math Model 30:477–488

    Article  Google Scholar 

  • Islam MA, Chowdhury RI, Huda S (2009) Markov models with covariate dependence for repeated measures. Nova Science, New York

    Google Scholar 

  • Long SL (1997) Regression models for categorical and limited dependent variables, vol 7. Advanced quantitative techniques in the social sciences. SAGE Publications, Thousand Oaks

    Google Scholar 

  • Mahmud S, Islam MA, Hossain SS (2020) Analysis of rainfall occurrence in consecutive days using Markov models with covariate dependence in selected regions of Bangladesh. Theor Appl Climatol 140:1419–1434

    Article  Google Scholar 

  • Miftahuddin M, Setiawan I et al (2020) Rainfall analysis in the Indian Ocean by using 6-states Markov chain model. IOP Conf Ser Earth Environ Sci 429:012012.

    Article  Google Scholar 

  • Muenz LR, Rubinstein LV (1985) Markov models for covariate dependence of binary sequence. Biometrics 41:91–101

    CAS  Article  Google Scholar 

  • Nemes S, Jasson JM, Genell A et al (2009) Bias in odds ratios by logistic regression modeling and sample size. BMC Med Res Methodol 9:1–5

    Article  Google Scholar 

  • Rushingabigwi G, Nsengiyumva P, Sibomana L, Twizere C, Kalisa W (2020) Analysis of the atmospheric dust in Africa: the breathable dust’s fine particulate matter \(\text{ PM}_{2.5}\) in correlation with carbon monoxide. Atmos Environ.

    Article  Google Scholar 

  • Ryu J, Kim JJ, Byeon H, Go T, Lee SJ (2019) Removal of fine particulate matter (PM\(_{2.5}\)) via atmospheric humidity caused by evapotranspiration. Environ Pollut 245:253–259

    CAS  Article  Google Scholar 

  • Schwartz J, Laden F, Zanobetti A (2002) The concentration–response relation between PM\(_{2.5}\) and daily deaths. Environ Health Perspect 110:1025–1029

    CAS  Article  Google Scholar 

  • The United States Environmental Protection Agency (EPA) (2020) Pre-generated data files. Accessed on 22 July 2020

  • The United States Environmental Protection Agency (EPA) What are the air quality standards for PM. Accessed on 17 May 2021

  • Xu F, Shi X, Qiu X, Jiang X, Fang Y, Wang J, Hu D, Zhu T (2020) Investigation of the chemical components of ambient fine particulate matter (PM\(_{2.5}\)) associated with in vitro cellular responses to oxidative stress and inflammation. Environ Int.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


This research was supported by Grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). We also acknowledge the United States Environmental Protection Agency (EPA) for making these data publicly available. The authors are grateful to the referees for their helpful comments on the paper, which greatly improved the quality of the paper.

Author information

Authors and Affiliations


Corresponding author

Correspondence to M. Tariqul Hasan.

Additional information

Communicated by Jun Zhu.


Appendix A

figure a

Appendix B

Table 7 Outcome frequency for the years 2000 to 2020
Table 8 Crosstabulation of outcomes between consecutive years to identify different transitions
Table 9 Estimates of Markov regression models for all transitions
Table 10 Marginal, conditional and joint probabilities for Seaford, Delaware

Appendix C

Fig. 7
figure 7

ROC curves for the subsets 1 to 15

Fig. 8
figure 8

ROC curves for the subsets 16 to 30

Fig. 9
figure 9

ROC curves for the subsets 31 to 41

Fig. 10
figure 10

Proportion of \(\hbox {PM}_{2.5}{} \) level based air quality indicator at various monitoring sites in the USA

Fig. 11
figure 11

Pairwise distance for data subsets. Red represents high and blue represents low similarity. The color level is proportional to the value of the dissimilarity between observations

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chowdhury, R.I., Hasan, M.T. Markov regression model for analyzing big data to predict trajectories of repeated categorical outcomes: an application to \(\hbox {PM}_{2.5}\) air pollution data. Environ Ecol Stat 29, 149–184 (2022).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Big data
  • Divide and recombine
  • Longitudinal data
  • Markov model
  • Trajectory Risks