Abstract
The imbalanced construction dataset reduces the accuracy of the machine learning model. This issue that addressed by recent construction management research through different sampling approaches. Despite their advantages, the utilized sampling approaches are reducing the reliability of the prediction model, while posing the risk of artificial bias. The objective of this study is to address the challenge of imbalanced datasets in construction progress prediction models using a novel variational autoencoder (VAE) that generates synthetic data for underrepresented classes. The VAE's encoder-decoder architecture, along with its latent space components, is optimized for this task. A comparative analysis using decision tree-based ML models, including grid search optimization, substantiated the effectiveness of the VAE approach. The results indicate that the hybrid dataset benefited the ML models from the addition of the synthesized dataset, showing 2% improvements in performance metrics across most models. The synthetic data generated by VAEs contributes to the construction of more balanced datasets, which, in turn, can lead to more reliable and accurate predictive models. The enhanced accuracy of the VAE-ML model addresses the class imbalance problem and improves the reliability of construction productivity predictions and related resource allocation plans.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
The construction industry often grapples with the complexity of data, particularly when it comes to monitoring and predicting project progress. Placing effective project control is essential for the success of construction projects (Ezzeddine et al. 2022), reducing the rate of construction budget and schedule failures. However, the increasing complexity of projects coupled with inefficiencies in project control systems accelerates the rate of budget and schedule failures (Ezzeddine et al. 2022). Earned value analysis (EVA), employing indicators such as cost performance index (CPI) and schedule performance index (SPI), is a widely accepted tool for comprehensive construction performance analysis (Kim and Kim 2014). SPI along with CPI provides crucial insights into the health and progress of a project, being the bird’s-eye view of the performance triangle of a project (Kim and Kim 2014). The effectiveness of EVM is often limited by the quality of input data, particularly from traditional forecasting methods like S-curves. Recognizing the limitations of traditional methods, the construction sector has increasingly adopted ML solutions (Candaş and Tokdemir 2022; Kazar et al. 2022; Koc 2023; Mammadov et al. 2023; Mostofi et al. 2022; Mostofi and Toğan 2023; Toğan et al. 2022). However, the efficacy of these models is contingent on the quality and size of the underlying datasets, which remains a challenge in the construction industry (Althnian et al. 2021; Li et al. 2017; Sordo and Zeng 2005).
While ML models have shown potential in forecasting construction productivity, their performance heavily relies on the quality and size of the underlying datasets (Althnian et al. 2021; Aroyo et al. 2021; Barbierato et al. 2022). A critical issue faced in the construction sector is the collection of ample, relevant, and reliable data, a challenge that is aggravated by the dynamic and heterogeneous nature of construction projects. Data augmentation has been recognized as a critical strategy to enhance the performance of ML models (Barbierato et al. 2022; Mostofi et al. 2023). Variational autoencoders (VAEs) stand out as a powerful tool in this context. VAEs are generative models capable of learning complex data distributions and generating new data samples. Research demonstrated the potential of VAE in improving the accuracy of ML by 8% (Islam et al. 2021). (Mostofi et al. 2023) explored the use of VAEs in construction management to improve the prediction accuracy of graph attention networks for CPI productivity prediction, achieving considerable improvement in prediction accuracy. However, this study did not evaluate the VAE on other ML models, particularly on SPI prediction. As a result, the present research delves into the application of VAEs on a comprehensive construction progress dataset, with the aim of generating synthetic data that addresses the imbalance issue. In addition, our study details the components of the VAE, including its encoder-decoder architecture, latent space, and hyperparameters, which are crucial for its effective functioning. The objective is to leverage the generative capabilities of VAEs to address the challenges of underperforming ML solutions due to training on imbalanced datasets, thereby aiding in more accurate and reliable construction project forecasting and control. A comparative analysis using decision tree-based models, such as random forest, AdaBoost, gradient boosting, LightGBM, XGBoost, extra trees classifier, and bagging models was performed.
2 Methodology
The methodology of using VAE for data generation and then aggregating it with a collected dataset involves several key steps. Figure 1 displays the research flow of this study.
The original dataset comprises 1,342 progress records collected from a construction project site, detailing the resources used for the execution of different construction activities, whereby their performance was reported by CPI and SPI. Followingly, the data preparation included handling records with inconsistencies and missing values, adjusting the format, and selecting the features for SPI prediction. Next the median instances per class were determined for the identification of underrepresented classes within the dataset. Subsequently, the underrepresented portion of data was used for targeted data generation and addressing the issue of class imbalance. Here a VAE model was configured with an encoder and decoder network. The encoder maps input data to a latent space representation, while the decoder reconstructs the data from the latent space. At this stage, a sampling function was implemented within the VAE architecture to facilitate the generation of new data points from the learned distribution. VAE was trained on the 80% of data related to underrepresented class data to learn the distributions specific to these classes, while the rest 20% was used for model validation. The training of VAE was guided through a reconstruction loss and the Kullback–Leibler divergence for the latent loss while being optimized based on the mean squared error. Figure 2 displays the performance of the utilized VAE.
Generated data encompasses 315 new records created by the VAE, which aimed to address the imbalance in the original dataset. The generated synthesized data was then used to create a balanced dataset by combining it with the original dataset. The aggregated dataset, combining original and generated data, totals 1,496 observations. The combined dataset provides a more substantial and diverse training base, which could potentially lead to more robust models that generalize better to unseen data. Next, a variety of tree-based models were considered, namely decision trees, random forest, AdaBoost, gradient boosting, LightGBM, XGBoost, extra trees classifier, and bagging, ranging from simple decision rules to complex ensemble methods that aggregate multiple weak learners into a stronger predictive model.
3 Results
This study evaluated the proposed VAE-ML on various ML models while comparing their performance when applied to a dataset comprising synthesized and original data and the original collected data. For each model type, we conducted hyperparameter optimization using grid search, systematically working models through multiple combinations of parameter values and cross-validating them. This allows the determination of the tunes that give the best performance according to a specified metric. The models configured with their best hyperparameters were further evaluated based on accuracy, F1 score, precision, recall, and area under the curve (AUC) metrics. Figure 3 displays the prediction performance of the ML models trained over the original and hybrid datasets based on accuracy and F1 score metrics.
The synthesized dataset consistently exhibited superior performance compared to the original dataset, notably, models like LightGBM and gradient boosting achieved an accuracy of 0.98. The integration of VAE in ML models improved their prediction accuracies by about 2%. Figure 4 compares the prediction performance of these models.
The VAE-ML proposed in this research demonstrated a robust predictive capability using post data augmentation. Previous studies employed undersampling, oversampling, and synthetic minority oversampling technique (SMOTE) approaches. Undersampling removes the records related to the majority class (Guo et al. 2018; Mishra and Singh 2021) and thus loses important information from the utilized dataset (Taha et al. 2021). On the other hand, oversampling replicates the underrepresented class (Guo et al. 2018; Mishra and Singh 2021), where the introduced repetition in the dataset poses the risk of overfitting (Taha et al. 2021).
SMOTE creates artificial samples using nearest-neighbor approaches over the minority class limits a synthetic sample from the minority class has been pivotal in enriching datasets, thereby fostering better classifier performance across various domains (Bogner et al. 2018; Chawla et al. 2002). However, the research raised a question about the reliability of the SMOTE as its over-generalization of the minority class can result in artificial biases (Bao and Yang 2023) and inclination of the prediction towards the minority class (Blagus and Lusa 2012, 2013).
Considering the drawbacks mentioned, there exists a gap in methodologies that not only address the class imbalance but also ensure the authenticity and reliability of prediction models is paramount.
The incorporation of VAE-generated synthetic data appears to be a potent strategy for enhancing ML model performance. This approach has demonstrated substantial benefits in accuracy, precision, recall, and AUC measures without incurring prohibitive computational costs. These findings suggest a promising direction for future research and applications. Future studies could focus on the application of VAEs across a wider range of predictive tasks and explore the impact of different types of generative models on the predictive accuracy of various machine learning algorithms.
While the results are promising, caution must be exercised in interpreting these findings. The synthetic data's performance boost must be validated against more extensive datasets from different construction projects to ensure the ability of VAE to improve the prediction performance of the ML model.
4 Conclusion
There exists a gap in construction management research for the methods that improve the prediction accuracy of the real-life construction dataset, with an imbalanced prediction class. The objective of this study was to address the challenges associated with data imbalance in construction management datasets and enhance the performance of machine learning models using data augmentation through a VAE model. The research entailed a comprehensive analysis of multiple decision tree-based ML models, evaluating their ability to predict construction project outcomes effectively, considering the SPI metric.
The obtained results highlighted several pivotal findings. Firstly, the incorporation of VAE-generated data led to an enhancement in the accuracy of SPI predictions, improving the prediction performance of the eight investigated ML approaches by up to 2%. The implementation of the VAE-ML model has improved the prediction accuracy of construction management datasets.
This suggests that addressing class imbalance through synthetic data generation can effectively improve model performance. Secondly, the results underscored the need for rigorous data pre-processing and augmentation to improve the robustness of ML models in construction management.
Overall, throughout the research, it was evident that data imbalance impacts the predictive capability of traditional models. The use of VAEs for data generation proved to be a promising approach to mitigate this issue. The generative capability of VAEs allowed for the creation of synthetic yet realistic samples, which balanced the dataset and provided a more uniform learning environment for the ML models.
However, the study also revealed gaps that warrant further investigation. The application of VAE was primarily focused on SPI prediction using different decision tree-based ML models, and its effectiveness on deep learning approaches remains unexplored, pointing to an area ripe for future research.
References
Althnian, A., et al.: Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl. Sci. 11(2), 796 (2021). https://doi.org/10.3390/app11020796
Aroyo, L., Lease, M., Paritosh, P., Schaekermann, M.: Data excellence for AI: why should you care (2021)
Bao, Y., Yang, S.: Two novel SMOTE methods for solving imbalanced classification problems. IEEE Access 11, 5816–5823 (2023). https://doi.org/10.1109/ACCESS.2023.3236794
Barbierato, E., Della Vedova, M.L., Tessera, D., Toti, D., Vanoli, N.: A methodology for controlling bias and fairness in synthetic data generation. Appl. Sci. 12(9), 4619 (2022). https://doi.org/10.3390/app12094619
Blagus, R., Lusa, L.: Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. In: 2012 11th International Conference on Machine Learning and Applications, pp. 89–94. IEEE (2012)
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14(1), 106 (2013). https://doi.org/10.1186/1471-2105-14-106
Bogner, C., Seo, B., Rohner, D., Reineking, B.: Classification of rare land cover types: Distinguishing annual and perennial crops in an agricultural catchment in South Korea. PLoS ONE 13(1), e0190476 (2018). https://doi.org/10.1371/journal.pone.0190476
Candaş, A. B., Tokdemir, O.B.: Automated identification of vagueness in the FIDIC silver book conditions of contract. J. Constr. Eng. Manag. 148(4) (2022). https://doi.org/10.1061/(ASCE)CO.1943-7862.0002254
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res.Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Ezzeddine, A., Shehab, L., Lucko, G., Hamzeh, F.: Forecasting construction project performance with momentum using singularity functions in LPS. J. Constr. Eng. Manag. 148(8) (2022). https://doi.org/10.1061/(ASCE)CO.1943-7862.0002320
Guo, H., Diao, X., Liu, H.: Embedding undersampling rotation forest for imbalanced problem. Comput. Intell. Neurosci.. Intell. Neurosci. 2018, 1–15 (2018). https://doi.org/10.1155/2018/6798042
Islam, Z., Abdel-Aty, M., Cai, Q., Yuan, J.: Crash data augmentation using variational autoencoder. Accid Anal. Prev. 151, 105950 (2021). https://doi.org/10.1016/J.AAP.2020.105950
Kazar, G., Doğan, N.B., Ayhan, B.U., Tokdemir, O.B.: Quality failures–based critical cost impact factors: logistic regression analysis. J. Constr. Eng. Manag. 148(12), 04022138 (2022). https://doi.org/10.1061/(ASCE)CO.1943-7862.0002412
Kim, B.-C., Kim, H.-J.: Sensitivity of earned value schedule forecasting to s-curve patterns. J. Constr. Eng. Manag. 140(7), 04014023 (2014). https://doi.org/10.1061/(ASCE)CO.1943-7862.0000856
Koc, K.: Role of national conditions in occupational fatal accidents in the construction industry using interpretable machine learning approach. J. Manag. Eng. 39(6) (2023). https://doi.org/10.1061/JMENEA.MEENG-5516
Li, D.-C., Lin, W.-K., Lin, L.-S., Chen, C.-C., Huang, W.-T.: The attribute-trend-similarity method to improve learning performance for small datasets. Int. J. Prod. Res. 55(7), 1898–1913 (2017). https://doi.org/10.1080/00207543.2016.1213447
Mammadov, A., Kazar, G., Koc, K., Tokdemir, O.B.: Predicting accident outcomes in cross-border pipeline construction projects using machine learning algorithms. Arab. J. Sci. Eng. 1–19 (2023). https://doi.org/10.1007/s13369-023-07964-w
Mishra, N.K., Singh, P.K.: Feature construction and smote-based imbalance handling for multi-label learning. Inf. Sci. (N Y) 563, 342–357 (2021). https://doi.org/10.1016/j.ins.2021.03.001
Mostofi, F., Toğan, V.: Explainable safety risk management in construction with unsupervised learning, pp. 273–305 (2023)
Mostofi, F., Toğan, V., Ayözen, Y.E., Tokdemir, O.B.: Predicting the impact of construction rework cost using an ensemble classifier. Sustainability (Switzerland), 14(22) (2022). https://doi.org/10.3390/su142214800
Mostofi, F., Toğan, V., Tokdemir, O.B.: Enhancing construction productivity prediction through variational autoencoders and graph attention network. In: Proceedings of 3rd International Civil Engineering and Architecture Congress (ICEARC 2023), pp. 120–128 (2023). Trabzon: Golden light Publishing
Sordo, M., Zeng, Q.: On sample size and classification accuracy: a performance comparison, pp. 193–201 (2005)
Taha, A.Y., Tiun, S., Abd Rahman, A.H., Sabah, A.: Multilabel over-sampling and under-sampling with class alignment for imbalanced multilabel text classification. J. Inf. Commun. Technol. 20 (2021). https://doi.org/10.32890/jict2021.20.3.6
Toğan, V., Mostofi, F., Ayözen, Y.E., Behzat Tokdemir, O.: Customized AutoML: an automated machine learning system for predicting severity of construction accidents. Buildings 12(11) (2022). https://doi.org/10.3390/buildings12111933
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Mostofi, F., Tokdemir, O.B., Toğan, V. (2024). Leveraging Variational Autoencoder for Improved Construction Progress Prediction Performance. In: Feng, G. (eds) Proceedings of the 10th International Conference on Civil Engineering. ICCE 2023. Lecture Notes in Civil Engineering, vol 526. Springer, Singapore. https://doi.org/10.1007/978-981-97-4355-1_51
Download citation
DOI: https://doi.org/10.1007/978-981-97-4355-1_51
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-4354-4
Online ISBN: 978-981-97-4355-1
eBook Packages: EngineeringEngineering (R0)