Keywords

1 Introduction

The construction industry often grapples with the complexity of data, particularly when it comes to monitoring and predicting project progress. Placing effective project control is essential for the success of construction projects (Ezzeddine et al. 2022), reducing the rate of construction budget and schedule failures. However, the increasing complexity of projects coupled with inefficiencies in project control systems accelerates the rate of budget and schedule failures (Ezzeddine et al. 2022). Earned value analysis (EVA), employing indicators such as cost performance index (CPI) and schedule performance index (SPI), is a widely accepted tool for comprehensive construction performance analysis (Kim and Kim 2014). SPI along with CPI provides crucial insights into the health and progress of a project, being the bird’s-eye view of the performance triangle of a project (Kim and Kim 2014). The effectiveness of EVM is often limited by the quality of input data, particularly from traditional forecasting methods like S-curves. Recognizing the limitations of traditional methods, the construction sector has increasingly adopted ML solutions (Candaş and Tokdemir 2022; Kazar et al. 2022; Koc 2023; Mammadov et al. 2023; Mostofi et al. 2022; Mostofi and Toğan 2023; Toğan et al. 2022). However, the efficacy of these models is contingent on the quality and size of the underlying datasets, which remains a challenge in the construction industry (Althnian et al. 2021; Li et al. 2017; Sordo and Zeng 2005).

While ML models have shown potential in forecasting construction productivity, their performance heavily relies on the quality and size of the underlying datasets (Althnian et al. 2021; Aroyo et al. 2021; Barbierato et al. 2022). A critical issue faced in the construction sector is the collection of ample, relevant, and reliable data, a challenge that is aggravated by the dynamic and heterogeneous nature of construction projects. Data augmentation has been recognized as a critical strategy to enhance the performance of ML models (Barbierato et al. 2022; Mostofi et al. 2023). Variational autoencoders (VAEs) stand out as a powerful tool in this context. VAEs are generative models capable of learning complex data distributions and generating new data samples. Research demonstrated the potential of VAE in improving the accuracy of ML by 8% (Islam et al. 2021). (Mostofi et al. 2023) explored the use of VAEs in construction management to improve the prediction accuracy of graph attention networks for CPI productivity prediction, achieving considerable improvement in prediction accuracy. However, this study did not evaluate the VAE on other ML models, particularly on SPI prediction. As a result, the present research delves into the application of VAEs on a comprehensive construction progress dataset, with the aim of generating synthetic data that addresses the imbalance issue. In addition, our study details the components of the VAE, including its encoder-decoder architecture, latent space, and hyperparameters, which are crucial for its effective functioning. The objective is to leverage the generative capabilities of VAEs to address the challenges of underperforming ML solutions due to training on imbalanced datasets, thereby aiding in more accurate and reliable construction project forecasting and control. A comparative analysis using decision tree-based models, such as random forest, AdaBoost, gradient boosting, LightGBM, XGBoost, extra trees classifier, and bagging models was performed.

2 Methodology

The methodology of using VAE for data generation and then aggregating it with a collected dataset involves several key steps. Figure 1 displays the research flow of this study.

Fig. 1.
figure 1

Research flow.

The original dataset comprises 1,342 progress records collected from a construction project site, detailing the resources used for the execution of different construction activities, whereby their performance was reported by CPI and SPI. Followingly, the data preparation included handling records with inconsistencies and missing values, adjusting the format, and selecting the features for SPI prediction. Next the median instances per class were determined for the identification of underrepresented classes within the dataset. Subsequently, the underrepresented portion of data was used for targeted data generation and addressing the issue of class imbalance. Here a VAE model was configured with an encoder and decoder network. The encoder maps input data to a latent space representation, while the decoder reconstructs the data from the latent space. At this stage, a sampling function was implemented within the VAE architecture to facilitate the generation of new data points from the learned distribution. VAE was trained on the 80% of data related to underrepresented class data to learn the distributions specific to these classes, while the rest 20% was used for model validation. The training of VAE was guided through a reconstruction loss and the Kullback–Leibler divergence for the latent loss while being optimized based on the mean squared error. Figure 2 displays the performance of the utilized VAE.

Fig. 2.
figure 2

Performance of the VAE proposed.

Generated data encompasses 315 new records created by the VAE, which aimed to address the imbalance in the original dataset. The generated synthesized data was then used to create a balanced dataset by combining it with the original dataset. The aggregated dataset, combining original and generated data, totals 1,496 observations. The combined dataset provides a more substantial and diverse training base, which could potentially lead to more robust models that generalize better to unseen data. Next, a variety of tree-based models were considered, namely decision trees, random forest, AdaBoost, gradient boosting, LightGBM, XGBoost, extra trees classifier, and bagging, ranging from simple decision rules to complex ensemble methods that aggregate multiple weak learners into a stronger predictive model.

3 Results

This study evaluated the proposed VAE-ML on various ML models while comparing their performance when applied to a dataset comprising synthesized and original data and the original collected data. For each model type, we conducted hyperparameter optimization using grid search, systematically working models through multiple combinations of parameter values and cross-validating them. This allows the determination of the tunes that give the best performance according to a specified metric. The models configured with their best hyperparameters were further evaluated based on accuracy, F1 score, precision, recall, and area under the curve (AUC) metrics. Figure 3 displays the prediction performance of the ML models trained over the original and hybrid datasets based on accuracy and F1 score metrics.

Fig. 3.
figure 3

Comparison of accuracy and F1 score performance of prediction models.

The synthesized dataset consistently exhibited superior performance compared to the original dataset, notably, models like LightGBM and gradient boosting achieved an accuracy of 0.98. The integration of VAE in ML models improved their prediction accuracies by about 2%. Figure 4 compares the prediction performance of these models.

Fig. 4.
figure 4

Comparison of prediction performance in each SPI class using both datasets.

The VAE-ML proposed in this research demonstrated a robust predictive capability using post data augmentation. Previous studies employed undersampling, oversampling, and synthetic minority oversampling technique (SMOTE) approaches. Undersampling removes the records related to the majority class (Guo et al. 2018; Mishra and Singh 2021) and thus loses important information from the utilized dataset (Taha et al. 2021). On the other hand, oversampling replicates the underrepresented class (Guo et al. 2018; Mishra and Singh 2021), where the introduced repetition in the dataset poses the risk of overfitting (Taha et al. 2021).

SMOTE creates artificial samples using nearest-neighbor approaches over the minority class limits a synthetic sample from the minority class has been pivotal in enriching datasets, thereby fostering better classifier performance across various domains (Bogner et al. 2018; Chawla et al. 2002). However, the research raised a question about the reliability of the SMOTE as its over-generalization of the minority class can result in artificial biases (Bao and Yang 2023) and inclination of the prediction towards the minority class (Blagus and Lusa 2012, 2013).

Considering the drawbacks mentioned, there exists a gap in methodologies that not only address the class imbalance but also ensure the authenticity and reliability of prediction models is paramount.

The incorporation of VAE-generated synthetic data appears to be a potent strategy for enhancing ML model performance. This approach has demonstrated substantial benefits in accuracy, precision, recall, and AUC measures without incurring prohibitive computational costs. These findings suggest a promising direction for future research and applications. Future studies could focus on the application of VAEs across a wider range of predictive tasks and explore the impact of different types of generative models on the predictive accuracy of various machine learning algorithms.

While the results are promising, caution must be exercised in interpreting these findings. The synthetic data's performance boost must be validated against more extensive datasets from different construction projects to ensure the ability of VAE to improve the prediction performance of the ML model.

4 Conclusion

There exists a gap in construction management research for the methods that improve the prediction accuracy of the real-life construction dataset, with an imbalanced prediction class. The objective of this study was to address the challenges associated with data imbalance in construction management datasets and enhance the performance of machine learning models using data augmentation through a VAE model. The research entailed a comprehensive analysis of multiple decision tree-based ML models, evaluating their ability to predict construction project outcomes effectively, considering the SPI metric.

The obtained results highlighted several pivotal findings. Firstly, the incorporation of VAE-generated data led to an enhancement in the accuracy of SPI predictions, improving the prediction performance of the eight investigated ML approaches by up to 2%. The implementation of the VAE-ML model has improved the prediction accuracy of construction management datasets.

This suggests that addressing class imbalance through synthetic data generation can effectively improve model performance. Secondly, the results underscored the need for rigorous data pre-processing and augmentation to improve the robustness of ML models in construction management.

Overall, throughout the research, it was evident that data imbalance impacts the predictive capability of traditional models. The use of VAEs for data generation proved to be a promising approach to mitigate this issue. The generative capability of VAEs allowed for the creation of synthetic yet realistic samples, which balanced the dataset and provided a more uniform learning environment for the ML models.

However, the study also revealed gaps that warrant further investigation. The application of VAE was primarily focused on SPI prediction using different decision tree-based ML models, and its effectiveness on deep learning approaches remains unexplored, pointing to an area ripe for future research.