1 Introduction

Construction drawings are essential documents as they show what will be built for a project. These drawings are still frequently stored in undigitised formats, and consequently, retrieving information from them must be carried out manually. This requires domain experts [1], and can be very time-consuming [2] and costly.

One of the most important processes in a construction project is material takeoff or quantity takeoff [3, 1]. The purpose of this task is to create a list of the required materials and quantities. The list is an essential document as it is used for cost estimation [3]. It is important that it is accurate, as any errors can impact the project budget and schedule [4, 5]. The takeoff is traditionally carried out through manual drawing analysis. However, this can be time-consuming and prone to counting errors, particularly for large projects [2]. Furthermore, the results are dependent on individual interpretations [4].

Artificial Intelligence (AI) based methods can augment employees’ capabilities by simplifying time-intensive and repetitive tasks [6]. The use of state-of-the-art digital technologies to transform traditional industry practices into autonomous systems has been referred to as Industry 4.0, or the fourth industrial revolution [7]. The number of publications that discussed AI applications in the construction industry has increased in recent years [8]. Within this, one of the main research topics was computer vision [8], where the applications mainly focussed on the monitoring of construction sites and structural health. However, the current adoption of AI-based applications in the building and construction industry is relatively low [7], and lags behind that in other industries [8, 9].

Across a range of sectors, there has recently been an increasing demand to create methods to digitise engineering diagrams [10, 11]. This involves extracting all diagram components, which are the symbols, text and lines. Although published research on this topic dates back to the 1980 s [12, 13], as stated in a recent review by Moreno-Garcia et al. [14], digitising complex engineering diagrams is still considered challenging. For instance, deep learning methods were used for symbol digitisation in other types of engineering diagrams [15, 16]. This was considered a difficult task for multiple reasons, including the numerous symbols present in each diagram [15], relatively small symbol sizes [15, 16] and use of non-standard symbols [17, 18].

Although construction drawings are more complex compared to other types of engineering drawings, methods for their digitisation have received considerably less attention than those for other engineering drawing types, such as Piping and Instrumentation Diagrams (P &IDs) [15, 17, 16, 10, 19, 20, 21]. One reason that construction drawings are more complex is that they are typically composed of multiple drawing layers. These organise graphical elements by type [22], and can be shown overlapping each other. This means that symbols are typically shown on a highly complex background. Additionally, these drawings contain a significant amount of visually similar shapes. Furthermore, they are typically grayscale and thus, no colour information is available to help distinguish between components.

This paper presents a novel deep learning framework to process construction drawings automatically. It should be noted that engineering diagrams are generally unavailable in the public domain [11, 14] due to confidentiality reasons. Therefore, in this experiment, a dataset was obtained from an industry partner to ensure that the research is relevant to a real-world scenario. Multiple building systems are represented, including plumbing and Heating, Ventilation and Air Conditioning (HVAC). The drawings are very complex and contain various symbol classes, typically shown on a cluttered background, as shown in Fig. 1.

Fig. 1
figure 1

Section of an ‘HVAC’ drawing. This is challenging to digitise for multiple reasons, including the dense representation of equipment, overlapping components, and complex background

The main contributions of this paper are outlined as follows:

  • A novel framework for the automatic processing of construction drawings is presented. This automatically detects symbols for the material takeoff, resulting in significant time-saving compared to manual drawing analysis.

  • Extensive set of experiments has been carried out using a large dataset of challenging high-resolution drawings of different qualities. Various symbol classes were used, which have high levels of intra-class variability and inter-class similarity. This is believed to be the first example of these experiments using complex construction drawings from industry.

  • Two state-of-the-art deep learning models were utilised for symbol detection in construction drawings. This allows for a comparative analysis of the performance and speed between two object detection architecture types, one-stage and two-stage.

The rest of this paper is structured as follows. Section 2 critically discusses the related work in symbol digitisation in complex engineering drawings with a focus on construction drawings. Next, the proposed methods are discussed in Sect. 3. The experiments and results are presented in Sect. 4. Finally, the conclusion and future direction are provided in Sect. 5.

2 Related work

Methods to automate engineering diagram analysis have been presented since the 1980 s [13, 23]. Most of these methods were based on traditional image processing approaches [14]. These rely on pre-established rules, which results in weak generalisation capability [24, 10]. Such approaches struggled to perform well across the wide range of variations present in engineering drawings, including image quality [25], object orientations [14], and overlapping objects [14].

More recently, Convolutional Neural Network (CNN) based deep learning models have significantly improved on traditional computer vision methods, including object detection, segmentation and classification [26]. Significant improvement has been seen since 2012 when the AlexNet [27] CNN-based classification model was shown to outperform previous methods by a large margin. Since then, method improvements have been facilitated by algorithm developments, together with an increase in data and computing power.

Despite this progress, deep learning methods for engineering diagram digitisation were only proposed very recently. Symbol digitisation methods were mostly based on object detection models, which predict the class and bounding box of target objects in an image. Most research focussed on one type of object detector, such as You Only Look Once (YOLO) [28] based approaches [24, 29, 30, 31, 11, 15, 32] or Faster Region-based Convolutional Neural Network (Faster R-CNN) [33] based approaches [6, 34, 35, 2, 55]. Other approaches were based on Fully Convolutional Network (FCN) [37] segmentation models [38, 39] or graph-based methods [40, 41, 42].

Deep learning methods generally showed improvements compared to traditional approaches, although it is clear from the published literature that there are remaining challenges [43]. For instance, due to the lack of real world engineering drawing datasets in the public domain, much of the existing research has been carried out using small datasets or simplified drawings [43]. Furthermore, deep learning models typically require a large annotated dataset which is very time-consuming to obtain for these drawings [44]. Another challenge is that of imbalanced datasets, which results in the class imbalance problem [45]. Additional challenges that still require further research are evaluation of drawing digitisation methods and contextualisation [43].

The literature on symbol digitisation in engineering drawings covers a range of drawing types, with a particular focus on P &IDs [15, 17, 16, 10, 19, 20, 21]. For instance, Elyan et al. [15] created a YOLO-based method to detect symbols in P &IDs. They reported high performance overall with an accuracy of 95%, although the results varied across the symbol classes. Meanwhile, Gao et al. [16] presented a Faster R-CNN-based symbol detection method. On a dataset of publicly available nuclear power plant drawings, they reported mAP values of 92% and above for three separate groups of symbols. In another example on P &IDs, Mani et al. [10] created a CNN-based classification method for fixed size drawing patches. They obtained promising results for two symbol classes, however this method may be computationally slow when scaled up for a larger number of classes.

There is also published research on symbol digitisation in architectural floor plans [46, 6]. For instance, Rezvanifar et al. [46] presented a YOLO-based method for symbol detection. They evaluated the method on a private dataset aswell as the public Systems Evaluation SYnthetic Documents (SESYD) dataset. On the latter, they showed that their method outperformed traditional symbol spotting approaches. In the same domain, Jakubik et al. [6] presented a human-in-the-loop approach for the detection and classification of symbols. They used a Faster R-CNN-based symbol detection method that was trained using a synthetic dataset created using a data augmentation approach.

However, there was a lack of research on construction drawings, as there are only a few recent works that presented deep learning methods for generating a list of materials from construction drawings [2, 5]. Joy and Mounsef [2] presented a Faster R-CNN [33] based method to automate material takeoff from electrical engineering plans. They used a dataset of five drawings. Training data was generated using symbols extracted from the legend and image processing-based data augmentation. Whilst the method did not require extensive manual annotation, it relied on a legend being available. Prior to testing, the background and text strings were removed. This may be particularly important here, as these components were not included in the testing data.

Chowdhury and Moon [5] presented a Mask R-CNN [47] based method to automatically generate the bill of materials (BoM), which is a list of the required item quantities and costs, from 2D images of concrete formwork. Mask R-CNN is an object segmentation model, which predicts pixel-level object masks rather than bounding boxes. They created 206 drawings from 3D models, which were relatively clean with few components. On the validation data, an mAP of 98% was reported. The method showed promising results on the test drawings, however detailed metrics were not presented. On an actual construction shop drawing, the increased complexity meant that preprocessing was required to remove unnecessary elements. The solution relied on the manual selection of relevant items within a cost database to produce the BoM.

In a related area, published work discussed automated quantity takeoff from Building Information Modelling (BIM) models [48, 1]. BIM has played the leading role in digitising the construction industry [8], and it concerns the creation of a 3D model to manage building data. The drawback of BIM-based takeoff approaches is that considerable time and resources are needed to create the BIM. Furthermore, errors in the BIM impact the accuracy of the quantity takeoff [48].

The literature shows that although deep learning has significantly improved computer vision methods, there is a lack of progress in construction drawing digitisation methods. Deep learning methods for engineering diagram digitisation were proposed only very recently, and these were primarily focussed on other engineering diagram types [15, 17, 16, 10, 19, 20, 21]. Moreover, there was a lack of research showing how different deep learning object detection models performed on complex real-world construction drawings.

3 Methods

This section discusses the methods used in the proposed symbol detection framework for complex construction drawings. This includes a discussion of the real-world dataset used for evaluation purposes.

3.1 Dataset

3.1.1 Overview

A dataset of 198 PDF construction drawings was obtained from an industrial partner. It contains three unequally represented types, as there are 92 ‘plumbing’, 103 ‘HVAC’ and 3 ‘other’ drawings. To prepare the dataset for the experiment, the PDFs were converted to high-resolution 14, 042 \(\times \) 9, 934 pixel PNG images at 300 dpi, as shown in Fig. 2.

Fig. 2
figure 2

Data preparation steps. The dataset of undigitised construction diagrams was pre-processed by firstly converting the PDF to PNG. Next, the image files were annotated with the target symbol classes. Finally, the drawing border was removed using a Connected Components algorithm

The diagrams contain numerous symbol classes, 13 of which were selected for the experiment. These were chosen as they are required in the takeoff, and are shown in multiple building systems. The ‘Detail Legend’ and ‘Direction of Flow’ symbols were also included, as detecting them can help to determine links between diagrams or flow direction [44].

The symbols of interest are represented by various shapes, as shown in Fig. 3. It should be noted that these examples were cropped from the legend, and are thus displayed on a white background, unlike typically seen within a diagram. The symbols are challenging for a model to detect for several reasons. Each symbol is only represented by a few lines or shapes and thus there are only a few features available for a model to learn from. They are commonly represented in different orientations, with different shading and often overlap other shapes. Intra-class variability in the graphical notations was also seen, as shown in Fig. 4. Furthermore, there is high inter-class similarity, for instance, the shape that represents a Gate Valve is also part of the Automatic Control Valve (ACV) and the Valve and Capped (V &C) Provision.

Fig. 3
figure 3

Symbol legend

Fig. 4
figure 4

Examples of intra-class variability. The symbols in each group represent the same class, which are a ACVs, b Ball valves and c Detail legends

3.1.2 Data annotation

The diagrams were manually annotated to create a symbol dataset, which can be a very time-consuming and demanding task [29, 44, 49]. The process includes drawing bounding boxes closely around each target symbol. For the purpose of the experiment, the diagrams were manually annotated using SlothFootnote 1. This is an open source tool which allows for object annotation. It should be noted that the annotations are exported to one output file per diagram. These record the labelled symbol information, including the class and bounding box coordinates. In total, the symbol dataset consists of 6231 symbols from 13 classes.

3.2 Data exploration and pre-processing

Different equipment items are shown with various frequencies within construction diagrams, therefore the symbol dataset is highly imbalanced, as shown in Fig. 5. Class imbalance is a major problem in both machine and deep learning [50, 52, 52] and is when algorithms trained on an imbalanced dataset are biased towards the majority class. It was observed that the Ball Valve symbol is significantly overrepresented, as it constitutes 35.3% of the dataset. In contrast, the four least represented classes each constitute less than 1%.

Fig. 5
figure 5

The left image shows the class distribution across the whole symbol dataset. The right image shows the distribution amongst those classes with fewer than 100 instances in more detail

The problem of small object detection was also seen in this experiment. This is considered a problem due to reasons such as limited context information and indistinguishable features [53, 54]. For example, on the COCO dataset [54], the Average Precision (AP) of YOLOv7 for small objects was lower, 35.2%, compared to that for medium objects, 56.0%, and large objects, 66.7% [55]. In this paper the COCO definition of object size was used. This classifies objects as small if their area was less than 32 \(\times \) 32 pixels, medium if between 32 \(\times \) 32 pixels and 96 \(\times \) 96 pixels, and large if more than 96 \(\times \) 96 pixels [56]. Most of the symbols here are small or medium sized using this criteria.

The diagrams were pre-processed to reduce false positives, as shown in Fig. 2. This was done by removing the diagram border, which contains text and no target symbols. A Connected Component (CC) algorithm was used to locate the largest CC of white pixels, which was considered to be the background of the main diagram area. In this calculation, the pixels were defined as connected to each other if they had four-way connectivity. An image mask was then applied to replace the pixels outwith the bounding box of the largest CC with white pixels.

3.3 Symbol detection

Two state-of-the-art object detection models were used in the experiment. These are YOLOv7 [55] and Faster R-CNN [33]. YOLOv7 [55] is a variant from the YOLO series [28, 57, 58, 59, 60, 61, 55, 62]. It is a one-stage model, that predicts objects’ locations and classes using a single CNN. It is known for its fast performance, for instance YOLOv7 had improved speed and detection accuracy on the COCO dataset [54] compared with other object detectors in the range 5 FPS to 160 FPS [55]. Additionally, YOLO also performs well across different types of diagrams [15]. Faster R-CNN [33] is known to be accurate, with state-of-the-art performance on the PASCAL Visual Object Classes (VOC) benchmarks [63]. Whilst Faster R-CNN [33] improved on the speed of earlier related models, Fast R-CNN [33] and R-CNN [64], its separate region proposal stage results in slower speeds compared to one-stage models.

The construction diagrams are significantly larger compared with the typical image input size for deep learning object detection models. For example, the diagrams are 14, 042 \(\times \) 9, 934 pixels whereas the YOLOv7 input size is 640 \(\times \) 640 pixels [55]. Using the whole diagrams as training images would require considerable computing resources and therefore, a patch-based approach was used. This involves splitting high-resolution images into smaller patches [65, 66, 15]. In this experiment, the patch size was set at 640 \(\times \) 640 pixels. Note that the diagrams cannot be split exactly by the patch size, and the patches cropped at the edges of each diagram overlap each other. Only the symbols that appeared completely within a patch were used for training purposes. Note that the drawings were annotated prior to being split into patches, therefore these symbols were determined automatically.

Due to the limited size of the dataset, transfer learning was used. This technique improves a learner by transferring information from one domain to another [67]. Both models were pre-trained on a large-scale object detection dataset, 2017 Microsoft Common Objects in Context (COCO) [54]. All model layers were fine-tuned during training.

4 Experiment and results

4.1 Experiment setup

The experiment can be divided into two phases. The first is the Deep Learning Model Training on Construction Symbols and second is the Method Evaluation, as shown in Fig. 6. The input to the first phase is the pre-processed diagram dataset that was the output from the Data Preparation phase shown in Fig. 2.

Fig. 6
figure 6

Experiment steps. In deep learning model training on construction symbols, the pre-processed diagram dataset was split into training, validation and test sets prior to being split into patches. This is followed by model training. In Method Evaluation, the method was tested and the predictions on individual patches were combined using Non-Maximum Suppression. Next the predictions were visualised to create the processed test diagrams and the predictions were compared to the ground truth

The pre-processed dataset of 198 diagrams was split into training, validation and test sets. These contained 168, 15 and 15 diagrams respectively. Each subset contained all three diagram types and instances of each symbol class. As the classes were unevenly distributed across the diagrams, there was a different proportion of each class across the subsets, as seen in Fig. 5.

Following the patch-based approach described above, the 168 training diagrams were split into 59, 136 patches. Of these, 1633 contained at least one annotated symbol. Patches not including symbols of interest were also included in the training data, in equal ratio to labelled patches. This may help to reduce false positives occurring due to similar shapes in the diagrams. To select the more cluttered patches, these were randomly sampled from those which contained over 15% black pixels. The 15 validation diagrams were split into 5280 patches, of which 94 were annotated. Again, patches without symbols of interest were included in equal amounts to the labelled patches.

YOLOv7 was trained using a batch size of eight as initial experiments showed improved results compared with larger batch sizes. The momentum was set to 0.937. To help prevent overfitting, mosaic data augmentation, which mixes four images [59] was used on all the training images. The idea is to show extra symbol contexts to the model, and it also reduces the requirement for a large mini-batch size [59]. The probability of a left-right flip was set at 0.5, and an up/down flip was set at 0.0. The image translation factor was set at ± 0.2.

The Faster R-CNN batch size was set at four due to memory requirements. The momentum was set at 0.9, and the probability of a horizontal flip was set at 0.5. Following the original baseline model, a ResNet-50 backbone was used [33]. Note that as the aim was to compare the methods based on the two models, the data augmentations used in each approach were kept as is standard in each implementation.

Each model was trained for 100 epochs which took 2.92 hours for the YOLO-based method, and 40.86 hours for the Faster R-CNN-based method. Note that the official implementations were usedFootnote 2. [68]. The experiments were carried out using an NVIDIA Quadro RTX5000 16GB GPU with 256GB RAM.

The methods were evaluated using a test set, which contains 15 drawings split into 19, 995 patches, as seen in Fig. 6. Here an overlapping patches strategy was used to ensure all symbols fully appeared within a patch. This means that overlapping predictions can occur. Non-Maximum Suppression (NMS) was used to handle this, as shown in Fig. 7. It should be noted that the overlap threshold was set at 0.3, and the confidence threshold was set at 0.005. It is worth pointing out that the testing took 0.09 hours using the YOLO-based method, and 2.72 hours using the Faster R-CNN-based method. Note that this was for the whole test set of 15 drawings. This is significantly less than the time required for manual drawing analysis, which can take hours of work per drawing and requires subject matter specialists.

Fig. 7
figure 7

Non-Maximum Suppression was used to handle the overlapping predictions. Initial predicted bounding boxes are shown in red (left image). The results following Non-Maximum Suppression are shown in green (right image) (color figure online)

4.2 Evaluation metrics

The methods were evaluated using multiple metrics, including Precision, Recall, and F1-score. The Precision is the ratio of True Positives to predicted positives (Eq. 1). The Recall is the ratio of True Positives to the actual positives (Eq. 2). The F1-score is the harmonic mean of Precision and Recall (Eq. 3). The Intersection Over Union (IOU) defines the overlap between the predicted and the ground truth bounding boxes (Eq 4). For a True Positive, the IOU threshold was set at 0.5 in accordance with the PASCAL evaluation metric [69], and the model confidence threshold was set at 0.25. The latter setting should ensure an appropriate trade-off between obtaining true positives and reducing false positives.

$$\begin{aligned}{} & {} {Precision} = \frac{True\ Positives}{True\ Positives\ +\ False\ Positives} \end{aligned}$$
(1)
$$\begin{aligned}{} & {} {Recall} = \frac{True\ Positives}{True\ Positives\ +\ False\ Negatives} \end{aligned}$$
(2)
$$\begin{aligned}{} & {} {F1\ score} = \frac{1}{\frac{1}{2}(\frac{1}{Precision} + \frac{1}{Recall})} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} {Intersection\ Over\ Union} = \frac{Area\ of\ Overlap}{Area\ of\ Union} \end{aligned}$$
(4)

The method was also evaluated using the mean Average Precision (mAP) at IOU threshold of 0.5 (mAP@0.5). The AP of each class, the area under the Precision-Recall curve, was determined using the all-point interpolation method as in PASCAL VOC 2010 [63]. In addition, the AP@[0.5 : 0.05 : 0.95], APsmall, APmedium and APlarge were reported [54]. An open-source toolkit for object detection metrics created by Padilla et al. [56] was used to perform this calculation.

4.3 Results

The results were initially evaluated by visual inspection with help from domain experts in order to understand the model performance. As shown in Fig. 8, this was facilitated by drawing bounding boxes around the ground truth in red, YOLO-based method correct predictions in orange and by the Faster R-CNN-based method in purple. The incorrect predictions by the YOLO-based method are shown in dark blue and those by the Faster R-CNN-based method in light blue. For in-depth analysis, the model confidence and the IOU were also shown. This suggested that various symbol classes were detected well, even with multiple overlapping components. It was also observed that where a correct prediction was recorded by both methods, the difference in predicted bounding box locations was small and most visible on the larger symbols, refer to patches a and b in Fig. 8.

Fig. 8
figure 8

Examples of test patches. To facilitate visual inspection, bounding boxes were shown around the ground truth in red, YOLO-based method correct predictions in orange and by the Faster R-CNN-based method in purple. The incorrect predictions by the YOLO-based method are shown in dark blue and those by the Faster R-CNN-based method are in light blue. The model confidence, c, and the IOU were also shown (color figure online)

Table 1 Method performance on the test set. The highest performing score for each metric is highlighted in bold

The results on the whole dataset were assessed using several metrics, as shown in Table 1. The mAP@0.5 of the YOLO-based method was 79%, whilst that of the Faster R-CNN-based method was 83%. Out of the 665 symbols, 637 were correctly detected by the YOLO-based method and 636 by the Faster R-CNN-based method, equivalent to an accuracy of 95.8% and 95.6%, respectively. In terms of the AP@[0.5 : 0.05 : 0.95], both methods performed equally with a score of 0.50. The results were also evaluated according to symbol size. This shows that both methods perform better the larger the symbol size is. This is likely due to more information being present in the larger symbols compared to the smaller ones. Both methods performed equally on the medium sized symbols, as shown in Table 1. It is also evident that the YOLO-based method performs slightly better on the small symbols than the Faster R-CNN-based method, with a value of 0.27 compared to 0.19 obtained for APsmall. Similarly, the YOLO-based method performs slightly better on the large symbols than the Faster R-CNN-based method, with the values of APlarge being 0.68 and 0.64 respectively. These results suggest that although the YOLO-based method outperforms the Faster R-CNN-based method on certain metrics, both methods have performed well on this challenging dataset.

Table 2 Method performance on the test set per class. The highest recall, precision and F1-score for each symbol is highlighted in bold

The precision, recall and F1-score were calculated for each class, as can be seen in Table 2. These results show that both methods performed well for the detection of various symbol classes. Although class imbalance can strongly effect performance, other factors also influenced these results. For instance, the highest F1-score was not obtained for the majority class, the Ball Valve. The recall was high indicating that most instances were detected correctly. This included those in different orientations, as shown in patches a, c and e in Fig. 8. However, the precision was lower, which may be due to several reasons, including model bias due to class imbalance. Furthermore, similar shapes were very common, refer to the incorrect predictions of Ball Valves shown in patches f, h and i in Fig. 8. The highest performance by the YOLO-based method was for the Pump and by the Faster R-CNN-based method was for the ACV, even given that these were the sixth and tenth most represented classes respectively. This indicates that the results are impacted by other factors aswell as class representation, such as similar shapes in the drawings.

The results also show that the lowest performance was obtained for classes with very few instances. For example, an F1-score of 0.00 was recorded by both methods for one class, the V &C Provision, which had only 47 instances. Although the Meter had fewer instances, 21, the performance was higher, likely due to the relatively consistent appearance of this symbol.

It can also be seen in Table 2 that there were higher levels of recall compared to precision. This was due to false positives, of which there were 320 by the YOLO-based method and 154 by the Faster R-CNN-based method. Only a few of these were as a result of the inter-class similarity. There was one prediction of a Meter as a Detail Legend by the YOLO-based method, and one prediction of a Gate Valve as a Check Valve by the Faster R-CNN-based method. The other misclassifications between target classes were that all V &C Provisions were predicted as two separate symbols, the Gate Valve and Capped Pipe. An example of this is shown in patch d in Fig. 8. This can be explained as the shapes that constitute the V &C Provision are essentially a combination of these two symbols, see Fig. 3.

The majority of the false positives were due to similar shapes in the background of the drawing. The highest number of false positives for any symbol, 150, was for the Gate Valve by the YOLO-based method. This was often due to similar triangular shapes used to shade parts of the diagram, as shown in patches f and h in Fig. 8. In contrast, the Faster R-CNN method performed better here and only predicted 9 false positives for the Gate Valve. Another noticeable difference between the methods was in the number of false positives recorded for the Detail Legend symbol, for which the YOLO-based method predicted 47 whereas the Faster-RCNN-based method predicted less at 14. Both methods predicted a similar number of false positives for the most represented symbol in the dataset, the Ball Valve, with 39 false positives by the YOLO-based method and 36 for the Faster R-CNN-based method. Both of the methods predicted that similar shaped components in the diagram were Ball Valves, see patch i of Fig. 8. It should also be pointed out that there were no false positives from similar shapes in the diagram border area, as this section of the drawing was removed in the drawing pre-processing, refer to Fig. 2. Overall, these results show that both methods have high discriminative power between the target classes, and that most incorrect predictions result from the similar shapes that are used throughout the drawing.

The YOLO-based method required less time for testing compared to the Faster R-CNN-based method. However, both methods substantially reduced the time required to process a drawing compared to manual analysis. For instance, using one test drawing as an example, the YOLO-based method took 0.54 minutes, whereas the Faster R-CNN-based method took 18.17. Note that this time would be complemented by that required for manual review to correct any errors in the model output, which in this case took an additional 1.20 minutes. Contrast this with the much longer time needed for manual analysis of the whole drawing, which took a subject matter expert 1.34 hours. These time savings would be substantial, especially in projects with a large dataset of complex drawings.

5 Conclusion and future direction

In this paper, we present a deep learning framework for the automatic processing of construction drawings. This enables symbol digitisation and can therefore automate tasks such as material takeoff. Two state-of-the-art object detection models, YOLO and Faster R-CNN, were utilised. An extensive set of experiments was carried out using a large dataset of challenging high-resolution drawings sourced from an industry partner.

The results show significant time-saving compared with manual drawing analysis. Although the highest accuracy was obtained with the YOLO-based method, both methods were shown to obtain high performance, for both recall and precision, for a range of symbols. This was obtained even with the challenges posed by the dataset, such as relatively small symbol size, different orientations and the presence of multiple overlapping objects. One limitation was that the performance was inconsistent across the classes, due to factors including the class imbalance, similar shapes and intra-class variations such as size and orientation.

This work could be extended by improving the symbol detection methods. For instance, further experiments could assess the impact of various model backbones on the performance. Another interesting idea that we aim to investigate is how segmentation models such as Mask R-CNN [47] or FCN [37] would perform in this scenario. As segmentation models predict a pixel-level outline of an object instead of a bounding box, it may alleviate some of the errors that occur as a result of drawing objects that are located in close proximity to, or overlap the target object. Additionally, the challenge of class imbalance could be addressed. One possible direction is to create synthetic image patches to balance the dataset, using generative deep learning models such as Generative Adversarial Networks (GAN).

Future directions of this work also include automatically processing the entire construction drawing. This involves methods to digitise all components including the text and lines. Extending this framework will enable the digital transformation of the whole drawing. This will allow for the extraction of vast amounts of valuable data, and additionally will substantially reduce the manual effort required to analyse construction drawings for a wide range of tasks.