Application of NLP-based topic modeling to analyse unstructured text data in annual reports of construction contracting companies

The construction industry is the backbone of a nation’s economy. It is a matter of great concern that such an industry suffers from time and cost overruns, especially in these challenging times. Coupled with the overrun issues, the sector is often criticized for lacking adequate quality and quantity of structured secondary data. The emerging technologies in data science and machine intelligence present a unique opportunity to understand the sector better and aid in effective decision-making. To better understand the utility of such technologies, the Management Discussion and Analysis ssections of the annual reports of publicly listed top Indian construction contracting firms are analyzed to identify the presence of ‘strategy themes’ and further map them to the organizations considered. Natural Language Processing (NLP)-based topic modeling algorithms, namely Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), are used in this study to perform a qualitative content analysis to identify the latent themes. From a methodological standpoint, considering the context of this study, the NMF results are better in accuracy, precision, and recall compared with the LDA. The results show that while most construction contracting firms prioritized a ‘revenue-focused’ strategy to expand their order books, a smaller set of large-sized firms seem to prioritize process improvement to improve their execution productivity and therefore are ‘profit margin improvement focused’ or ‘lean-focussed’ in their approach. Although a proof-of-concept, this study unlocks the immense potential of unsupervised NLP-based topic-modeling tools to understand and infer from unstructured and freely available text data in the public domain to aid sectoral analysis and policymaking.


Introduction
The construction sector is operating in unprecedented times. The industry is now facing the onslaught of the Coronavirus pandemic (and its aftereffects), which critically affects the sector's value generating potential. Top management decision-making is crucial to successfully steer organizations in these trying times [1]. 'Top management,' in this study, refers to the group of individuals in a construction contracting firm who are responsible for strategic decision-making. Construction management researchers and industry professionals can collaborate, and collective efforts can be crucial in informed decisionmaking. However, researchers require data to analyze the trends, patterns, and scenarios for informed decisionmaking to propose any solution irrespective of the domain. In addition to the quality and quantity requirements, if the available data is structured and quantifiable, it becomes easier for researchers to analyze them and report findings quickly. Unfortunately, the construction sector, at least in India, is often criticized for being poor in maintaining a research-ready database that can serve as a publicly available and dependable secondary data source for researchers [1][2][3][4]. Private-sector data is generally considered confidential and is out of bounds for researchers. Therefore, researchers often bank upon primary data collection techniques like questionnaire surveys and interviews to equip themselves with data [5].
While the structured-data inadequacy issue is a severe concern in the construction sector, unstructured text data related to the construction sector is abundantly available in the public domain. In the private sector context, companies' websites contain information on vision and mission statements and promotional videos. Specifically, in the case of publicly listed private firms, annual reports and financial statements are mandatory disclosures in the public domain. Knowledge extraction from such unstructured data is now possible with the recent developments in computer-aided text mining and Natural Language Processing (NLP) [6][7][8].
In this research, the authors explore the efficiency of NLPbased topic modeling algorithms to extract keywords and topics from the publicly available annual reports of construction contracting firms and use the information obtained to analyze the strategies such firms adopt in dealing with emerging sectoral challenges explained in the next section.

Opportunities and sectoral challenges
Considering the stakeholder-intensive and labor-oriented nature of the industry, the pandemic has made construction execution challenging due to its impact on the logistics and supply chain efficiency and the working style of people involved in the project delivery. Notwithstanding the pandemic impact, the construction sector cannot pause even momentarily, as it plays a central role in shaping a nation's economy. With the pandemic affecting virtually all sectors of the economy, a renewed growth of the construction sector can infuse a much-needed boost to propel economic growth. Realizing this, the Government of India has announced massive investments in the infrastructure sector that can revive the construction industry. Schemes like the National Infrastructure Pipeline (NIP) envisage providing world-class infrastructure to citizens and improving their quality of life [9]. NIP is an ambitious multi-billion dollar investment plan spread over infrastructure sectors like energy, roads, urban development, and railways, with 2465 projects currently under development in active partnership with the private sector [9].
While such massive investments open up much-needed avenues for the private sector to bounce back, there are challenges in converting the opportunity to benefits. The tangible positive impact of the opportunities rolled out by the Government on the construction sector can be seen only if there are structural changes in the execution style of construction contracting companies involved in infrastructure project deliveries. The observation above is considering the fact that the industry traditionally suffers from significant cost and time overruns, ultimately leading to disputes and litigation [10,11]. Therefore, with such a high probability of overruns and a poor track record in claims management and dispute resolution [12], there is a risk of massive investments in infrastructure (like the NIP) turning into breeding grounds for claims and disputes. Overall, there seems to be a two-pronged attack on the contracting construction sector. Firstly, the burden of unresolved pre-pandemic claims and disputes stresses a construction contracting company's balance sheet. Secondly, the onslaught of the pandemic is forcing organizations to secure their strained profit margins from deteriorating further. Considering the two-pronged attack (as described above) a challenge faced by construction contracting firms amidst opportunities, this article reviews the strategies Indian construction contracting firms adopted in response to the challenges.
Considering the evolving critical financial situation and the impending opportunities the Government is creating, construction contracting firms can strategize their future path in several ways. While embarking on a revenue-focussed strategy [13] can be one option, profit margin improvement [14,15] through cost reduction exercises, liquidating long-pending claims, mergers, and acquisitions, among others, can be the other option. While it is not intended to infer that the strategies above are mutually exclusive, the study explores the predominant strategy adopted by construction contracting firms in response to the evolving scenario. The study uses an NLP-based topic modeling algorithm to qualitatively analyze the management viewpoints from the annual reports of publicly-listed top construction contracting firms in India to understand the strategies adopted.

Literature review
In the face of calamities, the role of top management takes center stage [16,17]. Therefore, understanding the top management view becomes crucial to decipher an organization's strategy to manage an emerging scenario unfolding risks (due to the pandemic) and opportunities (Government's infrastructure push through capital investment). Considering the nature of the study, it is essential to refer to those documents and materials that can help explain an organization's strategic intent. Such reference documents should be authentic and also available in the public domain. Considering the requirement, the annual reports of publicly listed construction contracting firms are analyzed to understand top management's commitment to formulating and communicating strategies in response to the risks and opportunities evolving in the market.
Publicly listed companies have a regulatory requirement to publish annual reports containing the company's financial information and the management opinion about the performance in the previous year and prospects [18]. Since an annual report is a public document read by a sizeable heterogeneous audience, companies may also treat them as an instrument of mass communication of their achievements and innovations [19]. In academics, annual reports have often been used as grey literature, especially for research on corporate strategy [20,21]. Specifically, content analysis of annual reports of multiple companies is an effective way to gauge the industry's strategy [22]. In particular, annual reports, among others, are used to understand leadership phenomena [23].
As per the statutory requirements laid down by the Ministry of Corporate Affairs (MCA), Government of India, and the relevant governing laws of the land, an annual report is typically divided into three parts, namely corporate overview, statutory reporting, and financial statements. Among the three parts, the second part, statutory reporting, consists of a section titled 'management discussion and analysis' (MDA). According to the Securities and Exchange Board of India (SEBI), MDA should include discussion on matters about a company's competitive position, such as (1) overall business scenario (global, national, and sectoral), (2) opportunities and threats, (3) highlights of the performance, (4) business outlook, (5) risks, concerns and mitigation plans, (6) internal control systems and their adequacy, and (7) discussion on financial performance concerning operational performance and human resources/ industrial relations initiatives [24]. From the very structure of this section, it is clear that the report captures the management's view on the impending threats and opportunities and the organization's response plan. Therefore, the contents of MDA sections of multiple construction contracting firms are analyzed to understand the direction of top management thinking in response to the emerging scenario.
The qualitative analysis technique is employed to analyze the text data in the MDA section of the annual reports. Earlier studies have attempted to understand such reports using a keyword-based search of specific pre-decided themes [25]. In this study, however, the themes will be an output of the study, and therefore the keyword-based search is not suitable. Since the study aims to identify various topics/themes evolving out of the annual reports' MDA section, authors use tools that help cluster topics from raw text. Topic modeling algorithms like Natural Language Processing (NLP)-based Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) can be employed to meet the objectives of this study. Such NLP-based tools have been used widely in the construction management research domain [26]. LDA, in particular, is widely used in several domains as a tool for topic modeling [27]. Table 1 below presents a non-exhaustive summary of construction management domain research employing topic modeling, particularly LDA models. The articles are obtained by employing a keyword search in the Google Scholar Database. The keywords used are 'topic modeling,' 'NLP,' 'Construction,' and 'Project. ' While the keyword-based search (in the Google Scholar database) highlighted many recent LDA-based articles in construction management, the search is further narrowed by adding the phrase 'annual report' to the keywords to identify construction management domain-specific topic modeling studies employing LDA on annual reports. The result indicated that topic modeling on annual report documents is quite common in finance [36]. Additionally, it is also seen that the search (with the key phrase -'annual report') did not yield any similar articles in the construction domain. The study attempts to address the gap in the literature using NLP-based topic modeling techniques.
While NLP-based techniques are commonly used for detailed analysis of text from standard contract forms to identify problematic provisions and control-focus, among other things [7,37], topic modeling algorithms are rarely used to extract keywords and topics from standard contract forms as the text from those documents are already classified into topics in the form of clauses and sub-clauses with clear headings and sub-headings. However, extraction of strategies (represented through keywords) from the annual reports in response to market challenges is relatively more challenging because companies do not explicitly lay down their action plan. The same can be gathered only after reading through all the relevant pages of the annual report. Therefore, topic modeling algorithms can potentially help in a quick and reliable extraction of strategies from annual report documents. The MDA section of the annual reports falls under relatively (relative to contract documents) more unstructured data categories such as news articles, social media data, and court cases where topics are not identified as explicitly as that of a standard form contract clauses. The following sections describe the methodology adopted for the study, followed by results, validation, discussion, research implications, future scope, and conclusion.

Methodology
As explained in the earlier sections, the study employs qualitative topic modeling techniques to thematically analyze the MDA section of annual reports of publicly listed construction contracting companies in India. The construction sector in India consists of more than 31,000 registered firms as well as a much larger number of contractors in the unorganized sector, but only around 100 companies are listed in the National Stock Exchange (NSE) and Bombay Stock Exchange (BSE) [38]. India's leading 18 publicly-listed construction contracting companies are selected for the study considering their market capitalization, with representations from real estate, infrastructure, and commercial development sectors. Next, the NLP-based LDA and NMF approach models latent topics in the input text corpus (MDA section of annual reports). Anaconda 3, a pre-packaged distribution of an open-source Python programming interface, is used in the Jupyter Notebook web application to perform topic modeling.

Topic modeling process
Two topic modeling algorithms, namely LDA and NMF, are employed in this study. LDA is a Bayesian-based statistical model. It assumes that each document is a combination of a given number of topics, and each word in the corpus is associated with each of the given topics with some probability. The model builds a probability distribution according to which topics are modeled by allocating words under them [39]. NMF, on the other hand, is a linear algebraic model. While NMF attempts to achieve the same objective, topic modeling, NMF is a matrix factorization and multivariate analysis technique that generates coefficients (instead of probability) for each word while mapping them to a given topic. LDA and NMF models have been employed for topic modeling, and earlier studies in nonconstruction contexts (large text stream data analysis and review data analysis) report superior performance of one algorithm over the other [40,41]. As there was no data on the robustness of the topic modeling algorithms in the construction context, both LDA and NMF algorithms are used in this study. The coding process of LDA and NMF is almost similar except for some changes in the commands and tools called by the program. Therefore, LDA coding is first described, followed by a few lines highlighting key differences in the NMF coding process.
Topic modeling is carried out in a sequential process. The process begins with preparing input corpus in a format recognizable by the programming language. The data from the MDA section of the annual reports (which are primarily published in the Portable Document Format or PDF) is copied and pasted into a Comma-Separated Values (CSV) file with Unicode Transformation Format, 8 Bit (UTF-8) encoding. This enables the programming interface to recognize and skim the data in the files. The coding process is then initiated in the Jupyter Notebook interface using opensource software libraries designed for data analysis and manipulation through Python programming language [42]. The input corpus (a CSV file) is then read into the program by invoking necessary programming commands. After importing the necessary libraries, the text data in the CSV file (input corpus) is pre-processed by instructing the program to ignore those words that are commonly found in more than 95% of the documents (95% of 18 companies) and also ignore the words that are found in less than or equal to 2 documents (2 out of 18 companies). Secondly, the common English language stop words such as ''a,'' ''the,'' ''is,'' and ''are'' are eliminated in the process. Overall, pre-processing eliminates commonly used words across documents that may not contribute in differentiating the companies. Similarly, words repeated in only two or fewer company MDA data indicate that the words are unique to a particular company and may not be relevant while modeling for latent topics hidden across the text corpus. The input data is now ready for the entire text data's 'fitting and transformation,' called 'unsupervised learning.' Finally, the algorithm extracts four latent topics from the input corpus. While the number of topics is initially set to '4', the input parameters are subsequently modified to generate more models for comparison.
The output obtained is presented in the results section. While the above steps are for LDA, similar steps are followed for NMF-based topic modeling. However, the difference is that a 'Term Frequency and Inverse Permit descriptions and other urban data LDA [28] Site issue records LDA [29] Defect litigation cases LDA [30] Building information modeling (BIM) case studies LDA, Latent Semantic Analysis (LSA), and Support Vector Machine [31] Construction market news LDA [32] Construction schedules LDA, LSA, word2vec and fastText [33] Construction specifications Named Entity Recognition (NER) models [34] Building regulations (rule-checking) NLP and Deep Learning-based semantic analysis processes [35] Document Frequency (or 'tfidf') vectorization is used instead of the 'CountVectorizer,' and the 'NMF' tool is called instead of the 'LatentDirichletAllocation' tool. Finally, in both LDA and NMF-based models, topics are assigned to the input corpus text to display and interpret the results.

Results
The keywords assigned to the four topics are displayed by calling the LDA component function (NMF components in the NMF model). However, the output is in the form of vectorized values, as shown in Fig. 1. The output values have to be sorted in their probability of occurrence under a given topic and converted into actual words so that the topic names can be deciphered. The output in the form of keywords is shown in Fig. 2.

Naming the topics and validation of results
To begin with, as discussed in the methodology section, the number of topics sought from the model is four. On observing the components, there are broadly two themes of strategy spread across the four topics. One set of words referred to an inward-looking strategy to improve the execution processes (like 'digital,' 'technology,' 'productivity,' 'employee,' and 'value'), and the second set indicated an outward-looking strategy that involves interaction with external stakeholders (words like 'government,' 'clients' and 'business'). Considering the presence of two latent themes, the algorithm is again executed with an instruction to generate two topics. The LDA and the NMF topic modeling program results for the 2-topic model are shown in Figs. 3 and 4. The results show that except for the companies with index numbers 1, 6, and 15, the LDA and NMF models show similar results. Since the LDA and the NMF tools model the topics using different algorithms, there can be slight differences in the set of words under topics (topic 0 and topic 1). The two topics modeled and allocated by the LDA and NMF tool (shown in Table 2) indicate 'latent' themes in the corpus text. On a keen observation, it is seen that topic-1 contains words such as 'team,' 'digital,' 'process,' 'management,' 'productivity,' 'technology,' 'learning,' 'safety,' 'innovation,' and 'quality.' Here, the term 'digital' refers to the implementation of technologies that help in increased visualization through tools and techniques such as Building Information Modeling (BIM). Similarly, the term 'team' can be equated to a 'collaborative' mindset amongst stakeholders. According to [25], the above terms correspond to 'lean signs,' and such signs indicate a construction contracting organization's keenness to implement lean construction techniques. Lean construction techniques help reduce waste, thereby increasing the productivity of the construction execution operations [43] and improve the profit margin. Therefore, the term profit margin improvement-focussed strategy is renamed lean-focused-strategy considering the positive impact of lean on profit-margin improvement (through waste reduction and productivity improvement). Rightly, even the word 'productivity' has popped up under the 'topic-1'. Therefore, the 'topic-1' will be named the 'lean-focussed' strategy. On the other hand, the words grouped under 'topic 0' stress 'growth,' 'government,' 'capital,' 'order,' and 'industry' indicating that the strategy focuses on external opportunities to propel revenues. This is a seemingly different strategy compared to the inward-focused lean strategy of certain other organizations. Therefore, the name 'revenue-focused strategy' is retained.
Choosing the keywords ''collaboration'' and ''team'', a phrase extraction algorithm is applied to extract 30 words on either side of the keywords from one company to check if the extracted keywords are relevant considering the lean context. The result in one case is shown in the Fig. 5. The result shows that the keywords are used in the lean context, aimed at improving the work productivity. Considering the unsupervised nature of the modeling process, the results need validation. To validate the results, the authors consider the companies under the lean-focussed strategy (allotted to topic-1) and compare the results with those of the earlier keyword-based non-NLP work [25]. This study considers the 49 'lean signs' identified by [25] (in which lean signs are based on the keywords from leanrelated research articles) and performs a simple keyword search in the annual reports of the 18 companies considered in the analysis. It is found that the four out of the top five companies with the highest count of the 'lean-signs' are identified correctly by both LDA and the NMF-based models (see column 4 of Table 2 for lean signs frequency). However, the LDA-based model identified four additional companies under the 'lean-focussed' category while they are not (false positives) and one under the 'revenue-focussed' category, even when they have a high frequency of 'lean-signs' (false negative). In the NMF model, one case of false positive and false negative is observed. A confusion matrix and the associated validity parameters are shown in Tables 3 and 4. It is evident that in the context and the data considered for the study, NMF performs better than LDA.

Discussion
The results confirm the presence of two strategies to counter the emerging construction scenario. The four companies with a lean-focussed strategy seem to be in the minority compared with the revenue-oriented companies (opportunity-focussed). The results align with extant  literature that finds a shallow extent of lean penetration in construction contracting organizations, especially in India [44,45]. It is observed from the results that the few companies categorized as 'lean-focussed' are strategizing collaboration, increased visualization, and productivity improvement so that the work efficiency is improved, resulting in a more significant contribution to the profit margin. Extant studies elaborate on the low penetration of lean philosophy in construction, and the absence of a 'lean mindset' is considered a cultural barrier to implementing lean [46].
At the heart of all construction operations lies the concept of productivity. Construction productivity is the efficiency with which a firm converts inputs (resources like workforce, material, machinery, and money) to saleable outputs (work done or revenue earned). When productivity, which represents the execution efficiency of a construction contracting organization, is high, then the balance sheet tends to be healthy [47]. Three out of the four companies highlighted as lean-focussed are the top three companies in market capitalization. Therefore, the lean elements in the MDA section of the annual reports of large companies are a ray of hope that large companies are setting benchmarks that the companies in the middle and the lower rung can follow. As a testimony to this fact, in the October 2021 edition of the 'Construction World' magazine, construction industry leaders have reported the positive impact of implementing lean in their organizations [48]. Another interesting observation from the study is that words like 'learning' and 'employees' appear in the top twenty words mapped to topic-1 (or the lean-focussed organization). Employee-focussed and learning-oriented strategy is a hallmark of lean organizations [49].  Regarding the 'revenue-focused' organizations, the firms' primary focus seems to expand their order books by tapping into the infrastructure push by the Government. While it is not argued that the two strategies are mutually exclusive, it is only highlighted that there is a visible difference in the top priorities of the top management when it comes to responding to the uncertain scenario that the world is currently facing. With the inclusion of words such as 'time,' 'cost,' 'quality,' and 'operations,' the absence of evident mutual exclusivity can be established by noting that some aspects of lean-focussed thought process in the top management can be seen in the topic-0 (or revenuefocussed) organizations as well. However, such words are few in the topic-0 firms compared with the lean-focussed topic-1 firms.
The results demonstrate that the NLP-based topic modeling algorithms can be helpful in clustering and understanding the themes in the large text corpus sourced from the MDA section of the annual reports of Indian construction contracting firms. While this study focuses on the contracting firms, a similar analysis of the MDA sections of the employer firms will indicate the extent to which employers find lean implementation necessary compared to the contracting firms' interest in lean.

Research contributions and limitations
Even though the study is only a proof-of-concept, the contribution is the findings opening up many new research directions. The observations from the study pave the way for expanding the use of such algorithms also to analyse other sections of annual reports and vision and mission statements of construction organizations. Policymakers and researchers can understand the overall strategies adopted by the construction sector and juxtapose them with the topics extracted from Government's industrial policy documents so that the extent of the alignment of industry and the Government can be assessed. In addition, a comparison of results with the keywords obtained from similar strategies adopted earlier in the timeline or across a cross-section of various countries or in both dimensions can reveal crucial information on the effectiveness of the policy implementation, and necessary course correction can be adopted well in advance. In terms of contribution to the body of practice, the study can help organizations assess their strategy compared to their competitors and embrace newer and better strategies to set their organizations on the growth trajectory.
The accuracy and the precision levels of the model need further improvement. Improvement in model validity can be achieved by incorporating the MDA section of more companies across various stakeholder types such as consultants, owners, and developer organizations. Further, data from international organizations can also be included as a part of input data to understand its effect on the topic modeling results and its validity. Additionally, it will be interesting to identify keywords representing strategies that combine both revenue and lean focus.

Conclusion
The study was initiated with a primary intention to develop a proof-of-concept to test the applicability of NLP-based unsupervised topic modeling algorithms in deciphering  latent topics prevalent in unstructured publicly available construction sector data. Considering the need to understand a construction contracting organization's strategy to respond to both risks and opportunities evident in the evolving scenario in the Indian construction sector, the MDA section of the annual reports of construction contracting firms is used as a text data corpus to fulfill the primary objective. The results show that the top management of construction contracting firms laid down their strategy in two broad ways. A majority of the organizations are revenue-focused to propel growth by tapping into the massive infrastructure investment unveiled by the Government. On the other hand, a small number of organizations, typically having large market capitalization, focus on improving their work productivity by embracing lean construction concepts dominated by digitalization, team building, and process improvement. It is important to note that the two groups indicated above are not strictly mutually exclusive but only indicate an organization's priority in trying times. While the finding above confirms the observation of lean experts and researchers, the uniqueness of the study is in the demonstration of the potential of NLP-based topic modeling algorithms to analyse and cluster keywords in a manner that can help researchers to visualise the strategies of construction contracting firms, which otherwise would require a significant amount of time if manually analysed. The demonstrated potential of the NMF algorithm paves the way to greater use of NLP-based topic modeling techniques for the systematic content analysis of text documents. Notwithstanding that the work is just a 'proof-of-concept,' it is an essential step in developing a robust text mining algorithm that can simplify the process of reading, understanding, and summarizing unstructured text data.