Introduction

Data that are available for the software development community can be overwhelming. Though there are indications in the data about the possible direction in which the software systems evolve as they are developed or enhanced, it’s been challenging to have eyes into those insights. This nature of the data makes it essential to pay enough attention to data and see if they provide any insight. Security vulnerabilities being a key area for software systems, looking for patterns on security lapses will be necessary. Data is available in multiple sources, starting from the customer conversation that the software development team is involved with and more. Industry processes a large amount of data across various domains and makes it available [1]. It’s up to the software development teams to process these data meaningfully. Early identification of the sign of security lapses will help the software development team make their process more robust and secure. Due to the nature of the security issues, they are costly for the corporations if they lose track of those issues and see them later in the software systems that are in use. Though there is substantial effort to check these security vulnerabilities, leveraging the data and learning from it can take the effort to the next level.

Integrating the information from customer conversations, software development process data, and industry data will strengthen the insights for the software development team. In the long run, having this integrated view will make the information smarter and more predictable. Data science capabilities have been excelling in building various solutions across domains. A secure software system as a domain also has not been left behind. Exploring the integration of the possibilities of bringing together data science approaches and software development practices will be beneficial [2]. Software system security experts will benefit from this system, as their decisions can be validated with the data. This system also enhances the confidence among the customers. The effort is also made to study correct techniques and fine-tune them to align with the thought processes devised as part of the study.

System proposed in the paper is combination of AI (artificial intelligence) and the best industry practices in managing the knowledge effectively across software development industry. Best algorithms of the AI are explored and used for key processes in the knowledge management pipeline. Connecting industry knowledge, software development industry knowledge and business knowledge from customer point of view are emphasized in the framework.

Literature Review

This literature review has explored work done in software source code modeling, focusing on handling software vulnerabilities effectively. In work [3,4,5] author focused on approaches for effectively fixing security issues and the critical contributing factors involved. Techniques explored include regression and neural networks. Data used for experiments are from SAP’s internal secure development practices. Time taken to address security issues has been demonstrated to significantly influence the location of the security vulnerability in a software system component. Work in [6, 7] has encouraged exploring approaches beyond regression techniques to predict security issues. In the research [8], software and hardware components exploration to understand the security issues injected into software systems are studied. The focus of this study is to check effective collaboration between the vulnerabilities injected in hardware and that of software components. Work [9] also has highlighted time needed to fix the security vulnerability is dependent on the type of the vulnerability.

Research in [10] looks at global data covering multiple projects; in the case of the study in [3], only company data has been looked at, which has helped to get the information relevant to the company where the implementation is taking place. Research in [11, 12] had worked on defect fixes across software development. Work in [13,14,15,16] are other references where similar work has been done. Study [17] leverages the static code-related parameters to understand predictors for the defect. The research done in work [13] studies the prediction of software defects and the effort involved.

Compared to rule-based decision trees, NB (Naive Bayes) has been proposed to be a better approach. The approaches used in work [13] are based on work done in [18]. In the study of [14], factors influencing the time taken to fix a security issue in the case of SAP company software are studied. In work [11], a review of the methods used for software vulnerability management is explored with data mining approaches. The area of exploration is around vulnerability prediction approaches that work with software process-related metrics. In [15] focus is on identifying the value brought in by transfer learning instead of traditional ways of learning that have problems when looking at cross-company data. The work [19] had highlighted the impact of less experienced software development team members, how the experience influences changes to their code, and the study of possible injection of vulnerabilities. Work in [20] has looked at the metrics like code complexity that provide insight on historical metrics taken from software development processes, which indicate the possibility of vulnerabilities being discovered. Feature extraction with graph learning and graph mining are studied in work [21,22,23]. Work in [24, 25] focuses on domain-based feature engineering in software development, and its application is explored. Work [26] focuses on program analysis considering deep learning approaches that can be looked at for software vulnerability management-related use cases. Some of the other works that has provided considerable inspiration for this exploration are [27,28,29,30,31,32,33,34]

Research Gaps

Extensive literature review was conducted covering 77 papers are to understand the research gap on the subject. These papers are covering the areas of deep learning, knowledge graph, natural language processing, requirements management, secured requirements management, software analytics, software optimization, software requirements management and software system security. These papers were spread across time of 2006 to 2021, 42 papers out of these are from last 5 years from 2015 to 2021. Primary gaps identified in the research are lack of systematic exploration of machine learning and deep learning approaches for software security knowledge management, influence of time on software issues, extent of impact of various factors or features on the outcome around software security and validation of these scenarios in the real time software development practices. Apart from these there is research gap around security issues prediction approach, exploration of wide range of metrics that can help to validate the machine learning and deep learning experiments and building machine learning and deep learning solutions that are explainable.

Based on the exploration done across the literature, the following are the areas that need focus. Customer requirements are flowing into the software development team with various modes. It will be of value to understand the patterns in the conversations with the customer. In the process, we will derive security-related information from these conversations. Agile methodology of development brought in more rigor in software system development. There has not been work done to explore the security-related requirements in organizations that use agile methods for their software development. Data that flow in from the industry overwhelms the software development team. It will be valuable to generate useful information from these data and feed it into the concerned member of the software development team as and when it is needed. This data from the industry comes with rich knowledge and provides an opportunity to learn the granular details of the security needs of a software system. The granularity of the security-related information will help to understand various security categories hidden within the data. There is a need for exploration to integrate the knowledge sources from industry, within the company, and within multiple projects in the company. This integration helps to co-relate the themes of security needs and prioritize the implementation.

Motivation for the Work

Work is motivated from the challenges faced by the software development community with respect to information explosion. Though there is all the guidance needed to ensure a smooth software development eco-system. It is humanly impossible to be on top of what matters the most. In this exploration there is a study of most important challenges that matters the most for software development teams. Security of the software system is one of the key concerns that must be tackled. Figure 1 shows the trend of cyber security events and its impact on the organizations. This exploration intends to build knowledge processor that can integrate the key knowledge and then provide features to consume the knowledge as and when there is a need. Also work focus on continuous learning from the knowledge and provide proactive inputs before things go wrong for customer. Some of the thought process demonstrated in the work [35,36,37] has been beneficial for our exploration.

With the onset of pandemic, cyber security risks have taken toll on the organizations. For the software development industry that was in the journey of digital transformation, this situation put across huge challenge. Reputation, legal and operations challenge was faced by organizations. Increased rate of people working remotely was the biggest risk faced, and the phishing attack related impacts was significant challenge. With growing video conferencing solutions came with it the threats and vulnerabilities that would compromise the organizational systems. Improper access provisioning due to human error was an old problem but this became more critical as the people worked from home. Increase in the range of malware utilized for the cyberattack is also another trend that concerns organization. With these growing trends that impact cyber security there is a need for an organizational solution that will sustain long term vision of cyber security for the organization.

This research work focus on customer landscape module, software development landscape module and industry landscape module. In industry landscape module focus would be to understand critical trends of software system security and providing effective mechanism to translate that knowledge to the experts in the organization. In this module cyber security related aspects are also accounted for. Largely other modules will focus on software system security, while the software system is in the development and deployment mode. Cyber security domain revolves around the possibilities of software system being hacked by external entities, while the software security focus on ensuring the software systems are built with security capabilities so they are self-sufficient when they are being used in real world.

Fig. 1
figure 1

Percentage of organizations compromised by at least one successful cyber attack

Data

Data chosen for these experiments are the information processed by software development teams as part of their Agile development processes. Data used in the experiments belong to software development organization that caters to the title insurance domain business is United States of America. This organization follow agile method of software development and use ADO (Azure Dev Ops) as their application lifecycle management tool. Data include the customer requirements, internal technical requirements, issues reported in the development work, testing outcomes, and other sources. Data collected are further labeled with the involvement of experts from software development teams. This exercise focused on categorizing the security-related information from the base data. This labeling was needed for building the base for supervised learning. It was observed that a large part of data was non-security related in this base data, which can be expected, as security-related information will be a smaller sub-set. Visual analysis of this data is conducted to understand the underlying theme in the data. This exercise helped to provide software development expert’s while they took up the task of labeling the data.

Proposed Architecture of the System

This system will have three prominent modules: customer conversation, software landscape, and industry landscape. These modules function to gather the data from their respective sources, process it and forward it for further usage. The customer conversation module fetches data from various customer interactions done by the software development team. The software landscape module will leverage multiple data sources within the software development processes. These data would be the technical requirements drafted by the software development team, outcomes of testing processes of the software development team, conversations within the software development team, etc. The industry landscape module will work on the data that is fed in from the industry through publicly available data on the internet. Figure 2 shows overall architecture.

Fig. 2
figure 2

Overall architecture

Customer Conversation Module

The customer conversation module takes in all the data associated with the customer conversation done by the software development team. This data includes any information during the discussion between the software development team and their customers. Since these discussions range in various areas, it is essential to leverage the critical information that has hidden insights. Implicit customer requirements play a significant role in effective software systems being built. So, it is necessary to understand the pattern hidden in customer conversations. Apart from this data, other data processed internally within software development processes are also to be utilized. These internal data are a further refinement of customer conversation. In these internal conversations, customer needs from a business perspective are analyzed to understand the technical requirements. Figure 3 shows customer conversation module.

Fig. 3
figure 3

Customer conversation module

In customer conversation, modeling information is processed at two level. All the incoming data are modeled with a binary classification model in the first-level modeling. Prediction of security and non-security events are conducted in the first level. In level 2, security events are further modeled with a multi-class classification model. Here the security event is categorized into various security categories. Understanding the security categories in the data helps to get a comprehensive understanding of different security requirements before the software development team starts with software system development.

The output of this module will be a security-related expectation for the software systems being built. These expectations will also be categorized. These categorized security expectations help build the needed ecosystem for software development. This system helps software development teams proactively identify appropriate security measures while developing the systems. These systems must evolve to adapt to the dynamic nature of the businesses. This module will have a checkpoint where security experts will review the predictions of the module and make necessary updates to the predictions. These updates can be used further used to calibrate the models. Over the period, information collected in the database will be more intelligent and relevant to the software systems being worked upon.

Models for Customer Conversation Module and Comparative Study

In our paper [38] detailed exposition of the level 1 and level 2 models are conducted. Bi-directional LSTM (long short-term memory) with attention layer is the model used in level 1.

Fig. 4
figure 4

Experimental design for bi-directional LSTM that includes attention layer

Figure 4 depicts the general theme of the level 1 model construction. Pre-trained FastText embedding matrix creation; tokenization, vectorization, and padding and LSTM-GRU combination with attention mechanism are the key constructs of this architecture.

Fig. 5
figure 5

Architecture for bi-directional LSTM that has attention layer

Figure 5 depicts the architecture of bi-directional LSTM with attention layer of level 1 model in the module.

Embedding matrix is a key component of this modeling. As shown in Fig. 6, if we consider embedding the sentence shown, with 300 dimension matrix. Taking example of word "ORANGE" in the sentence, it can be represented as a 10,000 by 1 matrix. So "ORANGE" word can be represented as a embedding vector of 300 by 1 dimension.

Fig. 6
figure 6

Embedding matrix representation

$$\begin{aligned}{} & {} E \rightarrow {} (300 \times 10,000) \end{aligned}$$
(1)
$$\begin{aligned}{} & {} O \rightarrow {} (10,000 \times 1) \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \textrm{Embedding matrix} = E \times O_{6257} = e_{6257} \end{aligned}$$
(3)

E \(\rightarrow {}\) 300 dimension representation of the sentence “I EAT.....ORANGE.....I LIKE <UNK>"

O \(\rightarrow {}\) one hot vector for word "ORANGE"

e \(\rightarrow {}\) Embedding vector of (300 \(\times\) 1) for word "Orange"

Results: Validation accuracy was 98.69%, a weighted average of precision was 91%, of recall of 84%, and F1-score of 87%. Compared to the work done in literature, our exploration has the unique build of the pre-trained model being used in combination with bi-directional LSTM with an attention model and a GRU (gated recurrent unit). In the second level of the module, Distil-BERT is finalized for multi-class classification of the security events. Here the model specializes in categorizing the security events into a specific event category. An experiment conducted with Distil-BERT demonstrated that the pre-trained model base comes in handy for its practical application in this context. These models also provide a lighter base for helping to make the solution robust and scalable and provide a base for language modeling capabilities. It can be leveraged for many use cases like next sentence building etc. Distill BERT Fits well for second stage prediction where specific categorization is needed.

Fig. 7
figure 7

Experimental design for Distil-BERT

Figure 7 demonstrates the key construct of Distil BERT as used in this module. Distil BERT based input feature creation, Distil BERT architecture creation and tokenization based on distil bert-uncased are the key constructs.

Fig. 8
figure 8

Architecture for Distil-BERT experiment

Figure 8 depicts the architecture of the Distil BERT used at the second level in the module. Distill BERT was subjected to experimentation on bi-class classification and showed the following results. Distill BERT had average precision of 95%, average recall was 95%, and F1-score was 94%. Further, the Distil BERT was subjected to multi-class classification specialized with the data collected based on experts’ input in software development processes. Distill BERT yielded the precision of 79%, recall of 78%, and F1-score of 78%.

Industry Landscape Module

In the industry landscape module, all the industry information related to security are leveraged. Figure 9 shows industry landscape module architecture.

Fig. 9
figure 9

Industry landscape module architecture

Modeling of the vulnerabilities is conducted to identify the CWE (common weakness enumeration) of the security event. The company will have an internal security event database. Events from these databases also can be fed into this model. Once the CWE is predicted, the database is updated with the CWE and threats details. This database also must be maintained by a periodic update on the mapping of threats and vulnerabilities. Vulnerabilities are the potential loopholes in the system which the threats can exploit. From the security of the system point of view, it is essential to understand the threats so that vulnerabilities combination to put appropriate security controls in place. The database gets updated with the modeled threat, vulnerabilities mapping, and CWE mapping. This database also will have to be periodically updated with the security events information across the company. This information helps assess the probability of occurrence and impact of the security event. Based on the type of threat, its associated vulnerabilities, the likelihood of occurrence, and its impact on the business, risk value can be tagged to each threat. Based on the risk values, controls can be recommended to the threats. Control’s information also is updated in the database. This information helps to predict future controls needed as and when there are new threats identified in the software development.

CWE (common weakness enumeration) are community developed common software and hardware weaknesses. These are used as reference to understand the parameters of the software security. While the machine learning and deep learning models are built these formed the base reference to understand the critical software system security needs. Some of the examples of CWE in software development are API related errors, key management issues, data validation issues and others. With CWE as foundations of software security, our intended knowledge management framework will help in weakness identification, optimization and prevention.

Model for Industry Landscape Module

The prediction model needed in this module is a threat modeling specialized model, which can take in the industry and company-related security-related information and refine the security classes in terms of CWEs. Based on the software development expert team analysis top 20 most essential CWEs were shortlisted for this modeling. Three thousand two hundred fourteen data points from various software development programs were collected for these 20 CWEs. Among multiple experiments conducted on multiple modeling approaches, the stacking model of the decision tree classifier, K-neighbor’s classifier, and logistic regression showed the best performance with 77.7% accuracy and a standard deviation of 2.5%. Decision tree classifier and K-neighbors were used at level 0, and the logistic regression model was used in level 1 in the stacking.

Evaluation against the state-of-art has shown the following outcomes. Work [39] utilizes SMOTE, SVM (support vector machine) with RBF (radial basis function) kernel and logistic regression approaches utilizing Recordings from meetings between developers and their customers from a software development company located in United States. Here they explore a classification approach to figure out security vulnerabilities. They have recorded the following results with precision at 70.8% and recall at 18.3%. Work [1] explored LDA (latent Dirichlet allocation), and SVM approaches with stack overflow dataset to classify the security vulnerabilities from the data has produced the following results. For LDA, precision was 70.33%, recall was 77%. For SVM (support vector machine), precision was at 72%, and recall was at 77%. Among all the above experiments conducted, the stacking model of the decision tree classifier, K-neighbor’s classifier, and logistic regression showed the best performance with 77.7% accuracy and a standard deviation of 2.5%. Decision tree classifier and K-neighbors were used at level 0, and the logistic regression model was used in level 1 in the stacking. In comparison to these best works, we would achieve better performance with the precision of 76% and recall of 79%.

Software Landscape Module

In the software landscape related module, all the information processed within the software development practices are leveraged. Figure 10 shows software landscape module architecture.

Fig. 10
figure 10

Software landscape module architecture

Figure 10 depicts the architecture of the software landscape module. The focus here is on software source code, software production-related information, and all other information processed during software development. Software source code holds quite a bit of pattern within it, which can be used to learn the security weaknesses of the software systems under study. The naturalness of the software source code has been extensively highlighted in the literature, where the source code is said to be more structured than natural language. This ability of the software source code can be leveraged in this module. In our book [40], we have dealt in-depth with the topic of statistical modeling of the source code and ways to leverage the same to solve some of the challenges faced by the software development community.

Software development-related data includes all the information processed as part of software requirements processing, software design, software construction, and software testing. All this information also can be leveraged to learn the security flaws of the software systems. Software requirements-related conversations are already modeled as part of the customer conversation module. The rest of the information can be leveraged as part of this module. Threat modeling conducted in the industry landscape module can model the software design-related information. Threat modeling is one of the classical approaches followed by the software development industry to strengthen software design. Software testing involves the information associated with various testing activities, from unit testing to integration testing and a wide variety of software security testing. Software security testing includes Veracode scans done with static and dynamic modes. Penetration testing is another test conducted to understand possible loopholes in a software system. Greenlight on the coding tools helps proactively identify the security issues as software developers’ code. Sonar cube is another code analysis tool that can bring out more code-related features under analysis. But all these systems will not establish a full-fledge fool-proof system to plugin all the security loopholes. Hence this smart system of security vulnerabilities management in a software system is needed to bank on all the learnings that are happening in parts from various security measures. This system can also get smarter by integrating all this information over time [41, 42].

Many of this software-related information can be converted into metrics. These metrics can be used to model the possible security flaws. From software scanning processes, security vulnerabilities can be directly identified. These can be further modeled to understand the security vulnerabilities categories. Understanding the security categories helps to strengthen the company security event database. This information can be built over time to understand the security heat map of the software systems built. Software metrics that can be derived from other software development data are broadly seen under categories of productivity, product quality, process quality, and technical quality of software development. Under productivity, some of the measures include the software development team effort planned and actual software code output. Under product quality account of all software quality issues across software life cycle are counted [43, 44]. Process quality measures the maturity of the foundational processes used for software development. In this case, since agile model development is followed, maturity of the agile practices is accounted. The code’s technical quality, maintainability, technical debt, and security issues are counted.

These metrics can be modeled with the security flaws identified by the other security-related tests and scans. Based on these correlations, security issues projection can be made. These projections help the software development team proactively identify the security issues before they become costly when discovered late in the software development lifecycle. Apart from these metrics, other metrics based on the performance of the software systems in the production can also be added to this module. Software systems in production are monitored with various tools like SCOM (system center operations manager), Uptrend for website and web performance monitoring, and App Dynamics can provide more insights into the software systems in action. Some of the monitoring-related information like server processor processing time, CPU utilization rate, database utilization rate, SQL server alert related information, database health, business transaction health, application availability, production incidents, and so on. These performances are to be correlated with the security flaws modeled in various other modules in this system and the security-related information generated by security testing. Building a connected system that can use all this information can help build a security prediction system. All this information can be gathered and developed as part of a software development security vulnerability database [45].

Integrated Software Security Vulnerability Management System

The customer conversation module generates information around security-related themes and classes from customer conversation. The software landscape module generates information from software development processes, including qualitative security data and quantitative software metrics data. The industry landscape module generates the vulnerabilities class data from the available information in the industry. All these can generate a central platform for the company. This platform can be leveraged to build an organization’s security management system. This central platform can be enhanced to build a query database where software development practitioners can query any required information of their interest. There can be a visualization module that can provide a visual output for the software development data. Prediction module is another capability that can be built into a central platform where a variety of software system prediction capabilities can be made in to.

Fig. 11
figure 11

Expansion of software landscape module and central module

Figure 11 demonstrates the software landscape module and the central module and its knowledge management capabilities. All the data flowing from other modules are integrated here, and the data is processed to assimilate key information that the software development community can leverage. Security-related information that is picked up from customer conversation, industry knowledge content, software monitoring related systems, software development related information are leveraged to devise a security threat vulnerability map. Security insights derived from various software development metrics are another critical source of information. All this essential information plugs into providing querying, visualization, and prediction capabilities for the central module.

Conclusion

This study explored the possibility of bringing together software development practices and data science practices to control the software system vulnerabilities better. Putting together the data generated in industry, within software development processes and the conversation between customer and software development team are the sources of information. The architecture proposed in this exploration combines these data sources to create an artificial intelligence-based system for managing knowledge around software system vulnerabilities. In the customer conversation module, we have built a data classifier that can refine the data to identify security-related information and further provide granular data classification. Deep learning models are used for the two-level processing of this information, where the approaches used are better than the state-of-art approaches. Under the industry landscape module, we have proposed an integrated approach combining the risk assessment and threat modeling approaches with machine learning models. We have customized the ensemble models in this architecture perform the needed modeling and have been able to demonstrate the best performance in comparison to state-of-art results. We have proposed the architecture under the software landscape module to combine the data from across the software development practices. Creating the smart knowledge base for the software development practitioners with a proper combination of all the data across software development processes is demonstrated. Finally, the system extends to feed into the central database for the organization. This central database will be enabled to have capabilities of extracting helpful information based on the need of the software system development team. Authors have been able to devise best machine learning approaches for building customer conversation module to process the data that is transacted between customer and software development team. This solution will enable the knowledge processing platform to mature with time and make software development processes more efficient.

Contribution of the Research

This research has helped to bring together multiple knowledge sources from across the customer, outside industry and within the company. Work went deep in to two specific areas around customer conversation module, where all the data that is generated between software requirements management team and customers are processed. Two level processing of the customer conversation using machine learning capabilities are explored and we have presented best results compared to best work in this area.

Future Study

This study can be further extended for businesses across various domains. Models built into this architecture can be refined as the system processes more data in the future. However, the system will initially be enabled with an expert in the loop. As the system becomes smarter, this dependency on the expert can be removed. Advanced deep-learning approaches can be explored and fine-tuned to provide better modeling capabilities for the models used in the architecture. Machine learning or deep learning models can also be examined to be used for modeling the quantitative that is processed in software development practices.