1 Introduction

Software security is the safeguarding of software and its components within the scope of information security’s three elements: confidentiality, integrity, and availability [1]. It also involves ensuring that software should perform its expected functions even when under attack. To detect potential software vulnerabilities, institutions perform various analyses on source code developed in-house or obtained from third parties.

Analysis of closed source software, including executables and libraries, is also essential. Accordingly, institutions also aim to identify security vulnerabilities in the closed source software they provide or use and take appropriate measures to address them. The Lighttpd web server, which is widely used, provides an example of the use of vulnerable executables. The absence of a verification process for the input string used in the vulnerable function base64_decode enables an adversary to use malicious inputs, which may result in the occurrence of out-of-bounds and abnormal termination of the web server [2]. Nevertheless, by being aware of the aforementioned circumstance, it is possible to implement a security mechanism that checks the input from users. Furthermore, the Heartbleed vulnerability is a well-known example. As the data size was not controlled by the popular OpenSSL cryptographic software library, it was sending back as much data as requested. This meant that the server was capable of sending all information in the memory [3]. Consequently, with relative ease, the attackers were able to obtain the server memory, which may contain passwords, certificates, and so forth.

Even end users want to know whether the applications they are currently using contain any vulnerabilities. The Free MP3 CD Ripper software may be an illustrative example of this necessity. This software enables the conversion of audio CDs into the MP3 format. Due to the presence of a local buffer overflow vulnerability, in the event that the user took a malicious.wav file which had been prepared by an attacker, the version 1.1 of the software was capable of executing attacker’s instructions which had been embedded in the.wav file [4]. It is essential to inform users when such a vulnerability is identified. This enables users to refrain from utilising software that contains vulnerabilities. As a result, in today’s world, there is an increasing demand for information about vulnerabilities in closed source software.

Cyber threat intelligence can be shared with institutions, developers, and end users who lack the ability or resources to conduct their own analysis. This enables them to become aware of security vulnerabilities and take necessary precautions.

To produce cyber threat intelligence, it is obvious that the first step is to identify any vulnerabilities. Software can exist in three states: source code, bytecode, or binary code. Therefore, security vulnerabilities can be detected by examining all three types of code. As the study focuses on software whose source code cannot be accessed, the analysis and examination were carried out on binary code. This is because the application area is broader as source code is not required for analysis [5]. In certain fields, such as Internet of Things (IoT) and performance-oriented industries, the source code is not shared and is often compiled into binary [6,7,8]. Additionally, updates are usually distributed in binary code [9]. Furthermore, the existence of malicious compilers, such as Xcode-Ghost, which inject additional code, is a concern [8, 10]. In numerous cases, the source code may not precisely represent the code that is executed. Analyzing the source code alone may not provide adequate results as compilers make changes when converting the source code into executable binary code [5, 6, 11, 12]. Given the significant progress made in source code-based vulnerability analysis, binary code analysis is now considered more important [13]. Additionally, advanced analysis of byte code may not be necessary as it can be converted back into source code.

With most existing analysis methods, many vulnerabilities can be missed due to the need for human resources that create features and require expertise [7]. It is therefore important to develop automated methods. Automated analysis of binary code is a promising research area, because it enables the development of new software analysis technologies that can perform more accurate and reliable examinations [5, 14]. Detecting security vulnerabilities in binary code is a challenging process [6, 11]. The structure of binary code is complex [11]. There are many architectures that contain dozens of instructions [14]. Furthermore, the data and control structures lack high-level and semantically rich information [6, 11]. Additionally, there is a significant requirement for programming effort to develop a tool for binary code analysis [14].

This study aims to automate the process of creating cyber threat intelligence for closed source software vulnerabilities by addressing the issue of manual steps that require expert knowledge to detect vulnerabilities and create related intelligence. The objective is to present an approach that offers a new perspective for binary code analysis, shares a standardized way for vulnerability intelligence creation, and does not compromise performance by simplifying the steps. This paper contributes to the literature on the following topics:

  • Literature review of vulnerability detection in binary code and software vulnerability intelligence,

  • Proposed flowchart and system design for creating threat intelligence for closed source software, based on the analysis of existing process steps and system architectures,

  • Reviews and assessments for currently available datasets and cyber threat intelligence standards,

  • Data types that should be defined in accordance with the recommended intelligence standard,

  • A deep learning model that is not limited to a single processor architecture, file format, or software domain,

  • The function-as-sentence approach, which converts each machine instruction as a word and each function as a sentence, for quickly detecting vulnerabilities in binary code,

  • New data representation for binary code analysis,

  • Publicly available source code that is ready to use.

This study differs from previous studies in several ways. The majority of studies have focused solely on vulnerability analysis. There is insufficient focus on the production of intelligence. So our study aims to fill the gap and produce intelligence in a short time. Our approach assumes that each sentence expresses information about the existence of security vulnerability. It therefore tries to understand what the function wants to say. For this reason, we have proposed simple transformations, avoiding any steps that could introduce additional meanings. We have shared the usage details of the proposed cyber threat intelligence standard to establish a standardised approach for the literature.

Section 2 examines studies presented in the literature on the analysis of closed source software. It also analyses studies that have performed production, use, and/or management for software vulnerability intelligence. In the section, an evaluation of the literature is presented and the necessity of the study is shown. Section 3 presents the recommended flowchart and system design. It also includes a survey of usable datasets and cyber threat intelligence. In the section, our approach is detailed and compared with approaches in the literature. Section 4 describes the algorithms used, presents their performance, and evaluates the implemented system. The conclusion is presented in Sect. 5.

2 Related studies

The literature review addressed vulnerability analysis and vulnerability intelligence separately in order to cover a wide range of approaches, to evaluate the solutions effectively, and to enable better comparison with our work. Relevant studies were identified using combinations of the following keywords:

  • For vulnerability detection studies:

    • binary software, binary program, binary code, binary code executable, binary, executable, program

    • vulnerability, vulnerability detection, software vulnerability, software vulnerability analysis

    • machine learning, deep learning

  • For vulnerability intelligence studies:

    • software vulnerability, software vulnerability analysis

    • cyber threat intelligence, threat intelligence, vulnerability intelligence

Keywords were searched in both English and Turkish. The results are not restricted to a specific start or end date. Keywords were searched in the fields of title, abstract, keywords and main text fields.

Some studies [15,16,17,18,19,20,21] used control and data dependencies, semantic information, or flow graphs for the purpose of detecting vulnerabilities. Cui et al. employed a combination of CNN and Bi-LSTM algorithms to detect multiple vulnerabilities [15]. The proposed model created binary code slices based on function calls, control dependencies, and data dependencies. Features were extracted from the created slices during the analysis. 19 types of vulnerabilities were identified. Similarly, in [16], control and data dependencies were analyzed from function calls to determine the exact location of vulnerabilities within binary code.

Diwan et al. [17] used VDGraph2Vec, which was developed to represent the machine language, along with RoBERTa to obtain control flow and semantic information. The analysis was carried out through various neural networks, with a focus on the CWE-121 and CWE-190 vulnerabilities of the Juliet dataset. Only the codes containing the CWE-119 vulnerability of the NDSS18 dataset were used.

Cheng et al. [18] conducted a vulnerability analysis of Internet of Things devices. The study examined control flow using CNN and achieved 80% accuracy.

Gao et al. [19] analysed binary code vulnerabilities across multiple platforms. Labelled semantic flow graphs were created for the binary code, followed by the use of DNN for vulnerability detection. The tests reported a 0.2 s time to detect a vulnerable function and an AUC value of 88.48%.

Padmanabhuni and Tan [20] utilised machine learning techniques to identify buffer overflow vulnerabilities in binary code. The variable information was obtained by parsing the binary data, and then the BinAnalysis tool was used to detect dependencies. They then created and analyzed various attributes using the Weka tool. The analysis compared the Naive Bayes, Multilayer Perceptron (MLP), Simple Logistics (SL), and Sequential Minimal Optimization (SMO) techniques. The results indicated that SMO had the highest accuracy rate of 94%.

Chen and He [21] conducted risk analysis and vulnerability detection through feature matching. The control flow graph was first obtained through binary code, then semantic information was extracted using LSTM, and finally, the vector representation was produced.

In some studies [22,23,24,25], prior to analysis, the binary code was translated into an intermediate language. Wang et al. [22] conducted a study on detecting known and unknown vulnerabilities in binary code. They obtained pseudocode from the binary code and analyzed it using the BiLSTM-attention model. The Juliet test set was used with a focus on the C/C++ programming language.

Taviss et al. conducted a vulnerability analysis on binary code using machine language summarisation methods [23]. The Bi-GRU algorithm yielded the best results, achieving a performance of 79.6% on the Juliet dataset and 84.1% on the NDSS18 dataset.

Redmond [24] utilised Natural Language Processing (NLP) to support multiple architectures, including ARM and X86. The machine instructions that are dependent on architecture were converted into a common language using NLP. Analyses were conducted using Bivec. Tests were performed on the openssl, coreutils, findutils, diffutils, and binutils programs.

Zheng et al. [25] proposed a method to detect CWE-134, CWE-191, CWE-401, and CWE-590 vulnerabilities in binary code using a recurrent neural network. Binary code was decompiled with the RetDec tool and converted to an intermediate language. Then, flow analysis was performed. To determine the most suitable recurrent neural network, they compared the performance of six different models: SRNN, BSRNN, LSTM, BLSTM, GRU, and BGRU. The analysis revealed that BSRNN achieved the highest accuracy. The NIST SARD dataset was used compiling with GCC 5.3.1. Upon examining the compiled programs, they identified vulnerabilities in 9,352 out of 14,657 binary codes.

A number of studies [26,27,28,29,30] employed the use of vectors to represent binary data. Lee et al. [26] introduced a method that converts assembly codes into fixed vectors, rather than prioritising the ordering of API function calls. To achieve this, they utilised the Intrusion2vec tool and employed Text-CNN for analysis and classification of the vectors. The function classification model was trained and tested using the Juliet dataset provided by NIST, which was divided into 80% for training and 20% for testing, resulting in an accuracy rate of 96.1%.

Yan et al. [27] analysed the machine language using the Juliet and ICLR19 datasets and various algorithms. They first used Bi-GRU and a ‘word-attention module’ to obtain contextual information, followed by TextCNN and a ‘patial-attention module’ for feature extraction. Their hybrid approach yielded better results than previous studies.

Nguyen et al. [28] proposed the deep cost-sensitive kernel machine to analyze binary code for CWE-119 and CWE-399. A total of 79,848 binary codes were tested. Their best F1 score for the NDSS18 binary dataset was 85.7%.

Le et al. [29] proposed the maximal divergence sequential autoencoder to automatically derive features for binary code vulnerability analysis. Furthermore, a labelled dataset suitable for binary code vulnerability analysis was developed. Their best F1 score for the dataset was 87.1%.

Feng et al. [30] developed a bug detection engine called Genius using unsupervised machine learning. The study analysed 8126 firmware containing 420,558,702 functions using high-level numerical feature vectors instead of raw features.

Few studies [31,32,33] employed similarity analysis as a method of detection. Eschweiler et al. [31] presented a new approach for performing similarity analysis of functions within binary code. They implemented a preprocessing process to reduce computational costs, allowing for analysis of high-dimensional code bases. The developed prototype analysed binary codes for various architectures, including x86, x64, ARM and MIPS. The system identified vulnerabilities, including Heartbleed and POODLE, in an Android image that contained over 130,000 ARM functions. Luo et al. [32] also performed a study of binary code similarity analysis using a DeBERTa model. Recall was 79% for their best model.

Durmuş and Soğukpınar [33] presented a new method for analysing heap buffer overflow vulnerabilities in binary executables. The study aimed to identify similarities between the distributions and sequences of binary opcodes used for branching, looping, register value updating, memory value updating, stack operations, and system calls. Learning and testing sets were created, consisting of a total of 30 binaries, 15 of which contain vulnerabilities. These binaries were derived from the NIST SARD dataset. All programs were developed using the C/C++ programming language. The tests employed various classification methods/algorithms, including KNN, Naive Bayes, SVM, ZeroR, Bayes Multinomial, Hoeffding Tree, J48, K-Star, Random Forest, Random Tree, and Decision Table. The KNN, Naive Bayes, Hoeffding Tree, K-Star, and Random Tree algorithms demonstrated high performance rates, indicating their suitability for vulnerability analysis.

Various studies [7, 34,35,36,37,38,39] indicated that the analysis of binary instructions is a preferable approach. Liu et al. [7] analysed binary code programs created for the Internet of Things using a deep learning-based approach for automated vulnerability analysis. The binary instructions obtained through the IDA Pro tool were preprocessed and cleaned before extracting binary functions. Model learning was then performed on these functions using Att-BiLSTM (attention bidirectional LSTM) for vulnerability prediction.

Dong et al. [34] analysed defects in Android binary executables. They obtained over 90,000 smali files from approximately 50 apk files. Then, the extracted features were analysed using DNN, resulting in an AUC value of 85.98%.

Morrison et al. [35] investigated several methods for analysing binary data of Windows 7/8, including logistic regression, naive bayes, recursive partitioning, support vector machines, tree bagging, and random forest. However, when compared with the literature, the accuracy of the results fell behind the studies that detect vulnerabilities through source code.

Saxe and Berlin [36] presented the Invincea system for binary code analysis. The system achieved an accuracy rate of 95% and a false positive rate (FPR) of 0.1 based on tests conducted on over 400,000 pieces of software from various sources. The process involved creating features from binary code and then using a DNN with 4 layers, 1024 inputs, 1204 hidden units, and 1 output. Feedforward is preferred, and Keras, a deep learning library, is used to create the neural network model.

Rosenblum et al. [37] performed a binary code analysis on the Intel IA32 architecture in both Windows and Linux environments. A probabilistic model was used with IDA Pro and Dyninst tools to classify the start byte (FEP) of each function. The modelling was done with Markov Random Field (MRF) resulting in the recall rate of 1%.

Tian et al. [38] compared the performance of several algorithms, including RNN, LSTM, GRU, Bi-RNN, Bi-LSTM, and Bi-GRU. The results show that the Bi-GRU and Bi-LSTM algorithms achieved the highest F1 score, reaching almost 90%. They developed a prototype called BVDetector and tested it using the SARD (Software Assurance Reference Dataset) provided by NIST.

Wu and Tang [39] used the Bi-GRU algorithm with the ELMo model. They achieved 98.9% and 87.7% accuracy on the Juliet and NDSS18 datasets, respectively.

A limited number of studies [40, 41] employed a dynamic approach to analyse the binary. Li et al. [40] proposed the use of fuzzing techniques to identify vulnerabilities. The V-Fuzz prototype was implemented using the described approach. The prototype comprises two main components: the vulnerability prediction model and the vulnerability-oriented evolutionary fuzzifier. When analysing a binary program with V-Fuzz, the system first predicts which parts of the program are more likely to contain vulnerabilities. The fuzzing component then tests the program by using an evolutionary algorithm to generate inputs to the relevant parts of the program. The testing was conducted using Uniq, MP3Gain, pdftotext, and libgxps. The experimental results indicate that 10 common CWEs were detected, including 3 new ones.

Grieco et al. [41] conducted a study on memory corruption prediction using static and dynamic features. The study analysed a dataset called VDiscover, which was built from 1039 Debian programs. The binary code was disassembled, static properties were removed, and dynamic properties were obtained through execution analysis. The predictive module is trained on the extracted features using supervised machine learning techniques, such as logistic regression and random forest. The experiments were conducted with different algorithms, and the maximum accuracy achieved was 55%.

Commercial products that analyse binary code were found through search engines. Examples of these tools include Grammatech CodeSurfer [42] and CodeSonar [43, 44], SourceForge BugScam [45], hex-rays IDA Pro [46], Veracode SAST [47], Microsoft CAT.Net [48] and Carnegie Mellon University BAP [5]. The tools do not provide a detailed description of the business logic, and we could not find any information on the use of machine learning/deep learning techniques.

Only a few studies have focused on cyber threat intelligence for software vulnerabilities. Patil and Malla [49] focused on virtualization environment software, with the aim of identifying vulnerabilities and prioritising their patch management. The vulnerabilities were identified by analysing both internal and external threat intelligence. Following this, patch prioritisation was carried out, and information was provided on possible controls to prevent the exploitation of vulnerabilities.

Wu et al. [50] presented a model for producing vulnerability intelligence by analysing code repositories of open source tools. The tool was developed solely for C/C++ programming languages, and no cyber threat intelligence standards were utilised in the model.

According to Wu et al. [51], software vulnerabilities present a significant threat to power systems. To minimize this threat, they presented a solution and developed a model using the China National Vulnerability Database (CNNVD). The model enables corrective action to be taken based on vulnerability intelligence. The initial stage involves defining the entities within a topology. Based on the gathered information, patch management is performed on these assets to eliminate vulnerabilities.

Davidson et al. [52] discussed a solution for transferring information about software vulnerabilities between organisations or units within an organisation. The authors employed game theory as a methodology and developed a system that utilised standards such as STIX and TAXII. They also developed a proof-of-concept protocol for exchanging intelligence. The protocol was implemented using the Go programming language.

Previously, we examined [7, 38] in the context of vulnerability analysis. In this section, we will discuss them again, this time with a focus on vulnerability intelligence. The models did not focus on producing or distributing cyber threat intelligence in a specific format. The data acquired is primarily static and not shareable.

The majority of studies have focused solely on vulnerability analysis. Intelligence production is not given sufficient weight, especially in a shareable format. The lack of a commonly used benchmark dataset makes comparisons difficult [53]. Furthermore, many studies concentrate on software specific to a particular domain.

Studies in the literature vary in granularity. Some only identify vulnerabilities in binary code, while others classify them. Moreover, many studies concentrate on a single vulnerability and/or architecture. As a result, it is not feasible to make direct comparisons of the results presented in the literature.

It is clear that deep learning techniques are not yet mature enough for binary code analysis [54]. Additionally, there is a lack of models/systems that can produce shareable cyber threat intelligence. Therefore, our study aims to fill this gap.

3 Proposed system

3.1 Dataset selection

An analysis has been conducted on the available datasets. Buffer overflow benchmark from MIT Lincoln Laboratories [55] contains outdated applications. Furthermore, the dataset only includes three applications and 14 exploitable vulnerabilities. Similarly, the LAVA-M dataset [56] also has a limited number of applications.

The DARPA CGC dataset [57] focuses on a single architecture. Hence, it may not be sufficient for studies aiming to support multiple architectures.

The NIST SARD [58] is a collection of source code for various programming languages, platforms, and vulnerabilities. It contains numerous test cases and test suites, such as the Juliet C++ [59], C# Vulnerability Test Suite [60], and Java Test Suite [61]. However, researchers must compile the source code, which may pose a challenge.

The VulDeePecker dataset [62], also known as the NDSS18 dataset or NDSS18 source code dataset, was created using the NIST SARD. It contains buffer error vulnerabilities (CWE-119) and resource management error vulnerabilities (CWE-399). The dataset has been created for deep learning studies, but it also does not include compiled binary.

By compiling the source code files of the VulDeePecker dataset, the NDSS18 binary dataset [29] were created with a specific data representation in 2018. The dataset includes records for various operating systems and architectures. In 2020, Nguyen et al. [28] presented a new dataset created from six open-source applications, using the same specific data representation as before.

Our study utilised the binary dataset from six open-source projects (hereinafter referred to as SOSP) and the NDSS18 binary dataset to detect CWE-119 and CWE-399 vulnerabilities. These datasets were chosen because of their simple raw data acquisition procedures. Firstly, the binary code is decompiled to obtain machine language instruction sequences in base 16. Then, the function blocks are converted to decimal base. The opcode and instruction values are extracted from each line and merged using the pipe and comma symbols respectively. This representation conforms to our desired word structure in the function-as-sentence approach.

3.2 Solution method

Natural languages are formed by combining root and derived word groups with suffixes, whereas machine language is formed by combining opcode and instruction information. The function-as-sentence approach has been proposed to treat machine language as a natural language. This approach considers each line in machine language as a word and each function as a sentence.

The first step is to convert the binary code based on the data representation of the selected dataset. This process provides the words for the functions. Next, these words are joined with spaces to form a sentence. In our vectorisation approach, we consider punctuation marks within the word structure as separate entities rather than as part of the alphabet. Therefore, we remove all punctuation marks. The dataset is then analysed to determine the frequency of individual string values. This information is used to create a vocabulary. The batch of strings is transformed into a list of integer token indices using the vocabulary. A deep learning model utilises the resulting vectorised sentence for training.

When analysing a binary, the same steps are applied. The sentence is then classified using a deep learning model, and cyber threat intelligence is produced based on the presence of vulnerabilities. Accuracy is considered to be the main criterion of our approach, rather than time and space complexity. However, time remains an important consideration for detection systems. So we have included metrics to evaluate performance. Therefore, in Sect. 4, we compared our achievements with those of other studies.

This study differs from previous approaches in several ways. For vulnerability analysis, some studies in the literature used control and data dependencies, semantic information, or flow graphs [15,16,17,18,19,20,21]. Our approach assumes that functions express themselves, similar to sentences in natural language, rather than focusing on their implementation details.

In certain cases, the binary code was transformed into an intermediate language before analysis [22,23,24,25]. Furthermore, several studies used vectors to represent binary data [26,27,28,29,30]. Our approach does not intend to impose new meanings. It solely acquires existing machine language sequences and applies various transformations to facilitate more efficient and effective calculations.

Similarity analysis was also employed in the literature [31,32,33]. The approach does not perform similarity analysis directly as it assigns meaning to functions through patterns. The function-as-sentence approach assumes that each sentence expresses information about the existence of security vulnerability. It therefore tries to understand what the function wants to say. For this reason, the objective is to perform simple transformations, avoiding any steps that could introduce additional meanings.

In the field of vulnerability intelligence, some researchers have narrowed their focus to specific domains [51]. Additionally, the exchange of intelligence has been studied [52], but most of the data obtained is static and not easily shareable [7, 38]. This study proposes the automatic production of shareable cyber threat intelligence without focusing on a specific domain or platform. Table 1 presents the methods and features of the proposed solution.

Table 1 Solution overview

3.3 General flow

The studies obtained in the literature review have been analysed for their process steps and system architectures. Based on this analysis, a flow has been determined.

Then, it has been improved by adding new steps, including plotting history, saving and importing a model, exporting an intelligence file, and preparing data in accordance with our approach. The flow is shown in Fig. 1.

Fig. 1
figure 1

Flow diagram for closed source software vulnerability intelligence

First, the model preparation phase begins. Then, binary files are converted to the appropriate format and the classification process is performed. After that, the cyber threat intelligence creation phase uses the classification results and exports the intelligence file.

3.4 General design

The system architecture was designed based on the literature and then improved by updating it according to the flow diagram, standardizing the components, and refactoring it according to software engineering best practices.

Figure 2 illustrates the architecture. In this regard, a comprehensive design has been presented to the literature in order to build a system that can analyze binary code and create cyber threat intelligence.

Fig. 2
figure 2

Component diagram of the proposed system

In the architecture, Orchesrator ensures that the learning, testing and detection steps are performed in the correct order. ModelManager is responsible for preparing the deep learning model and analysing the binary code for classification. FormManager vectorises the input for training or testing. Additionally, it transforms machine code into the appropriate format. Reconstructor decompiles the executable for analysis. CtiManager generates and exports the acquired intelligence. Commons comprises common modules and necessary functionalities, such as executing operating system commands, storing and retrieving configurations, reading and writing files, creating and deleting directories and files, logging, and preparing and exporting figures.

3.5 Cyber threat intelligence selection

Intelligence that does not comply with a specific format may create additional workloads and incompatibilities for stakeholders. Several formats are available for creating intelligence in a standardised form. The following formats have been found [63,64,65,66,67,68,69,70,71,72,73,74,75,76]: CybOX, CSAF, CVRF, GENE, IDMEF, IODEF, MAEC, MISP, MMDEF, OCIL, OpenIOC, OVAL, RID, Sigma, STIX, VERIS, XCCDF, YARA.

The GENE [77], OCIL [78], Sigma [79], XCCDF [80], and YARA [81] standards are not suitable for representing cyber threat intelligence related to software vulnerabilities. GENE, Sigma, and YARA are primarily used for creating detection rules. OCIL focuses on creating audit questions, while XCCDF is used for defining checklists.

The RID XML data schema is currently based on the IODEF [76]. Neither the IODEF [82] nor the RID [83] standard include data fields directly relevant to a software vulnerability.

The MAEC [84] and MMDEF [85] standards are primarily intended to express information obtained from static and dynamic analysis of malware. Therefore, a vulnerability may not be expressed in full detail.

CVRF [86] is currently presented as CSAF [87], and CybOX [88] has been integrated into STIX [89]. As a result, the most suitable formats are CSAF, IDMEF [90], MISP [91], OpenIOC [92], OVAL [93], STIX and VERIS [94].

After evaluating which standard provides more compatible results with less effort, it was determined that the STIX standard is the most suitable due to its compatible fields, widespread usage, and mature documentation. All data fields in the standard can be utilised for vulnerability intelligence. Table 2 provides recommendations for the information to be included in each field when using the standard. This is expected to make a valuable contribution to the literature and prevent inconsistencies in studies using the same standard.

Table 2 Proposed uses of the standard

All information is automatically filled in by our system. For CWE-199 and CWE-399 vulnerabilities, any data that cannot be automatically extracted from the binary file is directly included in the source code.

4 Algorithms, achievements and evaluation

The architecture was implemented using Python (v3.11.3) [95] due to the availability of numerous mature libraries and its widespread use in deep learning applications. Keras (v2.11.0) [96] and Tensorflow (v2.11.0) [97] libraries were used to manage deep learning models. The dataset was processed using sklearn (v0.0.post1) [98] for operations such as splitting and random selection. Numpy (v1.23.4) [99] and Pandas (1.5.2) [100] libraries were used for data filtering and transformation. Figures of data distributions and learning/test performance curves were generated using the matplotlib (v3.6.2) [101] library. The RetDec tool [102] was used on an executable file containing binary code to decompile, obtain machine code in 16-bit format and access additional data for cyber threat intelligence. The intelligence schema was created with the STIX 2 Python API (stix2) [103]. The hardware specifications were as follows: Intel Core I7-7700HQ CPU (@2.80 GHz), 16 GB of RAM, and the 64-bit version of Microsoft Windows 10 operating system.

The following metrics have been employed for the purposes of evaluation: accuracy, precision, recall, F1 score, and AUC. Accuracy is the ratio of number of correctly predicted functions to the total number of functions. Precision indicates the model’s ability to accurately predict vulnerable functions. Recall measures the number of correct vulnerable function predictions made across all vulnerable functions in the dataset. The F1 score is defined as the harmonic mean of the precision and recall values. It provides a single score, making it a more valuable metric. The AUC is a measure of the model’s ability to distinguish between non-vulnerable and vulnerable functions.

MLP, OneDNN, LSTM and Bi-LSTM were used because the algorithm selection was made based on the existing capabilities of the deep learning library (Keras). Bi-LSTM achieved the best performance for both the NDSS18 binary dataset and the SOSP dataset. Figure 3 shows its learning and validation (val_prefix is used) achievements, as well as the label distribution.

Fig. 3
figure 3

Performance metrics of Bi-LSTM

An accuracy of over 99% was achieved on the SOSP dataset when a test rate of 0.2, a validation rate of 0.5, and shuffling were used. The dataset comprises approximately 52,000 non-vulnerable records and only 600 vulnerable records. The fluctuation in the AUC, precision, and recall values during the learning and validation stages confirms the presence of skewed data. Furthermore, the small percentage of vulnerable data is insufficient for proportional selection.

For this reason, the NDSS18 binary dataset was used for the learning phase and to compare achievements. The dataset was split using a 0.2 test rate, 0.5 validation rate, and shuffling. The data distribution was equal.

The results showed that the Bi-LSTM model achieved an accuracy of 84.2%, precision of 81.7%, recall of 88.6%, and AUC of 94.6% on the training subset. On the test subset, the model achieved an accuracy of 81.8%, precision of 82.3%, recall of 82.6%, and an AUC of 93%. Table 3 presents the confusion matrix of the Bi-LSTM algorithm on the test subset, along with the rates of false positive and false negative.

Table 3 The confusion matrix of the Bi-LSTM on the test subset of the NDSS18 Binary Dataset (All)
Table 4 Comparison of achievements in studies using the NDSS18 Binary dataset

The accuracy, AUC, and precision values in both the learning and validation stages are similar and converge at a common point. Although the recall value initially produced varying results, it eventually converged as well. This is a common occurrence in machine learning applications, where the model improves with each iteration and produces more true positives than in the beginning. Furthermore, there are no significant differences between the results achieved in the learning and testing phases. Table 4 presents a comparison with other studies that have used the same NDSS18 binary dataset.

Although achievements on the learning subset were much higher, achievements on the test subset have been used for comparison. In the table, if the highest achievement in our study (HA) is higher than other study in the relevant category, green is used. Blue indicates the proximity of the HA to the achievement in the other study. The closeness rate is taken as 1%. If our accuracy is lower and our F1 score is higher, purple is used. Additionally, if there was no accuracy, comparison has been made through the F1 score computed by us. In the light of all this information, the comparison of our achievements can be summarized as follows:

  • There were 21 models for Windows data. Our model outperformed 14 of them. The two models were evaluated as similar. Therefore, our model was as successful as 16 of the other 21 studies.

  • There were 21 models for Linux data. Our model outperformed 10 of them. The two models were evaluated as similar. Therefore, our model was as successful as 12 of the other 21 studies.

  • There were 26 models for all data. Our model outperformed 22 of them. One model was evaluated as similar. Therefore, our model was as successful as 23 of the other 26 studies. Additionally, we achieved the highest AUC score.

AUC is a measure of the model’s capacity to discriminate between classes. Our model has demonstrated the most effective ability in the existing literature to distinguish between vulnerable and non-vulnerable functions.

When using the same dataset and algorithm, our approach has achieved better results. Table 4 shows the achievements of [23] using the Bi-LSTM algorithm. The Bi-LSTM algorithm, when used with our approach, achieved approximately twice the precision and recall values.

In conclusion, the function-as-sentence approach has been proven to be effective in performing binary code vulnerability analysis. To evaluate the efficiency of the developed system on real applications, we created two binary samples using C++ and compiled them with the gcc [104] tool. We analysed cmd.exe as a real-world application. During all testing processes, we used the Bi-LSTM algorithm model, which was trained on all x32 and x64 datasets of Windows and Linux.

One of the binary samples is the hello-world application. As expected, the software did not detect any vulnerabilities. The other is a binary containing two functions including buffer overflow vulnerability. Its source code is presented in Fig. 4.

Fig. 4
figure 4

Source code of the binary sample containing two problematic functions

The cti-for-css has calculated a 74% probability for the badLevelA function and a 99% probability for the badLevelB function. This demonstrates that the calculations are dependent on the codes used within the function. It is a predictable outcome that the strength of the calculation will vary depending on the use and repetition of codes that cause vulnerability. It is also widely acknowledged that the sscanf and cin functions are vulnerable [105,106,107].

Fig. 5
figure 5

Classification logs of the binary sample containing two problematic functions

On average, cyber threat intelligence reports were produced in 0.1 s, excluding the detection process. There are no studies in the literature that report the related duration. The transformation from binary to our data representation took approximately 0.49 s for each function. Each record, which was represented using our approach, was classified in 0.32 s. Thanks to our filtering feature, unwanted functions can be ignored in under 0.001 s while classifying.

Duration information was not provided in most studies. Gao et al. [19], Feng et al. [30], Eschweiler et al. [31] classified a function in 0.2, 0.1, and 0.08 s, respectively. Gao et al. [19] and Eschweiler et al. [31] require preparation time of 50 min. Feng et al. [30] did not provide any information on its preparation time. However, instead of using Portable Executable (PE) or Executable and Linkable Format (ELF) binaries, they processed images. In addition, their time costs may increase linearly with the number of basic blocks in a function [19]. Also, the preferred datasets and system resources are not the same. Therefore, a direct comparison would not be accurate. Figure 5 provides a section from the classification logs of the binary sample containing two problematic functions.

Figure 6 shows the visualisation of the cyber threat intelligence produced automatically for the relevant sample. The intelligence includes metadata such as the software producer, production location, and processor architecture used. It also presents information on the vulnerable code section, vulnerability definition, remediation techniques, groups aiming to exploit the vulnerability, the analyzing product, and the relational connections.

Fig. 6
figure 6

Visualised cyber threat intelligence of the binary file

During the analysis process of cmd.exe, 1332 functions were obtained. These functions had mostly meaningless names and code lines that are difficult to understand. As a result, it was suspected that the software may be obfuscated. To prevent false positives, the threshold value was increased to 0.99 using the system’s configuration feature.

In conclusion, 10 of the functions have been classified as having CWE-199/CWE-399 vulnerabilities. This executable is known to contain vulnerabilities [108, 109]. Upon examination of the detected functions, it was manually verified that they were performing memory-related operations and could contain vulnerabilities.

5 Conclusion

The aim of the study is to automatically identify vulnerabilities in closed source software and create cyber threat intelligence. The study focused on multiple architectures and developed a deep learning-based system named cti-for-css, which stands for “Cyber Threat Intelligence for Closed Source Software”. CWE-119 and CWE-399 vulnerabilities have been detected successfully.

When using the entire dataset, our model was as successful as 23 of the other 26 studies according to F1 scores. The results obtained from our approach outperformed those of other algorithms that used the same dataset and algorithm. Additionally, our model achieved the highest AUC score. AUC is a measure of the model’s capacity to discriminate between classes. Thus, our model has demonstrated the best ability in the existing literature to distinguish between vulnerable and non-vulnerable functions. The data transformation, data classification, and production of shareable intelligence were achieved quickly. The system can produce a shareable intelligence from known vulnerabilities in 0.1 s. The simplified pre-processing steps, low time cost, and absence of hybrid/fusion strategies render our approach an optimal benchmark for the comparative evaluation of future studies.

Furthermore, this study presents the proposed flow diagram, software design, cyber threat intelligence standard, and dataset to the literature. Sharing the usage details of the proposed cyber threat intelligence standard is expected to make a valuable contribution to the literature and prevent inconsistencies in studies using the same standard.

The source code is publicly available on the Github platform at https://github.com/arikansm/cti-for-css. We have also shared three different tools: cti-for-css-library-usage, which shows how to use the system; cti-for-css-gui, which provides a graphical interface; and cti-for-css-stix-viewer, which visualises the produced cyber threat intelligence.

As a future study, we have planned to create a balanced dataset from real-world applications. And future research could investigate the performance of our approach in addressing different vulnerabilities and utilising hybrid/fusion strategies. Additionally, the approach may be used for purposes beyond software vulnerability, such as malware analysis and function task interpretation. For this reason, the issue of software vulnerability is not emphasised in the name of the tool.