1 Introduction

Digital technology has evolved over recent years and has become an essential part of many organizations, including academic institutions where academic documents are officially issued. Academic documents are issued and used by academic institutions of all levels from kindergartens, primary schools, and secondary schools up to higher education institutions. The purpose of these documents is to record the teaching and learning processes and to provide evidence of the status of students and their study progress. More importantly, academic documents indicate the knowledge and skills that a student has achieved throughout their academic years. Moreover, an academic document proves that a student has successfully completed a course, through which the document holder can use it to apply for a job or an advanced degree in their field. The problem is that obtaining such achievements in education is not easy or cheap. This has led to adversaries offering to illegally forge or even fake academic documents for those who want to advance their careers or education easily.

In many organizations worldwide, a degree certificate or an academic transcript is considered important evidence for academic achievements, which has to be attached to an application for a job or an advanced education level. When an academic document holder applies for a job or advanced education, the organizations or academic institutions that receive the document usually need to verify its authenticity and validity. Traditionally, organizations or academic institutions need to call the institution issuing the document for confirming the validity and authenticity of the document. This unfortunately is a time-consuming process, because of that many choose to ignore it completely. Consequently, criminal acts of forging academic documents have begun to appear.

There have been many cases [17] where forged academic documents were successfully used to apply for a job. Two notable ones are as follows. The first case occurred in 2019 in Thailand, where a hospital employee who had worked at the hospital for over ten years was found guilty of using a fake academic document to apply for the job. The employee was then dismissed from his job by the Ministry of Public Health and asked to pay back all the income he had earned over the working period. The second case also occurred in 2019 in Singapore, where a person was arrested for creating and using fake academic documents for 4 years. He had applied for jobs and successfully received 38 offers. However, after the arrest, he was sentenced to almost 3 years in prison and incurred a fine of 1600 SGD [21].

With the advancement of digital technology, many institutions have begun to issue academic documents in a digital format with the advantages of easy storage and durability. However, there remains the same problem of how to verify faked or forged digital academic documents. Currently, a few known methods have been designed to help verify the authenticity of digital academic documents. They include digital signature-based platforms and blockchain-based platforms.

On a digital signature-based platform [5, 39], academic institutions generate a pair of public and private keys and request a digital certificate from a certificate authority (CA). When academic documents are issued, the institution digitally signs them using its private key. The verifier of the signature and documents, having obtained the institution’s digital certificate, uses the institution’s public key to ensure validity. This appears to be a possible solution to detect document forgery. However, there is one main disadvantage of this solution, which is the infrastructure required for this platform to work, i.e., a CA is required and must be established so that digital certificates can be issued to all institutions. Moreover, all the verifiers need to obtain these digital certificates to verify the documents and signatures.

Blockchain technology [20], although originally applied to Bitcoin and other cryptocurrencies, has evolved enormously over the years from cryptocurrency to smart contracts, which then have been integrated into many applications, such as the Internet of Things and decentralized finance. The reason for the suitability of blockchain technology for many applications is that it provides a combination of desirable properties, which include decentralization, immutability, security, and consensus. This has also led to the application of blockchain to digital academic document platforms. The first example of a blockchain-based digital academic document platform is Blockcerts [5], developed by the Massachusetts Institute of Technology (MIT). On this platform, students are required to download the Blockcerts Wallet before being able to manage their academic documents [6]. Since then, a few more universities have developed their blockchain-based platforms, including the Blockchain Lab by Birmingham University and the Block.co platform by the University of Nicosia in Cyprus. In addition, a nonuniversity organization developed a blockchain-based platform for universities called EduCTX [37], which is claimed to be a global credit platform used by students to store their credits after completing a course.

Although blockchain appears to have many desirable properties for ensuring the validity of academic documents, it still has several disadvantages. The first obvious one is its infrastructure, i.e., for a blockchain platform to work, three roles must be considered: students, academic institutions, and academic document verifiers. Furthermore, education authority should be added to the infrastructure. This implies that there must be some form of formal collaboration among these entities for the blockchain-based platform to function. The second issue is concerned with the joining of new participants in the platform. Such a platform would need to give the responsibility to some entities to authenticate new joiners. The third drawback is that digital wallets are needed for sending and receiving academic documents. Similar to cryptocurrencies, wallets are required so that cryptocurrencies can be transferred from one entity to another. In the case of academic documents, it is the tokens representing the documents that are transferred from academic institutions to students. Hence, this represents extra overhead for all entities on the platform.

The contributions and objectives of this study are to design and develop a system based on a cryptographic hash function to solve the problem of academic document forgery by detecting modifications to the original document. This study will show that new infrastructure or a consortium of the education authority, academic institutions, students, and document verifiers are not required here. In addition, we compare the forgery detection success rate and efficiency in terms of processing time between our proposed method and other recent technologies, including convolutional neural network and blockchain-based schemes.

The remainder of the paper is organized as follows. Section 2 provides some background knowledge and reviews the literature related to cryptographic hash functions and document forgery detection systems. The details and description of the proposed document forgery detection system and how it was evaluated are presented in Sect. 3. The system is then evaluated in Sect. 4. Section 5 concludes the paper.

2 Background knowledge and related work

This section describes and analyzes existing technologies that are applied for issuing and verifying electronic academic documents. The principle of the cryptographic hash function, which is an integral part of the proposed method, is also provided.

2.1 Existing research and technologies

The existing research and technologies related to the verification of electronic academic documents can be divided into two categories: nonblockchain-based and blockchain-based technologies.

Academic document manipulation can be done in two ways. The first is to just make changes to the document, such as changing the grade by deleting the existing one and typing in the new grade. This method is essentially adding new content to the document. The second method is called the copy-move forgery method [2, 36], where an adversary copies some existing textual information (e.g., the letter “A”) from a document and pastes it somewhere else in the same document.

One system that aims to detect document forgery is known as the Photon System [14]. It was developed by two specialists in ultraviolet radiation, Hug and Reid. The authors applied their expertise to examine documents in many dimensions, including the ink used to print on the paper, the material used for the paper, and other relevant materials. Although the authors claimed that the Photon System was able to detect document forgery, their technology solely focuses on the physical aspect of documents and not the abnormalities in the digital forms of documents. Another disadvantage of this system is the need to invest in high-priced equipment, which also requires expert technicians to work on it. Thus, it is unsuitable for detecting forgeries in digital academic documents.

Valida, a document forgery detection system developed by Gradiant, works with digital or electronic documents [11]. It applies artificial intelligence and deep learning along with cloud computing for computation and storage purposes. To detect document abnormalities, users of Valida can work on either a mobile application or a web browser, provided that the device is connected to the Internet. This implies that users do not have to be specialists. However, one disadvantage of Valida is that the system applies artificial intelligence and deep learning, both requiring an adequate quantity of data or sample documents for training. Thus, the system cannot work as soon as users obtain it. Moreover, Gradiant does not provide information regarding the accuracy of forgery detection to Valida.

Valida is not the only document forgery detection system that uses artificial intelligence, machine learning, and deep learning as its base technology. Several other existing studies apply similar methods to detect forgeries in electronic documents. Sirajudeen and Anitha [34] tried to validate document authenticity using an image processing technique. However, they found that conventional image processing did not provide sufficient accuracy. They, therefore, attempted to improve it by creating a model for document verification using a convolutional neural network (CNN). They also used optical character recognition (OCR) and linear binary pattern to extract textual information from documents. The authors claimed that the results of their CNN-based forgery detection model were promising and could be used in the implementation of real-world applications. Sahu et al. [31] demonstrated a technique to detect content tampering using a similar scheme proposed in a previous study [34] in a CNN along with image processing; however, they also focused on finding the alterations in the exchangeable image file format. Other researchers [26] not only worked with the CNN technique but also applied the Fourier transform and Chebyshev’s theorem to improve the accuracy of forged document detection. Although the accuracy was improved using the conventional CNN technique, their work traded off processing time for accuracy gains. Another CNN-based detection method is the work by Ullah et al. [38] who worked on malware classification and detection. Although this work was not proposed to directly detect document forgery, it could still be applied by first using N-gram feature analysis on academic documents, a deep learning approach similar to CNN would then be used to extract deep and broad feature learning. This way, features of legitimate academic documents would be analysed, and fake documents could potentially be detected. Sarode et al. [32] and Saxena et. al. [33] then further developed a design of a document manipulation detection system by not only using CNN but also applying a blockchain for storing data related to each document. However, they never mentioned the accuracy and efficiency of their proposed system.

There is another existing machine learning-based method that is worth mentioning. Although it is not specifically proposed for document forgery detection, it could still potentially be used to achieve such goal. This method is a community detection algorithm based on graph representation learning [40]. A community detection algorithm can be used to evaluate how items are clustered. In the context of academic document forgery detection, one could attempt to locate communities of strongly connected documents. This method could, therefore, potentially differentiate forged documents from legitimate ones. Although the work of Wang et al. [40] showed a promising result, it would only work in the context of academic document forgery detection if the number of documents available was large enough to form communities.

Blockchain was introduced as a part of the Bitcoin proposal in 2009 [25], although it was first designed in 1991. It has since taken off and become a platform upon which many applications, especially those related to finance, have been built. The reason for this is mainly its properties, such as decentralization, immutability, security, and consensus. Consequently, many recent systems for issuing and verifying digital academic documents have been developed based on blockchain technology.

MIT was one of the many universities that issued digital academic documents using blockchain. The system was developed by the MIT Media Lab and Learning Machine company and is known as Blockcerts [5, 29]. Using blockchain, digital academic documents issued by Blockcerts can be accessed anytime as long as the blockchain is available. Blockcerts uses the Bitcoin blockchain as its platform and hence is inefficient in terms of speed and ease of use. This is due to the computation time of the proof-of-work consensus and the need to install the Blockcerts Wallet. Another problem with the Blockcerts system was found by Baldi et al. [3]; they introduced a way to impersonate a document-issuing institution because of the lack of document issuer verification.

The next major blockchain-based system is known as BTCert [5, 30], which was developed by Birmingham University after realizing that blockchain could be used to protect the authenticity of digital academic documents. The main difference between Blockcerts and BTCert is that the latter uses digital signatures to ensure the authenticity of the document-issuing institutions. This way, when an electronic academic document is shared with a third party, the digital signature on the document can be verified. Moreover, a hash function is applied to the academic document stored on the blockchain. Thus, the authenticity of the actual document can be verified. Although BTCert is an improvement over Blockcerts, especially on document issuers’ authenticity, the system appears to apply numerous cryptographic methods causing high overhead.

OpenCerts of Nanyang Technological University [27] is another blockchain-based academic document platform that supports both issuing and verifying academic documents. Document verification on the OpenCerts website requires that a digital certificate of the issuer be paired with each academic document.

In addition, Sribastava and Gupta [35] worked on preserving the integrity of electronic health record, which constituted a similar concept to document forgery detection. The authors introduced a method that allowed each patient’s information to be recorded in a blockchain-based system. By applying the blockchain technology, the system provided an immutable and unforgeable healthcare record. To enhance the security for electronic health records, the authors added encryption and digital signatures to prevent unauthorized access and to obtain authenticity, respectively. Mishra et al. [24] also proposed a blockchain-based file system related to healthcare. The authors suggested that a secure access to medical records was necessary. Therefore, a data management system based on blockchain was implemented to ensure data integrity, privacy, and security of Internet of medical things (IoMTs). In this system, apart from the blockchain technology, the use of hash function was introduced to the system so that the completeness and correctness of data could be examined. This approach applied the same idea as our proposed system in utilizing the capability of hash functions, but our method was different because blockchain was not required to achieve the objective of checking data integrity. Another research closely related to [24, 35] was proposed by Gowda and Malakreddy [10], who designed a blockchain-based privacy preservation mechanism for fog computing environment. However, this mechanism only focused on the privacy of data, rather than forgery detection or the immutability side of the data.

Khan et al. [18] advanced the application of blockchain beyond medical data by proposing a data verification system tailored for closed-circuit television (CCTV) surveillance cameras. The authors proposed a three-part architectural framework comprising the sensor layer, responsible for image capture; the data layer, facilitating decentralized ledger-based data storage; and the application layer, through which data access was facilitated. Authorised users seeking access to camera-captured imagery would engage in a process of image falsification detection, entailing a frame-by-frame comparison between raw camera images and their blockchain-stored counterparts. It is pertinent to note that a hash function is only exclusively utilized to verify the legitimacy of a CCTV camera within the network, rather than for checking the integrity of video imagery.

Even though blockchain provides many desirable properties for academic document issuing and verification, it usually comes with high overhead, such as the need to apply many cryptographic methods and install a blockchain wallet. Furthermore, to be successfully used by many stakeholders, blockchain requires the development of a massive infrastructure. There is also a need for a more suitable consensus algorithm than the proof-of-work one used by Blockcerts to reduce the computation time. Finally, if a public blockchain is used, which should be the case in the context of academic documents, a method for verifying and authenticating a person, an organization, and an academic institution before being allowed to join the blockchain should be in place for security purposes. Hence, herein, we have proposed another system for academic document forgery detection. The proposed system does not use blockchain technology and only applies a cryptographic method called the cryptographic hash function.

2.2 Principle of cryptographic hash functions

Information security has three major characteristics, namely, confidentiality, integrity, and availability, which make up the CIA model [4]. The first characteristic is confidentiality, which involves keeping information secret and making it available only to authorized entities. The second characteristic is integrity, which is concerned with information correctness and completeness, i.e., information should remain the same as long as it is not changed by authorized parties. The third is availability, i.e., information is accessible to authorized parties when required.

Out of the three security characteristics, integrity is the one that is directly related to our objective: detecting academic document manipulation. One security mechanism that can help detect if and when data have been changed is known as the cryptographic hash function.

A cryptographic hash function, h(), takes an input of any size, and in our case, the input is an academic document (D). The cryptographic hash function is then computed and outputs a hash value of that particular input or h(D). This can be summarized in Eq. 1:

$$\begin{aligned} D \rightarrow h() = h(D) \end{aligned}$$
(1)

Cryptographic hash functions have the following important properties. First, they are one-way functions, i.e., it is relatively simple to compute the h(M) of an input M; however, it is very difficult to turn h(M) back to its original value M.

Collision resistance is the second property that we are interested in. In the context of cryptographic hash functions, a collision occurs when two different pieces of data, x and y, are input to h() and their resultant hash values are exactly the same; i.e., \(h(x) = h(y)\). Therefore, collision resistance states that collisions should not occur. It further ensures that whenever there are two different inputs, their resultant hash values should always be different.

Because of these properties, cryptographic hash functions can be used to examine the integrity of information or, in this case, academic documents. When a message, M, is generated, its h(M) is also computed and is appended to the original document to form Mh(M). When checking for the integrity of this piece of information, the same process is followed, i.e., M is hashed to obtain h(M). The resultant h(M) is then compared with the previously computed hash value. If they are equal, it means that the document has not been tampered with. However, if the values are different, it can be assumed that the document has been modified. Hence, the integrity of the document is said to be broken.

There are two categories of cryptographic hash functions: modification detection code (MDC) and message authentication code (MAC). For MDC, the hash value of a message is computed without using any secret keys, i.e., given a message M, its hash value is h(M). Examples of MDC cryptographic hash functions are MD5, SHA-1, and the SHA-2 family. Unlike MDC, MAC needs a secret key, k, in the computation process, i.e., given a message M, its hash value is computed to be \(h_k(M)\). This means that when a MAC cryptographic hash function is used, the creator of the hash value \(h_k(M)\) and the integrity verifier must possess k so that the integrity can be checked, which makes this type of cryptographic hash function more secure than MDC, where anyone can compute the hash value of a message. The two well-known MAC cryptographic hash functions are cipher block chaining message authentication code (CBC-MAC) and hash-based message authentication code (HMAC).

The main aim of a cryptographic hash function is to enable the detection of unauthorized modifications to data. However, existing research works have predominantly focused on the utilization of cryptographic hash functions for user authentication purposes. For instance, Ali and Anwer [1] suggested the hashing of registered user details, including a user ID and password, within an Internet of Things (IoT) network context. Upon user login, the IoT network would generate a hash code corresponding to the user’s details, which was subsequently compared against the stored hash code. A cryptographic hash function was also applied to secure resources within an organization as explained in [28]. In other words, the authors of [28], proposed that a cryptographic hash function was used to generate a hash table for storing hash values of data, acting as content identifiers. This way, if there were changes to the data, they would be detected. Furthermore, Jain and Doriya [15] introduced a security framework aimed at facilitating the secure download of healthcare data from cloud platforms. Central to their proposed framework was a data auditing mechanism, necessary for ensuring data integrity. The authors underlined the significance of verifying file integrity by comparing the hash code of downloaded files from the cloud with those stored within the cloud repository. However, their explanation of the details of this integrity verification process was not presented thoroughly. From existing literature, it is evident that cryptographic hash functions have been applied across various domains, including user authentication, organizational resource security, and file integrity verification, as opposed to their application in detecting academic document forgery, which constitutes the primary objective of our research.

Table 1 provides a comprehensive overview of the methodologies employed in existing research for detecting data falsification. It is notable that a majority number of studies opted for leveraging blockchain technology in their attempt to achieve data integrity verification. Notably, investigations conducted between 2020 and 2023 seemingly explored the utilization of machine learning techniques, particularly the CNN algorithm, for detecting data modification. However, it is noteworthy that only studies [15] and [28] have integrated hash functions into their methodologies, albeit in approaches distinct from what is proposed in our research. Specifically, [15] utilized a hash function to ascertain the completeness and integrity of downloaded files from the cloud, while [28] employed a hash function for the creation of a hash table aimed at content identification.

Table 1 Summary of literature related to data integrity verification

3 Methodology

This section provides the design, implementation, and evaluation of the academic document forgery system proposed in this research. It describes how the detection algorithm works, what the typical system architecture is, how the system was implemented, and how the system’s performance was assessed.

3.1 Proposed framework of an academic document forgery detection system

This section describes how the proposed system works. First, we provide the definitions of the symbols used to describe the proposed process. Then, we present the overall framework for detecting academic document forgery.

The symbols used for the academic document forgery detection framework are as follows:

i

Academic document issuers, such as an academic institution

s

Students receiving academic documents

\(ID_s\)

Identification data of student s

v

Verifiers, such as an employer or an academic institution

D

An academic document

\(D_s\)

Academic document of student s

\(k_i\)

Secret key k of issuer i

\(h_{k_i}(D)\)

Cryptographic hash function generating a hash value of academic document D computed with secret key \(k_i\)

The proposed system involves three roles: academic document issuers, students, and verifiers. An academic document issuer is an academic institution that creates and issues academic documents. A student is the one who requests and receives an academic document from the academic institution. A verifier is either an employer or another academic institution that receives academic documents as a part of an application process and is responsible for checking the validity of academic documents. The proposed system can be divided into two main phases: the academic document-issuing phase and the academic document-verification phase.

3.1.1 Phase 1: issuing academic document

Before any academic document can be generated, an academic institution, i, must generate its secret key \(k_i\), which will be used for the computation of h(D) of each academic document. It is assumed that the institution has a way to generate its secret key and a method for storing the key securely. A hardware security module can be used for such purposes [23].

Usually, an academic document is generated and issued by an academic institution after a student submits a request for it. Following is the procedure for issuing an academic document.

  1. 1.

    Student s submits a request to an academic institution i for an academic document \(D_s\), where \(Request = ID_s, Name_s, Email_s\). Note that this is a typical process for students to submit their personal information to the educational institution when asking for an academic document, i.e., \(s \rightarrow i: Request for D_s\).

  2. 2.

    The academic institution i generates the academic document, \(D_s\), and computes the hash value of \(D_s\) using its secret key \(k_i\) to obtain \(h_{k_i}(D_s)\).

  3. 3.

    i stores the hash value of the academic document \(h_{k_i}(D_s)\) in a database along with other details, such as the academic document \(D_s\) itself and the student’s ID, name, or email for future reference, i.e., i stores \(D_s\) and \(h_{k_i}(D_s)\) in its database.

  4. 4.

    i sends \(D_s\) to student s, i.e., \(i \rightarrow s: D_s\).

The issuing procedure of academic documents is depicted in Fig. 1.

Fig. 1
figure 1

Academic document-issuing procedure

3.1.2 Phase 2: verifying academic document

The verification of an academic document ensures that the document does not contain any changes or forgeries, such as a modification to the holder’s name or grades. Specifically, this is an examination of the integrity of the document. This verification process occurs when student s, who is the holder of an academic document \(D_s\), applies for a job or applies to study at another educational institution. In this case, the employer or the institution is the verifier, v, of the academic document. The verification process consists of the following steps:

  1. 1.

    When applying for a job or further study, student s sends their academic document \(D_s\) to an employer or another academic institution, which is known as verifier v, i.e., \(s \rightarrow v: PID_s, D_s\), where \(PID_s\) represents the personal information of s such as the name and email address, which are for a job or an institution.

  2. 2.

    Having received \(D_s\), v contacts the issuer i and sends \(D_s\), i.e., \(v \rightarrow i: D_s\)

  3. 3.

    i verifies \(D_s\) by retrieving its secret key \(k_i\) and computing \(h_{k_i}(D)\), which is then compared to the hash value stored in the database of this particular academic document. If the newly computed hash value matches with the stored hash value, then \(D_s\) is authentic or valid. Otherwise, \(D_s\) has been forged. Note that although it is possible for v to send \(D_s\) to be compared with the \(D_s\) stored at i, the comparison between hash values is more computationally efficient [12].

  4. 4.

    i transmits the verification result back to v, i.e., \(i \rightarrow v: D_s valid/invalid\).

The verification procedure for academic documents is depicted in Fig. 2.

Fig. 2
figure 2

Academic document-verification procedure

3.2 System architecture and implementation

The proposed detection method for academic document forgery was developed as a web application, which included three types of users to simulate real-world situations. The first is academic institutions or issuers, whose responsibility is to generate and issue academic documents. The second is students, who request academic documents and submit them as a part of job or education applications. The third is employers or other educational institutions that receive academic documents from students and initiate a document-verification process.

A student can request an academic document via the developed application, although this process can be performed via the institution’s process. The issuer, having created its secret key and, after receiving a request, generates the academic document and computes the corresponding hash value, which is then stored in the database. The student can then retrieve their academic document by downloading them from the application.

The developed system also allows students to apply for a job or further education by submitting their information along with the academic document received from the previous stage. Having received the application and an academic document, the verifier contacts the issuer and transmits the academic document via the application. The issuer computes the hash value of the received document and compares it with the hash value stored in the database. The verification result is then sent back to the verifier.

Figure 3 illustrates the system architecture used for our implementation purpose.

Fig. 3
figure 3

Architecture of the academic document forgery detection system

The proposed detection system was implemented as a web application using Java as the programming language. The application was developed on a 64-bit Ubuntu 20.04 system (long-term support (LTS)), with Apache Tomcat 9.0.63 as the development server and MySQL 8.0.31 as the database server. The reason for developing the proposed system using the mentioned architecture is that Ubuntu is a Linux distribution with approximately 16% market share for server operating systems, Apache Tomcat has approximately 14% of market share for web servers, and MySQL also consumes the market share of around 33% for database servers [9]. All of the mentioned components are also currently used for at our institution. The proposed system applied the cryptographic hash function as the main mechanism that helped detect changes to the original document. For the sake of implementation, we adopted HMAC-SHA512, where SHA stands for secure hash algorithm and 512 is the size of the output of the hash algorithm in bits [8], as it requires a secret key for the computation of hash values, which makes it more secure than the nonkey hash functions (or the MDC type). For the implementation of this hash function, we used a Java class called javax.crypto, which is a package containing various cryptographic algorithms. Specifically, for the HMAC-SHA512, javax.crypto.mac was used. Because HMAC-SHA512 requires a secret key, a Java method called javax.crypto.spec.SecretKeySpec was necessary for the secret key generation process. After the implementation had been completed, the application was uploaded to a server for testing and evaluation. The application server consisted of two virtual central processing unit Intel processors, 4 GB of memory, and 25 GB of storage and ran on a 64-bit Ubuntu 20.04 (LTS) system.

3.3 System evaluation

The evaluation of the proposed academic document forgery detection system consisted of two primary assessments. Firstly, a correctness evaluation was carried out to assess the system’s ability to accurately differentiate between legitimate and illegitimate academic documents. Secondly, the system’s performance in terms of academic document verification time was analysed to ensure the efficiency in real-world usage scenarios.

3.3.1 Correctness evaluation

The main purpose of the proposed academic document forgery detection system is to detect whether or not an academic document has been modified. Therefore, we designed a two-part experiment that required an issuer to generate 15 academic documents, such as transcripts, letters of studentship confirmation, and degree certificates. For the purpose of the experiment, 15 samples of academic documents were obtained from various schools and universities in Thailand. These documents were then sent unmodified to a verifier (an employer or an educational institution) as a part of an application process. The verifier validated these documents by following the process explained in the previous section. The other part of the experiment involved modifying or forging a part or parts of the previously obtained 15 academic documents, which were again sent to the verifier for authenticity verification. Modifications made to academic documents included changes to grades on transcripts and alterations to the name on a letter of studentship and degree certificate. This was done to ensure that the academic document forgery detection system could distinguish between legitimate (unmodified) academic documents and illegitimate (modified or forged) ones.

The proposed system was evaluated using a confusion matrix, which consisted of four elements: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). In the context of detecting forged academic documents, TP means that when the system is provided with an unmodified academic document, the system can accurately confirm that the document is genuine. TN means that when a forged academic document is presented to the system, the system accurately detects that the document has been modified. FP means that when an unmodified academic document is provided to the system, the system identifies the document as forged. Finally, FN means that the system fails to detect a forged academic document.

3.3.2 Performance evaluation

The core operation of the proposed academic document forgery detection system is to generate a hash value of the document and to verify the integrity of the document by checking its hash value.

In this experiment, 15 different academic documents were obtained from various schools and universities, as explained previously. The sizes of the academic documents varied from 128 to 1500 KB, depending on the type of the documents. They included academic transcripts, letters of studentship confirmation, and degree certificates. The sizes of these documents were collected from the actual documents issued to students.

Performance evaluation was conducted by measuring the time taken to compute the hash value of each academic document and the time taken to check its integrity. Note that the cryptographic hash function used in our system is HMAC-SHA512. Therefore, the resultant measurements would only reflect this particular method. However, we believe the results can still indicate how fast the academic document forgery detection system can perform.

4 Results and discussion

This section provides a detailed twofold evaluation of the system proposed for detecting academic document forgery. Herein, we assess the correctness of the detection system by applying the well-known confusion matrix and system performance in terms of the time taken to verify the authenticity of academic documents.

4.1 Results of correctness evaluation

The results obtained from the correctness evaluation experiment were divided into two parts. The first is when 15 genuine academic documents were verified by the proposed system (Fig. 4a). The second is when 15 forged academic documents were verified (Fig. 4b) by the same system.

Fig. 4
figure 4

Evaluation of the accuracy of the academic document forgery detection process

It can be seen from Fig. 4a that when 15 legitimate academic documents were provided, the proposed forgery detection system could tell with 100% accuracy that the presented academic documents were genuine, demonstrating a 100% TP rate. Furthermore, Fig. 4b shows that the proposed document forgery detection system could detect that the 15 documents had been modified and therefore were invalid, thereby presenting a 100% TN rate.

4.2 Results of performance evaluation

The results obtained from the experiment are illustrated in Fig. 5, which shows the average time taken to generate the hash value of an academic document and the time taken to verify whether each document was authentic.

Fig. 5
figure 5

Time taken for hash value generation and document verification

Figure 5 shows that the average time taken to compute the hash value of an academic document is 0.120 ms with a standard deviation of 0.0169, whereas the time taken to verify the authenticity of the document is 0.232 ms with a standard deviation of 0.0432.

4.3 Discussion

This study proposes a method for detecting forgeries in academic documents. There are three aspects that we would like to discuss here. First, we compare our method’s forgery detection accuracy with that of a couple of existing studies that apply machine learning methods. Second, we compare the processing time of our method with that of one of the most recent blockchain-based approaches. Finally, the possible real-world adoption and implications of the proposed method are discussed.

Most existing research on document forgery detection, such as [16, 19, 22], involves applying machine learning techniques, especially CNN. This approach has been applied to OCR, handwriting recognition, and copy-move image forgery detection. Table 2 provides a comparison of the forgery detection accuracy between our method and a couple of existing CNN approaches.

Table 2 Document forgery detection accuracy comparison

Our method for detecting document manipulation applies a cryptographic hash function whose purpose is to ensure data integrity. Applying such an algorithm ensures that any modifications or forgeries can be detected, which is why it resulted in 100% forgery detection accuracy. However, this accuracy rate can be achieved only when there is an original document, more particularly the hash value of the original document, to be compared.

Next, we compare the processing time of our method with that of a couple of existing approaches involving CNN [16] and blockchain technology [13]. We admit that this comparison is not extensive by any means; however, we believe it can still provide some general idea of how different technologies are compared to each other. Processing time or transaction time means the time it takes to prove the authenticity of an academic document. For example, in the case of the CNN approach, it is the time measured from the moment the data (or document) are input to the system (or a machine learning model) to the moment the result is generated. For the blockchain-based approach, the transaction time is the time from the moment the required public key is read from the system (so that the digital signature on a document can be verified) to the moment the document is written or recorded into the blockchain. For our proposed method, the processing time includes the time taken to generate the hash value of an academic document and the time taken to check its integrity. Moreover, we implemented a digital signature-based system for academic document issuance and verification similar to those described in [5, 39]. The system was run in the same environment as that considered for the proposed cryptographic hash function-based method. The processing time, which included the time taken for digital signature generation on academic documents and signature verification, was then measured and compared with our proposed method. Table 3 provides a comparison of the processing time of the four approaches.

Table 3 Document forgery detection processing time comparison

Table 3 demonstrates that the proposed academic document forgery detection system requires a shorter time to process and verify documents than the convolutional neural network-based, blockchain-based, and digital signature-based forgery detection systems. Although the proposed method yielded the result of 0.352 ms of forgery detection processing time, the results would not necessarily be conclusive for all environments. In other words, the document forgery processing time depends on several factors. The first is the size of the academic documents. Our experiments were carried out by using the documents of the sizes between 128 and 1500 KB. If the document sizes were smaller, it would be possible to achieve faster processing time. In contrast, if the document sizes were larger, longer processing time would be obtained. Another factor that would affect the forgery detection processing time is system implementation. The proposed system was implemented using Java. However, if other programming languages were used, the resultant processing time would be different [7]. For example, if C++ were used, the execution speed would be faster. On the contrary, if the system were implemented using Python, the processing time would be longer than the Java-based system. The machine on which the program is implemented also has an impact on the processing time. If the machine had more resources in memory and processing power then higher forgery detection speed would be achieved, as a result.

Next, we compare our proposed method with the CNN-based and blockchain-based methods in terms of computational complexity by analysing their main operations. For the CNN-based method [16], the computational complexity is usually evaluated based on the number of operations needed during the training stage. Specifically, it is calculated in terms of the number of parameters m and the number of CNN layers n. Therefore, the computational complexity of the CNN-based methods can be expressed as O(n.m). For the blockchain-based system [13], the authors did not specifically state which consensus mechanism was used in their design. It is, therefore, assumed here that it was implemented on an Ethereum-based blockchain, which applied the proof-of-stake consensus mechanism. In the proof-of-stake mechanism, the creator of a new block is chosen in a randomised fashion. Therefore, the time complexity of this mechanism can be considered constant, which can be expressed as O(1). Finally, our proposed academic forgery detection system applies a cryptographic hash function, namely HMAC, whose computational complexity can be evaluated as follows. HMAC’s complexity is determined based on the underlying hash functions such as SHA-256 and SHA-512, whose complexity depends on the length of the input n. HMAC applies a key to its computation, so the key length also contributes to its complexity, but due to the process of HMAC computation, it can be considered constant. Therefore, the overall computational complexity of HMAC can be expressed as O(n). Table 4 provides a comparison of the computational complexity of the three document forgery detection methods.

Table 4 Document forgery detection computational complexity comparison

The implication of the proposed system is that each educational institution needs to keep a database of academic documents and corresponding hash values. This also means that the verifiers will have to contact the relevant document issuers to verify the integrity of the academic documents. Even though the system architecture was proposed in this distributed manner so that the databases of academic documents belong to different institutions, it is possible to consolidate them into one central authority, such as the Ministry of Education. Consequently, the verifiers need to contact only one location when checking document integrity.

Overall, it is acknowledged that different technologies serve distinct purposes. Our research specifically focuses on ensuring forgery detection of academic documents. We would like to emphasize that the advantage of our work lies in the correctness and efficiency in the context of data integrity verification. The proposed academic forgery detection method is centered around key-based cryptographic hash functions, which offer a direct solution to data integrity verification. Moreover, while CNN-based methods involves feature engineering to enhance predictive capabilities, blockchain technology leverages zero-knowledge proofs, and digital signature achieves non-repudiation, the proposed method is distinct in this context, focusing on data tampering detection by making use of the actual capabilities of cryptographic hash functions.

5 Conclusion

Academic document forgery has been an issue for employers and academic institutions as people want to work at their desired organizations or study at their desired institutions. Cryptographic hash functions can offer a solution to this problem by addressing the issues related to the integrity and authenticity of academic documents.

Therefore, this study proposes a forgery detection system using a cryptographic hash function to detect whether academic documents have been manipulated. In particular, this study provides a solution for determining the ingenuity of academic documents. Additionally, our approach can detect changes in the original documents within a short processing time.

Consequently, this study can lead to adopting and developing a design framework for academic document forgery detection to address the issues associated with the accuracy and efficiency of document ingenuity and forgery detection.

Even though we show that the proposed scheme can efficiently detect forged academic documents, there is one shortcoming when compared with other existing research. Due to the use of a cryptographic hash function, our proposed scheme will only work when there are original academic documents to be compared with, i.e., the hash value of the original document will need to be accessible so that the hash value of the forged document can be compared. This is in contrast with other technologies, such as machine learning and deep learning, which help in detecting forged documents without the requirement of the original document at the time of detection. Another limitation is the dependency on secure key management of cryptographic hash functions. It is required by the proposed system that academic document issuers possess a key so that the hash values of the issued academic documents can be computed. It is, therefore, essential that the key is securely managed and stored, which would put a burden on the institutions issuing academic documents. The application of cryptographic hash functions could also lead to attacks such as collision attacks and birthday attacks, which could in turn undermine the effectiveness of forgery detection. Although these appear to be our limitations, the trade-off is that our proposed method provides a more accurate forgery detection result and requires a shorter processing time compared to the methods discussed in this study.

In addition to our focus on applying a cryptographic hash function to help detect academic document forgeries, several directions could be explored in future research to enhance the system’s applicability. Firstly, the adoption of mobile application could facilitate seamless and convenient academic document verification process. Secondly, the proposed system could be extended to include a wider range of documents. Overall, this future scope represents opportunities to enhance the system’s versatility and practical utility in countering document fraud across various domains.