1 Introduction

Melanoma is an extremely common form of skin cancer, which is caused by the abnormal growth of melanocytes (i.e., the cells that give the skin its tan or brown colour). When cells go out of control, cancer develops and spreads to many parts of the body. Unfortunately, melanoma can be the deadliest of all skin cancer diseases; however, it can also be treated if diagnosed early. Even though the sun has a significant role in causing melanoma, it is found that Ultraviolet (UV) light exposure can be responsible for most melanomas (i.e., 86% of melanomas are caused by UV). One can also get such a disease without being exposed to the sun since melanoma can be genetically inherited. Speaking of genes, melanoma affects men and women differently. Melanoma is more frequent in women than in men under the age of 49. Men are more likely than women to die of skin cancer by the age of 50. This may be because melanoma frequently occurs on parts of the body that are more difficult to track in men. It is more prevalent in men than in women after the age of 50.

There is not only one type of melanoma but three types of Melanoma are common. Nodular melanoma can have different colours that are pink, black, or red. They can also appear as clear bumps. Lentigo malignant melanoma is usually found in older people. Its shape is like a flat light brown smear that expands and darkens with time. Acral lentiginous is another type of melanoma that appears as black bands on the wrist’s palm, feet’s soles, and nails. New technologies were able to diagnose skin cancer by using machine learning algorithms. In this research, we aim to develop a mobile application using Artificial Intelligence (AI), particularly Machine Learning (ML), to track skin spots. This application is supported by an ML model that can classify skin images as malignant or benign.

The novelty of this study is mainly related to the proposed system architecture. The system’s integration of various modules, including the mobile application, web server, database, classification system, and cloud storage, demonstrates a seamless and efficient flow of data. The novel architecture allows for quick and accurate diagnostic results for users. The adoption of specific technologies, such as Flutter for the client application, NodeJS for the server, MongoDB for the database, and Flask for deploying the machine learning model, showcases a comprehensive and contemporary approach to building the system. While some preliminary research that uses CNN algorithms has been carried out in the literature, they have not considered the use of state-of-the-art technologies as used in this research. Furthermore, the creation of a web service acting as a bridge between the mobile application, MongoDB database, and Google Cloud Storage buckets demonstrates our system’s ability to connect and interact with multiple data sources, making it a versatile and robust solution. Finally, while other methods of diagnosing melanoma exist, our research stands out for introducing a user-friendly and accessible solution through a mobile application.

The contributions of this research are three-fold:

  1. 1.

    A user-friendly mobile application that diagnoses skin spots with confidence rates was developed.

  2. 2.

    A machine learning model to classify melanoma images as malignant or benign was built.

  3. 3.

    A web service to connect the mobile application with the machine learning model was implemented in this research.

2 Related works

Research confirms that regular access to health care is an effective strategy to prevent the causes of poor health. Biomedical engineers state that early diagnosis can help save more than 40% of the lives lost [1]. According to industrial research, the Google Play Store and Apple iTunes have over 165,000 health-related smartphone applications (a.k.a., apps) [2]. Nowadays, medical apps are developed using artificial intelligence and, particularly, machine learning techniques to aid in diagnosis. Researchers can build algorithms and models to identify objects, whether it is a cat or dog, however, in medicine, we rarely have ground truth values [3]. The best approach to getting confidence in models is to put them through thorough validation and testing stages

Melanoma classification using machine learning, deep learning, and neural networks has received a lot of attention in the past few years. This is because AI has progressed dramatically; deep learning-based algorithms have been developed that can automatically extract features and learn from them. Gupta et al. cited many articles related to CNN and compared them based on the classification technique they used and the accuracy outcome. The best accuracy achieved was 98.7% using the CNN and a dataset of clinical images.

TensorFlow and Keras are helpful software libraries to develop a powerful deep learning model. As for the dataset, it is divided into several sets (i.e., training, validation, testing). Benbrahim et al. [5] split their dataset into the following sets: 80% training, 10% validation, and 10% testing sets. Their CNN architecture made use of a convolutional layer and then a max-pooling layer, in which this pattern was repeated four times. Also, they added a flattening layer, a dropout layer, and ended the model with two dense layers. Their proposed model achieved 93.93% in the testing set.

Another study that uses CNN architecture is by Namozov and Cho [6]. The dataset they used contained 10,015 images and was taken from ISIC 2018. The CNN was used as the classification method, with a total of nine layers. The network has four convolutional & two pooling layers for feature extraction and three fully connected layers at the end for the classification part. The special part about this CNN is that a piecewise linear activation function was used in the convolutional layers, which enhances the performance of the network. The first experiment was based on the ReLU function, which resulted in 93.25%. The tangent function provided the result of 91.76%. Lastly, the piecewise activation function returned the best result of 95.86%. With more training sets, the CNN using the piecewise function reached 98% accuracy after 65 epochs.

Another study that used the same dataset (ISIC 2018) was done by Guha et al. [7]. The first step is the image preprocessing part, which focuses on enhancing the images’ qualities. Then, the images are taken as inputs and processed with a median filter to remove the image noise. Moreover, feature extraction takes part where the Otsu thresholding technique is used to decrease the grey-level image. The last step of the pre-processing step is the classification, where CNN was applied. The CNN used two max-pooling, and 2 fully connected layers. In the first layer, 32 convolution channels were chosen with a size of 3x3 developing 32 channels of 396 x 644. The second layer contained 64 convolution layers, which use 64 channels of 394 x 642. Two convolution layers on the input images were added, and the max-pooling was set to 2x2. In the third layer, 64 convolution channels of size 195 x 319 were utilized. The stride of 2x2 and kernel size of 2x2 were utilized in the max-pooling layer. Dropouts were linked with SoftMax, and the model was trained for 40 epochs. The provided result of the model was 79.42%.

A deconvolutional neural network (DDN) based on residual learning for skin lesion segmentation can help address the wide range of skin lesions [8]. Dense deconvolutional layers (DDLs), chained residual pooling (CRP), and hierarchical supervision (HS) are all parts of the proposed network. DDL allows the proposed DDNs to reuse learned features from previous layers, forming tight linkages between all feature maps. Another study [9] uses a convolutional-deconvolutional neural network (CDNN) that helps in tumour segmentation. The dataset is from ISBI 2017 for both training and validation. All images were re-sized to 192 x 256 pixels by using the bi-linear interpolation. Additionally, HSV was used instead of RGB to apply image augmentation. The kernels applied for this CDNN are of size 3x3 in both the convolutional and de-convolutional layers. Lastly, batch normalization is applied, and the batch size is 18. The CDNN contained 29 layers with 5,042,589 trainable parameters. The experiment provided a testing accuracy of 93%.

ResNet is a classification method that helps in feature extraction. A study [10] suggests an in-depth study framework for end-to-end skin analysis. The proposed framework deals with multiple tasks simultaneously and uses ResNet50 (50 layers) and ResNet101 (101 layers) to extract melanoma features. It also uses an active loss function to reduce the problem of inequality in the skin data section and utilizes a region proposal network to remove background information. It provided the best results using segmentation-with-different-hyperparameters, IMAGENET Initialization, and a batch size of 2X2. During the testing stage, the accuracy of ResNet50 got 97% while ResNet101 provided 96% accuracy. In another study, ResNet 50, 40, 25, 10, and 7 models were used and compared to reach the best accuracy. According to the researchers [11], the best model in this study was Resnet50 using 10-epoch training with data augmentation.

Gonzalez-Diaz [12] used both CNN and Resnet-50 for training the model, and also used DermaKNet which is a CAD system for the automatic diagnosis of skin lesions. The dataset used in this study is the ISBI, which contains 2750 dermoscopic images with varying resolutions. The dataset was split into three parts which are training (i.e., 2000 images), validation (i.e., 150 images), and testing (i.e., 600 images). All of these images were then adjusted to 256 x 256 pixels. The Area Under ROC Curve (AUC) was 87.4% for Resnet-50 and 95% for the CNN-based model.

Google also developed a classification method called MobileNetV2, which is based on the CNN architecture, specifically designed for mobile and embedded applications. Akay et al. [13], stated that MobileNetV2 showed better results than the traditional CNN algorithm. The suggested network achieved 94.8% on the testing dataset.

Region-of-interest (ROI) can be extracted using other classification methods. Ashraf et al. [14] identify and differentiate melanoma from nevus cancer using a CNN-based approach. The system uses enhanced K-Means to extract ROIs from images and is trained and tested on DermIS and DermQuest images. This framework achieved the best accuracy with data augmentation and an ROI-based approach, 97.9% on the DermIS dataset and 97.4% on the DermQuest dataset.

Deep convolutional neural networks can be applied on mobile phones using a web service, TFLite, or TensorFlow Mobile. Emuoyibofarhe et al. [15] performed a comparison between TensorFlow Mobile and TFLite to find which one suited better. The results state that TFlite is a solution for making machine learning models on mobile and embedded devices. However, it does not support all TensorFlow operations. Tensorflow mobile, on the other hand, supports more features. In this research, our mobile app uses web service technology.

Table 1 shows the comparison of different research studies discussed in this study. In a recent review article [19], researchers presented a review of state-of-the-art machine learning techniques used to detect skin cancer. From this review paper, we observed that although there are various algorithmic-level solutions, there is a lack of a comprehensive and contemporary approach that explains how to build a classification system using state-of-the-practice software tools. In this article, we present how to design and implement such a system using current tools and technologies.

Table 1 Related works comparison table

3 Methodology

In this section, we introduce the proposed melanoma detection system. Section 3.1 provides the system architecture. Section 3.2 is about the tools and technologies used. In Section 3.3 section, the proposed neural network-based model is explained. In Section 3.4, the web service is discussed, and in Section 3.5, the implementation details of the mobile application are presented.

3.1 System architecture

The proposed system consists of the following five modules: The mobile application, web server, database, classification system hosted on a server, and cloud storage. Patients interact with the mobile application to view their logs and their profiles. When a patient adds a new entry to their log lists or retrieves old logs, the mobile application communicates with the web server, which in turn performs CRUD operations on the database to send data back to the mobile application. When a patient captures a new photo for diagnosis purposes, the mobile application communicates directly with the web server, which sends back the diagnosis result. To save the log photos once they are captured, Google Cloud Storage is used. The application was developed in accordance with the microservice architecture [16]. The order of communication between the different parts of the system is shown in Fig. 1.

Fig. 1
figure 1

System Architecture diagram

3.2 Tools and technologies

3.2.1 Dataset

The dataset had a total of 53,081 images, which were later split into training and validation sets. The training dataset takes 80% of the images, and the rest is used for validation. The size of the images is 224x224 and was accessed from multiple sources on Kaggle [17] and SensioAI [18]. As for the testing part, there was an external dataset used, which contains 4,100 images.

3.2.2 Tools

The Jupyter Notebook was used to train the models, and a GPU environment was created. Flutter was used to build the client application, NodeJS was utilized for building the server for the application, and MongoDB was used for creating the database.

3.3 The proposed model

The input of our neural network is of size 224x224x3. Our first layer is Conv2d, which is the feature extractor layer. Then, we have a max-pooling layer that helps to extract low-level features. After that, we used dropout, which is a technique that ignores a certain rate of neurons to prevent overfitting. The dropout rate was 0.25. Then, we repeated this pattern, and the model had in total 2 convolutional layers, 2 max-pooling layers, and 2 dropout layers. Later, we flattened the output into a one-dimensional array and created a fully-connected layer. The output layer determines whether the mole is benign or malignant, as represented in Fig. 2. We started with 32 neurons, then 64, and finally, the fully connected layer was composed of 128 neurons.

We used the normal initialization approach that samples the weights in a normal distribution with a low deviation. As for the activation function, we used Relu and for the optimizer, adam was preferred.

Fig. 2
figure 2

Architecture of CNN Model

3.4 Web service

In order to deploy the machine learning model, we used the Flask web framework and our model as a file with the ‘.h5’ extension. We used the POST method to get the image from the gallery. Then, we decoded that image and opened it by wrapping it into IO. BytesIO because it is not an actual file yet. We pre-processed the taken image from the gallery and converted it to RGB. The target size of our image was 224 X 224. After that, we turned the image into an array and expanded that array by inserting a new axis at position 0. Eventually, we decided that the image based on the model where 0 represents the benign case and 1 shows the malignant case, and this was returned as a JSON message to the front end.

3.5 Mobile application

3.5.1 Client app

Flutter framework was used to develop the app. The Consumer-Provider approach was used to manage the state of the application. A models directory is used to keep track of the class schemas, containing a separate file for each schema and its functions. For navigation and routing between pages, the named routes method is used. A routes directory is used to keep the list of route names, and the custom router class is responsible for managing navigation using the given route names. The screens directory contains a separate file for each screen used in the application, containing its layout and its respective functions. The services folder contains services that have methods that are responsible for communicating with the diagnosis web service or the mobile server. The HTTP protocol is used for communicating with web services with the help of a library called Dio that makes sending and receiving HTTP requests easier than using plain HTTP.

3.5.2 Server

A web service was created to act as a bridge between the Flutter application and both the MongoDB database and the Google Cloud storage buckets. The responsibilities of the web service can be summarized into the following two main categories:

  • Connecting the application with the MongoDB database and performing all CRUD operations on the database. This refers to all operations that can be performed in the LogList screen in the application (i.e., adding entries, deleting entries, editing folders, retrieving the log list, and saving all mole images taken by the user to be easily accessible even if the user uninstalls the application).

  • Managing the Login and SignUp operations. This includes encoding user passwords at registration, storing and handling user information in the database, and authenticating registered users.

To create this web service, NodeJS was used, specifically the ExpressJS framework. The Passport library was used for the authentication process (i.e., user login and sign-up). Bcrypt was used for encrypting and decrypting user passwords when creating new accounts or during authentication when logging in. To communicate with the database, JSON was used.

Fig. 3
figure 3

Confusion Matrix before Data Augmentation

Fig. 4
figure 4

Metrics after Data Augmentation

Fig. 5
figure 5

Confusion Matrix after Data Augmentation

Fig. 6
figure 6

ROC Curve Graph after Data Augmentation

Full documentation for the methods and the classes of the server was generated using JSDoc markup language. A library called Swagger was used to create interactive documentation for the HTTP requests, documenting all necessary headers, queries, and parameters needed in all different requests performed across the application’s life cycle. The code of the server is divided into three main directories (or modules, in expressJS terminology). The methods directory is where all of the actions performed on the user’s profile or log are written. Actions are basically expressJS middleware functions, and their job is to perform CRUD operations on the database using Mongoose as a mediator, and they use JSON for data transfer and for returning feedback on whether or not a certain request has succeeded. The models directory contains the database schema definitions used in the MongoDB database. There are two main schemas; user, which stores all information in a user’s profile, and log list, which stores all information related to user moles and mole images. The routes directory is where the HTTP requests are defined, organized, and documented for Swagger.

3.5.3 Database

Mongodb was used, and hosted on Mongodb Atlas cloud service, free tier. We chose to use MongoDB since it is more flexible and easy to use than relational SQL databases.

4 Results

4.1 Without data augmentation

We first started training our data using the images. We used 29,454 images for training and 7,364 images for validation. The result for both training and validation accuracy was 99%. However, when we tried to test our model using 660 unseen images, we got 61% accuracy, as shown in (Fig. 3), which indicated that we had an over-fitting problem.

4.2 With data augmentation

To address the over-fitting problem, we tried to reduce the complexity of the CNN architecture by reducing the learning rate and also applying different regulation methods. Since none of these methods solved our problem, we decided to use the data augmentation approach. As a result, our training images increased to 42,464 and validation images to 10,617. The training and validation accuracy was 99% (Fig. 4). Then, we tested them on 4,100 unseen data and we received 84% accuracy, as shown in Fig. 5, which means that the over-fitting problem was reduced significantly. A precision of 72%, recall of 40%, F1 of 51%, and ROC_AUC score of 64% were reached. The ROC graph is shown in Fig. 6.

4.3 Application

The outcome of the mobile application is shown in different figures. Figure 7 shows the login and main screens of the application. The user can register a new account from the registration screen or log into an existing account from the login screen. After that, they are able to diagnose their moles and save them in their log list, as shown in Fig. 8b, which is where all the diagnosis information is kept in the form of a list of folders containing multiple entries. On the camera screen (Fig. 8a), the user can take a photo by positioning the mole in the focus circle. Later, the photo is cropped and resized to be processed by the ML model. Once the user confirms the photo, it is sent for diagnosis. When the diagnosis is ready, the diagnosis report screen appears with information about the classification and the confidence rates matrix. Figure 9 shows some sample screens regarding these classification results. The results are automatically saved to the log list, where the user can view them again later if needed. The current apk size of the application is approximately 66 MB, however, the actual size of the appbundle is yet to be determined if we decide to release the app on Google Play in the future.

Fig. 7
figure 7

Login and Main Screens of the Application

Fig. 8
figure 8

Camera and Log List Screens

Fig. 9
figure 9

Sample Screens for Classification Results

5 Conclusion

Finding a very large medical dataset that is precise and correctly formatted is very difficult. Most of the melanoma datasets on Kaggle had only 2% malignant images, while the rest were benign images. Therefore, we had to combine two datasets to address this issue. We also encountered an overfitting problem and solved using augmented images with higher CPU/GPU. As a result, the training accuracy increased from 84% to 99% and the testing accuracy from 60.6% to 84.3%. As for the mobile application, we connected it with our model successfully and were able to receive a real-time diagnosis. Eventually, we created an APK to test it on Android devices. As we share data with a web service in the proposed architecture, there are several security implications that need to be considered to protect user privacy. While the main focus of this paper is not the security aspect, we followed the best practices related to each software framework utilized. Transmitting sensitive user data over the Internet may expose it to eavesdropping by malicious actors, and there is also a risk of data tampering during transmission. Furthermore, the web service may become a target of DoS (Denial of Service) attacks and insecure communication channels may lead to man-in-the-middle attacks. As such, we followed a holistic approach, which involves secure coding practices, ongoing monitoring, and regular security assessments. For instance, the Passport library was used for the authentication process. Bcrypt was used for encrypting and decrypting user passwords when creating new accounts or during authentication when logging in. We ensured that Node.js and Passport are configured securely, following the best practices. Also, we validated and sanitized all user input to prevent common security vulnerabilities such as SQL injection. We also keep Node.js, Passport, and other dependencies up to date with the latest security patches. As security is a continuous process, we must be proactive in addressing emerging security threats.

In the future, we aim to create a much larger dataset and include new models for benchmarking. We also want to focus on interpretable machine learning (a.k.a., explainable artificial intelligence) aspects. Another research direction can be related to digital twins, which have been applied successfully in many different domains [20,21,22]. A digital twin can continuously update its representation based on changes in the user’s skin over time. This application can compare images of the user’s skin with the digital twin to monitor any evolving moles. Different object detection algorithms can be investigated from a recent review article [23] and novel models can be built.