Introduction

The term "visually impaired" is used for people with no vision or non-recoverable low vision. According to the data, 89% of visually impaired people belong to less developed countries, and 55% of them are women [1]. Braille is a language specifically designed for the visually impaired and is composed of combinations of six-dot patterns. Symbols corresponding to sounds in natural language are represented by activating and deactivating these dots [2].

Initially, Braille was read and written on paper or slates with raised dots. The Perkin Brailler, a Braille type-writer, was introduced for visually impaired people in 1951 [3]. With the advent of technology, touch-screen devices have become available with applications such as talkback services, screen readers, navigators, smart magnifiers [4], and obstacle detection and avoidance system for visually impaired people [5] [6]. It allows people to communicate easily anytime all over the world [7]. These helped in improving conversational technology and natural language communication for educational purpose [8]. Most visually impaired people are hesitant to use smartphones because of accessibility and usability issues, such as difficulty in finding the location of keys on smartphones. Furthermore, whenever visually impaired people perform a task, they require feedback regarding its outcome [9].

Various studies have been conducted to convert Braille into a natural language. Applications such as BeMyEyes [10], BeSpecular [11], TapTapSee [12], KNFBReader [13], Color Teller [14], Text to Speech Synthesizer [15,16,17] have been specifically developed to improve the quality of life for people with visual impairments. Most of these take input from scanned Braille documents using computational tools. Scanning means the process consists of two steps; write and then scan the document. However, if the input is taken directly from touch screen devices, the conversion can be performed instantaneously [18]. Currently, only a few studies have investigated touchscreen-based Braille input, and most of these schemes require screen-location specific input, which is difficult for visually impaired users [19]. Besides, no studies have collected the Braille character dataset using Android devices.

Traditional machine learning models have been extensively used in numerous research domains. However, these techniques require in-depth knowledge of the domain and use pre-defined features for evaluation. Extensive manual feature extraction is also required in these models. Deep Learning (DL) techniques become popular because they learn features directly from the data without pre-defined features, and their performance is not dependent on the size of the dataset.

This study focuses on the design, implementation, and evaluation of a new touchscreen-based Braille input method. An accessible Braille input method that places little burden on the user is proposed for the visually impaired. Users are only required to tap the dots needed for the specific character anywhere on the screen. The user's input is predicted using a DL method, which automatically extracts features from saved images to identify the Braille character and translates them into the corresponding English character. To the best of our knowledge, this is the first study to use DL for Braille-to-natural language conversion by classifying Braille images collected from touch screen devices.

The main contributions of this study are as follows:

  • A position-free accessible touchscreen-based Braille input method for the visually impaired is proposed to put the least burden on the user.

  • A Grade 1 Braille dataset was collected from visually impaired students from the Special Education School, Manak Payyan, Pakistan because there is no existing Braille dataset on which DL techniques can be applied.

  • A DL model is proposed to classify Braille images and match them with specific English alphabetic characters.

  • Evaluation of the performance of an actual dataset collected using Android devices from visually impaired people is presented.

  • Comparison with traditional methods like NB, SVM, KNN, and DT is made.

In the rest of this paper, Sect. 2 briefly covers the background and related work. Section 3 materials and methods describe the new accessible Braille input method and its important features. Section 4 provides the experimental setting for evaluation and Sect. 5 presents the results in detail. The conclusions and future directions are given in Sect. 6.

Background and related work

Braille input mechanism

Several input methods have been designed for entering Braille using touch screens, e.g., Type In Braille [20], Edge Braille [21], VBraille [22], Perkinput [23], and Braille Easy [9]. Small screen sizes, difficulty in finding specific locations on the screen [24], the lack of physical keys, and a lack of adequate input are some factors that make the use of these applications impractical for visually challenged people. Accessibility is a major challenge faced by visually impaired people when using these software applications. Although many researchers have focused on making touch screens viable for the visually impaired, it is still a cumbersome process [25,26,27].

Braille dots are entered using three rows and two columns format in TypeIn Braille [20]. The user enters the data by tapping on the screen, but tapping with both fingers simultaneously for entering two consecutive dots is not viable for the visually impaired. In Braille Touch, visually impaired people have to use both hands and multiple fingers to enter characters using a fixed position [23]. A comparative study conducted by Subash specifies four large buttons on the screen and tapping gestures such as single tap, double-tap, and triple tap are used to enter Braille dots [28]. Similarly, Braille Easy uses single, double, and triple taps to enter dots, but memorizing the reference points is difficult [9].

From the literature, it is clear that the most important problem faced by the visually impaired when entering data on touch screen smart devices is finding the position of the dots on the screen. This problem was resolved by Shabnam, who designed Eyedroid [29]. Braille dots are entered by activating and deactivating dots anywhere on the screen, and visually impaired people do not need to find the fixed location of the dots. Edge Braille is another position-free Braille entry method that requires a continuous pattern to be drawn along the edges of the screen [21].

Navigation is also an important factor for the usability analysis of Braille input mechanisms. VBraille navigation is performed by swiping up and down, as well as from left to right [22]. Difficulty in memorizing the gestures for navigation purposes and switching among various writing modes are the major challenges faced by the users of TypeIn Braille [20]. In SingleTap Braille, operations are performed by flipping left, right, up, and down. Learning letters requires a user to press "1" [30]. Improved SingleTap Braille uses different swipe gestures for inserting upper-case letters, lower-case letters, adding spaces, backspaces, and Grade 2 Braille [31]. Braille Tap introduced four different gestures to perform arithmetic operations: swiping from bottom to top for addition and subtraction, swiping from top to bottom for division and multiplication, swiping from left to right for clear operation, and swiping from right to left for backspace [32]. Braille Enter was introduced to remove the navigation issues of single-tap Braille. Braille Enter supports upper and lower-case letters, numbers, and special characters. The addition or deletion of spaces is also allowed [27]. Karmal et al. designed an application for blind, deaf, and dumb people that takes images, audio, and text through the customized keyboard as an input. Feedback to the user is provided in the form of voice and text [33] [34].

Some applications provided Tactile based object recognition [35] and audio feedback. Audio feedback poses privacy issues in some scenarios [36]. Voice feedback is provided to inform the user of the task’s completion in [20, 25, 31, 32, 37]. On completion of the task, vibrio-tactile feedback is provided by [20] and [21]. An educational application named mBraille was designed for learning English and Bengali [38]. Braille-to-text conversion is also performed in other languages such as Hindi, Tamil, Urdu, Odia, Arabic, Chinese, Bengali, Telugu, Malayalam, and Kannada [18, 39,40,41,42,43,44,45].

Braille-to-natural language conversion using deep learning

DL techniques provide better results for image and pattern recognition compared to other machine learning techniques [46,47,48]. A deep learning method for Braille data recognition has been used by Li et al. [49]. In their scheme, data was collected in the form of Braille images taken with a digital camera from a Braille book. They designed Stacked Denoising Auto Encoder (SDAE) for reducing the feature extraction and dimension reduction problem in Braille character recognition. They achieved 92% accuracy when SDAE was used with the Softmax classifier. This encoder works better than traditional methods for automatic feature extraction from images. An improved Optical Braille recognition system was designed by Vishwanath et al. that increases the picture quality of the scanned images [50]. To improve the quality of scanned Braille images Vishwanath et al. presented an improved optical Braille recognition system.

In a study carried out by Jha and Parvathi, Hindi and Odia text was converted into Braille using scanned documents. SVM robust machine learning technique was applied using the Histogram of Oriented Gradient (HOG) feature extraction method. Accuracies achieved for converting Hindi [51], and Odia [52] to Braille was 94.5% and 99% respectively. The same techniques were applied to convert English and Sinhala scanned documents into Braille. The results revealed that 99% and 80% accuracy was achieved for converting into Braille [53]. Another study carried out by Li. et al. using the same classifier as used in earlier studies. This time HAAR feature extraction method is used that showed a 10% classification error reduction for converting handwritten English sheets to Braille [54]. Using the KNN technique input taken from gesture-based touch screen English text was converted into Braille. The distance between two dots was calculated using Bayesian Touch Distance and 97.4% accuracy was attained [55]. Ahnaf et al. designed a dynamic Braille recognition system using the YOLO algorithm that identifies an image captured from the camera and converts the information into Braille text. They achieved 87% accuracy for detecting an apple, 86% for recognizing a bottle, and 67% for recognizing a book [56].

From an extensive review of the current literature, it was found that the most common problems that the visually impaired face when using Braille on touch-screens are the use of both hands, multi-finger touch, multiple single-entry touch, gesture memorization, deficient feedback, privacy issues, and location-specific data entry.

Therefore, in this study, a position-free text entry method for Braille is presented that reduces the problems found in the methods discussed earlier. In the proposed method, the user is only required to tap the active dots for a specific character and is liberated from the need to enter unnecessary dots. This results in a reduction in training time as well as user memorization requirements. In the past, the researchers computed handcrafted features using machine learning (non-deep learning) techniques, but their performance is based on the type and relevance of the features extracted. In this study, a new dataset is generated using this application and we have proposed the deep learning GoogleNet with a transfer learning approach. The deep learning methods being fine-tuned do not depend on the handcrafted features and their activation functions (ReLu) and convolution and pooling layers help for optimization and minimization of error to reach the goal. For further analysis classical machine learning like NB, DT, SVM, and KNN are applied.

Materials and methods

In this section, the DL model is used for text input character prediction. In the following subsections, the proposed Braille input interface use and character input methods are illustrated. Then, a DL-based classification model that predicts a user's input character is thoroughly discussed. The proposed framework for Braille to text character prediction from Braille input using the DL model is shown in Fig. 1.

Fig. 1
figure 1

Framework for text character prediction from Braille input using the DL model

Dataset collection

Presently, there is no existing digital dataset for Braille. For dataset collection, a new touch-screen interface for Android devices was developed. Braille dataset was collected using touch screen devices from the Special Education School, Manak Payyan, Pakistan. The average age of the participants in the study was 19 years. Initially, English letters (a–z) were collected as Braille input from 24 students who were partially or fully visually impaired. Each Braille input was saved as an image that is comprised of 64 × 64 pixels because this image size retains the important features of the image and reduces the computation time [57]. If any input image was greater than this size, then that image was resized and saved accordingly, as shown in Fig. 1. Inconsistencies from the data were removed manually by matching the patterns of the Braille dots on paper. Unmatched characters were removed from the dataset. There are 1284 images in the final dataset arranged in alphabetical order, as shown in Fig. 2. The training was performed using the 858 images of Braille Dataset, and validation was performed using 390 images. Choosing accurate, hyperparameter combinations such as optimizers, batch size, learning rate, accuracy and loss is the most challenging task when using CNN. If these hyperparameters are not properly set, overfitting can occur in the model [58]. Therefore, we have used a batch size of 32 with 26 classes and 50 epochs. The risk of overfitting has been reduced by adding a dropout layer.

Fig. 2
figure 2

taken from a participant using the proposed interface and stored in image format in the dataset

Sample Braille input is

Braille text entry: interaction of visually impaired users with the proposed Braille input design

A new application is designed by keeping in mind the limitations of the earlier applications. This input method is better than previously designed techniques for the following reasons: Braille text entry places the minimum burden on visually impaired users. There are no fixed positions for entering data on the device, and the user can freely enter dots anywhere on the screen. Users only have to enter active dots, and there is no need to enter the deactivated dots. Users do not need to enter unnecessary dots, and there is no need for difficult gestures such as double taps and triple taps. Vibrio-tactile feedback is given. A touch screen interface for Braille input developed using the proposed method is depicted in Figs. 3 and 4.

Fig. 3
figure 3

Proposed Braille interface

Fig. 4
figure 4

A participant using the proposed Braille interfaces

Input algorithm designed for Braille image extraction

The proposed input method is illustrated in Fig. 5. In the proposed method, the user has to input the Braille dots to enter a specific character on the screen. Swipe right gestures are used to save the character and swipe left used if the user wants to clear the screen because of a mistake. At the current stage of development, the user has to enter data character by character. This mechanism is extremely easy for visually impaired users to memorize and use. Braille character recognition from the input image is performed using a DL model that has been trained on the collected Braille dataset. Thus, there is little burden on visually impaired users when using this application.

Fig. 5
figure 5

Algorithm for Braille image extraction from the proposed input interface

Classification techniques

Different DL methods can be used for analyzing the Braille dataset. CNN is the most suitable method for working with image data because it consists of several hundred layers [59].

In this research, two CNN techniques, the sequential model, and transfer learning were trained using the collected dataset.

Sequential method

In this method, no pre-trained model is available, so the model needs to be trained from scratch. This method is more challenging than the others are, but it provides better accuracy. The sequential model consists of the following layers:

Convolution layer

This layer uses a linear function. The layer consists of input and kernels (filters). The dimensions of the input and kernel should be the same; the rest of the parameters can change. Kernels are a number of activation filters that are generated from the training image data. These activations are applied to the input image to generate the output. The output consists of the same number of parameters as the input. A special dimensional parameter called the stride is also used to downsample the image.

Non-linear Activation Layer

Non-linear activation functions are most commonly used among most recent neural networks. This layer allows the model to map network input and output, which is required for learning from the dataset. The most commonly used activation functions are the sigmoid, hyperbolic tangent, Softmax, rectified linear unit (ReLU), Leaky-ReLU, and SoftPlus. This layer introduces non-linearity into the network. For applying it on an image, negative values are replaced by zero, and the highest positive values are used.

  • Sigmoid function: This is a non-linear activation function; it prevents jumps in output values. Output values lie in the range of (0)–(1). It can be calculated using Eq. 1.

    Where "h" is the sigmoid function and "i" is input data. E is the mathematical constant, and its value is equal to 2.718 approx.

  • Hyperbolic tangent: This function can easily model neutral, extremely positive, or highly negative input values. Its output values lie between (−1) – (1). It measures non-linear activations with the help of the function given in Eq. 2.

    Where "f" is the hyperbolic tangent function, "b" is the input data and "e" is the exponential constant.

  • Softmax: This function normally works at the output layer where input is converted into multiple classes. This function works for finalizing the output using a probability distribution, as shown in Eq. 3.

    Where Sigma (\(\sigma )\) is the SoftMax function "S" is the input "j" is the number of inputs, and e is the exponential constant.

  • ReLU: This function is the most widely used non-linear function. It replaces all the negative values with a zero in a pixel feature map. This layer increases the non-linear properties of a model. It enables quick convergence of a network. Neural networks are then able to learn more complex functions using Eq. 4.

Where "f" is the ReLu activation function and "a" is the input value. If this value is less than 0, It will place a 0, and if this value is greater than 0, it places value that lies in a.

$$h\left(i\right)=\sigma \left(i\right)=\frac{1}{1+{e}^{-i}}$$
(1)
$$f\left(b\right)=\mathrm{tanh}\left(b\right)=\frac{2}{1+{e}^{-2b}}-1$$
(2)
$$\sigma \left({S}_{j}\right)=\frac{{e}^{xj}}{{\sum }_{i}{e}^{Sj}}$$
(3)
$$f\left(a\right)=max(0,a)$$
(4)

Pooling layer

This layer also uses a non-linear activation function, and it is used for downsampling. The pooling layer reduces the dimensionality of each feature map by extracting the highest values that contain the most important features. This layer reduces the computational complexity of the network.

Fully connected layer

The Fully Connected Layer is the final output layer. Every node in this layer is connected to another layer as in an ordinary neural network. After processing from this layer, a bunch of output data is generated. The top three outputs are selected using a probability distribution algorithm such as Softmax or a SVM. The fully connected layer is very load driven, and it makes a network load bound.

$$\delta {(y)}_{j}=\frac{{e}^{yj}}{\sum_{m=1}^{m}eym}$$
(5)

where delta \((\updelta )\) shows the fully connected layer function. "Y" is the input data, and "j" is the number of input data, and e is the exponent function.

Transfer learning

Transfer learning is used to transfer base knowledge to any relevant target [60]. Transfer learning achieves accuracy by applying previously gained information to a new task. After the innovation of transfer learning in CNN, a considerable improvement in accuracy was shown for small samples of images [61, 62]. Transfer learning can improve learning using three different methods: (1) without applying any further learning and only transferring the already extracted knowledge, (2) by reducing the time required to train the model for a task from scratch using transferred knowledge, and (3) by improving the final performance level [63]. The GoogLeNet architecture of transfer learning is used in this study.

GoogLeNet (Inception model)

GoogLeNet is a CNN-based architecture that has won the "ImageNet Large Scale Visual Recognition Challenge" [64]. It achieved a substantial improvement over ZFNet, which was the winner in 2013 [65]. In another study, Wei et al. proposed a CNN based 3D object classification model and achieved an accuracy of up to 93.3% [66]. The network architecture of GoogLeNet differs from other models because a convolution layer of 1 × 1 kernels with ReLU activation is used in the middle of the model. The major advantage of using 1 × 1 convolutions is the reduction of computational complexity. Instead of using fully connected layers, global average pooling is used. Different sizes of convolution kernels are applied to the input and all the outputs are concatenated using an Inception module [67]. In the proposed technique, along with the position-free text entry method, we reduced the tapping time of the user. The input character is recognized by the application using DL. Thus, it reduces the burden on the user by shifting it to the application.

Naïve Bayes algorithm

NB is a popular text classification technique that is based on Bayes theorem for calculating probability [68]. This simple technique is most suitable for multi-class problems. This algorithm assumes that the presence of one feature does not affect the presence of any other feature.

$$P\left(c|x\right)=\frac{P(x|c)P(c)}{P(x)}$$
(6)

The posterior probability is calculated using:

  • Where P(c|x) is the posterior probability of the class,

  • P(x|c) is the likelihood of the class,

  • P(c) is the prior probability of the class,

  • Moreover, P(x) is the predictor prior to probability.

The following steps are needed to perform the classification using NB: Initially, the dataset is converted into a frequency table, by finding the probabilities create a like hood table and after this NB equation is used to calculate the posterior probability of each class.

Support vector machines

SVM is a generalized classifier, and it can be used for different domains like character recognition, image recognition, etc. This technique even works better where there is a small sample of data for training [68]. Kernel trick is used to handle non-linear separable data [69]. The non-linear mapping function from the input space is transformed into a higher dimensional feature space. Polynomial, Gaussian, and Radial Based functions are the most popular kernels.

The decision function for an input "i" is given below:

$$D\left(i\right)=V.i-x$$
(7)

where "V" is the vector to the normal plane and "i" is the displacement relative to the origin.

Decision Trees

DT is also one of the famous machine learning models used. It has a tree-like structure, and it is easy to understand even for non-expert users. Data is checked, and common attributes are extracted from all the classes [68]. These attributes are further divided into branches until it meets the required criteria.

Mathematically it can be represented as follows:

$$\left(S,Z\right)=({S}_{1},{S}_{2}{, S}_{3}{, S}_{4 }\dots \dots \dots {S}_{n},Z)$$
(8)

where "S" is a vector that is composed of different features used for classification and "Z" is the target variable.

K Nearest Neighbors

KNN is a machine-learning algorithm that can be used for both classification and regression-based problems. This algorithm calculates the distance between each object, and then the nearest K training objects are delineated [70]. At last, K objects that belong to the same category are classified.

Performance measures

Prediction assessment using a confusion matrix

Confusion Matrix is a way of representing the performance of the classification algorithm with deep insight into the number of observations of each class. It shows the standard output for binary or multi-class classification problems [71]. It plots instances of output classes' verses predicted classes. Performance metrics that are used to analyze the performance of these classification techniques based on confusion matrix are True Positive Rate (TPR), True Negative Rate (TNR), Positive predictive value (PPV), Negative predictive value (NPV), False Positive Rate (FPR), False Negative Rate (FNR), False Discovery Rate (FDR) and Total Accuracy (TA) as shown in Table 1. TNR is calculated as the true negative that measures the proportion of actual negative results that are correctly identified. We have correctly classified TPR True Positive and True Negative samples. PPV shows the number of positive results that are actually true. NPV measures how likely it is that a negative result is actually predicted as negative. FPR is how often the model depicts a false result as true. Accuracy is the measure of the proportion of True Positives (TP) and True Negatives (TN). These models are also compared with NB, KNN, SVM, and DT algorithms.

Table 1 Comparison of DL and classical machine learning techniques
$$Recall/Sensitivity (TPR)=\frac{TP}{TP+FN}$$
(9)
$$Specificity (TNR)=\frac{TN}{TN+FP}$$
(10)
$$Positive Predictive Value\left(PPV\right)=\frac{TP}{TP+FP}$$
(11)
$$Negative Predictive Value\left(NPV \right)=\frac{TN}{TN+FN}$$
(12)
$$False Positive Rate(FPR)=1-Specificity$$
(13)
$$Total Accuracy (TA)=\frac{TP+TN}{TN+FP+TP+FN}$$
(14)

The other well-known performance measures for multi-class classification problems are Receiver Operating Characteristics (ROC) and AUC. ROC is one of the most appropriate assessment techniques for evaluating the performance of any classification model. It measures the performance of any classification problem using different threshold settings [72]. This indicates how much a model can distinguish between different classes. ROC accuracy is better than the total accuracy as total accuracy is only measured at a specific cut point, whereas ROC accuracy is measured on all cut points.

Experimental setup

In the model proposed, the user inputs Braille dots using touch screens, as shown in Fig. 4. The input is saved as an image. This image is processed to map the Braille dots to the corresponding Braille character. The algorithm for predicting English character is shown in Fig. 6.

Fig. 6
figure 6

Algorithm for English character prediction

A sequential model, which consists of a linear stack of layers, was built, trained, and validated using the Braille dataset. Each layer is considered an object. Variables are initialized for the training and validation of the model, which feeds data to the next layer. An input image of 64 × 64 is passed to the convolution layer. This layer works as a "flashlight" layer that illuminates the major portions of the image. The "flashlight" is the filter, and the region it illuminates is the receptive field. A filter is also an array of numbers. These numbers are weights at a particular layer, which acts as a feature identifier. Because the features convolve with the input, their values are multiplied by the input image pixels. This method is called pyramid-wise multiplication. The results of multiplication from each region are then summed for all parts of the image. CNN learns the values for its filters during training. Then, the feature information is passed to the next layer, which is the activation layer. This layer has a non-linear activation function. The pooling layer reduces the dimensionality of each feature map but retains the most important features, reducing the computational complexity of this network.

These three layers are grouped together and then repeated nine times. The features of the first layer are more abstract than those of the other layers. The output feature map is input to the next layer, and the filter in each layer learns more abstract features. The dropout layer drops a random set of activations in that layer by setting them to zero as data flows through it. Minimize function is used to measure the difference between the targeted and the expected output. To minimize the loss function, the derivative of the loss with respect to the weights in each layer is calculated.

$$\sigma {^{\prime}}_{j}= \frac{\partial j}{\partial s\frac{i}{j}}$$
(15)

The derivation is performed starting from the last layer to compute the directions in which network up-gradation is required. The loss is propagated backward for each layer, and the value of each filter is updated so that they change in the direction of the gradient to minimize the loss. The learning process is intrigued by using the compile method. Adam optimization can be used to maximize accuracy in a smaller number of epochs, but it increases the training time because of the increase in the number of images [57, 73]. It individually calculates the learning rate for the first and second moments of the gradient weight of the neural network. The end fit function is used for training and validation. Input and output neurons of DNN architecture are shown in Fig. 7.

Fig. 7
figure 7

Graphical representation of Input and Output neurons of DNN

Results and discussion

The classification performance was evaluated using Deep Learning techniques like Sequential Model and GoogLeNet Inception Model. For making a comparative analysis, classical machine learning techniques like NB, DT, SVM, and KNN are also applied. Features were extracted using the traditional feature extraction method. The performance was evaluated using the following performance metrics: TPR, TNR, PPV, TA, FPR, FNR, FDR, and AUC.

In this study, we have analyzed the classification of English Braille alphabets. Figures 8 and 9 show the results in the form of the ROC curve that is plotted with the TPR on the Y-axis and the FPR on the X-axis. The AUC represents a portion of a square unit, and its value lies between 0 and 1. An AUC greater than 0.5 indicates the separation of the classes [74]. Figure 8 shows the performance of the Sequential Model for Category- A (a-n).  The highest separation is achieved for all the classes from a-n. Similarly, for Category-B (n-z) we have obtained the highest separation for all the classes except for class "o," i.e. AUC (0.98) and "x," i.e. AUC (0.97). The total accuracy achieved for Sequential Model is 92.21%. The model has an average micro and macro ROC separation of 100%, as shown in Fig. 10a.

Fig. 8
figure 8

ROC for the sequential model for a classes a–m and b classes n–z

Fig. 9
figure 9

ROC for GoogLeNet for a classes a–m and b classes n–z

Fig. 10
figure 10

Macro and micro ROCs a for the sequential model and b GoogLeNet

Figure 9 shows the performance of the GoogLeNet Inspection Model. The results show that the highest separation is achieved for all the cases of Category-A (a–m). For Category-B (n–z) highest separation is achieved except for class o i.e. AUC (0.98) and x i.e. AUC (0.099). The total accuracy achieved for the GoogLeNet inspection model is 95.8%. It has also achieved an average separation of 100% in terms of micro and macro ROC, as shown in Fig. 10b.

Figure 11 demonstrates a comparison of various machine-learning techniques. The results obtained from the Naïve Bayes classifier show that the highest accuracy was achieved for the character "a" with TA = 100%, TPR = 100%, and TNR = 97%. Similarly, the Decision Tree classifier has shown that the highest accuracy is achieved for "v" with TA = 100%, TPR and TNR = 100% followed by "w" that achieved TA = 92% with TPR and TNR = 100%. KNN showed TA = 100% with TPR and TNR = 100% for "a"," b"," f"," q" and "w". Followed by "a", "c" and "d" with TA = 99% and TPR, and TNR = 100%. Moreover, SVM also showed TA = 100%, TPR and TNR = 100% for "a", "e", "i" and "r". Followed by "c" and "l" that achieved TA = 99% and TPR = 100%.

Fig. 11
figure 11

Comparison with classical machine learning techniques

We have also analyzed our results using the confusion matrix for the GoogLeNet Inception model. In Confusion Matrix, Truly predicted classes are shown on diagonal entries of an n × n matrix. Figure 12 shows a confusion matrix of a 26 class problem for Braille to English alphabet conversion using GoogLeNet architecture. There is a possibility of confusion among several classes. As in this case, class l is misclassified as class b, class w is predicted as class d, j is predicted as t, and o is predicted as x.

Fig. 12
figure 12

Confusion Matrix representing the classification performance of GoogLeNet

We have also compared the results acquired from Deep learning techniques with Classical machine learning techniques. Table 1 shows a comparison of DL with classical machine learning techniques.

All the comparisons are made using similar performance measures such as TPR, TNR, PPV, NPV, TA, FPR, FNR, and FDR. Among all these classifiers, the performance of the GoogLeNet Inspection Model was the best. It has achieved TA = 95.8% along with TPR = 95.89%, TNR = 99.83% and FDR = 3.39% only. The second-highest performance was achieved by Sequential Model with TA = 92.21%, TPR = 90.76%, TNR = 99.63%, and FDR = 8.90%. The third highest performance was achieved by SVM with TA = 83%, TPR = 68.67%, and TNR = 99.16% and FDR = 23.25%. Rest of the three classifiers NB, DT, and KNN obtained better TA = 96.38%, 97.20%, and 97.04% respectively than others but with no TPR. Therefore, we cannot consider these performances for better classification. Thus, TA alone cannot indicate the efficiency of a classifier. In addition to a high TA, an efficient classifier must have higher values of TPR, TNR, PPV, NPV, and smaller values of FPR, FNR, and FDR. Therefore, GoogLeNet shows better performance than the rest of the techniques, all in terms of TPR, TNR, PPV, NPV, TA, FPR, FNR, and with reduced FDR 3.39%.

Conclusions and future work

Deep learning (DL) has been widely used for pattern recognition. To our knowledge, this is the first study to use this technique for Braille-to-natural language conversion. In this research, we have designed an innovative touchscreen-based Braille input method that provides maximum ease-of-use for visually impaired people. A Braille dataset was then collected using a purpose-built Android-based smartphone application. In this study, only level 1 Braille data for English alphabets were used, and English characters corresponding to Braille input were recognized using DL techniques. Performance evaluation indicates that GoogLeNet Inception Model has the highest accuracy of 95.8% with 95.89% specificity (TNR) and 99.83% sensitivity (TPR). It also has a reduced error rate of 3.39%.

In the future, we plan to include an error correction feature so that it can be more beneficial to the visually impaired community. DL techniques will also be used for accurate classification of level 2 Braille datasets for English word recognition. Currently, this application only supports English alphabet characters but could be extended to support other regional languages such as Urdu and Pothohari. These techniques will also be applied for Braille-to-Urdu conversion and numerical characters for level 1 and level 2 dataset. Other DL techniques can be applied to improve the accuracy and efficiency of the designed interface. Several other related educational games could also be developed using this position-free input interface.