1 Introduction

Visual impairment refers to the loss of visual acuity that cannot be ameliorated by refractive correction or medical technologies (Rahman et al. 2021). It is difficult to build a full perception of the world for visually impaired people especially those at the phase of education, due to the lack of a sane vision, which ultimately affects their living abilities, self-esteem, and mental health. According to Organization et al. (2019), as of 2010, there are 285 million visually impaired people in the world, among which 37 million are blind. And based on the estimation of Ackland et al. (2017), the number of blind will reach 55 million by 2030, and 115 million by 2050. What is sad is that, except for a few visual impairments caused by senile eye diseases such as cataracts and glaucoma, the vast majority of visual impairments are congenital, which means that these people have never seen the world they live in their whole lives. Due to medical limitations, there is no easy cure for most visual impairments. Hence, how to help visually impaired people integrate into society has become an urgent problem to be solved. Although visually impaired people who are educated in special education schools can obtain basic living abilities, not all visually impaired people have such opportunities and conditions. There has been a lack of sufficient professional educators for special education positions. In addition, most special education schools in poor areas lack the budget to hire professional special education workers, which has led to not only the lack of assistance for visually impaired people, but also more potential discrimination, abuse, and indecency against them (Warren 1994).

With the rapid development of artificial intelligence and advanced sensing technology, the use of deep learning to assist visually impaired people is booming. In recent years, there has been lots of work using deep learning to help visually impaired people perceive the environment and avoid obstacles (Poggi and Mattoccia 2016; Liu et al. 2021; Kumar et al. 2019). However, to our knowledge, research on deep learning based educational assistance systems targeted at visually impaired people is insufficient. In the field of human–computer interaction, paper Metatla et al. (2020) used a co-design approach to design and evaluate a robot-based educational game that could be inclusive of both visually impaired and sighted children. But it cannot continuously educate visually impaired people to develop their ability to perceive the world. Paper Ahmetovic et al. (2020) provides a vision-based deep learning approach to assist visually impaired people to recognize daily objects. However, due to that the classification algorithm is just vision-based, visually impaired people are difficult to get perceptual feedback and a sense of participation in the whole process. In order to bridge the gap in this area, we proposed AviPer, a system that aims to assist visually impaired people to perceive the world with visual and tactile multimodal object classification.

Fig. 1
figure 1

System overview. The user sits at the table and grasps the object to be identified with his hand wearing the tactile glove. The glove collects the tactile time series and transmits the data to the computer through the USB interface of two Arduino Mega microcontrollers. At the same time, the webcam arranged on the desktop transmits the video of the hand area to the computer. After data preprocessing, the tactile data and visual data are sent into the trained multi-modal attention-based classification model. Finally, the model gives the classification label and broadcasts it to the user through a Bluetooth speaker

Our objective is to develop a continuous, immersive, and educational assistance for people who are visually impaired. More specifically, we expect to create a pattern that visually impaired people can safely and continuously learn to identify living objects without being supervised or taught, while they can have as much experience as possible about the objects they are learning to recognize. These demands lead us to multi-modal deep learning. Multi-modal learning, or multi-view learning, refers to building models that can process and relate information from multiple modalities (Baltrušaitis et al. 2018). We integrate the recognition of tactile signals with visual recognition, in order to attach more classification evidence for the model, and provide users with a real sense of grasping operation. In the evaluation experiment, we proved that the visual and tactile modal fusion is very necessary and beneficial, as the multimodal model can achieve robust classification in extreme cases which are hard to distinguish with only one modality. Besides, we innovatively embedded three kinds of different attention mechanisms, namely temporal, spatial, and channel-wise attention, to better extract important information from the whole process of grabbing objects for classification. Figure 1 shows the whole process of the system.

To unleash the power of visual and tactile multimodal attention network in the application scenario of assisting visually impaired people, we have to address a series of challenges:

  1. 1.

    Accurate tactile data acquisition In previous studies, deep learning tasks based on tactile sensors are often deployed on robotic arms (Romano et al. 2011; Yuan et al. 2015; Morrison et al. 2018; Li et al. 2019). These sensors are difficult to meet the needs of wearable devices for softness and flexibility. Paper Sundaram et al. (2019) develops an advanced sensor integration, which can be fixed on a glove for wearing. But wearable device with high sensor density is not suitable for visual impairment assistance. Not only the sensor itself will incur high production costs, but also the complicated circuit requires lots of maintenance costs.

  2. 2.

    Heterogeneous data Tactile and visual signals naturally have a huge difference in scale. More specifically, vision can perceive the full view of items, while touch only reflects the local characteristics of the contact point. In a machine learning model, the gap in scale is embodied in the completely different data dimensions of the tactile signal and the video frames. Furthermore, different approaches are required to acquire the important information contained in tactile time series and video frames. Therefore, the model structure requires careful design to extract features of heterogeneous data.

  3. 3.

    Privacy security Due to the need for a webcam to obtain video frame data and the particularity of the people served, data privacy and other security issues also need to be considered.

To tactile the above challenges, we processed a tactile glove with a much lower sensor density and used our self-developed capacitance-based force sensor to collect accurate tactile data. The flexible sensor can directly reflect the degree of bending of the joint to help the model infer the gesture of the hand. To handle heterogeneous data, we creatively designed a dual-modal attention network. The model uses different modules to process tactile and visual input and conduct modality fusion with extracted feature vectors. In the modules which respectively process tactile and visual input, we implemented three kinds of attention mechanism, namely temporal, spatial and channel-wise attention to focus on key features of different modalities. In order to protect the user’s privacy security, we strictly collect the video data of grasping action in the hand area, which means that the user will not be exposed to the webcam except for his hand with the tactile glove. In addition, the collection of data for training and testing the model is completely done by people with unimpaired vision. The rights of visually impaired people are fully respected. Hardware including tactile gloves and cameras, visual-tactile bimodal dataset, multimodal deep learning classification algorithm together constitute our assistance system for visually impaired people: AviPer. Detailed information about the system will be discussed later in Sect. 3.

We summarize the contributions of this paper as follows:

  • We take the lead paying attention to the continuous immersive assistance for visually impaired people to perceive the world, and first put forward the idea of applying multimodal deep learning to this application scenario.

  • We propose a complete system including hardware, data, and algorithms to make the above ideas truly land. We design a flexible tactile sensor glove for the needs of the use scenario. A multimodal attention model is proposed, which can achieve robust classification with high accuracy. We construct a visual-tactile bimodal dataset to train and evaluate our system. For the proposed system, we conduct lots of various evaluation experiments, including the test of the model in extreme situations and the evaluation of the system in real use scenarios.

  • We opensource all the code and datasets in hope that researchers can freely use them to promote the development of the field of visual impairment assistance, which will accelerate the practical application of research in this field and bring about a change in the life of visually impaired people.

The remaining of the paper is organized as follows. Sect. 2 surveys the related problem and methodologies. Section 3 detailedly presents the design and implementation of the AviPer system. Section 4 shows extensive experiments we conduct to evaluate the system, including model evaluation and real-world tests. In Sect. 5, we discuss the insights, achievable optimizations, and prospects of our system. Then we conclude in Sect. 6.

2 Related work

The problem and methodologies presented in the paper are highly related to the following three research areas: visually impaired assistance, multimodal learning, and attention mechanism.

2.1 Assistance for visually impaired people

Assistance for people with impairments has always been a hot topic in the field of human–computer interaction and pervasive computing. There have been many works focused on visual assistance, such as Aladren et al. (2014), Praveen and Paily (2013) and Papadopoulos and Goudiras (2005) in navigation and reading accessibility. Deep learning has a broad application prospect for impairment assistance, especially the assistance for visually impaired people. Paper Poggi and Mattoccia (2016) proposes a wearable mobility aid for visually impaired based on 3D computer vision and machine learning, which achieves effective and real-time obstacle detection. Work Tapu et al. (2017) develops a system called DEEP-SEE which realizes joint object detection, tracking, and recognition for visual impairment assistance. Paper Liu et al. (2021) and Delahoz and Labrador (2017) apply deep learning to floor detection. There are also some smartphone-based approaches such as Lin et al. (2017) and APP Seeing AI by Microsoft. However, The above-mentioned existing research, along with Wang et al. (2017), Lakde and Prasad (2015) and Ganz et al. (2014), mainly focuses on navigation and obstacle avoidance. There are other applications like facial recognition for visually impaired people Neto et al. (2016), facilitating search tasks Zhao et al. (2016) and password manager Barbosa et al. (2016). But research on continuous immersive motivational aids for visually impaired people, especially those in educational age, is insufficient. We hope that our proposed system AviPer, which aims to assisting visually impaired people to understand the world, can fill the gap.

2.2 Multimodal learning

Human perception of the world is multimodal. We see objects, hear sounds, smell odors, feel texture, and taste flavors. Modality refers to the way in which something happens or is experienced (Baltrušaitis et al. 2018). Unlike machine learning models with a single data source, multimodal machine learning aims to build models that can process and relate information from multiple modalities, which has great potential to provide a stronger understanding ability for the model. Multimodal learning has a wide range of applications. The earliest examples of multimodal learning include audio-visual speech recognition (Yuhas et al. 1989) and multimedia content indexing and retrieval (Snoek and Worring 2005). In the early 2000s, multimodal learning began to be applied to human activity detection (Smith et al. 2005; Yin et al. 2008), as it is inherently very suitable for handling multimodal human behavior. Now, multimodal learning has been widely used in tasks that require a complex perception of the surrounding environment like self-driving (Xiao et al. 2020; Cui et al. 2019) and health monitoring (Banos et al. 2015; De et al. 2015).

The main challenge in multimodal learning is to choose the optimal fusion structure. Deep architectures offer the flexibility of implementing multimodal fusion either as early, intermediate, or late fusion (Ramachandram and Taylor 2017). In early fusion, also known as data-level fusion, the various sampling rate of different sensors and huge dimensional differences of heterogeneous data would be tricky. The common approach for alleviating the challenges related to raw data fusion is to extract high-level representations from each modality before fusion, which could be handcrafted features or learned representations, widely used in works like Wu et al. (2016), Karpathy et al. (2014) and Simonyan and Zisserman (2014). Therefore, intermediate fusion is also known as feature-level fusion. Late fusion, or decision-level fusion, represents a paradigm for fusing the results of network branches handling different modalities. The advantage of this method is that it is feature independent, which means error caused by each modality is uncorrelated. Neural network architecture with intermediate fusion needs careful design. Recently, the use of automatic machine learning to adjust intermediate fusion network architecture has become a hot trend (Ramachandram et al. 2017; Li et al. 2017).

The fusion of tactile and visual perception has also developed for decades. As early as the 2000s, neurologists studied the coordination of vision and touch in human perception (Zangaladze et al. 1999; Ernst and Banks 2002). Work Björkman et al. (2013) and Luo et al. (2015) introduce low-resolution tactile sensing to assist visual tasks. Work Kroemer et al. (2011), Güler et al. (2014) and Gao et al. (2016) respectively propose haptic-visual multimodal deep learning models for specific tasks. The main difference between our work and the above researches is that they focus on robot perception and manipulation while our system uses flexible wearable tactile sensors to increase the user’s participation in assistance tasks for visually impaired people, as well as to boost performance.

2.3 Attention mechanism

Attention mechanism is a data processing approach in machine learning. Since first proposed by Bahdanau et al. (2014), it has been extensively used in natural language processing (Hu 2019), computer vision (Sun et al. 2020), and other various machine learning tasks. The main idea of attention mechanism comes from the way humans perceive things, which is expected to put more attention on key features that need more concern. At the implementation level, the basic approach is to use a mask to reweight the data, in order to endow the region concerned with higher weights. Attention mechanism can be classified as soft attention, hard attention, and self attention. Soft attention is differentiable, while hard attention is not. Training process of the latter is usually completed through reinforcement learning. Self-attention is a special form of attention mechanism, which focuses on the intrinsic correlation of different elements in the data source, whose representative architecture is Transformer (Vaswani et al. 2017) and its variants.

In soft attention mechanism focus on computer vision task, Paper Hu et al. (2018) proposes Squeeze-and-Excitation Network which carries out channel-wise attention. For spatial attention, Wang et al. (2017) employs the idea of residuals to develop Residual Attention Network, which learns attention-aware features from different modules that change adaptively as layers go deeper. Then, paper Woo et al. (2018) develops a channel-spatial integrated attention mechanism. Paper Wang et al. (2018) and its improvement Cao et al. (2019) proposes Non-local Attention for CNN architecture, which provides solutions to capture long-range dependencies. For deep learning architecture handling time series data, there are many studies on attention in time-domain like Qin et al. (2017) and Liang et al. (2018) and frequency-domain like Lee et al. (2020). As for research that is highly relevant to our work, Cao et al. (2020) embeds spatial mechanism to tactile sensing for texture recognition. But studies are lacking for attention mechanism in the fusion of tactile and visual modalities. Besides, attention-based research specifically for visual impairment assistance is rather insufficient.

3 System design and implementation

In this section, we will introduce the components of our proposed system in detail. The system consists of three parts: hardware, data, and multimodal attention model. In order to achieve accurate classification of objects, first of all, sensors that collect tactile data and a webcam for recording visual data are required. These hardwares along with the microcontrollers for data transmission, GPU for training, and speaker for broadcasting will be discussed in Sect. 3.1. For training and testing the classification model, we constructed a bimodal dataset and prepossessed the data, whose specific collection strategy and prepossessing pipeline including data augmentation will be given in Sect. 3.2. Then, we detailedly introduce the architecture and settings of the multimodal attention network used for classification in Sect. 3.3.

3.1 Hardware

3.1.1 Tactile glove

To achieve high-precision object recognition while providing users with as much participation and realism as possible, we integrated our self-developed capacitive-based tactile sensor into a wearable glove. By investigating real human grasping processes, we inferred that most of the grasping gestures can be deduced from the degree of bending of the fingers and the force on the fingertips and palms. Hence there is no necessity to use high-density sensors like Sundaram et al. (2019), which not only allows our gloves to be manufactured at a relatively low cost, but also sharply reduces the wiring difficulties and maintenance costs, which is more in line with the actual use requirements.

Fig. 2
figure 2

The tactile glove. a Production and integration: MWCNTs (multi-walled carbon nanotubes) powder and PDMS (polydimethylsiloxane) are used to make an active layer film with a regular macroscopic shape. The measurement surface of the thin film structure is placed and packaged face to face and \(\mathrm {Au}\) nanoparticles are sputtered on the outside as electrodes to produce a flexible capacitive sensing unit. Pressure sensors are placed on the fingertips and palms of the hand, while tension sensors are placed on the joints. b The response of the pressure sensors to the force. c The response of the tension sensors to the bending angle

As shown in Fig. 2 the sensing glove consists of 14 sensors containing 9 pressure sensors and 5 tension sensors. The pressure sensors are fixed on five fingertips and four corners of the palm, and each sensor unit is about \(1.2\times 1.2 \ \mathrm {cm}^2\). While tension sensors were fixed to 5 finger joints and each sensing area is about \(1\times 1.5 \ \mathrm {cm}^2\). Both pressure and tension sensors are fabricated by two thin films as active layers and \(\mathrm {Au}\) nanoparticles as electrodes. Changes in shape and electrode distance are manifested as changes in capacitance. Therefore, the pressure sensor on the palm and fingertips can give different signals when grasping objects with different forces. And the tension sensors on the joints can show different signals when grasping objects with different shapes as the fingers bend at different angles.

3.1.2 Others

Besides the tactile glove mentioned above, the hardware of our system includes two Arduino Mega microcontrollers, which are used to transmit signals between the tactile sensor and the computer serial port. In addition, we use an easily available web camera (Logitech c922 Pro Stream webcam) to collect video data. The above-mentioned microcontrollers and webcam are directly connected to the host computer through a USB interface. For the multimodal attention model, we complete all training and evaluation experiments on an NVIDIA Geforce GTX 3070 GPU. Finally, we use a Bluetooth speaker (MC A7) to announce the predicted results to visually impaired users.

3.2 Data

3.2.1 Data collection

In the scenario of assisting visually impaired people to learn to recognize objects, we look forward to give prediction results based on a short period of grasping action. The time period should allow users to fully perceive the characteristics of the object, and at the same time allow the model to give an accurate prediction. Therefore, we adopt the strategy of collecting hand motion videos and tactile time series of every complete grasping. Specifically, in the basic classification experiment, we first select ten types of items that are commonly used in life, which are illustrated in Fig. 3. For each type of objects, we have carried out more than 150 times of complete grasping: picking it up, holding it for a few seconds, then putting it down, which forms the original dataset (a). When grasping the same item, we adopt a variety of gestures to enhance the diversity of data from the aspect of data collection, which is tally with the actual use situation as well. Besides, we choose two different desk textures as background. One is warm with wood grain, and the other is cool without grain. The ratio of data in the two backgrounds is about 2:1. Each piece of data includes a grasping video (.avi) with a length of about 17 s, tactile sensor information for the same time period (.csv), and four frames (.jpg) of the 0th, 5th, 10th, and 15th s intercepted from the video.

In order to demonstrate the robustness of visual and tactile multimodal prediction to extreme situations, we design 6 categories of items that are difficult to distinguish only from the camera data. Each category is collected more than 100 samples with the same collection pipeline described above but only on warm grained background, which makes up dataset (b). We build another small dataset (c) which contains 200 pieces of data of 10 categories in total. In the collection of these parts of data, irrelevant objects are placed in the view field, which aims to test whether the participation of tactile information can help the model resist interference and whether spatial attention can focus on the operating hand. We finish this part of collection on the cool no-grained desk. The items that we use to collect data are shown in Fig. 3. We perform data augmentation to enlarge the size of all the three datasets at a ratio of 1:10. Data augmentation and preprocessing will be discussed in Sect. 3.2.2. Details of the augmented dataset configurations can be found in Table 1.

Fig. 3
figure 3

Data collection. a The 10 objects used for constructing the main dataset. b When constructing the second dataset, which is made up of objects that are hard to recognize only by vision, we choose two bottles of different sizes. Each one is grasping under the following 3 status: full, half full, empty. In c, we collect the same 10 items as a but under the condition where there are distracting objects on the table

Table 1 Dataset configurations

3.2.2 Data augmentation and preprocessing

Data augmentation is a data-space solution to the problem of limited data (Shorten and Khoshgoftaar 2019). By flips, translations, and rotations, we could significantly expand the dataset. For tactile time series, there are argumentation approaches in time domain and frequency domain (Wen et al. 2020). In our experiments, we simply augmented visual data while keeping tactile time series unchanged. We generated new video frames from the collected datasets (a), (b), and (c) at a ratio of 1:10 following the process:

  1. 1.

    Color jitter: adjusting brightness, contrast, saturation, hue \(\pm 20 \%\)

  2. 2.

    Random rotation in interval \([-45^{\circ },+45^{\circ }]\)

  3. 3.

    Random vertical flip and horizontal flip both with a probability of 0.2

  4. 4.

    Random gray scale with a probability of 0.2

It is worth mentioning that we implement data augmentation for 4 frames from the same video independently, that is each of the 4 frames go through different data augmentation configurations, which enhances the diversity of the augmented datasets. An argumentation example can be seen in Fig. 4.

Fig. 4
figure 4

Data augmentation example: hand cream. a shows the original video frames of different time points. b Shows the 4 frames after data augmentation

The function of data preprocessing component is to generate the data which can be directly fed to the multimodal deep learning model. Due to the characteristics of the capacitance-based tactile sensor, the initial value (which refers to the value of the natural placement state without grasping anything) and range of change of the raw tactile data vary in different tactile sensor units, which is shown in Fig. 5. The huge difference in raw tactile time series among each channel will cause weight bias of each channel’s importance to the trained model. Therefore, in order to let the model learn the information of each unit more comprehensively and evenly, we independently normalize each tactile sensor channel. Specifically, we find out the maximum and minimum values of all the data of a certain sensor unit, denoted as \(V^{(i)}_{\max }\), \(V^{(i)}_{\min }\), then use the following simple linear mapping to map all the time series recorded by the sensor to [0, 1]:

$$\begin{aligned} V_i'(t)=\frac{1}{V^{(i)}_{\max }-V^{(i)}_{\min }}V_i(t) -\frac{V^{(i)}_{\min }}{V^{(i)}_{\max }-V^{(i)}_{\min }},\quad i =0,1,\ldots ,13. \end{aligned}$$

For the four frames of images used for training, we also adopt normalization. First, we convert the RGB image to a tensor in the range of [0, 1]. Then, the mean is adjusted to 0 and the standard deviation to 1. The data prepossessing pipeline is illustrated in Fig. 5.

Fig. 5
figure 5

Data preprocessing: a shows the preprocessing pipeline of video data. We first select 4 special frames from the original video and then stack them by time and implement transform to normalize them. For tactile time series, as shown in b, we first implement the normalization and then stack them by sensor channels

3.3 Visual and tactile multimodal attention network

In this section, we will discuss the proposed multimodal attention network to classify objects with tactile and vision data in detail. Concretely, as the data we collect are time series of 14 tactile sensor units all over the glove and 4 frames taken from the particular time in the corresponding downward videos of hand, we carefully design 4 distribution functions \(p_i \ (i=0,1,2,3)\) to hard-code the importance of tactile signals at different times to make tactile-map and video image pairs, denoted by \((t_i,v_i),(i=0,1,2,3)\). On account of only a portion of sensor units play a major role in the process of grasping, to fully extract the information of different tactile units, we embed squeeze-and-excitation(SE) blocks in the branch of the network that extracts haptic information, which achieves channel-wise attention. For a single video frame, we apply attention to spatial position through residual attention modules to locate hand area to make the prediction more robust. We extract feature vectors from data by visual network branch and tactile network branch respectively, then perform a feature-level modal fusion with a modality weight hyperparameter \(\lambda\). Ultimately, we pass the fused features with relatively low dimensions to a two-layer perceptron to give the prediction. It is worth mentioning that although we have 4 different \((t_i,v_i)\) pairs, each of them will be put into two branches(tactile branch and vision branch) of exactly the same network, which means the weights are shared when the proposed model process data pairs from disparate temporal interval. The overall schematic diagram is shown in Fig. 6, the structural details of separate parts of the proposed network architecture will be discussed in the following subsections.

Fig. 6
figure 6

Network architecture. a Shows the overall structure of the model, in which we adopt weight sharing strategy to learn from different visual-tactile data pairs \((v_i,t_i),(i=0,1,2,3)\) of a single grasping process. The network is divided into tactile and visual branches, based on squeeze-and-excitation network (Hu et al. 2018) and residual attention network (Wang et al. 2017) respectively. The structure of SE blocks is shown in b. \(\mathbf {\bar{X}}\) is the SE block output of original feature \(U=Conv(\mathbf {X})\) as input. The structure of a single residual attention module is shown in c. In our residual attention network, we stack 3 residual attention modules along with max-pooling layers and convolution layers

3.3.1 Hard-coded temporal attention

As mentioned in Sect. 3.2.1, the vision data we use is 4 single frames cut out from the full-length video of grasping objects rather than the video itself which requires lots of computational resources. The chosen 0th, 150th, 300th, 450th frames of the video lasting about 15 s represent the hand gesture of a specific temporal interval. It would be natural to think that we should also weigh tactile time series in time dimension to distinguish the importance of the current frame time interval from other intervals. Therefore, we design 4 quadratic distribution function to depict attention in time. As the tactile series’ length is 60, we divide it to four intervals: \(0\sim 14,15\sim 29,30\sim 44,45\sim 59\), denoted by time period No. 0, No. 1, No. 2, No. 3. During each time period i, the starting index is denoted as \(T_{i}^{s}\), the ending index is denoted as \(T_{i}^{e}\), The distribution function \(p_i,(i=0,1,2,3)\) is given by the following equation.

$$\begin{aligned} p_i(t) = \frac{f_i(t)+g_i(t)}{s} \end{aligned}$$
(1)

s is a scale factor which is set to 15. While, \(f_i(t)\) is the peak distribution, which is defined as:

$$\begin{aligned} f_i(t)=-(t-T_{i}^{s})\cdot (t-T_{i}^{e}), \quad t\in [T_{i}^{s},T_{i}^{e}]. \end{aligned}$$
(2)

\(g_i(t)\) is the body distribution, which is defined as:

$$\begin{aligned} g_i(t)={\left\{ \begin{array}{ll} -w \cdot (t-59)\cdot (t-T_i^s - T_i^e +59) , \quad &{} i=0,1\\ -w \cdot t \cdot (t-T_i^s - T_i^e) ,\quad &{} i=2,3 \end{array}\right. }, \quad t\in [0,59], \end{aligned}$$
(3)

where w is the weight to adjust the ratio of the maximum of \(g_i(t)\) and \(f_i(t)\). Outside their domain of definition, both functions are set to 0. The weight distribution obtained is shown in Fig. 7. For each distribution, we sample at the integer index between 0 and 59 to form a 60-dimensional vector \(\mathbf {p_i}\), then the weights are hardcoded into tactile time series of each sensor unit j by element-by-element multiplication. So that we end up with 4 tactile-visual data pairs \((v_i,t_i), i=0,1,2,3\) which imply 4 adjacent time intervals.

$$\begin{aligned} t_i^{(j)} \leftarrow t_i^{(j)} \odot \mathbf {p_i},\quad j=0,1,\ldots ,13 \end{aligned}$$
(4)
Fig. 7
figure 7

Weight distributions, from left to right labeled as No. 0, No. 1, No. 2, No. 3

3.3.2 channel-wise tactile attention with squeeze–excitation blocks

By reason of that the tactile sensors are all over the palm and knuckles, it is very unlikely that each sensor has a strong and similar signal change in a single grasping action. To exploit sensor channels’ different importance dynamically, we embed a SE block in tactile branch of the model, which is proposed by Hu et al. (2018), containing two processes, namely squeeze and excitation, shown in Fig. 6b. In the squeeze process, we generate channel descriptor \(\mathbf {z_i}\in \mathbb {R}^{14},(i=0,1,2,3)\) using global average pooling:

$$\begin{aligned} z_i^{(j)}=\frac{1}{60}\sum _{m=0}^{59} {(t_i^{(j)})}_m,\quad j=0,1,\ldots ,13. \end{aligned}$$
(5)

Following comes the excitation process to make use of the information integrated into squeeze process, which is implemented with a gating mechanism:

$$\begin{aligned} \mathbf {s_i} = \sigma (W_2\delta (W_1\mathbf {z_i})),\quad i=0,1,2,3, \end{aligned}$$
(6)

where \(\delta\) refers to the ReLU activate function, \(\sigma\) refers to a Sigmoid function, \(W_1 \in \mathbb {R}^{\frac{14}{r}\times 14},W_2 \in \mathbb {R}^{14\times \frac{14}{r}}\). r is the reduction ratio, which was set to 2 in our experiments. At the end of excitation, the original input \(t_i\) is rescaled by channel:

$$\begin{aligned} t_i^{(j)} \leftarrow \mathbf {s_i}^{(j)} t_i^{(j)},\quad j=0,1,\ldots ,13. \end{aligned}$$
(7)

3.3.3 Spatial visual attention with residual attention modules

In our application scenario, although the gloved hand only grips one item at a time, there is no guarantee that no interferential objects are on the desktop within the camera view. To prevent other objects on the table from interfering with the inference of the visual branch of the model, we propose to make the model capable of perceiving spatial attention to locate the object being grasped. We adopt the approach of Residual Attention Module proposed in Wang et al. (2017), which gets ideas from residual learning and adds soft mask attention mechanism on an identical mapping:

$$\begin{aligned} H_{i,c}(x) = (I + M_{i,c}(x)) \odot F_{i,c}(x) \end{aligned}$$
(8)

where i ranges over all spatial positions and c is the index of image channels. \(M_{i,c}(x)\) ranges from [0, 1] and, with \(M_{i,c}(x)\) approximating 0, \(H_{i,c}(x)\) will approximate original features \(F_{i,c}(x)\). Meanwhile, the identical mapping guarantees that its performance will at least be no worse than original no-attention network.

The structure of a single residual attention module is shown in Fig. 6c, whose main components are the truck branch and the mask branch. In the construction process, there are 3 hyperparameters ptr to control the block’s size, respectively denoting the number of preprocessing Residual Units before splitting into trunk branch and mask branch, the number of Residual Units in trunk branch, and the number of Residual Units between adjacent pooling layer in the mask branch. In our implementation, the set values are \(\{ p=1,t=2,r=1 \}\). We stack 3 residual attention modules along with max-pooling layers and convolution layers to form the visual residual attention network. The corresponding size of the 3 residual attention module’s outputs are \(56\times 56@256,\ 28\times 28@512, \ 14\times 14 @ 1024\).

3.3.4 Weight sharing

As mentioned in Sect. 3.3, although there are 4 different \((t_i,v_i), (i=0,1,2,3)\) pairs for network input, but we use a single network with visual branch V(x) and tactile branch T(x) rather than four separate ones, which means the weights are shared when the proposed model processes data pairs from disparate temporal interval. Therefore, each time we update the weights, the gradient comes equally from the four data pairs. This not only greatly reduces the number of network parameters, but more importantly, it indirectly provides more data for single network training. The desired visual branch should have the ability to extract features at any moment, so should the tactile branch. Through weight sharing, we can train network branches with stronger representational ability.

3.3.5 Modal fusion with tunable weight \(\lambda\)

The two branches’ outputs \(V(v_i)\) and \(T(t_i)\) will have exactly the same dimension, denoted as fusion dimension \(D_f\), which is also a hyperparameter (set to 100 in our experiments). We use another parameter \(\lambda \in [0,1]\) to indicate the importance of each modality while performing modal fusion. We set the weight of visual features to \(\lambda\) and tactile features to \(1-\lambda\). In our preliminary trial, \(\lambda\) is set to 0.5, which means that visual and tactile features are equally important in this case. The fused feature vector \(\mathbf {v}\) is given by:

$$\begin{aligned} \mathbf {v}=\sum _{i=0}^3 [\lambda V(v_i) +(1-\lambda )T(t_i)] \end{aligned}$$
(9)

3.3.6 Classifier

The fused feature vector \(\mathbf {v}\) will be passed through a two-layer perceptron. The output dimension is the number of object classes. We set the dimension of the hidden layer to \(\lfloor \sqrt{D_f+\frac{1}{2}} \rfloor\). With hidden layers of reasonable size, the two-layer classifier provide stronger nonlinearity for the model thus helping to achieve higher classification accuracy.

4 Experiments

In this section, we first detailedly discuss experimental settings in Sect. 4.1, then provide the complete evaluation result in Sect. 4.3 and ablation study in Sect. 4.3. In Sect. 4.4, the robust classification results of the system in several extreme cases are presented. Then we discuss the effects of input data pair selection in Sect. 4.5. Finally, we provide the actual use scenario test in Sect. 4.6.

4.1 Experimental settings

We will clarify the general settings in our evaluation experiments in the first place. For the basic ablation studies in Sect. 4.3, we split the multimodal dataset for training and testing at a ratio of 7:3. During the training process, Kingma et al. (2014) algorithm is employed to optimize the learnable model parameters. We set the learning rate to 0.0005 and the batch size to 4. We choose Cross-entropy Loss as the loss function and use the accuracy score as our performance metric. All of our model implements and experiments are based on Pytorch Paszke et al. (2019). The training and testing processes are completed on a Windows10 PC with an NVIDIA Geforce GTX 3070 GPU. Other specific settings will be given in the corresponding section of the experiment.

4.2 Best accuracy performance

With the 3 kinds of attention mechanism and 2 kinds of modality, our system achieved 99.75% classification accuracy, which is about 1.5% higher than result of similar network structures with roughly the same number of parameters but no attention mechanism and significantly higher than the model with single modality(\(\sim 5\%\) of single tactile model and \(\sim 2\%\) of single visual model).

4.3 Ablation study

The ablation study of the multimodal attention network on dataset(a) shown in Table 2 is intended to show the necessity of each component. All the experiments done in this section follow the setting in Sect. 4.1.

Table 2 Ablation study on attention mechanisms

It’s worth mentioning that we abandon the specific attention mechanism by means of setting all of its parameters to 1 if they are attention weights or 0 if they are soft-masks and remove modality by resetting the modality fusion hyperparameter \(\lambda\) to 0 or 1, respectively representing removing visual and tactile modality. From the experiment results, we can draw a conclusion that the modules we integrate distinctly improve the accuracy of the model. Concretely, with both visual and tactile modalities and all of the three attention mechanisms loaded, the model can achieve an extremely high accuracy of 99.75%. If we abandon the visual spatial attention, then the test accuracy drops down to 98.71%, shown in A-TC. If we remove the channel-wise attention and temporal attention respectively, the accuracy correspondingly descends to 99.14% and 99.02%, shown in A-TS&CS, whose decreases are less than A-TC. We can infer that visual attention matters more than those two kinds of tactile attention mechanisms. In A-S experiment of ablation study, we remove both of the two kinds of attention mechanisms applied on tactile modality. The accuracy is 98.97%, still better than 98.71% in A-TC experiment, which is in line with our inference above. If we remove all the attention mechanisms, ending up with an accuracy of 98.28%, which implies the effectiveness of introducing attention mechanisms to the classification process. As for ablation experiments for Modality, we remove the visual branch of the network and only use tactile information, the accuracy drops sharply decrease to 94.54%, shown in M-T. And the accuracy is 97.98% if only inferring with visual modalities, namely M-V. These two experiments together prove that the modal fusion is rewarding.

4.4 Multimodal prediction in extreme cases

4.4.1 Classification of almost visually indistinguishable items.

In this section, we test the classification modal in the case that items are hard to distinguish only by vision, in order to prove that the integration of tactile modality improves the robustness of the classification model. It can be expected that there are some items that are untoward to distinguish with the webcam alone. However, these items usually have very different tactile characteristics, such as weight, texture, material, feeling, etc. Based on the above features, we ingeniously selected 6 types of bottle as targets to be identified, which are even indistinguishable by a human from a little far away. Specifically, we made use of two bottles with exactly the same color but significantly different volume (380 mL/550 mL). Each size bottle was used to collect three data sets with three different filling states: empty, half full, and full, as shown in Fig. 3. We conducted classification experiments on the dataset(b) of above 6 kinds of bottles. We tested three cases separately: using visual data only, using tactile data only, and visual-tactile modal fusion. Results showed that relying on the auxiliary recognition of tactile information can significantly improve the results of visual classification. For specific results, see Table 3. Besides the accuracy of classification on 6 bottles, we evaluated the classification accuracy for the 3 water-filled states. We also calculated the recall rates of different size bottles (by marking the large bottle as positive), which derives from:

$$\begin{aligned} Recall=\frac{TP}{TP+FN}, \quad TP \text { for true positive and } FN \text { for false negative.} \end{aligned}$$
(10)
Table 3 Multimodal prediction in extreme cases (I): classification of almost visually indistinguishable items

It can be seen that the classification result combining the two modalities significantly improved compared to making predictions with data with single modality. The experiment also proves that our proposed system has the potential to achieve more than just object classification assistance. The above example of accurately distinguishing bottles with different capacities and different filling states shows that the system can qualitatively assist visually impaired people to perceive the size and weight of items, which means to let them feel the characteristic of items.

There is also a large category of items that are easy to distinguish visually but almost the same in touch, for example, the items of the same shape but different colors. But this kind of object is of little practical significance to our application of assisting visually impaired people. Additionally, the advantages of visual recognition have been proved by a large number of previous studies. Therefore, we did not implement these additional experiments.

4.4.2 Classification when there are interfering objects on the desktop.

In this section, we test the classification model in the case of interfering objects placed messily on the table. In practical situations, it is common for more than one item to appear in the field of view captured by the webcam, while only one of these objects is the target that the visually impaired user is grasping, which is the one model should predict. Specifically, we mixed dataset(c) of 10 items mentioned in Sect. 3.2.1, approximately 2200 pieces of data after augmentation in total, and the original dataset(a) in varying proportions to form the training set and test set. During each grasping process in the collecting process of dataset(c), there is more than one item in the view field. We labeled the data according to the items which are directly grasped by the gloved hand. The results evaluated on the mixed dataset are shown in Table 4.

Table 4 Multimodal prediction in extreme cases (II): classification when there are interfering objects on the desktop

From the results, we can see that in the first five experiments, the classification accuracy is considerably high even if the proportion of each dataset in the training set or the test set varies. Concretely, in the first five experiments, data from each of the two datasets are partly used to form the training set, which ensures the capability of the model to learn essential features of these 10 items and robust attention mechanism to overcome the disturbance of irrelevant objects. However in the other two experiments, when we separately use one dataset to train and the other to evaluate, the accuracy declines steeply. The conceivable underlying cause is that the visual attention training is misled. Therefore we tried removing visual branches to see the impact of visual misleading. The classification results of the last two experiments with single tactile modality are 84.43% and 88.57% respectively, which proves that ill-trained visual branch does severely adversely affect the results.

4.5 Effects of input data pairs \((v_i,t_i)\) selection

The main purpose of this experiment is to explore the influence of the number of input tactile-visual data pairs \((v_i,t_i)\) on classification accuracy. In the previous experiments, we found a difference between the collected data and the actual use scene. Due to every piece of data in the dataset records an entire grasping process. Therefore, the first frame of each group of video frames always starts with the user’s hand next to the object to be classified, while in the actual use scene it is possible that the object to be identified is held or hidden in gloves during the entire identification process. In order to verify the impact of input data pairs selection on the experimental results, we used different input data pair combinations to perform classification experiments. Details are shown in Table 5.

Table 5 Input frame selection

The results show that for the same number of input frames, there is no significant difference in classification accuracy. But reducing the number of input frames will result in a slight decrease in classification accuracy, compared to our best accuracy of 99.59%. Therefore, the frame selection strategy we use should not have a major impact on the actual use scene. In addition, the accuracy in experiments done without hard-coded temporal attention is less than ones with hard-coded temporal attention, which reflects that the proposed temporal attention effectively helps the model learn features from data pairs of different time intervals.

4.6 Evaluation in real use scenario

In this section, we implement our system into the actual use scene. Due to the COVID-19 pandemic, we were limited to finding a well-matched and adequate target audience. In line with the principle of epidemic prevention and control, we did not recruit social experimental users but carried out the practical application scenario test among 5 colleagues from the department. However, we believe that this choice has little influence on the validity of our user experiment since our purpose is to test the adaptability of the system to different individual user behaviors. Therefore, the core is to test the predictive performance of the model on data collected by different people in real-time. Specifically, we recruited 5 test users including 3 females and 2 males with normal visions. In order to more realistically simulate the use of the system by visually impaired people, we put on blindfolds for the users during the test, and the users explored the use of the system by themselves. The use scenario is shown in Fig. 8. The 5 users conducted a total of 200 tests, 10 types of items for 20 times each. The number of correct predictions for each category is listed in Table 6, from which we can learn that the accuracy in the actual use scene only drops slightly than which in the test set.

Table 6 The accuracy of each item in use scenario evaluation
Fig. 8
figure 8

Actual use scene. The main hardware of our system is marked in User 1’s picture. When the mouse is clicked, the system begins to work. The webcam and glove are turned on at the same time to collect data. The blindfolded user then repeatedly changes the posture of his gloved hand to grasp the current object to build his own cognition of the item and give his own judgment. When the collection is done, the multimodal classification model quickly broadcasts the predicted label via the speaker. Because the user’s purpose is to continuously learn to distinguish the objects in his gloved hand through the process, if the user is in doubt about the result, he can repeat the process, change the gesture, and then listen to the prediction result again, eventually form the cognition of the item in hand

Since the system is designed for continuous assistance, in actual situations, users can strengthen their perceptual knowledge of objects by repeatedly identifying the same object, the accuracy of the actual application will be higher than the accuracy of simple classification, because there are repeated recognition in most instances. Besides, there is an obvious trend that large-size and obvious-shaped items have better prediction results, while small and light items are predicted poorly, which is in line with the result on the test set.

The user-friendliness of the system is mainly reflected in the fact that for users who are blindfolded, after a brief oral introduction in advance, all of them can master the use of the system within a few minutes. This proves the potential of our system and that it would be easy to use for serving truly visually impaired people as well. From the feedback of the test-takers, the five people were impressed with the actual operation feeling after putting on the gloves. In addition, everyone has realized the process of repeatedly grasping, predicting, and finally determining what the object is in the hand. In our opinion, it can be used by visually impaired people equally easily. Whenever there is an opportunity, we hope to recruit visually impaired users to do more in-depth usage tests.

5 Discussion

5.1 Visualization of spatial and channel attention

In this section, we made an attempt to visualize the learned spatial and channel attention weights so as to examine the effectiveness of attention mechanisms and give more insights.

5.1.1 Spatial attention

As mentioned in Sect. 3.3.3, the vision branch of our model has 3 residual attention blocks to enforce spatial attention perception. The size of the attention map is \(56\times 56@256\), \(28\times 28@512\), \(14\times 14@1024\) correspondingly, which can provide attention focus from local region to global features. We visualized every channel of the attention maps calculated by these 3 attention blocks, which shows excellent effect in focusing on the tactile glove and the grasped object. Some of the visualizations are shown in Figs. 9, 10, and 11, in which we randomly sampled several channels of the three attention layers. From this, we can preliminary see how the attention mechanism works to improve the performance of the model. With the layer of attention network getting deeper, attention mechanism tries to focus gradually on the area of interest. Note that not every channel of the particular attention layer can ensure to pay attention to valid regions. But as the number of attention map channels is large, which statistically guarantees that attention mechanism works effectively. Especially, in the visualization on dataset(c), in which interfering objects appear in view, attention mechanism helps the model focus correctly on the grasped objects, shown in Fig. 11. Comparing the visualization results of attention maps obtained from the experiments on the three data sets, we can see its superiority in focusing on key regions and eliminating interference features.

Fig. 9
figure 9

Visualized attention maps on dataset(a). Objects of small size rely more on attention maps in the deeper layers, while those of larger size have effective attention at both shallow and deep levels

Fig. 10
figure 10

Visualized attention maps on dataset(b). Left: empty-small bottle. Right: full-large bottle. Spatial attention is still very effective, but the visual appearance between classes is too similar to provide distinguishing features

Fig. 11
figure 11

Visualized attention maps on dataset(c). Left: chess. Right: glasses box. Attention mechanism helps to highlight objects of interest and suppress irrelevant ones

5.1.2 Channel attention

In Sect. 3.3.2, we detailed introduced the channel attention in our proposed model. It aims to select several sensor channels with more importance. In practice, we found that a well-trained channel-wise attention module will give different sensor channels significantly different weights, shown in Fig. 12, which verified its validity. Besides, we observed that inputs from different categories only change the channel-wise weights slightly, which means channel weights are almost consistent for different object classes.

Fig. 12
figure 12

Channel weights visualization on dataset(a) & (b). The 4 curves in each figure represent channel attention weights of 14 sensors obtained with tactile time series under 4 kinds of time attention as input

5.2 Learnable modality fusion parameter \(\lambda\)

In all of the above experiments, we treated visual and tactile modality equally in the modal fusion process of our model, which means we reckoned without the preference of different objects for one of the two modalities. However, as mentioned in Sect. 4.4, there exist cases in which objects are more likely to be identified if attaching more dependencies to one particular modality due to its distinguishing features of this modality. In order to see the modality preference of the experimental objects and furthermore, use the preference to improve classification accuracy, we adjust the fusion hyperparameter \(\lambda\) to be learnable.

5.2.1 Learning process of \(\lambda\)

Technically, we implemented a tiny network to give the optimal \(\lambda\) for each input data based on the fusion-layer features. Given vision-branch output \(V(v_i)\) and tactile-branch output \(T(t_i)\) for \(i=0,1,2,3\). Then \(\lambda\) is given by:

$$\begin{aligned} \lambda =\mathscr {G}\left( \sum _{i=0}^3 T(t_i)+\sum _{i=0}^3 V(v_i)\right) \quad T(t_i),V(v_i) \in \mathbb {R}^{D_f}, \end{aligned}$$
(11)

where \(\mathscr {G}\) is a linear mapping followed by a Sigmoid function to rein \(\lambda\) in (0, 1). Then, the fused feature is computed by Eq. (9) in which \(\lambda\) weights visual features and \(1-\lambda\) weights tactile features.

5.2.2 Evaluation on 3 datasets

We performed an evaluation on all of the three datasets (a), (b), and (c). In the basic dataset (a) with 10 classes, we randomly sampled 100 times from each category and calculate the fusion factor \(\lambda\) and \(1-\lambda\) of each sample from the trained model. The results showed that in most categories, visual modality is clearly preferred, the \(\lambda\) varies from 0.6 to nearly 1.0, but in three categories, namely calculator, hand cream, and tape, \(\lambda\) is rather low, which indicates that tactile dominates in the classification of these three classes. The box plots (Figs. 4a, b, 13) give the details of visual and tactile fusion factors of each class.

Fig. 13
figure 13

The fusion factors \(\lambda\) and \(1-\lambda\) for experiments on dataset(a). See Table 1 for category labels. The yellow dotted line is the median line and the orange diamond is the mean point. The red points are extreme outliers

In the experiments done on dataset (b) and (c), significant results appeared. Both of the two experiments learned a \(\lambda\) close to 0 for all categories in the datasets, which means that with the self-adapting fusion factor, the model tended to reference using tactile features almost exclusively while ignoring visual features. As mentioned in Sect. 4.4, both of these two experiments represent a kind of visual indiscernibility. The learned fusion factor reasonably confirms the dominant position of tactile features in these cases, which highlights the necessity of introducing tactile modality, shown in Fig. 14.

Fig. 14
figure 14

The fusion factors \(\lambda\) and \(1-\lambda\) for experiments on dataset(b) and (c). See Table 1 for category labels

Moreover, the self-adaptive \(\lambda\)-learner can be thought as the fourth kind of attention mechanism used in our model, namely modality attention or branch attention as it makes a trade-off between the output feature vectors of the visual and tactile branches of the network. The classification accuracy gets improvement in all of the three experiments, which is illustrated in Table 7.

Table 7 The improvement in classification accuracy for experiments on different datasets due to the adoption of \(\lambda\)-learner

5.3 From research to practice

Although our system has shown excellent performance in the experimental evaluation, it has not been promoted to practical application and really brought convenience to the life of visually impaired people. Much more needs to be done to make this continuous, immersive and educational system practical. According to the data from the Department of Educational Planning of China,Footnote 1 only about 11,000 people with visual disabilities are in the national education system(or just graduated) in the year of 2020, which is far from the actual number of people in educational age who have vision problems. Many visually impaired people cannot receive compulsory education due to disability, travel restrictions, poverty, etc. Our proposed system has the advantages of low production and maintenance costs, convenient to use, and unsupervised by humans. Therefore, the system has strong promotion potential. When it is deployed in the education place of visually impaired people, with the assorted hardware of the system, it only needs to work with the same shared classification model in the cloud then can achieve high-precision robust classification on a given set of items. The training process for the model’s learning new items can be carried out completely on the server side and then deployed to all user terminals, which shows the excellent scalability of the system.

In the next stage, we have plans to apply for funding from the government and public welfare enterprises, to implement the pilot system, and then to promote and apply it to provide solutions for the continuous education of visually impaired people and improve their living ability and life quality. Specifically, we will first join the local government department and information center for the disabled persons to deploy the pilot system to demonstration areas such as the warm homes and schools for the disabled in various districts and counties. Secondly, we will collect the video and tactile data and feedback of visually impaired people using the system, and then based on the collected data and usage feedback, we will further update and improve the system. For example, we might improve the recognition accuracy and generalization ability of the vision and tactile fusion model by introducing the data collected in practical use scene, and design lighter and thinner gloves according to operating habits of visually impaired people. After that, the updated system can be deployed and applied in a new round. Through this deployment—data and feedback collection—system update—re-deployment mode, we hope to bring our system from the research level to practical applications, so as to truly change the lives of visually impaired people.

6 Conclusions

To provide a solution for unsupervised assistance of visually impaired people, we proposed the system AviPer, which aims to be a continuous, immersive, and educational assistance system for visually impaired people to perceive the world. We developed flexible tactile gloves to work with visual recognition to achieve robust multimodal object classification. The key insight of AviPer is that it can provide visually impaired people a sense of participation and real experience in the assistance process, as well as reach a high level of classification accuracy. In the process of developing the system, we overcame the difficulties of heterogeneous data by creatively designing a multi-attention multimodal classification network. We used the intelligent tactile glove to achieve low-cost and stable acquisition of tactile data. We fully respected the privacy of collectors and users in every session from data acquisition to actual application. The verification experiments under various extreme situations prove the robustness of our system. The user experience in the actual scene shows the usability and user-friendliness of the system. We will improve our work in the direction of further improving its generality and putting it into practice. We are looking forward that this work can truly enter the lives of visually impaired people and bring substantial changes to their living ability and life quality.