1 Introduction

The past decade has seen a revival in the area of Artificial Intelligence (AI) and machine learning, mainly driven by the impressive reported performance of deep learning. Computer vision with its sister counterpart natural language processing have improved significantly and contributed to several key application areas [1]. The likes of Amazon Echo, Google Home and various other internet of things devices we find around in our homes generate interesting data that can be used to train and recognise various activities using deep learning.

Image recognition and classification systems benefit enormously from deep learning and have contributed to several consumer applications like the Apple photo organiser [2], capable of grouping similar pictures with a common theme. Facebook’s face recognition [3] system is another classical example of how image recognition systems have been incorporated into social media. At industrial level, the use of AI has contributed in various way. For example, Amazon’s warehouses are heavily reliant on robots for moving shelves with load [4]. The robots have sense of coordination and can navigate around the warehouse, avoiding collisions with other robots as well as many other stationary objects. In healthcare, medical image analysis contributes to non-invasive diagnosis [5, 6]. The number of useful applications for AI and computer vision increases on daily basis and it is all around us [7,8,9,10]. In agriculture, satellite imagery has contributed in various ways to estimate crop yield [11,12,13,14]; in sports like football (or soccer) it has been used in making decisions like goal-line technology [15]. Various manufacturing companies use computers as part of their production line to identify defective items [16]. Similarly, the success of driverless cars is heavily reliant on the use of computer vision to identify object in the scene [17, 18]. What is really missing is the ability for machines (robots) to see, recognise and react to their immediate surroundings just like humans with their most complicated cognitive ability: vision. Thus, the use of perception to generate knowledge.

This work explores the failures in existing machine learning models for scene understanding and proposes new directions for modelling the scene with inspiration from bio-inspired vision as well as experimental results to demonstrate the effectiveness of this approach. A key contribution in this exploratory work is how attention mechanisms studied by psychologist have demonstrated improvement in modelling global descriptors to fully understand a visual scene.

2 Current State of Affair

Machine learning (a subset of artificial intelligence) has rapidly evolved in the past decade and continues to advance. Several key areas have seen significant progress and have become popular research topics and the very few related to this work have been listed below:

  • Deep Learning: Deep learning has been at the forefront of machine learning advancements. Neural networks with multiple layers (deep neural networks) have achieved remarkable results in various domains, including image recognition, natural language processing, and speech synthesis. Techniques such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been extensively explored and improved upon. In [19], deep learning has been used in monitoring a construction site to identify any safety concerns as well as worker behaviour.

  • Transfer Learning and Pre-trained Models: Transfer learning has gained prominence, allowing models to take advantage of knowledge from pre-training on large datasets. Pre-trained models like Bidirectional Encoder Representations from Transformers (BERT) [20], Generative Pre-trained Transformer (GPT) [21], and ImageNet-trained CNNs have demonstrated excellent performance across various tasks. By fine-tuning these models on specific data, researchers have achieved state-of-the-art results with smaller labelled datasets. Transfer learning has been used effectively in areas like medical imaging [22] where labelled data may not be readily available.

  • Reinforcement Learning: Reinforcement learning has made significant strides, especially in the field of game-playing agents. Algorithms like Deep Q-Networks (DQN) [23], have achieved superhuman performance in games like Go, Chess, and Dota 2. Reinforcement learning has also been applied to robotics, control systems, and recommendation systems [24].

2.1 The Key Factors for AI Acceleration

Standard neural networks which would normally consist of many simple neurons that may produce a sequence of real-valued activation have been around for many decades [25]. Purely supervised neural networks improved significantly during the 1990 s and 2000 s, which has contributed to the success of deep learning and artificial intelligence in general. Three main factors have contributed to the acceleration of AI, making it possible to incorporate AI into various application areas. The key and driving factors for the acceleration of AI are three folds:

  1. 1.

    Better algorithms from supervised and unsupervised learning [26, 27]

  2. 2.

    Huge volumes of data available from multiple sources on the internet [28,29,30] and

  3. 3.

    Improved computational power from heterogenous architectures [31,32,33,34].

We have moved from the days of predefined rules (Expert Systems) and have better algorithms that learn from examples [25]. Neural network and machine learning algorithms have contributed to the success of deep learning in various ways. Deep learning architecture mainly rely on cascading several neural networks with various machine learning based classifiers. The inception of internet and social media has contributed to the volume of useful data generated every second. It is estimated that approximately 300 h of video are uploaded onto YouTube every minute [35]; that contributes to the number of useful training data. Similarly, a white paper published by Facebook reported that its users upload approximately 350 million new photos each day [36]. These huge volumes of data generated by various people around the world have positive impact of datasets needed to train supervised learning algorithms like deep learning. This has been made possible by the fast internet access we enjoy these days and the volume of data is expected to increase with ubiquitous internet of things devices [27].

Thanks to Moore’s law, the computing power has constantly been increasing for the past three decades. The ability to improve the performance with fast processors has reduced, making way for processing speed to increase with concurrent and parallel execution [37]. The dominance of Graphics Processing Units (GPU) and Field Programmable Gate Arrays (FPGA) for applications like image processing that naturally benefit from parallel execution has also contributed to the AI acceleration [32, 34].

2.2 Current Machine Learning Achievements

There have been significant and key achievements in computer vision that are worth pointing out. The rich nature of current computer vision algorithms, thanks to deep learning, have made it possible to perform visual recognition with accuracies as high as that of humans and even outperform humans in certain instances [38]. These accuracies are really limited to recognition and detection - perception rather than understanding of visual scenes - knowledge. Zhang et al. [39] proposed to leverage emerging deep reinforcement learning techniques for enabling model-free driverless vehicles control, and present a novel and highly effective control framework, which utilises the powerful convolutional neural network for feature extraction of the necessary information (including traffic flow), then makes decisions under the guidance of the network. How reinforcement learning is deployed in the field of decision making is illustrated in these recent works [40, 41]. Tremendous achievements have also been made in medical imaging, where deep learning approaches have helped with the early detection of tumours in images taken from the brain [42].

There have been reported cases of the use of deep learning for the detection of leakage in 3D blood vessel [42], leaks that were missed by medical professions were easily detected by machine learning based systems. The list in medical imaging continues with capabilities of lesion detection in the eye as well as various cancer cells [42]. Detection of known objects in an image has matured to a level that computer vision techniques have the capabilities of detecting multiple objects in a single image even if they are partially occluded, with very high accuracy levels [43]. Work on how to construct meaningful sentences from images have also produced impressive results [44]. Use of multiple images for the reconstruction of 3D environments have also been achieved to an acceptable level. The ability to stitch together multiple 2D images to form a 3D panoramic view has also been reported by Song et al. [45]. He et al. [46] presented Mask regional convolutional neural network (R-CNN) which is conceptually simple and aims to segment or separate each occurrence of any object in an image. Faster R-CNN [43] has two outputs for each candidate object, a class label and a bounding-box offset; to this, [46] added a third branch that outputs the object mask. Mask R-CNN [46] is thus a natural and intuitive idea, which basically combines two state-of-the-art models (a region proposal network and a binary mask classifier). But the additional mask output is distinct from the class and box outputs, which requires extraction of much finer spatial layout of an object, making it possible for such a model to be used innovatively in medical imaging.

There are also a number of challenges one would have to consider when trying to deploy a deep learning model in a real-world application. When the domain of application has limited training data available, deep learning may not be the best choice. Similarly, the training process makes accurate predictions based on statistical associations and hence application that relies heavily on causality rather than correlation may not be the best fit. These are only a few of the challenges associated with the deployment of deep learning and the next section will further highlight some of these problems in the real-world.

2.3 Areas where Boosting would be Needed

The hype around computer vision grows exponentially and it is worth pointing out that, even though computer vision and AI in general have made significant progress and increasingly solving many complex problems [7,8,9,10], there are several shortfalls [38]. Thinking of the application areas where computer vision can be used, the very few areas that they have dominated [1,2,3,4,5] shouldn’t overshadow instances where they underperform and would still need significant improvement [47]; if not a complete overhauling of the existing techniques.

In March 18, 2018, Uber’s autonomous car hit and killed a 49-year-old as she was walking her bike across the street at night in Tempe, Ariz [48, 49]. With its 360-degree cameras and sensors, the car should have been able to detect someone crossing in front of it, even at night. Safety reports released by Uber in November 2018 said the software that detects obstacles on the road and processes that information was too slow to act. In the same month, on March 23, 2018, a Tesla vehicle in the autopilot mode slammed into a concrete road divider killing its driver [50]. With all the sensors including cameras around today’s autonomous cars, one can be fooled in thinking these cars can really operate autonomously. What is really missing is unfamiliar scenes or combination of objects that these autonomous cars have not been trained with or can’t interpret correctly [48] as well as real-time processing needs. There is no doubt that the way in which computer vision techniques have been used in these systems are novel, but we need to note that, the systems are not yet perfect [38].

Another example to support this is the fact that a robot trained to open doors struggled to open a significantly different door [51]. This goes to confirm that when such systems are challenged with very different scenarios their behaviour may not necessarily conform to what we expect. In another incidence on 12th July 2016, a Stanford mall security robot collided with a 16-month old toddler and nearly run him over [52]. Programmed to predict crimes in schools, businesses, and neighbourhoods, the K5 robot uses video cameras, thermal imaging sensors, a laser range finder, radar, air quality sensors and a microphone to detect irregularities in the area. If it detects any abnormal noise, temperature change or even appearance of known criminals, it will notify authorities. It turns out that the robot did not detect the young boy.

To compare systems trained to understand a scene using deep learning and what is proposed in this work, we use Table 1 to summarise the difference.

Table 1 Some differences between a deep learning trained systems for scene understanding and what this work proposes.

3 Unwrapping the Failures

These failures point to the fact that AI and computer vision techniques have performed exceptionally in some areas (like image categorisation [2] and industrial packaging [4]) but doesn’t mean that new and emerging areas (like autonomous vehicles [38]) will enjoy the same benefits with simple tweaks. For autonomous cars, the state-of-the-art in computer vision is good and provides bounding boxes around objects in the scene. But detected objects are normally pre-trained and the system manages to recognise variant of such objects individually, with a high degree of accuracy. What is missing or becomes challenging is putting together the individual objects to give a global interpretation of the scene. State-of-the-art recognition systems have no contextual understanding of the scene at a global level; hence such systems have little ability to make acceptable and reasonable scene-level decisions [53].

For example, a navigational robot will generally be able to move from one point to the other, avoiding collisions to arrive at the destination in the most optimal way [54]. When it becomes challenging is when the robot will have to decide based on the scene rather than localised objects to take a safer route, which might not necessarily be the shortest path. Such decisions are made by humans intuitively, as they understand how objects interact in a scene [55]. However, in the case of a robot the decision taken might be optimal but not necessarily feasible or safe. Modelling a crowded scene to infer interaction as well as unusual situations with little or no data poses minimal problems to humans; but it can be incredibly hard for state-of-the-art AI systems to handle. Generally, things that humans find intuitive (such as dealing with complex scenes and walking), tend to be very hard for artificially modelled systems. This goes a long way to explain why machine learning techniques with inspiration from biological systems tend to outperform pure statistical models [56]. Deep learning models like convolutional networks used in computer vision and other related application areas represent candidate models for the computations performed in mammalian visual systems [56]. The convolution layers found in Convolutional Neural Networks (CNNs) mimic the effect of the brain to calculate the information from visual inputs. Visual object recognition framework has gained renewed interest with the success of deep neural network models trained to "recognise" objects: these hierarchical feed-forward networks show similarities to human visual cortex, including categorical separability [57]. Deep neural networks (DNN) like convolutional neural networks and recurrent neural networks (RNN) have existed since 1990, but only improved to an acceptable performance level in the past decade, thanks to all the three factors that have accelerated AI in general [58].

4 Pre-DNN Detection and Classification

Prior to the huge drive for the use of DNN in computer vision, standard feature detectors like scalar-invariant feature transform (SIFT) [59], histogram of oriented gradients (HOG) [60] and local binary patterns (LBP) [61] had been the dominant hand-crafted features in many computer vision tasks and neuroscientific studies [62]. Compared to CNN where features are learnt and stacked in a hierarchical structure, features are hand-crafted in SIFT, LBP and HOG. Rather than the use of different algorithms for object detection and categorisation, with CNN the same algorithm is adapted for the same purpose at the expense of requiring large volume of training data [59]. A standard neural network (NN) consists of many simple, connected neurons, each producing a sequence of real-valued activations, and may suffer from the curse of dimensionality [63] or might not scale very well. Input neurons get activated through sensors perceiving the environment, while other neurons get activated through weighted connections from previously active neurons [25]. Unlike NN or shallow NN [25], deep neural networks have three distinguishing factors that make it possible to minimise the common problems associated with standard neural networks. There are more neurons with varying width, height and depth in DNN compared to shallow NN. Also, DNNs enforce local connectivity between adjacent neurons and replicate each filter across the entire visual field to allow translational invariance.

Convolutional neural network was first proposed by Kunihiko Fukushima in 1982 [1], whose work was inspired by an article published in 1962 by Hubel and Wiesel [64], on revealing the mechanism of the visual system. Since then, there has been a lot of research in this area with the most significant ones being that of LeCun in 1989 [65] and Krizhevsky in 2009 [3]. Deep convolutional networks became illustrious in 2012 when Krizhevsky et al. [28] used CNNs to win the annual computer vision challenge with an impressive 15% error rate compared to 26% in the previous year. Deep convolutional networks belong to a class of deep, feedforward artificial neural networks with backpropagation to transmit signals backward for training. In CNN, the weights of the convolutional layers are used for feature extraction and the weights of the fully-connected layers are used for classification; these can be determined through the training process. The success stories about the rise of CNNs and their capabilities of learning high-level features in object recognition have increased steadily since 2012 [34] and keep increasing due to the availability of large datasets like ImageNet [30]. Given that deep learning architectures can classify objects in an image with near-human-level performance, other studies have revealed some of the shortfalls of CNNs in computer vision [38, 66]. Nguyen et al. [38] have demonstrated that discriminative DNN models are easily fooled to classify many unrecognizable images with very high certainty as members of a recognisable class.

Here we describe the shortfalls of CNNs using the three key stages of their design: training data requirements, the actual training process and finally the recognition or testing phase. To properly train a CNN architecture, the training data is expected to be large enough to cover most variations. Typically, to train the architecture to detect birds, the training data is expected to have all kinds of bird orientations, various image resolutions as well as all possible actions that a bird may perform [25]. This is clearly not the case for humans as they can easily recognise a bird in any state after learning about them in a single state [67]. The training phase as well as internal architecture of CNNs have long been assumed to be black-boxes, however because they are computer programs one can easily step through to understand how input images are represented in each stage. Turner et al. [68] used various input images to visualise the output of every single layer of a CNN architecture (Visual Geometry Group VGG-19 network). In [68] it becomes clear that the internal output of the various layers of a CNN may not necessarily say much about the input image. Other forms of visualisations have also been used to analyse the internals of CNN architectures, these are the Activation Maximization [66], Network Inversion [69], DeconvNet [70] and Network Dissection [71]. These visualisation techniques do not only show the low-level features but also explain the working mechanisms of CNNs in general. It must be noted that the existing visualisation tools do have their limitations compared to the capabilities of humans as reported by neuroscientists on the mechanisms associated with the visual system [72]. Another problem associated with CNN training, is the demand for huge computational time and power to detect and classify an object [73], compared to the visual systems in which an object can be recalled in few seconds by using minimal resources in brain [74]. Also, CNN models require a large search space (including the depth, the number of feature maps, interconnection patterns, window sizes for convolution and pooling layers), making them impractical to discover an optimal network structure with any systematic approaches [75].

Finally, the recognition or test phase of CNN still have some level of errors, especially in multi-object recognition and classifying tasks [28]. With all the huge training data, CNNs are not able to produce recognition without errors. Even though the error rates are very minimal, they are still not acceptable for application areas like autonomous or driverless cars. Compared to human recognition system it would be odd to find a sound and healthy individual not being able distinguish between two different items like an apples and bananas. The reported accidents [48,49,50] related to self-driving vehicles go a long way to confirm that the current state of deep convolutional networks aren’t able to handle complex situation or the intriguing fact that they can be easily fooled to misinterpret unrecognisable object with high confidence [38].

5 Summary of Recommendations

Some of the problems with current artificial systems have been identified and it would take some time to resolve [76], if they can ever be solved at all. Computer scientists have long been working with other key subject areas like physics, engineering and mathematics to solve challenging computer vision problems. What is missing and might be crucial to address most of the challenges in computer vision is the combined expertise from biology and psychology. To succeed in taking computer vision to the next level of robust scene understanding, the most appropriate research direction should be multi-disciplinary involving neuroscientists, psychologists, and physiologists. Thus, the use of biological and physiological data collected from experiments can be used to inform the design of models that mimic human vision and interpretation of a scene. There have been some successful cross-fertilization examples in visual cognitive neuroscience and CNN that provide a rationale for multi-disciplinary work for robust scene understanding. Greene et al. [77] presented a model for visual scene categorisation that reflects functions or actions that can be performed within a scene. The model in [77] is much closer to human scene categorisation and outperformed alternative models like object-based distance and visual features from CNN. Another study by Groen et al. [78] determined the contributions of the models tested in [77] to neural representations in scene-selective cortex by disentangling different types of information in the visual cortex. Besides, how strongly a simple property of the visual encoding of an image and its population response magnitude correlates with its memorability demonstrate how memory is shaped by visual context is presented in [79]. It is however worth noting that comparisons of deep network models with empirical electrophysiological, functional magnetic resonance imaging, and behavioral data do not invariably only show similarities between brains and models [80], but also at times discrepancies [76].

The main reason why DNN have managed to achieve state of the art performance has been linked to the human visual system by way of how they learn uninterpretable solutions. This has been reiterated in [75] with the development of a CNN model which borrows biological guidance from the human visual cortex and capable of determining critical design choices with simple calculations. The model in [75] simulates the V1, V2, V4 and inferotemporal cortex (IT) layers of the human ventral stream, uses convolutional layers with varied sizes and complexities and increased the use of concurrency for improved processing speed. The design presented by Zhang et al. [75] outperformed seven other CNN techniques to achieve state-of-the-art performances on four widely used benchmark datasets: CIFAR-10, CIFAR-100, SVHN and MNIST. These views are also advocated by others that the brain’s innate structures such as connectivity and mechanisms will inform deep learning network models and steer them toward more authentic human-like learning [81]. Rajalingham et al. [76] demonstrate that state-of-the-art deep convolutional neural network models cannot account for the image-level behavioural patterns of primates (humans and macaque monkeys) and made a strong case for the design of new models that precisely capture the neural mechanisms underlying primate object vision. The experiments conducted in [76] confirm that the failure of current DNN models to accurately capture the image-level signatures of primates cannot easily be rectified by simply modifying the existing architectures, but rather a complete overhaul of the models and architectures.

Redmon et al. [82] presented a new approach for object detection, based on the fact that humans’ glance at an image and instantly know what objects are in the image, where they are, and how they interact. The same analogy has been used in [76] that primates, including humans, can typically recognise objects in visual images at a glance despite naturally occurring identity-preserving image transformations. The model presented in [82] reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. What makes the YOLO model [82] different from other DNN models is the fact that it does not rely on sliding window but rather implicitly encodes contextual information about classes as well as their appearance. The model [82] has twice outperformed other state-of-the-art DNN models like Deformable Part Models (DPM) and Regional CNN using ImageNet and COCO datasets; a confirmation of the observations in [76].

5.1 Scene Understanding

To understand the scene, Zhou et al. [71] combined various local and global features, used CNN to learn deep features from the scene to present categories like humans, with the assumption that in an image set, high density is equivalent to the fact that images have in general similar neighbours. Semantic segmentation is a challenging task in computer vision which assigns a category label to each pixel of an image; a fundamental but challenging task in computer vision research [83]. There are other related feature descriptors that are combined in various ways to model and understand a scene using computer vision techniques. Table 2 provides a list of some of the common feature descriptors used in scene understanding.

Table 2 Some key scene understanding features and their importance.

There has been great advances in the research into the modelling of scenes for segmentation and detection using convolutional neural networks [84]. However, similar to [71] and related CNN based scene understanding techniques, segmentation-based convolutional neural networks require tremendous computational power and are not optimal for future autonomous vehicles. To address the scene understanding and power consumption, Gaurav et al. [85] presented a deep spiking neural model that translates a conventional CNN model into it’s spiking equivalence. The work demonstrate the capabilities of spiking neural architectures as well as the energy efficiency of neuromorphic hardware architectures like the Intel’s Loihi [86].

5.2 Object-Scene Appearance Modelling

The strong need for a more robust scene understanding model suitable for application areas like autonomous or self-driving cars has motivated our proposed model for scene understanding with the use of physiological data. Our work has been motivated by the conclusions drawn from Eckstein et al. [87], which emphasised that missing giant targets is a functional brain strategy to discount distractors. The work in [87] demonstrates that search is guided toward target sizes consistent with the scene and thus, if targets are scaled to be larger but inconsistent in size with the scene, it would be missed more often during visual search. To utilise the results from [87] in our model, we will conduct further experiments (cf. [87]) to understand how humans and primates recognise key objects in a scene. Rather than using synthetic scenes, we will combine natural and synthetic scenes as part of our experiments.

Similar to the approach used by Izadinia et al. [88], we argue that the type of scene can be determined by the objects and their sizes, as well as their distribution. For example, in a kitchen we expect to see a worktop, cabinets and possibly a kettle or microwave around the worktop. We also aim to avoid the place category as in Zhou et al. [89] and provide a list of objects in the scene with associated spatial relationships, considering their relative sizes. Deep convolutional networks take full advantage of the ubiquitous and improved computational power from heterogenous architectures, that resource will be utilised in generating our exhaustive list and relationship between scene objects. The key is to establish how primates, especially humans, use spatial relationships between known objects during visual search, beyond those attempts relying solely on salience and contextual cues [90]. This would also be different from those studies which have focused on passive free viewing by humans and monkeys [91, 92]. The proposed model will incorporate global and local descriptors. The need to build a model that utilises spatial relationship between objects, the aspect ratio of the objects and a pair-wise relationship between objects is the driving factor and novelty behind our model described in the following two sections.

5.3 Eye Tracking Paradigm on Humans

This section details an eye tracking paradigm on humans that will be used to compare results from our computational model. Clues will also be taken from the experimental design to inform and make our model more biologically inspired. Experimental design: 20 participants will perform a search-identification task. The design contains six conditions by crossing two factors: the number of objects (in this case 2, 3 or 5) and time limit allowed for targets search/identification (limited time or self-paced). We will use eye tracking to record eye movements throughout the task and the eye movement data will be incorporated into the computational model. Each trial is composed of three phases (Fig. 1).

Figure 1
figure 1

An example trial. Participants initiate self-paced start with a central cue. A number of objects (two objects are shown as example) will then be presented as search cue. Participants will then search for and identify the targets (presented in search cue stage) within a scene with both eye-fixation and keyboard-press responses.

In terms of experimental material, we will generate 120 scenes with a gameplay (The Sims 4, Maxis Software, Electronic Arts). These scenes are unique and made to depict various types of indoor scenes including those of kitchens, living rooms, bedrooms, classrooms, and offices. Overall, 960 objects are extracted from these scenes removing any information associated with the original scenes in which they have been taken from (8 objects are extracted from each scene). The objects are made to be of comparable dimension when presented to the participants as part of the trial (second phase as shown in Fig. 1). To track and record the eye movement we use EyeLink 1000 Plus [93].

Figure 2
figure 2

Three representative scenes (using kitchens as example) taken from our simulated scenes.

5.4 Eye Tracking Paradigm on Non-human Primates

In addition to comparison with humans, we provided a further example to illustrate how eye tracking could be applied on non-human primates. In this study, we trained macaque monkeys to view sets of still images and after a delay, to choose from three choices the one that they had viewed previously [94]. In the current context, we would then analyse the physiological responses such as saccadic scan-paths, fixations, and even pupil dilation when the animals view and process these still images [95] and compare with the computation model presented in Section 5.5.

Figure 3
figure 3

Eye movement tracking for macaque monkeys.

An example trial is shown in Fig. 3: the blue trace depicts the real saccadic scan-path whereas the yellow trace depicts the shortest distance between two fixations. The blue circles refer to fixations and their associated numerals represent duration of fixations (in ms). This image in Fig. 3 shows the memory test stage for 3-alternative forced choice recognition memory of a trial (the encoding stage is not shown here). These three test images were created using DreamStudio AI [96].

5.5 Computational Model for Scene Category Recognition

The pre-processing stages of the model will utilise state-of-the-art deep neural networks for the target or object detection, which will be followed by the construction of the three unique vectors (spatial, size, pair-wise relationships) that will be learned for known scenes. The emphasis here is on the use of spiking neuron; much closer to the human visual system to take advantage of the minimal power consumption. The proposed model will involve four major activities:

  1. 1.

    Design of a model that goes beyond object detection and identification;

  2. 2.

    the introduction of a real-world and novel dataset that can be used to justify the model;

  3. 3.

    comparing human search performance with other deep learning approaches trained with large-scale images as well as our model for object and scene identification;

  4. 4.

    and finally making the architecture deployable on a low power neuromorphic hardware.

5.5.1 Design of a Model that goes Beyond Object Detection and Identification

This aspect of our model has four different tasks. The first will involve the use of a pre-trained convolutional neural network (deep learning) to identify all known objects in a scene. The input to the pre-trained CNN model will be series of images (scenes) representing normal day environment at home (Fig. 2), office and on the streets. The main aim here is to identify all known objects in the scene automatically, using the deep learning architecture. The second task is to group all common objects in the scene. For example, images of an office will normally consist of a desk, chairs, a keyboard, a monitor and other related objects. The collection of objects in the scene will then be used to form an object-set, containing all prominent objects commonly found in the defined environment. A threshold will be used to classify an object as part of the object-set for any given environment; thus, an object will have to appear in a specified number of scenes to be counted as part of that object-set. The third task will use the object-set to generate a vector that represents the spatial relationship between any two objects. The spatial relationship between any two objects will include their relative position, distance and orientation. Part of these measures will be acquired during the process of object-set generation. For every pair of objects in an object-set, a three-dimensional vector will be generated to represent the extremal as well as the average values. The fourth and final task will involve the generation of size relationship. Like the spatial relationship this will be generated for each pair of objects in each object-set, making use of their combined aspect ratio as well as scalar invariant features like Histogram of Oriented Gradients (HOG), Speeded-Up Robust features (SURF) and Binary Robust Independent Elementary Features (BRIEF).

5.5.2 Introduction of a Real-World and Novel Dataset that can be Used to Justify the Model

This aspect of the model will involve building a database of images that will be used to train, test and verify the model. The images will include indoor and outdoor scenes with typical objects found in that environment. To avoid the biases in most of the available datasets like IMAGENET and CAFFE, used for training deep CNN architectures, this work considers a collection of normal scene images to design a data-acquisition protocol for visual scene understanding in self-driving and surveillance systems. Synthetic images will also be generated to represent atypical scenes for training purposes. Like IMAGENET, available images on the internet with common themes to that of scenes being tested in this work will also be used.

5.5.3 Architectural Comparison

The last aspect of the model will involve the comparison of the model designed in this work with human search capabilities as well as off-the-shelf deep learning models (cafenet, VGG-16, GoogleNet, ResNet and Yolo2) trained on off-line large-scale images. The comparison is mainly to show how humans and deep learning architectures interpret scenes with varying objects in terms of size and position. These comparisons will also evaluate the global understanding of the scene to infer possible actions and identify any anomalies.

5.5.4 Neuromorphic Computing

As a specialised hardware designed mainly to mimic the structure and functionality of the human brain, the target implementation in this work is to perform task that involve processing information in ways similar to the human’s brain neural networks function. The human brain is known to be highly efficient [97] at processing complex information, recognising visual patterns, and adapting to new situations. The traditional von Neumann architecture is not optimised for recognising or interpreting scenes and can be relatively power-hungry and slow when it comes to certain types of computations like pattern recognition and sensory processing.

Neuromorphic computing generally aims to address the limitations of the von Neumann architecture by designing hardware architectures inspired by the brain’s structure and functionality. Neuromorphic architectures often involve large numbers of simple processing units (neurons) that are interconnected and can communicate with each other. The connections, similar to synapses in the brain, allow for the transmission of signals and the formation of networks that can adapt over time based on experience. Such an architecture, efficient in processing visual information is the target for the proposed scene understanding system.

6 Concluding Remark

In this paper, we have reviewed recent advances in deep learning architecture that have taken inspiration from human or primate learning and visual systems, and provided direction to future advancement on deep learning with inspiration from physiological experiments. Upon a review of areas that have benefited from deep learning, we specifically outline a physiologically inspired model for scene understanding that encodes three key components: object location, size and category. Human vision understanding can serve as a valuable source of inspiration for bio-inspired computer vision and for that matter bio-inspired AI. Through an effective emulation of the mechanisms and principles underlying human visual perception, bio-inspired computer vision can aim to achieve similar levels of performance and robustness as a human, when it comes to scene understanding. For example, the selective nature of human vision can be incorporated into bio-inspired AI to prioritise important features or regions in an image and effectively reduce computational cost. Similarly, human vision integrates contextual information like scene layout, object relationship and semantic context to make sense of visual scenes; attributes that can enhance scene understanding but hard to model into existing vision systems. The model proposed in this work goes beyond simple object detection and identification, it aims to introduce a novel real-world dataset, and ultimately provide a comparison between how humans and deep learning architectures interpret complex, naturalistic scenes.