An Introduction to Vision AI for Business

Tools and Technologies

Your browser needs to be JavaScript capable to view this video

Try reloading this page, or reviewing your browser settings

In this section we will latest cover some of the tools associated with Vision AI.

Keywords

  • Artificial Intelligence
  • AI
  • Big data
  • Vision AI
  • Computer Vision
  • Analytics
  • Image Processing
  • Deep Learning Neural Network Architecture
  • CNN
  • RNN
  • Faster R-CNN
  • VGG-16
  • Transfer Learning
  • Embedding
  • Keras

About this video

Author(s)
Neena Sathi
Arvind Sathi
First online
03 July 2021
DOI
https://doi.org/10.1007/978-3-030-78761-5_3
Online ISBN
978-3-030-78761-5
Publisher
Palgrave Macmillan
Copyright information
© The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Nature Switzerland AG, part of Springer Nature 2021

Video Transcript

Tools and technologies. Computer vision is transforming some of today’s most critical and complex challenges from predictive maintenance to medical imaging. There are over a dozen companies in this area. Forestier has published an analyst report listing a range of criteria to evaluate these computer vision companies including major insights they offer in this section, we will cover some of these tour vendors and technologies associated with vision AI.

As we talk, there are many companies emerging providing computer vision commercial offerings for classifying image and some features deduction. Each AI vendor is beginning to provide domain specific libraries in classifying and identifying key features or objects in an image. These vendors are providing computer vision APIs using very simple steps like, click pre-trained model you would like to use, upload image, view results, and view this on. We will visit results from a couple of these commercial offerings.

I took a picture of Beyond Burger from my refrigerator and tried that using a couple of commercial offerings in this area. Here are the results from Google Vision API using rappa from my favorite Beyond Burger. As you can see from results, Google broke the image into multiple blocks and provided key entities like Beyond Meat, veggie burger, and so on, text associated with each block. It identified over 20 blocks. Here is an example of extracted text from block 12 and 13. Here is another example of computer vision or our Beyond Burger example. Again, we use computer vision APIs provided by Microsoft to classify and extract text from the Beyond Burger label. As you can see from the extracted results, Microsoft computer Vision API provides objects associated with the image, text associated with the image with confidence factor. It also provides a description of the object as shown here.

Finally, let’s use computer vision using food recognition model developed by another vendor called Clarifi, to classify Beyond Burger label. See what you found. Computer vision- Key technologies. We have a few hours long course on vision AI technologies, covering many of the key computer vision AI technologies. Here, we will introduce some of the key technologies behind Vision AI. These include deep learning, computer vision models like, CNN, R-NN, and Faster R-CNN embedding transfer learning, and very popular computer vision, open source library called Keras.

What is deep learning? With massive amount of computational power, machines can now recognize objects and translate speech in real time. So how does a deep learning model work? Deep learning is a multi-step process or pattern recognition. There are a number of hidden layers that carry large number of regression equations. Each is representing deduction or derivation of a higher level abstraction of input data. So for example, my input data for images could be bitmap but each pixel represents either dark or light color. Please note that we never recognize the information pixel by pixel. We extract these pixel into shapes, into edges, into faces, into features. However, a number of these hidden layers may never come out in our representation, but do get processed. The focus of the model is on final result, and not on intermediate features.

Neural Network Architectures. The underlying technology behind computer vision is called Neural Network Architecture. And there are many flavors of the technology. Artificial neural network, or ANN is based on a collection of connected units or nodes called artificial neurons. An ANN was the first architecture that was built using neural network technology. It provided us the ability to solve using fully well-defined neural network, where every node was connected to every other node, and there was no hidden layers. The problem with ANN was that it became too detailed. There were too many regression equations.

A number of techniques came out that reduce the size of the computations. The first of these was Convolutional Neural Network, or CNN. CNN are regularized versions of multilayer perceptions. It used the convolutional filter as a way of deducting edges. The key assumption was that for domains like computer vision. It is not necessarily the entire image, but the edges that makes up the primary interpretation task, or primary recognition task. CNN is a popular mechanism for image classification.

The next one to come out was Recurrent Neural Network, or RNN. We had connection between nodes for a dedicated graph a longer temporal sequence. The key assumption of Recurrent Neural Network was that there is a sequencing. So any information in a sequence was dependent on something prior to that. This could be a text or it could be an image. We are left to right, there is a dependency. For example, in tables where you have a raw header and a value. So these are all good examples. RNN network took advantage of the sequencing of the information to do its detection.

The next one was Regional Proposition Networks, RPNs, which applied region to the CNN architecture. This was the first attempt to bridge the gap between image classification and object detection. It showed how CNN can lead to dramatically higher object detection, performance by applying recognition using region or set of regions. The key assumption here was that in a typical image, there are regions and each region is independently providing us part of the interpretation. Each of these regions could be parallel processed by the algorithm each creating an interpretation of each part of the component independent of the other components. Faster R-CNN is a popular mechanism for image and object interpretation. And I have applied this successfully in several cases studies in inventory management domain.

The next one to come was Generative Adversarial Network, or GAN. And the key assumption behind GAN is that it is a neural network and then there is also a second neural network. The first one creates an image and the second one evaluates it. So in this case, it is primarily meant for creating test data or creating test images. You can use GANs to create an image by imitating another image, and utilizing that to create a lot of tested averages, or variation of the original image. For example, I can take an original network and I can change the image by creating some systematic variations to the original image. Generative Adversarial Network are powerful machine learning models capable of generating realistic images, videos, and voice outputs.

GANs have widespread application from improving cybersecurity by fighting against adversarial attacks, and anonymizing data to preserve privacy to generate a state of the art images, colorizing black and white images, increasing image resolution, creating avatars, turning to 2D images to 3D, and more.

Last but not the least is Attention models. Human do not actively use all the information available to them from the environment at any given time. They focus on the specific portion of the data that are relevant to the task. Attention models are input processing techniques for neural network that allows the network to focus on this specific aspect of a complex input, one at a time until the entire data set is categorized. So you can start from a larger image, but you are going to utilize the focus area to go and zoom in on one part of the image. This is a two-stage process, one in which you define the focus area. And the second one in which you can actually go and find something in that focus area. Attention models require continuous reinforcement or back propagation training to be effective.