One of the key problems in robot design lies in providing the machine with an understanding of the world around it. The robot needs to be able not only to detect obstacles and dangers but to understand their nature so it can react to each situation appropriately.
Cobots and Machine Learning
A collaborative robot (cobot) is designed to interact with humans in a shared space and, for example, must be able to differentiate between objects that it needs to pick up and move from the people who may be working alongside it.
Although it is possible to build rules-based models that guide the motion of an autonomous system, it has proven difficult to engineer these systems to be robust and effective in the complex situations that they are likely to face in applications such as cobots in the factory or warehouse environment, or delivery robots. Machine learning provides an alternative path to achieving a solution. It has been demonstrated in numerous applications, from drones that can follow paths through a forest, to self-driving vehicles that are reliable enough to be allowed to run in trials on city streets.
Machine learning in robot design
The key application for machine learning in robot design is that of perception – providing the robot with the ability to react appropriately to the input from cameras and sensors that image the 3D landscape around it. Sensory artificial intelligence provides the robot with the ability to recognise objects in the surrounding environment. Using that understanding, the robot can use pattern matching to learn appropriate behaviours from past experience. And it may learn new situations as they arise through reinforcement-learning techniques.
AI’s Presence in Everyday Devices
AI is becoming more present in our daily lives. Devices like Amazon’s Alexa, Google’s OK and many other web services depend on these complex algorithms, which are run on servers in the cloud. Robot designers will turn to similar approaches both through improvements in hardware performance and the ability to offload some of their processing to the cloud.
Since its inception over 50 years ago, there are now many approaches to the concept of machine learning. The fundamental link between all machine-learning technologies is that they take in data, train a model on that data and then use the derived model to make predictions on new data. The process of training a model is a learning process where the model is exposed to unfamiliar data at each step and is asked to make predictions. Feedback from these predictions in the form of an error term is used to alter the model so that, over time during the training process, the model improves.
Often the model adjustments made for new data will worsen performance on prior samples. So it takes multiple iterations over the training set to achieve consistent performance. Typically, training stops when the predictions of the model reach a point at which the error does not improve – which may be a local or, ideally, a global minimum. As a result, machine learning has strong links to optimisation techniques such as linear regression in which a curve is fitted to a set of data points.
Supervised and Unsupervised Learning
There are many machine-learning algorithms available. An important distinction is between supervised and unsupervised learning. In the latter case, the model is provided with unlabelled data and asked to segment the elements into groups. A common algorithm used for this purpose is k-means clustering. The algorithm works iteratively to assign each data point to one of a number of clusters. The algorithm does this by first estimating centroids for each cluster – often by an initial random selection – and then refining its model based on the distance between data points from each other until it determines the most likely clustering.
In robotics, k-means and similar unsupervised clustering approaches have been used to support the automated mapping of unknown spaces by groups of robots. However, for perception-based tasks, supervised learning is currently the most common form of machine learning being applied in research and production robots.
Until recently, one of the most successful techniques for image-recognition tasks was the support vector machine (SVM). This technique is similar to clustering but works with data that has been labelled into two or more classes. The job of the SVM is to determine the parameters that will allow the model to place unlabelled data into the most appropriate class. Although SVMs were used in research for applications such as autonomous vehicles in the late 1990s and early 2000s, their use has largely given way to deep learning.
Deep Learning in Robotics
Deep learning is a modification of the artificial neural network (ANN) technology that was highly publicised in the 1980s and 1990s, which itself drew on theories developed more than half a century earlier, which were inspired by the biology of the animal brain. In a traditional ANN design, artificial neurons are arranged in a small number of layers – an input, an output and a hidden layer. Each neuron in the hidden layer takes in data from every neuron in the input layer, performs a weighted sum and applies an activation function, such as the hyperbolic tangent or logistic function, before passing the result to the output layer.
Neural Network Training
Training of the network is typically performed using backpropagation, an approach to optimisation and error reduction that works from the output back to the input – giving the technique its name. Backpropagation calculates the gradient of the error. This gradient is used to perform gradient descent in an attempt to find a set of weight values that are more likely to reduce the error during each epoch of training. This approach to ANN showed early promise. But the need for intensive computing resources to perform backpropagation and its inability to compete with the SVM meant that ANN slipped into relative obscurity. That situation began to reverse with a reinvigoration of deep networks – ANNs with more than one hidden layer – that were first proposed in the 1960s but which foundered because optimising the network weights proved extremely difficult.
A key development was the application of a more efficient approach to training and backpropagation developed by Geoffrey Hinton and Ruslan Salakhutdinov, working at the University of Toronto in the mid-2000s. The development was aided by the massive improvement in compute performance compared to the early 1990s, first with multi-core CPUs and then with GPUs. Increases in model performance came with the application of refinements to the fully connected architecture that had been proposed over the previous two decades. One was to introduce convolutional layers interspersed between fully connected layers.
Convolutional Neural Networks (CNNs)
Convolution is a matrix operation that applies a feature map to an array of data – pixels in the case of image recognition.
The feature map can be regarded as a filter. Convolutions of this kind are frequently used in image processing to blur images or to find sharp edges. They also provide a way of converting data in a spatial domain to a representation based on the time domain, where waves are superimposed on each other to form the overall image. As a result, convolutions make it possible to convert pixel arrays into collections of features that can be worked on independently by the following layers.
In contrast to the conventional use of convolution in image processing, the feature maps are learned as part of the ANN training process. This makes it possible for the model to adapt to differences in the training set that make it easier to distinguish between examples. For example, feature maps tuned to detect differences in shape will be most appropriate for general image-recognition tasks. Feature maps optimised for colour will be favoured in situations where the objects to be separated have similar shapes but are differentiated by their surface attributes.
The Efficiency and Organisation of Convolutional Neural Networks
One major advantage of the convolutional layer is compute efficiency. It is easier to implement in an ANN as it employs far fewer connections per neuron than fully connected layers, and maps readily to GPUs and other parallel-processing architectures with single-instruction, multiple-data (SIMD) arithmetic units. Another attribute of convolutional layers is that the design resembles the organisation of neurons in the visual cortex of the organic brain, which is different to the more highly connected regions used for cognition.
Multiple convolutional layers are often used in series in deep-learning architectures. Each successive layer filters the image for increasingly abstract content. In a convolutional neural network (CNN), a set of convolution layers is often followed by a pooling layer. These pooling layers combine the outputs from multiple neurons to produce a single output – producing a sub-sampling effect – that can be fed to multiple inputs in the following layer. This pooling has the effect of concentrating information and steering it to the most appropriate set of neurons that follow. The benefit of their use is that they improve the performance of recognition operations on images where important features may move around within the input. For example, a person’s face may move around in the image field as the robot approaches. Pooling layers help ensure that features activated by the shape and colour consistent with those of a face are steered towards neurons that can perform a more detailed analysis. Training on images in which faces are offset and rotated helps build the connections between the most appropriate neurons.
When Machines Surpass Humans: Deep Learning’s Recognition Prowess
There are different kinds of pooling operations. A max-pooling layer, for example, takes the maximum value from the inputs and passes that on. The highly influential AlexNet entry to the ImageNet LSCVRC-2010 contest employed these structures. AlexNet comprised five convolution layers, three fully connected layers, and three max-pooling stages.
A further improvement to training performance came with the adoption of stochastic gradient descent (SGD) as the mechanism for calculating gradient during backpropagation. This was primarily a choice made for computational efficiency, as it uses a small sub-set of the training data to estimate gradients. However, the random-walk effect of SGD helps move the optimisation towards a good global minimum faster and more frequently than with previous techniques.
Not long after deep-learning architectures were first employed, researchers at IDSIA in Switzerland showed that the machines could outperform humans on recognition tasks. In one experiment, the CNN could correctly identify heavily damaged road signs because it was able to make use of visual features that humans would normally ignore. However, this ability to make use of non-obvious features can be a weakness with current approaches based on ANNs.
CNN Architecture for Optimal Performance
Poor selection of training materials can cause the network to train on elements that will lead to mistakes in the field. Researchers have found in recent years that, simply by changing a single pixel in an image, the network will provide the wrong classification. Analysis of the weights chosen by one CNN indicated that, in trying to classify cats, the network had learned to use unrelated markings in some of the training images as part of the identification. Networks will also sometimes claim a successful classification for an image that is only noise.
The architecture of the CNN should be chosen to fit the application. There is no one-size-fits-all architecture. Decisions as to the number and ordering of convolutional, pooling and fully connected layers have a strong impact on performance. And the feature map and kernel sizes for each of the convolutional layers provide trade-offs between performance, memory usage and compute resources.
The classical feedforward architecture of the basic CNN is far from being the only option, particularly as deep learning moves from classification tasks to control. Feedback is becoming an element of the design in applications such as voice recognition. Recurrent neural networks use feedback loops. Memory networks make use of elements other than neurons to hold temporary data that can be used to store contextual information that is likely to be useful in applications that call for a degree of planning, which may include systems that control robot behaviour and motion. Another option is the adversarial architecture, based on two linked networks. The competition between them helps avoid the risk of a single network making fundamental mistakes. As the technology continues to develop, we can expect other novel architectures to emerge.
Cloud-Based Robotics: Leveraging Compute Resources
Supervised learning is different from the organic experience in that training and execution occur in different phases: the network does not typically learn as it runs. However, in order to ensure the system is able to meet new challenges, it can be important to perform training sessions on recorded data, particularly if the system flags them as situations that led to errors or poor performance.
For control of the core robot functions, reinforcement learning is often employed. This rewards the robot during training for ‘good’ behaviour and penalises poor decisions. In contrast to simple image-classification tasks, forward planning is a vital component of the process. This calls for the use of discounting techniques to tune rewards for decisions made in a given state. A discounting factor of 0.5, for example, will be just one-eighth of its value after three state changes. This will cause the machine-learning network to pursue near-term rewards. A higher discounting factor will push the network to consider longer-term outcomes.
A key question for designers of robots is where training occurs. The separation of training and the inferencing needed during execution provides an opportunity to offload the most compute-intensive part of the problem to remote servers. Inferencing can take place in real time using less hardware while servers perform training updates in batches overnight. The cloud environment provides access to standard tools such as Caffe and TensorFlow that can be used to design, build and test different CNN strategies.
Optimising Hardware for Efficient CNN Processing in Robotics
With a hardware platform optimised for inferencing, designers can take advantage of some features of CNN architecture to improve processing efficiency. Typically, the backpropagation calculations used during training demand high-precision floating-point arithmetic. This keeps errors to a minimum. The processes of normalisation and regularisation work to reduce the size of individual weights on each neuronal input. These steps are needed to prevent a small number of nodes developing strong weights that reduce overall performance.
As a result of normalisation, some weights will reduce to very low levels and, in the optimisation process, reduce to zero. In the runtime application, these calculations can be dropped entirely. In many of the interneural connections with low significance, the weighted-sum calculation can tolerate increased errors from the use of low-precision fixed-point arithmetic. Often 8-bit fixed-point arithmetic is sufficient. And, for some connections, 4-bit resolution has been found to not increase errors significantly. This favours hardware platforms that offer high flexibility over numeric precision. A number of microprocessors with SIMD execution units will handle low-precision arithmetic operations in parallel. Field-programmable gate arrays (FPGAs) provide the ability to fine-tune arithmetic precision. An upcoming generation of coarse-grained reconfigurable arrays (CGRAs) optimised for deep learning will provide an intermediate solution between microprocessors and FPGAs. They will help improve performance and make AI-enabled robots and cobots more feasible.