Are convolutional neural networks self-learning

Bachelor thesis. Tobias Hassenklöver Classification of highly variant patterns with convolution networks

Transcript

1 Bachelor thesis Tobias Hassenklöver Classification of highly variant patterns with convolution networks Faculty of Engineering and Computer Science Department of Computer Science Faculty of Engineering and Computer Science Department of Computer Science

2 Tobias Hassenklöver Classification of highly variant patterns with convolution networks Bachelor thesis submitted as part of the Bachelor examination in the Computer Engineering course at the Computer Science Department of the Faculty of Technology and Computer Science at the Hamburg University of Applied Sciences Supervising examiner: Prof. Dr.-Ing. Andreas Meisel Second reviewer: Prof. Dr. Wolfgang Fohl Submitted on January 16, 2012

3 Tobias Hassenklöver Topic of the bachelor thesis Classification of highly variant patterns with convolution networks Keywords convolution networks, face recognition, object recognition, convolution, filters, neural networks Brief summary The classification of objects and especially humans by software has been a problem that is difficult to solve for years. These highly variant forms are recorded optically and so far often recorded with neural networks in combination with statistical methods. A new method for recognizing highly varied patterns such as characters, objects or people are convolution networks. Convolution networks are a variety of neural networks that make their decisions on the basis of algorithms from image processing. In this thesis the limits of the detection of the convolution networks are tested. In tests with differently modified input data, the effect on the classification accuracy is checked. Tobias Hassenklöver Title of the paper High-variant pattern classification with Convolutional Neural Networks Keywords Convolutional Neural Networks, Facedetection, Objectdetection, Convolution, Filtering, Neural Networks Abstract The classification of objects and especially of humans by software is for years a great challenge. These highly variant forms are optically captured and often detected with the use of neural networks in combination with statistical methods. A new method for the detection of these highly variant patterns, such as characters, objects or people are convolutional neural networks. Convolutional neural networks are a variety of neural networks, which make their decisions based on algorithms used in image processing. In this thesis the limits of the detection of convolutional neural networks are tested. In the tests the impact on the classification accuracy is checked with various modified input data.

4 Table of Contents List of Figures 6 1. Introduction Motivation and Objective Outline State of the Art Convolution Networks Objective of the Work Extension Possibilities Stereo input data Recognition of three-dimensional input vectors Classification of patterns Neural networks Biological neural networks Artificial neural networks Main component analysis Analysis Reduction of the dimensions Convolution networks Functionality and structure Convolution Layers Learning process Realization and test test data creation of training and test data noise of image data test setup Matlab LeNet

5 Table of Contents EBlearn ++ Convolutional Neural Network Results General Face Recognition Recognition of Specific Faces Discussion 48 Bibliography 50 A. Training Convolutional Neural Networks 53 A.1. Training for personal A.2. Training for Matlab CNN B. Starting the Persnet application 55 C. Using the toolchain for image and training data 57 C.1. Generation of images from video streams C.2. Conversion of images to the mnist format C.3. Background noise from facial images

6 List of Figures 1.1. A string of characters handwritten and as a digital pattern (left) and two high-variant forms of chairs [8] (right) Strongly variant letters that differ in terms of rotation, distortion, size, noise and sharpness. Three license plates [9] [1] [2] in different perspectives and recognizabilities Sketched structure of a convolution network with stereo input data Three-dimensional point cloud of a person [10] Pyramidal cells (highlighted in green) in the human cerebral cortex [31] Neural network with two layers of neurons and three patterns to be recognized Inner structure of a perceptron neuron Examples of possible activation functions θ Construction of a small artificial neural network. Input data x 1 ... x n, processed by neurons e 1 ... e n and v 1 ... v n and output from neuron a to result y A neural network with several layers. (a) input layer, (b) hidden layer, (c) output layer Structure of the training of a perceptron neuron Error mountains of a neuron with two input data and weights The gradient descent to find the smallest error with the backpropagation algorithm. Each red arrow represents an iteration of the gradient descent. Applying the backpropagation algorithm to an entire network of point cloud [21] with its first main component - a straight line that approximates all points. Point cloud with the first two main components r and h is calculated (left). The main components are enough to display the data in a new coordinate system (right, convolution of a two-dimensional vector with a 3x3 convolution mask)

7 List of figures Unfiltered image (left), image filtered with 3x3 Laplace convolution mask (center) and a 3x3 randomly initialized convolution mask filtered image (right) Example of the localization of similar patterns of a 1x3 correlation mask in a row of pixels Structure of the layers of a convolution network Structure of a recording of the image database muct [18]. Here a series of images is created in which five images are taken in parallel from different angles. Four examples from the image database Faces in the wild [30] Some examples of the background classification group. Object sections, different colored backgrounds or edges were used as training images. The individual processing steps of the tool chain. A random noise (left), the noise depth-filtered (middle) and the inserted face in the noise (right) structure of the layers of the LeNet-5 convolution network

8 1. Introduction 1.1. Motivation and Objective The recognition of shapes, objects and people by software is still a difficult problem to solve. These variant patterns are recorded optically and further processed digitally. Over the past few years, research has developed many methods of recognizing and classifying these patterns. The patterns can be roughly divided into two categories: two-dimensional patterns, such as characters, symbols or stamped parts. An application example would be the recognition of handwriting. three-dimensional objects that are depicted in two-dimensional images. A practical application for this would be the recognition of faces or license plates in a picture or video. In the case of three-dimensional objects, a distinction is made between dynamic, living and rigid. The classification of the patterns in the two categories are subject to certain variances. These make detection more difficult. In the case of two-dimensional patterns, such as characters, these variances can be rotation, scaling, different spellings, line widths or distortions. With three-dimensional patterns, additional effects such as perspective, the influence of light, shadow, depth and size of the object are added. An example of the different complexity of two-dimensional and three-dimensional objects is shown in Figure 1.1. The character strings on the left in the picture have a comparatively low variance in contrast to the chairs on the right in the picture. The left string consists of four symbols. When the character string is classified, they are evaluated individually. Each symbol has comparatively little rotation, color deviation or distortion compared to the digital font sample next to it. The pattern therefore has little variance. If the chairs are optically classified as this, several problems arise. A chair is a three-dimensional object represented in a two-dimensional image. This three-dimensional object can, as can be seen with the right chair, be subject to rotation and deviate greatly from the basic shape. Despite the

9 1. Introduction 9 Differences in shape and rotation, both figures must be classified as a chair. Letters and numbers can also be exposed to rotation and distortion - which also makes them highly varied. Figure 1.1 .: A string of characters handwritten and as a digital pattern (left) and two high-variant forms of chairs [8] (right) These variant patterns can be recognized with the help of the combination of different pattern recognition processes - however, many classification methods fail in the event of strong variations in these patterns . The classification of a pattern is based on a description of what is to be recognized. The closer the pattern to be recognized corresponds to the description, the higher the chance that the method will classify the pattern correctly. This problem is illustrated in Figure 1.2 using the example of handwriting recognition. Here the letters i and a are shown in different variances. In the figure are Figure 1.2 .: Highly variant letters, which differ in rotation, distortion, size, noise and sharpness, the shapes of the letters with little variance on the left and stronger shapes of the letters on the right

10 1. Introduction 10 Variance shown: Distortions, slight and strong rotations, noise and smearing make detection extremely difficult. With the bottom line there is also the difficulty in recognizing whether the letter is an a or d. Many methods improve their detection by correcting certain forms of variance. In this way, noise can be recognized and suppressed, distortion can be counteracted by affine transformation or weak line widths can be made more concise with local contrast enhancement. In order to do this, however, one must be aware of these variances in the respective application area. An example of this would be the recognition of license plates. There are three indicators in Figure 1.3. Figure 1.3 .: Three license plates [9] [1] [2] in different perspectives and recognizabilities. Little variances in letters and numbers can be seen on the license plate on the left. After determining the position of the license plate in the image, the characters can be evaluated one after the other. Recognition of the middle number is more difficult. Parts of the license plate are dirty and the evaluation could classify numbers incorrectly or skip reading completely. If one is aware of this problem when designing the detection process, it could be counteracted by a combination of corrective measures. In the picture on the right, the angle of view of the license plate is variant. Here one would have to adjust the procedure again in order to correct the angle of the recording and its effect on the view of the license plate with an affine transformation. Individual variations are known when designing the classification process or can be recognized and corrected at runtime. Since patterns do not vary predictably in many areas of application, new methods have to be developed and existing ones combined in order to generally be able to better recognize patterns despite variations. In this thesis a new approach for the recognition of highly variant patterns is to be investigated. The basis for this is the use of convolution networks. They are a special form of neural networks, the processing of which is based on graphic filters.

11 1. Introduction Outline In chapter 2 an overview of the state of the art of convolution networks and the aims of this work is given. A short list should clarify which implementations were necessary to achieve the goals of this work. Furthermore, an outlook on future approaches and possibilities with the technology of convolution networks should be given. Chapter 3 introduces two ways to classify patterns. One of these methods is the use of neural networks. Neural networks are an important tool for digital classification. Artificial neural networks are inspired by biological neural networks. This is first shown using the biological model. Next, the functionality and training of artificial neural networks will be discussed. After introducing the neural networks, the second method, the principal component analysis, is explained. This is required in order to examine large amounts of data for their identifying features and to minimize unnecessary data, whereby the massively minimized amounts of data still have the same distinctive identifying features. These minimized data can then be compared with others and recognized. Both technologies are widely used in classification tools and form an important basis for the functions of convolution networks. Chapter 4 is entirely devoted to how convolution networks work. The differences to conventional neural networks are clarified here and the specific basics of convolution techniques are discussed. Chapter 5 explains the different scenarios in which the convolution networks are tested for their recognition of highly variant patterns. First, in Section 5.1, the test data, i.e. the highly variant patterns, are discussed in more detail. For some tests, parts of the test data were noisy in the hope of clarifying features in the patterns and changing the detection rate of the networks. The methods required for this are explained in more detail at this point. Section 5.2 explains in detail the structure of the convolution networks used in the tests. The results of the tests with the various configurations are listed at the end of the chapter. Chapter 6 includes a discussion of the tests, the results presented in Chapter 5, and the training of the convolution networks.

12 2. State of the art In this chapter an overview of the state of the art of convolution networks, the goals of this thesis and the technologies and tools used in this thesis should be given. In addition, possibilities should be presented to use the results and tools that were created for this work as a basis for further tests and extensions. Convolution networks Convolution networks are an increasingly emerging form of neural networks. Despite positive results in the number [13] and object classification [16], this technology is only at the beginning of its widespread use in the economy. Convolution network technology has already been used in commercial products, especially in face recognition. An example would be the recognition of faces in the Google Street View image database. Here the faces had to be made unrecognizable in the pictures. For this purpose, a program pipeline was developed that recognizes faces in the image and blackens them. Part of the pipeline is based on the technology of convolution networks [26]. The technology of the convolution networks is still in the development stage. For this reason, tools, documentation and open source initiatives in this area are almost non-existent, in contrast to other, very common technologies. Although much has happened in this area in the last few years, the number of tools that can be built on in this thesis is very small. Objective of the thesis The aim of this thesis is to reproduce the classification results described by Osadchy, LeCun and Miller in [20] to be understood with our own test series. In the results of Osadchy, LeCun and Miller, faces were recognized in images with the help of convolution networks. For this purpose, a program was written that breaks down each picture into many partial pictures.

13 2. State of the art 13 Each partial image was passed on to the trained convolution network. The convolution network has then made the decision as to whether or not there is a face in the image section. This type of classification of images with or without a face is also to be investigated in this work with the help of two different implementations of convolution networks. Since the technology of convolution networks is still young, there are few supporting tools. For this reason, a few tools had to be created in addition to the two convolution networks already implemented. All the partial aspects realized in this thesis are listed below: A Matlab program that implements a convolution network was analyzed and rewritten. In section the program and the changes are explained in more detail. An undocumented programming interface of the LeCun working group, the Computational and Biological Learning Lab of the University of New York, was analyzed and used to implement own programs with convolution networks and their training. A more detailed explanation of the implemented programs and the convolution networks is given in section. The following programs had to be written to convert the training and test data: Program for converting image data into the MNIST format. The MNIST format is specified as the input format by the Matlab program. A detailed structure of the file format is described in section. Program for adding noise to the surroundings of faces in image data. For a series of tests with convolution networks it was to be found out whether the recognition rate could be improved by the noise in the surroundings of the faces. A detailed description of the processing steps of the program is shown in section. Implementation of a script that breaks down video streams into individual images with the help of the Gimp and ffmpeg programs. The individual images were saved as 32x32 grayscale images. In this way, training and test data could be generated from the videos.Training of a convolution network for the generation of image data for facial environments. These image data were used for the later training of other convolution networks and represented the training data for the classification group "Image contains no face". A more detailed description of the training and test data is shown in Section 5.1.

14 2. State of the art 14 Carrying out training and tests with the two different convolution networks with the aim of achieving classification results similar to Osadchy, LeCun and Miller. For this purpose, various mixtures of classification groups and training data were used to train the convolution networks. A detailed list of the results achieved is shown in Section 5.3. Expansion options The following is a brief overview of the options available for building on the status of this thesis. In the working group Computational and Biological Learning Lab at the University of New York [4] there is brisk development in the field of convolution networks. The results and tools made available by the EBlearn ++ [23] project can, in addition to the findings of this work, help to further test convolution networks for their strengths and weaknesses. S 1 input layer C 1 S 1 C 2 S 2 C 1 S 1 output layer Figure 2.1 .: Sketched structure of a convolution network with stereo input data

15 2. State of the art Stereo input data As Scherer, Müller and Behnke have shown [25], the recognition of variant patterns can be improved by stereo input. In the case of a convolution network with stereo input, the input layer has two input fields instead of the one described in section. A convolution network with such a structure is sketched in Figure 2.1. The input layer distributes two images into the convolution network. Depending on the implementation of the network, it takes several convolution layers until the convolutions of the input images meet. Recognition of three-dimensional input vectors The convolution networks presented in this thesis work on two-dimensional input vectors. Since the convolution described in section also works on multi-dimensional levels, a convolution network with input data of three or more dimensions would be possible. Point clouds of objects would be one possible type of input data form. Figure 2.2 .: Three-dimensional point cloud of a person [10] Figure 2.2 shows a point cloud of a person. These point clouds can be created quickly with modern and increasingly cost-effective hardware, such as the

16 2. State of the art 16 Microsoft Kinect [17]. This could enormously simplify and accelerate the generation of training and test data for further tests. With enough three-dimensional data, the tests presented in this thesis could be repeated. As Lai and Fox [12] have shown, it is possible to recognize objects in point clouds and to distinguish them from one another. The question arises as to whether convolution networks could represent a suitable tool for distinguishing or recognizing highly variant faces or other objects of the same type from one another.

17 3. Classification of patterns Pattern recognition is the ability to recognize regularities, repetitions, similarities or regularities in a set of data [35]. There are many different procedures for these tasks in computer science. In this thesis the focus is on the recognition of highly variant patterns with convolution networks. Since convolution networks are based on neural networks, the basics for their structure and functionality are explained in this chapter. Another method in the context of classification is principal component analysis. This enables large data sets to be reduced to striking patterns. The highlighted features are great for making comparisons and identifications. Principal component analysis is a statistical method and is often used in combination with neural networks to classify patterns [32]. Neural networks The use of neural networks in data processing is almost as old as the programmable computer itself [33]. Since then, the use of artificial neural networks to approximate functions has been a frequently used means in computer science. The idea behind artificial neural networks is to mimic biological processes in the brain. Since artificial neural networks are inspired by the biological, the structure of neurons is to be deepened based on the model Biological neural networks In the brain and nervous system of humans and animals, nerve cells (neurons) are highly networked [24]. With the help of this network structure (approx. 100 billion to 1 trillion neurons) [29] information is evaluated and processed in the human brain. This processing between the neurons takes place in an electrochemical way. Figure 3.1 shows a microscopic section of the network with several networked neurons. Here (in the middle) a neuron is highlighted in color. One neuron has

18 3. Classification of patterns 18 Figure 3.1 .: Pyramidal cells (highlighted in green) in the human cerebral cortex [31]. several input signals and one output signal. The output signal is passed on to other neurons as an input signal through several connections. An output signal can be generated depending on the input stimuli of a neuron. The output signal is then passed on to all neurons connected to the output. In this way of processing and transporting the impulses, decisions are made in the human body, motor actions are set in motion and reactions to events are generated.

19 3. Classification of Patterns Artificial Neural Networks Artificial Neural Networks are the attempt to implement the properties of a neural network in a computer (approx. 100 to 1000 neurons). In artificial neural networks, some ideas from biological neural networks are used to process data. In many ways, artificial neural networks differ considerably from their model, since the structure and the electrochemical processes would be too complex to reproduce efficiently. Inspired by the biological model, neurons are also used. In contrast to the biological model, the information flows from the entrance to the exit through the network. Desired mapping functions: Input Output Input Output Input Output Pattern A Pattern B Pattern C Figure 3.2 .: Neural network with two layers of neurons and three patterns to be recognized The basic idea is to process a complex task through a neural network instead of using an algorithm . In a network, the input information is passed on from neuron to neuron and changed. Each neuron has a different parameterization to evaluate the input information. An artificial neural network stands out

20 3. Classification of patterns 20 in that it performs complex tasks quickly by calculating its neurons. Each neuron sets up its calculation on the basis of a few additions and multiplications. The result of a neuron is then passed on to other neurons through an activation function or provided as the end result. In this way, functions can be approximated with neural networks. Three patterns are given in Figure 3.2. With the correct parameterization of the neurons, the given neural network could realize this mapping function. The information flows through the network of neurons, and a decision is made based on the connections between the neurons and their parameterization. Structure of a neuron In the biological model there are different types of neurons [5]. An artificial neuron can also have different structures. In this work, however, only the structure of the perceptron neuron is dealt with. Figure 3.3 .: Internal structure of a perceptron neuron Figure 3.3 shows the internal structure of a perceptron neuron. Each perceptron neuron has a certain number of inputs. Each input x i of a perceptron neuron has an associated weight w i. This weight is a value that is multiplied by the input value. After all inputs have been multiplied by the respective weights, they are added up. The sum is compared with the so-called threshold value, the

21 3. Classification of patterns 21 bias θ, added and then transferred to the activation function φ. The result of the activation function is then the output value y of the neuron. The entire calculation of a neuron can also be summarized as equation 3.1. y = φ (n x i w i + θ) (3.1) i = 1 Various functions can be used as activation functions. The frequently used linear y = x and sigmoid y = 1 1 + e activation functions x are shown in Figure 3.4. Figure 3.4 .: Examples of possible activation functions θ Structure of a neural network Figure 3.5 shows an example of an artificial neural network. A neuron has an output signal. However, this can also serve as an input for several neurons. The difficulty with artificial neural networks is to find the right setting for each neuron. The result of a calculation of a neural network therefore depends on the evaluation of the inputs of each individual neuron in a network. With the correct setting of every neuron in a network, one could optimally approximate a task. The setting of the neurons is different for each task and calculation of a neural network. Since it turns out to be very difficult

22 3. Classification of patterns 22 v 1 x 1 e 1 v 2 x 2 e 2 v 3 a y x n e n v n Figure 3.5 .: Structure of a small artificial neural network. Input data x 1 ... xn, processed by neurons e 1 ... en and v 1 ... vn and output from neuron a to result y in a network to set all neurons by hand and there is no algorithm with which one can enter If the neural network can initialize with the correct values ​​at finite runtime, an alternative procedure must be used to search for the ideal setting. This approach would be the dynamic adaptation of the neuron values ​​by training the network with different input data. The structure of neural networks depends on the neurons used. This chapter is limited to the structure of multilayer perceptron networks. Multilayer perceptron networks have at least three layers of neurons. Figure 3.6 shows an example of such a network. The first layer is the entry layer. Here the input data of the entire network is connected directly to the first neurons. The second layer is called the hidden layer. A neural network can have any number of hidden layers. It is only important that the neurons of one layer are only ever connected to neurons of the directly adjacent layers. The third and last layer is the starting layer. This is where the last calculations are made in the neurons, and

23 3. Classification of patterns 23 Figure 3.6 .: A neural network with several layers. (a) input layer, (b) hidden layer, (c) output layer their results show the decision of the entire neural network. The neurons of a layer are always structured in the same way. The neurons of the different layers mostly differ in their activation function. Input layers often have linear functions, whereas the hidden layers mainly use sigmoid functions. The activation function of the output layer depends on the application of the neural network. For function approximations linear functions are used - for classifications sigmoid functions. Learning process of a neural network There are different ways to train an artificial neural network. In this work the focus is on only one type of this learning. The basic principle of this learning process can be compared to learning with the help of a teacher. A teacher who lets the learner solve a problem and then corrects him in the event of a mistake. This procedure can also be applied to artificial neural networks for learning (English supervised learning). If a neural network is trained, it receives input data that should run through the network. A result must be available for the respective input data. Then each result of the network is compared with the respective result. If the result deviates, all weights and the bias θ in the neurons must be adjusted. A schematic representation of the process for a neuron is shown in Figure 3.7. The output of the perceptron neuron is negated here and summed with the desired result. The resulting error is then squared in order to exclude mutual compensation of positive and negative errors.

24 3. Classification of patterns 24 This results in the error E p. The smaller the error E p, the closer the result of the neuron to the goal set by the teacher. Figure 3.7 .: Structure of the training of a perceptron neuron Various approaches can be followed to find the smallest error E p for a neuron. Considered using the example in Figure 3.7, two of these approaches are to be compared here: Brute force: In order to find the smallest error E p for the weights w 1 and w 2, all combinations of weight values ​​for the input data x 1 and x 2 compared. Figure 3.8 shows the number of possible combinations in a two-dimensional space. In order to find the smallest fault in this mountain of faults, all points must be run through and compared with one another. The search for the smallest error E p in the error mountain would take O (n 2) per iterative run per neuron. In a larger neural network, a neuron has significantly more weights that have to be adjusted. This would lead to a multi-dimensional feature space and take an O (n m) for a learning step, where m stands for the number of weights of the neurons. Gradient descent: Here the principle is followed to find a relatively small error in the error mountain in a given number of steps. This is not done by iterating through all possible weight combinations, but by taking a step from the current error (in the error mountain) in the direction

25 w 1 w 2 3. Classification of patterns 25 in which the descent is the steepest. There are several algorithms for gradient descent. One is the so-called backpropagation algorithm. In this work this approach for the search for the smallest error is explained because it is very often used in perceptron-neuron based neural networks for learning. More details are described in Chapter 4. E (w 1, w 2) Figure 3.8 .: Error mountains of a neuron with two input data and weights As can be seen in Figure 3.9, the backpropagation algorithm finds a very small error for the weights within a few steps. This can be used for any number of dimensions. With larger input data and dimensions, a large number of steps may be necessary to find a small error for the weights. In order to train an entire neural network with the help of the backpropagation algorithm, each neuron must be adapted from output layer to input layer, as shown in Figure 3.10. For this purpose, the actual output of the network is compared with the desired output and a difference vector is formed. This difference vector is required for the adjustment of all neurons. The adjustment of the weights begins in the output layer. From here the backpropagation algorithm works its way backwards through the network to the input layer and adjusts the weights of all neurons by a factor. The factor of change depends on the difference vector. The entire network structure is trained for each training set (consisting of input data and the desired result). After the workout, the network should

26 3. Classification of patterns 26 E (w 1, w 2) w 1 w 2 Figure 3.9 .: The gradient descent to find the smallest error with the backpropagation algorithm. Each red arrow represents an iteration of the gradient descent, which can be checked for accuracy with the help of a test set. It is important that the data from the test set was not used to train the network. In order to test the hit accuracy of the network, the test data need not have been known to the network beforehand.

27 3. Classification of patterns actual result desired result difference vector 0 Figure 3.10 .: Applying the backpropagation algorithm to an entire network 3.2. Principal Component Analysis Principal Component Analysis is a statistical method that is used to reduce and simplify extensive data sets. Data with many dimensions can be compressed using principal component analysis. The number of dimensions of the data is reduced without losing the distinctive pattern in the data. The reduced parts of the data are unnecessary for the recognition of the pattern. This technique turns out to be very efficient, especially in combination with neural networks, since the smallest possible input vectors increase the speed and recognition rate. Analysis Principal component analysis is about finding features from a set of recorded data. These features become main components or eigenvectors

28 3. Classification of patterns 28 called. Main components found that are minor for the information content of the data could be reduced. A main component is based on the data on various factors. Principal components are calculated iteratively from data. Figure 3.11 shows a point cloud in which ten points with three values ​​each x, y, z are represented in three dimensions. The main components are determined in a similar way to the solution of a linear adjustment calculation [32]. The calculation results in a straight line that approximates all points as well as possible.Figure 3.11 .: Point cloud [21] with its first main component - a straight line that approximates all points. The next main components must be orthogonal to the previous one and go through its center. You can only calculate as many principal components as the data set has dimensions. After the main components of a data set have been calculated, they serve as a new Cartesian coordinate system, as shown in Figure 3.12. The new axes (the main components), on which the data is oriented, map new value ranges, which are put together from the old value ranges of the axes with different weightings [28]. This can be illustrated with an example. In the original coordinate system, a point has the values ​​y = 15, x = 120, z = 47. In the new coordinate system, based on the main components, the values ​​are composed of r = y, h = x, t = z.

29 3. Classification of patterns 29 Figure 3.12 .: Point cloud calculated with the first two main components r and h (left). The main components are enough to display the data in a new coordinate system (right). The new values ​​of the points are a combination of the old values. With the help of the exact combination of these values, transitions from one coordinate system to another can be carried out. Reduction of the dimensions The main components presented in can be used to create a new coordinate system for the data. In this new coordinate system, the data is displayed on the basis of the combined old values. Often the new axes (main components) cannot be interpreted in terms of content. The so-called total variance can be used to determine which main components have more influence on the nature of the data than the others. The total variance can serve as a measure of the importance of the main components. The variance is calculated from the distances between the points and the respective main component. The sum of the squares of the distances in the direction of the principal component form the variance of the data. The more principal components are calculated, the smaller the distances between the points and the principal components become. The variance therefore becomes smaller and smaller for the i-th main component. The total variance of the data is the sum of all the variances of the principal components. Major components with a low variance can be removed and the data with it

30 3. Classification of patterns 30 can be represented in a coordinate system with fewer dimensions. The reduced data can thus be used in further data processing in their new coordinate system.

31 4. Convolutional Networks Convolutional Neural Networks are a variant of the neural networks presented in Chapter. Convolution networks do not only rely on the use of the perceptron neurons presented in section, but also on a technology for filtering data that comes from signal processing. Convolution networks got their name from this technology, because the technology from digital signal processing is called convolution. A more detailed explanation of the structure of convolution networks can be found in section. The idea of ​​convolution networks is based on the neocognitron developed in the 1980s [7]. Convolution networks as self-learning networks first appeared in 1998. Here Simard [27] and LeCun [13] have shown that learning algorithms such as the backpropagation learning cycle can be used with convolution networks. This has made it possible for convolution networks to be trained with image data without the need for other settings for the subsequent classification. With the presented convolution networks [13], highly variant digits in images could be classified more robustly than would be possible with conventional perceptron networks. In the following years, further tests for the classification of people and objects with the help of convolution networks [16] in images have proven to be very successful. The fact that convolution networks are able to recognize complex shapes in image data makes them a competent successor to perceptron-based neural networks in the field of image classification. Functionality and structure The functionality of convolution networks is based on the filtering of input data. The data flow within a convolution network does not differ significantly from the networks presented in section. The input data pass through the convolution network layer by layer. The difference lies in the interlinking of the layers and the processing of the data in the respective layers. The data from the previous layer is filtered in the layers of a convolution network. In order to be able to go into more detail about the functionality, the term filtering or convolution of image data must first be explained in more detail.

32 4. Convolution networks Convolution In a convolution, the input data is processed with a so-called convolution mask so that certain features are emphasized or suppressed in the result. Convolution is one possible form of image filtering. y 0 ... yi input vector y ... ym convolution kernel kk 0 k 1 kxi output vector x Figure 4.1 .: Convolution of a two-dimensional vector with a 3x3 convolution mask) This filtering can take place on a complex level [19] or on the basis of convolution masks . In this work only the filtering on the basis of convolution masks is dealt with further, since convolution networks also calculate their convolution in this way. In the case of a convolution with convolution masks, an input vector is run through. Each element y i of this vector is multiplied by the convolution mask k and summed. When the convolution mask is multiplied, not only y i, but also its neighboring elements v i ± 1 ... v i ± n are included in the calculation by the convolution mask, where n depends on the size of the convolution mask k. See Figure 4.1 for a graphical representation of the fold.

33 4. Convolution networks 33 The result is placed in the output vector x. The exact calculation of x results from the equation shown in 4.1. x i = n v i k j (4.1) i = 0 By filtering images, many different effects can be achieved. Figure 4.2 shows an example of such a filtering. Here the left grayscale image was filtered with the Laplace convolution mask shown in 4.2 (4.2) The middle grayscale image in Figure 4.2 is the result of this Laplace convolution. Laplace filters are often used to determine turning points in brightness. If the convolution mask is initialized with random values, the resulting filtering effect is also uncertain. As can be seen in Figure 4.2 on the right, with such a filtering, distinctive patterns can appear in the image. The input data of a convolution network is filtered in the same way. The convolution masks are also initially randomly initialized here, but adjusted with a training process. Figure 4.2 .: Unfiltered image (left), image filtered with 3x3 Laplace convolution mask (center) and a 3x3 randomly initialized convolution mask filtered image (right) A convolution can also be used to compare data with others [11]. This type of convolution is called correlation. In image processing, correlation functions are used, among other things, for the precise localization of a pattern in an image [34]. A simple one-dimensional image is given as an example in Figure 4.3. The grayscale values ​​of the pixel row are given as numerical values. The pixel values ​​are compared with the correlation mask

34 4. Convolution networks 34 folded and the resulting correlation value represents a measure for the recognition of the pattern (. In the example, the highest correlation value is 85 and numbers the pixel sequence) (which corresponds to the pattern to be found) of all other pixel sequences in the image next comes. In this way, two-dimensional images can also be correlated and concise patterns can be localized. Processing in convolution networks makes use of this effect. In a convolution network, the values ​​of the filter masks are similar to the weights of the perceptron neurons. Depending on the setting of the weights, neurons evaluate your input data differently. This happens in the same way in convolution networks, except that here input images are folded differently depending on the filter mask setting and new output images are generated. In convolution networks, the filter masks - i.e. the weights - can be set with a learning cycle, as presented for perceptron neurons in section. In this way, adapted filter masks are created through training with a learning algorithm and a large amount of training data. The input data are then filtered and evaluated differently by the convolution network depending on the nature of the training data used. Each time an image is filtered, the result is reduced by a pixel ring that is as wide as the folding mask width + 1 2. Since the convolution mask cannot capture the outermost pixels of an image with its mask (the outermost pixels have no neighbors) ignored this in the calculation. Figure 4.3 .: Example for the localization of similar patterns of a 1x3 correlation mask in a row of pixels

35 4. Convolution Networks Layers Neural networks are built up in layers, as explained in section. Convolution networks consist of an input layer, several hidden layers and an output layer. The main difference to conventional multilayer perceptron networks is the way in which the data is processed and passed on through the network. One of these other types of processing is the folding explained in section. The layers in a convolution network differ in their function. In contrast to the multilayer perceptron networks, in which each layer is made up of the same neurons, there are three different layers in convolutional networks that differ in their function: Convolution layer Subsampling layer Perceptron layer The convolution and subsampling -Layers consist of several parallel folds, called feature maps [14]. Similar to several neurons working in parallel in one layer, the feature maps are folded in parallel in convolution networks. Each layer has its own size for the feature maps in pixels. One layer takes the data from the previous layer and folds it in each of its feature maps with a convolution kernel. As can be seen in Figure 4.4, each convolution layer (C n) is followed by a subsampling layer (S n). These shifts always work in pairs. A subsampling layer ensures that the size of the data is reduced by a factor of two from the previous layer. Depending on the structure of a network, any number of convolution layer and subsampling layer pairs alternate up to the output layer. The output layer consists of perceptron neurons that are fully networked with the previous layer. It is also possible, as in the LeNet-7 [23] constructed by Yann LeCun, to place several fully networked perceptron layers in front of the output. The neurons of the output layer output an output value when they are calculated. Every convolution network has a threshold value, which determines when an output value is a classification hit. For example, with a threshold of 0.8, all neuron calculations of 0.8 and above count as hits. The output values ​​are normalized from 0 to 1.0.

36 4. Convolution Networks 36 Input Layer C 1 S 1 C 2 S 2 Output Layer Figure 4.4 .: Structure of the layers of a convolution network 4.2. Learning process The learning principle for the multilayer perceptron described in section can also be used for convolution networks, whereby the weights in a convolution network are the values ​​of the filter kernel. It is important to set these correctly with the learning process for the respective task. This is done using the backpropagation learning algorithm, which, as described in 3.10, adapts all weights from output layer to input layer. During the learning process, it should be emphasized that several feature maps weights - that is, filter cores - can share [6]. A distinction must be made here again between the convolution layer and the subsampling layer. In subsampling layers there is only one weight for each connection to the previous convolution layer, which can be set and thus trained. In convolution layers, in turn, the feature maps of the respective layer share a convolution core.

37 5. Realization and test In this work, convolution networks were tested for their performance in the recognition of highly variant patterns. This series of tests was carried out with two different convolution networks. The first convolution network is based on the work of Nikolay Chumerin, who implemented a convolution network in Matlab based on Yann LeCun's LeNet-5 [3]. This Matlab program was developed with the aim of achieving the recognition rate of letters and digits of LeNet-5, explained in [14]. The convolution network used in the Matlab program had to be adapted for the respective test data. The second convolution network used in the tests was taken from the EBlearn ++ project [23]. EBlearn ++ is an open source project of the Computational and Biological Learning Lab at New York University. Yann LeCun's research is based on the tools in this project. EBlearn ++ provides many tools and examples for testing convolution networks. One of these tools makes it possible to use a configuration file to describe a convolution network that can then be trained and used in the next steps. For the test series in this work, examples of the EBLearn ++ project were subjected to small changes and the configuration files of this tool were adapted. A more detailed explanation of the test setup and the networks are given in section 5.2. The highly varied patterns that are classified in the test series come from various image databases. Since it was necessary for some tests to generate your own test data or to supplement the existing data, a number of script-controlled tool chains were developed for this. They created individual images from video data and, depending on the test series, converted them into different formats and made changes to the content. Since the convolution networks used are different programs and these require their input data in specific formats, the test data had to be converted for the respective platform. For this, too, a number of script-controlled tools have been developed that prepare the test data for the respective convolution network. A more detailed overview of the test data and their changes in content can be found in Section 5.1. Detailed information about the tool chains and their use can be found in Appendix C.

38 5. Realization and testing of test data To test the convolution networks, faces were chosen as highly variant patterns in this work. Similar to the chairs from Chapter 1.1, faces are three-dimensional patterns that have to be recognized on a two-dimensional image. Two image databases were used for the tests. The image databases were mixed or used individually depending on the test. The first image database used is called muct [18]. Muct consists of 751 different faces, each shot in five poses and illuminated differently. Figure 5.1 .: Structure of a recording of the image database muct [18]. Here a series of images is created in which five images are taken in parallel from different angles. The 751 different people come from diverse ethnic origins. The faces of the people are not always centered in the picture, but in different positions. The angle of view also differs in each of the five pictures taken per person. The structure of a recording can be seen in Figure 5.1. Three different poses per person were recorded in different lighting conditions, five of which were recorded at angles a, b, c, d, e. All images have a resolution of 480x640 pixels. Faces in the wild [30] was used as a second image database. This is an image database created by the University of Massachusetts that shows 5749 different faces in images. With 1680 of these people several times