Deep Convolutional Neural Networks A-Z | Denis Volkhonskiy | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Deep Convolutional Neural Networks A-Z

teacher avatar Denis Volkhonskiy, AI Researcher

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

29 Lessons (1h 55m)
    • 1. Introduction

    • 2. Computer Vision Problems

    • 3. PRACTICE #1: Data loading

    • 4. Linear Layer and Classification Pipeline

    • 5. Loss functions and Softmax

    • 6. Stochastic Gradient Descend

    • 7. PRACTICE #2: Linear Classifier in PyTorch (part 1)

    • 8. PRACTICE #3: Linear Classifier in PyTorch (part 2)

    • 9. PRACTICE #4: Multi-layer perceptron

    • 10. What is image

    • 11. Motivation to Convolutions

    • 12. Convolution operation

    • 13. Parameters of the convolution

    • 14. Max Pooling and Average Pooling

    • 15. Non-linear functions

    • 16. Building deep convolutional network

    • 17. PRACTICE #5: Convolutional Neural Network

    • 18. Overfitting. L2 regularization

    • 19. DropOut. DropConnect

    • 20. Dropblock

    • 21. Early stopping

    • 22. Batch Normalization

    • 23. Data Augmentation

    • 24. Datasets

    • 25. Modern Architectures

    • 26. Transfer Learning

    • 27. PRACTICE PROJECT: Data Loading

    • 28. PRACTICE PROJECT: Data augmentation

    • 29. PRACTICE PROJECT: ResNet18

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Dear friend, welcome to the course "Modern Deep Convolutional Neural Networks with PyTorch"! In this course, you will learn:

  1. What are convolutional neural networks and why do people need them
  2. How to efficiently train them
  3. What is the best way to regularize and speed-up training of neural networks
  4. How we can improve the prediction quality

Warmly welcome!

Meet Your Teacher

Teacher Profile Image

Denis Volkhonskiy

AI Researcher


Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: Hello, my dear friend. My name is Dennis for Bronski. Welcome to the course. More than deep convolution along narrow networks. Thank you very much for choosing discourse. I'll do my best to share my knowledge in image processing with deep learning and convolution al neural networks with you. Let me begin with the situation off Learning Revolution. A literal networks there exist at competition. Gold Image Net. Large scale Visalia Recognition Challenge. The dusk for their computers is the following. You have 14 millions off images off size 256 to 256. Each image is assigned to one out of 1000 classes cats, dogs, frogs, etcetera. You have to build a classifier, an algorithm that takes an image as an input and return the class of this image. This competition started in 2000 and 10 and the air on top five classes waas 28% Year after year, people develop neural networks that are deeper than previews, and in 2012 they consisted off eight layers than 19 then 22 finally, in 2015 people develop his netiquette texture, which showed the error off 3.5% and what is amazing about it. It outperform human. There are serial regions for it. First, the network is really big, 152 layers. In fact, learning off such a deep network not only loan but also worry hard. And authors proposed how toe overcomes such limitations, which we, sure enough, will discuss in our course. The second reason this network is able to recognize such classes that many humans can't recognize, for example, this network and distinguish different breeds of dogs. And I'm not sure that over, if you can distinguish between these two different breeds of dogs on the slight, but they're different. Neural networks show amazing performance in many tasks. In 2013 K Go, the biggest machine learning competition platform hosted at Doc's, was kept challenge. The dust was to classify whether there was a cat or a dog in the photo and accuracy off. The winners model was almost 99%. During our course, will move from the basics off convolution, all neural networks to the recently developed advanced techniques for trained them and achieving the high score. Firstly, will consider which problems can be solved with convolution. All neural networks. We will briefly remember Leo earlier and how to train multi layer pressure drone what is lost function and how greatly in dissent works. We will study convolution, all neural networks what it is, how to use it. What are the parameters etcetera. Later, we will move forward to studying advanced diploma techniques such as Bush. Normalization is this normalization and others? I will tell you the most more than regularization techniques such as do about drug connect and others. We will learn what is an ot in quarter and how toe applied in practical business tasks. Finally, we will study the most popular convolution models for gasification task and study how to use them in order to self problems with a small amount of data. Thank you and see you in the next we dio 2. Computer Vision Problems: let us now consider the hottest topics off research in computer region. All of these problems are solved using convolution, all neural networks. Each of the problems is a separate topic, and in thes course, I will pay attention on the basis for all of them convolutions and train off neural networks. The first basic problem. Win Computer Region East classifications. Usually you have to be the function that takes some image, is an input and produce a label for this image. On the slide, you can see a simple case of quantification, binary or to class problems. Our neural network should decide whether there is a cat or a dog on the image. Second computer vision problem is image segmentation For the given seen image we should find where each object is located and marked pixels of different objects with different colors or classes. For example, in this light, the glasses marked with blue collar, the laptop is black and the table is yellow. Usually this task assault with the conservation off which pixel separately. Another example of computer region task is object protection for the input in which we need to find a way. The objects in the image and highlight them with the rectangle, which is called the bounding box. The next problem of computer region is image captioning forgiven image. The mortal should generate tax description of this image, as there is in the slight for the input image is the caption maybe the man s passion the road. The problem off person identification is widely used in the bank system. Usually, banks would like to identify persons by the photo and say whether this is the same person or not. We will started this problem at the end of the course in 2014 with fellow proposed in you method off generated synthetic objects. Generative feather Several networks. They're able to learn the train data distribution and sample from the distribution. Now generative adversarial networks can generate synthetic objects off big size and high quality. Another recently proposed method can transfer styles of great artists photo. The input photo is on the left corner, and other three photos are stabilized to his different styles of Freeman's drawings. Thank you for watching and see you in the next video 3. PRACTICE #1: Data loading: Hello, my dear friend, Welcome to the practical part. Off more than deep convolution. A literal networks with pytorch in this video will build a linear classifier for the data set which is called C for 10 which consist off then classes off images like playing car bird GATT dear etcetera. So first, let's look everything we need. This is pytorch Mark Put Leap and them by libraries. In the next cell we have the full in court. First we define their transformation. That will apply to allow our image is what it is. First is transformation from that nam pie array to the Tenzer. The standard is the basic element off by Doris Library the same Tenzer we can see in Tensorflow or in Theano or any other library. Second thing is normalization. If we look at the signature off normalize class from the transforms, it says that we will subtract this mean That is 0.50 point 50.5 and divide by its standard deviation, which is again 0.50 point 50.5. So this is what we will obtain and scenes. Our images have well used from 0 to 1. Then after these immunization, our images will be central By subtracting this mean and we'll have stand the deviation equal to one so there will be normalized. And this is very good practice before training every neural network. After that, after we defined the normalization and the transformation that will apply to all image, we should, Lord, our data set for these purpose by torch have very good tools which is called Data Set and Data Loader. So what we do here? First we load that data set and you see its name, its see Far 10. We say that it will be located in the data folder. We say that we want train set, not dressed, not that set in this section. We should download it because we don't have it. And, uh, we want to apply that information that we defined. After we define the Jane set, we should write a law order. Water is a class in pytorch that says, What is the what will be the batch size do need to shuffle our data, how many processes we should use for lording this data from the memory and so on, So we'll use bench size which is equal to four, will shuffle our data and will have to processes to be workers and load our data from our memory. We do the same with train. Set the differences. Now we have trained Apple two falls, so let's apply it. Finally, in these cell, we define classes that we have playing car birds, etcetera. Totally. There are Dan classes. So now you can see that we are waiting until by torch. Well done, Lord Aled. The necessary files and pack them and so one. So, as you see files already downloaded than where fight Good. Now let's normalize these data with Do it. So we get some around the between samples, which is four because we have bench size for as I said before. And, uh, we use very good, too, for making relief images, which is called Migrate in touch vision library. And you can see images. They all have size 32 to 32 half, three channels red, green and blue. Okay, so you can see plain horse Bert and Bert. We in fact, can repeat dissection and obtain another for images 4. Linear Layer and Classification Pipeline: Now I'm going to remind you the core classification and regression pipeline. I will briefly describe what is last function, how to build a multilayered pressure tone and training. You can skip this video if you think you are good at these topics. Let's start with the senior Larry Cap. These, in fact, is a basis for every leaner conceive fired such a logistic regression or SPM, as faras for linear regression. Many people like to explain the year layer using neurons, which for me seems very complicated. I like to look a cleaner layer as the leader of information from one day your space to another. So we have some input factor X from the n dimensional space linear layer or, if you like, the linear transformation is defined by the transformation metrics off size and turkey in terms off neural networks. This is a great Ma Trix, which is updated. You're in training in Beit George. This is called Senior Layer, and he's in features, and Katie's Out features that information itself is a matrix multiplication off the input factor and wait metrics after magics multiplication. We need a new vector of size, Kate, and in fact, this is all about the in your layer. We can stack several senior layers into one neural network. Don't forget that combination off leaners Information's is also a linear. It's information. That's why you should always insert known linearity between them. For example, sigmoid function. The sigmoid is a bad choice, but don't worry. We will study which non linearity is to use later in this course. Depending on the problem, we consider the different number of output neurons. In the case of classification to K classes, we should have K output neurons and use percent apy loss with soft max lair. In case off regression problem, we should use one output neuron with any regression was function. For example, l two Euclidean distance between predicted and tag it values. Now let's talk about quantification in dissection. I will describe the full conservation by plane. We have some input factor. Each element off. The include factor is a feature. This vector is a part of the linear space. We can apply a leaners information to eat this cold linear layer, and I said before the linear layer is represented as a metrics off burnable parameters. Then we'll play some non linearity such as sigmoid or hyperbolic tangent. If we don't will get the problem from linear algebra, we should remember that combination off linear functions is also well in your function. Thus, I didn't several layers. One after another will lead to some bad effects during training. Now we can repeat the section several times in order to make our net for deeper between each linear layer were at nominee Garrity, such as sigmoid or hyperbolic tangent. If you want to perform a classification to see classes, then we need to set see as an output dimension off our network. The last layer is now called lodges. It consists off numbers from minus infinity to infinity. Each Rogers is responsible for one class. The more the lodge is, the higher the probability off the corresponding class. Usually, who should transform our lodge? It's two probabilities. Here is soft. Max Lear comes to help us. Probabilities that were obtained on the last layer can be used in two ways. First, prediction. Second training, where we use the probabilities for computer. The lost value, the less our lost, the better our predictions are. If losses computed, we can minimize, it uses the hasta Grady into san algorithm. Minimizing the loss is the process of training neural network To predict the class for the input object. We use the class with the highest probability, and that's it. Thank you for watching and see you in the next video. 5. Loss functions and Softmax: let us talk about self max lair and lost functions in deals. The first thing we're going to continue easy. A self marks layer. This layer is why the used in classifications and it's Justin's forming logics to probabilities. You can look at it as a normalization. Firstly, we transform all lords. Were there exponential function. And secondly, we d white by the soon off the exponent for all large it's Please stop the radio here and think, Why do we need both truths? Informations, exponent For all the lodge, it's and normalization and why we call the output of the slayer s probabilities. Let us look at an example off soft max lair we want to classify to four classes cats, dogs, cow and horse. The output of the last layer over the network is full lodges for 12 and minus eight self marks. Lear allows using each of these large it for normalizing them to probabilities which are on the right. While probabilities are computed, it is important to speak about the last function. Now we're talking about the buying innovation off cross entropy. Let's look at it in details. The first thing we see is w which is the weight off the current object. The last function is computed for each object and then we ever aged. Thus we can compute not usual leverage but waited. One, However, usually all wait are equal to one. After setting some weights to the object, we have the target value for our object. Since we can see there, the binding reclassification keys are labels take values either zero or one. Depending on it, we will have only one non zero term left or right. We also have their predicted probability off class one, which is the output off the last layer off our neural network with sigmoid function and end . Since it is a probability, it takes value in range from 0 to 1. If our target label is one, then we have the only left term and maximize the probability off Class one give. The target label is zero. Then we minimize the probability of glass. One. No, that maximizing or minimizing is regularized with the miners before the loss. Let's consider the full course tend to be lost function in Beit George. It is called Khar Santa P loss. What you should know about it is that we maximize the probability off True class and me mice all other probabilities. If you use cross Antipolo sin by torch, note that it already includes off marks layer insight considering regression problem You're free to use any regression loss such as l to Euclidean distance l one ingredient distance or any other. Just remind you should compute average distance between your predicted Why and target Why the more the distance, the wars, your predictions. Thank you. See you in the next video. 6. Stochastic Gradient Descend: Now I'm going to briefly remind you this the hostile granted descent algorithm, the common way to train deep neural networks grading descent is an algorithm that step by step updates their parameters off the network, going to the minimum off the last function, starting from some initial parameters, remove in the direction opposite to the Grady in factor. We have a lost function l off X and the theta that we wish to optimize. In other words, we want to find such parameters theta that minimize the last function. We said that some initial parameters theta zero in this course will later discuss how toe properly choose the initial parameters of the neural networks. Here is the formula for Grady and Descend algorithm. Let's look deeper into it. Firstly, we compute average loss. The loss function is computed for each during sample and average. Secondly, would take its Grady int according toa each parameter off the neural network. In this formula, we consider the update off the ID parameter. Then we should multiply our computer Grady int on the pre defined learning creed, which is the speed off learning. Next we update our parameters by substrate in the Grady int multiplied by learn great from parameters from the euro step. Since we live in the era off big data computing, the average off last function on millions or billions off objects seems computational e impossible. However, we can use not all the data from our training data set, but just that subset. If it this is what people usually goes, toe has to Grady and Descend. You should understand some facts about the choice off the bench size, the more your breach size, the mortgage GPU memories required by your neural network. Therefore, you are bounded above in the choice off the bench size. On the other hand, it is better to take a bigger bunch size, since it will decrease the variants off your Grady in the low will ease off. Much size will lead to high variance. The third important note. I prefer to take the batch size, which is the power of to the reason for these is the binding. Resist him in the lower level computations, it will be just faster. The usual choice said 2 to 64 128 and so one the last, but not the least for the hostage, Grady and Descendant. It's modifications. It's almost doesn't matter how big is your much What is important. The number off update. It will be almost no difference between batch size 256 and 512. Since you're just slightly improved their estimation off the average radiant, Let us now consider how we work with data in the meaning botch Satan. As usual, we have a set of training data at each iteration off Neural network. Look at the batch of data points When the neural network looked at along the data, it is called the EPO. We're now going deeper into the dales. The first thing you should do is sharpen your data before each hippo. We fix the size off the bunch to some value and split all the training data set into Bunches of this size. We have data when it work, according to the Grady Int computers with the first bunch. Then second bunch, we update our network batch by batch. I know that the last bunch has size that it's more or equal to the batch size because the total number of training samples is not oblique toe divide by the batch size. All I showed you is Golden Apple. After it, we start next to both of training. I know that he should shuffle our date again. There exist a number of modifications off the hasta Grady in dissent, which use some empirical improvements off the convergence. But this is not the topic for our course. In practice, it is always better to use Adam algorithm for neural network training. Adam is also their modification off this the hasty Grady in dissent and is available in pytorch that is often out and thank you for watching to you in the next year. 7. PRACTICE #2: Linear Classifier in PyTorch (part 1): Now we'll have a loop four training. What we have here? We have five, boss. A poor is that I described. He described you in the lecture. So we want to look at each image from our data set five times now at each apple we use how a train loader that we defined at the beginning here trained loader and best loader. And as I said, it is very convenient that we can do the following. So we just for, say, four data entering order and dead. It is so each Our data is input and labels. We translate these inputs, uh, input images. Sorry. This should be images, images, images as an input and labels is our data. We translate these images and labels too deep. You force being up training After that invite. Horschel should make our Grady into zero here. We should make for a bus and compute loss. So how to do for us? We just need to apply our network to images, and this is will be our output. And in order to compute loss function, we need to say that loss is equal to criterion from out foot and labels. And this is it just two lengths off court. And, uh, after that we make backward bus, we make optimization step and this is just the buck printed. Let's run this. Okay, Our training now was finished, and you can see how the loss was, how the laws decreased from 2.3 to 2 point during 28 Uh, and now let's check their result for these proposed we will use tests loader in the same way that we use trained loader in training, which is here three difference that we don't use any Grady in so backward pass. And that's why we right towards don't know great here. What we do is just make forward bus calling our nap class. And after that, we use don't torch dot max function in order to obtain the class with the maximum score. And let's compute the accuracy so we have the accuracy is equal to 31%. It is not random, which is better than random because random classifier will we'll have 10% accuracy because we have 10 classes. This is that you want. This is better than random. But again, this is just leave your question fire in next video will consider multilayered, oppressive tone. And then we'll move forward. Teoh Convolution ALS Neural Networks Thank you and seeing the next video. 8. PRACTICE #3: Linear Classifier in PyTorch (part 2): Now we have to write some linear classifier in by George. For this, we need to define a class, which is which we call linear question. Fire. Uh, he you can see on under more do. And, uh, we want that it will obtain an input Neurons and and return and Albert insurance a big later. I will describe porters and input neurons and and output neurons. So we have to define our linear question fire when your question fire consist of one layer . And this is the year this is gold senior so and then don't leaner and then he's torched. Don't nn And here we say on input neurons and output neurons. How many neurons I will describe what he did. They said Okay, this is our senior earlier we should have. We should add self don't senior in order to make a call off these class in forward function . So now, in order to complete this neural network, we should define forward pass. We don't need to define any backward pass because because pytorch will automatically do backward pass for us. So in four plus, we do the following. Uh, we want to apply self that linear from X and we should return it. Okay. Ah, Will it work? And I say, you know, because image x X have have shape bitch size three hate wheat. That's why. Has that's why we can't apply linear layer because it's it's an image. And what do we need to do is to shape these x toe, the bunch of vectors and how we can do it. We should define the variable veg size, which is equal to X duck shape zero. Then we say that X is equal to x the view which is equivalent off shape in numb by. And he will have Reggie size and minus one, which will mean that first dimension will be bench size and second dimension will be three dot hate dot with and boot. This is seat and we defined that when their own network. Now we have to defined this quest. So linear classifier and it takes some an input endurance as the parameters and and output neurons on input insurances, ableto something and output neurons is a gain amp equal to or something. Okay. Wish what was Should we do here? Here? We should say that it will be three don't hate. Sorry. Hey, hate dot with, uh because here here we're multiply would have a product of these shapes here because of minus one. That's why that dimension off our intern input space will be three dot hey, dealt with. Okay, What is hated with? Let's check the size off our image. We have here images. Let's say images that shape and look, it is 43 32 32. So this is batch size. This is number of channels and this is hey, this week. Okay, so we can see that hate we with Tzekel dough images the shape from second. Okay. And let's check that it will be 32 to 30. Do Yes. Yes, we're right. Hate and with physical definitive And, ah, how many neurons we should have in our output. And I say to you it it should be exactly 10 white then because we have exactly 10 number of classes and ah, when making afford bus, we should obtain one score for each class and ah, the score will with soft marks, layer is translated Teoh probabilities And again we should have temper beauties because of dan classes. So this is how we're cool off our class. Not cool. This is the creation off object from our class. And then we apply dot cooler, which means that we should transfer our network to GPU. Now we have to define their criterion which will be across and fabulous, which I mentioned on there lectures. And we use optimizer, which is Adam, which is now state of fart on the best optimizer that more invented on is good for deep neural networks. We pass it the parameter so off our network and we said the learning crate to some, will you, which is 0.1 9. PRACTICE #4: Multi-layer perceptron: now let us modify the linear classifier that we built in the grievous several videos into multi layer Perceptron. The difference between the multi layer Perceptron and the linear classifier is that now we have not won Dini earlier, but a set off linear layers and something only narratives between them. So we take the same court for data prep processing, such as import data by Door says that we already downloaded and verified files. Then we can again with our lives our data set, which is remind you it's see far down, which consist off dance classes off images of Size 32 through 32 which has three channels each. Then now we have linear question fire that we build when the previous we use. Now let's Renee meat to their multi layer person tune and let's money fight. So, as I said, we have several in your layers. Okay, let's let's do the following. Let's say that now we'll have, but it's called main sequential block. Sequential is a nice to in pytorch that allows to write several layers with as a sequence, so the first layer will be an end up the year, which has an input neurons and, let's say 256. Then second clean your layer will be. And then don't the year. And now we have the input insurance. 256 and output, Let's say 188 the third year, where will be our last one? It will have 128 input neurons and Dan output unit. No, not done and output neurons here. Yes, we do it. He's and instead off self doubt, linear with just call seldom made. And that's open. So no, please stop the video and think Where is their mistake that I made? I hope that you guess that A zay said in lectures. It is very important to insert some not many charities between linear layers because the combination off leaners informations in the linear space is again a linear transformation. It is the first reason and the second reason way can obtain some bad effects with radiance . We have several in, you know, layers. One after another, without non linearity is so just let us insert lick it real here. And, uh, as far as I remember, we should say that place should be true. And, uh, let's let's concerned the seam layer, Hugh. Okay. And if you see the parameters here is negative slope, which is now a 0.1 We can said it too. I see your point too. Zero point. This is a hyper parameter. And of course, we usually should Teoh tune it with some validation said Okay, so we are done. Then again remind you we obtain hate and with off our image Then we agreed No, clean your question fire but MLB multi layered Perceptron again input insurances three dot head out with an outpour Tristan and set it to Cuda would we created our network Then again , criterion and optimization algorithm. And again we have five books of training and the lead strain our network Good. Our train was finished. Let's obtain death quality using the same function that we used Whole lean your classifier so and now we have 45% off accuracy in linear quantification. We obtained 32% of accuracy. Now we have 45 when the random reciprocation will give us 10% accuracy. Ok, that's not very good result, of course, because we don't use convolution layers at all, and we dont une the network would just take some random layers. But we again improved The results in the next section will consider convolution all wears and and will significantly improve their results. So continue watching Thank you and seeing in the next video. 10. What is image: in this video, I'm going to describe what is image and what people did before. Deep learning, usually great image is represented as a metrics. Off sighs Hey, to with in this magics there are pixel values, which could take where he is from the range from 0 to 255. Another way to represent an image is metrics off pixels that takes will use between zero and one. If we're talking about color images. Usually it is represented as a three D metrics off size three to hate two weeks. Each pixel now is represented as three well years rare intensity, green intensity and blue intensity. These these so called channels on the image. Let us consider old school approach for image processing. We is always start from detail. Usually this is a set of images for each image. People constructed a set of features by hand. Using obtained features, one can use any standard machine learning everything to, such as linear regression as VM Grady and boost in or any others. Consider images people used to construct so called image descriptors. These descriptors were not learn herbal at all. As you can guess, the choice off the descriptors influenced training and prediction results a lot, however, now it is the area off deep learning, and the key advantage off the deep learning approach is automatic feature Learning With neural networks, you don't need to think about features and descriptors. You can stop the neural network and the clearance old features for you. Thank you for watching to you in the next video. 11. Motivation to Convolutions: in this course, we study convolution along narrow networks. However, we should understand why we have to use them. Why not to use just leaner layers? Quite a number. US question. Why not to use linear layers with images? After all, we can just reshape our image to vector and do a forward pass through a set off linear layers with no engineer. Garrity's good idea. But if our input image has three channels and size, 256 to 256 after shaping will have a vector off size around 200,000. That's the main situation for introducing convolution along narrow networks. Too much parameters in the linear layer. Before start learning convolutions, let us consider some assumptions. Could you agree with me that the clothes speak cells have clothes rail use? Actually, we can see there the big sell on the cat, which is red with high probability close pixels, will be also read. Okay, so we don't need a separate neuron for each pixel. The open question for now, how we can reuse the parameters off our network if we don't need a separate neuron for each separate big seal, we should use wants more. Lean your layer for local regions off the image, and this is a basic idea behind convolution. 12. Convolution operation: we understood why we should use convolutions instead of linear layers for images that has now consider convolutions in details. Let us look at the result explanation Off the convolution again, we haven't input image off size, see toe age to W where see in the number of channels H and W Hey, he's hate. And with correspondent me, we introduced convolutions. Wait three dimensional metrics off size. See Tok Tok. Here it is. The number off channels is the same in the input image and in the weight metrics and equals two c. We use convolution wait for sliding over the image, taken their element wise product off the local region off the image which is green, and the wait three dimensional magics, which is blue. We sum up the result of the product and obtained one, Will you? We write it to one of the Williams off their output three dimensional metrics. We repeat these action several times with different futures Yellow read. In fact, convolution operation operation consist off several kernels. Convolutions were popular before deep neural networks and was called filters. On the slide. You can see their result off applied goche in future to an image. You can see that dependent on the parameter off this future. The image lost its color and became blue convolution in operation, which can be represented as a layer in the neural network. That is why you can combine several convolution all layers, into one neural network. Let me demonstrate the main advantage of convolutions against linear layers issue. We haven't improved image with three channels off size, 256 to 256 we want to obtain 100 output features after the first layer. Then, using Lean earlier, we will obtain 19 millions of parameters. In the same time, four convolution all colonels will have only 300 parameters. Now you see the difference. In the next video, we'll go deeper into convolutions. Thank you for watching 13. Parameters of the convolution: convolution operation has several parameters. Let's discuss the most important stride and bedding. The first parameter off their convolution is its stride. The default will you off. The strike is one which means that our colonel is moving from left to right and from top to bottom by one pixel on the animation. On the slide, you can see how they're convolution with striped. One works in the animation. The red is a colonel. The green is the input image. Know that you have input sized 4 to 4 cattle size, 2 to 2, then the output size will be 3 to 3. Now let's increase the strike two to look what now happens, we increase the strike and their size off the output of the convolution decreased. Now it is to to that is why, on the first layers of the network, it is better to use strike more than one, you know, do down sample input images working. Now let's said the strike for three. What will happen probably in this case by torch, will return an error. This is because the size of the image, which is four, can be divided to the size off the kernel which is three. I hope you understood what it strikes. So now let's switch to another parameter bedding. Look at the image on the left. If we apply convolution a where to eat, we will obtain unequal contribution off pixels to the result of the convolution. Green pixels will play their own more than blue pixels. If we want to make their contribution off pixels equal, we could add zeros on the border off the image. And he you can see how it looks like red pixels have now. Well, you zero. Another reason for the badge in is that we can see there on the previous slide. If our image size is no divide by the Colonel, we can always extend the image with zero padding. And now the possible variation of padding is mirror body, which is in the by torches Golden Faction petting. In this case who at not zero pixels but we reflect the border peak Selves as it is on the slide. Well, this is old for batting. Thank you for watching and to you in the next view. 14. Max Pooling and Average Pooling: in this video, I will show you the most popular ways off down sampling images. Sometimes we have a very big images as an input. For example, the size can be more than 1000 big cells. If we will process it directly, we will need a lot of parameters in the network. However, in order to recognize if it is a cat or a dog on the image, we don't need so much information. We don't need so many pixels. That is why our goal is to reduce the input image size for these proposed people usually use max boarding glare. It divides your image into a squares off sound fix pre defined size. And in each of the square it selects maximum William, as in the example, on the right in Green Square, it select six in Red Square, eight in blue three and in yellow, four. In the example, the size of the square is two by two in the max pool, where we have the same parameters as in convolution, a lair colonel size, which is decisive. The squares also stride the same as in convolution and Betty, also the same as in convolution. Another type of pulling for images is average pulling. The only difference from the Max Polian is that now we leave. No, the maximum value, but an ever each one you might ask me then, is so watered. I use Max Putin or average putting. Who knows I will answer to you in practice. There is not so much difference between these two methods. Often people prefer to use Max Putin instead of Farish pulling, Thank you for watching and see you in the next video. 15. Non-linear functions: they're convolution. Operation is a linear function in the linear space. It means that between several convolutions, it is important to insert only charity function. Let's look at the most popular choices and discuss their advantages and disadvantages. The most popular known in charity are sigmoid and hyperbolic tangent. They see more entrance, forced the input to the well you from 0 to 1, while hyperbolic tangent transforms and input to the well you from minus 1 to 1. Both hyperbolic, tender intensive Boyd are bad choices. Let's look at sigmoid Grady int. It is equal to zero almost for all inputs. It has its largest value in the 0.0, the values 0.25. It means that for most off inputs, neural network wait will not be a president. All the same situation is with hyperbolic tangent and other popular church off non linearity is really function, it says their input to zero. If it is negative, it is maximum from zero and X. However, it says that derivative to zero in the half of the input. If the input is negative, then that derivative will be zero again means that you wait will not be updated one rather nice modification off re Louise Leakey reload. It doesnt said the negative input to zero, but slightly changed the slope off the negative. As you can see on the slide, it is a good choice. In most of cases, there are much more non linearity success in every deployment library. Usually in practice, the best choices will be literally Lou and Lou. However, you may use some validation set of samples in order to choose the best non linearity. Thank you for watching and to you in the next video. 16. Building deep convolutional network: Now we discuss how to build deep neural network. As I already mentioned, It is important to add non linearity between each convolution, where because convolution is a linear reparation, usually after set off convolution, a layers we re shamed obtained output to be a factor. Then we add several lena layers with non linearity between them. As a result, we should obtain a vector off size that is equal to the number of classes in the classification problem. If we solve regression problem, then the output size should be one. In order to build deep network, we should stack several blocks that contains convolution with non linearity. After that, at Mark's pulling clear, we can repeat these big block several times. After that, we reshape the output vector. Use linear layers with soft marks at the end. No, this is our deep convolution. Allnut Neural network. Thank you for watching and see you in the next video 17. PRACTICE #5: Convolutional Neural Network: We have already built linear classifier and multi layer Perceptron for quantification off Support them. Now let's switch to convolution. Allnut Neural Networks. So let me restart the kennel and clear our boots. Now we again infertile, necessary libraries. We again load our data we use already downloaded and very fight. We again have train sat test set, train loader and deaths loader. We can work with our images again. We have Dan classes. All of them have size 32 to 32 have three channels Now we have to build at confident from the lecture. You should remember that when we build the Conrad Continent usually we start from the set of convolution a lairs. And after that, yeah, we use several multi layer perceptron. Okay, so in fighter to have two parts I wrote two parts here main part which is convolution apart , and MLB which is multi layered perceptron which stands for said offline your layers And when we will do for us, we will apply first convolution apart. Then we'll reshape our output, toe the bench off directors and then we will apply multi layer Perceptron. Okay? No, I will show you the service, which is called for Maura ai, which is very helpful when you build a corner of Met. It is good when you want to understand what will be the output size off your convolution. All neural network. So what we have here? We have the imprecise 32. It means that our images this 32 by federative Then we should define the convolution all where here is colonel size, strike dilation and painting. We didn't discuss violation yet, but other parameters, I assume that you know. Okay, let's set the kennel size 25 for example. The stride to to and the bed in this village. We have two options here. Residency. So let's look, when we use same padding, it means that pixels will have the same contribution. This is what I described in pageant video. If we use village paging than our Beijing will be zero, it means that no bread inc will be applied. Okay, so what we have here, where one general size five means the colonel is 5 to 5. This tried this to batting zero emphasizes 32 the output size will be 14. It means that it will The output will have shaped 14 to 14 and we will have several channels. But there is no any work here about generals, because this is not important here. Okay, we can duplicate this layer and ah, applied after the 1st 1 So the input size will be 14 now, and the output sites will be five. And, for example, who want to dedicate it? Then we will obtain the empathize five and now puts ice is one Okay, so we can use three layers which has the same parameters. Colonel size five, strike two and biting zero. Let's write them. And then that de then we have parameters in channels which is n channels. And next per next parameter is all channels. This'd is our channels. Is the number off futures which we will have in our convolution aware. Let's say it will be 16. This is a hyper parameter. Of course, it should be tuned with relegation set. Now we have colonel size colonel size. We decided to set it to Fife, then stride to then bedding is you. Okey. And it's so, uh, now we should repeat this layer three times, but here we have now not on channels which is three in the input. But now it will be 16 and let's see the second player will have 52 huge hers. It means that the third layer will have 32 channels as input and assume it will have 64 kennels. Now, this is our convolution. All part. As I said, you lectures. Convolution is linear operation. It means that between each convolution a lair we should add some non linearity because the combination off leaner layers off linear transformations is the same leaners information Onda we can obtain again bad effects with the greedy INTs off the network. That's why we can right here like, uh, real ooh and in place to here and here. Good. So now we have the main model which consists of three convolution allow and to know only Naret ease after it. We can also blessem really later here after me After plan, we will have the shape, X shape about your size 64 because we have 64 output channel 64 kennels in the last revolution earlier then 1 to 1. Why one? Because after playing three convolution wears out with sizes one so 14 5 and one. Now we have to reshape these, like image consists off one pixels and one pixel and 64 channels. We should reshape it, toe the batch of factors again. We use eggs, don't for you, but size minus one. And after that, we're playing multilayered. Basic John. Okay, let's write it. And then the senior, uh, we'll have 64 input neurons, input elements and the output dimension will be 32 then. Norman Charity. What state will be like? A real ooh in place. True. Then again, Ananda Linear City to to and I'll put neurons and output neurons for our task will be 10 because we have exactly 10 crisis. So let's compute hate and wheat and a zit was in the previous lectures and now, and channels and channels is three. Because if we look at the image that shape, it is for too much size and three's a number of channels, so number of chalices three and how put neurons will be 10. It means that we need to classify to 10 classes. So again, let's greet our confident says Send id Teoh Cuda, and check that if we apply our network toe some images it will be. It will have their output size 4 to 10. Four stands to base size and then stents do number of classes and you have some criterion, which is cross center pillows who have Evan Optimizer and let's on our network for five bucks. Lucky our train was finished and now let's check the quality. So now we have 59%. If you remember when we trained multilayered Perceptron, we had 45%. So we improved our score in the next practical videos will improve for score more and more . Okay, applying on practice what we learned in lectures. Thank you for watching and see you in the next video. 18. Overfitting. L2 regularization: if you studied machine learning before you should know what is over feeding. Also, you should know how to avoid overheating and what is a regularization. However, there are a number off legalization techniques that are good for deep learning and image processing from the course off machine learning. You should know that when you increase the number of parameters in your model, it starts to or feet. The overfeeding means that your model learns the answers for the training object instead off predicting them. This leads to bad behavior off your model. On the test set, the prediction will be very bad. Their solution toe the problem off the or feet in east there. Regularization stop. You may ask me, Why should we regularize the network? Maybe it is better not to increase the model size, so we have two possible approaches deep in that work with regularization or smaller network . Without realization. Let's think our last function is known convicts. It means that there are multiple local minimums off the last function, but the local minimum is usually good for us. Yes, it is better to reach the global optimum, but now optimization methods for deep learning can do that you would have a small number of parameters Then we have a few local minimums and in practice, their bet. If we increase the number of parameters making our network bigger than our last function will have more local minimums and usually in practice, it is better. Their conclusion is simple. It is better to have a deeper network with regularization than small network. Without it off course, you should understand that this is practical results and in your concrete case, the situation may be different. The first method off regularization comes from the classical machine learning l two regularization The idea off which is the same as in reach regression which is a name off linear regression With l two regularization, we have some less function for our narrow network and we add their realization term to the optimization problem. This regularization term consists off the multiplication off lambda to wait Lambda is the regularization confusion We should be tuned on relations set The mawr love the parameter. The closer our weights will be 20 In order to implement such organization in pytorch, you can use their parameter wait decay in the optimization algorithm definition. This is absolutely the same parameter as Lambda in our formula 19. DropOut. DropConnect: we consider the L two organization. But there are much more organization techniques that were designed specifically for deep learning. The first deployment organization technique is drop out. It is easy to understand it on the multi layer perceptron on the left who have their original multi layer Perceptron with several layers on the right. We have the same network, but with a blight about look, we turned off some of the neurons. In other words, we made some of the input feature to be equal to zero, and the important thing we did it randomly. Let's take a closer look. We at first said some probability, be off the input feature to become zero. Now we have two cases training and destiny. During training, we set some features 20 with probability p on each forward pass that is. On each iteration, we choose different features. We can look at it another way. We generate their very newly mosque and multiplied to the features you're in destined As we sample the mask from the Bernoulli distribution, we should take the average so we multiply our input features to the probability beat about is usually implemented as a separate layer which randomly set some features 20 That is why it is possible to add rip out multiple times before the lay you need by default the probability in the drop off their set to 0.5. However it can and in fact, should be tuned on the validation set. Also, there exists some research papers that describes the way off automatically choosing that separate probability p for each feature. If you are interested in it, read about the creation. Old about dip out can be used as a good organization technique. But what if you want to train and sambal off neural networks, as you may know from their machine learning course and samples off predictions, for example, ready and booting or random forest always works better than single agree. For example, Decision Tree Construction, an and symbol of neural networks, is hard. It will require a lot of computational time and memory. However, dropout comes to help here seems for each forward past during training with sample, a random Bernoulli mask. We have a new neural network on each generation that is different from the previous what, and we can use these different networks as an assemble for example, every gene, their predictions and good news is that we don't need to store them separately. We just put the same data to the networking put each time a randomly making someone puts for some layers to zero authors off the sides. Article showed that such a drop out and assembled usually works better in terms of prediction quality. The other organization technique is Drop Connect, which is also can be used in deep learning. It is very similar to drop out, but with a big difference. While about said some features 20 job connects at some weights to zero. The Bip line is the same as in dropout. We choose some probability off the weights to become zero during training, with set the weights to zero according to the selected probability. During test time, we multiply our weight on the Probability P. Here is their comparison off about and drop connect methods, although the difference between them not that much people prefer to use dropout 20. Dropblock: Since we study more than deep image processing, I have to share with you recently proposed organization technique for images. Drop block. The intuitive idea behind Drop Out is that we sometimes should set some picks is 20 but we know that usually on the image there are only a limited set of pixels that contains useful semantic information. For example, for the doc photo, youthful region is green on the middle image. If we learn a classifier with two classes, cats or dogs, then we don't need the information that surrounds the dog, which is in this fort er the grass. The disadvantage off the dropout as an organization technique is the following. It was set pixels off the grass to zero, then it will not affect the prediction results at all. Seems the grass is useless information. In other words, Sometimes our organization will not work because we want our regularization method toe debrief our network for soft semantically important information during training Onley. It will help it in preventing learning the training samples and here comes their drug broke everything on each training iteration. It said that random square off pixels to zero as it shown on the right image. You can think of wheat in the following way. If our never could recognize the dock on the floater, it is also should be able to recognize a dog without a near or without your mom. And this is the idea behind your book, how it should be implemented, in fact, rather simple. At first we set the length of the side of the square, which is called block size. We also should said that probability off the block with the central at each pixel to become zero. Then, knowing the block size, we can select the region which potentially can contain the center of the block off course. We can start the center of the block to the corner off the image. This region is green on the left image. Here we have the block size equal to five. After said we generate app renewing mask off size that is equal to the green area. For the very newly mask, we use the probability that we set up before, as you see on the left image, the beauty mask. Give us two centres of block to make zero, and on the right image you can see the result of the drug book. Some pictures became zero, as in that drop out. The drug block is random on the generation and don't play. He's rolled urine testing. If you want, you can read the original paper by the link in the right corner On the slight. The experimental section off the drop block paper shows that these refrigeration techniques allows more than architectures such as risen at 50 which will study later in this course toe. Overcome the same architectural with dropout investigation accuracy. Here is another interesting visualization from the author's off this paper class activation mapping at the first rule. There are input images and other three rows Original model model with a drug block with block size one and with block size seven model. When trained withdraw book learns spatially distributed representations that induced high class situations on multiple regions. On the other hand, the model withdraw block tends to focus on one or few regions. So here it is. Oh, for drug block. If you are interested, please read the original paper 21. Early stopping: another important organization technique is early stopping. We already considered several organization techniques, such as l two, about drug block and others by the open question is, Can we just stop our learning earlier? And the answer is yes, we can. And in fact we have to. Usually when you train your neural network, you brought to magics some lost function on your train, Ted, and destined. It may be the loss that you minimize or the accuracy off the quantification or any other. And while training you trained error should always degrees and you're tester degrees only some limited time. And after that, start increasing. This is what is called the learning curve, and I hope you studied it in machine learning course. In order to stop earlier, you should define where to stop your model. The stock, maybe somewhere as on the slight or the stock may be here sometime later. So the question is where to stop. And here comes the early stoppin algorithm. It is important to say that scenes we use bikes Grady and descend at each iteration. Our loss can degrees home in Greece. This is normal thing that you will always see while training. That is why we need to introduce parameter and steps. It means how many steps should we be tore into? Validation laws increasing in as award. How many alterations should we wait while the losses no degree in? Let us look at the algorithm implementation in python at each iteration which have if our with additional laws decrees is comparing to last and steps off course. If you can see their accuracy, it should increase. So if the was decreased, then it is oak. We set way parameter to one. If it didn't degrees, we should check two options first. If we waited for increasing more than and steps in this case, we should stop training other ways. We increase the parameter weight which is responsible for how many alterations were already waiting. And that's it. Again, the only idea behind the early stopping. If our validation lost didn't increase for and steps then we should stop our training. Finally, what should we do in practice? At first, we should always use early stopping always. Secondly, big networks plus organization is better than small networks as a realization technique people usually use dropout. The book is also good, but it is recently proposed method which is not implemented in deep learning libraries. Yet that is all for this video. Thank you for watching. 22. Batch Normalization: Now we came to the question off, speeding up our training and normalization. Drain off modern convolution. A literal networks may take up to several weeks, and the question is, how can we reduce this time off course? We can use several GP use, but it is not always available. Maura, where training time can take several weeks even with multiple g. BYU's the one way to think off, speeding up our training East data normalization. If the data points will be close enough to each other in the linear space, the network will find their dependencies in the data faster. Usually, normalization is performed in the way where we make our data to be zero centered, that is, with a zero mean and also with a standard deviation equal to one. In order to do it, we should first pre compute the mean and standard deviation for each feature before training. After that, we should first substrate act that pre computed mean and divide by the pre computed standard deviation. I know that we add some small absoluto, our standard deviation in order to avoid division by zero. These normalization is applied before training to Aled. Data points include and train and test object. However, we have another problem that I described you in the section with non any Garrity's in the convolution all model. If you haven't seen it, please do. It is really important to know and many other courses avoid this information. So we have a sigmoid non linearity function. What is the problem with it? It is. It's Grady int that is equal to zero almost everywhere except off the small region. Iran zero For us, it means that went our data after some glare will not be around zero our Grady int will vanish. It will be zero And now wait will receive no update at all and we will learn nothing from our data. Okay, what can we do with it? In 2015 researchers proposed a new normalisation technique called based generalization. The idea behind it is simple. Let us great l air that will normalize our data and will set the learn herbal mean and standard deviation toe our data. So how does it work? We again, as in standard normalization, compute mean and standard deviation. Then we subtract the mean and divide by standard deviation Blast some small Absalon After that, we multiply our normalized data to the new learn herbal standard deviation and at new learning will mean value pay attention. These parameters are learning ble, which means they are also optimized with Grady and dissent. Also know that this is a layer which can be inserted before each linear or convolution a layer in the network. It occurred that with the best normalization, the train in time decreased by 11 times on the Inception Act texture, which will study in the next model. Just imagine in certain several best immunization layers can speed out power training by 11 times. It's amazing there are many other organization techniques such as bush immunization like layer normalization. Is this normalization? And so one. Let us compare the based organization with layer normalization in case off batch normalization. The mean and standard deviation is computed for each feature separately, and we obtained the mean off a concrete feature in the batch alone with the standard deviation for these concrete feature. In case off layer normalization, the mean and standard deviation are computed not for the feature separately but for the object. So for each object we computed mean and it's standard deviation among all features. As a result, for each object, we will have a single mean and standard deviation layer. Normalisation can be used for images, but usually people prefer to use based organization later in. Immunization is more popular for text processing and natural language processing. When you use based organization in practice, you should insert it after the convolution layers and before known linearity such a real ooh or sigmoid or any other off course. You can insert it in another way, for example, after non linearity. And sometimes it can be better instead of computing statistics such as mean and standard deviation for a budge. Usually people use Iranian statistics, which is computed with the use off momentum parameter, which is available in Pytorch. So that is all for normalization. Next time we'll discuss weight initialization. Thank you for watching and see in the next video 23. Data Augmentation: Hello, My friend in this video will talk about data augmentation, a good technique that allows you to extend your data set and to improve your quality. So what are the problems off convolution? All neural networks. They are in fact, very sensitive to irritation off images. So if we have good prediction for one image and they were irritated, it'll beat their quality may degrees. It is also sensitive to light. If we trained our network only on white images and then make the images dark, then the Corti again may decrease. And also there is a problem off small data sets. For example, if you if you have worry few amount of data. So it's hard to train a neural network and obtain good quality. And the solution for this is data augmentation. The first method off augmentation is gradation, so it will have an image off cat. We can see that predated cat is also the cat. We can retain our training images randomly and at our training set, and also use them for training. The second method off augmentation is sleeping so we can sleep. Our cat again added to train, set and train our narrowing that for the third matter it is scaling. We can use their subset, the random subset off image, and also again added to the train set and train the network. We can change the contrast to follow images and we can add some random noise, for example from their normal distribution or from uniform distribution, toe the image and add again this image do with the data set. No, In fact, this method can increase the size off your data set up to several times. But now what have we do in keys off testing? So there is no one good answer. One possible solution is not to use augmentation on testing. The second possible solution is to use some linear combination off different augmentation. So we take one test image, apply contrast, changing a proper apply scaling, apply rotation, apply blue ring and then use average off these predictions. The quality may worry from one task to another, so you should try. But believe me, the data augmentation is a very good technique that allows you to increase decisive that they just added, increase the quality a lot. Thank you for watching and see you in the next video. 24. Datasets: in this video will go into talk about existent data sets that can be used for training your neural network. The first and the most popular data set is Money's Day said it consists off images off size 28 to 28. Onda a Limoges are grayscale, and there are 10 classes from 0 to 9. Totally. There are 60 1000 of free images off dead classes. As I said, I know that the quality off the network should be worried because this data set. So the accuracy off, for example, came nearest neighbors algorithm. On these, they just had a 96%. So this is rather bad data set for testing because it's very easy for this reason. And now, the data said, which is called Fashion Menace, was proposed. There are again 60,000 images off again size 28 28 of 10 classes. But now there are not digits but different types off course, and this is rather good alternative to in East. At the bottom of the slide. You can see the link to these data, said Onda, uh, again, allow these data sets you can download with Torch Region Library the next famous data said that we used in our practical videos is see far down. It consists off 60 South and color images off 10 classes like airplane, automobile, bird, etcetera and there are 6000 image per class, and all of them have size that you do to 32. Another data set, which is where similar to previous one is so far 100 it begin, consists off 60,000 images off, but now off 100 classes. There are 600 images birth class oversized to do to 52 you can see the classes on this light like fish, flowers, people dials, trees and so one. There are also good set off big data sets. First is tiny images that was proposed in 2000 and eight, and it has 80 millions off images of size 32 to 32. There exists a Coltec data set which has around 300 to 200 size images, 30,000 images and the last one is the most popular Is Image Net, which consist off 14 millions off images off big size 256 to 256 and what I suggest you to do. If, for example, you have a set off gets and the set of dogs and you want to build a classifier that distinguish cats and dogs, try to use some additional data that is labelled from these data sets. So you're able to just download Image Net and take old gets and old dogs from there and add to your data's had and the increasing off the your data set will improve your quality. That is all for this video. Thank you for watching and see you in the next. 25. Modern Architectures: in this video, we'll talk about more than convolution, all architectures. And in the next one we'll discuss how we can apply them in order to solve any image. Instigation problem. The first model that was proposed in 1998 was Lynette. It was designed for me. Data set and consist off convolution all wears average pulling sigmoid and hyperbolic tendered. Normally, Naret ease and had the input off 32 to 32 the output was then classes. The next famous architecture was proposed in 2000 and 12 and consist off 60 millions off parameters, and it took up to six days off training to train these network on to GP. Use the input off this network WAAS 224 to 224 and output was 1000 classes. Because these network was designed for Image Net, which has 1000 classes, the next network which in improved the quality off the previous one was re gene, which consist of 138 parameters and more of these parameters waas in linear layers 123 millions. It took up to three weeks off training on four GP use How can we solve the problem off different scale? Look at these three images on each off this image there is a dark, but the scale is different. And on the first image, the dog is on the most part of the FOTA. And on the last image, the dog taking on Lee Small part of the Fordham for solving this problem. In 2014 the Inception Network was proposed. It consists off so called inception models that 90 fashion off inception models east. The following We have some pretty Fuselier the output of the previously or it can be the input off the network and the idea is simple. Let's apply convolutions with different colonel sites in order to manage different possible scaling Oneto one convolution 3 to 3 convolution 5 to 5 convolution and also let's use Max pulling. And after we use these different types of convolution we can can contain eat our put off this convolution into one result. This is their naive version off conception. More do in the same paper, they proposed dimension reduction with the inception model. What is it? In fact, we leave the same convolution as in the freshen, but who it oneto? One convolution, which is yellow on the right image 1 to 1 convolution allows us to have the same output size as an input size. But to reduce the number of filters, it means that we easily can decrease the number of parameters in 323 convolution and 5 to 5 convolution. And they found that want one convolution is better to use after max putting, but not before it. They put all these layers together and proposed what is called Inception Network. But what if we want to stick more layers? What do you want, for example, to use 150 layers? In fact, if you just step, 150 wears off convolutions blast some non minorities, you will obtain five problems. The first problem is could out of memory error. This problem can be solved using multiple GP use and let's imagine we can get another additional resources. Then we will have to problem with guardians, which is when you should radiance and exploding radiance, and we'll have wenshan activations and exploding actuacion. The last two can be sold to his breast immunization, as we described in the previous section. But when ish ingredients and export ingredients, ofcourse not only because off actuacion because of only marriages but also because the number off wears. And in fact it is very hard for Grady Int to go through the had of the network toe the tail of the network to the input. It is hard and usually it becomes zero. So the input and in fact, if you just take 150 layers, your network will not learn at all. But in 2015 people propose so called risen yet architecture it consistent off 152 layers and used skip connections, also known as a residual blocks. What is it? Imagine? Look at the right image. Imagine you have some X image. Then we use sound way clear, which is convolution, a lair. We can use it again, but at the right part, we use X and we add this X to the output off the convolution without any modification. So the formula is on the left where f is the convolution and axes, just our input. In fact, this helps a lot was building such a deep neural network and it secured that such a network allowed toe out before human on image Net data set. That's how it looked like. In fact, here is 34 layer residual network, but in practice they used 152 layers, and we will also use it in practice. So the bottom network is VG 19 which we look at a few flights ago. Then 34 layer network without residual connections and 34 where network, with the residual connections that is all for more than architectures, will look at them in details and will study how to use them in practice. Thank you and see you in the next to you. 26. Transfer Learning: in this year will study How can we use existent trained networks in order to improve the quality off our problem that we solved? Let's imagine the Ford You have a small data set off 150 airships off for Joseph airships and you want to build a classifier which classifies, for example, different types off these airships. What possible solutions can we use first is data augmentation and it is okay, but it will be not enough because again will have too few images in our training set. We possibly can use some existing data sets. But it is also not enough because in existing data says, I guess there are not so much airships. There are a lot of cats, dogs, birds and set your but not airships. Okay, can we do something else? And in fact, yes. In fact, we can use some project network and this is called transfer learning. Imagine, you have a big network that was trained on Image Net, for example. It it may be residual net rez net, which consists of 152 layers, which were studied in the previous video. It was trained on image net which consists off 14 millions off images and 1000 classes. Now we trained it. Then what we can do, we can remove. Last few wears last last layer or last two wears and add some new wears and train these two layers that we added on our existence. Small data set. And here we can, of course, apply the data augmentation. Look, now we need to train not a deep network, but just small one. And, uh, the amount of data that we have usually is enough to drain one or two wears. But as a result, we will have a big, deep network, and usually it works because first layers off the convolution. All network allows to obtain very nice features that a proper toe any problem. Another possible solution is to train, although network not just new layers. It depends on the problem, of course, and they own the amount of data that you have in your training set. That is all for transfer learning. Let's look how we can use it in practice. Thank you for watching and see you in the next video 27. PRACTICE PROJECT: Data Loading: Hello, my dear friend. And welcome to the border recognition section. In this section, we will solve practical problems off classifying different types off boards. Imagine you have a breach or with everywhere, and you want to set the camera and you want to build an algorithm that will classify old boards that are going under the breach. For this reason, you need to build some convolution. Question fires for our goal will use the data set that I took from Chicago, which is the biggest machine learning competition host. You can download this data manually by the link or use Keigo library that allows you to download data sets so you can just installed came library and then right, Diego Data sets down Lord, it's the name off the data set board types Recognition on DA minus P means the path where to download the data. Let's import the necessary libraries in this video. We'll load the data. So now again, as in previous practical videos, we define that information that we will apply to all our images. Now we have no do but three transformations. First information is reciting each image to the size 224 to 224. This is necessary because different images has different sizes. Onda Usually they're big, like 1000 to 1 Holland on, but it is really hard to process such big images with neural network. And the second reason all the images should have the same sites in order to used for neural network training and prediction otherwise will have different output sizes and can drain neural network at all. The second transformation east of that denser, which transforms the image to buy Torch dancer and the thirties Normalization when we subtract, mean and divide by standard deviation, making our mean equals zero and standard deviation equal to one. Now our images allocated in the data folder, and if we look at it at last dot slash datum, you will see different folders containing images. For example, Greece sheep and you can see a set of images. It would have the full indeed, the structure we can use Torture Asian Class Image folder. If you look at it recommendation, it requires that all your images are allocated in the separate folders like dog and get. So we load the data set by just writing Cartesian they decides Image folder and apply our transformation. After that, we should divide our data set to the train set and death set we can use For this reason, Torch Dottie to start data dot random split and past data set between size and their size, which has just 0.7 from our data set and 0.3 from our data set. And after that, we can define two loaders, train loader and test water, as we didn't previous practical videos with with pre defined bench size shuffling the data and number of processes that will load our native her memory is equal to two. Okay, let's run this cell. The length off our data set is 1000 476. It means that it is very small data set. And believe me, it is really hard to train any convolution away. Or let's work on this data set totally would have nine classes and, uh, it means that each class will have very few sets off images. The classes are creased sheep, very bored, frayed board, gondola and others. Since we have very few images in our train set, we should do something with it. And what I proposed to do is to use first method is data augmentation, which will consider in the next media and also transfer learning. So let us use some pre train network that we again we'll consider in the next videos. Now, let's visualize the data. He you can see the set off images and there are some different types off ships. Okay, so that's all for this video. Thank you. And she in the next to you. 28. PRACTICE PROJECT: Data augmentation: in this video, we will make data augmentation. So let's again in Portola libraries in Pytorch. It is very convenient to make data augmentation. Everything we need. We can define in the transformation. So what we need here? The first transform is reciting to size 224 to 224. After that, we can define some random transformation from transforms Modou off Tradition Library. Let's look what available transformations do have center crop car Jeetan. This is not the random random crop, random grayscale or in them horizontal sleep, random rece ice irritation. So this is a set of friend information that we can apple apply to always images. Let's apply, for example, random rotation. If you look it its signature, then it has the main parameter degrees, and here we should right degrees equal to minus 30 to 30. It means that we will irritate randomly our image to the angle, which will be also random from the minus 30 to 30 degrees. Now let's try to do it and visualize there is up. So OK, we did it, and he you can see irritated images. All this images were rotated around them Now let's look what we can do. More turns forms random horizontal fleet. Let's add random horizontal Felipe and transforms random vertical fleet. And let's look at the result lord again today to set her on those cells. And he you can see horizontal sleep. No, it's vertical Philippe again. Vertical sleep, and some of them are flipped horizontally. Good. Now let's look what we can do more transforms dot or end, Um, Grace Keel. So now you can see that some random images become grace keel, And in fact, this is all for data augmentation. So it is really easy to apply it. And you should always apply it in practice when you have not only when you have a small amount of data, but Erin, when you have a big amount of data, use the data mutation. It is really easy. Thank you for watching and see you in the next video. 29. PRACTICE PROJECT: ResNet18: In the previous video, we studied How to make data Alba nations in Pytorch. In this video, we will learn how we can reuse existing pre trained neural networks like Alex, not risen, add or any other. So let's do it. First thing that we need to do is to import Torch Vision Library. After that, we have to create the network Torch vision dot models dot or is net 18 for example. It's the shortest version off there is net, and there is one parameter, which is called pre trained default by default. It is false. Let's set it to true Train de quoted, too. And this will be our network or gate. Let's look at it. This is it. In fact, we just loaded. There is not 18 and it is prayer trained. It already has trained wait in order to look at the network and by torture, we can just type net and here these here it he will have a layers. And now what we should do. You can see that this is a sequential baseball called two D, called D bihsh, known quantity and so one, and at the end it has average putting plus funny connect clear. What we need now is to cut these fully connected clear off and that at new one new for the Connect Player and after that train, these want fully connected where using our boards data set, turn off all the ingredients off our network. We need these because we don't want to train again along these 18 layers before folly connected player. For this reason, let's do with the following four program in in that dot parameters ram dot You choirs Grete equal to fools. Here, is it okay? We turned off the updating off all the layers. Now we want to set the following folly connected. No net dot fully connected FC is equal is equal to an end of linear And the input neurons will be 512 and the output neurons will be nine. Because now we have nine classes. Good. Now we can check our net again. So we see it has a lot off convolution, much norms and so on. And the last layer is linear. 512 features and outnumber features is night. We have nine classes. Good. Let's check the output shape. We haven't error because we didn't move our network. Tokuda. Let's do it! That is equal to net. Okay, good. The output of our network is 16 to 9 16 is the batch size and nine is the number off output neurons, which is equal to the number of all classes. Let's law there, criterion and optimizer. And now we have two functions. First of them is getting quality on the death set and the 2nd 1 he's training function. It gets the net is a parameter and number off the books for training. So let's train our network for five pokes and look at the quality. So our train was finished and the final accuracy on best images is 70% which I think rather good taken into account the fact that we have only 1000 and 500 images. Thank you for watching and see you in the next year.