Transcripts
1. Course Trailer: you've probably read in the news. A deep learning is the secret recipe behind many exciting developments and has made many of our worlds dreams. And perhaps also nightmares come true. Who would have thought that Deep Mines Alphago could beat Lisa Dole in a boat game, which boasts more possible moves than there are atoms in the entire universe? A lot of people, including me, never saw it coming. It's even impossible, but it's here now. Deep learning is everywhere. It's beating physicians of diagnosing cancer. It's responsible for translating webpages and amount of mere seconds to the autonomous vehicles. By William Only Tesla. Hi, my name is Jason and welcome to this coast and deep learning where you learn everything you need to get started with deep learning and python. How to build remarkable algorithms capable of solving complex problems had one possible just a few decades ago. We'll talk about board. Deep learning is a difference between artificial intelligence and machine learning. I'll introduce new neck books, what they are and just how essential they are to deep blowing. You're gonna learn about how deep learning models train and learn and the very step of learning Associate ID supervised, unsupervised and reinforcement learning. We're going to talk about loss functions, optimizes the grading descent algorithm, the different types of new network architectures and the very steps involved in deep learning. So what are you waiting for control today and I'll see you in the coast.
2. Introduction to Deep Learning: this entire coast is centered on the notion off deep learning. But what is it? Deep learning is a subset machine humming, which in turn is a subset of artificial intelligence, which involves more traditional methods alone. Representations directly from data machine learning involves teaching computers to recognize patterns in data in the same way as our brain to do drives humans. It's easy for us to distinguish between a cat in a dark, but it's much more difficult to teach a machine to do this. And we'll talk more about this later on in the Scots. Before you do with that, I want to give you a sense of the amazing successes of deep learning in the past. In 1997 Gary Kasparov, the most successful champion in the history of chess lost toe IBM is Deep Blue, one of the first computer artificial systems. It was the first defeat of a reigning world chess champion by computer In 2011 IBM's Watson competed in game show Jeopardy against his champions, Brad Rutter and Ken Jennings, and won the first prize a $1,000,000 in 2015 Alphago, a deep learning computer program created by Google's Deepmind division, defeated Lisa Door and 18 time world champion and go a game of Google more times complex and chess. But deep learning can do more than just be. Does it both games. It finds applications anywhere from self driving vehicles to fake news detection, even predicting earthquakes. These were astonishing moments, not only because machines beat humans at their own games, but because of the endless possibilities that he opened up. What followed such events have been the serious of striking breakthroughs in artificial intelligence, machine learning and, yes, deep learning. To put it simply, deep Learning is a machine learning technique that learns, features and task directly from data by running inputs through a biologically inspired your network architecture. These neural net books contain a number of hidden layers. The rich data is processed, allying for the machine to go deep in its learning, making connections and weighing input for the best results. We'll go over in your notebooks in the next video. So why deep learning? The problem with tradition machine learning algorithms is that no matter how complex they get, they're always be machine like they needed a lot of domain expertise, human intervention and are only capable of what the design for. For example, if I show you the image of her face, you will automatically recognize its face. But how would a computer know what this is? Well, if we follow traditional machine learning, we'd have to manually and painstakingly defined to a computer when it faces. For example, it has eyes, years and month. But now, how do you define an eye or amount to a computer? Well, if you look at an eye, the corners at some angle, the definitely no. 90 degrees the definitely No. Zero degrees that some wrangling between so we could book with that and train a classifier to recognize these kinds of lines and certain orientations. This is complicated for me, I petitioners and the rest of the world. That's where deep learning holds a bit of promise. The key idea in deep learning is that you can learn these features just from the raw data so I can feed a bunch of images of faces to my deep learning algorithm, and it's going to develop some kind of hierarchical representation of detective lines and edges and then using these lines and edges to detect eyes and a mouth and composing it together to ultimately detective face. As it turned out, the underlying algorithms for training these models have existed for quite a long time. So why has deeper in gaining popularity? Many decks later? Well, for one day a has become much more pervasive were living in the age of big data, and these algorithms require massive amount of data to effectively be implemented. Second, we have hardware in architecture that are capable of handling the vast amount of data and computational power that these algorithms require hardware that simply wasn't available a few decades ago. Third, building and deploying these algorithms models, as I called it's extremely streamlined with the increasing popularity of open source software like Tensorflow and Pytorch.
3. What are Neural Networks?: deep early mortals referred to the training of things. Cornu let box new let box form the basis of deep learning, a subset of the machine learning where algorithms are inspired by the structure of the human brain, just like nuance, make up the brain. The fundamental building blocks of a new Let Buck is also a neuron. New Net books taken data between themselves to recognize patterns in this data and predict outputs for a new set of similar data in a new network. Information propagates through three central components that form the basis off every new network architecture, the input layer, the output layer and several hidden layers between the two. In the next video, we'll go over the learning process of a new network.
4. Learning Process of a Neural Network: the learning process of Anu let, but can be broken into two main processes. Forward propagation and back propagation. Full propagation is the propagation of information. From the input layer to the output layer. We can define our input. Layer several neurons, x one to rec center. These neurons connect to the neurons of the next layer through channels, and they are signed numerical values called weights. The inputs are multiply to the weights, and there's some his centers input to the neurons in the hidden layer, where each neuron in turn is associated to a numerical value called the bias, which is then added to the impotent. This waited. Some is then passed through a nonlinear function called the activation function, which, essentially the sanity of that particular neuron can contribute to the next layer. In the output layer. It's basically a form of probability. The neuron with the highest value determines what the output finally is. So let's go over a few times. The weight of a neuron tells us how important than your own is. The higher the value, the more important it is in the relationship. The bias is like the new on having an opinion to the relationship itself is to shift the activation function to the right or to the left. If you have had some experience with high school math, you should know that adding the scale of value to a function shifts a graph either to the left or to the right. And that's exactly what the biased as it shifts the activation function right or to the left back propagation is almost like four propagation, except in the reverse direction. Information here is passed from the output layer to the hidden layers are the input. But what information gets passed on from the output layer? Isn't the our place supposed to be the final layer where we get the final output? Well, yes, but no back propagation is the reason why new Net books is so powerful. It is the reason when your networks can learn by themselves. In the last step before propagation, a new network spits out a prediction. This prediction could have two possibilities either right or wrong and back propagation. The new network evaluated on performance and checks if it is right or wrong. If it is wrong, the network uses something called a loss function to quantify the deviation from the expected output. And it is this information that sent back to the hidden layers for the weight and biases to be adjusted so that the networks accuracy level increases. Let's visualize the training process with the rial example. Let's suppose we have a data set, this data said, gives us the weight of the vehicle at the number of goods carried by the vehicle, and Ultra tells us if those vehicles are cause of trucks. We want to go through this data tray and new Net books to predict cause our trucks based on their weight and goes to start off. Let's initialize the new Net book by giving it random weights and vices. These can be anything we really don't care with. These values are as long as that there in the first entry off a data set, we have vehicle weight equal to a value which in this case is 15 and goods as to. According to this, it's a car. We now start moving these input dimensions through the newer network, so basically what we want to do is take both inputs, multiply them by their weight and advise, and this is where the magic happens, we run. This waited some through an activation function. Now let's say that the output of this activation function is 0.1 This again is multiplied by the weights and added to the bys and finally, in the output layer. We have a guess. Now, according to this new Net book, the type of legal with May 15 and goods to has a greater probability of being a truck. Of course, it's not true. And a new net prognosis. So we used back propagation. We're gonna quantify the difference between the expected result and the predicted output using almost function in bad propagation, right? Gonna go back with an adjuster. Initial weight advises. Remember that during the initialization of the new network, we chose completely random with advices while doing back propagation. These values will be adjust to benefit the prediction law. Okay, so that was one interational through the first piece of the data set in the second entry, we have vehicle weight, 30 folk and goods 67. We're going to use the same process of before Mother bloody input with the weight and alibis passes result into an activation function and repeatedly output layer, check the air a difference and employ back propagation to adjust the weight in. The bias is your new network will continue doing this repeated process of four propagation , calculating the arrow and then back propagation. But as many entries there on this data set, the more data you give the new Net book, the better. It will be predicting the right out. But there's a tradeoff because too much data and you will end up with a problem like over fitting, which I'll discuss later on the scopes. But that's essentially how Manu land work works. You feed input, the network initialize. It was random weight and vices that are adjusted each time during back propagation until the networks going through all your data and is now able to make predictions. This learning algorithm can be summarised as follows. First, we initialize the network mood, random values for the networks, parameters or the way from the biases. We take a set of input data and pass them through the network. We compare these predictions obtain with the values of the expected labels and calculate the loss. Using loss function. We perform back propagation in order to propagate this loss to each and every weight and bias. We use this propagated information to update the weights and vices of new network with the great in descent algorithm in such a way that the total losses reduced and in battle morning is obtained. The last step is continue iterating the previous steps until we consider that we have a good enough model.
5. Activation Functions: in this section we're going to talk about. The most common terminology is using deep learning. Today, let's start off with the activation function. The activation function serves to introduce something called non linearity into the network and also decides whether a particular neuron can contribute to the next layer. But how do you decide of the new on Can fire or activate? Well, we had a couple of ideas, which led to the creation of different activation functions. The first idea we had is how about me activating your on if it is above a certain value or threshold. If it is less than this threshold, don't activate it. Activation function A is equal toe activated if wise, great evidence and threshold else it's not. This is essentially a step function. Its output is one or activated. When value is greater than zero, its output is activated when value is greater than some threshold and outputs not activated otherwise. Great. So this makes an activation function for a new on no confusions. Life is perfect, except there are some drawbacks with this. To understand, it better think about the foreword. Think about a case of where you want to classify multiple such nuance into classes that class one class to class three, etcetera. What will happen if more than one neuron is activated? All these neurons will output one well. How do you decide now? How do you decide which class of long stir? It's complicated, right? You would want the Net book to activate only one your own and yelled, It should be zero when. Then you will be able to say it was classified. Probably in real practice, however, it is harder to training convergent this way. It would be better. The activation was not binary. Instead, some probable value, like 75% activated or 16% activated. There's a 75% chance that it belongs to class to etcetera. Then, if more than one neuron activates, you could find which neuron fires based in which has the highest probability. Okay, maybe you lost yourself. I want something to give me a more analog value rather than just saying activated or not activated something other than in binary. And maybe you were thought about a linear function. Straight line function where the activation is proportional to the input by a value call. The slope of the line this way. It gives us a range of activations. So it isn't buying Reactivation, weaken. Definitely connect a few neurons together. And if more than one fires, we could take the maximum value and assigned based on that. So that is OK to. And what is the problem with this? Well, if you are firmly were great in dissent, which I'll come to you in just a bit, you'll notice that the derivative of a linear function is a constant makes sense because it's slow bits and changing at any point for a function. F X is equal toe MX plus. See, the derivative is M. This means that the grading has no relationship whatsoever with X. There's also means that during back propagation the adjustments made to the weights and devices on dependent on X at all, and this is not a good thing. Additionally, think about if you have connected layers, no matter how many layers you have. If all of them are linear in nature, the activation function of the final layer is nothing but just a linear function of the input of the first layer poster bed. And think about it. This means that the entire new net book of dozens of layers can be replaced by a single layer. Remember, a combination of linear functions in the linear manner is still another linear function. And this is terrible because we've just lost the ability to stat Leah's this way. No matter how much you re stank, the whole network still equivalent to a single there with single activation. Next we have a sigmoid function, and if you've ever watched a video inactivation functions, this is the kind of function years in the examples. A sigmoid function is defined to say if X is equal to 1/1 plus e to the negative X well, this looks smooth and kind of like a step function what its benefits think about it for a moment. While first things first, it is known linear nature. Combinations of dysfunction are also nonlinear. Great to now weaken stack for years. What about Norm buying a re activations? Yes, that to this function out puts it on the log activation like step function and also has a small radio. On advantage of this activation function is that unlike the linear function, the output of this function is going to be in the range 01 includes compared to the negative infinity to infinity of the latter. So we have a activations bound in the range and this won't blow up the activations, and this is great. And signal functions are one of the most widely used activation functions today. But life isn't always rosy and signal it's to tend to have the share disadvantages. If you look closely between X is equal to negative two and X is equal to two. The Y values are very steep. Any small changes in values of X in that region will calls values of wide to change drastically. Also towards on the end of the function, the white values tend to respond very less. It changes the next the grade in at those regions. It's going to be really, really small, almost zero, and it gives rise to the vanishing Grady in problem. We're just as that. If the input of the activation function is at large or small, the signal is going to squish that down to a value between zero and one, and the gray didn't off. Dysfunction becomes really small, and you'll see why, when we talk about greatly incent. This is a huge problem. Another activation function that is used. It's a tan. Each function This looks very similar to signaling. In fact, mathematically, this is what's known as a shifted sigmoid function. Okay, so like the sigmoid, it has characteristics that we discussed above. It is known in nature, so we can start players it is bound to arrange from negative wanted one. So there's no worrying about the activations blowing up. The derivative of the tangent function, however, is steeper than that of the sigmoid. So deciding between the sigmoid and the tan ege would really depend on your requirement of the great interest. Like sigmoid, tanager is also very popular and widely years activation function. And, yes, like the sigmoid tanager dance, have vanishing ridean problem. The Rectifying LTD unit, or value function, is defined us. If X is equal to the max from zero tax invested, this would look like a linear function. Right graph is linear in the parts of access, Let me tell you, rather was in fact, known linear nature and combinations of relative are also nonlinear. Great, So this means that we can stand players. However, unlike the previous two functions will be discussed is not bounded the range of the Ray Lewis from zero to infinity. This means there is a chance of blowing up the activation. Another point that would like to discuss here is a passage e off an activation. Imagine a big new network with lots of neurons. Using a sigmoid or a tanning will cause almost all the neurons fire in an analog way. This means that almost all activations will be processed to describe the networks output. In other words, the activation will be debts, and this is costly. Ideally, we want only a few neurons in the network to activate, and they're about making the activation spouse and efficient. Here's where the rally comes in, imagine, and Network was randomly initialized waits on almost 50% of the network you zero activation . Because of the characteristic relative, it output zero for negative values of X. This means that only 50% of the neurons fire sparse activation, making the network lighter. But when life gives you an apple, it comes with a little warm inside. Because of that horizontal line in value for negative values of X, the grated is zero in that region, which means enduring back propagation. The wait will not get adjusted during descent. This means that those nuance which go into that state will stop responding to variations in the era simply because the grading zero nothing changes. This is called the dying Really problem. This problem can cause seven yuan, so just die and not respond. That's making a substantial part of the network passive rather than what we want after there are work arounds for this one, especially is to simply make the horizontal line into a non horizontal component by adding a slope. Usually the slope is around 10.1 On this, this new version of the Ray Lewis called Leaky Value. The main idea is that the grading should never be 01 major advantage of the relevant is the fact that it's less computation, the expensive than functions like to manage and sigmoid because it involves symbol, a mathematical operations. This is a really good point to consider when you were designing your own deep neural networks. Great. So now the question is which activation function to use because of the advantages that rather offers? Does this mean that you should use relevant everything you do. Or could you consider sigmoid and damage? Well, both. When you know the function that you're trying to approximate has certain characteristics, you should choose an activation function, but which will approximate the function faster, leading to faster training processes. For example, a sigmoid function works well for binary classification problems, because approximating a classifier functions as combinations of the signboard is easier than maybe the relative. This relief of faster training processes and larger convergence you can use your own custom functions to. If you don't know the nature of the function you're trying to learn, I would suggest you stand with relatives and then work backwards from there before we move on to the next section. I want to talk about why we use known linear activation functions as opposed to anyone's. If you recall in my definition, off activation functions, I mentioned that activation function served in to do something called Naledi already in the Net book for all intensive purposes. Introducing non linearity simply means that your activation function must be long linear. That is not a straight line. Mathematically linear functions, appalled, a normal degree, one that went graft in the X Y plane are straight lines inclined to the X axis at a certain value. We call this the slope of the line. No new functions upon normals of degree greater than one, and when gruff, the doing form street loans rather than more code. If we use linear activation functions to model a data, the no matter how many hidden layers and network hands, it will always become equivalent to having a single day in network and in deep learning. Will you want to be able to morning every time data With that being restricted, as would be the case should be used lending functions.
6. Loss Functions: we discussed previously in the learning process of New Net books that we started with random weight and biases. The new and I put makes a prediction. This prediction is compared against the expected output, and the weight and vices adjusted accordingly. Well, Lois functions of the reason that we're able to calculate that difference really simply lost function is a way to quantify the deviation of the predicted output by the new network to the expected output is as simple is that nothing mote, nothing less. There are plenty of los functions out there. For example, under regression, we have squared ever lost, absolute ever lost in Cuba, loss and buying a reclassification. We have been very close entropy and hinge loss and multi class classification problems. We have the mother class cross entropy on the callback libel or diversions loss, and so one. The choice of the best function really depends on what kind of project of working on different projects were quite different loss functions. Now I don't want talk any further loss functions right now. We'll do this under the optimization section because that's really where most functions are utilized
7. Optimizers: in the previous section read F with lost functions with your mathematical ways of measuring how wrong predictions made by new networker During the training process, we tweak and change the parameters of the weights of the model to try and minimize that loss function and makeup addictions as correct and optimize as possible. But how exactly do you do that? How do you change the parameters of immortal by how much and when we have the ingredients, How do we make the cake? This is where optimizes come in. They're trying to get the lost function on moral parameters or the weight advisers by updating the network in response to the output of the lost function. In simpler towns, optimizes shape and mold your model into more accurate models by adjusting the weights and biases. The loss function is its guide. It tells the optimizer, whether it's moving the right or the wrong direction. Do you want to send this better? Imagine. Did you have just kale Mount Everest? And now you decide to descend the mountain blind forward. It's impossible to know which direction to go in. You could either go up, which is away from ago or go down. We're just words. You go, but they begin. You would stand taking steps. Using your feet, you'll be able to gauge whether you're going up or down. In this analogy, you resemble the newer network. Going down. Your goal is trying to minimize the error. In your feet are resemblance of the los functions they measure, whether you're going in the right way or the wrong way. Similarly, it's impossible to know what your mornings weights should be right from the start. But with some trial and error based on the lost function, you could end up getting there eventually. But we now come to grading descent. Often called the granddaddy of Optimizers, grading descent is an iterative algorithm that starts a bit of random point of the loss function and traveled down that slope in steps until it reaches the Louis Point with a minimum of function, it is the most popular optimize reused nowadays. It's fast, robust and flexible, and here's how it works. But we can't lead what a small change in each individual weight. Due to the loss function, we had just each individual rate based on its greedy int, that is, take a small step in the determine direction. The last step is to repeat the first and the second step until the lost function get as low as possible. I want to talk about this notion of a great aunt. The Grady int of a function is the vector of the partial derivatives with respect to all independent variables. The great in always points in the direction of the steepest increase in the function. Suppose we have a graph like so, with loss on the Y axis on the value of the weight on the X axis, we have a little data point here that corresponds to the randomly initialized wait to minimize a slow. So that is to get this data point of the minimum with function, we need to take the negative grid. And since we want to find the steepest decrease and function, this process happens. Interpretive lethal OSI's as minimizes possible, and that's great and descended. In a nutshell. When dealing with high dimensional data sets, that is lot available. It's possible you'll find yourself in an area where it seems like you've recently was possible value for your loss function, but in reality is just a local minimum to avoid getting stuck in a local minimum. We make sure we use a proper learning rate. Changing awaits too fast by adding or subtracting too much, that is, taking steps that are too large or too small can hinder your ability to minimize the loss function. We don't want to make a jump so large that we skip over the optimal value for a given wait to make sure this doesn't happen. We use a variable called the Learning Rate. If this thing is usually just a small number like Point the Receiver one that we multiply the green into buying to scale back, this ensures that any changes we make all wait a pretty small in math talk. Taking steps that are too large can mean that the algorithm will never converge to an optimum. At the same time, we don't want to take steps that are too small, because then we might never end up with the right values. For all rates in math, talk steps that are too small might lead to optimize a converging on a local minimum for the loss function, but never the absolute minimum for a simple summary. Just remember that the learning rate ensures that we change our weight at the right pace, not making any changes that are too big or too small. Instead of calculating the Grady INTs, all your training examples on every pass of the grave in percent it sometimes more fishing to only use a subset of the training examples each time. Sarcastic grade in dissent is an implementation that either uses batches of examples at a time on random examples on each pass. Stay gas agreed. In this end, years of the concept of momentum momentum accumulates Grady INTs of the past steps to dictate what might happen in the next steps. Also, because we don't include the entire training set, S g d is less computational, expensive. It's difficult to overstate how popular grading descent really is. Back Propagation is basically grating descent implemented on a network. They're all the tabs of optimizes bracing, great in dissent that are used today, and a grab adapts the learning rate specifically to individual features, then mean that some of the weight in your data set will have different learning rates than others. This works really well for sparse data set, where a lot of input examples are missing. Adigrat has a major issue, though the adaptive learning rate tend to get really, really small overtime. RMS prop is a special version of Adigrat, developed by Professor Geoffrey Hinton. Instead of letting all the grade Ian's accumulate from momentum, it accumulates greetings in a fixed window. RMS prop is similar to add a prop, which is another optimizer that seeks to solve some of the issue that at a Grand leaves open, Adam stands for adaptive moment estimation and is another way of using past credence to calculate the carbon radiant. Adam also utilizes the concept of momentum, which is basically our way of telling the new left bug whether we want pass changes to affect the new change. By adding fractions of the previous great ins to the current one, This optimizer has become pretty widespread, and it's practically accepted for use in training new networks. It's easy to get lost in the complexity of some of these new optimizers. Just remember that they all have the same goal. Minimizing the loss function and trial and error will get you there
8. Parameters VS Hyperparameters: you may have heard me referring to the words parameters quite a bit, and often this would is confused with the tome hyper parameters and this video. I'm gonna outlined the basic difference between the two. A modern parameter is a variable that is internal to the new network and whose values can be estimated from the data itself. They are required by the model when making predictions. These values define the skill off the model on your problem. They can be estimated directly from the deer and are often not manually set by the petitioner. And oftentimes, when you save your model, you are essentially saving your mortals. Parameters. Parameters are key to machine learning algorithms, and examples of these include the weight and the biases. Ah, hyper parameter is a configuration that is external to the model and whose value cannot be estimated from data. There's no way that we can find the best value for a model hyper parameters. On a given problem, we may use rules of thumb copy values, use another problems or search for the best value by trying an error. When a machine learning algorithm is tuned for a specific problem, such as when you're using great search of random search, then you were, in fact, tuning the hyper parameters of the model. In order discovered the parameters that resulted Most careful predictions. Moral hyper parameters are often referred to as parameters, which can make things confusing. So a good rule of thumb to overcome this confusion is as follows. If you have to specify a parameter manually, then it is probably, ah, hyper parameters. Grounders are in heaven to the morning itself. Some examples of hyper parameters include the learning rate for training, a new network, see in Sigma, hyper parameters for sport vector machines at the key and Can Uris neighbors.
9. Epochs, Batches, Batch Sizes and Iterations: We need terminologies like epochs, bad size and inspirations only when the data is too big, which happens all the time in machine learning and when we can't pass all this data to the computer at once. So to overcome this problem, we need to divide the data set into smaller chunks, give it to a computer one by one on update the weight of the new network at the end of every step to fit it into the data. Given one epoch is when an entire data said is passed forward on backward through the network. Once, in a majority of deep learning models, we use more than one epoch. I know it does make sense in the beginning. Why do we need a policy? Entire data said many times through the same in your network, passing the entire data set through the network on once it's trying to read the entire lyrics of a song. Once he won't be able to remember the entire song immediately, you have to re read the lyrics a couple more times before you can say, you know the song by memory. The same is true with the new network. We passed the data said multiple times through the new network, so its able to generalize better grading descent is an iterative process. And updating parameters and back propagation in a single pass or won a Polk is not enough. As the number of Popes increases, the more the parameters are adjusted, leading to a better performing mortal. But to many, parks could spell disaster. Indeed, is something called over fitting, where the model has essentially memories of patterns in the training data. On performance terribly. UNDATED It's never seen before. So what is the right number of E books? Unfortunately, there is no right answer. The answer is different for different data sets. Sometimes your data second include millions of examples bossing this entire data said at once. It becomes extremely difficult. So what we do instead is divide the data set into a number of batches rather than pausing the entire data said once the total number of training examples present in a single batch it's called a batch size situations is the number of batches needed to complete one e book , not the number of batches, is equal to the number of its orations. For one e book, let's say that we have a data set of $34 training examples. If we divide the data said in two batches of 500 then it will take 60. Gave inspiration to complete one iPAQ.
10. Conclusion to Terminologies: well, I hope that gives you some kind of sense about the very basic terminology is years and deep learning before we move on. I do want to mention this, and you will see this a lot. In deep learning, you often have a bunch of different choices to make. How many hidden layers should I choose or which activation function must use and where. And to be honest, there are no clear cut guidelines as to what you try should always be. That's a fun part about deep learning. It's extremely difficult to know in the beginning what's the right combination to use for your project? What box of me, my notebook for you and a suggestion from my end would be that you dabble along with materials, show dr various combinations and see what works for your best. Ultimately, that's a learning process for unintended drive this coast. I'll give you part a bit of intuition as to what's popular so that when it comes to building a deep learning project, you won't find yourself lost
11. Regularization: a central problem in deep learning is how to make an algorithm that will perform well. No, just in training data, but also on new inputs. One of the most common challenges you'll face when training models is a problem off fitting a situation where your morning performs exceptionally well on training data. The Norden testing data. See, I have a date aside. Graft in the X Y plane like so Now I want to construct a model that would best fit the data set. What I could do is draw line of some random, slow, been insect. Now, evidently, this isn't the best more and in fact, this is called under fitting because it doesn't fit the model. Well, in fact, it underestimate the data set. It said. What we could do is draw a line that looks something like this. That this really fits are mourning the best. But this is over fitting. Remember, the while training we show our networks and training data, and once that's done, we expected to be almost close to perfect. The problem with this graph is that although it is probably the best line of fit for this graph, it is the best line of fit. Only you feel considering your trading data, would you? Net Book is down in this graph is memorized patterns between the training data and wouldn't give accurate predictions at all data. It's never seen before. And this makes sense because the set of memorizing patterns generally to perform well on both training as well as new testing data. Our network, in fact, has memorized the patterns only on the training data. So obviously you want perform well on new data has never seen before. This is a problem off over fitting. It fitted too much. And by the way, this would be the more accurate kind of fitting. It's not perfect, but a little well in both training, as well as new testing data with sizable accuracy. There are a couple of ways to tackle over fitting. The most interesting type of regularization is dropout. It produced very good results and is consequently the most frequently used regularization technique and field of deep learning. To understand, drop out. Let's say that we have a new network with two hidden layers will drop but does is that at every iteration it randomly select some notes and remove them, along with their incoming and outgoing connections and shown so Each iteration has a different set of notes, and this results in a different set of outputs. So why do these models perform better? These models usually perform better than a single model as it capture more randomness and memorizes less of the training data and hence will be force it generalized better and build a more robust, predicted more. Sometimes the best way to make a deep learning model generalized better is to train it on more data. In practice, the amount of data we have is limited, and one way to get around this problem is to create fake data and attitude. The training set for some deep learning tasks. It is reasonably straightforward to create new fake deer. This approach is easiest for classification. Classified needs to take complicated, high dimensional input X and summarize of with the category identity. Why this means that the main task facing her classifier is to be in very into a wide variety of transformations. Recon generate new X y pez easily just by applying transformations on the X Y input In our training set data set augmentation has been a particularly effective technique for a specific classification problem. Object recognition images are high dimensional and include an enormous range of factors of variation, many of which can easily be simulated operations like translating the training images. A few pixels in each direction can often greatly improve generalization many other operations, such as rotating the image. Scaling the image have also proved quite effective. You must be careful multiply transformation that would change the correct class. For example, optical character recognition talus ted required recognising the difference between a B and A D and the difference between a six and annoying horizontal flips and 180 degree invitations are not appropriate. Rays of organising data sets for these House When training large models with sufficient representational capacity over fit the task. We often observe that the training error decreases steadily over time, but the every validation set begins to rise again. This means we can obtain a model with better validation, said error, and thus hopefully better tests that arrow by stopping training at the point where the air in the validation set starts to increase. This strategy is known as early stop. It is probably the most commonly used former regularization in deep learning today. Its popularity is due to both its effectiveness and its simplicity.
12. Introduction to Learning: in this section, we're going to talk about the different types of running, which are machine learning concepts. But I extended to deep learning his rope this coast we'll go over supervised learning, unsupervised learning and reinforcement learning.
13. Supervised Learning: supervised learning is the most common sub branch machine learning today. Typically, even use a machine learning your journey will begin with supervised learning algorithms. Let's explore what these are supervised. Machine learning algorithms are designed to learn by example. The name supervised learning originates from the idea of a training. This type of algorithm is almost like there's a human supervising the whole process. In supervised learning, we trainer models on well labeled data. Each example is a pack insisting often input object, which is typically a vector, and a design output value. Old to call the supervisory signal doing training is supervised. Learning algorithm will search for patterns in the data that correlates with the design outputs. After training, it will take a new unseen inputs and will determine which label the new inputs would be classified as based on prior training data. The objective, off supervised learning model is to predict the correct label newly presented input data. At its most basic form, a supervised learning algorithm can simply be written as winds equal fx. Why why is the predicted output that is determined by mapping function that assigns a class have been input value? X, the function used to connect input features to a predicted output is created by the machine . Very model. During trade, supervised learning can be split into do some categories classification and regression doing training. A classification algorithm will be given data point within assigned category. The job of a classification argo them is then to take this input value and assigned to a cost of category that it fits into. Based on the training data provided, the most common example of classification is determining for an email is spam or not with two classes of juice from spam or not spam? This problem is called a binary classification problem. The algorithm will be given training data with emails that are both spam and non spam. On model will find the features within the data that Corleto either class and create mapping function. Then, when provided with a non seen email, the model really uses function to determine whether or not the email was packed. An example of a classification problem would be the meanest handwritten digits data set where the input images of handwritten digits Bixel, vedo and the output is a class label. For what digit? The image represents that is number zero tonight. There are numerous algorithms to solve classification problems each, which depends on the data and the situation. Here are a few popular classifications algorithms. Many classify IRS support back the machines. Decision trees came nearest neighbors on Dragon Forest. Regression is a predictive statistical process where the model attempt to find the important relationship between dependent and independent variables. The goal of a regression algorithm is to predict a continuous number such a sales income index scopes. The equation for basically in the regression can be written as folds Rex. If I represents the features of the data and W of I and B are parameters which are developed during training for simple, then you're regression models with only one feature in the data. The formula looks like this. Where W is a Slope X is the single feature, and B is why insect familiar For simple regression problems such as this, the model's predictions are represented by the line of best fit for models using two features, a plane is years, and for models with more than two features, Ah, hyper plane issues. Imagine we wanted to tell many students test grade based on how many always a study the week of the test. Let's say the plus data with line of Best fit looks like this. There is a clear positive correlation between are Studied, the Independent Variable and the Students final. Tesco's the Dependent Variable ah, line of best fit can be drawn through the deer points to show the morals predictions when given new input, saying we wanted to know how well a student would do with five hours of study, we can use the line of best fit to predict the Tesco based on other students performances. Another example of regression problem would be the Boston House prices data set with the input of variables that describe the neighborhood and the output is a house price in dollars. There are many different types of regression algorithm three. Most common are vigna regression, loss of regression and multi variant regression. Supervised learning finds applications and classification and regression problems like bioinformatics, such as a fingerprint of iris and face recognition and smartphones, object recognition, spam detection and speech recognition.
14. Unsupervised Learning: unsupervised learning is a branch of machine learning that is used to manifest unlike patterns and data and is often used in exploratory data analysis. Unlike supervised learning answer. Whereas learning does not use label data but instead focuses on the data's features, label training data has a corresponding output for each input. The goal, often unsupervised learning algorithm is to analyze data and find important features. In that data, unsupervised learning will often find subgroups or hidden patterns within. The data said that a human observer might not pick up on, and this is extremely useful as build soon. Find out unsupervised learning can be of two types. Clustering an association. Clustering is the simplest and among the most common applications of unsupervised learning . It is a process of drooping the given data into different clusters or groups. Classes will condition data point that as similar as possible to each other and as this similar as possible to data pointed. Now, the clusters plastering helps find underlying patterns within the data that may not be noticeable through a human observer. Give me a broken down into pop. Additional clustering and hierarchical clustering partition. All clustering refers to a set of clustering algorithms where each data point in a data set can belong to only one cluster. Hierarchical clustering finds clusters by system of pie Rockies. Every data point can belong to multiple class stirs. Some classes will contain smaller clusters within it. This hierarchy system can be organized as a tree diagram. Some of the most commonly used clustering algorithms are k means expectation maximization. The hierarchical cluster analysis of the U. C A association, on the other hand, attempts to find relationships between different entities. The classic example of association rules is market basket analysis. This means using a database of transactions in the supermarket to find items that are frequently bought together. For example, a person and biased potatoes in burgers usually buys beer. For example, person advised tomatoes and pizza cheese might want being pizza bread. So on supervised, lonely finds applications almost everywhere. For example, a B and B, which helps host days and experiences and connect people all over the world. This application uses unsupervised learning algorithms where potential client query is a requirement and A B and B learns these patterns and recommend stays and experiences which fall under the same group of cluster gambler person looking for houses in San Francisco might not be interested in finding houses in Boston. Amazon also uses unsupervised learning to learn the customers purchases and recommend products which are frequently brought together, which is an example of association rule mining. Credit card fraud detection is another unsupervised learning algorithm that learns the various patterns of a user and a usage of credit card. The card issues in parts that do not match the behavior and alarm is generated, which could possibly be marked as fraud. And in some cases, your bank mind called you to confirm whether it was you using the card or not.
15. Reinforcement Learning: reinforcement. Learning is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error, using feedback from its own actions and experiences like supervised learning, using mapping between the input and the output. But unlike supervised learning where it feed, that provided to the agent is a correct set of actions for performing a task. Reinforcement learning uses rewards and punishments as signals for positive and negative behavior when you compared with unsupervised learning reinforcement learning is different in terms of its goals, while the golden unsupervised learning is to find similarities and differences between data points in reinforcement. Learning the goal is to find a suitable action model that would maximize the total accumulated reward of the agent reinforcement. Learning refers to goal oriented algorithms which learn how to attain a complex objective or goal, or how to maximize along a particular dimension over many steps. For example, they can maximize the point of one in the game over many moves. Reinforcement learning algorithms can start from a blank slate and, under the right conditions, achieve superhuman performance like a pet incentivized by scolding and treats, these algorithms are penalized when they make the wrong decisions and rewarded when they make the right ones. This is reinforcement reinforcement. Learning is usually model as a mark of decision process, although other frameworks like you learning unused some key terms. I described the elements off a reinforcement learning problem off the environment, which is the physical world in which the agent operates. The state represents a current situation of the agent. Reward is a feedback received from the environment. Policy sometimes is the method to map the agent state to the agents actions. And finally, value is a future reward that an agent will receive by taking an action in a particular state. A reinforcement loving problem can be best explained through gains. Let's take the game of Patman whether gold of the agent or Pacman is to eat the food in the grid while avoiding the ghosts on its way. The grid world is the interactive environment for the agent. Packman receives a reward for eating food and punishment. If it gets killed by the ghost, that is, it loses the game. The state of the location off Pac Man in the Grid wand and the total accumulated reward is Packman winning the game. Reinforcement building finds, applications and robotics business strategy planning traffic like a drove Web system, configuration and aircraft and robot motion control.
16. Introduction to Neural Network Architectures: in this section, I'm going to introduce the three most common types of new lead book architectures today for reconnected fee for when your networks recurring your networks and convolution alone your networks.
17. Fully-Connected Feed Forward Neural Networks: The first type of new network architecture we're gonna discuss is a fully connected feet forward. New network. By fully connected, I mean that each neuron in the preceding layer is connected to every neuron in the subsequent layer, without any connection backwards. There are no cycles or loops in the connections in the network. As I mentioned previously, each neuron in a new network contains an activation function that changes the output of a neuron when given its input. There are several types of activation functions. I can change this input output relationship to make a new one. Behave in a variety of waves some of the most rail known activation functions of a linear function, which is a straight line that essentially multiplies the input by a constant value. The sigmoid function that ranges from 0 to 1. The hyperbolic tangent or botanic function, ranging from negative one positive one on the rectified leaning unit or the rela function, which is a piece whites function that output zero. If the input is less than a certain value, all Alinea multiple if the input is greater than a certain value. Each type of activation function has its pros and cons. So we use them in various layers in the deep new network based on the problem each of designed to salt. In addition, the last reactivation functions we refer to as non linear functions because the output is not a linear multiple of the input know Lenny already is what allows deep new networks to model complex functions. Using everything we've learned so far, we can create a wide variety of fully connected feet for when your Net books reading create networks with various input. Very recent put various hidden layers, nuance for hidden layer and a variety of activation functions. These numerous combinations allow us to create a variety of powerful, deep new networks that can solve the wine array of problems. The moon you want to be at each hidden layer, the wide of the Net book becomes. In addition to mow, hidden lays we and the deeper the network becomes, however, each new on we add increases the complexity, and that's a computational resource necessary to train. A new network increases. This increasing complexity isn't linear in the number of neurons began, so it leads of an explosion and complexity and training time for large new networks. That's straight. If you need to consider when you were building deep, newer networks
18. Recurrent Neural Networks: all the new networks. Really, Scott. So far, I known as feed for one your networks the taking of fixed sized input and give you a fix. Eyes effort. That's all it us. And that's what we expect new networks to do. Taken an input and give a sizable Appert. But as it turns out, these plain or vanilla in unit books aren't able to model every single problem with the rehab today. To better understand this, use this analogy. Suppose I show you the picture of a bowl, a round spherical bowl that was moving in space in some direction. I've just taken a photo of the bowl or a snapshot of the bowl at some time. T Now I want you to predict the next position of the bowl and say two or three seconds. You're probably not going to give me an accurate answer. Now let's look at another example. Suppose I woke up to you and say, the wood duck. You will never understand my statement because, well, it doesn't make sense. There are trilling combinations solely using the word dog and among these trillion combinations I don't expect to use. And now guess what I'm trying to tell you what these two examples have in common is that it doesn't make sense. It doesn't. In the first case, I'm expecting you to predict the next position in time and in the second I'm expecting to understand what I mean by dog. These two examples cannot be understood and interpreted unless some information about the past was supplied. Now, in the first example, if I give you the previous position states of the ball and now ask you to predict the future trajectory of the ball, you're gonna be able to do this accurately. And in the second gates of that give you a full sentence saying I have a doc. This makes sense because now you understand that out of the trillion possible combinations involving a dog, my original intent was for you to understand that I have a duck. Why did I give you this example? How does this apply to new networks? In the introduction, I said, vanilla in your networks can morning every single situation of problem that we have. And the biggest problem, it turns out, is a plain vanilla feed. For when your networks cannot model sequential data, sequential data is data in the sequence. For example, a sentence is a sequence of what a ball moving in space is. A sequence of Ola's position states in the sentence that had shown you you understand each word based off your understanding off the previous parts. This is called sequential Member. You able to understand the data point in the sequence bio memory of the previous data point in that sequence. Traditional new networks can't do this, and it seems like a major shortcoming. One of the disadvantages of morning sequences were traditional new net bucks is the fact that they don't share parameters across time. Let s take, for example, these two sentences. On Tuesday, it was raining and it was raining on Tuesday. These sentences mean the same thing, although the details are in different parts of the sequence. Actually, when we feed the sentences into a feed full on your network for a prediction task, the model will assign different weights do on Tuesday, and it was raining at each moment in time. Things we learn about the sequence won't transfer if they appear at different points in the sequence. Sharing parameters gives the Net book the ability to look for a given feature everywhere in the sequence, rather than just in a certain area. That's the mobile sequences. We need a specific learning framework able to deal with variable and sequences, maintain sequence order and to keep track of long term dependencies rather than cutting in potato too short and finally to share parameters across the sequence so as to not rely on things. And that's where recover new Let books come in. Orrin Ends are a type of new let book architecture that use something called a feedback loop in the hidden layer. Unlike feed forward New Net books the recovering your network or in and can operate effectively on sequences of data with variable input length. This is how in our nannies UT represented this little loop here is called the Feedback Loop . Sometimes you may find the RN ends depicted over time like this. The first part represents network in the first time Step the hidden note. Each one uses the input X one to produce output. Why one? This is exactly what we've seen with basic fearful when your net bucks. However, at the second time step the hidden note at the current time. Step H two uses both the new input X two as well as the state from the previous time step each one as input to make new predictions. This means that in a car in new network, use the knowledge of its previous states as input for its current prediction. And we can repeat this process for an arbitrary number of steps, allowing for the network to propagate information via its hidden state. Throughout time. This is almost like giving a new network a short term memory. They have this abstract concept of sequential memory and because of this way, able to model certain areas of data sequential data that standalone new networks aren't able to model. Recording your networks remember their past, and their decisions are influenced by what it has learned from the past. Basic feed forward networks remember things, too, but they remember things they have learned during training. For example, an image classify loans what a three looks like during training and then use that knowledge to classify things in production. So how do we train in Auburn? N? Well, it is almost the same as training basic, fully connected people with network, except that the back propagation on grew them is applying for every sequence data point rather than the entire sequence. This algorithm is sometimes called the back propagation through time algorithm will be DT algorithm. To really understand how this works, imagine where creating a recurring new network to predict the next letter a person is likely to type based on the previous letters they've already types. The letter that he used just tight is quite important to predicting the new letter. However, all the previous letters are also very important to this prediction as well. At the first time step, say, the years of types of letter F. So when network might predict that the next letters and E based on all of the previous training example that included the word F d at the next time step, the user types a letter are so where Network uses both new Letter R plus a state of the first hidden neuron. In order to compute the next prediction. L. The network predicts us because of the high frequency of currencies in the wood F e l. In our training data set, adding the letter a my predicted levity, adding an end would predict the letter K, which would match the word I use intended to type, which is frank. There, however, is an issue with our names known a short term memory, shorter memories caused by the infamous vanishing and exploding grading problems as they are in and processes Mo Woods. It has trouble retaining information from previous steps. Kind of like our memory. If you're given a long sequence of numbers like pi and you try reading them out, you're probably gonna forget the initial few digits. Right? Short term memory and the vanishing grading is due to the nature of back propagation, the algorithm used to train and optimizing new networks after the forward propagation on the pass, the network compares this prediction to the ground truth. Using laws function, which are put in a row, value an estimate of how poorly the Net book is performing. The network uses that ever value to perform back propagation, which calculates ingredients for each note in the network. The grading is a value used to adjust the networks internal waits, allowing for the network to learn the bigger the great in, the bigger the adjustments are and vice versa. Here's where the problem lies when performing back propagation each note in a whale calculated ingredient with respect to the effects of the greetings in the layer before it. So if the adjustment of the layers before it is small, then the adjustments to the con player will be even smaller. And this causes greatness to be exponentially shrink as a back propagate that the elderly has failed to do any running as the internal weights are barely being adjusted duty, extremely small radiance, and that is the vanishing great in problem. Let's see how this applies. To recover new networks you can think of each time step in a recording your network as a layer to train recording you Net book. You use an application of back propagation called back propagation. Through time, the grating values will exponentially shrink at the back, propagate through each time. Step again. The grading is used to make adjustments in the new network rates. Thus, along it, learn small grains. Mean small adjustments on this cause of the old earlier is not alone. Because of the vanishing grains, the R N N dozen learned long range dependencies across time steps. This means that in a sequence it was raining on Tuesday. There is a possibility that the words it end wars are not considered when trying to predict users intention. The network then has to make the best guest with on Tuesday, and that's pretty ambiguous and would be difficult even for Human. So not being able to learn all the time steps causes The Net book to have a short term memory. We can come at the short term memory Finneran in by using two variants of record new networks. Gated R N N's and long short term memory ordinance, also known as ever. Seems both these variants function just like our nets, but they capable of learning long term dependencies using mechanisms called Gates. These gates are different tensile operations that learn information that can learn what information to add or remove to the hidden state of feedback. Loop. The main difference between a gated or nn and an LS iam is at the gated. Arnett has two gates to control its memory and update gate and reset gate, while in Elysium has three gates and it put gate an output gate, and if you get gate or in ends, work well for applications that involve sequences of data that change over time. These applications include natural language processing, sentiment classification, DNA sequence classification, speech recognition and language translation.
19. Convolutional Neural Networks: a convolution on your last book or CNN, for short, is a type of deep new network architecture designed for specific Tallis like image classification. CNN's were inspired by the organization of neurons in the visual cortex of the animal brain . As a result, they provide some very interesting features that a useful for processing certain types of data like images, audio and video. Like a fully connected New Net book, a CNN is composed of an input layer and output layer and several hidden layers between the two. CNN's derive their names from the type of hidden layers of consistent the hidden layers of ASEAN and typically consists of convolution, a layers pulling grayer's fully connected layers and normalization layers. This means that instead of traditional activation, functions were using feed for when your networks convolution and pooling functions a years instead. More often than not, the input of the CNN is typically a two dimensional array of neurons, which correspond to the pixels of an image. For example, if you're doing image classification, the output layer is typically one dimensional. Convolution is a technique that allows us to extract a visual features from a two D array in small chunks. Each neuron in a convolution layer is responsible for a small cluster of neurons in the preceding way. The bounding box that determines a class of neurons it's called a filter. Also call a colonel. Conceptually, you can think of it as a filter movie and cross an image on performing a mathematical operation and individual readings of the image. It then sends this result of the corresponding you're on in the convolution layer. Mathematically, a convolution of two functions, F N G is defined as falls, which is in fact the dot product of the input function and the kernel function pooling alternates. Sub sampling down sampling is the next step in a convolution on your network. Its objective is to further reduce the number of neurons necessary in subsequent lays of the network, while still retaining the most important information. There are two different types of pulling that can be performed Max pulling and pulling. As the name suggests. Max pooling is based in picking up the maximum value from the selected region, and men pulling is based on picking up the minimum value from that region. When we put all these techniques together, we get an architecture for a Deep Newell network, quite different from a fully connected new network for image classification where CNN's and used heavily. We first taken import image, which is a two dimensional matrix of pixels, typically with three color channels red, green and blue. Next, reuse a convolution layer with multiple filters to create a two dimensional feature matrix as the output for each filter. We then pull the results to produce down sample feature matrix for each filter in the convolution layer. Next, we typically repeat the convolution and pulling steps multiple times, using previous features as input. Then we had a few fully connected hidden layers to help classify the image. And finally, we produce a classification prediction in the output player convolution alone. New networks I used heavily in the field of computer vision and work well for a variety of tasks, including image recognition, image processing, image segmentation, video analysis and natural language processing.
20. The 5 Steps to Building a Deep Learning Model: in this section, I'm going to discuss the five steps that are common in every deep learning project that you built. These can be extended to include very saw other aspects, but and it's very cold there, very fundamentally five steps.
21. Gathering Data & Datasets: data is at the core of what deep learning is all about. Your model will only be as powerful as the data you bring. Which brings me to the first step gathering your data. The choice of data and how much data you would require in tiny depends on the problem you're trying to solve. Picking the right data is key, and I can't stress how important this party's band data implies. A bad motive. A good rule of thumb is to make assumptions about the data you require, and be careful to record these assumptions so that you could test them later if needed. Data comes in a variety of sizes. For example, Iris Flawed data set contains about 150 images in the total set. Gmail Smart Reply has around 238 million examples in the training set, and Google Translate reportedly has trillions of data points. When you're choosing a data set, there's no one size fits all. But the general rule of come is that the amount of data you need for a while performing model should be 10 times the number of parameters that more. However, this may differ from time to time, depending on the type of morning you're building, for example, and regression analysis, you should use around 10 examples per predictor variable for image classification. The minimum you should have is around 1000 images but class that you're trying to classify well, quantity of data matters. Quality matters, too. There's no use having a lot of data. If it's bad data, there are certain aspects of quality that tend to correspond to well performing mornings. One aspect is reliability. Reliability reversed. Did the degree in which you can trust your data model train on a reliable data set is more likely to yield useful predictions than model train and unreliable data. How common are labeled errors? If your data is labeled by humans, sometimes there may be. Mistakes are your features. Noisy is a completely accurate some noises. Old right, you'll never be able to purge it. Data off all the noise. There are many other factors that determine equality. For the purpose of this video, though, I know gonna talk about the remaining, although if you're interested, I leave them in the show notes below Lucky Ferraris. They're up. 20 of our sources on the Web that offer good data sets for free. You're a few sites where you can begin your date. Is that such? The You see, I machine Learning Repository maintains around 500 extremely realmente and data sets that you can use in your deep learning projects. Cattle's another one you love. How detailed that data sets up to give informed features, data types, number of records and so you can use a colonel two. And you won't have to download the data. Set Google's dates. That search is still in beta, but is one of the most amazing sense if you confined today, ready to is a great place to request the data said you want. But again, there is a chance of it not being properly organized. Create your own data set that will walk to you can use Web scrapers like beautiful soup to get your required data. With the data set
22. Pre-processing Data: after you have selected your date is that you now need to think of how you're going to use this data. There are some common pre processing steps that you should fool first, splitting the data set into subset in general, we usually split a data set into three parts training, testing and validating sets, retraining motives with the trainings that evaluated on the validation set. And finally, once it's ready to use, tested one last time on the testing data set. Now it is reasonable to ask the following question. Why don't have two cents training and testing In that way, the process will be much simpler. Just train the moment on the training data and tested on the testing data. The answer to that is developing a model involves tuning its configuration in other words, choosing certain values for the hyper parameters or the weight advices. This tuning is done with the feedback received from the validation set on is in essence of form of learning. Attend that we just can't split the Davis at randomly do that, and you'll get random results. There has to be some kind of logic to split the data set essentially what you want is for all three sets the training testing on validations that to be very similar to each other and to eliminate skewing as much as possible. This many dependent two things. First, the total number of samples in your data and second or the actual more you're trying to train models with very few hyper parameters will be very easy to validate in tune, so you can probably reduce the size of your validation set. But if you're mortal hasn't many hyper parameters, you would want to have a large validation set as well as considered cross validation. Also, if you happen to have a model with no hyper parameters whatsoever, ones that cannot be easily tuned, you probably don't need a validation set all the nor, like many other things in machine learning and deep learning. The dream test of validation spread ratio is also quite specific to your use case, and it gets easier to make judgement as you train and build more and more models. So here's a quick note on cross validation. Usually you won't split your data set into to the train and the test. After this, you keep aside the test set and randomly choose some percentage off the training set to be the actual train set on the remaining to be the validation set. The model is, then it relatively train and validated on these different sets. There are multiple ways to do this, and this is commonly known as cross validation. Basically, you use your training set to generate multiple splits of the train and validation Set. Cross validation avoids over fitting and is getting more and more popular with K fold cross validation being the most popular method. Additionally, if you're working on Time series data, ah, frequent technique is to split the data by time. For example, if you have a date is ahead with 40 days of data, you can train your data from days wanted 39 evaluate your model on the data from day 40. For systems like this, the train dear is older than the serving data, so this technique and shows your validation, said mirrors a lank between training and serving. However, keep in mind that time based splits work best reveille. Very large data sets such as does with tens of millions of examples. The second method that we have in pre processing. It's formatting, the data said. You've picked might no be in the right format that you like. For example, the data might be in the form of a database, but you like it as a CS file Mice vessel. Of course, there are a couple of ways to do this, and you can Google em if you'd like. Dealing with missing data is one of the most challenging steps in the gathering of data for your deep learning projects. Unless you extremely lucky to land with the perfect data set, which is quite red, dealing with missing dear would probably take a significant chunk of your time. It is quite Coleman and real world problems to miss some values of our data samples. This may be due to errors on the data collection, blank spaces on surveys, measurements not applicable, and so on. Missing values a typically represented with the any end or the no indicators. The problem with this is that most algorithms can handle these kind of missing values, so we need to take care of them before feeding data to our models. There are a couple of ways to deal with them. One is eliminating the samples of the features with missing values. The downside, of course, that you risked delete relevant information. The second step is to impute the missing values a common ways to set the missing values as a mean value for the rest of the samples. But of course, there are other ways to deal with specific data. Set. Be smart. Is handling missing data in the wrong way in spell disasters, Sometimes you may have too much data that what you require, mo data can result in larger computational and memory requirements. In cases like this, it's best practices. You was a small sample of the data set. It would be faster and ultimately an increase in time for you to explore and prototypes illusions. In most real world data set, you're going to come across imbalance data. That is classification data. That s cute class proportions leading to the rights of a minority class on a majority class . If we train a model on data like this, ah, model will only spend time learning about the majority class and a lot less time on the minority class and hands amount will ultimately be biased to the majority class and So in cases like this, we usually use a process called down sampling and up waiting, which is essentially reducing majority cost by some factor and adding example weights of that factor to the down sample cloth. For example, every down sample the majority costs by a factor of 10. Then the example. Wait, we add to that cloth should be 10. It may seem. Or, to add example, weight after down something. What is its purpose? Well, there are a couple of reasons, at least a faster convergence. During training, we see the minority classy more often, which helps him all. You converge faster but consolidating the majority class in a few examples with larger weights, we spend less this space, storing them are waiting and chills. The multi is still calibrating. We add up waiting after down stumbling so as to keep the data set in similar proportion. These processes essentially help a morning seem over the minority costs rather than just solely the majority class. This has a morning performed better in real world situations. Feature. Scaling is a crucial step in the pre processing face at the majority of the deep learning algorithms before much better when dealing with features that are on the same scale. The most common techniques are normalization, which reverses the re scaling or features to arrange between zero and one, which in fact is a special case of Min Max Scaring. To normalize that data, we need to apply Min Max scaling to each feature column. Standardization consists off centering the field and mean zero with standard deviation, one so that the feature columns have the same parameters as a standard normal distribution that is zero mean and unit variance. This makes it much easier for the learning algorithms to learn the weight of the parameters . In addition, it keeps youthful information about our pliers on makes the algorithms less sensitive to them.
23. Training your Model: once I did is being prepared. We now feed this into one network to trade. We discussed the learning process of a new network in the previous module, so if you are unsure, I'd advise you to wash that more job first. But essentially, once a date has been fed, four propagation occurs on the losses compared against the lost function on the parameters are adjusted based on this loss and cut again, nothing too different from would be discussed previously.
24. Evaluating your Model: your model has successfully trained Congratulations. Now we need to test how good our mortal is using the validations at the Reitz. I decide. Over here the evaluation process allows us to test a motile against data it has never seen before. And this is meant to be representative of how good the model might perform in the real world.
25. Optimizing your model's accuracy : after the evaluation process, there's a high chance that your morning could be optimized further. Remember, we started with run weights and vices, and these will find Tune doing back propagation. Well, in quite a few cases, bad propagation wouldn't get it right the first time. And that's OK. There are a few ways to optimize your morning. Further dooming hyper parameters is a good way of optimizing a model performance. One way to do this is by showing the mortal the entire later said multiple times. That is, by increasing the number of epochs. This is sometimes shown to improve accuracy in other ways. By adjusting the dining writ, we talked about what the learning rate was in the previous module, so if you don't know what the generators do, invite you to check out the previous module. But essentially, the learning redefines how far we shift the line during your step. Based on information from the previous training step in back propagation, these values of all play a role in how accurate immortal can become and how long the training kicks for complex models. Initial conditions can play a significant role in determining the outcome of training there are many considerations at this phase of training, and it's important you defined one makes a modern good enough. Otherwise, you might find yourself tweeting parameters for a long, long time. The adjustment of these hyper parameters remains a bit of a not and is moving experimental process that heavily depends on the specifics of the data set, Morgan and training process. You will develop this as you go more and more into deep learning, so don't worry too much about this now. One of the more common problems you encounter is when you're mortal performance well on training data, but performance terrible your data it's never seen before. This is a problem or fitting. This happens when the model learns a path and specific to the training data set that on relevant to other unseen data. There are two ways to avoid this over fitting. Getting more data and regularization. Getting move data is usually the best edition. A model training mode data will naturally generalize better. Reducing the mortal sides by reducing the number of learning will parameters in the model on with it, it's learning capacity is another way, however, by lowering the capacity of the network you forced it to learn pattern that matter, or then minimize the loss. On the other hand, reducing the network's capacity too much will lead to under fitting. The morning will not be able to learn the relevant patterns in the train data. Unfortunately, there are no magical formulas to determine this balance. It must be tested and evaluated by setting different number of parameters on observing its performance. The second method to addressing over fitting is by applying weight regularization to the model. A common way to achieve this is to constraint the complexity of the network by forcing it waits to take only small values regularizing the distribution of weight values. This is done by adding to lost function of the network, a cost associated with having larger rates. And this cost comes in to raise L one regularization at the cost with regards to the absolute value off the weight coefficient or the L one normal rates out to regularization at the cost. Better guards with squared value off the weights coefficient that is the L to normal weight . Another way of reducing over fitting is by old menting data for a more limited form. Well, a satisfactory. We need to have a lot of data. We just sound just is already. But typically, if you're working with images, there's always a chance that your morning warned form as well as you'd like it. No matter how much data you have in cases like this, when you have limited data sets, data, augmentation is a good way of increasing a date set without really increasing it. We artificially organ that data or, in this case, images so that we get two more data from already existing data. So what kind of augmentations are we talking about? Well, anything from flipping the image of the Y axis, flipping off the X axis blank blur to even zooming on in the image. What this does is that it shows you're mortal more than what meets the eye. It exposes your modeled more the existing data so that in testing it will automatically perform better because they have seen images represented in almost every single form. Finally, the last method we're going to talk about his dropped dropper is a technique using deep learning that grandly drops out units or neurons in the network. Simply put dropout diversity ignoring on neurons during the trend face off a randomly chosen set of neurons. By ignoring, I mean that these units are no considered during a particular forward or backward pass. So why do we need robot at all? Why do we need to shut down part of a new network? A fully connected earlier occupies most of the parameters and hence nuanced developer co dependency amongst each other during training, which curbs the individual power of each neuron on which ultimately leads over fitting the training data, so drop out a good way of reducing over fitting.