## Transcripts

1. Introduction to deep learning: that's now go through a basic introduction to deep learning. So what exactly is deep learning? Deep learning can be thought of as a sub family of machine learning, so it's not going to be actually a huge jump from what you've already learned. And that's a really important thing for you. Toe Feel confident, confident in at this point, so you can think of deep learning is like any other machine learning model. Even what we were talking about with our line of best fit is actually not a 1,000,000 miles away from that and linear regression, except we just have more parameters. There's more of a computational quiet, and what's really good with deep learning and why it's so popular is it can uncover complex patterns and data. There's lots of math and needs to be done, just like in machine learning, but you don't have to do any of it, so don't be intimidated. And like I always say, if I can do it, you can, too. So let's talk about a few examples of deep learning in practice. So image classification, for example, we now have have it so in videos and images, machine learning or deep learning algorithms ableto detects things like here. For example, a tree a human on airplane is. They would have detect different faces and images. We have so self driving cars that are being developed that can actually navigate its way from one place to another. And it can recognize cars in the way pedestrian street signs roads or that kind of stuff. So deep learning is very good things, like images for sounds and things like that. So with audio it deep learning can be used to convert speech speech to subtitles, which is really exciting on one project I've really enjoyed working on was generating new images. So they're deep learning algorithms that congenital eight probably give it, for example, a 1,000,000 images of artwork. And then it will be able to generate its own artwork, which is, I think is really cool. Another example are automated robots. So now in warehouses, for example, we have where robots they're able to pick up certain packages moving to different areas on the robots are getting more and more complex. They able to navigate their ways through difficult terrain and even maintain balance. So the requirements for deep learning what is required. So first of all, there's a large amount of mathematical computation. So one of the reasons why the bloody has gotten so popular in the last decade or so is because we way we now have GP use. We have computers that able to do such a large amount of mathematical computation in a small amount of time. What's also required is big data. So, for example, for image classification, let's say we needed and on algorithms tell the difference between cats and dogs would need to have a lot of images of cats and dogs. So also what's required is careful consideration by the developer on the architecture of new your network. And finally, White might not think it, but does take a lot of curiosity and perseverance. So the deep learning engineer needs to think about. Be very curious about how things work and what would happen if I tried changing this. What if I try and changing that on? You need to spend quite a lot of time in order to optimize your deep learning. So without any further ado, let's talk about some real world applications of deep learning
2. Real world applications of AI: before we get into some of the deeper mathematical concepts of deep learning, I just wanted to go over Ah, higher level view of what deep learning actually is through showing you some real world applications but also describing how it's different to machine learning. So just a few points on that deep learning we usually apply neuron networks, which will be going on to very scene, and I think you'll be pleasantly surprised by. Actually, it's not the most complicated thing in the world, like rocket science or brain surgery. So deep learning is normal, superior for trying to uncover more intricate patterns in data. So it tends to underperform when it's when it's looking. A dated that doesn't have complex patterns because it often finds complex patterns that's not there. So machine learning is actually superior for less complex patterns. Also, there's a great deal of mathematical cop calculations to go on doing deep learning, so it could be computational, very expensive on. That's actually one of the big reasons why deep learning has only just started to get big in the last 57 years. It's because now we're getting access to lots of big data on which we have computers now that can handle all these large mathematical calculations. So obviously they can both use big data. So it's just a little point of how it's different. Term machine learning now What's really important to grasp at this point on this lecture is that it's all about patterns. So neural networks machine learning all of this, it's just looking for patterns in data. It's never is never going to give us a hard, solid rules, but it's always going to give us some probabilities. So I don't know. For example, let's say, on a specific road. There's a high accident level for motorbike motorbike drivers. And let's say if they go over a certain speed round this brand of bend, let's say they go over 90 miles an hour. Then eight out of 10 motorbike drivers have a crash. There's a 0.8 or an 80% probability that they're going to crash if they're going at that speed. That's a very basic pattern. And if you think about deep learning, you can look it much more intricate patterns. It might look it, things like the speed, of course, but it may look at the behavior of profile off the person driving the motorbike, it may look atyou. What, what what kind of personality do they have, how fast they usually go? Is it different how they usually go, how different it basic goes deep deep into these patterns? As always, you give it the data, the label data that already gives it information about previous drivers. You can extract those patterns, so it's just for examples. So, for example, in marketing new networks could be used in a number of different ways. They're actually quite often on when to cite when deciding who toe right where. Who which individuals to given at two. So that sound Facebook. Then certain marketing companies will have in your networks that able to predict which users are going to be most likely to click on and out. And the way they do that is by giving them your network, that the parameters like their age, their sex, the number of friends and with this whole Cambridge Analytica thing, for example, they could actually look at the profiles of their friends as well. On DSO that goes very deep into the personal profile, maybe even their posts and keywords in those posts. So it's just one way of looking of how deep your networks have been used for marketing. It's used long fraud. So banks are actually able to look at the behavior of an individual on Doran individual accountant say. And if the behaviour is different, toe how it may be for the person who owns the accounts, then it may be flagged up is fraud because it's a new networks are actually able to detect patterns in behavior, so in medicine is also used quite frequently now. So it's used not only to predict how how successful treatment might be for an individual based on the individual profile on the medicine itself. It's even being used to create new medicines. And that's absolutely incredible. So it's We're now at the stage when your networks are able to come up with new combinations to try and create new medicines on, then they're able to create other your networks that able to simulate a person's disease on DSO that able to almost create using your network to both create and to shape a new medicines. There's a very exciting so it's all about learning, so the really important thing that new networks need is, of course, data. We can't just give it a task, and then, you know, for example, I don't know. It predicts whether with a motorcyclist is going Teoh, they're full of the bike or not. If they don't have any data on previous motorcyclist you have and haven't fallen off there their bikes before. So not only do they need to have some kind of data to go with, but they also need a direction to be given. So, for example, if we're creating an image classifier where it's able to say, if there's a dog in an image, well, then we need to give the new or network a bunch of images off dogs. So it knows what to look for. There, three main types of learning of pretty much solely focused on supervised learning for now, up until now, and we'll also be looking at looking at supervised learning for the main part of the new or networks because it's one of the most commonly used and applied. So with supervised is basically that you have labeled data on a task. So with Titanic, for example, labeled data was data of people who either had who either survived. I didn't survive when the Titanic crashed and way we were given a lot of information on each of those individuals. Eso If we're talking about some of the applications before, say, for example, with the motorbike accidents, then what's important there is that we've provided them with information off previous accidents. And also, it's also helpful to have information on the driver's motorbike drivers who didn't have accidents as well. So that's what are labeled data is. It also needs to be given a task writes need be told, to find out, to be able to predict where the summer has an accident or not, or toe identify dog in a pit in an image or something like that. So it needs to have some kind of task. Now unsupervised learning is different. It doesn't actually need to have a task. It's just given a lot of labeled data and is being asked to show clustering. So it's been is basically no no being asked to, um, you make predictions rate like that, but it's able to look at data and show patterns to us. So in reinforcement learning it's very different. But we have labeled data again, and we have, and we give the program sort of some rules and a win lose situation. So one example might be if you create robots and the rules are, you can move your arms up, down, left and right, and you can open and close your hand. And the task is to pick up a cup and move it to it, right? But it's not. It has. Toe is never experience the world before. So has to try moving left right up down, says tried lots of times, and each time it doesn't match to pick up. The couple moved to the right. That means it's lost, so it is going to keep trying into that winds. Andi, that's reinforcement. Learning where it's just tryingto trying to win in a certain situation where it's given, has given some rules in order to get there. So we're just gonna do a few practice examples now just to test whether you understand what supervised unsupervised and reinforcement many years. So don't worry. If you're not completely sure, give it a try, because it's just very helpful at this point. Just a get your brain working and get thinking. So I'm going to give you an example here of language identification. Let's say we're creating websites where the user can put in a paragraph of text in whatever language they want, and then we're able to outputs what language it is. So we would need to trade train in your network. We need to give it lots of data that the already contained these languages and tell which languages which Yes, sir, then is then able to make predictions when it's provided with new data. So do you think that might be supervised unsupervised or reinforcement learning? Pause the video. Now have, I think, have a guess. So this one is supervised letting simply because, as we look back here, we've been given. We're giving the your in your network labeled data. So we're saying whether or which language it is. So we're giving it some data like the actual text, and then we're giving the name of the language so it's being labeled, and then we give me the task of being able to predict whether or not it's French, Spanish, etcetera. Another example. So robotics with movement. Let's say we create a robot and we wanted to be able to walk across the room, and this is the first time it's ever moved before. But it does know that the rules are it can move it. These things called legs up and down, Fordham Backwards, Left and rights. It's allowed as many goes at once, but the task is to walk across the room. So pause the video now have, I think, where that supervised unsupervised or reinforcement learning. So this one is reinforcement learning. That's because it's been given labor data as win or lose. So when is toe walk across the room? To lose is to know, not manage, that the rules are that it could move its legs in certain ways. Another example of reinforcement learning that's quite popular is that we actually drain a computer to play computer games. So, for example, Mortal Kombat are street fighters. Quite a popular example where you have a street fighter game on the computer has given the rules that can jump, move left and right and has these four attack buttons on and that's all is given and the wind the wind is toe win, a fight toe loses to lose a fight, and you just let the game play over and over and over against, well, the computer until it gets really good. Finally, voting records. So let's say we want to see way we have the data for some virtual records, and we won't see patterns in that data. So we want to see what demographics are voting for a specific party so that one is unsupervised learning because essentially, we just were just providing with lots of data. And we just want to see patterns. We don't want any predictions. We don't want it toe do any kind of win or lose thing. We really just want to see what patterns. Aaron this data. We want to see what demographics are voting for, which which candidate. So that's unsupervised learning. So that's it for this lecture. So have a have a go yourself. Trying creates some examples of supervised, unsupervised and reinforcement learning for yourself and then, essentially, you already know the different types of deep learning. So you're ready. You'll be ready for the next lecture
3. L306 RecapLinearRegr: Okay, So we're now going to do a recap of our linear aggression and some of the other concepts so that were completely ready to get on to a deep learning. Because actually, what I find most exciting about teaching deep learning is that I teach these kind of basic ish concepts and people think, OK, right now we have to really get into the hard work of King and steep learning. And to be completely honest, once you've learned this stuff, when you're really confidence with the point, I'm gonna go over in this lecture which have already been over in the past it's not that big a transition to get onto deep learning. It requires some time, obviously don't really appreciate the concept, But don't worry about it. I'll get you there, Onda. Just make sure you're solid on these concepts. If you already feeling confident with them, No problem. You can We want the next lecture. We're not going to be covering much Andi of anything that's of importance on that we haven't already covered. So we're gonna start by talking about the X and Y axis, something just very standard. So X is our independent variable and why is R deep in available? Why is so? I are dependent Variable is the one that we're looking at that's that's of real important to us. So, for example, in the Titanic state set is whether someone survived or not. If we were looking in for marketing and looking yet, for example, if someone's if a certain person is going to be looking, is certain plants could be clicking on add or not X would be would have a number of different Xs x one x two, which would be their age, their sex etcetera. But why the depended very boat will be whether or not they click on an ad or not. You could also be if we're looking at the price of prices of If you want to create a house price predictor, why would be the price of the house and then X would have a number of different exes, which would be the size of house age of house number rooms, etcetera, which would be the x one x two X ray and obviously like you? Can we talk about this later? Only can create more axes X one next to you? Yes, that result will be the size of the house. That would be the number of rooms, but for now, we're just keeping it relatively simple. We're just having one input, which which could be, for example, you know, I don't know something like the number of number of rooms or something like that. So then let's say we already had some data, so we had labeled data. So let's go with Andi. Yeah, let's go with that. Let's go with X being the number of rooms in the house. And why being the price so X that, say we have one that has one room. We have another that has one room. One has two rooms when that's four rooms and when the house five rooms. And then it will say that this is this is in thousands. So that's a that's worth 10,000. This worth 20,000. This is worth 100,000. Let's say this is worth 150,000 and finally this one's worth 180 just like drawing eights. Okay, so let's just plain forgot one here to three for five. I'm just gonna guess to maybe you say, let's say this one goes here this was 20. So then for two, this one's going to hundreds that say it's about about here. Then we have 14 which is here, and that's under 56 per about here, 180. That's up here. OK, so now we've been able to plot our data on a graph, which is very helpful because we can already kind of see a pattern. We can see that as X creases increases. Why also increases? Obviously, obviously, this is incredibly basic data. We only have one x x parameter that we're looking at, which is the number of rooms. Of course, in real life would look a lot more inputs and we will be going on to that in a future lecture on. We only have five data points, but we can see you were talking conceptually here, but we can see some kind of pattern here. So the next thing I want to do is create a line of best fit. So we want to actually be able to create and a line that kind of represents the dates quite well. And we were saying that before that if we had some data that was perfectly in line then is quite easy to do. So let's say we had something like this. We had one here. One here, on here, one here. So this goes perfectly. We have a straight line that could get peppered throughout. Obviously, I haven't drawn that very well. But you get the idea on what we can do. We can create within your Russian on equation for this line. And we do that by writing out. Why equals I m thanks. Plus sorry. No, him w x plus B and W stands for weight and B stands for bias. So what we can do here is the B. Basically, it defines where the where the line insects y when x zero. So it looks like it's out around 10 there. So every why equals That's something X plus what we say in, like, 15 or something. Now the W defines the steepness of the graph. So as we got one, how much do we got? One X. How much do you got by Why? Um I have absolutely no idea. But let's say this. So this is 20 on this is 100. They say this is 20. And so this thing 25 minutes. Say this one's saying this was 50. So as we got one, we got by 25. So that would be like was 25 x Times X plus 15. Andi. So now we can use that to make predictions. So let's say we want to predict what it's like when X is six. So over here we can put this in here. So why equals 25 times six, which is 150 plus 15 sets 165. So why would be 100 65? So we could just block that? I don't. 106 5 might be. This is unrelated to say yes. Probably. So it makes so we can use the line so we could use the line. We could just go like a here six. Did it? Did it? Did it? Did it did it, did it. And we can go along and then we can say, OK, so why is going to be this when X is this? We can also use this equation. So that's how you can work out the line of best fit and how we can use that to make predictions. Now what we can do is we're never going to have these straight lines Unfortunately. So we're gonna have Teoh try and create a line that goes through all the data points well enough because in real ruler data is so, so, so unlikely that we're going to be able to get line that goes through all the data points. So we're probably going to have to draw a line that's a bit mawr like this. So what I'm doing in drawing this line is on trying to create a line that minimizes the distance between all the data points and the line. So this is the distance between this date point on this in my line, This is the distance here. So the distance between a data point on my my line of best fit is called the Cost. So the cost is just one data point in the distance between the line of best fit. Now the loss is the sum off a lot of the all of the distances, and we can write that as an equation. So we're going to find loss, Justus J. So just to be sure, when we're using J before right out, I just this is kind of a letter that we re tend to use to define loss. It's just because a lot of people do, but you can. You can use any letter you want. It's not that this has to be used and there's no complete logic behind it. So you have to worry about that. So that's going to the equation. So it's the some off so that this simple here means that some off And then it's why minus Why hat? If you remember why hat is our prediction. So, for example, where X is here, Well, let's look at this. The dotted line here don't make it too messy. So we go. This died line here that are prediction is going to be here. And this is going to be why hat, like our prediction is this value here is why hat on this value here is our actual way. So the distance renews, too. That's the cost, or that is, are y minus white hat. And now because we want the ones the data points there really far away, we really want to minimize the number of data points that are very far away. So in order to do that. We're gonna penalize quite heavily if if there if there far away. And we're going to that by squaring this number because the larger this number is is gonna be increased even mawr when it squared. And then we're going to take the average of all of this by divine divided by n an end is just the number of data points that we have. So that's how we work out the means Great era. You can also include gonna use their toe I you that you can use any Letty want. But by convention, we use the little light. And by putting it here, we're basically just saying we're going to do it for the same. How do I say this for the same point? So we're not just saying like the distance of this y minus no, over here. Yeah, because that would be city. We want to do it where X is exactly the same. So we're just saying that any given points were taking the why and the white hat from a specific point. So we're using why I won by places. So that's the means, Quite era. Now what's really cool and this is for me. This is the re exciting part because we've been able to create the line of best fit. It's like, Okay, cool. So our computer could do that. Our computer could come up with the y equals W X plus B whatever that it can choose A W and A. B. However it likes, it draws it out on him. Let's say the 1st 1 is, I don't like this because it's just completely random and it works out right. What's the lost while actually going to absolutely huge. And so then this is a recall bit where can update the line of best fit. So I say, Okay, so the loss is really high here. What happens if I gave this way? What? What? What happens if I increase w o increases loss even further? So I'm gonna decrease my W and I might decrease the beer betters. We'll see what happens then The loss reduces even more, and then it tries again, reducing Be reducing w of it. And you can see that after a while it starts to produce a much better line. So that's where that's where I find this really exciting and that's where the learning occurs. And trust me, this is very, very, very applicable to your networks. So we choose a weight and a bias. Or let's say the computer does. There looks of the loss and then it tries another W and B. And if the losses lover, then it's is better so that that's the best of one. And it will probably stick with that. And they will try a, W and B another one until it finds the minimum loss. So where the line of which its least far away from all the data points And actually what's quite cool? Because you can even plot the loss so lost up here and then actually would have to access down here, you'd have w don't be. And then you'd have so you'd be looking okay, The losses going down like this Obviously, this would actually a three D grafs because you have to actually is down here. We can know that you can see that we change you with weights in the biases and then we're getting these little dips and finally down. Probably here. This is what we call our global minimums. That's the That's the best possible line of best fit that can can describe the bit the data well, and it's minimizing the distance between the plots. Now. What you might realize is that we have these little dips that you know on either side. If I just just drive it big over here, we just do you men on this bit, for example, what we have going on here. We have a definite goes back up so that what's what's a danger is that the program confined this loss here and says, Well, if I go this way or this way is going to increase the loss, so I'm just going to stop here. And so it just stops in what we call a local MINIMA. We'll be talking about that problem later on. It's just a really good idea, a few to note now that that can happen. But you don't Teoh worry about too much Now. The next thing again to talk about is just actually create a new white board. So what we're going talk about now we just leave. That is that we can actually introduce non linear lines or graphs. Let's destroy a few days points. So what we could do with the new aggression is weaken go kind of through the middle to something that minimizes all of these distances. So that's OK, But what we can also do is weaken start to create lines that are more like this. We can't go through the different data points and you'd have some crazy, you know, equation. You know, we starting to do suddenly x three x cubed minus to every three x squared or whatever, and it gets quite complex. But that's no incredibly important. And we're getting to the point now where we don't have to do the calculations ourselves. We can kind of leave it to the computer, but this is so what in your networks, dio I wouldn't say we do this it exactly, but new or networks do start to introduce non lit linear Garrity's so you don't need to worry about this too much of this point. All I want you to appreciate is that with your networks, we start to introduce non linear parities, which basically means we're no longer using straight lines to define the data, and now there comes a danger with this. So what the really good thing about it is that it's more flexible on weaken. Reduce the loss even further. But the big danger is over fitting, and you'll hear me talking a lot about this. And if you can appreciate the importance of over fitting and you can put it into practice when you're creating on your networks, then you'll be doing really, really well. So over fitting essentially is where your over describing or your equation is going to close to these data points on. Now, if we introduce more data that say we have some more points around here, then, as you can see is the line is not only describing the data well, a tool because it's essentially over fitted on the data. So now we need something would be better. Be a kind of like a more generalized one. That kind of goes along here. So does I should go through any of the data points, but essentially it's represents the data. Well, now what's re important? You can cut. You could kind of say Okay, well, you know, let's just give it away the data and then we can make it go through all the points that we have. No problem the issue with that is you can never have enough data or I can't. I can't think of a situation usually where you'd have you'd have all the data that's possibly out there. So, for example, if we're talking about and I don't know the house price predictor, there's always going to be different houses. There might be one that yes, on paper seems tohave a lot of good parts. But then there's one parameter, for example, at a the distance from a school which is really high on brings down brings down the value. This always going to be more data out there. So you can't perfectly describe the data in these kind of situations. So you need to describe it well enough so that the data you have is is good enough and the description you have of it is sufficient. But if you describe the data too well and you go through all the data points, then essentially you're making it very hard for when new new data is brought in. So, for example, when you need to make new predictions so in orderto avoid over fitting when one really good thing, of course, is Teoh, increase the amount of data. We talk about lots of strategies to reduce over fitting in the future. But what's really important for now for you to appreciate is that we split our data. We split it into train that says, Trained. You just have to trust me on that. And then what? You can cool test or validation data. That's what people commonly call it. Test or valuation. And what you do basically, is we train on our new network on this data, so it creates the line. But then we we have only apply the exact same line that have come up with to some data that the new network has not been trained on. So it can't updates parameters. So it does that line and then we put it in here and oh, God, the losses. The loss is okay In our training data, it's quite late, but the loss is extremely high. With me is against the test. So we say toe on your network, go back and try again on the train data, and it needs to try something different. And it can't use this data to train on because every time it comes back to it, it's like it's seeing it a fresh. So the important thing to realize here is that over fitting is a danger. So we split up our data into training and test. We still train on your network using the train data on. We use the test data purely to make sure that it's no over fitting on our training data, and it's able to adapt to real life data that's outside of the training set. So that's everything for our recap. Hopeful that will make sense. Make sure you understand these concepts. Maybe go through the lecture again on a far speed and just note down the important points, even if it's just the headers from, like the 1st 30 seconds. Have a goat that on. If you're able to really appreciate these concepts, you'll be well on your way, and I'll see you in the next lecture.
4. L307 ScalingLinearRegression: I'm very excited. Teoh, introduce you to our first new or network Andi, you might think. OK, well, this is going to be taking a really big jump. From what I've already learned, I can absolutely assure you it's no. So it's just get to be a baby. Step very small one. And we'll actually be now talking about deep neural networks. So here's a diagram of probably what you've seen before, something in making very intimidating with lots of crazy lines, things called hidden layers. You know, this is what When I are someone to draw what they think your networks are when they don't know much about it, they don't draw something with lots and lots of lines, actually, is really not that complicated on going toe levy through it right now. So here's one I, Julia, which is obviously that's so much more better so much better than this one. So essentially, we're gonna break this down so we have what we call an input layer. We have hidden layers. We haven't output there. Now, I'm just going over the basics at this point, and there's not that much more than need to talk about, but we just talk about the basics Now The input layer is for example, let's say we're creating House Price predictor and we're just gonna keep it really simple for now. So which have one thing here saying a number of rooms. So we want to predict that the price of the house and the input were giving his number of rooms now off course we could. And where's my little drawer? So we could create a number of other inputs here. This could be item A let's say, Age of house and stuff like that. Of course we can have lots of different inputs and we will get onto us. Don't worry about it, but for now we're going to keep it a bit more simple. Andi will be a very easy transition toe, ADM or more parameters Next up are hidden Layers said these things here each these circles we could refer to as a note we'll be going. It will be going so zooming into one of these in the second. But essentially, this is where all the calculations go on on. The calculations are very simple. Similar toe are linear regressions and at the end, eyes output. There and this will be our prediction. So, for example, we might help have Here is the price in books. So essentially, we have our input here. Then we do a lot of calculations, you know, hidden layers. And then we have an output layer, which is our prediction on that is in your network. That's all there is to it. There's nothing create nothing crazy about it that I haven't told you about. But now we need to look into what these calculations actually do. So we're always gonna be like that now is our input layer and then one of these nodes and then another one afterwards. So we start off with the input layer, and then essentially, what we're doing at the start here is yes, you've guessed it linear regression. So I know is kind of difficult to believe, but actually, that is kind of what happened to him. So I've got why equals w x plus be And this weight is going to be randomly initialized, so it's just gonna be a completely random wait to begin with. So that's the first thing that happened. So we have our input layer and, of course, like I said. Like I said before, we have number of different inputs and we're gonna go over that later on. But for now, we're just going to ignore that. And so we have our number. Let's say we have the number of rooms, which is for so we basically gonna gonna do y equals and then a random wait times for plus B, which is our bias on that will also be random unless we specify it. So we're gonna have why in here we have a valuable why. And now we actually want Teoh do something else here. We want to make that into a new number. We cool. We're gonna call this number set. And essentially what? Because what we need to do if you think about it, this is gonna be w Wake was doubly exposed, B and they were gonna do the same calculation. Here we can. We can separate these by calling this. Why one we can call this y two equals W and then X will actually be are white. If you think about it, because that's what we're putting in our input is going to be this white, so we could do W Why one plus B. Now that step might have been slightly confusing, just with the way Hopefully it wasn't. But just to make sure you understand X is always our input. Our impact here is four, and then we work out. Why is here now to put this into our next note? Obviously, our input is going to be Why? So that's why one, that's why exes now, why one? But if we were, just do linear aggression the whole way. If you really stop and think about it, if you do this calculation, Wyche was doubly X plus B. Then why to Because w y plus B etcetera. What you realize is it's still it's a linear, and so you actually won't make any changes. And it would just be one straight line, which is essentially the same as just having one equation called y equals W X plus B. And so that's that would just be doing linear linear aggression, but with more steps. So we need to introduce something called a non linearity. I'm definitely gonna typed out. Otherwise you'll be spending five minutes watching me, trying to write up a lot of reality non linearity on what for that. The basic when we uses something called Sigmoid. So I'm just gonna draw that out for you in the ground. So sigmoid has a shaped like this. Something like this anyway. No, that's well, let me try again. Kind of like this. You can definitely be drawn better. I definitely suggest you do a Google image search, but what this does, we have zero here we have one hit. So it introduces a long non linearity, which basically means that now that we have, why we want to convert that to a number between 01 based on what value why is it will be a number somewhere on this non in your graph. So this why will be converted to set and said will be between zero and what? So what have we done in this fair step? We've done linear regression y equals W X plus B, and they were converted. Why? To number between zero and one using a sigmoid graph, and that gives us said on. We're just calling it said just because it's a different number we want, we want to signify that's a different number since they were being Y t equals y one plus B . It's going to be why, too, equals W. That and then we can, because it's all fair said. We just put that said one plus B. So that's it. And so you may be wondering, right Is this w going to be the same as this? W Well, that's that's where that's a really, really important point, all of our W's air going to be different. So when you look back at this graph here, we're essentially doing linear regression and adding our sigmoid, which is our activation function. So that's a very, very important word to remember activation function. So we're introducing our activation folks and sigmoid so sigmoid is just one type of activation function there, a number of different ones. You really don't need to be concerned with them. For now, just just remember that essentially what we're doing is between any of these notes were essentially just doing linear regression. And then we're adding an activation function that converts are the outputs of on any aggression into a number between usually between zero and one. But what's important is that it's not linear, it's nonlinear, so that means that we're not just gonna have a straight line, and it's actually gonna be quite quite bendy to use the technical term, of course. So right here, what we've done is have worked out that one. Any one of these lines, What we're doing is reading linear regression, and then we're adding an activation function to turn the number between zero and one. With that, sit with that number that's between 01 with an impertinent into next one using in your aggression. And then again, we're adding an activation function, and every time we're doing, then your aggression. We're using a different W on. We'll talk a bit more about be essentially in the different layers, were using the same, be within her layer. But they're different between this. That's not too important for now. What's really, really important is that you're excited because we were literally just cracked the code of what on your network is. And that's essentially where ever you see a line. All that's happening there is that we're doing linear regression, and then we're adding an activation function that turns number into a number between 01 on . We keep doing that even up up to the point where we get to output lab. And so that's it for our trend. Our transformation from linear regression to deepen your networks will be going into some some of the finer points. But if you can, if you can get your head around that, then you're really, really well on your way. So congratulations for basically stepping up your first new network, and I'll see you in the next lecture.
5. L308 MultipleInputs: so another. We've looked at new networks and how each of these connections work between the nodes. Let's now just take a quick look, having multiple inputs. So, as you can see in this new network, for example, we have a lot of important nodes in our input layer, each of these connecting to each of the different nodes in the next hidden there. So we're just gonna focus on now that the important there and just see how that works. Now it should all be relatively straight forwards. So hopefully this this wall will be quite easy. Toto, understand? But let's have a go. So we're gonna look again at this being a house price predictor. So, one of the independent variables we're gonna be adding here so we can just call this exquisite. That's draw this out. This could be X one. This is our first variable. This is on. The number of rooms are independent. Variable. Obviously, our dependent variable is the house price, which ends up on the right hand side of on your network, which is the output there. But we don't worry about that for now. So just remember that for each of these connections. The note. What we're doing his day. Why? Because X Sorry, w x and this is one plus. Be now, actually, in ahead. Layer B is always the same. You're saying W that changes and always the same side B is the same for for each of the different notes, it will obviously be updated. But they stayed. They remain the same between the notes so we can have. This is be one and the next thing here we have his BT for the next day. But so this is what happens like was W x one plus B one and then we add in our activation function, which is that sigmoid graph here which I can draw absolute terribly. And then that turns into you said equals and then to signify the reading activation function, you can write this have you want? It's very common to be written, Justus Hey h of why don't ask me why. That's just how is that's why one of course. So hopefully, at this point, that will make sense because that's what we went over the last lecture. So essentially we're taking our input, which we're calling X one were saying Why one reading linear regression. So why one is equal to W X one? Plus the one reason why this one is because this is our X one. This is our first. Why so their next in the new aggression here it's gonna be y two equals W Teoh. Sorry, w X and serve. It could be extinct. It's actually be set, right? Because that's the impotence assistance could be w that one plus bt. So hopefully that will make sense to you. Now what I want to discuss in this lecture and now that we have that settled on, that happens for each of these nodes. So again, this is gov why one equals brother, but of ah problem, they have activation function going on in there. We get another output. So what I want to concern ourselves with in this lecture is that we have multiple impacts. Let's have two more. Let's see how that works. They say we know only have number of rooms now, but we also have the age of house and so that's gonna be a continuous number. But we could also have 00 or one. So yes or no on that could be Ah does has a pool. So if it does have a pool with this will be a one, if not then zero. So you obviously you have different types of numbers coming in here. It could be a number of rooms could be zero to infinity or 1000. Whatever pool would be just 01 We can call these X two. So age of housework or x two and then has a pool could be x three. So then these, or connect with these different ones as well. On this is where you might think, Oh, God, this is going to get messy but actually doesn't get away that Messi, especially because the computer has to do with calculations. But what's important here is, let's say X one x two Next three is all connecting to this hidden layer here. Well, we are. We're still working out. Why here? And they will be working out. Why here for the age of the house, it has a pool, so be guessing in three different wise. Now each of these will have a different w. They'll have a different weight and obviously they have a different X, but they will be having the same beat, so with different weights. But the same bias will be getting three different wise coming in to here. So what we do with all these different wives where it's actually really, really simple, we literally just add them together. We just take the some so you that's a little there is to it. So then, in this one here, this is taking the some of this the new aggression on this in your aggression on this in the regression here now. So what's really important to keep in mind here is the fact that we we have thes three different or however many are coming into that layer. They all added together and each of the different linear regressions. It will have a different weight, but the same bias and then obviously every other no have another hidden layer. There'll be a different bias for this there, but we'll talk about that in the next sector. What I want you to really appreciate from this lecture is obviously just the foundational thing we've been talking about about having a linear regression on, then adding an activation function. But then on top of that or wait, appreciate that with our imprint there. We can have lots of different inputs going to the same note in the next hidden in there and they're gonna have a different weight, a different W but the same bias. And then we're just adding them all together. And that's going to be, ah, why that's gonna be put in tow activation function here. And so this could be continued. So it's not just the imports, Let's say, for example, then we have another hidden layer over here, and all of these are connected to the next light. Next note. Now, there's gonna be a linear regression here. There's gonna be a linear regression here, and I linear regression here, they're gonna have different W's. They're gonna have the same bias which is gonna be be to. And when we get to hear, why is going to be the some off all of those wise that would just worked out. So this one, this one, this one there's just gonna be added together and then run through the activation function , and that's all there is to it. So on the one hand is very simple, but on the other hand is re important. You appreciate the specific steps. It goes three because otherwise you can get lost in the detail. So what I highly highly recommend you do at this point is get a pen gets an paper, will go into this white board here. Yeah, I use tutorials point dot com slash free online whiteboard. Um, and just draw it out and make sure you can appreciate all those different things And keep doing that until you can explain this to someone. Just just that, say, yeah, you're you're having dinner with a friend. They say, Oh, what's in your networks? You know, if you could explain this to them and you can include all the important points, then you're doing incredibly well. And that's that's the stage you want to get up to at this point. So even if it takes you a day or two, I highly recommend you just draw this out a couple of times, Try and explain to friends or family Andi, give that a go. And then once you feeling confidence, I'll see in the next lecture
6. L309 HiddenLayers: Okay, so now that we've gone over most of the basics of the new your network, I just want to go over very quickly. The hidden layers. So the hidden layers is where the bulk of our calculations are done. Essentially, this is where we're gonna have the most number, usually of our weights and our biases s. Oh, yeah, we have a lot of calculations going on here. Now we can have as many hidden layers as we like having three. Here. It's It's just random network air, different architecture. So the architecture of is based on, you know how many hidden layers you have. How many know? Gee, having each hidden layer is completely up to you. And it depends on how complex you want you in your network toe toe work. Now what most people think of when you start creating the first your networks, as they say, Let's just chuck in, like, 50 different layers and 100 knows needs layer because obviously that's gonna perform the best. Or actually it is not so simple. First we have to definitely pay him pay attention, Teoh. Yeah, The more hidden layers you have been, the more no do you have? You have to do a lot more computation on. Also, one of the things is that actually might start over fit because of that. And we've already talked about your fitting and we'll talk about it more in the future. But really wise want to do is just solidify. Make sure that you're happy understanding what hidden layers are. Essentially, They're just where the bulk of the calculations go on, where we have the linear regression on, you know and are activation function is going on. So let's just go into a few more details of what actually happens. I'm just going to use use a magnified version here. So let's say this is our first hidden May lay here. Which Quist? HL one. This will be a hidden there. Do No, It's actually very simple what happened here. So let's just look at what happens to this guy here, Okay, so we have one day in your aggression coming in here. We'll just call this why one and then we have another the new Russian going on here would y two that's quite whitey and then have 1/3 interesting coming in here like without my three . Now, in order to get the Y here, that's literally just gonna bay. Why equals? Why one bus? Why to you Plus why three. It's a simple Is that so? However many nodes you have connected to this note that we're working out all it is is a load of linear aggressions and then summed up so we could do this as why equals the sum off of those wise we could just pose. Why I So what's really important to note here is that we're adding together. These in your aggressions on this is this is the really, really important part is that every single in your aggression we have in the whole new network So a lot of these different lines that we're going to have different weights. Obviously they may be the same at some points because, yeah, who by random John's This one might be 0.8 and this one might be zero point, but they're all independent of each other. Let's say so. Each and every single one of these will have a different weight and will be updated by themselves. But within the same hidden layer. We've always got the same. Be the same bias. So if we look back of equation, why equals W X must be the bias will stay the same within a hidden there, but the weights will be independent of each other on that just allows for a lot more complexity to occur. And that's the whole point of on your networks is to allow for us a lot more complexity to find them or hidden patterns. And obviously this is a lie kind of random at first. Anything that's that's actually shaping. How we change with weights and how we change the biases between his layers is the loss of the end. So when we get to our, you know, output layer over here, then we calculate the loss. How different our prediction is, because that's what's gonna come out our white hat. How different that is weaken in our computer will compare it to the actual way based on a with weights. Our calculations we've created Why hat and we compare it to why we see how different is on then the waterbirds later. But the computer. We'll then update the weights, just like remember when we had our line of best fit, it might start here and then it rises to lower. The loss needs to change, change the weights, to be here or somebody that is kind of similar. That obviously is very hard to actually visualize this on a graph now, because there's so many different axes. So we have to We can't really do that. But what I really want to emphasize in this lecture is that you can have as many hidden layers as you like, and as many node with in a hidden layer as you like is computational, very expensive to add lots of in layers and lots of notes because with all of these different lines will we do linear regression and often have independent weights on their have the same bias with inner layer. And to calculate what goes into note any points, the computer is going to be basically taking the linear regression for each of these on something them together. Now some of you may be thinking, well, I just want to be able to code in your networks. I'm going let the computer do the calculations for me, said after worry. But actually, it's really helpful to understand the theory behind the new or networks so you can understand things like why it's computational, expensive and you know what's actually going on under the heads. And what's the point in doing all this on what does it generated? The Edward generates our prediction that helps us to work out our loss. So it really does help to inform you to create your best in your networks. So the takeaways from this lecture are that you're deep in your network, these middle layers. So you have your input layer far left, that's gonna be presumptive. We're doing the house price. That's could be our independent variables, like number of rooms, age of house that I output. There will be literally just one note for us so you can ignore those four ones. It's that you just gonna be here saying the price. Everything in between, ah, hidden less And within that we just have lots of collected nodes, each of them each time they have a connection there doing a linear regression with independent weights, but the same bias within a hidden there. So make a note of those impacts important points, and I'll see in the next lecture
7. L310 OneFullPass: in this lecture were against the one foot pass There were one for past of on your network are conceptual in your network because once you able to do the Ford Ford parts of a new network, you kind of understand how it works. I mean, there are a few more things to pick up, but we're definitely on our way. So to do a pass this way, this is how this the first thing that happens in your network as we do Ford propagation. That's what was called Ford Propagation. Say four propagation is when we go this way and then straight after we do backward propagation. And so we're just going to go into a bit more detail about what these two things are. So essentially, what we start off with right in the beginning is that we have are important. They're here on Let's say this is for our house price predictor. Let's say this is the number of rooms on that the age of the house here, whether it has a swimming pool, not those kind of things. Now, at the starts what our computer is going to do, it's going to probably unless we specify otherwise is going to initialize each of our weights. Each of these W's to be between zero and one randomly is just going to sign them completely randomly. So you have like, Yeah, let's say we have a graph and then it's just gonna create a lion. You just completely random, regardless of what What the information is actually telling us and then it's going to go through is going to create all the weights and all devices on. Then it's gonna test out these weights and vices biting for propagation. So it's gonna take, Let's say we have loads of labeled data. Let's say we have 100 piece of data on all of them contain our exes. So contains here the information about the age of the house, the number of rooms of that kind stuff, and it also includes how much each of those houses is worth. So then we try it. So we take all the X Alvar data points each one, so we take it one at a time. So we take the first bit of data about one house, and we run it through this whole network so that the output at the end here is the price, and it will come out with some kind of prediction, which will be our Y hats. And maybe you can see where I'm going with this now. So then weaken. The computer has its prediction, but it also has the ground truth. By that, I mean, it has the actual price of the house, and then it can work out. Oh, so that's that's why predicted by using these weights and the biases on this is the actual ground truth safe. We do this and we can start to work out the loss because you might remember if we use means spread our for example, we could look at the some of my hats minus y squared over end. So this because it's just one data point, this is this would be the loss. But then, if you run along 100 houses through this and then you add it together like we're doing here and divide by N, then we'll have the overall loss. So once we have the loss, so that's what we're doing here, where it says the purpose. We're testing out the weights and biases, so the computers generated lows of weights and those devices. It runs through the data that we have. And then it says Right, What's the loss when I use these weights and biases and then it gives it out, but to set and loss for just just for argument's sake, let's say the loss is 10 completely running number. Now what? What their computer will go through is backward propagation. And during this time, what it would do. It would change all of the weights, and the bias is usually just a small amount in an attempt to reduce the loss. So it's goingto probably well, it will say OK, so that that's if I use the WB here these WB's What happens if I use it? WB's to be here. Obviously this graph, because we have so me, W's and these it would be multi dimensional. Um, a lot lot more than this, But let's just fragments a sake. Say it's like this so enduring backward propagation. It goes through all of these different weights and biases and using some more complex math which we don't actually need toe fully understand for for the purposes of this course. But it will basically update the weights and vices usually ever so slightly on. Then it would take that these 100 data points again, and I would do the exact same thing. So run them all through this and see Okay, now that I've updated the weights in the biases, what's the lost? Now, you know, we've got some new predictions. How different are they from our ground Truth? It might be like. Okay, so now the losses, not 9.5. And then since you would do the exact same thing again, it will go back and say, OK, I'm gonna update my weights and biases even more in attempt to reduce that loss even more. And that will just keep on going. So each time you do one run, afford propagation and one run of backward propagation. That is what we call a ladies and gentlemen an e book. And I absolutely love that word. I think it's a nice word. So make a note of that, because when you're coding, you specify the number of each box you want you on your network today. Now, usually you kind of just want to use a lot of each box. I mean, each time e chibok takes time, depending on how big your neuro network is. But he box do take up time, they take up computation on. Also, sometimes you can over train on Do you know my over fit slightly, but usually you want. You have quite a lot of you box in order to train something so you just keep in mind then Epoch is both one run of four propagation on backward propagation. So that's that's literally what That's the whole thing. Now off how your networks work, which is super exciting. So we're just gonna make sure that you've got the maths kind of down and you understand what happens during one passed before appropriation. So I'm just going great. One very basic new network here. This We've only got three notes and I hadn't layered. Let's say that's have this have to in layers and then we'll have an input than up. But this just for fun. He's had a couple more. Okay, so here we have our input. So this is going to be, for example, number of rooms and let's have multiple in Burt's just because that's always good practice . So let's say here we also have, um it's a distance from school and kill it is. But if it's and then finally we have over here output layer, which is house price. So I want to make sure you guys completely understand what's going on here. So during Ford Propagation, what we're going to be doing is way. Take you one single data point. So let's say and because we talked about this before, Remember, with supervised learning, you have to have existing labeled data. So we already have in production. You probably want to have tens of thousands off existing houses. They're labeled when I say labeled data. What I mean is that we have existing information that contains information on the number of rooms of the house, the distance from school of the house on the price of the house. So I said, That's how we're able to train on your network. So let's say we have 100 8 points. We just take the 1st 1 of these and we put it into on your network. That's a number of rooms. Is four distance from the nearest school is two kilometers or something. That So what is gonna happen here is we're gonna have reaching designs as well as we've discussed before, we're gonna have linear regression happening. So we're gonna have a linear regression happening here, right? Why? And then you go Russian here by one on Why, too? So, with the new aggression, we have white equals w x plus the say the W stands for the wait, and that's just a parameter that's going to be independent of any of the other weights. So it's just gonna be usually randomly initialized between 01 and then we're gonna have a bias on device is gonna be the same for the hole in your network. So this could be why equals then a w here. That's really a shame initialized. Plus, this bias on devices will be the same for to them. But we have to wise coming in here. So the way that we actually have ah, just one number coming in because we let you just some All of those wise says could be some of why I, for example, so then, once we have that, let's just call this progress eight. Just call that white and then to make sure there is not linear because we want to make sure This isn't just a linear regression that has a straight line, because if we just run these lines within your aggressions, it could be generalized. Just Teoh one single y equals W X plus B, which means this could be a straight line, and it's not going to be able to describe complex patterns. And that's the whole point of on your networks. So you run it through an activation function, and so we tend to use sigmoid. But there are. There are other ones that are better, but most them basing. Just give you back a number that's between 01 or minus one on one and something like that. You don't need to worry too much about that. Just remember that we use an activation function on we're using sigmoid on. Let's say what comes out of that is that so we're using, Sigmoid said. That is, that is going to be a number between 01 So I've taken are two infants, run them through linear regression with independent weights and the same bias, added them together, run it through an activation function, and now we have a number between zero and one, so that's gonna happen exactly the same for this guy here. And then we're going to do something similar here. Well, pretty much the exact same thing. So for each of these lines, now they're going to be receiving two inputs. So too, did your aggressions with independent waits on the same bias. And so this is also going up between 01 So those two linear regressions we're gonna happen , and then we're going toe, have them together here, and then we take whatever number that is when we run it through our signal function activation function so we can output, we can cause it, and that's gonna happen for this one as well. Although this again is going to have two different weights. So the weights are going to be different, and that's really important for you to appreciate. But the biases are gonna be the same. And then again, it's going to the exact same thing. So I went with the time going through that again, butts to the inner aggressions, independent weights added together and then run through activation function. And then the exact same thing happens him and then you might It might just run straight into here and so these two will be added together to be given the final house price. Now, when we get to this point, this number after just all this being fed through if you think about it, that's gonna be our Y hats. Are prediction. Okay, So, essentially in the first run of Ford Propagation, all these weights and vices are completely random. So but we get a prediction, and now we can compare that because it's labeled. We have what? The actual price of the house waas. And so then we can work out what the loss is, how well the Costas. So how different why hat was from why now, Once they run through hundreds, 100 houses, we've been able to collect 100 of these, so we can just take the average. This is if we doing using means great area to find out loss. When were squaring that? Just as you might remember, that's just penalize any one. Any predictions that really incorrect and then divided by n so divided by 100. And that's literally one foot one Ford Parcel propagation. After that bat backward propagation occurs where, where the weights on the biases are updated and and so for that is a bit more complex maths , and we use a different number of different methods. One of the most common ones is something called Grady in dissent. Um, but there are the ones that can be used on for the purposes of this course. We're not gonna be going too deep into the math on that. So for now, all you need to do is appreciate with backward propagation. It's that during that time, the weights and the biases are updated in an attempt to then reduce the loss the next time we do forward propagation. And so each time this happens with four propagation, the weights aren't changing. The bias is unchanging, were literally just It's kind of like a way of testing out on you are new weights and biases. Backward propagation is when we actually update the weights and biases. Four propagation when we test it out and do of pass afford propagation and then a part of backward propagation is called an epoch and usually run that numerous numerous times each time. Hopefully the loss decreasing. Obviously, sometimes the the computer gets it wrong and updates the biases and the weights incredibly , but it's all an attempt to reduce the loss to a minimum. And now, if you think about when we using train and test data sets or validation Tate sets, we'll be doing will be running for propagation for all of our training sets, and they will be able to get a loss for our training. But then what's really important is that we take our test data says as well. Let's say we have a separate 20 bits of data on. We run that through a swell and our our network because it has. So when we do back propagation, it does not take these into account. So it's always it's almost as if it's blind. Todo has never seen these bits of data before. So does four propagation on these 20 parts as well. Then we have a loss for our train data and also for our validation data and so that, ladies and gentlemen, is a full Ford propagation run. We've described the backward propagation and how those equal into an epoch, so definitely make sure that you you can understand, appreciate how this works. Obviously you can add many, many, many more noes toe all of these or more hidden layers that's up to you. But you know that the theory will stay the same no matter how many rats he could make his complex you like. It's just the same thing happening more times. So if that makes sense, I'll see you in the next lecture.
8. L311 BiasandVariance: No, I just wanted to discuss in a bit more detail work talked about before in terms over fitting and under fitting. But now I'd like to talk about them in terms of bias and variance, because very often, especially if you're working in a team, you can inform me. So I talked about fitting under fitting, but often they referred to his bias and variance. So I just wanted to make sure you've got a handle on not only what over fitting and under pitting is and how we can reduce that the issues of both, but also were able to refer to them as bias and variance as well. It's not a huge deal toe refer to is one of the other, but it's good to understand that you have both. So if you have high bias, it means your your model is under under fitting. So if you think of your model, just you can just imagine the line of best fit that have been talking about within your aggression. You can imagine that it's not fitting the data well, so it's actually not. The line of best Fit is nowhere near the data points you can think of it like that. So it's not big enough enough patterns to describe the data well and to make accurate predictions. So that's when with our date sets, when we have our training and test or training people quote train evaluation, some people quote training, evaluation, whatever you call it, if you have been talking about is training tests, I'll just talk about like that if you have low accuracy for your training and your test. So if you have a high loss, then that is going to be mean. That you have high bias or you confer to is under fitting is bait your when you have low accuracy for both your training, your test data. That means you're under fitting essentially, and you have a high bias. So now, on the other side of things, this is where your model has obstructed too much complexity from the data that it has. So there is no able to respond very well to new data on this example of tanks with force in the background will be talking about later on in this nature. But we've talked about this quite a lot, and we've talked about a number of different dress, different strategies we can use to reduce reduce variants, namely reducing the complexity of on your nets. But really, the most important thing is toe. Try and get have as much data as possible possible. So if we can increase your data size, that's really what's gonna help with over fitting, because it's gonna be able to have more data toe abstract complexity from So just just remember, essentially, if you have a high bias that that means you're under fitting and you'll be able to see that in your results. If both your training and your test data have low accuracy now, on the other side of things, you may have a high variance or we call over fitting. And that's if the accuracy is high in your training, training data sets. But he actually is low in your test or your evaluation date set. So a few examples that we're gonna be going three. So for high bias when you're under fitting, we're going to be the driverless car on and on example of a teacher in the students and for over fitting. There's actually really interesting real life application where it's gone wrong because of over fitting. And then we'll be looking at where the friend says They don't like a movie. It's a festival, the driverless car. So you can imagine that if you have a driverless car and you don't train it for very long, so you don't you don't allow it to get enough to take enough patterns. Then it could, as a very basic pattern, say, Okay, well, what I see black and white, that's essentially the road. So I'm just going to stick to that. Okay, well, what if you then sees a zebra? If the only rule is that is looking to see black and white, then it's under fitting. It needs to be able to detect a lot more patterns than just the colors. Black and white should say it's under fitting, and you probably wouldn't want to take this this car out on the road, especially if you're going on safari. That's one example of under fitting. How about let's say we have a teacher and he's been explaining, Ah, a number of different rules in math. Let's say that the student has been sleeping and the teacher says, What did I just say? And their students actually didn't pick up any other patterns, and it hasn't. He hasn't actually learned anything from that. So you could say that he's been under fitting. So here's the really interesting real life application. Sometime ago, now, an attempt was made to create a new network, an image classify that could detect heavily armored vehicles and tanks. And so they were given a great deal of a large data sets. And it was very strange because once they're in your network had been trained, they were sure that was working. When they actually try to add insulin, some new images to predict. They found that if the if any image contained trees, then it then essentially would say that there is a tank, even if there wasn't. So what happened is that actually all the data that was provided at the start, where tanks and heavy armor armored vehicles there were only in forests around trees? So essentially the issue here is that the new network it was over fitting because it abstracted too much complexity because there wasn't a diverse enough data set, so in this example, it wouldn't have helped it'll Teoh reduced the over fitting by. Let's say, reducing the complexity of the neural network. What's really important here is that they should have added a large diversity of images so that a lot of the images had tanks and stuff like that, but without the trees around them. So finally, here's another example. Let's say someone says they don't like movies Now if if you were Teoh, essentially just abstract from that a complexity of saying Okay, well, that means they don't like any movies ever. Well, that's not true. You need to ask some more questions about whether they like other movies or not. So it's taking just one piece of information that's I don't like movies on distracting a pattern. That's baby saying they don't like any movies ever. Which essentially, if you were to, um, if you if you were Teoh, let's say Ask more questions like Okay, what do you like comedy movies? Do you like action movies? And they might say, Oh yeah, I like comedy movies by the like Action movies. Then you'd be able to abstract more patterns from it. So isn't in this in this example, you need more data in terms of just asking more questions so there's a few examples of different real life cases off over fitting and under pitting. So in order to reduce bias or under fitting, what you want to do in general is a couple of very useful things is to increase the complexity of you in your network by adding hidden layers and adding notes to less and just increasing the amount of training. So the number of reports, for example, on that helps just toe just to improve the training accuracy on the test accuracy in terms of reducing variants, one main thing you can do is you can decrease the complexity of in your in your network. Andi. That's often very, very helpful. So so it doesn't abstract so many complex patterns that may not even be there in real life so that that very much helps. But one of the biggest things you can do to reduce over fitting and one of the best things you can do for your network is making sure you have is much data for in the training set as possible, and you also have enough in the your test, your your valuation date set to ensure that you're making to ensure that your no over fitting. If you can't get more data, then feeling that you can try and augment data, we're told by the in the moment you could also are also add, dropout your layers, which have talked about before. So in terms of dropouts, what's important is that when you have your in your network just like this, essentially what it's gonna be doing, this could be a muting a number of the different notes. So it's basically just every time it goes through any block, it randomly chooses a certain percentage of nodes and just ignores those weights and doesn't actually apply them on. What that helped does adds, um, a certain amount of of randomness. Or you can so you can use the word randomness. Or you can say Stay Cassidy eso this. It's kind of the same word on by thought. Introduce that to you on DSO with dropout to randomly removing removing a certain sense of these notes. You can tell you're just meeting them. The dropout rates Let's say you have a rate of 0.5000 mean your dropping 50% off the notes in any given Reebok and they're randomly chosen. If you were to choose, drop a drop out rates of 0.1 than just 10% of them would be meted. It's in terms of or renting data. That's that's like a few examples. So let's say we were training, you know, one of these sort of home devices where you can talk to it and it recognizes you. Well, let's say you didn't have enough examples of people talking. Then one thing you could do to augment the data would be you could take all of that audio date of people talking, and then you could increase the frequency. So that's you've got a wider range of voices you could add. Add in things like random delays into the audio on background noise so you could use the original noise. You could use the original voice, and then you could add these things as well. So I used to have supplemented your date set. You've suddenly got a lot more data to train your your network on. That includes things that it needs to be able to deal with, like background noise like delays and audio and different frequencies for different voices . So just one more on, Let's say you're looking at images. Well, this is Actually there's a lot of different ways you can augment your data. So here, for example, you've got an image of a cat, and you can move it on to different parts off what you have. You can cut out different parts. You can change the angle, even the colors. You can even apply blurriness. Andi can increase and decrease lighting. There's so many ways that you can or winter data. So here is an example where you might want to take it a bit easier. So, for example, you can see one that's just completely green is kind of unlikely that you're going to get an image of this animal that's completely green. So you need toe kind of and upside down one as well. So you kind of want a logic jacket at the same time, they just chuck in. Loads of augmentation. Kind of needs to make sense, so that one looks explain. So that's it for us talking about bias and variance, hoping that make sense to you what I do if I use, I'd write down over, fitting an under under fitting and explain just in your own words. What how that means In terms of bias and variance are trained and evaluation data sets and what we can do to reduce over fitting and reduce under fitting. So have a go that I'll see in the next picture.
9. L311 Hyperparameters: so we're now just going to quickly look at hyper parameters. Hyper parameters are way Teoh tune on your networks in order to get the best out of them. And I'm going to include a list just here of some of the most common hyper parameters. So these will be very common to you. As soon as you start creating your own your networks, you're very quickly become comfortable tuning these hyper parameters yourself, and it's actually very easy to do so in your coat. It really is simple, so they worry too much about that is important for now. Maybe just note down these different types of hyper parameters. Somewhere on what's really important, you understand what they do on why it's important to tune them. So we're just gonna go through them one this time. So the number of hidden mayors I think that's quite obvious. So it's kind of like just the number of these layers that we have here. So why would it be important to change this well if we increase the number of hidden mayors , so right now we have three, but weakening quickly, we could have another five or 10 or 100 doesn't really matter. Of course it matters. But it is all possible now, The reasons why we would have hidden layers. So the more hidden layers we have, the more complexity we can hold in our in our equations. So basically weaken extract more intricate patterns from our data. So that's a really big pro. That's very helpful. But what makes it less helpful? The cons of having number Huntelaar's number thin layers is a computation, expensive so that it would take more time to train your network. That's one difficulty, but also one of the most important things is War talked about before, and we'll talk about very often is over fitting the data. So this is real problem. When you train your net in your networks, you want to avoid it as much as you can. You avoid over fitting so you come up with some. You wanna have some kind of bandits, right? Because you want toe extract complex patterns. That's why we're using deep in your networks. But you don't want it to be so complex that it actually fits our training data to well on. Then it doesn't it's and then it doesn't even he room let's say for for new data on git doesn't actually represent data, there doesn't know yet. Let's say so. The good thing about new hidden layers and increasing the number is that increases the complexity. The cons are there is computation expensive on there? It's prying toe over fitting. If you have too many heading layers on, the exact same thing goes for the number of nodes in a layer. The more knows you have, so we could just ADM or more. You tend to have the most in the middle of your your network, and what you don't want to do is have like two nodes in one of your head layers. And then, for one, you just have loads and loads and loads and loads loads, and then you drop back down to two straight way. You kind of you. In general, you want to have it to be relatively gradual, changing rather extreme changes because then that way you'd lose a lot of information if you just had extreme changes. So it's one thing to keep in mind, and yet it's the same thing, really. Is computation expensive to have a large number of nodes in layer but it does, helping being increase their complexity of our deep in your network architecture but is also prone to over fitting. The next thing is a number of eat box. So, as we describe before the epochs are, when we do go of Ford propagation and then you go a backward propagation. So usually you want to have this number being quite high. We'll play with love a number of different ones. But usually it's good to have a high number of e books because essentially the more it trains, the better you. It may be that at some point junior training there actually gets worse on that's something to keep in mind and keep an eye on. So well, at some point, go over how we can create a graph so that it will show us your what the losses over the RV parks that might start free, high, and then our loss decreases, which is really good. But then it starts to go up. Then we could say, OK, I'm a block number. I don't know. This might be part of a 500 or something. Okay, this point will stop. So what important thing is if we At some point, the more you box might actually not be helpful. And in my increased loss again, the other thing is it takes time to train so very often, especially when you're playing with different architectures. And when I say architectures, I mean, you have a number of hidden layers or your hyper parameters, all that kind of stuff you might want to do a low number of each box, just a safe time. And then when you when you really want to go for a specific architecture and you gonna even overnight strain or something, then you could increase the number of e books. So the learning rate is very important, and we touched on it very, very briefly. But it's essentially when we're talking about our loss and if we grow graph out. So let's say they This is our loss, his gradually decreasing, and it's increasing. So this is based on the parameters were using on. Then you can see that. Okay, right. This is the bottom Here. This is the This is where we want our lost today. But what if our program algorithm get stuck here? So it's zoom in on this bit here so it gets the ball out of here on. The task is to minimize loss, right? That's that's what we're trying to deal with our backup back propagation. So when we get to this point here, the program might think, Well, you know, if I go this way or if I go this way, the loss is going to increase. So this is this is the best. This is the best we can do to minimize the loss, whereas in fact it's just a local minima. That's, Ah, good key. Where to keep in mind So local minimum on This is the global minimum down here, which is actually the truth minimum. So what's important is the learning rate. If you have a lot of low learning rates, then essentially it would just you go this way changed a tiny bit, and this is basically how much you'd update your parameters each time. See, if ever load load learning rates, then June backers propagation, your weights or any go change very, very small amount. And so that's what's useful here. You can slowly go down, slowly, go down, but then, when he gets to hear because of the learning rate is so low it will try and go either way, a small step and get stuck in this local minimum. Whereas if you have a larger learning rate, which is sort of like this leg, for example, gets to hear this is okay, well, we've got this. Any rates, or we could go beyond that, and that's when it starts to look a bit further than that. So what's good about having it Larger learning rates is that you can avoid local minimums better. But what's also problem is that it might be so large. There's actually no ever going to be able Teoh find that local minimum the global minimum. Sorry, because it is to is to smaller space so it can go into the fine detail. So it's good that a having a higher learning rate basically means that your weights and your biases are updated to a larger extent alone. Learning rate means you each each epoch. The weight of ice is only updated by a small amount, so the pro of having a large learning rate is that you avoid local minimums. But you often misses the global minimum because it's such a large Lenny rate that is never gonna be ableto go down to that. So one good thing to have some people have started to use. There's something called learning rate decay, which is basically, you start off with a large leading rates on, then over the box. That only rate actually decreases, which is a cool idea that something to keep in mind. The next thing is back size. That's literally just like how many bits of data sets they were doing the house price predictor. How many how many houses the information on them do pass through in one game. So if you try and pass through 1000 you'll probably be a bit too much for a computer to handle when it might happen out of memory error or something like that. So it's a relatively small bite broke, high parameter, keep in mind, but it is useful nonetheless to keep a relatively low batch size, but large enough that you keep things running efficiently. So next up we have the activation function. So so far we've just talked about using sigmoid this thing here, which is you introduce our non milia linearity after we do linear regression. There are large number of other activation functions, which will be going into when we actually do practical projects. So don't worry too much about that. I'll be introducing you to the names on Do you know when it comes to the code, it's literally just changing one line of code. But it turns out that using at different activation functions for different purposes is very important, and signal is very useful for if in the output you want, like a 01 So you want to know, for example, the Titanic? Did someone survive zero or one? That's when signal it could be really useful for other ones. We have things like something called Really Leakey. Rarely on tan h. There are the ones you don't need to really worry about to much of the moment, but because we'll be going into in the future. Next thing is back propagation techniques. So the one I mentioned is great in dissent, and that's kind of intuitive name, right, because you're looking for where the grade into the lost this descending. But there are other ones as well. I'll be introducing those to you in the future as well. Finally, we have regularization trainings. So, for example, drop out drop outs of really cool idea. So, essentially, if you find your data is over fitting, your your model is over fitting. Then dropouts does something relative basic but quite quite interesting, actually. Very effective dropout. Basically, just every time you do a neat but it just randomly decides to, like kill off some of your notes, it literally just just kills them off, removes their way to their biases randomly. So each time random ones are removed and or you could kind of say, muted rather than removed on DSO yeah, even do you drop out so you can change the rate of your dropout? So 50% would mean literally 50% of your notes each time are meted, so that's quite a lot to do a lot to, Toe added. However, it does work quite well on There are other regularization techniques which will be talking about in the future, but those are the main hyper parameters. So make a note of each of these and try and try and just remember some of the pros and cons of of using, of having high hidden layers. For example, Hi learning rates lower lending rates, stuff like that just to become used to it. And as you do more more projects and you have to play with these more arm or you kind of find it, find it to become quite intuitive. So have a gay just remind yourself of these and the pros and cons, and they'll see in the next picture.
10. L312 SettingUpForNeuralNetwork: Now, before we get into actually coding out on your networks, we need to make sure that we're fully set up with everything that we need to work. So first of all, if you don't have Python already installed, you want to go, want to go ahead and still that? So go to python dot org's ford slash downloads four slash release four slash python dash. Then whatever version you want, I'm gonna be using version 3.7 point four. So I recommend you to use the same if you want things to work the exact same way that I'm doing it now. What's really important here is even if you've got Python running you to download the 62 bits Eso 64 bit. So what you want to do if you already have Python is either going to do to notebook or open or just open up. Whatever, editor. Using and run this import platform platform dot architecture Mt Princes in square brackets zero and they don't let you know whether you're running 32 bit or 64 bit. What's really important that you wanna have 64 bit python because if you don't have it then essentially, you need Teoh. You won't be able to run foreign stool tensorflow. So when used, A is going to go to your python release and then you want to download so engaged, go down to the files. And if you're using windows, you could do something like this. Which is why did Windows 86-64 executable installer? Otherwise, if you doing Mac Os, just make sure that you're using the 64 bit. That's really important. So that's the first thing to do is get python installed. Once you've got that, then you can either use Pip to install things like tensorflow and your caress. So if you've got Python reset up, just accept my troops. Olympics there said you want to use Pip, which is a package installer, then anyone in store tensorflow and then you also want to install something called care ass . And also something called that you've to notebook. So I'm just gonna go ahead and install tensorflow here now. One other thing you can do if you don't want to use Pip. For whatever reason, there's something called Anaconda, which is is a distribution package which, if we download it then installs things like Tensorflow for you Duped a notebook for you, pandas and stuff like that on. Then, instead of using pip install from then on, you could use conduct, so C o N d a. Then Kanda install tensorflow or condo in store care ass or anything else. So you can either use Pip or you can use Anaconda. I've for using Pip myself, but a lot of people have said that Anaconda could be quite useful. So once you've got all this set up, so you want to install, tend to play this way on. We're just excited that for now, because it's not too important. Just make sure the install it and then you also want to do pip in store do tonight back if you haven't already spell of this and then also you want to install caress because that's the library we're going to be using for our deep landing. So that's super important. So once you've installed all these things, you just going to dupe notebook and obviously have it running and just try tried in something like import care. US imports tensorflow A and just make sure that you are gay. Any errors when you run those, if you have any errors, there are a multitude of various there could be. But make sure you're using 30 60 to, say 64 bits. Python that's super important. Otherwise tensorflow. Basically, when you got we were able to install it or run it and if there any others, then I suggest using stack overflow to take a look at what other people are saying because there are so many errors that can come up. So now that we've talked about how to get set up on your own PC, if you're having problems with that, don't worry, because a lot of the time, what you can use is definitely good to haven't have the environment set up on your local computer. Don't get me wrong, but a lot of the time what I do is I use Google Laboratory. So as you can see, this guy seems a little bit too happy about Google Collaborative, so we'll ignore him. But if you go to coal app dot research dot gov or dot com Essentially, it's just like due to notebooks, I get to file new Python three notebook. It is very, very similar to hell tubes. Nobody works. So we already have Python three here, so I could just do prints have a shift enter to run the cell or let's say a equals three prince A. It works. It works just like Python. Know what's really cool is no. Only do we already have tensorflow in stewed, so you have to worry about any of that kind of stuff. But we also have We have GP use. So if you go to run time, change runtime type harder accelerator and you you consent to the GP or TP. Now this is really cool. So essentially, most computers have CP use CPI use. Essentially, that's what's going to be used for doing or the mathematical calculations in the new or network. Now, if you if you are using a CPU, then it's going to run relatively slowly. Maybe not in our first example, we're coding, but when we create larger architectures, you will definitely take a long time GPS that stand for graphical processing units. Those are much, much better. That's doing all these calculations very quickly, so it's much better basic training your in your networks. So if you have one on your laptop, then that's all good. But if I I often use this one here because they have very fast, cheap use tp use run even faster, Um, depending on the on the architecture. But I are highly recommends that you try out Google App because I use it all the time. When I'm just playing around and trying to train my networks on, Do you get free GPS, which is just absolutely fantastic. You don't need to install any anything. All you need is a website browser and just a type type in the domain, and then you have access to a GP. So now that you're set up with everything we need, let's get started with our first new network.
11. L313 P ApplyNeurToTitanic: now that we've been through the theoretical side of creating your networks, now we can really get this the exciting thing of coding out our own your networks using by thin against do it on our tight a date set. Now there are a number of different frameworks that we could use in python. For creating your networks. You have an option of using something called pytorch, where you can use care, ass or tensorflow. Those are three of the main popular ones, so pytorch is relatively easy to use. You can code out your networks very quickly. However, they don't offer a huge degree of flexibility. Increasing the your networks compared something like tensorflow. So tensorflow is something that you actually have to code a lot more lines in order to create on your network and slightly more complicated to create it. But you're allowed a lot of flexibility she and customize your neural networks a lot easier with caress. Caress is actually built on top of tensorflow so it uses tend to flow, but it's just a lot simpler to create on your networks. You still have enough flexibility for most things eso That's why we're going to be using Caris. We're going to be using the Titanic date set. Hopefully remember this from when we were doing data science and machine learning where we're using. We're actually using information on passengers of the Titanic to predict whether they have survived when they didn't. The first thing we're gonna do is make sure that we have data in the right place. I'm just going up loads the train dot CSB the Titanic. We're only gonna be using the train CSP for now. That's fine. I'm just gonna run all the cells so we just prepare our data like we did in our machine learning. So once we get under here, essentially, we're doing the train test Split your spitting our data into validation and train. This is a with the stuff that we did in our machine learning and they were just running it through a random forest. So we're getting an 83% accuracy with with our random forest. Another have done that. I'd like to compare that to creating on your network using CASS now, because the data isn't that complex. We don't have loads of data, and the patterns aren't going to be all that complex. I'm not sure if if and your network will significantly outperform a random forest, but this is a very good date set just for us to get started with. And then we can look into applying it in different ways where it's really most effective. So for now, we're just going to use caress to create a new network for doing predictions. With Titanic. It's the first invention was in poor cast. So first of Inter do from from Sorry, import Cara, Stop models. So? So I want to. From Cara Stop models. I want to import something called sequential. So Quinto you can think of is being put you in a second. But it's a way of just creating a placeholder. Let's say full uh ah, new network. So then we're going to import fuel essay from caress dot layers. Import well, important things first with dense and then drop out. So dense essentially just refers to one hidden layer. So how'd you or him? So this is a dense layer Caress just refers to is that you can also think of its people refer to it as a fully connected layer. Some people call it a hidden layer, so characters refer to it as a dense layer. So it's importing that on drop out. You might remember I mentioned this in a recent lecture. It's a regularization technique which basically mutes out randomly each time Ford propagation. Some nights it chews them randomly on depending how? Maney, What proportion you say to mete out mutes out that many, and that helps tow avoid over fitting. That's why we're importing that. So those are the things we need to import for now. Now we want to create our model. So I'm just gonna say this stole this in a very book would model. So I'm going to pretend model equals sir Quinn Show. So in this line, all on basic saying is create a placeholder for in your network, and now you're gonna be blown away by just how easy it is for us to now create on your network. It's the first thing I want to do is create basically our first layer, so going to model dot ad and I'm going to add a dense layer, which is a fully connected layer, says one of these guys here. And then I get I'm going to say how many knows I want in there. So I'm just going to say 32 because that's a nice number. That's not too big, not too small. So that is basically the number of notes. I'm gonna have a little hit little ad in my activation function, which is ready. So the activation function we've been looking at up until now was signaling. So let's just get up an image of signal it say off ready so you can see what I mean. So this is kind of different from from the new network. Sorry from From the Sigmoid, which I'll show you another top because, as you can see with sigmoid, it's more rounded. The both sides and there is actually an issue that could arise in that. Let's say, for example, are we were ending up around here. It's time, you know, in what we were getting out of our wake was W X plus. Sorry, W X plus B. It was ending up around here. Then it's actually you have something called vanishing ingredient, which is essentially where you get stuck around this area on. It always just keeps out putting these kind of numbers, which is an issue when you use Really. Actually, we don't have that issue of the vanish ingredients, so just make make note. For now, that really is much more often used in your networks, then sigmoid, especially when you don't need a zero, but between zero and one output because it doesn't have a vanishing radiant, so basically gives your number. That's not that doesn't curve off like it doesn't the sigmoid. So we're gonna have an activation of ready, and that would specify the input shape, input, underscore shape. And this is where we need to put in basically the number off of our difference in boats. So basically the number of independent variables that's gonna be if we just scroll up, that's gonna be the class. The sex, the age, the fair, the embarks, entitle that kind stuff. So I believe we have eight, you know, and they just want to do comma and then leave that blank. And so now we've created our first hidden there. So the input layer is already issued, so are important. There is basically what we put here. This is our input shape. And then we specify later what we actually want to go in there. So that's our first fully collector layer or are densely my first hidden there. So now we're going to add in, and what's good to do is let in the well, I didn't drop out later. Actually, for now, let's just add in some more have really connected layers because we want to have at least a few layers. So we're going to add another one, and I'm going to have it pretty much the same. So I just got a copy and paste this. So we have the same activation again. We also want to just have 32 notes in this one, and then this have, um how did this had one more? Before we have our output on for this one. Let's reduce the number. Because if we have 30 30 every time, that's actually gonna be a lot of computation that we have to do. So I'm glad One more fully connected layer, and then I'm just going to add Finally, this is really important. This is your output layer. So this is where your output layer should be here now, for different your networks. You go, You're gonna wanna have a different number here for us. We literally just wanna have one number. This between this, either zero or one that's basically survived or not survived. So we only wanna have one note here. We just wanna have one number here, so we're gonna have more, more Moto ads, but it's giving. These were some of his dense, but we're just gonna have one. And now, because we wanted to be a zero a one between 01 import output. Sorry, we want to be between 01 We're going to use our signal it. So activation finally is going to be sigmoid, and that's that's literally all we have to do that is on your network in five lines of code , which is absolutely incredible. So just review what we've done essentially with we created our placeholder by just putting whatever very well you want. We could have changed that, Teoh, but in a camel or banana, that model is kind of a good name equals sequential. And I was like a place over and then we just used model dot at So we put in some fully connected layers which are these hidden layers hip. In the 1st 2 we basically had 32 nodes in each of those. So that's the number off these things that we have in the layer. And then we had an activation. Were using really, instead of sigmoid here to avoid the vanish ingredients which we talked about very briefly , which is we don't want to just get lost over here where it kinds of we're kind of plateaus because we'd rather you you don't have that problem so easy. Ready there. And then here we want to have our input shape. So essentially we have to put in here is the number of independent variables we have and then a calmer afterwards. So if we were, for example, to be doing images rather than some of this, it will be different, and we'll go over that in the future. But don't worry about that. For now. We then added two more fully connected layers or two hidden layers, so using dense and we did the same thing. Except I'm just using a few less nodes here at the end because I don't want to be to computation intense, Andi. Then finally, this is our output layer where I just wanted one number and I want to be be between 01 because we want to have a zero or one output. And then we used so using the activation signal it. So next, what means do something called compiling the middle? So we have to do here and when we're compiling the model, always saying is basically what do we want to use the back propagation, what techniques we want to use, like great Inter sent or something like that. How do you want to calculate the loss? And then what? What metrics do you want to back out? So it's just adding in a few more details from a bottle dot com pile and then in brackets First, I'm going to say what the optimizer is. You can think of that, basically, as what do you want to use the back propagation? We're going to something called Adam now. Include, resource is on what Adam is Onda, how it could be useful. But for now, we just type in and then we want to include our loss, which is binary cross entropy. So before we've been looking at using means squared error, So that was a really good introduction into how we can calculate the loss, buy new cars. HB is kind of evolution of that, let's say and again, I'll include some Resource is on that. And then for metrics, we just want to look at accuracy. So as well as looking at the loss is also really good to have a percentage accuracy. How accurate are we? So once we've compiled that, then essentially, where we want to train the middle. So we want to be in here. IHS, that's just put in for this model unscored train equals and we're going to model just like we did with our machine learning dot fit. So in here we know what to put our X and R y Onda we for what Paris does, which is really helpful is we don't actually need toe do something like this where we have Teoh split up into training test separately. It does that in this line of Kate, So I'm just trying to find what we so he paints this predictors and targets. So we're gonna put those in. So predicted is going to be ah is going to be our X and target is a y. Okay, It's actually I'm just going to change that here. So I'm going toe cool this X and call this. Why? Just says a lot more straightforward. X and y, where were we? Well, so much going pain, literally ex. And why so X is all of our independent variables. That's the age, the sex, the fair, all that kind stuff. Then why is whether they survived or not? And just just make sure you understand we're not using. We're ignoring this part here where we're done the train testament, because caress actually does that for us. Really? Helpfully so. But in our external, why next you want say, is how many eat box do you understand for So hopefully remember, we boxes any pockets when you do ford propagation and then back propagation. So Ford propagation is when you train the network this way, and that's where you're basically just testing out your weight on your vices. You running through your label data, and then you're seeing how, how close to the real values your predictions are. The back propagation is basically updating the weights of devices to minimize that loss. So doing once forward and wants back is an epoch. So let's go for something quite small. For now, let's escape for 100 and then our batch size. That's basically you know, how many people, Because because the Titanic each each state point is like a person. So how many people do you want? Teoh put in the Ford Propagation each time on, Let's let that up into on a 50 50 people pair. But go because if you had a note each time, then essentially you might get out of out of memory error the boasts. That's basically how much information do you want to be output like how much we want our model to tell us about the training while it's going. If we just better zero, we should get a nice amount. And finally, validation underscore split. So this is basically this is going to do for us. What? We were doing ourselves here with the train test split. So how much more percentage? What fraction do of our data? Do we want to fit into a validation? I'm gonna be in tweets, sent more or less evenly scraped temps every day. That much. Okay, so finally, the last you won't see this will in no train nicely. Why I want to do is I also would very much like Tiu have a way of visualizing this. So I'm going to his mat plot lip just to make sure that imported map lately but the top No , I haven't. Okay, sums going. Put this here them for now. SE imports Matt plot lib dot plot as party. So as you might remember, Matt Plot Live is a very helpful library for visualization, and I basically want want to be able to visualize how their accuracy changes over the box. Some got plot very quickly since there what? A what? A model train theocracy That's the label train. Big Week was trained when we do that again, this time against the validation accuracy. So it's just vow. Underscore a CC, then this is going to be test. So the next thing we want to do as you want Teoh give ourselves a title. PRT thought title. You may remember doing this when we're placing our graphs say it's just call this model at USC and then that's putting our labels have plt dot exe label because e book number. We want our why label to be accuracy. Oops records in every day. And then that's also wins. Have a legend, says I got in. So why isn't this happy out reaching to industry here? Okay, then beauty the legend. Finally, let's put this in the bottom, right? Location equals no, uh, right. And finally plt Doctor. So this should be everything we need for it to run. So just to go over what we've done, we created a placeholder for our model that we added a number of pretty connected layers using rally as an activation on making sure we included here the input shape That's very, very important. And then we included a sigmoid activation, the end with only one note because that's what we want to be. Our prediction. That's basically 01 debate survival not were then compiled the model using optimizer of Adam, which will be going going into more detail later and using a lost in 17 square area using binary cross entropy. And then the metrics were looking at accuracy. This is all syntactical for caress, especially, but you'll have the scripts play around with yourself anyway. So then I'm basically using model dot fits. I called it model here, So if I change this to banana after, but banana dot fit, noticed, including what is X and why? So this is my independent, very balls of the data. And this is where they survived or not setting that. We want 100 box to run backsides of 50 and then our validation split. So we want 10% of our data to go into a validation or test data basically. And then we just got graphing out. So let's run this and see, See if we get any in years. OK? So first of all, let's see position arguing for his keyword arguments. The case that's just taken. Look up with this one. See what's going on here. Cases just saying she's in the 10th straight back in now. Okay, so let's just see here. Says history. Object is not treatable, right? So just here, I just need to change here. I seen spurts Model train dot Someone's look at the history. Sorry off our model on access the actually and the violation accuracy. Now it should be running nicely. As you can see in that very short amount of time on your network has already trained, so that shows that it's not we haven't created some incredibly expensive computation, expensive your network. We can actually see something really exciting here. We can see what the accuracy is off on your network. So without train and test, say the train, the test has actually outperformed the train, which may suggest that were actually when way were no over fitting our data, which is fantastic. It looks like we're around the 80 mark, which is really good. So I think I think that, sir, that's really good. So that's that's the basics of creating on your network for our Titanic. So in the next sexual, just look be looking at how we can get slightly better accuracy, maybe by increasing complexity and having some drop out will ever play with it in the next lecture.
12. L314 P OptimizingTitanic: in the previous lecture were able to code out our first new or network, which is very exciting. And we were able to even see the accuracy of our model. I'm gonna want a light, see the actual number for accuracy, something change. Verbosity so verbose here. Toe one. So we would when we train it, we can actually see what the accuracy is between the box. So this is literally every block. It's reporting backers what the accuracy is. So as you can see this time of actual death but just just near the end. But in general, wear around the so this is our I could see for our trained 75% of validation, same six around the 75 76 kind of 80 mark. So that's not not too bad. But I think we can do better. So essentially we just scroll up to the top again. So let's just make a few changes. So if we think back to what some of our hyper parameters are, pause the video now have a think about what, what things we could change here in order to change the hyper parameters and try to improve our model So pause the video Here, Have a quick think. So I'm just going to have a few goes with improving this myself. So the first thing we're gonna do is actually add a few more hidden less. So I'm gonna add two more. Why not? Just for fun, Andi, I'm going to You make thes slightly larger, so I'm not going to make this twice as large. And we are made of 64. I'm gonna be this 128 this 1 64 This one touched him. These kind of standards numbers to use because they just need to divide by two. Quite well, which is basically, always handy. Yeah, because yet basically down to thirsty 16 and 84 to 1, it just kind of nice number to use, but keeping one the end. So So if I want done does add an extra hidden layer and I've added a number of nodes. We put in those hidden in those unless so, I'm gonna stick to really, because that's actually quite a useful activation function. And keep sigmoid at the end, most against increasing of the box. I'm gonna go for as a rescue for 501 up and then I'm going to change the validation spit. I'm actually going to put it. I'm going to use less data in the validation, so I'm actually going to news Just let's go for in September sent Let's go for 6%. Some gained 0.6 So now we actually have more data that were training on. So I'm gonna put the Basti back to zero just because I don't want to see the these numbers 500 times. So this will take a little bit longer to to train this time, obviously, because we've included mawr layers and more nodes and with greatly increasing over VE box, so should take a little bit more time. So just to review war have gone over for this optimization. We've been shooting some of the hyper parameters. So what's what you often do? We often see in script is actually the top here. You put all the hyper parameters of the top. She might put you box equals 500 at the top here. And then you just see actually the code here, he books and things like that, because that way you can just go scroll straight up to the top of the screen off your script story, and you could just make the changes there rather than having to look through the coat. Say, if we take a look now at the results where it looks like we're pushing way above 90 now. So I'm actually in to change this for Busty, back to one and the train again because I won't see the number like the actual accuracy that looks like we've got over 90 which is quite a significant jump on just for these small changes. You can see that we're going up the box now almost halfway, but so it it seems like it's very much helped for us to increase the complexity of our model. One other thing we could try is if we found that it may have been over fitting is we could start adding some dropout layers. That's definitely one option that we can do now. Let's see sure, the fish by now let's take a look. So if we just scroll down, no. So now, as you can see our accuracy so far, training accuracy were at 90% now, which is fantastic, But the problem is is that our validation is actually much is much lower than that. So one thing one final thing wants to this lecture is add our regularization technique called dropouts. So one thing we could do, because obviously we are slightly over fitting here because accuracy are training. Accuracy is high, but validation accuracy is a lot, Noah. Or we could say the loss is low without training data on the losses high with a validation data. So one thing we could do is reduce the complexity of the model that Stephanie. Good idea, but I just want make. One thing that's important is sometimes don't make all your changes at once, because then you don't know what's working, what's not. So I'm gonna make one change first, and that's why I'm going to add outs. Add in dropouts in between the dense layers and so drop out is the regularization technique were talking about before we're essentially it randomly, each for each e book. It just mutes out some of the nodes in hidden layers, so I'm actually going to select to mute out 20. Each time people can go up a size 50 60 and they say it is still works, so I'm just gonna go for something a little bit more conservative. Let's say so. The important thing to remember for Dropout is that you add this in between your your layers and you base. It's basically a regularization technique that's used to reduce over fitting. I'm gonna add this in in between every fornication there. Hopefully that will at least reduce the difference between the training accuracy and the validation accuracy. So so while this is running, so the next thing I can try out if that's not working too well, is I can then look at reducing the complexity of on your network because that's one of the biggest things that there can cause over. Fitting is when you either have too many layers or you have too many notes within those layers. So why might consider is just maybe reducing this to 64 if that still doesn't help, then removing that middle one well together. But for now, let's just have a look and see if that's worked. So here we are. We can now see from our graph that the train and test accuracy is a lot lot more similar. The violation accuracy is up to 85. On the accuracy of the train data is update six. So what we found here is that that three ocracy has dropped somewhat, but the validation actually has got up, which is which is actually super super important. So what we could do now is play around with the dropout. Rates may be decreasing to 0.1. See if that improves our training actually again. But these are just some of the hyper pipe of parameters that you can use to improve in your network. So I suggest that you have a go this yourself play rounds with hyper parameter tuning on. Then I'll see in the next lecture.
13. L316 P NewPredsSaveModel: So finally, with the new networks for Titanic data sets, I just want to make sure that we've gone over everything you need in order to get this set up on Jango and get it life just like we had had done for air for the Titanic states set when we were using our random forest. Now it's actually incredibly easy for us to do the same thing. First of all, we want to save our model. Now we don't have to use pickle for this one When using caress, we could just do model. So that's that's what I have called my model up here. So whatever you have caught it, when you've done the sequential appear you were that name and then dot save and then in brackets, you just give it the file name. Now, instead of using the dot south that were using before rush using something called dot h five. I'm just gonna call this Titanic unscored n in which that were sent from your network, says the military, or you have to do so once you've trained your model, you do model, not safe or whatever. The name is that you've called your model up here and then give it a file name on the make sure the file extension is docked age five. So I just put any box in for now, just just to keep it short and to make sure and show you that it does work, everything's fine. You guys that save just one. The next thing I want to do is I won't show you just literally how you can load your model on also how you can make a single prediction. So, first of all, you need to import everything's first. What I'm going to import numb pipe, which is very important for scientific computing. What we use it for a lot when we when we're playing around with your networks, is we just literally need to put it into the correct data format, which is a numb pie array. That's why I'm important here. So then, from paris dot models import something called load model, so when you're saving a model, you don't actually need toe import anything. But to load a model, you deign to import this load model function. So then finally, we just want to load up our model, so I'm going to let I could just blade as model just to make sure not getting confused up here. Let's say I want to run a cell later. Wants Teoh model dot fear or something like that. I don't need to get confused and some no go save it is the same variable called model. I'm going to call this one model on score. Predict. So we put into that variable a nice little literacy needs toe run this function like model and then I want to vote the model name, which is very much a quote Titanic Underscore and end and making sure use the dot h five file extension at the end of this. Just make sure that works before we go any further. Okay, so that's working just fine. Now, in order to make a prediction what we need to do, let's say I don't I'm literally just going Teoh, I know that we need to add in eight parameters the order of them. That's have a look. Shall we? Lewis is here, so let's do this property. Make sure we have all the cracked ones. I was just gonna pay and completely random numbers, but I would be a little bit city. So the P class I know we have. We have 123 sexes, one or zero age skipper 30 love of siblings or spouses, too. Number parents, Children escaped with two again. Fair. Let's go 50 embarked that, say, one of three options for that title. Okay, so now what we were just doing before is a being into this format. Now, when we're making predictions, two cats need to cast that into a numb pirate. So I'm going to call this X underscore prediction. Well, let's call this X underscore example equals and that I seem to cast this into numb pirates . I'm going to mp dot All right, then put this whole thing in brackets and that will put it into the correct form it. So you can imagine Now, if we're on our Django website, this is this is what we would need to do. So we take away the different fields way, have enough form, and then we can just put it into this numb pyre A with two square brackets, and that should be just fine. Another call this example. That's printout and see what we get. Print Exe on school. Example who century of the style. Of course, this might take a look of time because essentially, it's letting this model and then it's running through the United Empire. A. That's printing out. Oh, so I haven't actually embarrassing. I haven't actually run the prediction yet, have I? I've literally just just created Nouri. So let's get CASS to run an actual prediction this time. So we have to do in order to create the election. That store this in a variable cord prediction equals. And then we just take the name of the model, which is this it would it mortal predict? And then we just need to do dots predict, predict. And then we put in what we want to put on, which is our actual example. So essentially, because let me just put this to print. You don't just print Afari. So essentially, because we've loaded our model three caress well, we have to do is put in dot predict, and then the data we want to predict with Let's just go with this. So now this is going to make the prediction is going toe basically run through the neural net. Do one pass afford propagation with our data points here and then come up with a prediction . So this is 0.5. So essentially, this is kind of a kind of sense of 50%. So this prediction it's almost 50 50 as to whether it believes that someone survived or not . So because if you think about it, let's just talk through this. So when we look at the architecture, what we have in our output man is basically just a one note that has a sigmoid activation function. As we know we're sigmoid outputs, a number between zero and one on. That's gonna be our prediction. So zero would be that they didn't survive. One would be that they survive now because it's between 01 It's providing us with a probability is saying there's a 50.5% chance that that the person has survived, so it's likely more likely that this person would have survived the not and so that's Ah, this is kind of a good time to actually do this. So what we can do now is we kind of want we don't this to appear that say on a website, because then the person looking at me like. Well, that's just some weird number. What I do that Let's use an if statement, we can say, if the prediction is less than 0.5 then where the prediction basically is going to be. Production is going. So is equal. T not survived, I would say. And this addiction there's more than 0.5. Well, I guess we could just have else, actually, because I see the Gobi Lessons of 0.5 or it's not so much else. Prediction equals so closed. And let's I should put this as prediction underscore numb. So we can print off both things so we can see the probability. And we can see the the taggers Well, whether it's not survived, what survived, Let's try this again. The problems. He may change because obviously it's going to It's going to very, ever so slightly each time. So this time what hope you're gonna come out is gonna come out with no any the probability , but also the So it says Okay, my mistake. Queens put prediction numb here as well. And let's put this it should be fun on. Been underneath here has prints the production. So yeah, this is going to output first of all, our number, so that could be a probability of survival. With his lessons about five, it's going to say it's not survived. Otherwise, uh, I'm just I'm just full of city See things today. So I put that in. There's a string. Let's do this one more time. And so it's going out. But the probability, but also the tack, whether it survived or not survived, and they will just try one more example after that. So it's just over 50 like Local had before, So the person is likely to survive that exchange of suburb it. Let's say the person is 20 and female on page 10 on the fifth, so putting in a completely different example now running that three and let's see what our model thinks of this. Where they would have survived or not, so is less than 0.5, so it's unlikely. Less likely, the person survived has gone for ablation, have not survived accents. Let's just go over war war, float down the structure, so we've been able to create on your network. But I really wanted to make sure that you have the tours. You need in case you want to apply this to a website just like we did with our random forest. So it is it pretty much the exact same thing here. What we're doing is we're saving our model on DSO for to do that, we have to just take the name of our models. I called mine model or whatever you called it up here with circuit sequential. You do that dot save and then you put in with a far name he wants making sure you have the file extension docked age five and then here we that we were just actually making a prediction here. So what we're doing is we're loading the model festival by using it, load them late model function from casting models, and then we're creating an example. And our example, of course, has eight data points because that's the number of data points that the model needs your age, sex fair and all that kind of stuff when we cast in that into numb pirate. So this this in taxes re important to remember that. So you have your different variables within two square brackets and then you're casting into a N. P. dot Right after that, we're doing predict so literally in order to create a prediction. All we need to do is take our model. However, we've loaded it on do dot predict and then entering the information one. And so once we won't have got that, then essentially, that's what we really need. But if we wanted to actually turn that into a prediction is relatively easy. We can just put if it's less than 0.5, we're gonna put off, not survived else. This could be above 0.5. They would have survived. One thing I really want to touch on before we finish this lecture is If I've put in the wrong number off, that's then it's not going to work in a school is going to complain about the shape. And I found, at least on the loss of my peers, have found that's one of the most frustrating things When you're learning about in your networks is that you get these errors about shapes and it gets so confusing, so it's just important just to be able to deal with that, I'm always make sure you're putting in the correct shape. If you were to not put this innocent as an umpire, eight or so it's going to make a complaint. It's gonna make it complaint about the shape as well. So you just have to always be aware when you're doing this kind of thing, that shape errors are gonna come, come up all the time. So here, for example, they say, expected shape eight, but gotten away with shape one. So when when I first started, all I did was I put into a search engine this exact error, right, putting carrots as well. And then usually you find people who have got the same issues and they can help you. But it's also really helpful just to make sure that the shapes are correct. And you can use that just by doing the dark shape. So Exxon's go example. Doctor shape, so you can always use the doctor function to help you out on check that the shapes are all correct. Eso That's it for this lecture. Hopefully, that's that's of use to you. Have a go at yourself and then I'll see you in the next lecture
14. L317 DataShapes: in this section. I want to discuss something very important, and that's the shapes of data is very important. When you get into machine learning in deep learning, you become confident in how to deal with shapes and data, the dimensionality, how to work it out yourself and how to work it out using code as well. So we're going to start off just by talking about how a shape will appear. When, when, When, when you're using python. It's just gonna appear like this in an X I so end stands for the number of data points we have. So and I'm not going by the actual exile take ages. But essentially you can think of it as the number of data points we have. X I. That's basically the number off independent variables that we're working with. So let's look at the Titanic survival predictor as as a good start. So, as you can see, what we have here is we have our independent variables. The's are all the things that are being pretenders. X so the passion to class, the sex, the age, these things So we have. How many? 12345678 on Let's see an eye date set. We have 800 different people. So then if we look up what the shape of the dates is gonna be, it's gonna be 800 because that's the number of bits of data we have. And then the number of independent variables is eight. So a day chip is going to be 800 by eight on this is always This is commonly, almost always how you're going to see it. So you just keep that in mind. So let's do Let's do one together now and try work out. Now we have a past price house price predictor on these the things that were going to be using to help make the prediction. So the age of the house, of the number of rooms, whether there's a swimming pool, present number of windows, distance from school and energy efficiency score on that's been mis spelt. Know Roy's. Let me just change that with everything here. Perfect. So now we also have three hundreds different bits of days, so different houses are labeled data. So pause the video. Now, on trying work out with the date shapes. Gonna be yourself. Yes, I hope you had to go that so it's going to be in. So the number of data points we have by the number of independent variables that we have, which is essentially going to be so we have 300. That's the number of data points that we have on the number of independent variables we have. That's 123456 So the shape of our data is going to be 300 by six. Now, let's look at something a bit more tricky, But it's really important for us at this point and to go over the shapes of images. So this is kind of tricky, right? Let's say we have 300 images or even better, Let's say we've got a great dates that we have 500 images, Okay, so at least we know that end is going to be 500 now. This is going to be kind of confusing, a bit tricky because we don't just have I independent variables as something like this, right? We have the age of the picture of some of that, like literally all. We're going to be passing it on your network in orderto they say this is an image classifier for dogs. This has drawn expertly by me. All they all it has, is what it can see. So basically, a computer or program will break this down into pixel values. Eso essentially you. You have a pixel values for every single pixel that's on here. Now we have you going to the rights and two left now because it's black and whites, essentially, we're just gonna have to different values. So we're gonna have ones for you. If it's in this coordinate here, this is given, like 00 and then let's say, Let's look at the dimensionality. So let's say that this is and that this is going to be incredibly small. Let's just say just for to make it easy. That's a This is, like 20. So this is 20 pixels in length and 20 pixels in height. Now each of these values are going to be so it's for a picture of value against be between zero in turn 55. That's the largest picture above you can have. So each of these values could be between zero and 255. Now there are going to be how many different. How many pixels? Overall, there's going to be 20 times 20 right? That's how we're going to work out how many pixels air in there. The actual shape of this data, however, to begin with, is going to be so still the number of images. You don't need to worry too much about this picture value stuff. At the moment. I was just introducing, that's But if we look at the dimension out the dimensionality of the data, it's no longer just two dimensions here is very simple. We have the two dimensions. One is the number of houses, and then one is essentially the number of independent variables. Now we've actually got three before the number of images, and that's cool. This, um, would you call this letter will call this lengthen this height and then we have the length , which is 20. And let's change this to 30 just so that we can differentiate. Remember the height, which is 30 now. In order to run this through and your network, we need to get it down to this kind dimensionality. So literally all have to do is something called flattening it so well, we're going to do is just take all the pixel values and line them up in one big road like this, where we have the picture value of the top left here and then the ones with right and then once the right etcetera. And then would you go down here and we'll take him. Take again on the one to the right. And so we literally have our image stacked just in one dimension, and that's that's going to be 50. I'm sorry, 500 number of images and then the other one here is going to be 20 times 30. So that's gonna be 600 because that's what, 20 times, 30 years. So what would one over here is essentially the dimensionality of an image. It's actually two D was two dimensions because we have both with and height so still will end here, and they've got width and height and were able to flatten that down into one value. Now it gets even trickier if we now have color that say go red background. By the way, you can buy these prints online. This this is just this one of my better works of art. So it didn't sell quite quickly. Just seen it. They say we've got some some images. I'm totally joking about this being for sale, by the way and said Let's say we have this now. So not only do you have with and heights we can refer to these as channels. So we have No you do we have with sorry at the number of images, the width and the height. But now we also have something called RGB. You might remember this from our CSS days when we're looking at the RGB values off of images and so that's basically this. So this number defines the color of the image. So now when we've got color, we're actually looking at four dimensions. You've got the number of images, then we've got the width there go the height and then we have three more cellist because we have r, G and B. So we have to have this this value 02255 for the are for the G for the be on that defines. That's how we define color just in general life. Actually, if you look into it, that's that's how Cal Color is made. So it hasn't a red a green honorably value. So now, now with color images, we're dealing with four different different channels here, So but we do the exact same thing. We just do. We have 500 still, and then we just times the height and the width on the RGB that's going to be 1800. And that's gonna be the shape of the data. So this this this has been relatively straightforward of you. We've kind of stepped up slowly. So we started here just by looking it when we have standard data, we have the number of of examples that we have in our data, and then we look at the number of independent variables. So we did that with the Titanic survivor operator and then we did the same with House Price predictor. But you were able to do that hopefully by yourself, the minute a black and white images. And we saw that Now we have to Our independent variable was actually the pixel values for every single pixel in an image. Obviously, images are usually much bigger than 20 by 30 that we saw that the first channel is still 500 and then up after that, so the number of images. Then after that, we have the width and the height on. Then essentially in order together into just one channel, we just we just we just stack them or what we call flatten them just into one large thing where now all the pictures are going top to bottom and then we looked at Well, if we have color images, then we have to have at our rgb. But we can still do the same thing and flatten it, flatten it down into one number like this. So that's just to prepare you for when we're going to begin to image classification a Zwiers be able to appreciate the shapes of data for standard data as well. So I hope you that make sense to you and I'll see you in the next nature
15. L318 P HandwrittenDigits: We're now going to look at how we can apply new networks to image classification problems. This is a very exciting point on, but we're going to create it completely from scratch together. So all I've done here is I'm in Google Collab. It's fine if you and your job to know books as well. You don't need to have downloaded a date or anything like that s so if you're on Google Club, just like me, go to run time change. Runtime type joins the hardware accelerated GPU so we can run this a bit faster. Okay, so now, essentially first, where we want to import a few things that's gonna make make life a lot easier on, we're gonna also import the date set. So first of all, from KERA Stott models going to import sequential, that's how we actually create a model in the first place. And then we want to add a add a few layers. So first we want to add from care. Stott, there stopped core import dense. We're going to also imports on the court activation because we want to use an activation function in him. Now we want to add an optimizer So this is say, what we talked about before is Grady in dissent Say Cara stocked up, Tim, my imports and we're going to import creating percent. So this is SG D. So that stands for stochastic Grady int descent. Stochastic is another way of saying random keep safe from Kerr stopped details in court, some numb pie, you tells, and then we're going to report a date set, say, from paris dot data sets. So what's really help about CASS is that essentially has some built in data sets that we can use their kind like toy data sets that we can use. Just don't just a practice dusty toe have ago it creating these things and try a misclassification stuff. So of course in this practical is we're going to be loading this. What I found really frustrating when I was trying to learn is that all of the truth tour is out there. I could only find ones where they were showing you how to load data sets like this, which is just, you know, when you actually want to load your own data and create your own image. Classifies suddenly you don't have the tools you're not sure how to load them in yourself. So don't worry. In this one, we're going to be kind of doing the easy way and loading a soft, pre created data sets. But in a future lecture, we're gonna be doing another project where you can actually use your own data. So don't worry about that. We're also going to important, um, by N P. As we always day. And then, of course, Matt Floats lib dot plots for a visualizations It's not. It's late in our data. So this is this in text relating in the data. So we're going to have X train. You can obviously call these what? Whatever you like, an extra in White Train. I also want to load in X test and white test. So we have to do is use this function. Feminist. Really, Am nist dot loads. It's got it's helpful fortune called Load underscored data. Okay, let's run this. Make sure it's all all happy, and it's actually able to load on import all of these before we push on. And that's the thing that I really like about Google Klabin Jeeps notebook. You can run it a bit like this very easily to make sure things working. I'm interested to see as we've been talking about shape recently. Let's take a look at what extreme looks like we do print Exxon's score trained dot shape. Okay, So as you can see this first number, this is actually a number of data points we have. So we have 60,000 in our training data. Let's see how much we have in our test data. I was trained. Change this to Exel's go test because we have 60,000 data points, not train set, and 10,000 are test set. Then we have 28 by 28 on that essentially is just the dimensions off our images. So we have 28 pictures by 28 pictures. So the reason why this is such a good date set start with is they're very small images. So it's not going to take ages for us. Teoh actually run in your network on them, okay? So we can also look at why train just to be sure. So we know that the first what we very much hope that the first general here is going to be 60,000 because we don't have the exact same number of data points. Why unscored train? You can see it's just 6000. And then just to call it, because what we have is one number there that's or get. So now let's be one Teoh preparing our data. So there's a few things that we need to do. So essentially, we want to flatten this data like you would like to talk about a previous previous lecture . So we want to do the height, times the width because that's the seven underrated for I'm going T. I'm going to shave, save the hats, not shave it into a variable called reshape. And that was what, 74. Okay, so now let's just prepare the data somewhat. So first we need to do is to some reshaping. So I'm going to reshape our extreme first X unscored train is equal. Teoh Exxon's gautrain dot reshape this. This is a very helpful function that we could reshape doubt, shape our data. So first we want to do is keep actually the first number. That's fine on day. We want to change it the second channel into this number here, and that's going to basically flat in this into one number, which is gonna help us to feed it into on your network. So then we want to do this again for X test. So I'm going to change this ex test transistor excesses. Well, this is what you can need to do. Whenever you're preparing images, you need to reshape them into the correct shape so that you've got two channels. You've got the number here, and then you've got your independent xperia balls in one single number there. So then we want to change change the data type of these. So we want to make sure so often, it's This is another kind of frustrating thing. When you're first getting used to your networks, you actually want to make sure that the data types of correct we will make sure that all of these numbers are floats. So I'm making sure of that by just putting extraneous it would x train dot as type. I suspect she just met turning into this type. And I'm turning into a Flight 32. So talk about floats previously in python on their basically just you with decimal numbers and that's that's just what gas once. So what? CASS wants we provide. So we're just converting the X numbers into float 32. Okay, So I'm just gonna run this to make sure that's working. Fine. Okay, So great. So we got our first air. So saying cannot reshape array of size 784 Big number, basically in tow. 60,000 by 74. Let's just make sure that I've got these numbers rights. That's 60,000 there on, then 28 times. 28 I believe, is 20. This 74. Let me just check again. Just print out print 28 times 28 784. And that's the one that we've saved here. Well, of course, but this is an extreme. This is X testify. Look here. The air is pure on X test. Not know X train last because it doesn't have 60,000. It should have 10,000. So that works nicely now. Okay, yet that's that's absolutely fine. Next thing I want to do is I actually want to normalize the data and normalizing the data is basically just preparing it so that it's easier for the new or network to train with and it turns out so we were talking about before with images. Pixel values in this pixel values uh, between zero and zero and 255 That's that's those the different values that any pixel could be. So what we want to do is just to make sure and just make sure you understand what that means is basically, how white or how dark it is. So that's what the peaks of alleys are now. It actually helps to train our model if we make sure that all of our numbers aren't between zero and 255 but there between zero and one. So what I'm going to do is I'm going to do something called Normalizing later. So I'm going to make sure that all the values in extreme next test and no longer between zero and 255 but there between zero and one you can pull the video now and have a think about how you might do that yourself. So the best way to do it is basically divide or the numbers by 255 so I could do you remember previously invited have done like, plus equals to basically say it. Let's say three that would say that basic means Ex train is equal to extreme plus three. So basically changing it all the numbers Next train toe Beath three more basically doing the same thing. I'm going to divide by course. So basically what I'm doing here is I'm dividing all the numbers by 2055. So, as you can imagine, if I divide the number 2055 by 2055 I get one. Anything between zero and 255 by divide by 2155 that's gonna be between zero and one. So that's the perfect solution for us. Really. The same thing for X test fired equals 255 So it's all good. Onda. Let's just check that everything's looking. Okay, so that's just print off the shapes again. I think some train shape so that I'm not completely understand what this is referring to. What machines? Because it's told about casting here, and it's got something to do with the dates type. So I'm just gonna scroll upto have been looking the dates type, which is pretty much just here. We're reading the floats, see if everything's OK. I can see here. I haven't changed this two x test. So because I hadn't converted this will cost it two flights. That's what's completing here. So hopefully if I re do this now, this is absolutely Okay, So So that was the issue. I just had to make sure that I was casting both them to flight 30. Okay, so I can see that this is the This is the main shape. Now, Justo, make sure you on some one thing, I can index this. So if I were to just say x train zero, that would give me just the number of data points every 6000. Well, let's see, maybe several around. Oh, that's because I've done it. Done it incorrectly. I wanted to hit. There we go. So now we got to the number, the number, data points and this should give us the number Independent, independent variables. So separated for perfect. So we're or good. So the next name wants d is we want Teoh essentially convert with our wise. We want to convert all of our labels on into numbers. We want to do this using something called too categorical. So whenever you've got a number of categories, So, for example, here. Obviously, I haven't actually described to what M list is so endless. Essentially is a bunch of handwritten digits. So you've got 0123456789 So in the States set. Essentially, we were provided with lots of images off pictures of hundreds and digits between zero and nine, and I can actually show you that quite easily by dean PRT dot i am show Let's go for X on Scott Train zero then, which won't put something called See Map is equal to gray bot dot show. So that's the number five. So if we put in another one, so we just let you just getting asking examples from our extreme here, that's the number one. So you keep on doing this for ages, but essentially were given these images. They're just 28 by 20 images, handwritten digits, and so we're creating an average class by here that can tell whether it 0 to 9. So for categories we obviously need to be able to predict, our your net needs builds predict which Catterick goes into zeros nights. That's 10 different categories because obviously, zero is one and 23456789 10. So just to make sure you're not confused, there's not nine. Okay, so what we're going to do now is we're going toe. Turn those into categories. Now we're gonna dio Why Uncle Train? Yes, but this is capital ones. It's different from the originals. Equities, this will be using num pie tells dot on them using a fortune to category or record we could be using why train and then the number of classes we want to divide into, which is 10. And then we want to do the same thing with white test 70 winter copy and paste that basically included in the exact same thing changes and changing this to test. And this will also test. Now everyone's Prince example. Why train? You can see that we've got these long numbers now, Essentially, what we've done here is we have 10 of these in, or we've got the dot dot dots, but just ignore that. They're 10 zeros nine years long. Here, there's 11 So when there's a one in the 1st 1 here that's predicting it, zero if there is but and the rest of going to be a zeroes. There's only ever going to be a single one in there. If the 2nd 1 is a one thing that's going to be predicting the number one. If this one is the number one, there's give you predicting number two. If it's not the end here, is gonna be predicting the number nine. And that's how that's basically how we make predictions. So we're putting it here is too categorical. It's often referred to as one hot one encoding as well. You're here different, different phrases, so throwing around for that kind of thing. So essentially, that's how we're preparing our data. So let you just to go over what we've been doing. So we've been able to load in our data but able to, you know, take a look at our data quite easily, and then we walked. After that, we normalize the data by making sure it's all between zero and 255 and then we just turned all of our wise into into categorical so that they're more like more like there. Some have done basic what we call one hot coding. So next up, let's create on your network. I'm going to create an incredibly simple your network for now. And then, of course, of course we can play around with it afterwards and improve its but actually will find that even the most basic new network could do a pretty pretty good job on the handwritten digits . So against the model dot ad and they were interpreting the number classes, which is 10. So So what we're doing here eso whipping in a dense layer. Just start that part again. First reading is we're adding in our dense layer. This is a Freddie connected layer as we talked about before. So within here, now, the number of nodes we're going to put in I'm just going to put it is the number of classes , so it's gonna be actually relatively small, and then we're going to have to put what the input shape is so that our model knows what to expect. Essentially, that's going to be is equal to, and that's gonna be what we've saved up here would report it. Our number seven inch, four todo. So this is our number 700 for anyway, So we just put this in here so you have 784 and That's the number of independent variables that we're putting in. So essentially it's, you know, when we're looking at the shape by data, right at the start, eyes essentially what this second number became. So we were able to flatten these together into one number, which was multiplying them together. Gives us 700 for So that's what. But work been here. That's the input shaped says everything we need for our first lap. And then, literally, I'm just gonna have an output there. So I guess what I like all I didn't one more layer. Why not? Just for fun. And let's go for 20 hands. We can add in here some kind of activation function. Activation equals escaped already. Okay, then we're gonna add our output there, so really model dot ad, and then we're going to add in another innovation fortune. So for this one, before when we were doing the title dictates that we were adding in our signal it because that was giving us essentially a number between zero and one for a single number. Now what we we want again to have a number between zero and one, but we we wanna have that for all of these numbers where essentially all of the what we want is that all of these numbers basically add up to one on the one. The number with the largest number is going to be what we end up predicting is one. So with sigmoid, would you do that for one number here with one of hot encoded? So we've got 10 different classes. So instead of using sigmoid, we're going to use something called Soft Max. So if you ever need to do something where we've got a number of different categories, but we still want to outputs some kind of a probabilistic prediction, we're gonna be using soft max. So that's gonna be our model. Let's just make sure it's happy with everything there. So I've been in the wrong cute argument for activation. I've just noticed I didn't actually put in the dense hit, so these are all fully connected. Layers were putting in so called dense caress. So I run this. This should work now. Okay, that's all. Good. Now, one useful thing we can do if we want to see how many parameters that are new or network is taking up. We just want to see a basic summary that we can just do. Model dot Summary there were on this now actually tells us what's going on so we can see the shape of the data as it's going through. We can see the number of parameters that that we're getting each a space as well. So what's really interesting here and actually what I've noticed, what's really helpful is I actually noticed an issue here. So first of all, we can see the number of total Prompters. That's number of weights. So when when we were looking it, why equals W X Plus B? That's that can get the number of W's basically, which is very helpful for us tonight. That's that's actually not too many, which is fine. But one really helpful thing that I've noticed is that in our final layer here, where we're doing soft Max, we only want to have 10. So I actually detained this to 10. We could put this one in his twenties. He wanted more, but we'll even both this 10 for now, because we don't make it too big. So I went up and there should be a few less parameters because we have basically less knows in our second layer here to the out Bert Shape is as we want now, so that's so That's all good. So go on model and all set up. The next thing wants to do is to compile our model. Now that we've solved designed the architecture behind it. So you want d model dot com pile. Then, after Junior's put in some some of the in some of the different parameters so we won't have a loss for this, we're gonna have categorical cross entropy, which is just basically a good equation used for calculating the loss. Women looking at categories across control be, then optimize ah equals and we import this above. That's our stochastic grade int descent in the metric you want to look at is accuracy. So now that we've basically composite on model, not much left to do, no hoops, just make sure that's in the right place I've got. Yep, there's not much left to do now, except for the train, our model so you can call this model unscored train against the model. Don't fit. We're going to put in all the information that needs So first of all we're going to tell. It's what we're going to being in which they could be X underscore train. Why on school train? And then also we want to include the test in there. And we actually want have other capital. Why? Just because, as we did above yeah, moving too categorical were using upper case wise humans have X train. Why train? And then we want to give it a batch size. Give us something reasonable like 100. And then after that, we want to dictate the number of E box. So just to keep it short, I'm just gonna get 50 bucks. Now you can set it for a much larger one if you want. Just depends on how much time you want to spend doing the training. So for been the basti. So verbose spacing means if I've been zero, that it's not going to just very much out. But it'll if I put one that is going to show us essentially what the accuracy is for each epoch. So you might as well do that because we're doing 50 of them. It would be to bet on then. Finally, the validation split in school. So if this let's go for a 0.57 choosing 5% of the data. Okay, so then, after all, and that's this plot this out. So we want to be able to see what they actually looks like for our train and test. It's a much easier way of visualizing, so let's go. What'll underscore? Train DOT History says That is exactly the same way that we did before for Titanic. So give you taking accuracy and let's give that a label train. Then let's do the exact same thing for a test. So let's put in test accuracy here. So vow wins been vow on school back. Who? This test save Alston's for validation. It's the next thing we want to do is just do the sun things like Give it a title. Who's like a little accuracy train and test, then add in are actually was there. Why levels purity dot Exe people. This is the number and then 30 dot Why label is accuracy and finance combat legends. We can tell the difference between the train, the test. Let's put that into the love, right? So then we do PRT don't show and so Hopefully, this will compile If I spell compile Wrong. That would help. We're gonna compile a model, train it, and then we're gonna plot out the accuracy afterwards. Let's go Run. That locates. We have an area here, so it looks like it's pointed this line. That's just take this in tax festival. All right? Like I need to print equals there seven brackets to see if that sorts out a case. And now it's no happy because it's looking, saying something about the data shape. So it seems to be complaining that about this Looks like our ex train, right? Because we had 60,000 data points in our ex train, and it looks like it basically hasn't been reshaped into the Inter, just two channels. So let's just look above, I think what may have happened as we basis haven't run the cell where I reshaped it or oh, yeah, it seems like it's been excellent days deleted, So I'm just gonna pause the video and add those lines back in here really quickly. Okay, It's another for those in Let's just start from the top and run all of ourselves. So all the way from the top here. Will this make sure it's not have been reshaped? Everything's looking good. No. Okay, so it seems to be happy now with our dates. Structures. That's all good, Onda. As we concede, the accuracy, the validation is starting About 58% actually, for the train is started, about 41%. As you can see, the accuracy is increasing relatively quickly here, which is which is great news for us on this is only a only over a few few number of e books . So it's looking very hopeful so far. And then once we get to the end here, it's actually to plot out the accuracy for us anyway, which will be which will be great. So just pause the video here and then I'll start again once we get to 50. Okay, so here we are in only 50 e box. We've already been able to get accuracy up to 75% with our evaluation. So that's that's really good. If we if we were to increase the number of the fox by a lot more, I think the accuracy would increase much further, So we won't obviously do that in this lecture, so I don't want you to guys to be waiting around for too long. But one thing I wanted to mention is with the verbosity. If you if you did set this the number of reports to you I don't know, like 1000 or 500 or something like that. And then you wouldn't want to have This is verse one. So you change that to zero C to have to see all the outputs. Um, but then you wouldn't be able to see the actual number of the ended able to this graph. So if you want to be able to see the number right at the end, it's relatively easy to get get out working as well. So we just bet score is equal to model or whatever you have called name, view model dot Ever. You ate get spell, right marriage put in. Let's say we just want to see the validation accuracy, for example. So put an x test. Why test the first was there and then we can just print off you print test que si. Then we can just use a girlfriend. Don't for that And here, which put in the school first index so Now we can see the test accuracy was sent to 69 So that's a really helpful way of us just seeing basically like an over accuracy off recently box. So that's if the estate sets why I highly recommend you dio by yourself. Let's see how you can improve their architecture based on what we've talked about in the lectures on variants and bias and what we did with the Titanic date set. So have a guy that yourself see how, how you can get the accuracy. I'll see you in the next lecture.