Transcripts
1. Course Introduction: Welcome to deep learning neural networks with Python. I'm your instructor, Frank Kane and I spent nine years at amazon dot com and I am devi dot com building and managing some of their best known features. Product recommendations. People Who Pot also bought top sellers and movie recommendations that I. M. D. B. All of these features required applying machine learning techniques to real world data sets , and that's what this course is all about. I don't need to tell you that artificial intelligence, deep learning and artificial neural networks are the most valuable technical skills toe have. Right now. These fields are exploding with progress in new opportunities. The short course will tackle deep learning. From an applied practical standpoint, we won't get mired in notation and mathematics, but you'll understand the concepts behind modern AI and be able to apply the main techniques using the most popular software tools available. Today, we'll start off with a refresher on python and the Pandas library in case you're new to them. Then we'll cover concepts behind artificial neural networks. Then you'll dive right into using the Tensorflow library to create your first neural network from scratch and will use the caress library. To make prototyping neural networks even easier, you'll understand and apply multilevel Perceptron, deep neural networks, convolution, all neural networks and recurrent neural networks. By the end of this course, at the end, a quick final project will let you practice what you've learned. The activities in this course air really interesting. You'll perform handwriting recognition sentiment analysis and predict people's political parties using artificial neural networks using a surprisingly small amount of code. If you're a software developer or programmer looking to understand the exciting developments in a I in recent years and how it all works, this course is for you will turn concepts right into code, using python with no nonsense and no academic pretense. Building a neural network is not as hard as you think. All you need some prior experience in programming or scripting to be successful in this course, the general format of this course is to introduce the concept using some slides and graphical examples. Then we'll look at python code that implements the concept on some real or fabricated data . You'll then be given some ideas on how to modify or extend the code yourself in order to get some hands on experience with each concept. The code in this course is provided as an eye python notebook file, which means that in addition to containing riel working python code that you can experiment with, it also contains notes on each technique that you can keep around for future reference. If you need a quick reminder on how a certain technique works, you'll find this an easy way to refresh yourself without rewatching an entire video.
2. Getting Started and Pre-Requisites: It's hard to think of a hotter topic than deep learning, and that's what we're going to talk about in depth and hands-on for the next few hours, going to show you how neural networks work. Artificial neural networks, perceptrons, multilayer perceptrons. And then we're going to talk into some more advanced topics like convolutional neural networks and recurrent neural networks. None of that probably means anything to you right now. But the bottom line is if you've been curious about how deep learning and artificial neural networks work, you're going to understand that by the end of these next few hours. So think of it as deep learning for people in a hurry. I'm gonna give you just enough depth to be dangerous. And there will be several hands-on activities and exercises so you can actually get some confidence and actually applying these techniques and really understanding how they work and what they're for. I think you'll find that there are a lot easier to use and you might have thought. So let's dive in and see what it's all about. This section of my larger machine learning and data science course is actually available as a standalone course as well. So if you are new to this course here, you are going to need to install the course materials and a development environment if you want to follow along with the hands-on activities in this deep learning section. If you are new and on over to immediate dot Sun dog dash, soft.com slash machine dash learning dot HTML. Pay attention to capitalizations and dashes at all matters. And you should get to this page here. You'll find here a handy link to the course materials. Just download that and decompress it. However you do that on your platform and remember where you put it. And our development environment for this course will be Anaconda, which is a Scientific Python 3 environment. You can install it from here it is free software. Make sure you install the Python 3.7 or newer version. Once you've installed Anaconda, you'll need to install the TensorFlow package. So on Windows, you would do that by going to the Anaconda prompt. So go to Anaconda in your start menu and open up Anaconda Prompt. On MacOS or Linux, you would just go to a terminal prompt in and it would be all set up for you already. From there you would type in Conda, install TensorFlow, and let that run to install the TensorFlow framework that we will use within Anaconda. If you have an NVIDIA GPU, you might get better performance by saying tensor fluid attached GPU, but sometimes that results in compatibility issues. So don't do that unless you know what you're doing guys. You do not need to install pi dot plus for this particular section of the course camera to do it though, that's also part of the setup instructions for the larger course. And you also need to understand how to actually start the notebooks once you have them installed. So from that Anaconda Prompt, that same Anaconda prompt that we talked about earlier to actually launch one of the notebooks in this course, you would first change your directory to wherever you installed the course materials. So for me, I put them in C colon ML course. And if I do a DIR, you'll see all the course materials are here. From here, if I type in Jupiter notebook, Jupiter is spelled funny with a y. That should launch your web browser with a directory of all the different notebooks that are part of this course. So when I say in this course to open up, for example, I don't know tensorflow dot PY and be the TensorFlow Notebook, you would just scroll down to this list, open up TensorFlow dot IPO, eNB. And up it should come. When you're done experimenting and playing around with this notebook, you can just go to File, Close and Halt to get out of it. When you're done with Jupiter entirely for this session, just quit and that will shut everything down for you. Alright, so with that out of the way, let's move on. Let's talk about some of the mathematical prerequisites that you need for to understand deep learning. It's going to be the most challenging part of the course, actually just some of the mathematical jargon that we need to familiarize ourselves with. But once we have these basic concepts down, we can talk about them a little more easily. Think you'll find that artificial intelligence itself is actually a very intuitive field. And once you get these basic concepts down, it's very easy to talk about and very easy to comprehend. First thing we want to talk about is gradient descent. This is basically a machine learning optimization technique for trying to find the most optimal set of parameters for a given problem. So what we're plotting here, basically some sort of cost function, some measurement of the error of your learning system. And this applies to machine learning in general, right? Like you've got to have some sort of function that defines how close to the result you want, your model produces results for, right? So we're always doing in the context of supervised learning. We will be feeding our algorithm or model, if you will, a group of parameters, some sort of ways that we have tuned the model. And we need to identify different values of those parameters that produce the optimal results. So the idea with gradient descent is that you just pick some point at random and each one of these dots represents some set of parameters. Your model, maybe it's the various parameters for some model we've talked about before, or maybe it's the exact weights within your neural network, whatever it is, try some set of parameters to start with. And we will then measure whatever the air is at that produces on our system. And then what we do is we move on down the curve here, right? So we might try a different set of parameters here. Again, just sort of like moving in a given direction with different parameter values. And we then measure the error that we get from that. And in this case, we actually achieved less error by trying this new set of parameters. So we say, okay, I think we're heading in the right direction here. Let's change them even more in the same way. And we just keep on doing this at different steps until finally we hit the bottom of a curve here. And our error starts to increase after that point. So at that point we'll know that we actually hit the bottom of this gradient. So you understand the nature of the term here, gradient descent. Basically we're picking some point at random with a given set of parameters that we measure the error of four. And we keep on pushing those parameters in a given direction until the error minimizes itself and starts to come back up some other value, okay, and that's how gradient descent works in a nutshell, not going to get into all the hard-core mathematics of it. All The concept is what's important here, because gradient descent is how we actually train our neural networks to find an optimal solution. Now you can see there are some areas of improvement here for this idea. First of all, you can actually think of this as sort of a ball rolling downhill. So on optimization that we'll talk about later is using the concept of a momentum. You can actually have that ball gain speed as it goes down the hill here, if you will. And slow down as it reaches the bottom and, you know, kinda bottoms out there. That's a way to make it to converge more quickly when you're doing things in, I can make actual training your neural networks even faster. Another thing worth talking about is the concept of local minima. So what if I randomly picked a point, then ended up over here on this curve, I might end up settling into this minima here, which isn't actually the point of the least error. The point of the least error in this graph is over here. That's a problem. I mean, that's a general problem in gradient, gradient descent. How do you make sure that you don't get stuck in what's called a local minima. Because if you just look at this part of the graph that looks like the optimal solution. And if I just happen to start over here, that's where I'm gonna get stuck. Now, there are various ways of dealing with this problem. Obviously, you could start from different locations, so try to prevent that sort of thing. But in practical terms, it turns out that local minima aren't really that big of a deal when it comes to train neural networks. It's just doesn't really happen that often. You don't end up with shapes like this in practice. So we can get away with not worrying about that as much. That's very important. Good thing because for a long time, people believe that AI would be limited by this local minimum effect. And in practice it's really not that big of a deal. Another concept we need to familiarize yourself with something called auto diff. And we don't really need to go into the hardcore mathematics of how auto def works. Just need to know what it is and why it's important. So when you're doing gradient descent, somehow you need to know what the gradient is, right? So we need to measure what is the slope that we're taking along our cost function, our measurement of error, or it might be mean standard error for all we know. And to do that mathematically, you need to get into calculus, right? If you're trying to find the slope of a curve and you're dealing with multiple parameters and we're talking about partial derivatives, right? The first partial derivatives to figure out the slope that we're heading in. Now it turns out that this is very mathematically intensive and inefficient for computers to do. So by just doing the brute force approach to gradient descent. That gets very expensive very quickly. Auto diff is a technique for speeding that up. So specifically we use something called reverse mode auto diff. And what you need to know is that it can compute all the partial derivatives you need just by traversing your graph in the number of outputs plus one that you have. And this works out really well in neural networks. Because in a neural network you tend to have an artificial neurons that have very many inputs, but probably only one output or very few outputs in comparison to the inputs. So this turns out to be a pretty good little calculus trick. It's complicated. You can look up how it works. It is pretty hardcore stuff, but it works and that's what's important. And what's also important is that it's what the TensorFlow library uses under the hood to implement its gradient descent. So again, you're never going to have to actually implement gradient descent from scratch or implement auto different scratch. These are all baked into the libraries that we're using. Libraries such as TensorFlow for doing deep learning. But they are terms that we throw around a lot. So it's important that you at least know what they are and why they're important. So just to backup a little bit, gradient descent is the technique we're using to find the local minima of the error of that we're trying to optimize for given a certain set of parameters. And auto diff is a way of accelerating that process. So we don't have to do quite as much math or quite as much computation to actually measure that gradient of the gradient descent. One of the thing we need to talk about is softmax. Again, the mathematics are, it's so complicated here. But again, what's really important is understanding what it is and what it's for. So basically, when you have the end result of a neural network, you end up with a bunch of what we call weights that come out of the neural network at the end. So how we make use of that? How do we make practical use of the output of our neural networks? Well, that's where softmax comes in. Basically it converts each of the final weights that come out of your neural network into a probability. So if you're trying to classify something in your neural network, like for example, decide if an image is a picture of a face or a picture of a dog or a picture of a stop sign. You might use softmax at the end to convert those final outputs of the neurons into probabilities for each class, okay? And then you can just pick the class, it has the highest probability. So it's just a way of normalizing things, if you will, into a comparable range. And in such a manner that if you actually choose the highest value of the softmax function from the various outputs, you end up with the best choice of classification at the end of the day. So it's just a way of converting the final output of your neural network to an actual answer for a classification problem. So again, you might have the example of a neural network that's trying to drive your car for you. And it needs to identify pictures of stop signs or yield signs or traffic lights. You might use softmax at the end of some neural network that will take your image and classified as one of those sine types, right? So again, just to recap, gradient descent algorithm for minimizing error over multiple steps. Basically, we start at some random set of parameters, measure the error, move those parameters in a given direction, see if that results in more error or less error. And just try to move in the direction of minimizing error until we find the actual bottom of the curve there, where we have a set of parameters that minimizes the error of whatever it is you're trying to do. Auto diff is just a calculus tricks for making gradient descent faster. It makes it easier to find the gradients in gradient descent just by using some calculus trickery. And Softmax is just something we apply on top of our neural network at the very end to convert the final output of our neural network to an actual choice of classification, given several classification types to choose from. Okay? So those are the basic mathematical terms or algorithmic terms that you need to understand to talk about artificial neural networks. So with that under our belt, let's talk about artificial neural networks next.
3. The History of Artificial Neural Networks: Let's dive into artificial neural networks and how they work at a high level later on will actually get our hands dirty and actually create some. But first we need to understand how they work in where they came from. So it's pretty cool stuff. I mean, this whole field of artificial intelligence is based on an understanding of how our own brains work. So, you know, over millions of years of evolution, nature has come up with a way to make us think. And if we just reverse engineer the way that our brains work, we can get some insights on how to make machines that think so within your brain. Specifically your cerebral cortex, which is where I live, you're thinking happens. You have a bunch of neurons, thes air individual nerve cells, and they are connected to each other via Exxon's and dendrites. You can think of these as connections. You know, wires, if you will, that connect different accents together. Now an individual neuron will fire or send a signal to all the neurons that is connected to when enough of its input signals air activated so that the individual neuron level it's a very simple mechanism. You just have this cell. That's neuron that has a bunch of input signals coming into it. And if enough of those input signals reach a certain threshold, it will in turn fire off a set of signals to the neurons that it, in turn, is connected to a swell. But when you start to have many, many, many of these neurons connected together in many, many different ways with different strengths between each connection, things get very complicated. So this is kind of the definition of emergent behavior. You have a very simple concept, very simple model. But when you stack enough of them together, you can create very complex behavior at the end of the day, and this can yield learning behavior. This is actually this actually works and not only works in your brain, it works in our computers as well. Now think about the scale of your brain. You have billions of neurons, each of them with thousands of connections, and that's what it takes to actually create a human mind. And this is a scale that you know we can still only dream about in the field of deep learning and artificial intelligence. But It's the same basic concept. You just have a bunch of neurons with a bunch of connections that individually behave very simply. But once you get enough of them together wired enough complex and ways you can actually create very complex thoughts, if you will, and even consciousness. The plasticity of your brain is basically tuning where those connections go to and how strong each one is, and that's where all the magic happens, if you will. Furthermore, we look deeper into the biology of your brain. You can see that within your cortex, neurons seem to be arranged into stacks or cortical columns that process information in parallel. So, for example, in your visual cortex, different areas of what you see might be getting processed in parallel by different columns or cortical columns of neurons. Each one of these columns is in turn, made of these many columns of around 100 neurons per many column that air then organized into these larger hyper columns and within your cortex there are about 100 million of these many columns, so again they just add up quickly. Now, coincidentally, this is a similar architecture toe. How the video card, the three d video card in your computer works. It has a bunch of various simple, very small processing units that are responsible for computing. How little groups of pixels on your screen are computed at the end of the day, and it just so happens that that's a very useful architecture for mimicking how your brain works. So it's sort of a happy accident that the research that's happened to make video games behave really quickly or play call of duty or whatever it is that you like to play lend itself to the same technology that made artificial intelligence possible on a grand scale and at low cost. The same video cards you're using to play your video games can also be used to perform deep learning and create artificial neural networks. Think about how better would be if we actually made chips that were purpose built specifically for a simulating artificial neural networks. Well, it turns out some people are designing ships like that right now. By the time you watch this, they might even be a reality. I think Google's working on one as we speak, So at one point someone said, Hey, the way we think neurons work is pretty simple. It actually wouldn't be too hard to actually replicate that ourselves and maybe try to build our own brain. And the this idea goes all the way back to 1943 people just proposed a very simple architecture where if you have an artificial neuron, maybe you can set up an architecture where that artificial neuron fires if more than a certain number of its input connections are active and when they thought about this more deeply in a computer science context, people realize that you can actually create logical expressions Boolean expressions by doing this. So depending on the number of connections coming from each input neuron and whether each connection activates or suppresses honor, and you can actually do both that works that way in nature as well. You can do different logical operations, so this particular diagram is implementing an or operation. So imagine that our threshold for our neuron was that if you have two or more inputs active , you will in turn fire off a signal. In this set up here, we have two connections to neuron A and turn to connections coming in from neuron B. If either of those neurons, produce and input signal that will actually cause nor on sea to fire. So you can see we've created an or relationship here where if either nor on a or neuron B feeds neuron, see to input signals that will cause they're unseat a fire and produce a true output. So we've implemented here the Boolean Operation C equals A or B, just using the same wiring that happens within your own brain, and I won't go into the details, but it's also possible to implement and and not in similar means. Then we started to build upon this idea. We create something called the Linear Threshold Unit, or LTU, for short in 1957. This just built on things by assigning weights to those inputs. So instead of just simple on and off switch is, we now have the ability of the concept of having waits on each of those inputs as well that you can tune further and again. This is working more toward our understanding of the biology. Different connections between different neurons may have different strengths, and we can model those strengths in terms of these weights on each input coming into our artificial neuron. We're also going to have the output be given by a step function. So this is similar in spirit to how we were using it before. But instead of saying we're going to fire if a certain number of inputs are active, well, there's no concept anymore of active, are not active. There's weights coming in. Those weights could be positive or negative. So we'll say if the some of those weights is greater than zero, we'll go ahead and fire on her own off its lesson or lessons. Zero. We won't do anything. So just a slight adaptation to the concept of an artificial neuron here where we're introducing weights instead of just simple binary on and off switch is so let's build upon that even further and will create something called the Perceptron. And a perceptron is just a layer of multiple linear threshold units. Now we're starting to get into things that can actually learn. Okay, So by reinforcing weights between these lt use that produced the behavior we want, we can create a system that learns over time how to produce the desired output. And again, this also is working more toward our growing understanding of how the brain works within the field of neuroscience. There's a saying that goes cells that fire together wire together. And that's kind of speaking to the learning mechanism going on in our artificial perceptron here, where if we have weights that are leading to the desired result that we want, you know, they could think of those weights again as strengths of connections between neurons. We can reinforce those weights over time and reward the connections that produced the behavior that we want. Okay, so you see, here we have our inputs coming into weights, just like we did in lt years before. But now we have multiple lt use gang together in a layer, and each one of those inputs gets wired to each individual neuron in that layer, okay? And we then apply step function each one. Maybe this will apply to, you know, classifications. Maybe this would be a perceptron that tries to classify an image into one of three things or something like that. Another thing we introduced here is something called the bias neutron off there on the right. And that's just something to make the mathematics work out. Sometimes need to add in a little fixed, constant value that might be something else you can optimize for us. Well, so this is a perceptron. We've taken our artificial network. Move that to a linear threshold unit. And now we've put multiple linear threshold units together in a layer to create a perceptron, and already we have a system that can actually learn. You know, you can actually try to optimize these weights, and you can see there's a lot of this point if you have every one of those inputs going to every single LTU in your layer, they add up fast, and that's where the complexity of deep learning comes from. Let's take that one step further and we'll have a multi layer perceptron. It's announced of a single layer of perceptron of lt use. We're going to have more than one, and we actually have now a hidden layer in the middle there, so you can see that are inputs air going into a layer at the bottom. The output started layer at the top, and in between we have this hidden layer of additional lt used in your threshold units that can perform what we call deep learning. So here we have already what we would call today a deep neural network. Now there are challenges of training these things because they are more complex. But we'll talk about that later on. It can be done. And again the thing toe really appreciate here is this how many connections there are? So even though we only have a handful of artificial neurons here, you can see there's a lot of connections between them, and there's a lot of opportunity for optimizing the weights between each connection. Okay, so that's how a multi layer Perceptron works. You can just see that again. We have emergent behavior here, and individual linear threshold unit is a pretty simple concept. But when you put them together in these layers and you have multiple layers all wired together, you can get very complex behavior because there's a lot of different possibilities for all the weights between all those different connections. Finally, we'll talk about a modern deep neural network, and really, this is all there is to it. You know, the rest of this course we're just gonna be talking about ways of implementing something like this. OK, so all we've done here is we've replaced that step function with something better. We'll talk about alternative activation functions. This one's illustrating something called rela you that we'll talk about later. The key point there. Those that a step function has a lot of nasty mathematical properties, especially when you're trying toe figure out their slopes in their derivatives. So it turns out that other shapes work out better and allow you to converge more quickly when you're trying to train. A neural network will also apply soft max to the output, which we talked about in the previous lecture. That's just a way of converting. The final outputs of our neural network are deep neural network into probabilities from whence we can just choose declassification with the highest probability. And we will also train this neural network using greedy descent or some variation thereof. There are several of them to choose from. We'll talk about that in more detail as well. Maybe that will use auto diff, which we also talked about earlier to actually make that training more efficient. So that's pretty much it. You know, in the past five minutes or so that we've been talking, I've given you the entire history, pretty much of deep neural networks in deep learning. And those are the main concepts. It's not that complicated, right? That's really the beauty of it. It's emergent behavior. You have very simple building blocks. But when you put these building blocks together in interesting ways, very complex and frankly mysterious, things can happen. So I get pretty psyched about this stuff. Let's dive into more details on how it actually works up next.
4. Hands-On in the Tensorflow Playground: So now that we understand the concepts of artificial neural networks and deep learning, let's mess around with it. It's surprisingly easy to do. The folks behind Tensorflow at Google have created a nice little website called Playground dot tensor fluid out aware that lets us experiment with creating our own neural networks and you don't you write a line of code to do it. So it's a great way to get sort of, ah, hands on and to to feel of how they work. So let's dive in so head over to playground dot tensorflow dot org's and you should see a screen like this you can follow along here or just watch me do it. But I definitely encourage you to play around with this yourself and get sort of, ah, intuitive hands on feel of how deeply learning works. This is a very powerful thing if you can understand what's going on in this Web page. So what we're trying to do here is classify a bunch of points just based on their location in this two D image. So this is our training, data said, if you will. We have a bunch of points here and the ones in the middle are classified is blue, and the ones on the outside are classified as orange. So our objective is to create a neural network that, given no prior knowledge, can actually figure out if a given point should be blue or orange and predict successfully which classifications should be. So think of this is our training data. Okay, we know ahead of time what the correct classifications are for each one of these points. And we're going to use this information to train our neural network to hopefully learn that stuff in the middle should be blue, and stuff on the outside should be orange. Now, here we have a diagram of the neural network itself, and we can play around with this. We can manipulate it. We can add layers to take layers out. ADM or neurons. Two layers. Whatever you want to do, let's review what's going on here. So first of all, we're selecting the data set that we want to play with where it's starting with this default, one that's called Circle the Inputs Air. Simply, the X and Y coordinates the vertical and horizontal position of each data point, so as our neural network is given a point to classify. All it has to work with are those two values, its horizontal position and its vertical position. And the start off is equally weighted being horizontal a vertical, so we could define the position of any one of these points in terms of its result and vertical position. For example, this point here would have a horizontal position of negative one in a vertical position of about negative five, and then we feed it into our network. You can see that these input notes have connections to each one of these four neurons and are hidden layer. And we can manipulate the weights between each one of these connections to create the learning that we want. Those in turn feed into two output neurons here that will ultimately decide which classifications we want at the end of the day. So keep in mind, this is a Byeon Eri classification problem. It's either blue or orange, So at the end of the day, we just need ah single signal, really, and that's what comes into this output here. Let's go ahead, hit play and see what happens. What's going to do it started bunch of iterations where it learns from this training data. So we're gonna keep feeding it input from this training data set. And as it goes as it generates through it, you will start to reinforce the connections that lead to the correct classifications through Grady into center. Some similar mechanism, right? And if we do that enough times, it should converge to a neural network that is capable of reliably classifying these things . Let's hit playing. Just watch it in action. So keep your eye on that image to the right there. All right, you can see that we have already converged on a solution. I can go ahead and pause that now and pretty cool stuff so you can see it has successfully created this pattern where stuff that fits into this middle area here is classified is blue , and stuff on the outside is classified as orange so we can dive into what actually happened here. These thickness of all these connections represent their weights so you can see the individual weights that are wired between each one of these neurons. We start off here, you see these air more or less equally weighted. Uh, well not exactly. Equally, some of these are kind of weak. But what? At least two is this behavior in the middle? So we start off with equally weighted X and Y coordinates. Those go to this layer here. So, for example, this hidden layer here this neuron is saying I want to wait things a little bit more heavily in this corner, okay? And things that are like in the lower left hand corner, not so much. And then this other one is picking out stuff on the top and bottom. This one's a little bit more diagonal to the bottom, right? And this one's even mawr bottom right heavy. And if you combine these things together, we went up with these output layers that looked like this. Okay? And so we have We end up with these two blobby things where we're sort of giving a boost to things on the right and giving a boost to things that lie within sort of this. Ah, more blobby circular area. And then we combined those together we end up with our final output. That looks like this. Now, this might look different from run to run. You know, there is some random, some randomness to how this is all initialized. Do we actually even need a deep neural network to do this, though one optimization thing is to remove layers and see if you get away with it. Maybe we don't even need deep learning. I mean, really, this is kind of a simple thing. You know, stuff in the middle is blue. Stuff on the outside is orange. Let's go ahead and remove one of these neurons from the output layer again. All we need is a buying a result anyway. Can it still work? It does. In fact, it's just is quickly. So do I even need that layer at all? Let's go ahead and remove that final layer at all still works, right? So for this very basic problem out even need to deep learning. All I have here is a single layer. So this is just It's not even a multi layer perceptron. It's just a perceptron. Do I even need for neurons in there? Well, I think maybe I do, but this one here isn't really doing much right. All it's doing is basically doing it passed through, and the inputs coming into it of been weighted down to pretty much nothing. So I better don't even need that one. Let's get rid of it. It still works. Isn't that kind of cool? I mean, think about that. We only have three artificial neurons, and that's all it takes to do this problem. I mean, compare that to the billions of neurons that exist inside your head. Now we probably can't get away with less than that. Let's go ahead and try to do turn to neurons and see what happens. Yeah, that's just not gonna happen, Right? So for this particular problem, all you need is three neurons to won't cut it. Let's play around some more. Let's try a more challenging data set. Okay, so here's a spiral pattern, and you can tell this is going to be harder because we can't just say stuff in this corner is going to be this, uh, this classification, like we need a much more finer grained way of, like identifying these individual spirals. And again we're going to see if we could just train and Earl Network to figure that rule out on its own. And, well, obviously tuners will cut it. Let's go back to four. Let's see if that's enough. I bet it isn't. You can see it's it's trying, but it's really struggling. We can let this run for a while, and you can see it's starting to kind of get there. You know, the blue areas are converging on some blue areas, and it's it's really trying hard, but it's just not enough neurons to pull this one off. Let's go ahead and add another layer. Let's see if that helps. You can see it's doing more complicated things now that it has more neurons to work with. But I still can't quite get to where it needs to be. Let's add a couple more neurons to each layer. Generally speaking, you can either add more neurons to a layer. Add more layers. It's, ah, gonna produce the same results. But it might affect the speed in which it converges, depending on which approach to take. Just fascinating. Watching this work, isn't it? All right, this one got stuck. It still can't quite pull it off. Let's add one more layer. This is actually a very common pattern. You'll see you start off with a lot of layers at first and they kind of like narrow them down as you go. OK, so we're going to go to a initial input layer of six neurons to a hidden layer of four neurons and then a layer of two neurons which will ultimately produce a binary output at the end. Well, I think it's getting there. Here, Rio. Wow. Okay, so, technically, it's still kind of like refining itself, but it kind of did it right. I mean, now this is what we call over fitting to some extent, you know? I mean, obviously it has. He's like tendrils air kind of cutting through here, and that's not really part of the pattern we're looking for. It's still going, though. Those tendrils air kind of getting weaker and weaker. So, you know, it still doesn't have quite enough neurons to do exactly the thing that we would do intuitively. But I mean still, I mean, this is a pretty complicated classifications problem. It figured it out and maybe over fitting a little bit. But I figured it out, and all we have is what, 12 neurons here? I mean, that's insane right now. Another thing I want to talk about here, too. is that it kind of illustrates the fact that once you get into a multiple layers, it becomes very hard to intuitively understand what's going on inside the neural network. This gets kind of spooky, you know? I mean, what does this shape really mean? I mean, once you have enough neurons, it's kind of hard toe fit inside your own head. What the's patterns all really represent. I mean, the first layer is pretty straightforward. You know, it's basically breaking up the image into different sections. But as you get into these hidden layers, things start to get a little bit weird as they get combined together. Let's go ahead and add one more shall way. I should have said two more to this output layer and add one more layer at the end. Let's see if that helps things converge little bit more quickly. Yeah, all right. Start to struggle a little bit. See that like it's actually got a spiral shape going on here now. So with those extra neurons, it was able to do something more interesting. We still have this. Ah, this little spike here that's doing the wrong thing, and it can't seem to quite think its way out of that one gave a few more in Iran's, though it might able to figure it out. These ones are also misclassified. But I find it interesting that it actually created a spiral pattern here on its own. So may with a few more neurons or one more layer, you could actually create an even better solution. But I will leave that as an exercise for you. Now, you know, to play around this, I really encourage you to just mess around it and see what kind of results you can get. This spiral pattern is in particular an interesting problem. Just explain some of the other parameters here. We're doing a classification here. That's where we're gonna be doing throughout this section. The activation function we talked about not using a step function and using something else , some other ones that are popular Rally was actually very popular right now of realization function we haven't talked about yet. The learning rate is just basically the step size in the ingredient Descents that we're doing, so you can adjust that if you want to, as well, let's see if really well actually makes a difference I would expect it to just, you know, affect the speed. Oh, my gosh. Look at that. That's pretty darn close to what we want, right? I mean, there is apart from this little tiny spike here which isn't really even there a little bit of over fitting going there. But we have basically created that spiral shape just out of this handful of neuron. Scott, I could do this all day, guys. And I hope you will to, you know, just player out this It's so much fun, and it gives you such a concrete understanding of what's going under the hood. I mean, look at this hidden layer here. Let's where these spiral shapes were starting to emerge and come together. And when you think about the fact that your brain works in very much the same way, it's quite literally my employing anyway. Mess around with this. It's a really great exercise and hope you have some fun with it.
5. Deep Learning Details: all right. I know you're probably itching to dive into some code by now, but there's a little more theory we need to cover with deep learning. I want to talk a little bit about exactly how they're trained and some tips for tuning them now that you've had a little bit of hands on experience with them using the Tensorflow playground. So how do you train a multi layer Perceptron? Well, it's using a technique called back propagation. It's not that complicated, really. At a conceptual level, all we're doing is Grady in dissent like we talked about before, using that mathematical trick of reverse mode auto def. To make it happen efficiently for each training step, we just compute the output error for the weights that we have currently in place for each connection between each artificial neuron. And then this is where the back propagation happened. Since there's multiple layers to deal with, we have to take that error that is computed at the end of our neural network and back propagated down in the other direction, push it back through the neural network backwards, okay, And that way we can distribute that error back through each connection all the way back to the inputs using the weights that we're currently using at this training step. Okay, So pretty simple concept. We just take the air. We use the weights that we're currently using in our neural network to back propagate that our error to individual connections. And then we can use that information to tweak the weights through Grady and dissent to actually try and arrive at a better value on the next pass at the next epoch, if you will, of our training passes. So that's all back. Propagation is we run a set of weights, we measure the error, we back propagate that error. Using that waits to things he's in, great into, sent. Try it again and we just keep doing this over and over again. Until our system converges. We should talk a little bit about activation function. So in our previous exercise, using the Tensorflow playground, we were using the hyperbolic tangent activation function by default. And then we switch to something called Rela, and we saw that the results were a little bit better. What was going on there? Well, the activation function is just the function that determines the output of a neuron, given the some of its inputs. So you take the sum of all the weights of the inputs coming into a neuron. The activation function is what takes that some and turns it into an output signal. Now, like we talked about way back in lecture one using a step function is what people did originally. But that doesn't really work with Grady and dissent because there is no Grady int there. If it's a step function, there is no slope. It's either on or off. You know, it's either straight across or up and down. There's no useful derivative there at all. So that's why alternative functions work a little bit better in practice. There are some other ones called the Logistic Function, the hyperbolic tangent function that produces more of a curvy curve. If you think about what a hyperbolic tangent looks like, it's ah more of Ah, it doesn't have that sharp cut off their zero the origin, so that can work out pretty well. There's also something called the Exponential Linear unit, which is also a little bit more curvy. What we ended up using, though, was Rallo. That stands for rectified linear unit. And that's what this graph here is showing basically it zero if it's less than zero, and if it's greater than zero, it climbs up at a 45 degree angle. So it's just, you know, getting you the the actual. Some of the input waits as its output if that output is greater than zero. Okay, so that the advantage that Rela has is that it's very simple, very easy and very fast to compute. So if you're worried about converging quickly and your computing resource is, rela is a really good choice. Now there are variants of rela that work even better if you don't care so much about efficiency when it's called leaky rela. And all that is is instead of being flat left of zero, it actually has a little bit of a slope there as well, a very small slope and again, that's for mathematical purposes to have an actual meaningful derivative there to work with , so that can provide even better convergence. It's also something called noisy rela, which can also help with convergence. But, ah, these days yell you. The exponential linear unit will often produce faster learning. That's kind of the It's gaining popularity now that computing resource is air becoming less and less of a concern now that you can actually do deep learning over a cluster of PCs on network in the cloud. So that's what activation functions are all about. You can also choose different optimization functions. You know, we've talked in very general terms about Grady and dissent, but there are various variations of Grady into something you can use as well. We talked a little bit earlier about mo mentum optimization. Basically, the idea there is to speed things up is you're going down a hill and slow things down as you start to approach that minimum. So it's a way of just making the grating to send happened faster by kind of skipping over those steeper parts of your learning curve. Well, I never used the word learning curve in the context. Word actually means something mathematically meaningful. But anyway, there's also something called the Nesterov accelerated Grady in, which is just a tweak on top of momentum optimization. Basically, it's looking ahead a little bit to the Grady and in front of you to take that information into account. So that works even better. There's also something called RMS prop, which is just using an adaptive learning rate that again helps point you in the right direction toward the minimum. Remember back to how greeting to sent works. It's not always obvious which direction you're going to be heading in, given a change in parameters. So our mess prop it's just a more sophisticated way of trying to figure out the right direction. Finally, there's something called Adam stands for adaptive moment esten ization. Basically, it's the Mo mentum optimizer and RMS prop combined kind of gives you the best of both worlds, and that is a popular choice today because it works really well. It's very easy to use again the library's you're gonna use for this stuff for a very high level and very easy to use. So it's not like you're gonna have to implement Nesterov accelerated grieving from scratch . You're just going to say optimizer equals Adam and be done with it. You know, it's just a matter of choosing the one that makes sense for what you're trying to do. Make your own trade offs between speed of ah, convergence and computational resources and time required to actually do that Convergence. Let's talk about over fitting as well. You can see you often end up with patterns like this where you know you're not really getting a clean solution. You know, like these weird spikes sometimes and sometimes if you let things go a little bit too long , it ends up reinforcing those spikes. You know, those over fitted areas where you're not really fitting to the pattern you're looking for. You're just fitting to the training data that you were given. Okay, so there are ways to combat that. And obviously, if you have thousands of weights to tune, you know those connections between each neuron and each layer of your neurons can add up really quickly. So it is very easy for over fitting toe happen. There are ways to deal with it when it's called early stopping. So as soon as you see performance start to drop, that might be in a trice way of telling you that it might be time for you to stop learning . You know, at this point, maybe you're just over fitting. There are also regularization terms. You can add to the cost function during training. You know that can basically like the bias turn that we talked about earlier that could help to. But a surprisingly effective technique is called dropout, and a Kennison is an example of a very simple idea that is very effective. The idea is just to ignore, say, half of the neurons randomly each training step, pretend that they don't exist it all. And the reason this works is because it forces your model to spread out its learning. If basically you're taking away half of its brain, if you will at each training step you're going to force the remaining half of those neurons to do as much work as possible. So this prevents things where you have individual neurons taking on more of the work than they should. You even saw in some of the examples that we ran in the Tensorflow playground, that sometimes we don't with neurons that were barely used it all, and by using drop out that would have forced that neuron to be to have been used more effectively. So very simple concept very effective in making sure that you're making full use of your neural network. Let's talk about tuning your topology another way to improve the results of your deep learning network is to just play games with how many neurons you have and how many layers of neurons you have. One way of dealing with it is just trial and error. You know, it's kind of what we did in Tensorflow Playground, but you know, there can be a methodology to that. And even you can start off with the strategy of evaluating a smaller network with less neurons in the hidden layers, where you can evaluate a larger network with more layers. So basically, you want to see. Can I get away with a smaller network and still get good results and just keep on making it smaller and smaller until you find the smallest, it can be safely, or you can try to make your network larger and larger and see you know what point it stops providing more benefits to you. So, you know, just start sizing things differently and see what works and what doesn't again. There's sort of a spooky aspect how this stuff all works together. It's very hard to understand intuitively what's going on inside of a neural network, a deep learning network in particular, so sometimes you just have toe. Use your intuition to try to tune the thing and get at the right number of resource is you need also, you know, again in today's modern computing environments. Sometimes you don't really care so much. It's probably okay. Toe have a deep neural network that has more neurons that it really needs, right? I mean, what's the real expensive involved in that these days? Probably not much. I will say that more layers will often yield faster learning than having more neurons and less layers. So if you care about speed of convergence, adding more layers is often the right thing to do. Or you can also use something called model zoos. There are actually libraries out there of neural network to apologies for specific problems . So if you don't think you're the first person in the world to solve a specific classifications problem or anything else, you're trying to apply a deep neural network to maybe should check out one. The models ooze out there to see if someone's already figured out the optimal topology for you trying to achieve instead of trying to reinvent the wheel. Okay, people share these things for a reason, and it can save you a lot of time. So that's enough theory. That's enough talk. In our next lecture, we'll get her hands dirty with tensorflow and start writing some real python code to implement our own neural networks.
6. Introducing Tensorflow: If you've done any previous research in deep learning, you've probably heard of the TensorFlow library. It's a very popular framework developed by the folks at Google. And they've been kind enough to make it open source and freely available to the world. So let's talk about what TensorFlow is all about and how it can help you construct artificial neural networks. The thing that kind of took me by surprise when I first encountered TensorFlow was that it wasn't really purpose-built for deep learning at first, or even for neural networks in general. It's a much more general purpose tool that Google developed that just happens to be useful for developing deep learning and neural networks. More generally, it's an architecture for executing a graph of numerical operations. It's not just about neural networks. You can have any sequence of operations and define a graph of how those operations fit together. What TensorFlow actually does is figure out how to distribute that processing across the various GPU cores on your PC or across various machines on a network. And make sure that you can do massive computing problems in a distributed manner. In that respect, it sounds a lot like Apache Spark. If you've taken other courses from me, you've probably heard me talk about Spark. It's a very exciting technology. And Spark is also developing machine learning and AI and deep learning capabilities of its own. So in some ways, tensorflow is a competitor to Apache Spark. But there are some key differences that we should talk about. It's not just about distributing graphs of computation across a cluster or across your GPU. You can also run TensorFlow on just about anything. So one thing that's special about TensorFlow is that I can even run it on my phone if I want to. It's not limited to running on computers in a cluster in some data center. That's important because in the real world you might want to push that processing down to the end user's device. Let's take the example of a self-driving car. You wouldn't want your car to suddenly crash into a wall just because it lost its network connection to the Cloud. Now would you? The way that it actually works is that you might push the actual trained neural network down to the car itself and actually execute that neural network on the computer that's running imbedded within your car. Because the heavy lifting of deep-learning is training that network. So you can do that training offline, push the weights of that network down to your car, which is relatively small. And then run that neural network completely within your car itself. By being able to run TensorFlow and a variety of devices, it opens up a lot of possibilities about actually doing deep learning on the edge, on the actual devices where you're trying to use it on. Tensorflow is written in C plus plus under the hood, whereas Spark is written in Scala, which ultimately runs on top of a JVM by going down to the C Plus Plus level with TensorFlow, that's going to give you greater efficiency. But at the same time it has a Python interface. So you can talk to it just like you would any other Python library that makes it easy to program and easy to use as a developer, but very efficient and very fast under the hood. The other key difference between TensorFlow and something like Spark is that it can work on GPUs. A GPU is just your video card, the same video card that you're using to play Fortnite on or whatever it is you play. You can actually distribute the work across the GPU cores on your PC. And it's a very common configuration to even have multiple video cards on a single computer and actually use that to gain more performance on clusters that are purpose-built for deep learning. Plus TensorFlow is free and it's made by Google. Just the fact that it's made by Google has led to a lot of adoption. There are competing libraries out there to TensorFlow, notably Apache MXNet. Tensorflow as of right now is still by far the most popular. Installing TensorFlow is really easy. All you have to do is use the Conda commands in your Anaconda environment to install TensorFlow. Or you can use Anaconda Navigator to do it all through a graphical user interface. There's also a TensorFlow dash GPU package you can install instead if you want to take advantage of GPU acceleration, if you're running this on Windows, I wouldn't go there quite yet. I've had some trouble getting TensorFlow GPU to work on my own Windows system. You'll find that a lot of these technologies are developed primarily for Linux systems running on a cluster. So if you're running on a purpose-built computer in a cluster on EC2 or something that's made for deep learning. Go ahead and install TensorFlow dash GPU, although it's probably going to be installed for you All ready. Let's talk about what TensorFlow is all about. What is a tensor anyway? Well, this is another example of fancy pretentious terminology that people use to make themselves look smart. At the end of the day, a tensor is just a fancy name for an array or a matrix of values. It's just a structured collection of numbers and that's it. That's all a tensor is. Using. Tensorflow can be a little bit counter-intuitive, but it's similar to how something like Apache Spark would work too. You don't actually execute things right away. Instead, you build up a graph of how you want things to execute. And then when you're ready to execute it, you say, okay, TensorFlow go do this. Tensorflow will then go and figure out the optimal way to distribute and parallelize that work across your entire set of GPUs and computers in your cluster. So let's take a look here at the world's simplest TensorFlow application in Python. All of this is going to do is add one plus two together. But it's a good illustrated example of what's actually going on under the hood. We start by importing the TensorFlow library. We're going to refer to it as tf as a shorthand. We'll start off by saying a equals tf dot variable one comma name equals a. And all that is doing is setting up a variable in TensorFlow, a variable object which contains a single value one and which is going by the name of a. The name is what will appear in visualization tools for your graph if you're using that sort of thing. But internally will also assign that to a variable in Python called a. Then we set up a B variable that's assigned to the value two and given the name B. Here's where the magic starts to happen. We say F equals a plus b. And you might think that will put the number three into the variable f, but it doesn't. F is actually your graph. It's the connection that you're building up between the a and b tensors to add them together. So f equals a plus b does not do anything except establish that relationship between a and B and their dependency together on that f graph that you're creating. Nothing actually happens until we tried to access the value of f, at which points TensorFlow 2 uses something called eager execution to actually execute that graph at that point, it will say, okay, I need to create a graph that takes the a variable which contains one and the B variable which contains two and add them together. It will figure out how to distribute that incredibly complicated operation. I'm being sarcastic, across your entire cluster. And that will ultimately print the value three in the form of a new tensor. So we have just created the most complicated way imaginable of computing one plus two. But if these were larger tensors dealing with larger datasets or for example, a huge array or a matrix of weights in a neural network, that distribution of the work becomes important. So although adding 1 plus 2 isn't a useful exercise to do with the TensorFlow. Once you scale this up to the many, many connections in a big neural network, it becomes very important to be able to distribute these things effectively. So how do we extend this idea to neural networks? Well, the thing with TensorFlow is that it's not only just made for neural networks, it can do things like matrix multiplication. And it turns out that you can think about applying all the different weights and sums that happened within a single layer of a perceptron and model that is just a matrix multiplication. You can just take the output of the previous layer in your multilayer perceptron and do a matrix multiplication with a matrix that describes the weights between each neuron of the two layers that you're computing. Then you can add in a vector that contains the bias terms as well. So at the end of the day, you can modify this fancy diagram here of what a perceptron looks like and just model it as a matrix multiplication in vector addition. Go back and read up on your linear algebra if you want to know more about how that works mathematically. But this is just a straightforward matrix multiplication operation with the vector addition at the end for the bias terms. By using TensorFlow's lower level APIs. We're kind of doing this the hard way, but there are higher level APIs in TensorFlow that make it much simpler and more intuitive to define deep neural networks as we're describing TensorFlow at a low level right now, its purpose in life is just to distribute mathematical operations on groups of numbers are tensors. And it's up to us to describe what we're trying to do in mathematical terms. It turns out it's really not that hard to do with the neural network for us to actually compute a complete deep learning network from end to end. There's more to it than just computing the weights between different layers of neurons. We have to actually train this thing somehow and actually run it when we're done. So the first thing we need to do is load the training data that contains the features that we want to train on and the target labels. To train a neural network, you need to have a set of known inputs with a set of known correct answers that you can use to actually descender, converge upon the correct solution of weights that lead to the behavior that you want. After that, we need to associate some sort of an optimizer to the network. Tensorflow makes that very easy to do. It can be gradient descent or some variation thereof such as atom. We will then run our optimizer using our training data. And again, tensorflow makes that pretty easy to do as well. Finally, we'll evaluate the results of our training network using our test dataset. To recap at a high level, we're going to create a given network topology and fit the training data using gradient descent to actually converge on the optimal weights between each neuron in our network. When we're done, we can evaluate the performance of this network using a test dataset that it's never seen before and see if it can correctly classify that data that it was not trained on. One other. Gotcha. When you're using neural networks, it's very important to make sure that your input data is normalized, meaning that's all scaled into the same range. Generally speaking, you want to make sure that your input data has a mean value of 0 and unit variance. That's just the best way to make the various activation functions work out mathematically. What's really important is that your input features are comparable in terms of magnitude. Otherwise it's hard to combine those weights together in a meaningful way. Your inputs are all at the same level at the bottom of your neural network and fitting into that bottom layer, it's important that they're comparable in terms of magnitudes. So you don't end up skewing things in waiting things in weird ways. For example, if I already created a neural network that tries to classify people based on their age and their income. Age might range from 0 to 100, but income might range from 0 to a million. Those are wildly different ranges. So those are going to lead to real mathematical problems that they're not scaled down to the correct range at first. Fortunately, Python scikit-learn library has a standard scaler package that you can use that will automatically do that with just one line of code. All you have to do is remember to use it. And many datasets that we use while researching will be normalized to begin with. One we're about to use his already normalized, so we don't actually have to do that. But later on in the course, I'll show you an example of actually using standard scalar. We've talked about how this all works at a low level. And in TensorFlow 2, it's still possible to implement a complete neural network basically from scratch. But in TensorFlow 2, they have replaced much of that low-level functionality with a higher level API called Keras. There is value in understanding how it all works under the hood first. So let's work through a simple example of a neural network using the lower level APIs. Next, after that, we'll see how the Keras API simplifies common neural network setups and highs a lot of this complexity from you.
7. Using Tensorflow for Handwriting Recognition, Part 1: Okay, so let's play around with tensorflow using its lower level AP, I So you get more of a ah appreciation of what's going on under the hood, if you will. On Windows will start by going to our Start menu and finding the Anaconda three group. And from there, open up your anaconda. Prompt on mackerel in X, of course, you'll just open up a terminal, prompting you'll be good. First thing you want to do is make sure that you have actually installed tensorflow itself . So if you haven't already taken care of that, you can just say Kanda install tensorflow. And I've already done that so won't actually do anything for me. But if you do need to install that or update it, that will prompt you to do so. Give that a second just to check everything all right, looks like we're good. Next. Do you want to see the into the directory where you installed the course materials? So for me, that's going to be CD C colon, backslash, ml course, and from within the course materials directory type in Jupiter with a lie notebook. I should bring up your favorite Web browser from here find the Tensorflow notebook and go ahead and open that and let's start playing, so we'll start off by running the world's simplest tensorflow application that we looked at in the slides were just going to add the numbers one plus two Together using tensorflow, we start by importing the Tensorflow library itself and will give it the name. TF, as a shorthand, will create two variables in tensorflow one called AM one called Be the Variable A will have the number one associated with it, and the variable B will be initialized with the number two. We then say F equals a plus. B, which does not put the number three into F f just represents the graph that we're defining . That says F, represents the addition of whatever's in A and B together. So nothing has actually happened here except for constructing that graph. It's only when we say TF dot print looking for the output of F that Tensorflow will use what's called eager execution to go off and actually execute that graph and evaluate its results. So at that point, it goes off and says, Okay, we have this craft constructed of A and B A contains one. B contains two. Let's add them together and get the output of the F graph and print that out. And he could actually distribute that across an entire cluster if it had to. But obviously for just adding one plus two, there's no need for all that. But let's see if it works. Go ahead and hit shift enter within that block after clicking inside of it, and we should get the number three. Sure enough to some of the NBA's three. Hey, it works. So let's do something a little bit more interesting. Let's actually do handwriting recognition using tensorflow. This is a pretty common example to use when people are learning tensorflow. Basically, it's a data set of 70,000 handwriting samples where each sample represents someone trying to draw the numbers zero through nine. So we have 70,000 images that are 28 by 28 images of people drawing the number zero through nine. And our challenge is to create a neural network that looks at those images and tries to figure out what number they represent. Now, this is a very common example when people are learning tensorflow maybe a little bit too common. But there's a good reason for that. It's built into tensorflow. It's easy to wrap your head around. It's really good for learning. And our little twist on it, which you're not going to see in many other places, is actually using lower level AP eyes to implement this neural network to begin with. So let's dive in here. So let's walk through actually loading up this data and converting it to the format that we need. The first thing we're going to do is import the libraries we need, and we're gonna be using numb pie and also tensorflow itself and the M. This data set itself is part of tensor flows carry stock data sets package so we just import that right in and have that data accessible to us will define some convenient variables. Here. Numb classes is 10 that represents the total number of classifications for each one of these images. So again, these can represent the numbers zero through nine, and that's a total of 10 possible classifications. Our features are 784 in number, and we get that by saying that each image is a 28 by 28 image, right, so we have 28 times 28 which is 784 individual pixels for every training image that we have . So our training features are each individual pixel of every individual image that we're using to train our neural network. Let's start off by loading the data set itself so we'll say and Mr on load data to actually retrieve that data from TENSORFLOW. And we'll put the resulting data set into these variables here. So the convention that we usually uses that extra verse refers to your feature data. This is your, ah, in our case, the images themselves. And why refers to your labels? So this will represent whether that image represents the numbers zero through nine. Furthermore, we split things into training and testing data sets. So with amnesty, we have 60,000 training samples and 10,000 test samples. That means that we're only going to train our neural network using that 60,000 set of training samples and were holding a side of 10,000 test samples so we can actually test how well are trained network works on data that it's never seen before. This is how we prevent over fitting. We actually evaluate our model based on data that the model has never seen before, so it didn't have a chance to over fit on that data to begin with. Next, we need to convert this to 32 bit floating point values, because that's what Tensorflow expects. So we start by creating these numb pyre rays of the underlying training and test data and converting that to end peed off low 32 data types. We then flatten those images down so it comes in as two dimensional 28 by 28 images. And there are ways of constructing neural networks. I can deal with two dimensional data. We'll get there. But for now we're going to keep things simple and just treat. Each image is a one dimensional array or vector or tensor, if you will, Of Senator in 84 features 184 pixels. In this case, the reshape command is what does that So by saying reshape negative one numb features some features. Again, it's 784 that's gonna flatten these two dimensional rays down to a one dimensional 784 10 sirs. So that's gonna have a new X train and a new X test that can that contains these 784 pixel one dimensional sensors. Next, we need to normalize our data. We talked about that in the slides as well. So the raw data coming in from this data set has every pixel representatives and integer value between zero and 2 55 zero represents a black pixel 255 represents a white pixel, and values in between represent various shades of gray. We need to scale that down to the range of 0 to 1. So to do that, very simply, we just divide everything by 2 55 we're done. All right. So we've prepared and scrubbed and cleaned our data here. Let's actually do some more stuff with it. Let's start by wrapping our heads around what this data looks like. So it's always a good idea to sort of visualize your data that you're going to be training with and understand its quirks and nuances before you actually try to implement an algorithm. That's what we're doing in this display sample function here. It's going to take us input. A specific number of a sample from our training data that we want to look at, and we'll just extract the label of it. So why train again? Is the training labels the number zero through nine that that image represents? And then we'll reshape that back to a two dimensional 28 by 28 minutes so we can actually display it on the screen. We'll give it a title, will show that two dimensional images great scale and just show it. So let's go ahead and kick that off. Actually didn't kick off this previous market with, do we? So before we forget, go back up to this block where we prepare our data and shift enter to run that. And now that we've done that, we could actually visualize that data that we've loaded Click down here and shift Enter. And here's a sample data set point. So sample number 1000 is this image here, and we can see that it's supposed to represent the number zero and, well, it looks like a zero. So this isn't a particularly challenging one for us to learn. Hopefully, you could play around here. Try different sample numbers is to get a better feel of what the state is like. So let's try 1500. Turns out, that's the number nine, um, kind of a weird looking nine, so that might be a little bit of a challenge. How about, I don't know, 1700. That's the one. Looks like a one. But if you keep, you know, poking around here and trying different values, eventually you'll find some weird ones. For example, that's a pretty funny looking to. But you know, you can see there's a wide variety of handwriting capabilities of people who made this test data. So that's a good way to wrap your head around what we're dealing with. Moving on. We could take this visualization to the next step and actually visualize those one dimensional rays that were actually training our neural network on. So this will give us a better picture of the input that our neural network is going to see and sort of make us appreciate just what's going on here and how differently it, ah, quote unquote thinks so. What we're going to do here is reshape everything back down to one dimensional arrays of 784 pixels. Take this on one of our training data sets here, and we're gonna eatery through the 1st 500 samples and can captain eight each individual image to that original image there of zero. So we're not. Take basically the 1st 500 training images, flatten them down to one dimensional arrays of center journey for epistle values, and then combine that all together into a single two dimensional image that will plot. Let's go ahead and click in that and shift enter to run it. And this is interesting. So this is showing you the input that's going into our actual neural network for every individual image for the 1st 500 images, and you can see that your brain does not do a very good job at all of figuring out what these things represent, right? So every single row of this image represents the input data going into our neural network. So our neural network is going to be able to take every one of those rows of one dimensional data and try to figure out what number that represents in two dimensional space so you can see that it's thinking about the world or perceiving the world. Receiving these images more specifically in a very different way than your own brain does. So you know, it's kind of sometimes a dangerous assumption to think that neural networks work the same way that your brain does. They're inspired by your brain, but they don't always work the same way. So important distinction there. All right, let's actually start setting up our neural network so we'll start off by defining some parameters or hyper parameters, if you will, that defined how are training will actually work. The learning rate is basically how quickly will descend through Grady and dissent in trying to find the autumn values. The optimal waits for our neural network training steps is basically how many training at Box will actually conduct how many times will actually go over and iterating over this entire neural network trying to train it bat sizes. How many random samples will take from our training data during each step and displaced up is just how often will display our progress as we train this neural network and underscore Hidden represents how many hidden neurons will have in our hidden layer so that middle layer of neurons and our neural network will have 512 neurons within it. And you can play with that number. See what number works best for you. These are all hyper parameters, and the dirty little secret of machine learning is that Ah, lot of your success depends on how well you can guess the best values for these. A lot of it's just trial and error, trying to find the right value of the learning rate, the right number of hidden neurons. These numbers are basically determined through experimentation. So, you know, it's not quite the exact science that you might think it ISS. Let's go ahead and execute that block shift. Enter. And we're also going to further slice up our data set here and prepare it further for training user tensorflow. So we're going to use ah TFT outdated out data set to create what's called a data set object within tensorflow from our training images and training labels. We will then use that data set to create our individual batches that we used to train the neural network with. So shuffle 60,000 means I'm going to shuffle all 60,000 training images that I have just randomly shuffle them all. I will then batch them up into batches of 250 pre fetch the 1st 1st batch. So I have that ready to go. That's all that's going on here. Shift Enter. All right, now we're gonna start to actually construct our neural network itself will start by creating the variables that will store the weights and biased terms for each layer of our neural network. So we start off by, ah, initializing all over variables with random variables just to make sure that we have a set of initial random settings there for our weights. We want to start with something, and for lack of something better will start with random values. Actually, your choice of ah initialization values can make a big difference in how your neural networks perform. So it's worth looking into how to choose the right initial values For a given type of neural network. We set up our weights here for the hidden layer, will call those weights H and we'll use our random, normal function that we just defined here. To initialize those weights randomly and hidden as you might recall again, is let's look up here again. Five year 12. So this will create 512 variables that contained the weights for hidden neurons. We also need a set of weights on our output layer of 10 output neurons. So the output is gonna be 10 neurons where every neuron represents the likelihood of that being that given classifications zero through nine, we also need have biases associate with both of these layers, So be will be the set of biases with our hidden layer. Again, there will be 5 to 12 of those. And we also have biases associate with her output layer of tenor neurons at the output layer as well. These will be initialized zeros. Okay, so a little bit different there for the biases By default. We want our biases to be zero. Let's go ahead and execute that. All right, moving on.
8. Using Tensorflow for Handwriting Recognition, Part 2: Now we're going to set up the topology of our neural network itself. So that's what neural net here does. And as we said in the slides, you can define these fancy neural networks is just simple matrix, multiplication and addition functions here, right? So we'll start off by saying TF dot matt Mole. That's it's just a matrix multiplication of our input neurons, which is the raw 784 pixel values with the 512 weights in our hidden layer of neurons there so that matrix multiplication is multiplying each one of those input values by the weights in that hidden layer. We then say TF dot add to add in the biased terms, which again are stored in the B variable that we just defined above. Okay, next, we need to apply a sigmoid function to the output of that hidden layer. So that's basically the activation function on each hidden or on Okay, that's all it's happening. They're very simple. The output layer will do it again, so we'll say, do a matrix multiplication with our output weights and the hidden layer, and then we'll add in those output biases at the end as well, and we'll call soft Max to actually normalize those output neurons to a final probability of each individual classification. So soft max again. It's just a mathematical trick for taking the outputs of these neural networks and converting those output neuron values to what we can interpret as a probability of each individual classification that those neurons represent being correct. Go ahead and execute that. And that's quick as well. Again remember, all we're doing is defining our graph. At this point, we're not actually training anything or running anything yet, but yeah, Do you take the time to noodle on that? We have basically constructed the apology of the neural network itself. The next thing we didn't need to do is figure out how to actually train this neural network , right? So again, we're doing this the hard way, so we have to write this by hand. We'll start by defining our loss function and its called cross entropy. Basically, it's a way of doing Grady and dissent is supplying a log arrhythmic scale, and that has the effect of penalizing incorrect classifications much more than ones that are close to the correct answer. That's ah, handy property for making training go quickly. So within this function will gonna pass into things. Why? Pred is the predicted values that are coming out of our neural network and why true are the known true labels that are associated with each each image. So we have to talk about one hot encoding at this point. We talk about this when we talk about feature engineering in the course, but in order can compare that known value that known label, which is a number from 0 to 9 to the output of this neural network. Remember, the output of the neural network is actually 10 different neurons where each one represents the probability of a given classification. It actually compare that to the known correct value. We need to convert that known correct number to a similar format, so we're going to use one hot encoding. It's best understood with an example. So let's say that we know that the answer for the known correct label for an image is one. We would one hot encode that as a 10 value array here, where each value represents the probability of ah, that given classifications, since since we know with 100% certainty that this image was a one. We can say that for the classifications zero, there's a 0% chance of that for the classification one, there's a 100% chance of that one point. Oh, and for two it's going to be zero for 30 so on and so forth. So that is one hot encoding. You're just creating a binary representation of a integer value, if you will. So the number one is representative. 0100000 Every one of those slots in the array there represents a different classification value, and this just makes it easier for the math to work out and to construct things. All right, so we start off, like encoding, that known label to a one hot, encoded array. We then do some clipping their to avoid some mathematical numerical issues that log zero. And then we could just compare that Ah, and compute the actual cross entropy term by doing reduce some to go across the entire set of all values within this patch and using this log arrhythmic comparison like we said to actually do cross entropy without log arrhythmic property. So let's go ahead and ah, shift, Enter To do that again, the key things here are reduced me and reduce some, which means we're going to apply this across the entire batch all at once. All right. Next we need to define what's called an optimizer, and in this case, we're going to use a stochastic Grady and dissent. We've talked about that before, and we will use our learning rate hyper parameter, which again will want to tune through experimentation later on. Teoh, define that optimizer. We need to write a function to actually run that optimization. And again with the lower level tensorflow ap eyes. We kind of have to do this the hard way. We're gonna use something called radiant tape to actually do this automatically. What that's doing is you can see here it's actually calling our neural net function that defines the topology of our neural network. We're going to compute the loss function are cross entropy function that we defined above as well. So this is tying it all together and actually allowing us to optimize this neural network within this one function here. So not terribly important to understand what's going on here at a low level you can kind of understand here through the comments, we, uh we're updating our variables as we go through it. Step here. We compute the Grady INTs and then we update our weights and biases at ease training step. So this code is what's computing new weights and biases through each training pass again, This is going to be a lot easier using the caress higher level AP I So right now we're just showing you this to give you an appreciation of what's going on under the hood. Let's go ahead and shift. Enter that one as well. All right, so now we have everything we need. We have the topology of our network defined. We have the variables defined for our weights and biases. We have a lost function defined, which is cross entropy, and we have an optimization function that ties it all together called Run optimization. Let's go ahead and start training this thing. So that's what's going on here. Oh, wait. One more thing. We need an accuracy metric as well, So a loss function isn't enough. We also want to display the actual accuracy at each stage two and all this accuracy metric does is say, let's compare of the actual maximum argument from each output array that's gonna correspond to our one hot encoded value. And compare that to the one hot, encoded, known value that we have for that label. So this is just a way of staying. We're gonna call, reduce me to actually compute the accuracy of each individual prediction and average that across the entire days set. So that's what our accuracy metric here does as well. Shift, enter. And now we can actually kick it off so you can see that we've done all the heavy lifting. Now it's fairly simple to actually do it. So for every training step, we're going to take a batch from our training data. Remember, we created those batches earlier on using a data set across the training steps. That's gonna be ah, 3000. I think we said it was. And for each batch each step of of training, we're gonna run optimization. So that's calling that one function that we have that tied everything together to apply our optimization across our neural network in compute the optimal weights and biases at each step and yeah, every 100 steps. That's what display step is. We'll put our progress as we go. So it every 1/100 epoch, we will actually, uh, execute our neural network here on the current batch and get a set of predictions for that batch of 250 values there and compute cross entropy on that to get a snapshot of our current loss function and compute accuracy on that as well. Tash, you get a metric of our accuracy at each state so we can see this converge over time for every 100 steps throughout our 3000 training steps or that box, if you will. Let's go ahead and kick that off. And this is where the action's happening. So this is gonna iterated over a one over 3000 F box and we can see that accuracy changing as we go. It's kind of interesting to watch this because the accuracy is kind of fluctuating a little bit as we go Fear So you know, you can tell this kind of like maybe settling into a little local minima here and working its way out of those and finding better solutions over time. But as it goes, it will start to converge on better and better values. So, you know, we started off with Justin 84% accuracy. Right now, we're up to about 90 4% but it's still kind of all over the place. Let's give it some more time pretty firmly in the nineties at this point. So that's good. 90 three's, I'd say up to 2000 that box. Now remember, we're going to go up to 3000 and again, you know, you just gonna have to watch this and see where it starts to converge. We could do early stopping to figure out at what point we actually stop getting a an improvement. But you kind of have to eyeball of the first to get a sense of how many a pox you really need. We're almost there and looks like we're not gonna do much better than 93% accuracy. So there we have it. We over 3000 that box we ended up with an accuracy of 92.8% and this is actually remember using our training data set. So there's a possibility that we're over fitting here to really evaluate how well this model does. We want to evaluate it on the test data set on data that it's never seen before. So let's go ahead and use that test data set that we set aside at the beginning and run the neural network on it and actually call our accuracy function to see how well it does on test data test images that it's never seen before. 93%. Not too bad, you know. I mean, we could do better. But for a very simple take at using tensorflow and setting up a very simple neural network with just one hidden layer, that's not too bad. We'll do better throughout the course. We're gonna try different techniques, but we're off to a good start here right again. It's good to understand your data, visualize things, see what's actually going on under the hood. So let's take a look at some of those misclassified images and get more of a gut feel of how good about our model really is. Let's take a look at the some examples of images that it did not classify correctly and see just how forgivable those errors are. So that's what this function is doing. Basically, we're going through 200 images that were known to be incorrect. Taking a look at the actual predicted value, thereby taking art max on the output array there of the output neuron layer. And comparing that to the known correct labels. And, if it's not correct, will print it out with the original label and the predicted label toe. Get an idea of what's happening here. Shift enter. All right, so these are some pretty messy examples in this example. Um, we had we knew that this guy was trying to draw five. We thought it was 1/6. Yeah, I can't understand that. Yeah, I mean, that's a really nasty looking five, and it's a pretty good case there to say that that was actually a six. So that's a case where your human brain couldn't actually probably do a whole lot better. I would not get that. That's a five, just like it looks like a squiggle. This one to, um, our best guess for my model was a six. The person intended to four that does not look like a four to me. I mean, looks like half of a four, basically like I don't know what happened to the guy's arm when he was drawing it. But again, you know this lets you appreciate just how well it's doing or not. Ah, this one. I'm not sure how our model thought that was a seven again. That's a very weird looking six, but look like a seven, either, anyway, that one, also kind of a nasty one, looks like a two to the brain. It's a really funny, squished, odd looking to, but ah, this is an example of where your brain does a better job than are simple neural network. But overall, you know, these are largely forgivable errors. Someone's where I messed up were some pretty weird, messy examples like, What is that? Um, I guess it's supposed to be a two. We guess it was a nine. You know, I could actually see that going either way. So, um, not too bad guys, you know? Anyway, if you want to play with it, I encourage you to do so. See if you can improve upon things. So like we talked about, there's a lot of different hyper parameters here to play with the learning rate. How many hidden neurons do we have? And so try different values. Try different learning rates. Try more neurons, less neuron. See what effect that has just play around with it. Because in the real world, that's what you have to do. Try adding a second hidden layer even or different batch sizes or a different number of that box. Just get your hands dirty and get a good gut feel of how those different parameters affect the output in the final results that you get. So give it a shot. And if you can get more than 93% accuracy, we'd love to hear about it in the Q and A.
9. Introducing Keras: So we've had a look at developing neural networks using tensorflow sort of lower level AP eyes. Where instead of really thinking about neurons or units, you're thinking more about 10. Sirs and maitresse ease and multiplying them together directly, and that's a very efficient way of doing it. But it's not really intuitive. It could be a little bit confusing, especially when you're starting to try to implement a neural network in those terms. Fortunately, there's, ah, higher level AP I called caress that's now built into Tensorflow. It used to be a separate product that was on top of Tensorflow. But as of tensorflow 1.9, it's actually been incorporated into tensorflow itself as an alternative, higher level a p I that you can use. And it's really nice because it's really purpose built for deep learning. So all the code is very much built around the concept of artificial neural networks, and it makes it very easy to construct the layers of neural network and wire them together and use different optimization functions on them. It's a lot less code and a lot less things that could go wrong. As a result, another benefit of caress in, in addition to its ease of use is its integration with ease. Psych It learned library. So if you're used to doing machine learning in python, you probably use psychic learn a lot and using caress. You can actually integrate your deep neural networks with psychic learn. And you might have noticed in our previous lecture that we kind of lost over the problem of actually doing trained test or cross validation on our neural network because it would have been a kind of a big pain in the butt, but it would psych it learn. It's very easy to do cross validation and, like, perform proper analysis and evaluation of this neural network. So that makes it easier to evaluate what we're doing and to integrate it with other models or even chain ah, neural network with other deep learning or machine learning techniques. There's also a lot less to think about, and that means that you can often get better results without even trying. You know, with tensorflow, you have to think about every little detail at a linear algebra level of how these neural networks are constructed because it doesn't really natively support neural networks out of the box. You have to figure out, How do I multiply all the weights together? How do I add in the bias terms? How do I apply an optimizer to it? How do we define a loss function? Things like that, whereas carats can take care of a lot of those details for you. So when there's less things for you to screw up and more things that caress can take on for you in terms of optimizing things where you're really trying to do often you can get better results without doing us much work, which is great. Why is that important? Well, the faster you can experiment and prototype things, the better your results will be. So if it's that much easier for you to try different layers in your neural network, you know, different apologies, different optimizers, different variations. It's gonna be that much easier and quicker for you to converge on the optimal kind of neural network for the problem you're trying to solve. Whereas of Tensorflow is putting up a bunch of roadblocks for you along the way. At the end of the day, you only have so much time to devote to these problems, right? So the more time you can spend on the topology and tuning of your neural network, and the less on the implementation of it, the better your results will be at the end of the day. Now you might find that Paris is ultimately a prototyping tool for you. It's not as fast as just going straight to Tensorflow. So you know, sometimes you want to converge on the topology wanted, then go back and implement that at the tensorflow layer. But again, just that use of prototyping alone is well worth. It makes life a whole lot easier. So let's take a closer look. And again caress is just a layer on top of tensorflow that makes deep learning a lot easier . All we need to do is start off by importing this stuff that we need. So we're going to import the caress library and some specific modules from it. We have the amnesty data set here that we're going to experiment with the sequential model , which is a very quick way of assembling the layers of a neural network. We're going to import the dense and dropout layers as well, so we can actually add some new things onto this neural network to make it even better and prevent over fitting. And we will import the RMS problem optimizer, which is what we're going to use for our Grady and dissent. Shift, enter and you can see they've already loaded up caress just by importing those things. It's using Tensorflow is the back end. Let's go ahead and load up the amnesty data set that we've used in the previous example. Paris's version is a little bit different, actually has 60,000 training samples as opposed to 55,000 still 10,000 test samples, and that's just a one line operation. All right, so now we need to, as before, convert this to the shape that Tensorflow expects under the hood. So we're going to reshape the training images to be 60,000 by 7 84 Again, we're going to still treat these as one D images. We're going to flatten these all out in tow one D rose of 784 pixels for each 28 by 28 image. We also have our test data set of 10,000 images, each with 784 pixels of peace and we will explicitly cast the images as floating 0.32 bit values, and that's just to make the library a little bit happier. Furthermore, we're going to normalize thes things by 2 55 So the image data here is actually eight bit at the source, so it 0 to 2 55 So to convert that to 01 what we're doing, basically here is converting it to a floating point number first from that 0 to 2 55 imager and then dividing it by 2 55 to re scale that input data 20 to 1. We've talked before about the importance of normalizing your input data, and that's all we're doing here. We're just taking data that started off as eight bit 0 to 2 55 data and converting that to 32 bit floating point values between zero and one. It's always going on there as before, we will convert ARD labels to one hot format, so that's what too categorical does for you. It just converts the label data on both the training and the test date is set. Teoh one hot 0 10 values. Let's go ahead and run that previous block there before we forget and we will run. This is well again. I'm just hitting a shift. Enter after selecting the appropriate blocks of code here. All right, as before, let's visualize some of the data just to make sure that it loaded up successfully. This is pretty much the same as the previous example. We're just going to look at our input data for a sample number. 1234 and we could see that are one hot label here is showing one and position for and since we start counting it, 00123 That indicates label three. Using aren't Max. That gives us back the human readable label. And by reshaping that 768 pixel array into a two D shape, we can see that this is somebody's attempt at drawing the number three. OK, so so far, so good Armor data looks like it makes sense and was loaded correctly. Now remember it Back in when we were dealing with tensorflow, we had to do a whole bunch of work to set up our neural network. We'll look at how much easier it is with caress all we need to do is say that we're setting up a model, a sequential model. And that means that we can add individual layers to our neural network one layer at a time , sequentially, if you will. So we will start off by adding a dense layer of 512 neurons with an input shape of 784 neuron. So this is basically our first layer that takes our 784 input signals from each image one for each pixel and feeds it into a hidden layer of 512 neurons. And those neurons will have the rela oh activation function associated with, um So with one line of code, we've done a whole lot of work that we had to do in tensorflow before, and then on top of that will put a soft max activation function on top of it to a final layer of 10 which will map to our final classification of what a number of this represents from 0 to 9. Okay, so wasn't that easy. We can even ask care is to get us back a summary of what we set up just to make sure that things look the way that we expected. And sure enough, we have two layers here, you know, one that has 512 and then going to a 10 neuron layer for the final classification. And this does sort of omit the input layer. But we do have that input shape of 784 features going into that first layer. All right now, you also might remember that it was kind of a pain in the butt to get the optimization and lost function set up in Tensorflow again. That's a one liner in caress. Well, we have to do is say that are lost Function is categorical cross entropy, and it will know what to do There. We're going to use the RMS prop optimizer just for fun. We could use any of the one that we wanted to. We could just sees Adam if you wanted to. Or there are other choices, like Eight. A grad SG can read up on those at this link here if you want to, and we will measure the accuracy as we go along. So that's all that saying, Let's go ahead and hit that and that will build the underlying graph that we want to run in tensorflow. All right, so now we actually have to run it. And again, that's just one line of code with caress. All we need to do is say they were going to fit this model using this training data set these air the input features, the input layers that were going to train with. We want to use batch sizes of 100. We're going to run that 10 times. I'm going to set of verbosity level of two because that's what works best with an eye Python night notebook and for validation, we will provide the test data set as well. So instead of writing this big function that does consideration of learning by hand like we did in tensorflow caress does it all for us. So let's go ahead and hit, shift, enter and kick that office. Well, now caress is slower than tensorflow, and you know it's doing a little bit more work under the hood, so this will take more time, but you'll see that the results are really good. I mean, even on that first generation, we've already matched the accuracy that we got after 2000 iterations in our hand coded tensorflow implementation. We're already up to Epoch six and we're approaching 99% accuracy in our training data. Keep in mind this is measuring the accuracy in the training data set, and we're almost there, but yeah. I mean, even with just 10? Pox? We've done a lot better than using tensorflow. And again, you know, caress is kind of doing a lot of the right things for you automatically, without making you even think about it. That's the power of caress. Even though it's slower, it might give you better results in less time at the end of the day. Now, here's something that we couldn't really do easily with transfer flow. It's possible. I just, you know, didn't get into because that lecture was long enough. Is it waas? But remember that we can actually integrate caress with psychic learn so we can just say model dot evaluate. And that's just like a psychic learned model. As faras Pythons concern and actually measure based on our test data, set what the accuracy is and using the test data set as a benchmark, it had a 98% success rate incorrectly classifying those images, so that's not bad. Now, mind you, a lot of research goes into optimizing this, and this data set problem in 98% is not really considered a good result. Like I said later in the course, we'll talk about some better approaches that we can use. But, hey, that's a lot better than we got in the previous lecture, isn't it? As before, let's go ahead and take a look at some of the ones that it got wrong just to get a feel of where it has troubled things that are. Neural network has challenges. Code here is similar. We're just going to go through the 1st 1000 test images here, and since it does have a much higher accuracy rate, we have to go deeper into that. Tested to find examples of things that went wrong will reshape each data each image and do a flat 784 pixel array, which is what are neural network expects is input. Call our Max on the resulting classification and one hot format and see if that predicted classification matches the actual label for that data. If not print out. All right, so you can see here that this model really is doing better. The one said It's getting wrong are pretty wonky. Okay, so in this case, we predicted that this was a number nine. And if I were to look at that myself, I might guess that was a nine as well. Turns out this person was trying to draw the number four, but, you know, this is a case where even a human brain is starting to run into trouble as to what this person was actually trying to write. I don't know what that's supposed to be. Apparently, they were trying to draw the number four. Our best guess was the number six not unreasonable, given the shape of things. Here's somebody who was trying to draw, too. But it looks a whole lot more like a seven again. I would be too sure about that myself. So, you know, even though we flatten this data to one dimension, this neural network that we've constructed is already rivaling the human brain in terms of doing handwriting recognition on these these numbers. I mean, that's kind of amazing that one, and I probably would've guessed a three on that one, but again. You can see that the quality of the stuff that has trouble with is really sketchy. What is that, A scorpion? Apparently, that was supposed to be an eight. And our best guess was a two. But that's much Wow. Okay, Yeah, some people really can't. Right? That's a seven. Yeah. I mean, you get the point here, So just by using caress alone, we've gotten better accuracy. We've got a better result because there's less for us to think about. All right. And you can probably have improve on this even more so again as before. With tensorflow. I want you to go back and see if you actually improve on these results. Try using a different optimizer than RMS prop try. You know, different apologies. And the beauty with caress is that it's a lot easier to try those different Tell apologies now, right? Carrots actually comes in its documentation with an example of using amnesty, and this is the actual topology that they use in their examples. So go back. Give that a try, see if it's actually any better or not. See if you can improve upon things. One thing you can see here is that they're actually adding dropout layers to prevent over fitting. So it's very easy to add those sorts of features here. Basically, we've done here is at a that same dense layer, 512 hidden neurons taking the 17 84 features. And then we're going to drop out 20% of the neurons that the next layer to force the learning to be spread out more and prevent over fitting. So might be interesting to see if that actually improves your results on the test data set by adding those dropout layers. All right, so go play with that mom, come back. We'll do some even more interesting stuff using caress.
10. Using Keras to Learn Political Affiliations: So that was a lot easier using caress, wasn't it? Now, the M NIST data said, is just one type of problem that you might solve the neural network. It's what we call multi class classification, multi class, because the classifications we were fitting into range from the number zero through nine. So in this case we had 10 different possible classifications values, and that makes this a multi class classification problem. Now, based on caress is documentation and examples. They have general advice on how to handle different types of problems. So here's an example of how they suggest setting up a multi class classification issue in general. So you can see here that we have to him layers. Here we have an input dimension of however many features you have coming into the system. In this example, there's 20 but depending on the nature of your problem, there may be more. It's setting up to rela activation function layers, each with 64 neurons each and again, that's something that you would want a tune, depending on the complexity of what you're trying to achieve, it sticking in a dropout layer to discard half of the neurons and each trading step again. That's to prevent over fitting. And at the end, it's using a soft max activation to one of in this example 10 different output values. OK, so this is how they go about solving the amnesty problem within their own documentation. They then use an SG de optimizer on a categorical cross and entropy loss function. So again, you could just refer to the carousel augmentation for some general starting point somewhere toe begin from at least when you're tackling a specific kind of problem again, the actual numbers of neurons and number of layers, the number of inputs and outputs. Well, obviously, very depending on the problem, you're trying to solve it. This is the general guidance they give you on what the right loss function is to start with . What the right optimizer to start with might be another type of classification problem is buying a reclassification? Maybe you're trying to decide if images or people are pictures of males or females may be trying to decide if someone's political party is Democrat or Republican. If you haven't either or sort of problem, then that's what we call a binary classification problem, and you can see here there. Recommendation here is to use a sigmoid activation function at the end instead of soft max , because you don't really need the complexity of soft max if you're just trying to, like, go between zero and one. So sigmoid is the activation function of choice. In the case of binary classifications, they're also recommending the RMS prop optimizer, and the lost function in this case will be binary cross entropy in particular so few things that are special about doing binary classification as opposed to multi class. Finally, I want to talk a little bit more about using caress with psych. It learn. It does make it a lot easier to do things like cross validation. And here's a little snippet of code of how that might look. So here's a little function that creates a model that can be used with psych, it learned. Basically, we have ah create model function here that creates our actual neural network. So we're using a sequential model, putting in a dense layer with four inputs and six neurons and that layer that feeds to another hidden layer of four neurons. And finally it's going to a binary classifier at the end with a sigmoid activation function . So a little example of setting up a little buying Eri classifications Neural network. We can then set up an estimator using the caress classifier function there, and that allows us to get back an estimator that's compatible with psych. It learn. So you see at the end there we're actually passing that estimator into psychic learns cross Val score function and that will allow psych it learn to run your neural network just like it were any other machine learning model built into psych. It learns that means cross Val score can automatically train your model and then evaluate its results using careful cross validation and give you a very meaningful results for how accurate it is in its ability to correctly predict the classifications for data it's never seen before. So what those snippets under our belt? Let's try out, um, or interesting example. Let's finally moved beyond the amnesty sample we're gonna do is try to predict the political parties of Congressman just based on their votes in Congress using the caress library. So let's try this out now. This is actually going to be an example that I'm going to give to you to try out yourself as an exercise. So I'm gonna help you load up this data and clean it up. But after that, I'm gonna leave it up to you to actually implement a neural network with caress to classify these things so again, to back up. What we're going to do is load up some data about a bunch of congressional votes that various politicians made. And we're going to try to see if we can predict if a politician is Republican or Democrat, just based on how they voted on 17 different issues. And this is older data is from 1984. So you definitely need to be of a certain age, shall we say, to remember what these issues were. And if you're from outside of the United States just to give you a brief primer in US politics, basically there are two main political parties in the United States the Republicans, which tend to be more conservative, and the Democrats, which tend to be more progressive, and obviously those have changed over the years. But that's the current meal. So let's let over our sample data. I'm going to use the Pandas library. That's part of our scientific python environment here. To actually load up these CSB files or just comma separated value data files and massage that data, clean it up a little bit and get it into a form that caress can accept. So we'll start by importing. The Panis Library will call a PD for short. I have built up this array of column names because it's not actually part of the C S V file , so I need to provide that by hand. So the columns of the input data are going to be the political party, Republican or Democrat, and then a list of different votes that they voted on. So, for example, we can see whether each politician voted yea or nay on religious groups and schools. And my really short the details of that particular bill were. But by reading these, you can probably guess the direction that the different parties would probably vote toward . So go ahead and read that CSP file. Using pandas read see SV function. We will say that any missing values will be populated with a question mark and will pass in a names array of the feature name. So we know what to call the columns that will just display the resulting data frame using the head command. So go ahead, Hit, shift, enter to get that up and we should see something like this is just the 1st 5 entries. So for the 1st 5 politicians at the head of our data, we can see how each person's party is in the label that we've assigned to that person, the known label that we're gonna try to predict and their votes on each issue. Now, we can also use the describe function on the resulting data frame to get a high level overview of the nature of the data. For example, you can see this lot of missing data, for example, even though that there is 435 people in the have a party associated with them. On Lee, 387 of them actually had a vote on the water project cost sharing bill, for example. So we have to deal with missing data here somehow. And the easiest thing to do is to just throw away Rose that have missing data Now in the real world, you'd want to make sure that you're not introducing some sort of unintentional bias by doing that. You know, maybe there is more of a tendency for Republicans to not vote than Democrats or vice versa . If that were the case, then you might be biasing your analysis by throwing out politicians that didn't vote in every every actual issue here. But let's assume that there is no such bias and we can just go ahead and drop those missing values. That's what this little line here does. It says, Drop in a in place. It was true. That just means that we're going to drop any rows that are missing data from our voting data data frame. And then we'll described again and we should see that every column has the same count because there is no missing data at this point. So we've window things down to 232 politicians here, not ideal. But, hey, that's what we have to work with. Next thing we need to do is actually massage this data into a form that caress can consume . So Carris does not deal with wise and ends. It deals with numbers, so let's replace all the wise and ends with ones and zeros using this line here. Panis has a handy dandy replaced function on data frames he can use to do that and similarly will replace the strings Democrat and Republican, also with the numbers one and zero. So this is turning this into a binary classification problem. If we classify someone as belonging to label one, and that will indicate their a Democrat and labelled zero will indicate that they're Republican. So let's go ahead and run that clean up that data, and we should see now if you run ahead on that data frame again. Everything has been converted to numerical data between zero and one, which is exactly what we want for the input to a neural network. All right, finally, let's extract this data into, uh, num pie raise that we can actually feed to caress. So to do that, we're just going to call dot values on the columns that we care about. We're going to extract all of the feature columns into the features array and all of the actual labels the actual parties into in all classes array. So let's go ahead and enter to get that in and at this point, I'm going to turn it over to you. For now. The code snippets you need were actually covered in the slides just prior to coming out to this notebook here. So just refer back to that, and that should give you the stuff you need to work off of and actually give things a go here. So I want you to try yourself. Now, my answer is below here. No peeking. I put a little binge there to try to stop you from scrolling further than you should. But if you want to hit pause here, we can come back later. And you can compare your results to mine. Okay, So at this point, I want you to pause this video and give it a go yourself. And when you think you've got something up and running or if you just ah, want to skip ahead and see how I did it hit play again and I'll show you right now. All right. I hope you did your homework here. Let's go ahead and take a look at my implementation here again. It's pretty much straight up. Taken from the slides that I showed you earlier. All we're going to do is import that stuff we need from Caris here. We're using dense dropout and sequential, and we're also going to use cross Val scored actually evaluator model and actually illustrate integrating caress with psychic learned like we talked about as well. So when were interviewed with, like, it learn we need to create a function that creates our models. We can pass that into cross Val score. Ultimately, we're gonna create sequential model, and we're just gonna follow the pattern that we showed earlier of doing a binary classification problem. So in this case, we have 16 different issues that people voted on. We're going to use a rela activation function with a layer of 32 neurons. And a pretty common pattern is to start with a large number of neurons and one layer and window things down as you get the higher layers. So we're gonna distill those 32 neurons down to another hidden layer of 16 neurons, and I'm using the term units in this particular example here a little bit of an aside, Mawr and Mawr researchers air using the term unit instead of neuron. And you're seeing that in some of the AP eyes and libraries that are coming out. Reason being is that we're starting to diverge a little bit between artificial neural networks and how they work and how the human brain actually works in some cases were actually improving on biology. So some researchers are taking issue with actually calling these artificial neurons because we're kind of moving beyond neurons, and they're kind of becoming their own thing at this point. Finally, we'll have one last layer with a single output neuron. For there are binary classification with a sigmoid activation function to choose between zero and one, and we will use the binary cross entropy loss function the Adam Optimizer and kick it off. At that point, we consider a caress classifier to actually execute that, and we will create an estimator object from that that we can then pass into psych. It learns cross viale score toe, actually perform K fold cross validation automatically, and we will display the mean result when we're done. So shift, enter and see how long this takes. Mind you in 1984 politicians were not as polarized as they are today, so it might be a little bit harder than it would be today. To actually predict someone's parties just based on their votes will be very interesting to see if that's the case using more modern data. Hey, we're done already 93.9% accuracy, and that's without even trying too hard. So, you know, we didn't really spend any time tuning the topology of this network. It all maybe you could do a better job, you know, if you did get a significantly better results, post that in the course here, I'm sure the students would like to hear about what you did. So there you have it using carats for amore. Interesting example. Predicting people's political parties using a neural network and also integrating it with psychic learned to make life even easier. That's the magic of caress for you.
11. Convolutional Neural Networks: so So far, we've seen the power of just using a simple multi layer perceptron to solve a wide variety of problems. But you can take things up a notch. Sheikhoun arrange Maura complicated neural networks together and do more complicated problems with them. So let's start by talking about convolution alone, neural networks or CNN's for short. Usually, you hear about CNN's in the context of image analysis, and their whole point is to find things in your data that might not be exactly where you expect it to be. So technically, we call this feature location. In variant, that means that if you're looking for some pattern or some feature in your data, but you don't know where exactly might be in your data, a CNN can scan your data and find those patterns for you wherever they might be. So, for example, in this picture here, that stop sign could be anywhere in the image, and a CNN is able to find that stop sign no matter where it might be. Now, it's not just limited to image analysis. It can also be used for any sort of problem where you don't know where the features you are might be located within your data and machine translation or natural language processing tests. Come to mind for that, you don't necessarily know where the noun or the verb or a phrase that you care about might be in some paragraph percent and say you're analyzing, but a CNN confined it and pick it out for you. Sentiment Analysis. Another application of CNN so you might not know know exactly where a phrase might be that indicates some happy sentiment or some frustrated sentiment, or what? Whatever you might be looking for. But a CNN can scan your data and pluck it out, and you'll see that the idea behind it isn't really as complicated as it sounds. This is another example of using fancy words. Teoh make things sound more complicated than they really are. So how do they work? While CNN's convolution, all neural networks are inspired by the biology of your visual cortex, it takes cues from how your brain actually processes images from your retina, and it's pretty cool. And it's also another example of interesting emergent behavior. So the way your eyes work is that individual groups of neurons service a specific part of your field of vision. So we call these local receptive fields there just groups of neurons that respond only to a part of what you're. I see it's sub samples the image coming in from your retinas and just has specialized groups of neurons for processing specific parts of the field of view that you see with your eyes. Now these little areas overlap each other to cover your entire visual field, and this is called convolution. Convolution is just a fancy word of saying I'm going to break up this data into little chunks and process those chunks individually, and then they'll assemble a bigger picture of what you're seeing higher up in the chain. So the way it works within your brain is that you have many layers. It is a deep neural network that identifies various complex cities of features, if you will. So the first layer that you go into from your convolution, all neural network inside your head might just identify horizontal lines or lines at different angles or, you know, specific cut times of edges. We call these filters, and that might feed into a layer above them that would then assemble those lines that it identified at the lower level into shapes. And maybe there's a layer above that that would be able to recognize objects based on the patterns of shapes that you see. And then, if you're dealing with color images, we have to multiply everything by three because you actually have specialized cells within your right enough for detecting red, green and blue light. And we need to assemble those together as well. Those each get processed individually to So that's all a CNN is. It is just taking a source, image or source data of any sort, really breaking it up into little chunks called convolutions. And then we assemble those and look for patterns and increasingly higher complexities at higher levels in your neural network. So how does your brain know that you're looking at a stop sign there? Let's talk about this and more colloquial language, if you will. So, like we said, you have individual local receptive fields that are responsible for processing specific parts of what you see and those local receptive fields air scanning your image and they overlap with each other looking for edges. You might notice that your your brain is very sensitive to contrast edges that it sees in the world does tend to catch your attention, right? That's why the letters on this slide catch your attention because there's high contrast between the letters and the white background behind them. So at a very low level, you're picking at the edges of that stop sign and the edges of the letters on the stop sign . Now. Ah, higher level might take those edges and recognize the shape of that stop Science says. Oh, there's an octagon there that means something special to me. Or those letters form the word stop. That means something special to me, too, and ultimately that will get matched against whatever classifications pattern your brain has of a stop sign. So no matter which receptive Field picked up that stop sign at some layer, it will be recognized at a stop sign. And furthermore, because you're processing data and color, it could also use the information that the stop sign is red and further use that to aid in its classification of what this object really is. So somewhere in your head, there's a neural network that says, Hey, if I see edges arranging an octagon pattern that's has a lot of red in it and says, Stop in the middle. That means I should probably hit the brakes on my car and it's some even higher level. The weird brain is actually doing higher reasoning. That's what happened. There's a wire that says, Hey, there's a stop sign coming up here. I better hit the brakes in my car. And if you've been driving long enough, you don't even really think about anymore. Do you like It's almost hard wired, and that literally may be the case anyway. A convolution, all neural network, an artificial convolution. All neural network works the same way. Same exact idea. So how do you build a CNN with caress? Obviously, you probably don't want to do this at the lower level tensorflow layer you can. But CNN's get pretty complicated. Ah, higher level library like carrots becomes essential. First of all, you need to make sure your source data is of the appropriate dimensions of the appropriate shape if you will, and you are going to be preserving the actual two D structure of an image. If you're dealing with image data here, so the shape of your data might be the with times the length, times, the number of color channels and by color channels. I mean, if it's a black and white image, there's only one color black and white, so you don't have one color channel for a grayscale image. But if it's a color image, you'd have three color channels one for red, one for green and one for blue, because you can create any color by combining red, green and blue together. Okay, now there are some specialized types of layers in Carriacou use when you're dealing with convolution, all neural networks, for example, there's the convert to D layer type that does the actual convolution on a two D image. And again, convolution is just breaking up that image into little sub fields that overlap each other for individual processing. There's also a conv one D and a con three D layer available as well. You don't have to use CNN's with images like we said. It can also be used with text data, for example. That might be an example of one dimensional data, and it's also a con. Three D layer is available as well. If you're dealing with three D volumetric data of some sort. So the lot of possibilities there have a specialized layer and caress for CNN's. Is Max pooling to D? Obviously, that's a one D and three d very into that as well. The idea of that is just to reduce the size of your data down. So if I take just the maximum value seen in a given block of an image and reduce it to layer down to those maximum values, it's just a way of shrinking the images in such a way that it can reduce the processing load on the CNN. As you'll see processing, CNN's is very computing intensive, and the more you can do to reduce the work, you have to do the better. So if you have more data in your image than you need a max, pulling two D layer can be useful for distilling that down to the bare essence of what you need to analyze. Finally, at some point you need to feed this data into a flat layer of neurons, right that at some point is going to go into a perceptron, and at this stage, we need to flatten that two D layer into a one D layer so we could just pass it into a layer of neurons. And from that point, it just looks like any other multi level perception. So the magic of CNN's really happens at a lower level, you know. Ultimately, it gets converted into what looks like the same types of multi layer Perceptron is that we've been using before the magic happens and actually processing your data involving it and reducing it down to something that's manageable. So typical usage of image processing with the CNN would look like this. You might start with a conto de layer that does the actual convolution of your image data. You might follow that up with a max pulling two D layer on top of that that distills that image down just shrinks the amount of data that you have to deal with. You might then do a dropout layer on top of that, which just prevents over fitting like we talked about before. At that point, you might apply a flattened layer to actually be able to feed that data into a perceptron, and that's where a densely or might come into play. So dense layer and caress is just a Perceptron, really, You know, it's a layer of, ah, hidden layer of neurons. From there he might do another drop out past to further prevent over fitting and finally do a soft max to choose the final classification that comes out of your neural network now. Like I said, CNN's our compute intensive. They are very heavy and your CPU, your GP you and your memory requirements shuffling all that data around involving it adds up really, really fast. And beyond that, there's a lot of what we call hyper parameters a lot of different knobs and dials that you can adjust on CNN's. So, in addition to the usual stuff, you can tune like the topology of your neural network or what optimize your user, what lost function to use or what activation function to use. There's also choices to make about the colonel sizes. What is the area that you actually involve across? How many layers do you have? How many years do you have? How much pooling do you do when you're reducing the image down? There's a lot of various here that's almost an infinite amount of possibilities here for configuring a CNN and often. Just obtaining the data to train your CNN with is the hardest part. So, for example, if you wanna Tesla's that's actually taking pictures of the world around you on the road around you and all the street signs and traffic lights as you drive and every night it sends all those images back to some data servers somewhere. So Tesla can actually run training on its own neural networks based on that data. So if you slam on the brakes while you're driving a Tesla at night, that information is going to be sped into a big data center somewhere, and Tesla's gonna crunch on that and say, Hey, is there a pattern here to be learned of what I saw from the cameras from the car? That means you should slam on the breaks in the case of a self driving car, and you think about the scope of that problem, just the sheer magnitude of processing and obtaining and analyzing all that data that becomes very challenging in and of itself. Now, fortunately, the problem of tuning the parameters doesn't have to be a SARD, as I described it to be, there are specialized architectures of convolution, all neural networks that do some of that work for you. So the Lauder research goes into trying to find the optimal apologies and parameters for a CNN for a given type of problem, and you could just think this is like a library you can draw from. So, for example, there's the Lynette five architecture that you can use that's suitable for handwriting recognition. In particular, there's also one called Alex Net, which is appropriate for image classification. It's a deeper neural network than Lynette, you know. So in the example we talked about on the previous slide, so we only had a single hidden layer. But you can have as many as you want released as a matter of how much computational power you have available. There's also something called Google Lynette. You can probably guess who came up with that. It's even deeper, but it has better performance because it introduces this concept called Inception Module. They basically group convolution layers together, and that's a useful optimization for how it all works. Finally, the most sophisticated one today is called rez Net that stands for residual network. It's an even deeper neural network, but it maintains performance by what's called skip Connection. So it has special connections between the layers of the Perceptron to further accelerate things. So it's sort of like builds upon the fundamental architecture of a neural network toe, optimize its performance, and as you'll see CNN's, can be very demanding on performance. So with that, let's give it a shot. Let's actually use a CNN and see if we can do a better job at image classification than we've done before using one.
12. Using CNN's for Handwriting Recognition: and we're going to revisit the M NIST handwriting recognition problem where we try to classify a bunch of images of people are drawing the number is zero through nine and see if we could do a better job of it. Using CNN's against CNN's are better suited to image data in general, especially if you don't know exactly where the feature you're looking for is within your image. So we should expect to get better results here. All right, so we're gonna start by importing all the stuff we from caress were on import the m this data set that were playing with the sequential model so we can assemble our neural network . And then we're gonna import all these different layer types that we talked about in the slides. The dense dropout calm to De Max, pulling to t and flatten layer types, and in this example will use the RMS prop optimizer. Go ahead and kick that off. And the rest here for loading up the training and test data is gonna look just like it did before. Still waiting for that cares to initialize itself there. All right, so that should load up the M nus data set, we're gonna shape the state a little bit differently. So since ah, convolution, all neural networks can process to D data in all their two d glory. We're not going to reshape that data into flat one D arrays of 768 pixels. Instead, we're going to shape it into the with times the length times the number of color channels. So in this case, our data is grayscale in nature. So there's only a single color channel that just defines how wider dark the images the specific pixel is. And there's a couple of different ways that data can be stored. So we need to handle a couple of different cases here. It might be organized as color channels by with times length or might be with times, lifetimes, color channels. So this is what this little bit of code here is dealing with. But either way, we will see if it's a channels first format or not, and reshape the data accordingly. And we're gonna store that shape in this thing called input shape. That's the shape of our input test data and training data, for that matter. As before, we're going to scale this data down, so it comes in as eight bit byte data, and we need to convert that into normalized floating Point it instead. So we'll convert that data to floating 80.0.32 bit values and then divide each pixel by 2 55 to transform that into some number between zero and one. Go ahead, hit, shift. Enter in there to kick that off, all right. And as before, we will convert the label data into one hot, categorical format because that will match up nicely with the output of our neural network and nothing different here. We just got again to a sanity check to make sure that we successfully imported our data. So we'll pick a random training set sample toe print out here on display. And there's the one hot format of the three labelled 0123 That's correct. Human readable format. Three. And it looks like that. Sure enough, that looks like the number three, so it looks like our data is in good shape for processing. So now let's actually set up a CNN and see how that works. So let's walk through what's going on in this next code block here as before, we start off by setting up a sequential model that just allows us to very easily build up layers to build up our neural network here. And we will start with a calm to dealer. So what this syntax here means is that our convolution, all two d layer, is going to have 32 windows or 32 regional fields, if you will, that it will use to sample that image with and each one of those samples will be about three by three colonel size. It also needs to know the shape of your input data which we stored previously that CEO won by 28 by 28 or 28 by 28 by one, depending on the input format. There, we will then add a second convolution. All filter on top of that to hopefully identify higher level features. This one will have 64 colonels, also a three by three size, and we're going to use a yellow action activation function on that as well. So we've built up to convolution layers here. Ah, and again you want to just reuse any previous research you can do for a given problem. There are so many ways to configure CNN's that if you start from scratch, you're gonna have a very hard time to tune it, especially when you consider how long it takes to generate between each run. These are very resource intensive. So I've just taken this from the CNN example that comes with the caress library and drawn my initial topology from it. So now that we've done our convolution layers, we're gonna do a max pulling two d steptoe, actually reduce that down a little bit. So we're gonna take a two by two pool size and for each two by two pixel block at this stage, we're going to reduce that down to a single pixel that represents the maximum pixel found within in that pool. So note that the pool size can be different from the underlying colonel size from the convolution. You did So really, this is just a technique for shrinking your data down to something that's more manageable at this point. Will do a dropout passed to prevent over fitting. We will then flatten what we have so far. So that will take R two D data and flatten it to a one d layer. And from this point, it's just gonna look like any other multi layer Perceptron just like we used before. So all the magic of CNN's has happened at this point, and now we're just gonna convert it down to a flat layer that we input into a hidden layer of neurons. In this case, we're gonna have 128 in that layer again with a rail. You activation function will do one more drop out past to prevent over fitting and finally choose our final categorization of the number zero through nine by building one final output layer of 10 neurons with ease. Soft max activation function on it. All right, so let's go ahead and let that run again. Nothing's really happening until we actually kick off the model, so that doesn't take any time at all. We can do a model that summary just to double check that everything is the way that we intended it to be. And you can see that we have are two convolution layers here, followed by a pooling layer, followed by a drop out of flatten. And from there we have a dense dropout in dense multi layer Perceptron actually do our final classifications. All right, finally, we need to compile that model with a specific optimizer and lost function. In this case, we're going to use the Adam Optimizer and categorical cross entropy because that's the appropriate loss function for a multiple category classification problem. And finally, we will actually run it now. Like I said, CNN's air very expensive to run. So when we talk about what this command does, first of all, nothing unusual here just says that we're going to run batches of 32 which is smaller than before, because there is a much higher computational cost of. This really got to run 10 epochs this time because again, it takes a long time or would be better. But there's only so much we have time to do verbosity level two because that's what do you want to choose for running within an eye python notebook and we will pass in our A validation test data for it to work with us? Well, now, I am not going to actually run this because this could actually take about an hour to run, and if you don't have the beefy machine, it might not finish it all. You know, if you don't have enough ram were enough CPU power. This might even be too much for one system. So I'm gonna skip ahead here. Actually ran this earlier and it did take about 45 minutes. But you can see that it very quickly converged to a very good Accuracy Valley here and it was still increasing. So there probably would have been value to going even beyond 10 iterations of the training here. But even after just 10 at box or 10 iterations, we ended up with a accuracy of over 99%. And we can actually evaluate that based on our test data and recreate that 99% accuracy. So that's kind of awesome. So CNN's definitely worth doing if accuracy is key and for applications where lives are at stake, such as a self driving car, Obviously that's worth the effort, right? You want complete accuracy of detecting If there's a stop sign in front of you high Teeley , right? Even 0.1% error is going to be unacceptable in a situation like that. So that's the power of CNN's. They are more complicated to take a lot more time to run. But as we said, the power of tensorflow, which caress is running on top of means you could distribute its work across an entire cloud of computers in an entire array of GP use out our on each computer. So there are ways of accelerating this. We're just not taking advantage of that in this little example. Here, it's just illustrative. So there you have it, your first convolution, all neural network, and you can see how powerful it is and successfully doing image classification, among other things. So cool, let's move on to another type of neural network next.
13. Recurrent Neural Networks: Let's talk about another kind of neural network, the recurrent neural network. What are our and ends for? Well, a couple of things, basically their first sequences of data. And that might be a sequence in time so you might use it for a processing time series data . We're trying to look at a sequence of data points over time and predict the future behavior is something over time. In turn, so aren't answer based for sequential data of some sort. Some examples of time serious data might be weblogs, where you receiving different hits to your website over time, or sensor logs were getting different inputs from sensors from the Internet of things. Or maybe you're trying to predict stock behavior by looking at historical stock trading information. These are all potential applications for recurrent neural networks because they can take a look at the behavior over time and try to take that behavior into account when it makes future projections. Another example might be If you're trying to develop a self driving car, you might have a history of where your car has been. Its past trajectories, and maybe that can inform how your car might want to turn in the future, so you might take into account the fact that your car has been turning along a curve to predict that perhaps they should continue to drive along a curve until the road straightens out. And another example. It doesn't have to just be time. It can be any kind of sequence of arbitrary length. So something else that comes to mind are languages, you know, sentences there, just sequences of words, right, so you can also apply our and ends to language or machine. Translation are producing captions for videos or images. These are examples of where the order of words in the sentence might matter, and the structure of the sentence and how these words are put together could convey more meaning. Then you could get by just looking at those words individually without context. So again, in our nn can make use of that ordering of the words and try to use that as part of its model. Another interesting application of are an ends is machine generated music. You can also think of music sort of like text, where instead of a sequence of words or letters, you have a sequence of musical notes. So it's kind of interesting. You can actually build a neural network that can taken existing piece of music and sort of extend upon it by using a recurrent neural network to try to learn the patterns that were aesthetically pleasing to the music in the past. Conceptually, this is what a single recurrent neuron looks like in terms of a model. So it looks a lot like a, uh, an artificial neuron that we've looked at before. The big difference is this little loop here. Okay, so now, as we run a training step on this neuron, some training data gets fed into it. Or maybe this is an input from a previous layer in our neural network, and it will apply some sort of step function after something all the inputs into it. In this case, we're gonna be drawing something more like a hyperbolic tangent because mathematically, you want to make sure that we preserve some of the information coming in and more of a smooth manner. Now, usually we would just output the result of that summation and that activation function as the output of this neuron. But we're also going to feed that back into the same neuron. So the next time we run a run, some data through this store on that data from the previous run also gets summed into the results. Okay, So as we keep running this thing over and over again will have some new data coming in that gets blended together with the output from the previous run through this neuron, and that just keeps happening over and over and over again. So you can see that over time the past behavior of this neuron influences its future behavior, and it influences how it learns. Another way of thinking about this is by unrolling it in time. So what this diagram shows is the same single neuron, just a three different times steps. And when you start to dig into the mathematics of how our ends work, this is a more useful way of thinking about it. So we consider this to be time, step zero. You can see there's some sort of data input coming into this recurrent neuron and that will produce some sort of output after going through its activation function. And that output also gets fed into the next time step. So if this is time Step one with the same neuron. You can see that this neuron is receiving not only a new input but also the output from the previous time step and those get some together, the activation function gets applied to it, and that gets output as well. And the output of that combination then gets fed on to the next time step called this time step to where a new input for time Step two gets fed into this neuron, and the output from the previous step also gets fed in. They get some together, the activation function is run and we have a new output. This is called a memory cell because it does maintain memory of its previous outputs over time. And you can see that even though it's getting some together at each time step over time, those earlier behaviors kind of get diluted, right? So you know, we're adding in that time, step to that time step and then the some of those two things that working into this one, so one property of memory cells is that more recent behaviour tends to have more of an influence on the current behavior that you get out of a recurrent neuron, and this could be a problem in some applications. So there are ways to work against that that we can talk about later. Stepping this up. You can have a layer of recurring Ireland, so you don't have to have just one, obviously. So in this diagram we are looking at four individual recurrent neurons that are working together as part of a layer, and you can have some input. Going into this layer is the hole that gets spent into these four different recurring neurons. And then the output of those neurons can then get fed back to the next step to every neuron in that layer. So all we're doing is, ah, scaling this out horizontally. So instead of a single recurrent Iran, we have a layer of four recurrent neurons in this example, where all of the output of those neurons is feeding in to the behavior of those neurons in the next learning step. Okay, so you can scale us out to have more than one neuron and learn more complicated patterns as a result, aren't ends open up a wide range of possibilities because now we have the ability to deal, not just with vectors of information static snapshots of some sort of a state. We can also deal with sequences of data as well, so there are four different combinations here that you can deal with. We can deal with sequence to sequence neural networks. If we have the input is a time Siri's or some sort of sequence of data. We can also have an output that is a time Siri's or some sequence of data as well. So if you're trying to predict stock prices in the future based on historical trades, that might be an example of sequence to sequence topology. We can also mix and match sequences with the older vector static states that we predicted back with just using multi layer Perceptron. We would call that a sequence to vector. So if we were starting with a sequence of data, we could produce just a a snapshot of some state. As a result of analyzing that sequence. An example might be looking at the sequence of words in a sentence to produce some idea of the sentiment that that sentence conveys from last. People get that an example shortly. You could go the other way around, too. You can go from a vector to a sequence, so an example of that would be taking an image, which is a static vector of information, and then producing a sequence from that factor, for example, words in a sentence creating a caption from an image. And we can change these things together in interesting ways as well. We can have encoders and decoders built up that feed into each other. For example, we might start with a sequence of information from, ah, sentence of some language, embody what that sentence means as some sort of a vector representation and then turn that around into a new sequence of words in some other language. So that might be how a machine translation system could work. For example, you might start with a sequence of words in French, build up a vector that sort of embodies the meaning of that sentence and then produce a new secrets of words in English or whatever language you want. That's an example of using a recurrent neural network for machine translation. So lots of exciting possibilities here training are and ends just like CNN's. It's hard. In some ways. It's even harder. The main twist here is that we need to back propagate not only through the neural network itself in all of its layers, but also through time. And at a practical standpoint, every one of those times steps ends up looking like another layer in our neural network while we're trying to train our neural network and those times steps can add up fast. So over time we end up with, like, an even deeper and deeper neural network that we need to train and the cost of actually performing Grady and dissent on that increasingly deep neural network becomes increasingly large. Soto put an upper cap on that training time off. When we limit the back propagation to a limited number of time steps, we call this truncated back propagation through time. So just something to keep in mind when you're training in R N N. You not only need to back propagate through the neural network topology that you've created , you also need a back pocket propagate through all the time steps that you've built up up to that point now. We talked earlier about the fact that as you're building up in our an end, the state from earlier times. Steps end up getting diluted over time because we just keep feeding in behavior from the previous step in our run to the current step. And this could be a problem if you have a system where older behavior does not matter less than newer behavior. For example, if you're looking at words in a sentence, the words at the beginning of its sentence might even be more important than words toward the end. So if you're trying to learn the meaning of a sentence, the position of the word in the sentence there is no inherent relationship between where that word is and how important it might be in many cases. So that's an example of where you might want to do something to counter act that effect. And one way to do that is something called the L S. T. M. Cell extends for a long, short term memory cell, and the idea here is that it maintains separate ideas of both short term and long term states, and it does this in a fairly complex way. Now, fortunately, you don't really need to understand the nitty gritty details of how it works. There is an image of it here for you to look at if you're curious. But, you know, the libraries that you use will implement this for you. The important thing to understand is that if you're dealing with a sequence of data where you don't want to give preferential treatment to more recent data, you probably want to use an L S T M cell instead of just using a straight up r n n. There's also an optimization on top of L S T M cells called G R U cells that stands for gated recurrent unit. It's just a simplification on Ellis TM cells that performs almost a swell. So if you need to strike a balance or a compromise between performance in terms of how well your model works and performance in the terms of how long it takes to train it, Aguiar you sell might be a good choice. Training them is really hard. If you thought CNN's was hard, wait till VCR and ends. They are very sensitive to the topology is that you choose and the choice of hyper parameters. And since we have to simulate things over time, and not just through you know the static topology of your network. They could become extremely resource intensive. And if you make the wrong choices here, you might have a recurrent or a network that doesn't converge it all. You know it might be completely useless, even after you've running for hours to see if it actually works. So again, it's important to work upon previous research. Try to find some sets of apologies and parameters that work well for similar problems to what you're trying to dio. This all makes a lot more sense with an example, and you'll see that it's really nowhere near as hard as it sounds when you're using caress Now. I used to work at IMDB, so I can't resist using a movie related example. So let's dive into that next and see our and ends recurrent neural networks in action
14. Using RNN's for Sentiment Analysis: what we're gonna do here is try to do sentiment analysis. So this is going to be an example of a sequence to vector are in and problem where we're taking the sequence of words in a user written movie review. And we try to output a vector that's just a single binary value of whether or not that user like the movie or not where they gave a positive rating. So this is an example of doing sentiment classifications using riel user review data from IMDB. And since I used to run, I am DVS engineering department. This is a little bit too tempting for me not to do is an example here. Now, mind you just give credit where credit is due. This is drawn heavily upon one of the examples that ships with caress the IMDb l s t M sample. I've sort of embellished on it a little bit here, but the idea is there says to give credit where credit's due and it does warm my heart by the way that they include the IMDb data set as part of caress free to experiment with. So it's ah springing back good memories for me. I enjoyed working there. Anyhow, this isn't a other example of how we're going to use L S t M cells long short term memory cells because again, when you're dealing with textual data sequence of words in the sentence, it doesn't necessarily matter where in the sentence that word appeared. You don't want the property of words toward the end of the sentence counting mawr toward your classifications than words at the beginning of the sentence. In fact, often it's the other way around. So we're going to use an L S T m cell to try to counter act that effect that you see in normal RN ends where data becomes diluted over time or as the sequence progresses in this example. So let's just dive in and see how it works. Will start by importing all the stuff that we need from Caris. We're going to use sequence pre processing modules, sequential model so we can embed different layers. Together, we're going to introduce a new embedding layer as part of our and in addition to the dense layer that we had before, we will import the LS tm module and finally will import the IMDb data set So let's go ahead and ship entered to do all of that and get caress initialized. And that's done now. So now we can import our training and testing data. Like I said, Caris has a handy dandy IMDb data set preinstalled. Oddly, it has 5000 training reviews and 25,000 testing reviews, which seems backwards to me. But it is what it is. The one parameter you're seeing here for numb words, indicates how many unique words that you want to load into your training and testing data set. So by saying numbers equals 20,000 that means that I'm going to limit my data to the 20,000 most popular words and the data set. So someone uses some really obscure word. It's not going to show up in our input data. Let's go ahead and load that up. And since it does have to do some thinking, it doesn't come back instantly but pretty quick. OK, were we're in business here. Let's take a peek at what this data looks like. So let's take a look at the first instance of training data here, and what the heck, It's just a bunch of numbers. It doesn't look like a movie review to me. Well, you can be very thankful to the folks that cares for doing this for you. So the thing is, when you're doing machine learning in general, usually models don't work with words. They work with numbers, right? So we need to convert these words to number. Somehow is the first step, and caress has done all of this pre processing for you already. So you know, the number one might correspond to the word the or I actually have no idea what corresponds to, But they've encoded each unique word between zero and 20,000 because we said we wanted the 20,000 most popular words two numbers. Okay, so it's kind of a bummer that we can't actually read these reviews and get sort of an intuitive meaning of what these reviews air saying. But it saves us a whole lot of work. And I have said before that often a lot of the work in machine learning is not so much building your models and tuning them. It's just processing and massaging your input data and making sure that your input data looks good to go. So even though this doesn't look like a movie review. It is a movie review. They've just replaced all of the words with unique numbers that represent each word. We can also take a look at the training data. So the classification of this particular review was one which just means that they liked it . So the only classifications are zero and one which correspond to a negative or a positive sentiment for that review. Okay, so we have all over input data converted already to numerical format. That's great. Now we just have to go ahead and set things up. Let's start by creating some vectors were input here. So we're gonna break out our training and testing data here. We're going to call sequence stop pad sequences just to make sure that everything has a limit on them to 80 words. So the reason we're doing this is because, like we said, our and ends can blow up very quickly. You have to back propagate through time. So we want to have an upper bound on how many times steps we need to back propagate to. So by saying Max Line equals 80 that means we're only going to look at the 1st 80 words in each review and limit our analysis to that. So that is a way of truncating our back propagation through time. It's kind of a low tech way of doing it, but it's effective. Otherwise we would be running this thing for days. Okay, so the only point here is to trim all of these reviews in both the training and the test data set to their 1st 80 words, which again have been converted to numbers for us already. Let's build up the model itself. Hey, we didn't actually run. Run. That's let's go ahead. Ish hit, shift, Enter on that block. Okay, now we can build up the model itself. And for such a complicated neural network, I think it's pretty remarkable how few lines of code Zehr going on here. So let's talk through. This will start by creating a sequential model, meaning that we can just build up the topology of our network one step at a time here, so we'll start with some additional pre processing were using was called an embedding layer here, and all that does is convert our input date of words of to the first 80 words and given a review into dense vectors of some fixed size. So it's going to create a dense vector of a fixed size of 20,000 words and then funnel that into 128 hit and neurons inside my neural network. So that's all that embedding Layer is doing is just taking that input textual data that's been encoded and converting that into a format that suitable for input into my neural network. Then, with a single line of code, we build our recurrent neural network. So we just say, Add in L S T M. And we can go through the properties here once they wanna have 128 recurrent neurons in that Ellis TM layer. And we can also specify dropout terms just in that same command here. So we can say that we want to do a drop out of 20% and that's all there is to it. That one line of code sets up R l s T M neural network with 128 recurrent neurons and adds dropout phases of 20% all in one step. Finally, we need to boil that down to a single output neuron with a sigmoid activation function because we're dealing with a binary classification problem, and that's it. So we've defined the topology of our network with just four lines of code, even though it's a very complicated, recurrent neural network using L S T M cells and dropout phases. But caress makes that all very easy to do. We then need to tell caress how to optimize this neural networking, how to train it So we will use binary cross entropy because this is ultimately a binary classification problem. Did the person like this movie or not, will use the Adam Optimizer this time just because that's sort of the best of both worlds for Optimizers, and then we can kick it off. So let's go ahead and run these two previous blocks shift, enter, shift, enter and at this point, you're ready to actually train your neural network. And let's just walk through is going on here. It's very similar from the previous examples. In this case, we're going to use batch sizes of 32 reviews at once. We're going to run it over 15 training steps or epochs set of verbosity layer that's compatible with I Python notebooks and provide the validation data forties as well. Now, again, I'm not going to actually run this right now because it will take about an hour. Like I said, our and ends are hard. They take a lot of computing. Resource is. And since all I'm doing is running this online single CPU, I don't even have things configured to use my GP. You are let alone a cluster of computers. This takes a long time. But I did run it earlier, and you can see the results here. So over 50 Net box, you can see that the accuracy that it was measuring on the training data was beginning to converge. Seems like after about 13 steps, it was getting about as good as it was going to get. And then furthermore, we can actually evaluate that model given the testing data set. So let's go ahead and call Evaluate on that with our test data again using 32 batches and if we were to run that we would see that we end up with a accuracy of 81% on our model here . Doesn't sound that impressive, but when you consider that all we're doing is looking at the 1st 80 words of each review and trying to figure out just based on that beginning. Whether or not a user like the movie or not, that's not too bad. But again, step back and think about what we just made here. We've made a neural network that can essentially read English language reviews and determine some sort of meaning behind them. In this case, we've trained it how to take a sequence of words at the beginning of a movie, review that some human road and classify that as a positive review or a negative review. So, in a very real sense, we've at some level in a very basic level, taught our computer how to read. How cool is that? And the amount of code that we wrote to do that was minimal, right? So it's kind of libeling. It's really just a matter of knowing what technique to use to build your neural network, providing the appropriate training data, and then your neural network does the rest. It's really kind of spooky when you sit back and think about it anyway. Cool stuff. So it's a great example of how powerful caress can be and a great example of an application of a recurrent neural network not using the typical example of stock trading data or something like that, but instead for sentiment analysis where we took a sequence of words and use that to create a binary classification of a sentiment based on that sequence, so fun stuff are in ends and caress.
15. Transfer Learning: the world of a eyes in a strange and exciting time. With transfer learning, it's never been easier to deploy a fully trained artificial intelligence model and start using it for real world problems. The idea here is to use pre trained models already that are out there, available on the Internet for anyone to use. And for a lot of common problems, you can just import a pre trained model that somebody else did all the hard work of putting together and optimizing and figuring out the right parameters and the right topology and just use them. So, for example, if you're trying to do image classification, there are pre trained models out there that you can just import some of the murder called rez. Net Inception, Mobile Net in Oxford, v. G. Or some examples. And they come pre trained with a very wide variety of object types. So in many cases you can just unleash one of these off the shelf models, point a camera at something, and it will tell you what it iss. That's kind of freaky. Similarly, for natural language processing, there are pre trained models available as well, such as where Tyvek and glove that you can use to basically teach your computer how to read . With just a few lines of code. Now, you can just use them as is, but you can also just use them as a starting point if you want to extend on them or build on them for more specific problems. So even if they don't solve the specific problem you're trying to solve, you can still use thes pre trained models as a starting point to build off of that is, you know a lot easier to get going with. You don't have to waste a lot of time trying to figure out the right topology and parameters for a specific kind of problem. You can start with them all that's already figured all that out for you and just add on top of it. This is called transfer learning. Basically, we're transferring and existing train model from somebody else to your application. Now you can find more of these pre trained models and what are called model zoos. A popular one is called the Cafe Models. Ooh, and it's ah, I'm not sure to think of all this. I mean, it is super super easy to deploy a I. Now, as you'll soon see in our next example, you can just import an existing models who model and start using it with, just, you know, four or five lines of code. You don't have to really be, Ah, very good developer anymore to actually use AI for practical applications. So kind of kind of a weird place for the industry to be right now and kind of opens up a lot of interesting and potentially scary possibilities. How people might start to use this technology when there's such a low barrier to entry now of actually using it anyway, Let's dive into a real world example, and I'll show you just how scary easy it is now. So let's dive into transfer learning. Open up the transfer learning notebook in your course materials, and you should see this, and you will soon see just how crazy easy it is to use and how crazy good it can be. So we're gonna be using the resonant 50 model here. This is used for image classification, so it's an incredibly easy way to identify objects in arbitrary images. So if you have, ah, a picture of anything Maybe it's coming from a camera or video frames or what have you this comptel you? What's in that picture? Pretty reliably, it turns out. So let's ah, have some fun with it. So just to prove a point, I'm going to try some of my own vacation voters here with it. It's so we're going to be sure that the pictures that I'm giving resident 50 to classify our pictures that it's never seen before and see what it could do with it. For example, I took this picture of a fighter jet while I was exploring the deserts of California. Let's just run that This is included with your course materials, and there we have a picture of a fighter jet. So as a start, let's see if the resident 50 model can identify it and see what's involved in actually coating that up. First. We just need to import the modules we need, so we need to import the resident 50 model itself again that is built into caress, along with several other models as well. We don't even have to go to the trouble of downloading and installing it. It's just there and from we're also going to imports to manage pre processing tools, both from caress itself and is part of the Resident 50 package itself. We're also gonna important, um, pie, because we're going to use numb pie to actually manipulate the image data into a number higher ray, which is ultimately what we need to feed into a neural network. So let's go ahead and run that block now. One sort of limitation of the resident 50 model is that your input images have to be to 24 by 2 24 resolution. You know, that's in part to make sure that it can run efficiently. It's also limited to one of 1000 possible categories, and that might not sound like a lot. But I think you'll be surprised at just how much detail it will give you as to what the thing is. So let's go ahead and load up that image again. This time, we will scale it down to 2 24 by 2 24 while we're loading it, and we will convert that to a numb pirate with these two lines of code and then call the resident 50 models pre process input to do something to prepare that data. I assume it's scaling it down into whatever range it wants and maybe doing some pre processing of the image itself to make it work better. It's kind of a black box, and that's a little bit. What's weird about using transfer learning? You know, you're just sort of taking it on faith that it's doing the right thing. But from a practical standpoint, that's not a bad thing to Dio. Let's go ahead and run this all right, so it's pre processed my image. That was pretty quick. Now we'll load up the actual model itself. One line of code is all it takes. Model equals resident 50. The weights there represent that it's going to use weights learned from the image net data set. So you can even use variations of resident 50 that were trained on different sets of images . There potentially. So let's go ahead and load up the model and thats done. So now we can just use it. So we now have a pre trained image classification model here with one line of code, and we can just use it now. All have to do is call. Predict on it and we're done that that's it. It's really that easy. So let's try it. We have. Ah, as you recall our pre processed fighter jet image here in the X ray here, and we will just call modeled operative decks and see what it comes back with. I'll come back with a classification and to translate that into something human readable will just call the decode predictions function that comes with the resident 50 model as well. It's just that easy. Okay, literally two lines of code here, right? We decide one line to actually load up the resident 50 model and transfer that learning to our application, if you will, specifying a given set of weights that was pre learned from a given set of images. And then we just call, Predict on that model and we're done. That's it. Let's run that and see if it actually works. Wow. Okay, so, yeah, it's top prediction was actually warplane, and that's exactly what this is a picture of, even though it's never seen this picture before, ever. And I didn't do anything to make sure that the photo was like from the right angle or properly framed or anything like that or, you know, pre processed. With a lot of contrast, it just works. It's kind of spooky. Good. It's second guessed was a missile followed by projectile. And yeah, there were missiles and projectiles on that plane as well. So not only did tell me it was a warplane and told me that it was a war plane that had missiles on it. I mean, Wow, that's That's crazy. Good, right? Let's try with some other images. May we just got lucky. So let's make a little convenient. Ah, function here to do this on a given image more quickly. So we'll write a little classify function here, and it will start off by displaying a picture of the thing that we're starting to classify here. It will then low that image up, scaling it down to the required to 24 by 2 24 dimension. Convert that to a numb pyre, a pre process it and then just call, predict on the resident 50 model and see what it comes back with. So now we could just say classify and whatever our image file name is, and it will tell us what ISS So we've reduced our a little bit of code here to just one line now, so I can now just a hit shift enter to define that function. And now I could just say, Well, I have, AH file called Bunny Dodge a Pig and my course materials. Let's classify that shift. Enter. There's a picture of a bunny rabbit in my front yard that I took once, And sure enough, the top classification is would rabbit followed by hair. So not only is it saying it's a rabbit is tell me what kind of rabbit I don't really know my rabbit species that well, so I'm not sure if that's actually a wood rabbit, but it could be, You know your way. It's pretty impressive. I mean, it's not even like a prominent piece of this image. It's just sort of like sitting there in the middle, my lawn. It's not even that clear of a photo, either. Imagine this scale down 2 to 24 by 2 24 There's really not going to be a lot of information there, but it's still figured out that that's a rabbit. How about a fire truck picture of a fire truck and this isn't a normal fire truck either. This is like at that same aviation museum that I took a picture of that war plane of its ah , sort of an antique fire truck that was used by the Air Force. But still, fire engine is the top prediction. Uh, wow, that's kind of cool. Um, I took a picture of my breakfast once at a fancy hotel in London. Let's see what it does with that. A full English breakfast, Mind you. When in London, one must eat. His Londoners do. Actually, I don't know if he really felt English breakfast there, but it's still good. Ah, yes. So it picked up that there's a dining table in this picture. There's a tray containing my food. A restaurant. I mean, Well, this was actually room service, but you could definitely imagine that's in a restaurant instead. So, yeah, again, An impressive job here on a random photo from vacation. It's never seen this picture before. I took absolutely no, uh, put no thought into making sure this was a picture that would work well with machine learning. Artificial intelligence for image classifications. Let's keep going. Um, when I was England, I visited some castles in Wales. Picture of a cast lives Did Goto whales. Guys, it's beautiful there. Uh, yeah, it's a castle. That's it. Stop prediction. Second guess was a monastery or a palace. Both good guesses, But yeah, it's a castle. And, you know, it's not even a typical looking castle. I'd still figured it out. This is incredible stuff. All right, let's see if I can trip it up. I also took a trip to New Mexico once and visited was called the Very Large Array. This is basically a an array of giant radio astronomy dishes with only 1000 classifications . I wouldn't imagine it would get this right. So there's a picture. It's just a bunch of, ah, giant radio astronomy telescopes. And it says it's a radio telescope. Uh, did this? This is mind blowing stuff, guys. All right, one more. I took a picture of bridge once and you remember what Bridget is. London Bridge. Apparently eso Okay, what's this resident 50 a suspension bridge. And also there's appear in a chain and chain link fence in there as well, for good measure. So Ah, that's pretty impressive, right? I mean, so if you need to do image classification. You don't even need to know the details of how convolution l neural networks worker, how to tune them. And, ah, you know how to, like, build up the right apology and it arrayed on the right hyper parameters. You can just use somebody else's work already did that, and by using models ooze from the cafe models who are elsewhere for a lot of common problems, you congest get up and running in a couple of lines of code. It's never been easier to use artificial intelligence in a real world application now. So although it's good to understand the fundamentals, especially if you're going to be doing something that nobody's ever done before for common ai problems, there's been so much research in recent years that there's a good chance that somebody's already solved the problem that you're trying to solve. And you can just reuse their results if they were kind enough to publish them on a models who somewhere, Wow, so yeah, tried out on some photos of your own to my Not if you have some just throwing the course materials and call my classify function on it and see what it does with it. Just have some fun with it, and you can also try some different models to and see how they behave differently. Resident 50 was actually the model that worked the best for my photos. But there are other models included with caress, including Inception and Mobile Net that you might want to try out. If you do want to play with them, you'll have to return for to the documentation here. There's a link to it here. You do need to know what image dimensions that it expects the input in, for example, or someone working all. So yeah, I give to try and ah, man, it's mind blowing stuff. Guys just kind of like, uh, sit there and let it sink in that it's that easy to use ai right now.
16. Tuning Neural Networks: Learning Rate and Batch Size Hyperparameters: Let's talk a bit about tuning your neural networks. This is not stuff that is typically taught, but I'm not trying to convey it as best as I can. So let's talk about learning rate. First of all, what do we mean by learning rate? Well, you need to understand how these neural networks are trained. They're using a technique called gradient descent or something similar to gradient descent is various different flavors of it out there. The basic idea is that we start at some random point of weights in our neural network. And we just sampled different solutions, different sets of weights trying to minimize some cost function that we defined over several epochs. So those are the keywords there. We have many epochs, iterations over which we train. At each epoch, we try a different set of weights on our neural network, trying to minimize some cost function, which might be the overall accuracy of how well it makes predictions on our validation set. So we need to have some sort of rhyme and reason as to how we make those samples of different solutions, different weights, if you will, if we were to boil this down into sort of a two-dimensional graph, maybe it would look something like this, where we're just sampling different points here along a curve of solutions. And we're trying to find the one that minimizes the cost function. So that's the y-axis here. So what we're trying to find is the lowest point on this graph. And we're trying to get there by sampling it at different points and learning from each previous sample. That's what gradient descent is all about. So the learning rate is all about how far apart those samples are. So you see here we might have started up here and our learning rate said, Okay, I'm going to try another point here and try again here, so on and so forth until I finally find the lowest point along this curve and call that my best solution. So not too hard to understand the effect of learning rate on your training, right? If you have too high of a learning rate, you might overshoot that solution entirely. So imagine my learning rate was huge and I went straight from here to here. I might miss that bottom point. They're entirely if my learning rate were too high. But you can see that if my learning rate is too small, I'm going to be sampling a whole lot of different points here. And it's going to take a lot of epochs, a lot of steps to actually find that optimal solution. So too high of a learning rate might mean that I overshoot the correct solution entirely, but too small for learning rate will mean that my training might take longer than it needs to. Now, learning rate is an example of what we call hyperparameters. It's one of the knobs and dials that you use while training your deep learning model that can affect its end result. And oftentimes, these hyperparameters can have just as much influence on the quality of your model as the topology of the model, the feature engineering you've done in everything else. So it's just another piece of the puzzle here that you need to arrive at experimentally. In addition to learning rate, another important hyperparameter is the batch size, and this is how many training samples are used within each epoch. Now, hammered this into your heads guys, because it's kind of counterintuitive. You would think that a large batch size would be a good thing, right? The more data the better. But no, that's not how it ends up working. It turns out that if you have a small batch size, it has a better ability to work its way out of what we call local minima. So in this example here you can see that we have a minima here, sort of a dip in the graph here, where we have a pretty good, nice low loss function value here, what we're trying to optimize for is pretty good here. But there's a risk during gradient descent that we get stuck in that local minima when in fact the better solution is over here somewhere. So we wanna make sure that during the process of gradient descent, we have some ability to wiggle our way out of this thing and find that better solution. It turns out that smaller batch sizes can do that more effectively than larger ones. So a small batch size can wiggle its way out of these local minima. But a large batch size might end up getting stuck in there, like basically waiting it down if you will. So batch sizes that are too large can end up getting stuck in the wrong solution. And what's even weirder is that because you will usually randomly shuffle your data at the beginning of each training epoch. This can end up manifesting itself as getting very inconsistent results from run to run. So if my batch size is just a little bit too big, maybe sometimes I'll get stuck in this minima and sometimes I won't. And I'll see that in the end results is seeing that from run to run. Sometimes I'll get that answer and sometimes they'll get that answer, right. So hammered this into your head guys is smaller batch sizes tend to not get stuck in local minima, but large batch sizes can converge on the wrong solution at random. A large learning rate can end up overshooting the correct solution, but small learning rates can increase the training time. So remember this, write it down, important stuff and it's again, it's an example of things that most people just learned the hard way through experience, but I'm trying to teach it to you up frontier.
17. Deep Learning Regularization with Dropout and Early Stopping: Let's dive into regularization techniques in the world of neural networks. What is regularization anyway? Well, basically, regularization is any technique that is intended to prevent overfitting. What is overfitting? Well, if you have a model that's good at making predictions on the data it was trained on, but it doesn't do so well on new data that it hasn't seen before. Then we say that that model is overfitted. That means that it's learned patterns in your training data that don't really exist in the general sense in the real world. So if you see a high accuracy on your training dataset, but a lower accuracy on your test set or your evaluation dataset. That's nature's way of telling you that you might be overfitting. Let's take a step back here. This is probably the first time I've used the word evaluation dataset. Again, if you're new to this world, in the world of deep learning, typically we talk about three different datasets. So we have the training dataset. This is the actual training data fed into your neural network from the bottom up. And that's what we actually train the network on, right? And then as we're training each epoch, we can evaluate the results of that network against an evaluation dataset. So basically that's the set of the training set that set aside to evaluate the results and the accuracy of your model as it's being trained. And then we can also have a testing dataset to that lives outside of all of that. So once we have a fully trained model, we can then use our testing dataset to evaluate the complete, finished model, if you will. So again, if you're seeing your training accuracy being a lot more than the accuracy measured against your evaluation data or your testing data at the end, probably means you're overfitting to the training data. This graph on the right here makes it a little bit more easy to understand. So imagine I'm trying to build a model that just separates out things that are blue from things that are red here. So if you eyeball this data, your brain can pretty much figure out that this probably this curve that kind of separates where the bluest stuff is and where the red stuff is, right? But in the real-world, data's messy. There's little bit of noise there too. So if, if a model we're overfitting, it might actually learn that green curve there that's actually sneaking in and out of all the data to try to like fit that to training data. Exactly. But you know, that's just noise, right? I mean, just looking at it, your brain knows that that's not correct. But your neural network doesn't really have that intuition built into it. So we need regularization techniques to sort of prevent that from happening, to prevent a neural network or any machine learning model from curving and undulating and sort of making these higher frequency pass out of the way to overfit its data to its model. All right, That's what overfitting it's it's a good way to generalize it. The so-called correct answer, the correct model would be that black line, but an overfitted model would be more like the green line. And this is actually something that really happens in neural networks. If you have a really deep neural network with lots of weights and connections and neurons that are built into it. It can totally pick up on complex patterns like that. So you do have to be careful with it. So that's where the world of regularization techniques come in. Let's go into some. So a very simple thing might be, you might just have too complex of a model. Maybe you have too many layers are too many neurons. So you could have a deep neural network that's too deep or maybe too wide or maybe both, right? So. By actually simplifying your model down, that restricts its ability to learn those more complicated patterns that might be overfitting. So a very simple model that's just a simple curve like that, that could probably be achieved through a regression. Maybe you're better off with a simpler model. And the simplest regularization technique is simply to use fewer neurons or use fewer layers. That is a totally valid thing to do. Sometimes need to experiment with that. So if you find that your model is overfitting, probably the simplest thing is to just to use a simpler model. Try, try fewer layers, try fewer neurons in each layer and see what kind of effect that has if you can still have the same accuracy and your test dataset, but not overfit to your training dataset, then why use more neurons than you need? Another technique is called dropout. And this is kind of an interesting one. So the idea with a dropout layer is that it actually removes some of the neurons in your network at each epoch as it's training. And this has the effect of basically forcing your model to learn and spread out. It's learning amongst the different neurons and layers within your network. So by dropping out specific neurons that are chosen at random and each training step, we're basically forcing the learning to spread itself out more. And this has the effect of preventing any individual neuron from over fitting to a specific data point, right? So it's a little bit counter-intuitive that actually removing neurons from your neural network and make it actually trained better. But that's what happens, that prevents overfitting. So that's what dropout is all about. Again, a very effective regularization technique. We see this a lot in say CNNs for example, it's pretty standard to have a pretty aggressive dropout layer like maybe even 50 percent being held out for me for each training past. So that's all dropout is. It's just removing some neurons at random, at each training step to force your model to spread its learning out a little bit better. And that has a regularization effect that prevents overfitting. Another very simple solution is called early stopping. So let's take a look at this print out as we're actually training a real neural network. So you can see that if you look at the accuracy on the validation set, That's the right-hand column there. We've gone from 95 percent to 97% and things are getting better. And then all of a sudden we get up to like around 98% and things start to get weird. It starts to oscillate, right? So we can say just by looking at this, that after around epoch five we're not doing any, any more benefit by training further. In fact, we might be doing more harm than good because at this point we're probably starting to overfit. And indeed, if you look at the training set accuracy, that's that first column of accuracy, the second column of numbers that you've seen this display, the accuracy on the training set continues to increase as we train more and more epochs. But the accuracy on the validation set pretty much stopped getting better at around five. So this is pretty clearly starting to overfit beyond the fifth epoch. Alright. I'll early stopping is, is a way of automatically detecting that. And it's an algorithm that we'll just say, okay, the validation accuracy is leveled out. My training accuracy is still increasing. We should probably just stopped now. So early stopping just means, okay, I know you wanted ten epochs, but I can see here that after five things are just getting worse as far as overfitting goes. So we're going to stop at five guys, we're done here. That's it. That's all early early stopping is about. It's just making sure that you're not training your neural network further than you should. And that prevents overfitting. Very simple solution. There.
18. The Ethics of Deep Learning: A lot of people are talking about the ethics of deep learning. Are we actually creating something that's good for humanity or ultimately, bad for humanity? So let's go there now. I'm not gonna preach to you about cinci in robots taking over the world. I mean, maybe that will be a problem 50 years from now, maybe even sooner. But for the immediate future, it's more subtle ways in which deep learning can be misused, that you should concern yourself with. And it's someone entering the field, either as a researcher or practitioner. It is up to you to make sure that this powerful technology is used for good and not for evil. And sometimes this can be very subtle, so you might deploy a new technology in your enthusiasm, and it might have unintended consequences. And that's mainly what I want to talk about in this lecture. Understanding unintended consequences of the systems you're developing with deep learning. First of all, it's important to understand that accuracy doesn't tell the whole story. So we've evaluated our neural networks by their ability to accurately classify something, and if we see like a 99.9% accuracy value we congratulate ourselves and pat ourselves on the back, but often that's not enough to think about. First of all, there are different kinds of errors. That's what we call a type one error, which is a false positive. That's when you say that something is something that it isn't. For example, maybe you miss misinterpreted a tumor that was measured by some you know, biopsy that was taken from a breast sample as being malignant, and that false positive of malignant cancerous result could result in riel unnecessary surgery to somebody. Or maybe you're developing a self driving car, and your camera on the front of your car sees a shadow from an overpass in front of you. This is actually happen to me, by the way, and slams on the brakes because it thinks that the road is just falling away into oblivion into this dark mass, and there's nothing for you to drive on in front of you. Both of those are not very good outcomes. That could be worse. Mind you, I mean, arguably, it's worse to leave a cancer untreated than to have a false positive or one. Or it might be worse. Toe actually drive off the edge of a cliff than to slam on your brakes. But the's could also be very bad as well, right? You need to think about the ramifications of what happens when your model gets something wrong now for the example. The self driving car. Maybe it could take the confidence level of what it thinks is in front of you and maybe work that into who is behind you. So at least if you do slam on the brakes for no reason, you could make sure there's not someone riding on your on your tail is going to Rear Andy or something like that. So think through what happens when your model is incorrect because even a 99.9% accuracy means that one times out of 1000 you're going to get it wrong. And if people are using your system more than 1000 times, there's going to be some bad consequence that happens. As a result, you need to wrap your head around what that result is and how you want to deal with it. The second type is a false negative, and, for example, you might have breast cancer but failed to detect it. You may have misclassified. It is being benign instead of malignant. Somebody dies if you get that wrong. Okay? So think very closely about how your system is going to be used and the caveats that you put in place, and the fail safe's and the backups that you have to make sure that if you have a system that is known to produce errors under some conditions, you are dealing with those in a responsible way. Another example of a false negative would be thinking that there's nothing in front of unions, self driving car, when in fact there is. Maybe it doesn't detect the car that stopped at a stoplight in front of you. This is also happen to me. What happens then, If you're if the driver is not alert, you crash into the car in front of you and that's really bad again. People can die. OK, so people are very eager to apply deep learning to different situations in the real world. But often the real world consequences of getting something wrong is a life and death matter , quite literally. So you need to really, really, really think about how your system is being used. And make sure that your superiors and the people who are actually rolling this out to the world understand the consequences of what happens when things go wrong and the rial odds of things going wrong. You know you can't over sell your systems is being totally reliable because I promise you they're not. There could also be hidden biases in your system. So just because the artificial neural network you've built is not human does not mean that it's inherently fair and unbiased. Remember, your model is only as good as the data that you train it with. So let's taken example if you're going to build a neural network that can try to predict whether somebody gets hired or not just based on attributes of that person. Now you, your model itself, may be all pure and what not. But if you're feeding it training data from real humans that made hiring decisions that training it is going to reflect all of their implicit biases. That's just one example. So you might end up with a system that is, in fact, racist or ageist or or sexist simply because the training data you provide it was made by people who have these implicit biases who may not have even been fully aware of them at the time. OK, so you need to watch out for these things. Simple things you can do. I mean, obviously, making an actual feature for this model that includes age or sex or race or religion would be a pretty bad idea, right? But I can see some people doing that. Think twice before you do something like that. But even if you don't implicitly put in features that you don't want to consider is part of your model, there might be unintended consequences or dependencies in your features that you might not have thought about. For example, if you're feeding in years of experience to the system that predicts whether or not somebody should get a job interview, you're going to have an implicit bias in their right. The years of experience will very definitely be correlated with the age of the applicant. So if your past training data had a bias toward you know, white men in their twenties who are fresh out of college, your system is going to penalize more experienced candidates who might in fact be better candidates who got passed over simply because they were viewed as being too old by human people. So think deeply about whether the system you're developing has hidden biases and what you can do to at least be transparent about what those biases are. Another thing to consider is, is the system you just built really better than a human. So if you're building a deep learning system that the people in your sales department or your management or your investors really want to sell a something that can replace jobs and save people were save companies money. Rather, you need to think about whether the system you're selling really is as good as a human. And if it's not, what are the consequences of that? For example, you can build deep learning systems that perform medical diagnoses, and you might have a very eager sales rep who wants to sell that is being better than a human doctor. Is it really what happens when your it system goes wrong? Do people die? That will be bad. It would be better to be insistent with your superiors that this system is only marketed as an supplementary tool to aid doctors in making a decision and not as a replacement for human beings, making a decision that could affect life or death again. Self driving car is another example where if you get it wrong, if you're a self driving car, isn't actually better than a human being and someone puts your car on autopilot, it can actually kill people. So I see this happening already. You know where self driving cars are being oversold and there are a lot of edge cases in the world still where self driving cars just can't cut it where human could, And I think that's very dangerous. Also, think about unintended applications of your research. So let me tell you a story, because this is actually happen to me more than once. Sometimes you develop something that you think is a good thing that will be used for positive use in the real world. But it ends up getting twisted by other people into something that is destructive, and that's something else you need to think about. So let me tell you a story, so you need to think about how the technology you're developing might be used in ways that you never anticipated. And it can those usages be, in fact, malicious. This is actually happening me a couple of times. I'm not talking theoretically here, and this isn't just limited to deep learning. It's really an issue with machine learning in general or really any new, powerful technology. Sometimes our technology gets ahead of ourselves as a species, you know, socially. Let me tell you one story. So this isn't isn't actually relate to deep learning. But one of the first things I built in my career was actually a military flight simulator and training simulator. It's idea was to simulate combat in sort of, ah, virtual reality environment in order train our soldiers to better preserve their own lives and, you know, come out of the battlefield safely. I felt that was a positive thing. Hey, I'm saving the lives of soldiers. But after a few years, the same technology I made ended up being used in a command and control system. He was being used to help commanders actually visualize how to actually roll out real troops and actually kill real people. I wasn't OK with that. And I left the industry in part because of that stuff. A more relevant example. Back when I worked at amazon dot com, I was one of the men I want take too much credit for this because the people who came up the ideas were before me. But I was one of the early people actually implementing recommendation algorithms and personalization algorithms on the Internet, taking your user behavior on the Internet and distilling that down into recommendations for content to show you. And that ended up being sort of the basis that got built upon over the years. That ultimately led to things like Facebook's targeting algorithms is another example of that. And you know, when I look at how people are using fake news and fake accounts on social media to try toe , spread their political beliefs or, you know, some ulterior motive that may be financially driven and not really for the benefit of humanity, I don't feel very good about that, you know, I mean the technology that I created at the time just toe sell more books, which seemed harmless enough, ended up getting twisted into something that really changed the course of history in ways that might be good or bad, depending on your political leanings. So again, remember that if you actually have a job in deep learning and machine learning, you can go anywhere you want to. If you find yourself being asked to do something that's morally questionable, you don't have to do it. You can find a new job tomorrow, okay? I mean, this is a really hot field, and at the time, by the time you have real world experience in it, the world's your oyster. You know, if you find yourself being asked to be doing something that's morally questionable, you can say no, someone else will hire you tomorrow. I promise you, if you're any good at all. So I see this happening a lot lately. There's a lot of people publishing research about using neural networks to crack people's passwords. Or Teoh, you know, illustrate how it could be used for evil, for example, by trying to predict people's sexual orientation just based on a picture of their face. I mean, this can't go anywhere. Good guys. What are you trying to show by actually publishing this sort of a research? So think twice before you publish stuff like that, think twice before you implement stuff like that for an employer because your employer only cares about making money about making a profit. They are less concerned about the moral implications about the technology you're developing to deliver that profit, and people will see what you're building out there, and they will probably take that same technology, those same ideas and twisted into something you may not have considered. So I just want you to keep these ideas and these concerns in the back of your head, because you are dealing with new and powerful technologies here. And it's really up to us as technologists to try to steer that technology in the right direction and use it for the good of humanity and not for the detriment of humanity. Sounds very, very high level, high horse preachy, I know. But these are very real concerns, and there are a lot of people out there that share my concern. So So please consider these concerns as you delve into your deep learning career.
19. Variational Auto-Encoders (VAE's): All right, It's time to get into the fun part of this course. We're going to talk about generative modeling. This is the technology behind all those viral apps that let you swapped faces around and age people and stuff like that. Also the technology behind deep fakes, kind of a poster child of the ethics discussion we had earlier in the course, but I'll spare you my rant on that for now before we can talk about generative adversarial networks though, which is the tech behind all that stuff. We need to talk about variational autoencoders, which is the underpinning of generative adversarial networks. So let's dive into VAEs first before we talk about variational autoencoders, Let's first talk about auto encoders in general. Kind of a diagram of how they work. So an encoder learns how to reduce input down to its latent features. So that's kinda the left side of this diagram here, the yellow part there. So basically we take an input which is often an image. And by using convolution, just like we saw in convolutional neural networks, we distill that down to some sort of representation of the latent features are latent vectors of that input signal. So our encoder there looks a lot like a CNN. It's using convolutional layers and max-pooling and all that stuff to sort of distill down the patterns in the input that we're training it with, down to these latent vectors that we're representing here as the blue z box in the center there. So nothing really to new there. On the other side that we have the decoder, which is basically the inverse of the encoder. So it's being trained on how to reconstruct complete images are complete data, more generally from those latent vectors in the z box there. So by using transposed convolutions and max d pooling and things like that. It learns how to take those latent vectors and reconstruct them into a full complete image or a full complete dataset. So while we're training this, the objective is to try to get the input and the reconstructed input to be as similar as possible. That's what we're trying to optimize for there. So x should be equal to x prime, where x is the original input images that we're training it on x prime or the generated reconstructed images based on those latent factors, those latent vectors that we've learned through the training process. So pretty interesting stuff. Now the system as a whole is trained such that the original input that into the encoder is as close as possible to the reconstructed data generated by the decoder. You know what I just said? But let me try and make it a little bit more real-world example. Let's say that we're training this on pictures of shoes, all right, Just to pick something out of the blue there. So the encoder would distill that down into sort of the essence of what makes a different kind of shoe. And it's not going to be thinking about it in these terms. It's just a neural network, right? But what those latent vectors might ultimately represent is this thing kinda fits the pattern of a sandal, and this sort of thing kinda fits the pattern of a sneaker. And this other thing kind of fits the pattern of a boot that I've learned. It doesn't know how to label them or call them those things, but that's what it might learn in those latent vectors in the z box in the middle there. And the decoder will learn how to take that and reconstruct a complete image of a boot or a sandal or a sneaker based on that latent vector there. And there could be more to it than just a classification, right? I mean, a very simple latent vector would just be a classification of what kind of shoe is this, but we can have more information in there that's more nuanced. And that's kinda the beauty of this system is a little bit more, a little bit more flexible. And what, once we train the whole system, we could actually throw the encoder away because we just used it for training the decoder really. So if you want to just generate synthetic images of different kinds of shoes, you could use the decoder to do that and the ideas, you can generate a bunch of synthetic generated pictures of sandals or sneakers or whatever it is at random if you wanted to. And that's kind of where the underpinnings of all these sort of, you know, fabricated faces that you see on viral apps comes from. So there's a lot of applications to this. I mean, you know, going back to the ethics discussion, nobody really intended this to be used for deep fakes are deceiving people write the original goal was actually for compression. So you can think of this as sort of a noiseless way of compressing data down. So if you want to have a really clean way of saying, I just want to have a picture of a sandal. Here's a way to do that with just basically a few bytes of information potentially, where it learns how to reconstruct that without any noise. Also, it can be applied to two dimensionality reduction. Obviously kind of the same thing as compression. We're taking that higher-dimensional input and boiling it down to those latent vectors there instead to a lower-dimensional representation. It also has applications in search. So we could use that to sort of distill a corpus of text data down to its relevant search terms. De-noising, we can, the decoder can be used to reconstruct an image that has no noise in it based on image that does have noise in it. That's another good application of it. I've used that in Photoshop all the time actually. Or a colorization is also a cool application. Maybe this then can be trained to recognize that this picture of a person looks like he's wearing a pair of jeans and this black and white image. And based on the shape and the shade of gray in that image, we think it's a pair of jeans and maybe I should color it blue because most genes are blue. So again, it's not thinking about it in those terms, but those are the labels that we might attach to what's going on under the hood there. So again, the trick for doing this in the decoder is to use a transpose convolution instead of the convolutions. So instead of Conf 2D layers. We're using conf 2D transposed layers to reconstruct images from those latent features that we talked about. So what's going to be learning all the weights to use to create a new image with new pixels from a lower-dimensional representation. And again, it can be used on more than just images like we talked about, search and text information is another application, but images are kind of the easiest example to wrap your head around here. So with a transpose convolution, we're not distilling things down or shrinking images down. We're actually expanding them into their original form. We're trying to, you'll often see it used with a stride of two. I've also seen some decoders where they sort of intersperse tries to strike of 12 at different layers. So if you need more complexity, but you'll probably see a stride of two in there somewhere. And again, if you're going to be using max pooling instead of max pooling, like we talked about when we talk about convolutional neural networks. So think of the Dakota or roughly as a CNN that works backwards is kind of a good way to wrap your head around it. So let's talk specifically about variational autoencoders. What do we mean by that? So in a variational autoencoder, those latent vectors are probability distributions. And this is the way that we might graph those, those probability distributions for a given, given set of categories or what have you. So we represent those probability distributions by the mean and variance of Gaussian normal distribution. So a variational autoencoder is specifically using Gaussian normal distributions and the properties of those distributions as its latent vectors that it learns over time. Mathematically, we can express it like this. The input data x is expressed as a probability of Z given X where Z or are those latent vectors. And then we can reconstruct z using a probability of X given Z. So flipping that back on its head to reconstruct the original image. And this is the inspiration of generative adversarial networks or GANs. So we're getting there soon. We'll see that GANs are just another spin on variational autoencoders where it's more general than using Gaussian normal distributions. One thing that you're going to see in the code that we should talk about is something called the repairs. Wow, that's hard to say. Reparameterization trick. I'm not gonna say that again because that's a tongue twister. So one mathematical problem with the idea of VAEs is that the probability distribution that we're calling Z can't be differentiated. Anytime you have randomness in the equation there, these random distributions that throws a wrench into the calculus, right? And as you might recall from when we talked about training neural networks, we need to use the derivatives of the system for backpropagation and the actual learning to work, right? So the trick we use to get around that is by converting the random sampling and z to a deterministic form. So mathematically we might call that Z equals mu plus sigma times epsilon. The specific Greek letters he used, don't matter, we're actually using different ones on the diagram here. But the idea is that we're introducing this new Epsilon term or whatever you want to call it, where the randomness gets moved off to. So epsilon is the random variable from the standard normal distribution here. And by pushing that out into its own term, that pushes the random step out of the network, has an input and then we can have a connected graph again that we can actually differentiate and actually do training on and back-propagation. So I'm not going to get into the math of that too much. It just wants you to know what that is. Also in the subject of things you just need to know what they are is Kullback-Leibler divergence or KL divergence for short. This solves the problem of measuring the distance between two probability distributions. So what we need as retraining to measure the distance between the probability distributions of our original data and the reconstructed data. We want this to be as close as possible, right? And one way to measure that is KL divergence. Sometimes they call it the earth mover distance because of an analogy used of how much earth would you need to move to transform the United States into the shape of the United Kingdom. I think that's the actual example used. But it's simpler way to think about it as this diagram here. Let's say we have a, a shape of three stacks of three blocks. It's basically how many blocks do I have to move to get it to look like nine stacks of one-block, right? So it's just how much information do I need to move around to make these two distributions, these 2D graphs, line up with each other. Mathematically it looks like this. So sum of P of x times the log of P of x over q of x, which is equivalent mathematically to the cross entropy of p and q minus the entropy of P. So we sometimes also called this relative entropy. And as you might recall, entropy is a really common loss function. So it means we can use this as a loss function while we're training our variational autoencoder. And further we can decompose that as this function here, I'm not going to get into the mathematical derivation of that. But when you see that function, this is where it's coming from. That is the Kullback-Leibler loss function here, expressed as he might see it in TensorFlow. So with that, let's dive into a notebook and see how VAEs actually work.
20. VAE's: Hands-On with Fashion MNIST: All right, Let's see variational autoencoders in action and the poor little bit of twist on things. We're not going to use the traditional MNIST dataset of handwritten numbers run East LA Fashion MNIST dataset, which is actually a little picture is pieces of clothes, so mix it up a little bit here. We're going to try to do is train our VAE to generate pictures of clothing. So let's see how it works. All right, so let's dive in here, open up the variational autoencoders notebook from the course materials if you'd like to follow along. If you do want to follow along though, it's pretty important that you have a GPU available, that you have TensorFlow GPU installed. Otherwise you're going to find that this takes a really long time to train. So if you don't just heading over to this link here at tensorflow.org slash install slash GPU. And that will walk you through what you need. You will need an NVIDIA graphics card and you'll probably need to go to the NVIDIA developer website and sign up to get the CU DNN Library installed as well. But this walks you through what you need to do if you do want to follow along and you don't have TensorFlow GPU installed if you don't know if you have GPU support all ready. That's what this first block checks. So let's go ahead and Shift Enter in here. And it's just going to list how many GPUs I have access to from TensorFlow. It's also going to load up the TensorFlow library itself. So that's going to take a little bit of extra time. I can hear my hard drive chugging away is it loads up TensorFlow. All right, The chugging, it's slowing down. We have one GPU available. That's enough. But if it says 0 again, you're probably going to want to stop in either go install TensorFlow GPU or just watch the video without following along because otherwise both VAEs and GANs will take a long time to train. All right, We're also going to set a consistent random seed so I can get somewhat consistent results out of this notebook and don't have any nasty surprises. This is pretty complicated. It's harder than you might think to have a consistent random seed and to get consistent results. And that's because we're using a GPU. So we're doing a bunch of training in parallel and getting consistent results out of that when you're dealing with randomly generated initial conditions and various randomness within the model itself. Even that can get a little bit dicey. Even with all of this that we're doing here, we're not going to get perfectly consistent results, but it will at least be close. So let's go ahead and Shift Enter on that as well. All right, let's talk about that fashion m-nest dataset a little bit. So like I said, it's like the MNIST dataset, the one that we've been using before, where it's just handwritten digits 00 through nine, right? The difference is that instead of numbers, we're looking at pictures of clothes. So a little bit interesting. And the various classes that we have available to us, our T-shirts, trousers, and pull overs. I'm guessing this came from the UK or something that's pants and sweaters where I come from. Dresses, coats, sandals, shorts, sneakers, bags and ankle boots. And so you can see that example I talked about in the slides have different kinds of shoes was not theoretical. We're really going to be looking at sandals, sneakers and boots independently here. So let's start by importing our data in and making sure that we understand the dimensionality of it. So all we're gonna do here is load up the fashion m-nest dataset, which conveniently is built into Keras dot datasets already for us, although that up and we'll just check that everything is as we expect, 60 thousand training images and 10 thousand test images. Each one is 28 by 28 grayscale image. All right, Shift Enter and nothing is complaining. So everything is as we expect. And it's always a good idea to take a look at the data and get a feel for it before you start messing around with it. So what we're gonna do here is plot. What this looks like in a nine by nine grid will take nine samples at random out of our 60000 training set and plot each one using grayscale. And let's see if we get just plucking out nine at random. So you can see we've got looks like a, I guess it would be called a sandal handbag. Let's address some pants or trousers as they call it, ankle boot, I guess. Honestly, I'm not sure what that is. I think that's a handbag. So you can see the data's not the greatest to begin with, but that's what makes it challenging and interesting, right? But they're pretty for that thing. I'm not sure what that is. Everything else is pretty recognizable as a piece of clothing. All right, so the first thing we need to do is preprocess our data. First thing we're gonna do is combine our training and test data together. Why? Because this is not a classification problem I'm not trying to figure out is this aboutness is a pair of pants. I'm trying to create a system that can generate pictures of clothes in general. So because I'm not actually trying to test my ability to classify these things that test dataset with that label data. That tells me if I classified it right, isn't really useful for this problem. So instead, I'm just gonna use it for extra training data. So the first thing we're gonna do is concatenate the training dataset and the test dataset together into a single dataset called creatively dataset. I'm also going to add an extra dimension to that dataset because our convolutional layers expected three channels of input, not just to. I'm also going to convert that to floating point data and normalize that to 0 to one. So the raw data is just going to be integer data from 0 to 255 representing how bright that pixel is. This actually transforms that to a floating point number between 01 because the sigmoid activation values in our model are between 01. So we're just sort of massaging the data here to fit the input of our convolutional layers. Better Shift Enter to run that. All right, so now things get interesting. First thing we're going to set up is our sampling layer here. And This is where that reparameterization trick, they actually set it right, comes into play here. So remember, we need to move that random component out into an epsilon term while retaining Mu and Sigma, the mean and variance of our data coming in here, actually we're going to be using the log of the variance as we come into here, as you'll see shortly. So our sampling layer is going to be this custom layer, and it will take in the inputs of the mean and log variance. It will then extract how, what the batch size is by looking at the first dimension of the z mean input coming in. And the dimensionality of that data by looking at the second dimension, we then compute epsilon, that is the random term we talked about in the slides. It's just a normalized random number of that shape that we expect to match up what we're getting in for the z mean data that's coming in. So we're creating a bunch of epsilon random numbers in the size of whatever the batch sizes times the number of dimensions that we have. And what we return is just u plus sigma times epsilon. It's a little bit more complicated because again, we're using the log of the variance here for training purposes later on. So to convert that back to eight actual variance, we take the exponential of that of 0.5 times the log of the variance. So some basic algebra there. But fundamentally this is returning the reparameterization trick we talked of z equals mu plus sigma times epsilon. All right, Next let's actually create our encoder model. Let's run that before we forget Shift Enter. We're gonna be using the Keras functional APIs to build this because it's a little bit complicated or more complicated than usual. It's not really that hard. All right, so we're going to have this building code or function. Its job is to return a Keras model for the encoder part of our larger model. Let's start by setting up a sequential model that consists of two conf, 2D layers. We're going to start off with a 128. Number of filters is going down at 64, will then flatten that out and dump it into a dense layer of 256 neurons at the end. And to two, like we talked about. Again, these are kinda hyperparameters, that topology of this, the exact number of filters that you use it, each convolution layer, those are things you can play with. The number of convolution layers you have, what the strides are. So I've seen people, like I said, inters leave a stride of 12 in there for more complexity. If they need more, more neurons in their model for doing a more complex task. Again, it's a lot of trial and error to get this right. So it's kind of the dirty little secret of deep learning and a lot of it's just experimentation and trying different things until you see what works the best. All right, so we're going to pass in our inputs through that convolutional block. So we just call that sequential model that we just set up with the encoder inputs that come into this function, this building code or function. Next, we'll create a dedicated layer to learn the mean and the variance in parallel. So we're send them to different layers here. One's going to be a dense layer of the dimensionality that we pass in as a parameter to build encoder again, another hyperparameter we can play with. One's going to be dedicated to learning the means, and ones will be dedicated to learning the logs of the variances as we go. Okay? We then call sampling layer with the z mean and z log Vera that we just set up there. Again, that's dumping it into here where we apply that reparameterization trick to combine it all together. Alright, and we return finally the actual Keras model that's composed of at all. We pass in the model with the encoder inputs Z, me, see log var and the latent vectors there, z. All right, and a little note here in the comments here to note that z mean and z log vara or not the final output of this encoder, but we will feed that into the KL divergence loss. And a little bit, now that we have our function to build the encoder, let's actually use it. We'll set up our input layer here of 28 by 28 by 1. So again, our input images from fashion m-nest or 28 by 28 with one color channel, just grayscale. We'll call that encoder. Inputs pass it into build encoder with a latent dimensions of just two. And the inputs that we just loaded up. And we'll print out the summary just to do a sanity check, Shift Enter. See, we get there we go. All right, looks reasonable to me. Moving on. Now we need to also implement our decoder. So again, it's kind of the inverse of what we just did. We're just going to be using calm 2D transpose instead of calm 2D because instead of trying to reduce these images down to their latent vectors, we're going to be expanding them from their latent vectors to an image. So we're kind of going backwards here. The model here is going to start off with the dense layer around a reshape that to 7 by 7 by 64. And then we'll set up three com 2D transposed layers, starting at 128, filters down to 64, and then finally down to a single image that we should finish up with. A little bit confusing potentially here because you might say to yourself, why are we going down in size here when we're trying to make a bigger image. But this is the dimensionality of the number of filters we're applying here and not the size of the image that we're producing. So keep that in mind. All right, we'll return the model. Again, just passing in that model that we call it L1 here with a latent inputs being passed into build decoder. And we'll call it decoder. And then we'll actually build it by loading up the input shape there again, the shape coming in is just two because we created the encoder with the dimensionality of two just above us, will then call Bill decoder there and print out the summary of the decoder model. Shift Enter. Yep, looks reasonable to me. All right, Next we need to set up our loss functions. Yes, plural, there are two of them, and I didn't really make that explicit in the slides, I don't think, but it's definitely explicit here. So there are two loss functions here. One is the reconstruction loss, and that's what's going to be penalizing images that are not similar to the original images. So remember our overall goal is for our decoder to generate images that are as close as possible to the original images that were fed into the auto encoder. And this is the loss function that measures that it's just using binary cross entropy on the original data fit in and the reconstructed data that was generated by our decoder. Okay, pretty simple stuff there, Shift Enter. And then we're also going to measure the KL divergence loss here as well. And this we talked about in depth in the slides about what that's all about. So I'm not going to go over it again, but we did talk about where all these formulas came from there. Again, this is looking at the distance between the probability distributions on both sides. So we're looking at the probability distribution of the original data versus the probability distribution generated from the generated data. And we want those again to be as close as possible. And we're measuring that with the KL divergence loss function here. So let's go ahead and define that as well. Shift Enter. And well, we need to have a overall loss function at the end of the day, you can't really have two loss functions at the same time. So that's what this next block does that calculates the total loss as a function of both the reconstruction loss, which again is just measuring how similar the original and reconstructed images are to each other. And the KL divergence loss, which again is measuring how close the probability distributions of each image is. Now we need to combine them together somehow. The obvious thing would be to just take the mean of the two. But it turns out that weighting that is also another important hyperparameter that needs to be tuned. So you can see we're getting pretty deep into the number of hyperparameters that need tuning in this model in the variational autoencoder. That's why it's so difficult to get these things trained well. Now this KL weight is basically what we're calling that parameter 3. Basically we are going to be waiting the KL divergence loss by that number. And that is a very important parameter for effecting how good their final results are, it turns out, so I've done some experimenting already and settled on the number of three, but maybe you can do better. Maybe a different way will actually produce better results for you. Let's go ahead and Shift Enter. Again. All we're doing here is taking the reconstruction loss, calling that loss one, the KL divergence loss calling that lost two. And we're returning both the individual losses so we can keep track of them as we train. But the final total loss that we're actually going to be optimizing on is going to be the reconstruction loss plus the weight of three times the KL divergence loss. That's how we're combining them together into the overall loss function. Alright, moving on. So because we have a custom loss function, we kinda have to make our own custom model that uses it. Most of this code looks like a lot of code, but most of us just keeping track of those different loss functions so we can graph them and visualize them later on. We're going to create a VAE model here that derives from keras dot model. So it's just a custom model. We're going to create our little constructor here that just sets up a total loss tracker, a CE lost tracker, and a KL loss trackers. So this is just going to be used to keep track of the total loss that combine loss that we talked about, that includes that weight on the KL loss, the reconstruction loss, which here is called CE loss, and the loss which is called k l os. And we declare those as all being observable here in this block of code. And here's where we get to the actual meat of our actual model here by overriding the train step function. So we use a gradient tape here in tensor flow to actually explicitly defined how the training works. For the forward path, we just call self-taught encoder with the data coming in. And that returns back as we saw above, the z mean, the z log var and the z itself. We then construct the reconstructed image by calling self dot d coder and actually will give us back our reconstructed images. So encoder again is running our encoder model to boil those images down to that probability distribution Z. And we're also keeping track of the means and variances independently here as well. And then the decoder is called with that resulting probability distribution Z to try to reconstruct that image. And we're calling that reconstruction. We then calculate the total loss between the two. And again, calc total loss is looking at both reconstruction loss and the KL divergence loss. And we're going to keep track of them both independently so we can view them. Finally, just look at the total loss between the two for the actual training itself that happens in the backpropagation phase, right? If you remember, I've neural networks just work in general. So what we're gonna do here is just calculate the gradients here based on the trainable weights that we set up. And we will apply gradients by just flipping together those gradients and the trainable weights. And we'll optimize on those trainable weights. What ties us all together is using that total loss here that came from calc total loss to actually bake that into how the backpropagation work. So that's really where we're incorporating our custom loss-function into how this model is trained. Again, we're going to keep track of all those individual loss functions. Both the chaos, the reconstruction loss and the total loss function that combines the two independently so we can graph them all independently as well. And we've returned all three just so that we can keep track of them. All right, Shift Enter to set that up shouldn't take long, but this next block will take a long time. Here we're actually going to do the training. So we're gonna setup our VAE model passing in our encoder model and our decoder model, the VAE model puts them all together. We'll compile that model using the Adam optimizer. And this learning rate is yet another hyperparameter that needs to be tuned. This is one that I had to tweak quite a bit while experimenting with this notebook myself. And finally, we'll call fit to actually do the training itself. We're going to use 32 epochs of training and a batch size of a 128. Yet more hyperparameters that need to be tuned. In general, more epochs is better. But if you're finding that the model's not stable, which is pretty common, more might not be better. So again, experimentation is needed to see how many epochs you really need once you get everything else right to get good results. And what batch size makes sense. So very easy to overfit here, very easy for, you know, get to getting stuck in a local minima and not getting out of it. That's also something that happens a lot when you're training these settings. So I'm gonna go ahead and kick this off in with a GPU. It's going to take awhile. So what I'm gonna do is through the magic of video editing, just to pause this recording and come back when it's done, Let's look the first epoch run at least though before I do that, I think the act of recording this video was actually competing for resources on my GPU. There it goes. So you can see here that we can watch the reconstruction loss of the KL loss and the total loss coming together there. And you can kind of see why we had a weight that KL loss a bit. Those are much smaller numbers, at least at first compare to the reconstruction loss. So the reconstruction loss is really dominating the trading right now and its contribution to the total loss. And as we go over time, we should see that reconstruction loss getting smaller and smaller. And as that happens, the KL loss will be more of a factor as it trains further on. So really, the more epochs you have, the better it gets and the more that KL loss will come into play. All right, Again, as I said, I'm just going to pause this and come back when it's done. All right, that took about 10 minutes even with a GPU, but our training has finally wrapped up and we can kind of eyeball what happened here by looking at the loss functions reported over time. You can see we start off with a total loss of 313 and got down to about 266 here. You can definitely see just by eyeballing it, that it was really kinda reaching a point of diminishing returns here, was actually really struggling. By the time we got to 31, it actually went back up at 32. So 266 seems to be about as good as we are able to get. Whether that's because it found the best solution or because it got stuck in a local minima. Well, I guess we'll find out. You can also see that the reconstruction loss is really much larger than the KL loss. So that was really paying a much, playing a much larger role than k l losses we went. So it might make sense to experiment with using a larger weight on the k. A loss if you have time, you might want to play around with that a little bit. And also it's possible we got stuck in a local minima here, I guess we'll see how good the results are in a moment here. But if so, experimented with the batch size might be a good way to try to get out of that minimum more easily. Anyhow, we can kinda eyeball what's going on here, but let's plot it because we went through all this trouble of actually keeping track of all those numbers. So Luke, and here we can actually see plotted the total loss, the reconstruction loss, and the KL loss. Again, chaos being a much smaller value. Hard to see what's going on there. And you can see that after just a couple of epochs here, it kind of was really struggling to improve and decrease that lunch loss function further, this chaos is hard to see because it's so small in comparison. So let's zoom in and just look at the KL loss independently here. You can see that was actually getting worse over time. So that's kind of interesting, right? So you know, definitely not the direction you want to be going for a loss function, but at least it wasn't like, you know, going up exponentially, it at least started to level out. So this might suggest that training for even more epochs may have been beneficial because we're starting to get to a point where we couldn't really squeeze anything more out of the reconstruction loss, but there are improvements to be made on KL loss. So if we went further and training, we might have seen that KL law start to drop as it started to basically turn to that for making the model that better. All right, so let's see what kind of results we got here. So we're just gonna randomly pick a mu of one and a variance sigma of two and see what we get out. So again, the idea here is to sort of now that we've trained the model, we can throw out the encoder and just use the decoder to construct synthetic images. So let's see what a probability distribution of 12 gives us. So we're just gonna call the decoder, ask it to predict, actually generate an image based on the input of one comma two. And again, that corresponds to our mean and variance will just plot that as a grayscale image 28 by 28 and see what we get back from it. Hey, that's pretty cool. So we have synthetically created, apparently a pair of pants that looks reasonable, right? So hey, I'm pretty happy with that. That's the training worked. Awesome. Let's go further. So let's actually generate 256 images now entirely random. What we're gonna do is just guess at the Z distribution each time with a random mu and Sigma. And don't worry about guessing here, we can actually get back the actual mu and sigma associated with each category systematically, or at least approximate it if we want to. But for now let's just generate 256 random images. So we're just going to guess with random normal distributions for both Mu and Sigma at a scale of four and see what we get back. So this construct 256, so those random distributions called the decoder on that entire array of input values and plot them all out one at a time as again, and 28 by 28 grayscale images and a 16 by 16 grid. And there we have it. 256 synthetically generated images of closed. And these are actually pretty darn good. Wow, I'm happy with that. So yeah, I mean, this isn't really much worse than the source imagery really. So I think we got lucky that time and we actually hit on a real solution here during training, yeah, I see, you know, sweaters, I see shirts, I see pants, I see sandals, I see ankle boots. Yeah, we really got lucky on this one. So good results, I'd say so far. So that's all well and good. We can generate random images of pieces of clothes. But what if I want to generate a specific kind of a piece of clothes? How do I do that? What if I just want to draw a pair of sandals or something, right? Well, one way to go about it would be to just run a known case of a given category through the encoder, observe the Mu and Sigma that came back from the encoder for that image and send that back into the decoder to try to get a similar image to the one that you just through to the encoder, right? It's not perfect. It's not a totally a concrete way of going about this. There is something called conditional variational autoencoders if you want a more concrete approach to doing this, but this is a reasonable way to do it. You know, take a picture of a shirt and put it to the decoder and say, I want a picture that looks like this and you'll probably get a short back. So let's just randomly pick an image number one hundred, two hundred and eighty, whatever it is in our training dataset. I'm gonna go ahead and expand that into three-dimensions in like the encoder expects it to be. And go ahead and convert that to a floating point value between 01, going to print out the shape of that just to double-check it, send it to our encoder, ask it to predict what that probability distribution will end up being for that specific image. And then we'll pass that probability distribution into our decoder to get back a synthesized image. And we'll see what exactly that probability distribution is by just typing in z at the end there to print it out. Shift Enter. All right, so we can verify that we've got a 28 by 28 by 1 image. They're just like we expected. And what came back from the encoder was a probability distribution with a mean of negative 0.427 and a variance of 1.259. All right, cool. So let's go ahead and take that distribution and see what it gives us. So we already called the decoder on that and save the result in synthesised visualize what sin synth. And we'll visualize that up alongside the original image as well. So setup a plot here. We're going to plot the training image that we fed into the encoder to get a thing to be similar to. And they will spit back that synthesized image that should be similar to it. So Shift Enter. And yeah, sure enough, it turns out image number 11280 is address, and we got back a shape that looks kind of like address. So by doing this, we were able to synthesize a specific category of close. So that's one way to go about using that VAE. Another thing you can use VAEs for is basically unsupervised learning. So let's visualize what those probability distributions look up if we a, color them by the actual known classifications. So I'm going to feed into labels from our training and test data here to this graph and plot the means and variances across our entire dataset there from the encoder. So we're going to throw the entire dataset into our encoder, plot, the resulting probability distributions, and color those based on their known categories. Got it. Let's go ahead and Shift Enter and see what that looks like. And there you have it. So you can see that we're seeing those clusters. We can actually visualize that there are distinct probability distributions, these different kinds of clothes. So I don't know what these different colors represent. Maybe the purple ones are pants and they'd be the green ones are dresses. I don't know. But you can think of this as a form of unsupervised learning where those different distinct areas of probability distributions probably correspond to classifications, different kinds of things in our source data. So if I didn't know what those labels were ahead of time, maybe I could infer them by digging into what these different probability distributions tie back to in our source data. So another possible application of VAEs, unsupervised learning of categories. But the more interesting application is in generating synthetic images. And next we'll build on that with GANs.
21. Generative Adversarial Networks (GAN's): All right, Now that we have variational autoencoders under our belt, let's talk about generative adversarial networks. It's a similar idea but different, you know, different enough that it's its own thing. Yes, this is the tech behind deep fakes and all those viral face swapping apps in aging apps you've seen before. For example, this is a picture of someone who doesn't exist. This is the output of a generative adversarial networks that just has been trained on how to generate realistic looking images of people's heads. Real thing. Yeah, and again, I'm not going to get into the ethics of this. We already lectured to you about that earlier in the course. But this is the tech behind deep fakes, but it's also the tech behind all those viral apps you see for face swapping in aging people, in making people look like Disney characters and all that stuff, right? Researcher so had nobler intentions for this work. Originally, some of the applications envisioned, we're generating synthetic data sets if you have private information. So this is especially useful in the medical field where you can't easily get real training data because of privacy laws, right? So if you're trying to generate a new neural network that can learn how to detect breast cancer or something. It's tough to get real data for that to train it on. But by training a GAN on real data, we can sort of train it on how to make synthetic datasets that are very close to the original ones but without any actual private information in it. So that's one practical application of GANs that is not just a viral app on your phone. It can also be used for anomaly detection. It can be used to compare an image to what it thinks an image should look like and detect anomalies automatically that way. It also has applications in self-driving cars. It also has applications in art and music. You know, you could train a GAN on how to create an artwork in the style of Picasso or whatever your favorite artist is, or how to generate a symphony and the style of Beethoven or Mozart. And this stuff really works. You can generate synthetic art pieces and synthetic works of music using GANs that are pretty convincing. So when you all these impressive demos of an AI that made its own symphony, this is how it works and I think that's about to be demystified for you. It's not really as complicated as you might think. Alright, so to understand how GANs actually work, you gotta really noodle on this diagram here, this really sums up how it all comes together. And as you can see, it's not that hard, right? So it's pretty simple. So first of all, we don't assume Gaussian normal distributions in the latent vectors that we're learning like VAEs. It could be anything in the case of GANs. But the thing that's really different here is that we're mapping random noise in our generator to probability distributions or whatever they might end up being. And by doing that, by taking random noises and input, we can generate random, whatever it is as the output of this, of this generator. So the generator learns how to take some sort of random signal and make a random face or a random piece of music, or a random art, or a random dataset, right? So that's where that randomness comes in as the input to the generator there. And on the other side we have a discriminator that is trying to learn What's the difference between real images that I'm training the system as a whole on versus the generated images coming from the generator. Okay, So this is really the heart of it all. We have this generator that's learning how to generate whatever sorts of images or, or data that we're trying to create. And the discriminators job is to detect whether those generated images can be distinguished from the real images in general, right? So you're going to be training this on a set of example images of faces. So let's give it a bunch of real face images. The discriminator is going to be learning how to tell apart those real faces from generated faces in general. And when he got to the point where the discriminator can no longer tell the difference. That's when we're done training. So you have this adverse adversarial network going on here. The adversarial part is that the generator is in adversary to the discriminator. So the generator is trying to always fool the discriminator into thinking that the image is it's creating are real. And the discriminator is trying to catch the generator in lying, right? So there's sort of intention with each other. And that makes it a very tricky thing to train, you know, in practices is a very fragile system to actually get going. But once it works, it works really well. And that's why I'm saying that once the discriminator can no longer tell the difference between the real faces and the generated faces are the real whatever's in the generated whatevers we're done training in theory. Because in practice it's really hard to get this trained right? It's, there's a ton of hyperparameters to tune. It ends up being very unstable. So an awful lot of trial and error is needed to actually train one of these things and get good results out of it. But when you take that effort, you end up with some pretty impressive stuff from GAN. So before I move on, I want to let you sort of noodle on that diagram a little bit more because this is really the heart of understanding GAN. So, okay, so we have random noise that we're using that is going into a train, a generator that learns how to generate, fabricated whatever it is faces in this example. We then are training the system as a whole with real faces as well. The discriminator is being trained how to distinguish the real from the fake images or data. In general, they're over time the discriminator has a harder and harder time of learning what's real and fake as a generator gets better and better at generating convincing fake images. We'll get to a point once the discriminator can no longer tell the difference, we have a really well-trained generative adversarial network going on here. Okay, so that's kinda the heart of it. We're training a generator how to generate fake data, a discriminator that is trained in how to tell the difference between real data and fake data. And when these things sort of come together, we have a generator that can generate stuff that the discriminator cannot tell the difference between from the real stuff. Okay? Fancy math. This is what it all comes down to. Not gonna get into too much, but that is the adversarial loss function for the system as a whole. We call it a min-max game, so that is worth talking about. Again, the generator is trying to minimize its loss and creating realistic images. While the discriminator is maximizing its ability to detect, to detect fakes. So that's when we say Min sub g, that's a generator minimizing its loss and max sub D there is the discriminator maximizing its ability to detect fakes. So that's what that all means in terms of the fancy math. I like I said, it's all very complicated and delicate. The training can be very unstable. There's a lot of tuning, a lot of trial and error involved to get it to work well. And just in making the notebook that we're about to look at, took a long time to get that together and you get halfway decent results out of it. It can also take a very long time to train, a lot of computational resources to train. But once you have it, it's a very efficient way of creating fake images of faces or whatever it is you're trying to create some other problems that it runs into those something called mode collapse. So a problem is that if you have different kinds of something, you know, different kinds of faces, different kinds of pictures of shoes, whatever it is. The system can just learn how to make one of those things very efficiently and very convincingly. And that will still result in a low loss function across the system as a whole. So it's not uncommon, for example, if you're trying to train it how to create fake images of shoes. For it to only really learn how to make fake images of a sandal. And that's a problem that's called mode collapse, where we just learned how to make one specific kind of thing really well. But it's not as general as we want it to be. It also suffers from the vanishing gradient problem a lot. We talked about that earlier in the course. And with that, let's go into some examples of seeing GANs in action because I think it makes a lot more sense when he kinda see what it's happening in real-time under the hood. And then we'll dive into a notebook and actually get hands-on with using one.
22. GAN Demos & Live Training: So to help you understand how GANs are trained, let's look at a few hands-on interactive examples here. And then after this, we'll go through a notebook to go through it in more detail. But a really nice tool for visualizing how this training works is the GAN lab here. And let's talk about what's going on here. So instead of starting off by trying to generate fake images, Let's start with something simpler, easier to wrap our heads around. Let's just choose a two-dimensional data distribution here. So what I'm gonna do is try to create a GAN that learns how to create a ring. Okay? So we have this distribution of 2D points in this general ring shape here. And what we wanna do is train a GAN to take a random input and randomly generated distribution that matches that as well as possible. So in this visualization, we're going to see the real data points plotted in green. This is coming from our real distribution and the fake ones being created by our generator will show up in purple as we train. And as the training goes on, we'll be able to visualize in a heatmap, the loss functions of the generator and the discriminator. So as we go we'll see the discriminator trying to classify these points is real or fake. And what we should see is that it should eventually converge around sort of a ring shape where it identifies things in the ring as being real and things that are outside of the ring as being fake. And we can actually see over time how well the generator is doing a tricky the discriminator and how good the discriminator is doing a telling real and fake apart. So let's go ahead and kick this off. Let's hit play and we can watch it in action. So you can see that our fake data here, it's kinda like going all over the place here at first as it starts to learn. But pretty quickly it's gonna start to fall into more and more of a ring shape. And we can see our discriminator here, heatmaps. So right now the discriminator is saying, okay, fake ones are over here, they are here. The purple again is fake and green is real. And already we can see that we're getting sort of this ring, this circle here of white here, where it's going to start putting that green real classification around that ring and a purple fake classification everywhere else. And as we go on and on and should get tighter and tighter already we're seeing that generated purple fake samples trying to get more and more within that actual distribution that came from the original real data distribution there. And let's take a look at the metrics over here. So we can see that over time the discriminators loss function is decreasing but kinda holding steady because it's starting to get harder and harder to tell the two apart, right? But the generator's loss function is starting to stabilize here. Could be better. They sort of get a little bit wonky here. We can see that we've already sort of a maybe hit a local minima there and kinda got a little bit unstable there. And it's kinda trying to work its way back to something better. And again, you know, the thing with GANs is that they're very unstable. So it really takes just the right set of hyperparameters for this to work right in a stroke of luck. Luck, quite frankly, because there are some random components to the training going on here. But eventually it does seem to be getting back to where it should be here. I think it's getting out of that and starting to get back to where it wants to be. We have kind of this weird crescent shape going on to the discriminator right now that obviously is not correct. The circular shapes are notoriously hard to learn. And those generated samples are really kinda falling into a line. They're not really a circle that was actually better early, earlier. Oh wait, wait, I think we're getting back into a better state here. All right, now we're starting to converge to something that looks a little bit better. Yeah, you can see that starting to stabilize. And we can see in the grass here that the generator loss function is decreasing. The discriminators loss-function is starting to stabilize. Those are all pretty good signs. But again, you know, not awesome. We can see the overall manifold of the generator here. It's not really that circle shape, but at least it's getting closer. That's kinda falling off the edge here with that distribution. And it can't really seem to get out of that in this case. So we're kinda stuck here. It seems if we were to start it again, we'd probably get different results. So let's try it again. Now we have a better feel of what's going on. Alright, so again, we're, the generator is got a little bit luckier this time we're actually getting into more of that circle shape a little bit earlier on it seems. And we can see that the loss functions are more or less stable here. Because again, we kind of guessed correctly from the get-go, still got some of those fake samples in the middle there where they shouldn't be. We'll see if that improves over time. This started look better. That generator are manifold, there is silicon more circular. Discriminator is still kinda focusing on that upper right quadrant there. But overall the results aren't too bad. And now we're starting to look good. So now the discriminator is really kinda getting that circle shape and learning that so that anything outside of that circle, it knows it's fake. And at the same time the generator starts to get a little bit wonky again. So again, it's unstable, right? So we're learning here that GANs are difficult to train. They can get unstable, getting those hyperparameters and the number of epochs just striking the batch size. Just write all critical for good performance. While we try something a little bit simpler, Let's try this data distribution, which is just a straight line. That should be an easier case here. And we can see that the generator is pretty quickly converging on that line there. And we should see the discriminator starting to pick up on that as well. But already the generator is pretty close to the real data. So yeah, that's a lot easier, right? So circles are always hard, but in this case we converge on something much better, much more quickly. And you can see here that the KL divergence really shrunk quickly there. The discriminator loss in the generator loss or basically stable at this point. So I think that's about as good as it's gonna get. And so if you want play with this yourself, head over to Polo Club dot github dot io slash GAN lab, and you can fiddle with this yourself. There are other models a year you can try and play around with different different scenarios here and see what happens. All right, moving on. Taking this to another level, let's look at the GAN playground here. Go to Ryan at Kano.com. Not sure how to say that. Playground. This is actually using the old m-nest dataset instead of those more synthetic data distribution. So in this case, let's go ahead and hit train here. So we're going to be training this to generate numbers, pictures and numbers. And remember again with a GAN, we're not training it to generate specific numbers, so I'm not expecting to see a nine here associated with the nine or six with the six is just generating numbers in general. So again, there is that mode collapse degenerate state where it just learns to make one number reliably. But Hopefully we'll see something more interesting is training continues here. So what we're seeing here visually is how well the discriminator is doing a telling real and fake apart. So these are the real images coming in and the generated images as well. And how will the discriminator is doing at telling the two apart? So we can see right now we're getting 60% or so. Success rate on identifying real images is real. Actually all of a sudden it's gotten a lot better. And we're identifying the fake image is fake more often than not as well, still not great though. Obviously the visual results here are not awesome, but they are visually improving over time. So making a number is a much more complicated tasks and just making a circle like we were trying to do with the GAN lab before. So you'd expect this to be a little bit more computationally expensive. And we can also visualize the loss function is on the discriminator and the generator here as well. We can see that already, you know, it's kinda struggling to get any better, but given enough time and it will, you know, if you want, let this run yourself for awhile after about ten inches. So I find that the results get pretty interesting and already start to look like numbers, kinda looks like a five. There are still some room for improvement, right? That kind of looks like an eight. It's getting there, but give it more time. And these will look more and more like numbers. And already I think they're kind of improving, right? So Carlos, like a flourish kind of thing there. But given enough training, we can't see that slowly but surely that discriminator cost is decreasing. Generator, however, really struggling to improve. Now you can play around with the different topology of the network here if you want to, and the different hyperparameters see if you can do a better job. But, you know, it's got to be high. It's going to be tough. But if you do want to play with this interactively, that's one way to do it. Let's stop this and move on to more fun example. So this is an Nvidia GAN demo called Gauguin. Were they intentionally spelled that with GAN At the end of Gauguin, after the famous painter. And it's using a GAN to generate fake images of different types of landscapes, right? So if I just hit the button here, It's going to automatically take this segmentation map here that I drew that says I want water down here and sky up here. And it will generate a synthetic ocean. And it's a synthetic sky that matches with those segments that I defined. So I have to agree to the terms and conditions first. And now we should see an image like that. So they have a GAN that's trained on how to make water and what it's trying to, how to make a sky and it just kinda put them together there. It gets more interesting though. Let's actually select landscape here and put some mountains on the horizon. Feel like Bob Ross here. And you can make any kind of landscape you want this way. So now really gone beginning at any math or neural network topologies here. So it's sort of a fun example of what you can do with GANs. And we've got some mountains there, kind of a weird boundary there between mountains and water. Let's let's give it some, some ground there, some dirt. And I'll sort of make that ocean into a lake. I don't want to waste too much time on this. This isn't a painting class. But just to give you an idea of what this can do. And while we're at it, let's put some happy little clouds up there to about one there, one here and one over there. And there we have our little fake landscape. So a little, little waterhole. They're surrounded by some dirt mounds in the background and some clouds in the sky all generated through GANs that were trained in how to make those different types of features in a painting at random algorithmically. So kind of a fun example there of GANs. All right, so you've seen GANs in action. Let's go into an actual notebook and dive into how these are actually working under the hood.
23. GAN's: Hands-On with Fashion MNIST: All right, We've had our fun. So let's go ahead and implement a generative adversarial network using a notebook here and actually see how they're running under the hood here and how to actually code one up. Surprisingly small amount of code here. And so not a whole lot to go through really there. Deceivingly simple, although they're very complex to train. Now again, these are very computationally intensive things to train. So if you don't have a GPU, you probably don't want to be doing this yourself. Let's go ahead and check if we have one with this first block of code. And that will go off and load up TensorFlow, which we'll take a little bit. But we can see that I do have one GPU available for training. If you are says 0 and you do have an NVIDIA GPU, again head over to this link and it will talk to you about how to install tensorflow GPU under Anaconda, just pip install TensorFlow dash GPU. But first, you'll probably have to need to install some dependencies like the CU DNN Library, which requires going to the NVIDIA developer websites signing up for an account there and all that stuff, but it is worth it for accelerating this training. All right, as before with the VAE sample, we're going to load up the fashion m-nest dataset. And we're going to be creating a GAN that can generate random articles of clothing. So let's go ahead and load that up from keras dot datasets package. And as before, we're going to merge together the training and testing datasets just to get more training data. Again, our goal here is not to classify data, is to generate fake data. So there's really no use for the test dataset here. We're just gonna glom it all together. And while we're at it will normalize that image data from character values from 0 to 255 to floating point values between 01, because that's what sigmoid functions expect. And we will reshape the data, adding that extra dimension that we need for the CNN layers, for the convolutional layers will also shuffle it and batch it up while we're at it. In a batch size of 64 up, There's our first hyper-parameter that we might want to tweak. Not much going on there. All right, so let's start by setting up our generator model. Again, this is the thing that's trying to generate fake images just given a random input. So it makes random pictures of, well, articles of clothing in our example here. Alright, let's walk through what's happening here. We import the stuff we need from TensorFlow and Keras, another hyper-parameter here. How many dimensions of noise do we have? How many inputs do we have going into this thing? So we're going to start with a 150 random noise values here. And we'll use that as our input. And again, you can change that and see what impact it has. Later on. Let's set up a sequential model and Keras starting off with that noise vector there if a 150 noise values. And we can play with what kind of distribution that noise is as well. We'll feed that into a dense layer of seven by seven by 256. And we'll do a transpose convolution on that in three layers, working up to a final image when we're done to reconstruct that. So again, this is very similar to the decoder from a VAE, right? Pretty much the same thing. So you can see how these things are very closely related. Pronounce a summary of the model just to make sure it looks right before we move on. Yep, looks sane to me. Next, we'll make our discriminator model. And again, this looks a lot like the encoder from the VAE, right? So what we're taking here as input is a 28 by 28 by 1 image, 28 by 28 pixels, one channel of color grayscale feed that into a 256 element convolutional 2D layer that to a 128. And we'll flatten that down to a dense layer of just 64 neurons. And we'll apply a dropout face to that to prevent overfitting. And finally output a final number of yes or no. Does this thing think that it's actually real or fake? So that's the main difference there from VAEs. We're not actually trying to generate a vector of latent features here where I've just generating an output of do I think this is a real image or do I think this is a fake image, but otherwise pretty similar to the encoder from a VAE. Let's go ahead and hit that model looks correct to me. One of the things I forgot to point out, we're using the ReLu activation function here on the discriminator and the leaky ReLu activation function on the generator. And that's just something that people have learned over time that work works well. Just standard best practice there. Alright, let's set up our optimizers and loss functions here. So yay, more hyperparameters. So again, we have a optimizer for the generator and discriminator. We're calling that optimizer G and optimizer D. Both are going to be using the Adam optimizer, but with different learning rates. And these are critical to get it right if you want to have a stable model while you're training, if you get those wrong, It's just going to blow up kinda like we saw during that earlier demo of trying to like learn that a ring shape, if you remember back in the GAN playground, our classifier for the loss function is pretty straightforward though. We're just trying to figure out if things are real or fake. So binary cross-entropy fits the bill for that. Similarly, for accuracy, we want to see how accurate we are at guessing yes or no. Is this a real image? So binary, binary accuracy. Is totally reasonable for that as well. Shift Enter. We've got that all set up. All right, Let's tie it all together. So we'll start by defining our training for the discriminator. Alright, so the batch coming in where I'd like read the shape of the data to figure out what the dimensionality of that is, what the batch size is. And then we'll create a vector of random noise that matches that batch size for the noise vector here. And we're using random dot normal here for a normal distribution doesn't necessarily have to be a normal distribution. You could try a uniform distribution here if you wanted to as well and see what that does if you have time to play around with it. And what we're gonna do next is concatenate the real and fake labels. So as we're gonna see, we're going to actually stick the real and the fake data together as we feed it into the discriminator. And so we're going to have all the real data with a label of one, followed by all the fake data with a label of 0. As we go in and define our labels here for the data that we're feeding in. That's what's going on there. We set up our gradient tape for training here. We first generate our fake data by calling the generator with that noise vector to get a set of fake images. We then concatenate that with the real data together. So we have the real data followed by the fake data. And again, that lines up with these labels of real and fake that we set up above here under y underscore true. We then feed that into the discriminator to see how well it does at guessing whether these are real or fake images. And then calculate the loss function that we defined above as to whether or not it got it right. All right, we have to also define the backward path for training here. We're just setting up again a gradient tape here, optimizer D that we set up before. Apply gradients. Nothing really interesting here is kind of boilerplate code here, but we do pass in that discriminator loss function that we defined earlier. Alright, we report the accuracy back here and keep track of that printed out over time. And that's all there is to the discriminator model. Let's Shift Enter to set that up. And now let's turn our attention to the generator. So the generator's job is to not get caught by the discriminator, right? So we're going to say, we want to test how well you are at guessing that it was real. So as we go into our discriminator, we're measuring our success that whether or not it guessed that it was a real image even though that it's fake. So that's what's at the heart of this training for the generator that's going on here. Again, we extract the batch size from the shape of the input data. We set up a noise vector that matches that batch size. And we just set up a vector of ones because we're measuring how well we are doing it being classified as real. That's what one represents. So we set up our gradient tape here, and this is kinda the heart of it all here. We take that noise that we generated, pass it into the generator to create a bunch of fake images for whatever the batch size is. We take those fake images and then pass them into the discriminator. And we get back from that. Whether the discriminator thought those different images were real or fake. We then calculate our loss based on how well we did at making the discriminator thinks that are fake images were actually real ones. Alright, so that's what's happening here. You know, not, not too much code really. The rest is kind of boilerplate. Again, we're tying together our backpropagation and keeping track of the loss and accuracy over time. So Shift Enter to set that up. All right, a little handy dandy function here to visualize the generated images as we're going. So all this is gonna do is take as an input a model. It's going to pick 81 random numbers here and pass that into whatever that model is, some sort of a generator and visualize what it came out with in a nine by nine grid. So just a little helper function to visualize 81 random images given some model. And now let's actually do the training. So we aren't going to just use a one line fit method here to do that. The reason is the original GAN paper actually would run multiple discriminator training sets for each generator step. But in our case here, we're going to keep it simple and just do one discriminator training step followed by one generator training step. But if you wanted to, you could actually duplicate this discriminator training multiple times to more match with the original paper did. So, it gives you some flexibility and things to experiment with. Again, fundamentally though all we're gonna do is go through 30 epochs. We're going to train the discriminator, keep track of the overall loss in accuracy. Train the generator again, keep track of our overall loss in accuracy and just print the performance as we go. Also for every other epoch, we're going to call that plot images helper function again to visualize AT ones random sample images from our generator model. So at every second epoch, we're gonna take our current, currently train generator model and all the weights that it's learned so far. And just visualize how good it's doing at generating fake images of close. So let's go ahead and set that up. And this is where the training starts. This is gonna take awhile guys, but let's at least see if it kicks off successfully. And with the GPU, This is going to be an incredibly intensive operation. Alright, here we go. So here we are at epoch 0, kinda getting random blobs here at this point as would be expected. But let's wait for at least the next visualization here at epoch to. Again, what's going on here is our generator is learning how to trick the discriminator. And the discriminator is trying to learn how to detect to discriminate real from fake images at the same time. It's adversarial because these two things are intention with each other. And getting that balance just right is the key to creating a generator that can create convincing fakes. Alright, epoch to finally came back and well, it's still not really recognizable, but you can see that we're getting somewhat more complex shapes out of it already. We're going to go ahead and let this run for 30 epochs and see what we end up with. It's gonna take awhile. So I'm just gonna pause the video here and come back when it's done while my GPU heats my room. All right. About a half an hour has passed here in my GPU has been working overtime here. It's actually hot in the office now here. So the other sort of dirty secret of deep learning is how much energy it consumes, definitely something to be aware of and cognizant of. Let's see what happened though. So as you can see, an epoch 0 just getting some random blobs that puck to not so much better. And by epoch forward start to do some more complicated stuff here. And starting to form into something sort of recognizable by epoch eight, these are starting to look like shirts and pants, I think. And by the time we get to 10, there are definitely starting to become recognizable. Over time. It just gets better and better and better. So even though we stopped it at 30 epochs, because I just don't want to destroy the planet by running it further. You can see the trend here that it is in fact getting better and better over time. And by the time we got to 30, well, we actually stopped printing it out of 28 here just to a vagary of the code. But even at the 29th opcode epoch, which is really 30 because we started counting at 0. And the discriminator was still only 80% effective at discriminating real from fake. So there was still room for improvement here. We could let this train even longer to get even better results if we wanted to very intensive to train these things. And again, they're very touchy. So given that you have to really run these things repeatedly to tune all those hyperparameters, you can see how training a GAN can be a very time-consuming and energy consuming proposition. But these are reasonable results. You know, these definitely looked like dresses and shirts. Well, it's mostly dresses. Insurance, isn't it that we'll talk about that in a moment. Let's just take the final train model. And again, just to spit out nine by nine random images here to see what our final results are. And kinda points down in the comments here that's kind of noteworthy how quickly it can generate those images once you have the trained model. So there's actually a demo on the NVIDIA website where they created a GAN that recreates a game of Pac-Man. Just based on analyzing the video of people playing it over and over and over again. So we can really represent anything you want without any really, any real knowledge of what's going on necessarily. Anyways, you can see here, these look like reasonable dresses and sweaters, I guess, but that's it. So we're seeing that mode collapse problem that we talked about before where we're really just measuring its ability to generate pictures of clothing. And we're not measuring its ability to generate specific types of clothing. Just whether or not that generator can generate an image that can fool the discriminator into thinking that it's real. So it kinda found a way to cheat here because, you know, different types of choose are going to be more complicated than generating a picture of a pair of pants or a shirt. And given the limited training time that we had, that's kinda what it converged on. So that's the mode collapse problem that we discussed earlier in the slides. But overall pretty satisfying results given just 30 epochs in a pretty simple training dataset there. Obviously you can tell that there's a lot more to explore here. There's so far is quite a gap between these results and all those face swapping apps that you see on, on your phone, right? So obviously this much larger GANs out there that train for a much longer period of time that are able to generate those convincing fakes of real pictures of human faces, which is really the hardest thing to get right. But it's out there and this is just the tip of the iceberg. As far as GAN research goes, it's really the cutting edge of deep learning research these days. If you'd go up, look up GAN model zoos, you'll see that there's a huge variety of different models out there that you can work off of and build off of for very specific problems. And people incorporating GANs into larger systems. For example, there is something out there that will automatically generate a talking head. Given a picture of somebody that can look around, its eyes, can move around. It can fake talking to you and everything by sort of melding together the feature extraction of somebody's face with GANs that can create patches of that face that do different things. So lots of exciting innovation going on in this space right now, again, I'll remind you from the ethics standpoint, please use this technology for good and please be cognizant of the energy that it requires for training. So that's GANs and introduction at least enough for you to go off and learn more and be a little bit dangerous with it.
24. Deep Learning Project Intro: So it's time to apply what you've learned so far in this deep learning course. Your final project is to take some real world data. We talked about this in the ethics lecture, actually of masses detected and mammograms, and just based on the measurements of those in massive see if you can predict whether they're benign or malignant. So we have a data set of mammogram masses that were detected in real people, and we've had riel. Doctors look at those and determine whether or not they are benign and malignant in nature . So let's see if you can set up a neural network of your own that can successfully classify whether these masses are benign or malignant in nature. Let's dive in. So I've given you some data and a template to work from. At least the data that we're gonna be working with is in your course materials. The mammographies underscoring masses dot dated A text file is the raw data that you'll be using to train your model with, and you can see it's just, ah, six columns of stuff or those things represent. We'll show you in a minute. There's actually a description of the state is set in the names dot text file here that goes along with that data set. But to get you started, I have given you a little bit of a template to work with if you open up the deep learning project on I p Y N b file. So this came from the U. C I repository, and I gave you a little bit of a link to where the original data came from. It's actually a great resource for finding other data to play with. So if you're still learning machine learning, that's a great place to find stuff to mess around with an experiment with. But for this example, that's when we're gonna be messing with. So here's the description of those six columns. One is called the buyer as assessment, and that's basically a measurement of how confident are diagnosis was of this particular mass. Now, that's sort of giving away the answer. It's not what we call a predictive attributes, so we're not gonna actually use that for training or model. Next, we have the patient's age. We have a classification of the shape of the mass. We have a classification of the mass margin, but how? It looks like the density of the mass. And finally we have the thing that we're trying to predict. So this is the label, the severity, whether it's benign, zero or malignant one. So we have what's here, a binary classification problem very similar to stuff that we've done earlier in the course , And you shouldn't need much more than to use code snippets from earlier exercises in this course and adapting them to this data set. Okay, now, one little caveat here. Typically, when you're doing machine learning, you don't want to deal with what we call nominal data, and both shape and margin are technically nominal data. While we're converting them to numbers, those numbers aren't necessarily meaningful in terms of their great Asian. You know, going from 1 to 4 doesn't mean that we're increasingly going from one to round two irregular in a linear fashion. But sometimes you have to make do with what you have. It's better than nothing, and at least there is some logic to the progression of numerical scores here to these descriptions. So they do generally go from, you know, mawr regular tomb or irregular as those numbers increased. So we're gonna go ahead and use those anyway. Anyway, this is important stuff. You know, there's a lot of unnecessary anguish and surgery that comes along from false positives on manner where mammograms. So if you can build a better way of, ah, diagnosing these things all the better. But again, think back to my ethics lecture. You don't want to over sell this. You want to be sure this is just a a tool that might be used for a really human doctor Unless you're very confident that the system can actually outperform a human. And by definition, it can't because we're training this on data that was created by humans. So how could it possibly be better than a human? Think about that. All right, so your job is to build a multi layer perceptron to classify these things, I was able to get over 80% accuracy with mine. Let's see how you can do now. A lot of the work is just going to be in cleaning the data, and I will step you through the things you need to do here. So start by importing the data file using the read See SV function. You can then take a look at that, convert the missing data into not of numbers and make sure you import all the column names . You might need to clean the data, so try to get an idea of the nature of the data by using describe on your resulting pandas data frame. Next, you'll need to drop rose. It have missing data and what you've taken care of that you need to convert the data frames into numb pie. Raise that you can then pass into into psych it learn or into caress. Okay, so you also need to normalize the data before you analyze it with caress. So a little hint there's to use pre processing dot standards scaler out of SK learned that can make things very easy for you. That's the one thing that we haven't really done before. The rest of this, you should be able to figure out just based on previous examples. Once you have the data in the proper format, it's been pretty straightforward to build an M LTP model using caress and you can experiment with different kept apologies, different hyper parameters and see how well you can do so I'm gonna set you loose here and give this a shot. Let's see how you do when you come back in the next lecture. I'll show you my solution and how I walk through this myself. So go forth and practice what you've learned.
25. Deep Learning Project Solution: So I hope you got some good experience there and actually applying what you learned to create a neural network that can classify masses found in mammograms. It's benign or malignant. Like I said, I got around 80% accuracy myself. Wonder how you did anyway? I started off by just toe blindly reading into CSP file using pd dot reid CSP and taking a look at it. And I saw at that point that the column names were wrong. There was missing column name information in the data file, and there were missing values in there. They were indicated by question Mark, so have to read that in a little bit more intelligently. So on my second try here, I called read, see SV passing and explicitly the knowledge that question marks mean missing values or any values and passing an array of column names like we did before and did another head on the resulting Panis data frame. And things look a lot better now. At that point, we have to clean the data now that it's in a proper format and organized properly, we could do a describe on that to take a look at things and get idea that we are missing some data and things seem to be reasonably well distributed. At least at this point, we did a little count here to see what exactly is missing. So my strategy here was to see if there's any sort of bias that I'm going to introduce by actually removing missing values. And if I were to see that the missing data seems to be randomly distributed just by eyeballing it, at least that's probably good indication that it's safe to just go ahead and drop those missing rose. So given that I have determined that that is an okay thing to do, I've gone ahead and called Drop in a on that data frame and described it, and now I can see that I do have the same number of counts of rose in each individual column. So I now have a complete data set where I've thrown out rows that are missing data, and I've convinced myself that's an okay thing to do statistically. All right, so now we need to actually extract the features and values that we want to feed into our model, our neural network. So I've extracted the feature date of the age, shape, margin and density from that data frame and extracted that into a dump. I recall all features. I've also extracted the severity column and converted that, too, in all classes array that I can pass in as my label data. And I've also created a handy dandy array of column names since l need that later on. So just to visualize, I've punched in all features just to take a look at what that looks like. And, sure enough, looks legitimate looks like array of four features. A Pete's on each row looks reasonable. At that point. I need to scale my data down. So to do that, all I need to do is import the pre processing dot standards scaler function there and apply that to my feature data. And if I look at all features scale that came out of that transformation, I could see that everything appears to be normally distributed now centered around zero when, with a standard deviation of one, which is what we want, remember, when you're putting inputs into a neural network, it's important that your data is normalized before you put it in. Now we get to the actual meat of the thing, actually setting up our MLP model, and I'm going to wrap this in such a way that I can use psych. It learns cross Val score to evaluate its performance. So in my example here I've created little function called create model that creates a sequential model, adds in a dense layer with six units or six neurons using the rela Oh activation function. I've added in another layer with one that does my final sigmoid classifications, my binary classification on top of that, and I have compiled that with the Adam Optimizer and the binary Cross entropy lost function . So with this, we've set up a single layer of six neurons that feeds into one final binary classification layer. Very simple, and I've then gone ahead and used the caress classifier to build a psychic learn compatible version of this neural network, and I've passed that into cross Val score toe actually do K fold cross validation in this case with 10 folds and print out the results. So with just these six neurons, I was able to achieve an 80% accuracy and correctly predicting whether or not a mass was benign or malignant, just based on the measurements of that mass. Now in the real world, Doctor, it's used a lot more information than just those measurements. So you know where our algorithm is kind of at a disadvantage compared to those human doctors to begin with. But that's not too bad if you did better if you used more layers more neurons ivy, curious to see if you actually got a better result. Turns out that sometimes you don't need a whole lot to actually get the optimal result from the data that you have. But if you were able to substantially improve upon that result, congratulations. So I hope deep learning has been de mystified for you. And up next we'll talk about how toe continue learning mawr in the field of deep learning.