Deep Learning and Neural Networks with Python | Frank Kane | Skillshare

Deep Learning and Neural Networks with Python

Frank Kane, Founder of Sundog Education, ex-Amazon

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
21 Lessons (2h 39m)
    • 1. Course Introduction

    • 2. Getting Started and Prerequisites

    • 3. Please Follow Me on Skillshare!

    • 4. The History of Artificial Neural Networks

    • 5. Hands-On in the Tensorflow Playground

    • 6. Deep Learning Details

    • 7. Introducing Tensorflow

    • 8. Using Tensorflow for Handwriting Recognition, Part 1

    • 9. Using Tensorflow for Handwriting Recognition, Part 2

    • 10. Introducing Keras

    • 11. Using Keras to Learn Political Affiliations

    • 12. Convolutional Neural Networks

    • 13. Using CNN's for Handwriting Recognition

    • 14. Recurrent Neural Networks

    • 15. Using RNN's for Sentiment Analysis

    • 16. Transfer Learning

    • 17. The Ethics of Deep Learning

    • 18. Deep Learning Project Intro

    • 19. Deep Learning Project Solution

    • 20. Deep Learning: Learning More

    • 21. Let's Stay in Touch

27 students are watching this class

About This Class

It's hard to imagine a hotter technology than deep learning, artificial intelligence, and artificial neural networks. If you've got some Python experience under your belt, this course will de-mystify this exciting field with all the major topics you need to know. 

We'll cover:

  • Artificial Neural Networks
  • Multi-Layer Perceptions
  • Tensorflow
  • Keras
  • Convolutional Neural Networks
  • Recurrent Neural Networks

And it's not just theory! In addition to the class project, as you go you'll get hands-on with some smaller activities and exercises:

  • Building neural networks for handwriting recognition
  • Learning how to predict a politician's political party based on their votes
  • Performing sentiment analysis on real movie reviews
  • Interactively constructing deep neural networks and experimenting with different topologies

A few hours is all it takes to get up to speed, and learn what all the hype is about. If you're afraid of AI, the best way to dispel that fear is by understanding how it really works - and that's what this course delivers.



1. Course Introduction: Welcome to deep learning neural networks with Python. I'm your instructor, Frank Kane and I spent nine years at amazon dot com and I am devi dot com building and managing some of their best known features. Product recommendations. People Who Pot also bought top sellers and movie recommendations that I. M. D. B. All of these features required applying machine learning techniques to real world data sets , and that's what this course is all about. I don't need to tell you that artificial intelligence, deep learning and artificial neural networks are the most valuable technical skills toe have. Right now. These fields are exploding with progress in new opportunities. The short course will tackle deep learning. From an applied practical standpoint, we won't get mired in notation and mathematics, but you'll understand the concepts behind modern AI and be able to apply the main techniques using the most popular software tools available. Today, we'll start off with a refresher on python and the Pandas library in case you're new to them. Then we'll cover concepts behind artificial neural networks. Then you'll dive right into using the Tensorflow library to create your first neural network from scratch and will use the caress library. To make prototyping neural networks even easier, you'll understand and apply multilevel Perceptron, deep neural networks, convolution, all neural networks and recurrent neural networks. By the end of this course, at the end, a quick final project will let you practice what you've learned. The activities in this course air really interesting. You'll perform handwriting recognition sentiment analysis and predict people's political parties using artificial neural networks using a surprisingly small amount of code. If you're a software developer or programmer looking to understand the exciting developments in a I in recent years and how it all works, this course is for you will turn concepts right into code, using python with no nonsense and no academic pretense. Building a neural network is not as hard as you think. All you need some prior experience in programming or scripting to be successful in this course, the general format of this course is to introduce the concept using some slides and graphical examples. Then we'll look at python code that implements the concept on some real or fabricated data . You'll then be given some ideas on how to modify or extend the code yourself in order to get some hands on experience with each concept. The code in this course is provided as an eye python notebook file, which means that in addition to containing riel working python code that you can experiment with, it also contains notes on each technique that you can keep around for future reference. If you need a quick reminder on how a certain technique works, you'll find this an easy way to refresh yourself without rewatching an entire video. 2. Getting Started and Prerequisites: it's hard to think of a hotter topic than deep learning. And that's what we're going to talk about in depth and hands on for the next few hours going to show you how neural networks work artificial neural networks. Perceptron is multi layer Perceptron is, And then we're gonna talk into some more advanced topics like convolution, all neural networks and recurrent neural networks. None of that probably means anything to you right now. But the bottom line is, if you've been curious about how deep learning and artificial neural networks work, you're gonna understand that by the end of these next few hours, so think of it as a deep learning for people in a hurry. I'm going to give you just enough death to be dangerous, and there will be several hands on activities and exercises so you can actually get some confidence and actually applying these techniques and really understanding how they work and what Therefore, I think you'll find that there are a lot easier to use and you might have thought so. Let's dive in and see what it's all about. This section of my larger machine learning and data science courses actually available as a standalone course as well. So if you are new to this course here, you are gonna need to install the course materials and a development environment. If you want to follow along with hands on activities in this deep learning section. If you are new, just head on over to Sun Dog Dash education dot com slash machine Dash learning. Make sure you pay attention to the dashes and the capitalization. It all matters and don't spell anything wrong, and he should get to this page. Here. You'll find here a handy link to the course materials. Just download that and decompress it. However, you do that on your platform and remember where you put it. If you want to join our Facebook group Pecans, totally optional Disa place for students to hang out with each other and our development environs for this course will be Anaconda, which is a scientific python three environment. You can install it from here. It is free software. Make sure you install the python 3.7 or newer version. What's even stalled anaconda? You'll need to install the tensorflow package. So on Windows you would do that by going to the Anaconda prompt. So go to Anaconda in your start menu and open up Anaconda prompt on Mac OS or Lennox. He would just go to a terminal prompted and they would be all set up for you already. From there you type in Kanda install tensorflow and let that run to install the tensorflow framework that we will use within Anaconda. If you have an NVIDIA GPU, you might get better performance by saying tensorflow dash GPU. But sometimes that results in compatibility issues. So don't do that unless you know what you're doing. Guys, you do not need to install pied out. Plus, for this particular section of the course, can't hurt to do it, though That's also part of set up instructions for the larger course. And you also need to understand how to actually start the notebooks once you have them installed. So from that anaconda prompt that same anaconda prom that we talked about earlier to actually launch one of the notebooks in this course, you would first change your directory to wherever you installed the course materials. So for me, I put them in C colon ml course, and if I do a de ir you'll see all the course materials air here from here. If I type in Jupiter notebook, Dan Jupiter is spelled funny with a why that should launch your Web browser with a directory of all the different notebooks that are part of this course. So when I say in the scores to open up, for example, um, I don't know, um tensorflow doubt I p y and be the tensorflow notebook. He would just scroll down to this list. Open up tensorflow doubt I p Y n b and up. It should come when you're done experimenting and playing around with this notebook. You can just go to file clothes and halts to get out of it. And when you're done with Jupiter entirely for this session, just quit. And that was shut everything down for you. All right, so with that out of the way, let's move on. Let's talk about some of the mathematical pre requisites that you need for to understand deep learning. It's probably be the most challenging part of the course, actually, just some of the mathematical jargon that we need toe familiarize ourselves with. But once we have these basic concepts down, we can talk about them a little more easily. Think you'll find that artificial intelligence itself is actually a very intuitive field? And once you get these basic concepts down, it's very easy to talk about it very easy to comprehend. First thing we want to talk about his Grady int descent. This is basically a machine learning optimization technique for trying to find the most optimal set of parameters for a given problem. So what were plotting here, basically some sort of cost function, some measurement of the error of your learning system. And this applies to machine learning in general, right? Like you're gonna have some sort of function that defines how close to the result you want . Your model produces results for right, So we're always doing Ah, in the context of supervised learning, we will be feeding our algorithm our model, if you will. A group of parameters, you know, some sort of ways that we have tuned the model and we need to identify different values of those parameters that produced the optimal results. So the idea of grading descent is that you just pick some point at random, and each one of these dots represent some set of parameters to your model. Maybe it's, you know, the various parameters for some model we've talked about before. Or maybe it's the exact weights within your neural network. Whatever it is, you try some set of parameters to start with, and we will then measure whatever the air is that that produces on our system. And then what we do is we move on down the curve here, right? So we might try a different set of parameters here again, just store like moving in a given direction with different parameter values, and then measure the error that we get from that. And in this case, we actually achieved less error by trying this new set of parameters. So we say, OK, I think we're heading in the right direction here. Let's ah, change them even Maurin the same way. And we just keep on doing this a different steps until finally we hit the bottom of a curve here and our air starts to increase after that point. So at that point will know that we actually hit the bottom of this Grady int. So you understand the nature of the term here ingredient descent. Basically, we're picking some point at random with a given set of parameters that we measure the air four and we keep on, you know, pushing those parameters in a given direction until the error minimizes itself and starts to come back up to some other value. Okay, and that's how creative dissent works. In a nutshell. Not going to get into all the hardcore mathematics of it all. The concept is what's important here because Grady and dissent is how we actually train our neural networks to find an optimal solution. Now you can see there are some areas of improvement here for this idea. First of all, you can actually think of this is sort of a ball rolling downhill. So one optimization that we'll talk about later is using the constant mo mo mentum. You can actually have that ball gain speed as it goes down the hill here, if you will, and slowdowns that reaches the bottom and you know kind of bottoms out there. That's the way to make it. Ah, converge more quickly when you're doing things and I can make actual training. Your neural networks, even faster unearthing were talking about the concept of local minima. So what if I randomly picked a point that ended up over here on this curve? I might end up settling into this minima here, which isn't actually the point of the least Aron putting the least error in this graph is over here. That's a problem, you know, I mean, that's ah general problem in great Grady in dissent. How do you make sure that you don't get stuck in what's called a local minima? Because if you just look at this part of the graph that looks like the optimal solution and if I just happen to start over here, that's where I'm going to get stuck. Now there are various ways of dealing with this problem. Obviously could start from different locations, try to prevent that sort of thing. But in practical terms, it turns out that local minima aren't really that big of a deal. When it comes to training neural networks. It's just doesn't really happened that off you. You don't end up with shapes like this in practice so we can get away with not worrying about that as much. That's a very important good thing, because for a long time people believe that AI would be limited by this local minimum effect. In in practice, it's really not that big of a deal. Another concept. We need to familiarize yourself with something called auto def. And we don't really need to go into the hardcore mathematics of how auto def works. I just need to know what it is and why it's important. So when you're do ingredient descent, somehow you need to know what the Grady int iss, right? So we need to measure What is the slope that we're taking along? Our cost function are measurement of error might be mean standard error for all we know. And to do that mathematically, you need to get into calculus, right? If you're trying to find the slope of a curve and you're dealing with multiple parameters and we're talking about partial derivatives, right, the first partial derivatives to figure out the slope that we're heading in now it turns out that this is very mathematically intensive and inefficient for computers to dio. So by just, you know, doing the brute force approach to Grady and dissent that gets very expensive very quickly. Auto def is a technique for speeding that up so specifically, we use something called reverse mode auto def. And what you need to know is that it can compute all the partial derivatives you need just by traversing your graph in the number of outputs plus one that you have. And this works out really well. A neural networks, because in a neural network you tend to have an artificial neuron that have very many inputs, but probably only one output or very few outputs and comparison to the inputs. So this turns out to be a pretty good little calculus trick. It's complicated, you know. You can look up how it works. It is pretty hard core stuff, but it works. And that's what's important. And what's also important is that it's what the Tensorflow Library uses under the hood to implement its Grady and dissent. So again, you know you're never gonna have to actually implement Grady into sent from scratch or implement auto different scratch. These air all baked into the libraries that were using libraries such as Tensorflow for doing deep learning. But they are terms that we throw around a lot, so it's important that you at least know what they are and why they're important. So just to back up a little bit, Grady and dissent is the technique were using to find the local minima of the air, the that were trying to optimize four. Given a certain set of parameters and auto def is a way of accelerating that process, so we don't have to do quite as much math or quite as much computation to actually measure that radiant of the radiant descent. One other thing. When you talk about this soft Max again, you know the mathematics aren't so complicated here. But again, what's really important is understanding what it is and what it's for. So basically, when you have the end result of a neural network, you end up with a bunch of what we call weights that come out of the EU neural network at the end. So how we make use of that? How do we make practical use of the output of our neural networks? Well, that's where Soft Max comes in. Basically, it converts each of the final weights that come out of your neural network into a probability. So if you're trying to classify something in your neural network like for example, decide if a an image is a picture of a face or a picture of a dog or a picture of a stop sign. You might use soft max at the end to convert those final outputs of the neurons into probabilities for each class. Okay, and then you can just pick the class. It has the highest probability. So it's just a way of normalising things, if you will, into a a comparable range and in such a manner that if you actually choose the highest value of the soft max function from the various outputs, you end up with the best choice of classification at the end of the day. So it's just a way of converting the final output of your neural network to an actual answer for a classification problem. So again, you might have the example of a neural network that's trying to drive your car for you, and it needs to identify pictures of stop signs or yield signs or traffic lights. You might use soft max at the end of some neural network that will take your image and classify it is one of those signed types, right? So again, just to recap grading descent, and I'll grow them from minimizing air over multiple steps. Basically, we started some random set of parameters, measured the error, move those parameters in a given direction, see if that results in more error or less error and just try to move in the direction of minimizing error until we find the actual bottom of the curve there, where we have a set of parameters that minimizes the air of whatever it is you're trying to do. Auto def. Is this a calculus trick for making Grady into sent faster? It makes it easier to find the Grady INTs ingredient descent just by using some calculus, trickery and soft max is just something we apply on top of our neural network at the very end to convert the final output of our neural network to an actual choice of classifications, given several classification types to choose from. Okay, so those are the basic mathematical terms, or algorithmic terms that you need to understand to talk about artificial neural networks. So with that under our belt, let's talk about artificial neural networks next 3. Please Follow Me on Skillshare!: the world of big data and machine learning is immense and constantly evolving. If you haven't already be sure to hit the follow button next to my name on the main page of this course. That way you'll be sure to get announcements of my future courses and news as this industry continues to change. 4. The History of Artificial Neural Networks: Let's dive into artificial neural networks and how they work at a high level later on will actually get our hands dirty and actually create some. But first we need to understand how they work in where they came from. So it's pretty cool stuff. I mean, this whole field of artificial intelligence is based on an understanding of how our own brains work. So, you know, over millions of years of evolution, nature has come up with a way to make us think. And if we just reverse engineer the way that our brains work, we can get some insights on how to make machines that think so within your brain. Specifically your cerebral cortex, which is where I live, you're thinking happens. You have a bunch of neurons, thes air individual nerve cells, and they are connected to each other via Exxon's and dendrites. You can think of these as connections. You know, wires, if you will, that connect different accents together. Now an individual neuron will fire or send a signal to all the neurons that is connected to when enough of its input signals air activated so that the individual neuron level it's a very simple mechanism. You just have this cell. That's neuron that has a bunch of input signals coming into it. And if enough of those input signals reach a certain threshold, it will in turn fire off a set of signals to the neurons that it, in turn, is connected to a swell. But when you start to have many, many, many of these neurons connected together in many, many different ways with different strengths between each connection, things get very complicated. So this is kind of the definition of emergent behavior. You have a very simple concept, very simple model. But when you stack enough of them together, you can create very complex behavior at the end of the day, and this can yield learning behavior. This is actually this actually works and not only works in your brain, it works in our computers as well. Now think about the scale of your brain. You have billions of neurons, each of them with thousands of connections, and that's what it takes to actually create a human mind. And this is a scale that you know we can still only dream about in the field of deep learning and artificial intelligence. But It's the same basic concept. You just have a bunch of neurons with a bunch of connections that individually behave very simply. But once you get enough of them together wired enough complex and ways you can actually create very complex thoughts, if you will, and even consciousness. The plasticity of your brain is basically tuning where those connections go to and how strong each one is, and that's where all the magic happens, if you will. Furthermore, we look deeper into the biology of your brain. You can see that within your cortex, neurons seem to be arranged into stacks or cortical columns that process information in parallel. So, for example, in your visual cortex, different areas of what you see might be getting processed in parallel by different columns or cortical columns of neurons. Each one of these columns is in turn, made of these many columns of around 100 neurons per many column that air then organized into these larger hyper columns and within your cortex there are about 100 million of these many columns, so again they just add up quickly. Now, coincidentally, this is a similar architecture toe. How the video card, the three d video card in your computer works. It has a bunch of various simple, very small processing units that are responsible for computing. How little groups of pixels on your screen are computed at the end of the day, and it just so happens that that's a very useful architecture for mimicking how your brain works. So it's sort of a happy accident that the research that's happened to make video games behave really quickly or play call of duty or whatever it is that you like to play lend itself to the same technology that made artificial intelligence possible on a grand scale and at low cost. The same video cards you're using to play your video games can also be used to perform deep learning and create artificial neural networks. Think about how better would be if we actually made chips that were purpose built specifically for a simulating artificial neural networks. Well, it turns out some people are designing ships like that right now. By the time you watch this, they might even be a reality. I think Google's working on one as we speak, So at one point someone said, Hey, the way we think neurons work is pretty simple. It actually wouldn't be too hard to actually replicate that ourselves and maybe try to build our own brain. And the this idea goes all the way back to 1943 people just proposed a very simple architecture where if you have an artificial neuron, maybe you can set up an architecture where that artificial neuron fires if more than a certain number of its input connections are active and when they thought about this more deeply in a computer science context, people realize that you can actually create logical expressions Boolean expressions by doing this. So depending on the number of connections coming from each input neuron and whether each connection activates or suppresses honor, and you can actually do both that works that way in nature as well. You can do different logical operations, so this particular diagram is implementing an or operation. So imagine that our threshold for our neuron was that if you have two or more inputs active , you will in turn fire off a signal. In this set up here, we have two connections to neuron A and turn to connections coming in from neuron B. If either of those neurons, produce and input signal that will actually cause nor on sea to fire. So you can see we've created an or relationship here where if either nor on a or neuron B feeds neuron, see to input signals that will cause they're unseat a fire and produce a true output. So we've implemented here the Boolean Operation C equals A or B, just using the same wiring that happens within your own brain, and I won't go into the details, but it's also possible to implement and and not in similar means. Then we started to build upon this idea. We create something called the Linear Threshold Unit, or LTU, for short in 1957. This just built on things by assigning weights to those inputs. So instead of just simple on and off switch is, we now have the ability of the concept of having waits on each of those inputs as well that you can tune further and again. This is working more toward our understanding of the biology. Different connections between different neurons may have different strengths, and we can model those strengths in terms of these weights on each input coming into our artificial neuron. We're also going to have the output be given by a step function. So this is similar in spirit to how we were using it before. But instead of saying we're going to fire if a certain number of inputs are active, well, there's no concept anymore of active, are not active. There's weights coming in. Those weights could be positive or negative. So we'll say if the some of those weights is greater than zero, we'll go ahead and fire on her own off its lesson or lessons. Zero. We won't do anything. So just a slight adaptation to the concept of an artificial neuron here where we're introducing weights instead of just simple binary on and off switch is so let's build upon that even further and will create something called the Perceptron. And a perceptron is just a layer of multiple linear threshold units. Now we're starting to get into things that can actually learn. Okay, So by reinforcing weights between these lt use that produced the behavior we want, we can create a system that learns over time how to produce the desired output. And again, this also is working more toward our growing understanding of how the brain works within the field of neuroscience. There's a saying that goes cells that fire together wire together. And that's kind of speaking to the learning mechanism going on in our artificial perceptron here, where if we have weights that are leading to the desired result that we want, you know, they could think of those weights again as strengths of connections between neurons. We can reinforce those weights over time and reward the connections that produced the behavior that we want. Okay, so you see, here we have our inputs coming into weights, just like we did in lt years before. But now we have multiple lt use gang together in a layer, and each one of those inputs gets wired to each individual neuron in that layer, okay? And we then apply step function each one. Maybe this will apply to, you know, classifications. Maybe this would be a perceptron that tries to classify an image into one of three things or something like that. Another thing we introduced here is something called the bias neutron off there on the right. And that's just something to make the mathematics work out. Sometimes need to add in a little fixed, constant value that might be something else you can optimize for us. Well, so this is a perceptron. We've taken our artificial network. Move that to a linear threshold unit. And now we've put multiple linear threshold units together in a layer to create a perceptron, and already we have a system that can actually learn. You know, you can actually try to optimize these weights, and you can see there's a lot of this point if you have every one of those inputs going to every single LTU in your layer, they add up fast, and that's where the complexity of deep learning comes from. Let's take that one step further and we'll have a multi layer perceptron. It's announced of a single layer of perceptron of lt use. We're going to have more than one, and we actually have now a hidden layer in the middle there, so you can see that are inputs air going into a layer at the bottom. The output started layer at the top, and in between we have this hidden layer of additional lt used in your threshold units that can perform what we call deep learning. So here we have already what we would call today a deep neural network. Now there are challenges of training these things because they are more complex. But we'll talk about that later on. It can be done. And again the thing toe really appreciate here is this how many connections there are? So even though we only have a handful of artificial neurons here, you can see there's a lot of connections between them, and there's a lot of opportunity for optimizing the weights between each connection. Okay, so that's how a multi layer Perceptron works. You can just see that again. We have emergent behavior here, and individual linear threshold unit is a pretty simple concept. But when you put them together in these layers and you have multiple layers all wired together, you can get very complex behavior because there's a lot of different possibilities for all the weights between all those different connections. Finally, we'll talk about a modern deep neural network, and really, this is all there is to it. You know, the rest of this course we're just gonna be talking about ways of implementing something like this. OK, so all we've done here is we've replaced that step function with something better. We'll talk about alternative activation functions. This one's illustrating something called rela you that we'll talk about later. The key point there. Those that a step function has a lot of nasty mathematical properties, especially when you're trying toe figure out their slopes in their derivatives. So it turns out that other shapes work out better and allow you to converge more quickly when you're trying to train. A neural network will also apply soft max to the output, which we talked about in the previous lecture. That's just a way of converting. The final outputs of our neural network are deep neural network into probabilities from whence we can just choose declassification with the highest probability. And we will also train this neural network using greedy descent or some variation thereof. There are several of them to choose from. We'll talk about that in more detail as well. Maybe that will use auto diff, which we also talked about earlier to actually make that training more efficient. So that's pretty much it. You know, in the past five minutes or so that we've been talking, I've given you the entire history, pretty much of deep neural networks in deep learning. And those are the main concepts. It's not that complicated, right? That's really the beauty of it. It's emergent behavior. You have very simple building blocks. But when you put these building blocks together in interesting ways, very complex and frankly mysterious, things can happen. So I get pretty psyched about this stuff. Let's dive into more details on how it actually works up next. 5. Hands-On in the Tensorflow Playground: So now that we understand the concepts of artificial neural networks and deep learning, let's mess around with it. It's surprisingly easy to do. The folks behind Tensorflow at Google have created a nice little website called Playground dot tensor fluid out aware that lets us experiment with creating our own neural networks and you don't you write a line of code to do it. So it's a great way to get sort of, ah, hands on and to to feel of how they work. So let's dive in so head over to playground dot tensorflow dot org's and you should see a screen like this you can follow along here or just watch me do it. But I definitely encourage you to play around with this yourself and get sort of, ah, intuitive hands on feel of how deeply learning works. This is a very powerful thing if you can understand what's going on in this Web page. So what we're trying to do here is classify a bunch of points just based on their location in this two D image. So this is our training, data said, if you will. We have a bunch of points here and the ones in the middle are classified is blue, and the ones on the outside are classified as orange. So our objective is to create a neural network that, given no prior knowledge, can actually figure out if a given point should be blue or orange and predict successfully which classifications should be. So think of this is our training data. Okay, we know ahead of time what the correct classifications are for each one of these points. And we're going to use this information to train our neural network to hopefully learn that stuff in the middle should be blue, and stuff on the outside should be orange. Now, here we have a diagram of the neural network itself, and we can play around with this. We can manipulate it. We can add layers to take layers out. ADM or neurons. Two layers. Whatever you want to do, let's review what's going on here. So first of all, we're selecting the data set that we want to play with where it's starting with this default, one that's called Circle the Inputs Air. Simply, the X and Y coordinates the vertical and horizontal position of each data point, so as our neural network is given a point to classify. All it has to work with are those two values, its horizontal position and its vertical position. And the start off is equally weighted being horizontal a vertical, so we could define the position of any one of these points in terms of its result and vertical position. For example, this point here would have a horizontal position of negative one in a vertical position of about negative five, and then we feed it into our network. You can see that these input notes have connections to each one of these four neurons and are hidden layer. And we can manipulate the weights between each one of these connections to create the learning that we want. Those in turn feed into two output neurons here that will ultimately decide which classifications we want at the end of the day. So keep in mind, this is a Byeon Eri classification problem. It's either blue or orange, So at the end of the day, we just need ah single signal, really, and that's what comes into this output here. Let's go ahead, hit play and see what happens. What's going to do it started bunch of iterations where it learns from this training data. So we're gonna keep feeding it input from this training data set. And as it goes as it generates through it, you will start to reinforce the connections that lead to the correct classifications through Grady into center. Some similar mechanism, right? And if we do that enough times, it should converge to a neural network that is capable of reliably classifying these things . Let's hit playing. Just watch it in action. So keep your eye on that image to the right there. All right, you can see that we have already converged on a solution. I can go ahead and pause that now and pretty cool stuff so you can see it has successfully created this pattern where stuff that fits into this middle area here is classified is blue , and stuff on the outside is classified as orange so we can dive into what actually happened here. These thickness of all these connections represent their weights so you can see the individual weights that are wired between each one of these neurons. We start off here, you see these air more or less equally weighted. Uh, well not exactly. Equally, some of these are kind of weak. But what? At least two is this behavior in the middle? So we start off with equally weighted X and Y coordinates. Those go to this layer here. So, for example, this hidden layer here this neuron is saying I want to wait things a little bit more heavily in this corner, okay? And things that are like in the lower left hand corner, not so much. And then this other one is picking out stuff on the top and bottom. This one's a little bit more diagonal to the bottom, right? And this one's even mawr bottom right heavy. And if you combine these things together, we went up with these output layers that looked like this. Okay? And so we have We end up with these two blobby things where we're sort of giving a boost to things on the right and giving a boost to things that lie within sort of this. Ah, more blobby circular area. And then we combined those together we end up with our final output. That looks like this. Now, this might look different from run to run. You know, there is some random, some randomness to how this is all initialized. Do we actually even need a deep neural network to do this, though one optimization thing is to remove layers and see if you get away with it. Maybe we don't even need deep learning. I mean, really, this is kind of a simple thing. You know, stuff in the middle is blue. Stuff on the outside is orange. Let's go ahead and remove one of these neurons from the output layer again. All we need is a buying a result anyway. Can it still work? It does. In fact, it's just is quickly. So do I even need that layer at all? Let's go ahead and remove that final layer at all still works, right? So for this very basic problem out even need to deep learning. All I have here is a single layer. So this is just It's not even a multi layer perceptron. It's just a perceptron. Do I even need for neurons in there? Well, I think maybe I do, but this one here isn't really doing much right. All it's doing is basically doing it passed through, and the inputs coming into it of been weighted down to pretty much nothing. So I better don't even need that one. Let's get rid of it. It still works. Isn't that kind of cool? I mean, think about that. We only have three artificial neurons, and that's all it takes to do this problem. I mean, compare that to the billions of neurons that exist inside your head. Now we probably can't get away with less than that. Let's go ahead and try to do turn to neurons and see what happens. Yeah, that's just not gonna happen, Right? So for this particular problem, all you need is three neurons to won't cut it. Let's play around some more. Let's try a more challenging data set. Okay, so here's a spiral pattern, and you can tell this is going to be harder because we can't just say stuff in this corner is going to be this, uh, this classification, like we need a much more finer grained way of, like identifying these individual spirals. And again we're going to see if we could just train and Earl Network to figure that rule out on its own. And, well, obviously tuners will cut it. Let's go back to four. Let's see if that's enough. I bet it isn't. You can see it's it's trying, but it's really struggling. We can let this run for a while, and you can see it's starting to kind of get there. You know, the blue areas are converging on some blue areas, and it's it's really trying hard, but it's just not enough neurons to pull this one off. Let's go ahead and add another layer. Let's see if that helps. You can see it's doing more complicated things now that it has more neurons to work with. But I still can't quite get to where it needs to be. Let's add a couple more neurons to each layer. Generally speaking, you can either add more neurons to a layer. Add more layers. It's, ah, gonna produce the same results. But it might affect the speed in which it converges, depending on which approach to take. Just fascinating. Watching this work, isn't it? All right, this one got stuck. It still can't quite pull it off. Let's add one more layer. This is actually a very common pattern. You'll see you start off with a lot of layers at first and they kind of like narrow them down as you go. OK, so we're going to go to a initial input layer of six neurons to a hidden layer of four neurons and then a layer of two neurons which will ultimately produce a binary output at the end. Well, I think it's getting there. Here, Rio. Wow. Okay, so, technically, it's still kind of like refining itself, but it kind of did it right. I mean, now this is what we call over fitting to some extent, you know? I mean, obviously it has. He's like tendrils air kind of cutting through here, and that's not really part of the pattern we're looking for. It's still going, though. Those tendrils air kind of getting weaker and weaker. So, you know, it still doesn't have quite enough neurons to do exactly the thing that we would do intuitively. But I mean still, I mean, this is a pretty complicated classifications problem. It figured it out and maybe over fitting a little bit. But I figured it out, and all we have is what, 12 neurons here? I mean, that's insane right now. Another thing I want to talk about here, too. is that it kind of illustrates the fact that once you get into a multiple layers, it becomes very hard to intuitively understand what's going on inside the neural network. This gets kind of spooky, you know? I mean, what does this shape really mean? I mean, once you have enough neurons, it's kind of hard toe fit inside your own head. What the's patterns all really represent. I mean, the first layer is pretty straightforward. You know, it's basically breaking up the image into different sections. But as you get into these hidden layers, things start to get a little bit weird as they get combined together. Let's go ahead and add one more shall way. I should have said two more to this output layer and add one more layer at the end. Let's see if that helps things converge little bit more quickly. Yeah, all right. Start to struggle a little bit. See that like it's actually got a spiral shape going on here now. So with those extra neurons, it was able to do something more interesting. We still have this. Ah, this little spike here that's doing the wrong thing, and it can't seem to quite think its way out of that one gave a few more in Iran's, though it might able to figure it out. These ones are also misclassified. But I find it interesting that it actually created a spiral pattern here on its own. So may with a few more neurons or one more layer, you could actually create an even better solution. But I will leave that as an exercise for you. Now, you know, to play around this, I really encourage you to just mess around it and see what kind of results you can get. This spiral pattern is in particular an interesting problem. Just explain some of the other parameters here. We're doing a classification here. That's where we're gonna be doing throughout this section. The activation function we talked about not using a step function and using something else , some other ones that are popular Rally was actually very popular right now of realization function we haven't talked about yet. The learning rate is just basically the step size in the ingredient Descents that we're doing, so you can adjust that if you want to, as well, let's see if really well actually makes a difference I would expect it to just, you know, affect the speed. Oh, my gosh. Look at that. That's pretty darn close to what we want, right? I mean, there is apart from this little tiny spike here which isn't really even there a little bit of over fitting going there. But we have basically created that spiral shape just out of this handful of neuron. Scott, I could do this all day, guys. And I hope you will to, you know, just player out this It's so much fun, and it gives you such a concrete understanding of what's going under the hood. I mean, look at this hidden layer here. Let's where these spiral shapes were starting to emerge and come together. And when you think about the fact that your brain works in very much the same way, it's quite literally my employing anyway. Mess around with this. It's a really great exercise and hope you have some fun with it. 6. Deep Learning Details: all right. I know you're probably itching to dive into some code by now, but there's a little more theory we need to cover with deep learning. I want to talk a little bit about exactly how they're trained and some tips for tuning them now that you've had a little bit of hands on experience with them using the Tensorflow playground. So how do you train a multi layer Perceptron? Well, it's using a technique called back propagation. It's not that complicated, really. At a conceptual level, all we're doing is Grady in dissent like we talked about before, using that mathematical trick of reverse mode auto def. To make it happen efficiently for each training step, we just compute the output error for the weights that we have currently in place for each connection between each artificial neuron. And then this is where the back propagation happened. Since there's multiple layers to deal with, we have to take that error that is computed at the end of our neural network and back propagated down in the other direction, push it back through the neural network backwards, okay, And that way we can distribute that error back through each connection all the way back to the inputs using the weights that we're currently using at this training step. Okay, So pretty simple concept. We just take the air. We use the weights that we're currently using in our neural network to back propagate that our error to individual connections. And then we can use that information to tweak the weights through Grady and dissent to actually try and arrive at a better value on the next pass at the next epoch, if you will, of our training passes. So that's all back. Propagation is we run a set of weights, we measure the error, we back propagate that error. Using that waits to things he's in, great into, sent. Try it again and we just keep doing this over and over again. Until our system converges. We should talk a little bit about activation function. So in our previous exercise, using the Tensorflow playground, we were using the hyperbolic tangent activation function by default. And then we switch to something called Rela, and we saw that the results were a little bit better. What was going on there? Well, the activation function is just the function that determines the output of a neuron, given the some of its inputs. So you take the sum of all the weights of the inputs coming into a neuron. The activation function is what takes that some and turns it into an output signal. Now, like we talked about way back in lecture one using a step function is what people did originally. But that doesn't really work with Grady and dissent because there is no Grady int there. If it's a step function, there is no slope. It's either on or off. You know, it's either straight across or up and down. There's no useful derivative there at all. So that's why alternative functions work a little bit better in practice. There are some other ones called the Logistic Function, the hyperbolic tangent function that produces more of a curvy curve. If you think about what a hyperbolic tangent looks like, it's ah more of Ah, it doesn't have that sharp cut off their zero the origin, so that can work out pretty well. There's also something called the Exponential Linear unit, which is also a little bit more curvy. What we ended up using, though, was Rallo. That stands for rectified linear unit. And that's what this graph here is showing basically it zero if it's less than zero, and if it's greater than zero, it climbs up at a 45 degree angle. So it's just, you know, getting you the the actual. Some of the input waits as its output if that output is greater than zero. Okay, so that the advantage that Rela has is that it's very simple, very easy and very fast to compute. So if you're worried about converging quickly and your computing resource is, rela is a really good choice. Now there are variants of rela that work even better if you don't care so much about efficiency when it's called leaky rela. And all that is is instead of being flat left of zero, it actually has a little bit of a slope there as well, a very small slope and again, that's for mathematical purposes to have an actual meaningful derivative there to work with , so that can provide even better convergence. It's also something called noisy rela, which can also help with convergence. But, ah, these days yell you. The exponential linear unit will often produce faster learning. That's kind of the It's gaining popularity now that computing resource is air becoming less and less of a concern now that you can actually do deep learning over a cluster of PCs on network in the cloud. So that's what activation functions are all about. You can also choose different optimization functions. You know, we've talked in very general terms about Grady and dissent, but there are various variations of Grady into something you can use as well. We talked a little bit earlier about mo mentum optimization. Basically, the idea there is to speed things up is you're going down a hill and slow things down as you start to approach that minimum. So it's a way of just making the grating to send happened faster by kind of skipping over those steeper parts of your learning curve. Well, I never used the word learning curve in the context. Word actually means something mathematically meaningful. But anyway, there's also something called the Nesterov accelerated Grady in, which is just a tweak on top of momentum optimization. Basically, it's looking ahead a little bit to the Grady and in front of you to take that information into account. So that works even better. There's also something called RMS prop, which is just using an adaptive learning rate that again helps point you in the right direction toward the minimum. Remember back to how greeting to sent works. It's not always obvious which direction you're going to be heading in, given a change in parameters. So our mess prop it's just a more sophisticated way of trying to figure out the right direction. Finally, there's something called Adam stands for adaptive moment esten ization. Basically, it's the Mo mentum optimizer and RMS prop combined kind of gives you the best of both worlds, and that is a popular choice today because it works really well. It's very easy to use again the library's you're gonna use for this stuff for a very high level and very easy to use. So it's not like you're gonna have to implement Nesterov accelerated grieving from scratch . You're just going to say optimizer equals Adam and be done with it. You know, it's just a matter of choosing the one that makes sense for what you're trying to do. Make your own trade offs between speed of ah, convergence and computational resources and time required to actually do that Convergence. Let's talk about over fitting as well. You can see you often end up with patterns like this where you know you're not really getting a clean solution. You know, like these weird spikes sometimes and sometimes if you let things go a little bit too long , it ends up reinforcing those spikes. You know, those over fitted areas where you're not really fitting to the pattern you're looking for. You're just fitting to the training data that you were given. Okay, so there are ways to combat that. And obviously, if you have thousands of weights to tune, you know those connections between each neuron and each layer of your neurons can add up really quickly. So it is very easy for over fitting toe happen. There are ways to deal with it when it's called early stopping. So as soon as you see performance start to drop, that might be in a trice way of telling you that it might be time for you to stop learning . You know, at this point, maybe you're just over fitting. There are also regularization terms. You can add to the cost function during training. You know that can basically like the bias turn that we talked about earlier that could help to. But a surprisingly effective technique is called dropout, and a Kennison is an example of a very simple idea that is very effective. The idea is just to ignore, say, half of the neurons randomly each training step, pretend that they don't exist it all. And the reason this works is because it forces your model to spread out its learning. If basically you're taking away half of its brain, if you will at each training step you're going to force the remaining half of those neurons to do as much work as possible. So this prevents things where you have individual neurons taking on more of the work than they should. You even saw in some of the examples that we ran in the Tensorflow playground, that sometimes we don't with neurons that were barely used it all, and by using drop out that would have forced that neuron to be to have been used more effectively. So very simple concept very effective in making sure that you're making full use of your neural network. Let's talk about tuning your topology another way to improve the results of your deep learning network is to just play games with how many neurons you have and how many layers of neurons you have. One way of dealing with it is just trial and error. You know, it's kind of what we did in Tensorflow Playground, but you know, there can be a methodology to that. And even you can start off with the strategy of evaluating a smaller network with less neurons in the hidden layers, where you can evaluate a larger network with more layers. So basically, you want to see. Can I get away with a smaller network and still get good results and just keep on making it smaller and smaller until you find the smallest, it can be safely, or you can try to make your network larger and larger and see you know what point it stops providing more benefits to you. So, you know, just start sizing things differently and see what works and what doesn't again. There's sort of a spooky aspect how this stuff all works together. It's very hard to understand intuitively what's going on inside of a neural network, a deep learning network in particular, so sometimes you just have toe. Use your intuition to try to tune the thing and get at the right number of resource is you need also, you know, again in today's modern computing environments. Sometimes you don't really care so much. It's probably okay. Toe have a deep neural network that has more neurons that it really needs, right? I mean, what's the real expensive involved in that these days? Probably not much. I will say that more layers will often yield faster learning than having more neurons and less layers. So if you care about speed of convergence, adding more layers is often the right thing to do. Or you can also use something called model zoos. There are actually libraries out there of neural network to apologies for specific problems . So if you don't think you're the first person in the world to solve a specific classifications problem or anything else, you're trying to apply a deep neural network to maybe should check out one. The models ooze out there to see if someone's already figured out the optimal topology for you trying to achieve instead of trying to reinvent the wheel. Okay, people share these things for a reason, and it can save you a lot of time. So that's enough theory. That's enough talk. In our next lecture, we'll get her hands dirty with tensorflow and start writing some real python code to implement our own neural networks. 7. Introducing Tensorflow: If you've done any previous research in deep learning, you've probably heard of the Tensorflow library. It's a very popular framework developed by the folks at Google, and they have been kind enough to make it open source and freely available to the world. So let's talk about what Tensorflow is all about and how it can help you construct artificial neural networks. The thing that kind of took me by surprise when I first encountered tensorflow was that it wasn't really purpose built for deep learning at first, or even for neural networks in general. It's a much more general purpose tool that Google developed that just happens to be useful for developing deep learning and neural networks. More generally, it's an architecture for executing a graph of numerical operations. It's not just about neural networks. You can have any sequence of operations and define a graph of how those operations fit together. What tensorflow actually does is figure out how to distribute that processing across the various GPU cores on your PC or across various machines on a network, and make sure that you could do massive computing problems in a distributed manner. In that respect, it sounds a lot like Apache Spark. If you've taken other courses from me, you've probably heard me talk about spark. It's a very exciting technology, and Spark is also developing machine learning and AI and deep learning capabilities of its own. So in some ways, Tensorflow is the competitors to Apache Spark. But there are some key differences that we should talk about. It's not just about distributing graphs of computation across a cluster or across your GPU . You can also run tensorflow on just about anything. So one thing that's special about Tensorflow is that I can even run it on my phone if I want to. It's not limited to running on computers in a cluster in some data center. That's important because in the real world you might want to push that processing down to the End users device. Let's take the example of a self driving car. You wouldn't want your car to suddenly crash into a wall just because it lost its network connection to the cloud, now would you? The way that it actually works is that you might push the actual trained neural network down to the car itself and actually execute that neural network on the computer that's running embedded within your car because the heavy lifting of deep learning is training that network. So you can do that training offline. Push the weights of that network down to your car, which is relatively small, and then run that neural network completely within your car itself. By being able to run tensorflow on a variety of devices, it opens up a lot of possibilities about actually doing deep learning on the edge on the actual devices where you're trying to use it on tensorflow is written in C plus plus under the hood, whereas Spark is written in scallop, which ultimately runs on top of a JV M. By going down to the C plus plus level with tensorflow, that's going to give you greater efficiency. But at the same time, it has a python interface so you can talk to it just like you would any other python library. That makes it easy to program and easy to use as a developer, but very efficient and very fast under the hood. The other key difference between tensorflow and something like spark is that it can work on GP use. A GPU is just your video card, the same video card that you're using to play fortnight on or whatever it is you play. You can actually distribute the work across the GPU cores on your PC, and it's a very common configuration to even have multiple video cards on a single computer and actually use that to gain more performance on clusters that are purpose built for deep learning. Plus, tensorflow is free, and it's made by Google. Just the fact that it's made by Google has led to a lot of adoption. There are competing libraries out there to tensorflow, notably Apache Mxnet. But Tensorflow, as of right now, is still by far the most popular. Installing tensorflow is really easy. All you have to do is use the conduct commanding your anaconda environment to install tensorflow. Or you can use Anaconda Navigator to do it all through a graphical user interface. There's also a tensorflow Dash GPU package you can install instead if you want to take advantage of GPU acceleration. If you're running this on Windows, I wouldn't go there quite yet. I've had some trouble getting tensorflow GPU to work on my own Windows system. You'll find that a lot of these technologies. Air developed primarily for linnet systems running on a cluster. So if you're running on a purpose built computer in a cluster on E. C two or something that's made for deep learning, go ahead and install Tensorflow Dash GPU. Although it's probably going to be installed for you already. Let's talk about what Tensorflow is all about. What is a tensor anyway? Well, this is another example of fancy, pretentious terminology that people use to make themselves look smart. At the end of the day, a tensor is just a fancy name for an array or a matrix of values. It's just a structured collection of numbers. That's it. That's all the tensor is using. Tensorflow can be a little bit counterintuitive, but it's similar to how something like Apache Spark would work, too. You don't actually execute things right away. Instead, you build up a graph of how you want things to execute, and then when you're ready to execute it, you say, Okay, tensorflow, go do this. Tensorflow will then go and figure out the optimal way to distribute and paralyze that work across your entire set of GP use and computers in your cluster. So let's take a look here at the world's simplest tensorflow application and Python. All this is going to do is add one plus two together, but it's a good illustrated example of what's actually going on under the hood. We start by importing the Tensorflow library. We're going to refer to it as TF as a shorthand. We'll start off by saying a equals TF dot variable one comma name equals a and all that is doing setting up a variable in tensorflow, a variable object which contains a single value one and which is going by the name of A. The name is what will appear in visualization tools for your graph if you're using that sort of a thing, but internally will also assign that to a variable in python called a. Then we set up a be variable that's assigned to the value, too, and given the name be, here is where the magic starts to happen. We say f equals a plus B, and you might think that would put the number three into the variable F. But it doesn't f is actually your graph. It's the connection that you're building up between the A and B tensor is to add them together, so f equals a plus. B does not do anything except established that relationship between A and B and their dependency together on that F graph that you're creating. Nothing actually happens until we try to access the value of F at which points tensorflow to. He uses something called Eager Execution Toe. Actually execute that graph. At that point, it will say, OK, I need to create a graph it takes to a variable which contains one and to be variable, which contains two, and add them together. It will figure out how to distribute that incredibly complicated operation. I'm being sarcastic across your entire cluster, and that will ultimately print the value three in the form of a new tensor. So we have just created the most complicated way imaginable of computing one plus two. But if these were larger tense er's dealing with larger data sets or, for example, a huge array or a matrix of weights in a neural network, that distribution of the work becomes important. So although adding one plus two isn't a useful exercise to do with the tensorflow, once you scale this up to the many, many connections in a big neural network, it becomes very important to be able to distribute these things effectively. So how do we extend this idea to neural networks? Well, the thing with Tensorflow is that it's not only just made for neural networks, it can do things like matrix multiplication. And it turns out that you can think about applying all the different weights and sums that happen within a single layer of a perceptron and model that is just a matrix multiplication . You can just take the output of the previous layer in your multi layer perceptron and do a matrix multiplication with the matrix that describes the weights between each neuron of the two layers of your computing. Then you can add in a vector that contains the biased terms as well. So, at the end of the day, you can modify this fancy diagram here of what a perceptron looks like and just model it as a matrix multiplication. In vector addition, go back and read up on your linear algebra if you want to know more about how that works mathematically, but this is just a straightforward matrix multiplication operation with a vector addition at the end for the biased terms by using tensorflow is lower level. AP Eyes were kind of doing this the hard way. But there are higher level AP eyes in tensorflow that make it much simpler and more intuitive to define deep neural networks. As we're describing tensorflow at a low level right now, it's purpose in life is just to distribute mathematical operations on groups of numbers or 10. Sirs, it's up to us to describe what we're trying to do in mathematical terms. It turns out it's really not that hard to do with a neural network for us to actually compute a complete deep learning network from end to end. There's more to it than just computing the weights between different layers of neurons. We have to actually train this thing somehow and actually run it when we're done. So the first thing we need to do is load of the training data that contains the features that we want to train on and the target labels To train a neural network, you need to have a set of known inputs with a set of known correct answers that you can use to actually descend er converge upon the correct solution of weights that lead to the behavior that you want. After that, we need to associate some sort of an optimizer to the network. Tensorflow makes that very easy to do. It could be great into center or some variation thereof such as Adam. We will then run our OPTIMIZER using our training data. And again, tensorflow makes that pretty easy to do as well. Finally, will evaluate the results of our training network using our test data set to recap at a high level, we're going to create a given network topology and fit the training data using Grady and Dissent Toe actually converge on the optimal weights between each neuron in our network. When we're done, we can evaluate the performance of this network using a test data set that it's never seen before and see if it can correctly classify that data that it was not trained on. One other. Gotcha. When you're using neural networks, it's very important to make sure that your input data is normalized, meaning that's all scaled into the same range. Generally speaking, you want to make sure that your input data has a mean value of zero and unit variants. That's just the best way to make the various activation functions work out mathematically. What's really important is that your input features are comparable in terms of magnitude. Otherwise, it's hard to combine those weights together in a meaningful way. Your inputs are all the same level at the bottom of your neural network and fitting into that bottom layer. It's important that they're comparable in terms of magnitude, so you don't end up skewing things in waiting things in weird ways. For example, if I already created a neural network that tries to classify people based on their age and their income, age might range from 0 to 100 but income might range from zero to a 1,000,000. Those air widely different ranges, so that's going to lead to real mathematical problems that they're not scaled down to the correct range at first. Fortunately, python Psych it learned library has a standard scaler package that you can use that will automatically do that with just one line of code. All you have to do is remember to use it, and many data sets that we use while researching will be normalized to begin with, the one we're about to use is already normalized, so we don't actually have to do that. But later on in the course, I'll show you an example of actually using standard scaler. We've talked about how this all works at a low level and intensive flow to It's still possible to implement a complete neural network, basically from scratch, but intense airflow to they have replaced much of that low level functionality with a higher level AP I called caress. There is value in understanding how it all works under the hood first. So let's work through a simple example of a neural network using the lower level AP eyes next. After that, we'll see how the caress AP I simplifies common neural network setups and highs. A lot of this complexity from you 8. Using Tensorflow for Handwriting Recognition, Part 1: Okay, so let's play around with tensorflow using its lower level AP, I So you get more of a ah appreciation of what's going on under the hood, if you will. On Windows will start by going to our Start menu and finding the Anaconda three group. And from there, open up your anaconda. Prompt on mackerel in X, of course, you'll just open up a terminal, prompting you'll be good. First thing you want to do is make sure that you have actually installed tensorflow itself . So if you haven't already taken care of that, you can just say Kanda install tensorflow. And I've already done that so won't actually do anything for me. But if you do need to install that or update it, that will prompt you to do so. Give that a second just to check everything all right, looks like we're good. Next. Do you want to see the into the directory where you installed the course materials? So for me, that's going to be CD C colon, backslash, ml course, and from within the course materials directory type in Jupiter with a lie notebook. I should bring up your favorite Web browser from here find the Tensorflow notebook and go ahead and open that and let's start playing, so we'll start off by running the world's simplest tensorflow application that we looked at in the slides were just going to add the numbers one plus two Together using tensorflow, we start by importing the Tensorflow library itself and will give it the name. TF, as a shorthand, will create two variables in tensorflow one called AM one called Be the Variable A will have the number one associated with it, and the variable B will be initialized with the number two. We then say F equals a plus. B, which does not put the number three into F f just represents the graph that we're defining . That says F, represents the addition of whatever's in A and B together. So nothing has actually happened here except for constructing that graph. It's only when we say TF dot print looking for the output of F that Tensorflow will use what's called eager execution to go off and actually execute that graph and evaluate its results. So at that point, it goes off and says, Okay, we have this craft constructed of A and B A contains one. B contains two. Let's add them together and get the output of the F graph and print that out. And he could actually distribute that across an entire cluster if it had to. But obviously for just adding one plus two, there's no need for all that. But let's see if it works. Go ahead and hit shift enter within that block after clicking inside of it, and we should get the number three. Sure enough to some of the NBA's three. Hey, it works. So let's do something a little bit more interesting. Let's actually do handwriting recognition using tensorflow. This is a pretty common example to use when people are learning tensorflow. Basically, it's a data set of 70,000 handwriting samples where each sample represents someone trying to draw the numbers zero through nine. So we have 70,000 images that are 28 by 28 images of people drawing the number zero through nine. And our challenge is to create a neural network that looks at those images and tries to figure out what number they represent. Now, this is a very common example when people are learning tensorflow maybe a little bit too common. But there's a good reason for that. It's built into tensorflow. It's easy to wrap your head around. It's really good for learning. And our little twist on it, which you're not going to see in many other places, is actually using lower level AP eyes to implement this neural network to begin with. So let's dive in here. So let's walk through actually loading up this data and converting it to the format that we need. The first thing we're going to do is import the libraries we need, and we're gonna be using numb pie and also tensorflow itself and the M. This data set itself is part of tensor flows carry stock data sets package so we just import that right in and have that data accessible to us will define some convenient variables. Here. Numb classes is 10 that represents the total number of classifications for each one of these images. So again, these can represent the numbers zero through nine, and that's a total of 10 possible classifications. Our features are 784 in number, and we get that by saying that each image is a 28 by 28 image, right, so we have 28 times 28 which is 784 individual pixels for every training image that we have . So our training features are each individual pixel of every individual image that we're using to train our neural network. Let's start off by loading the data set itself so we'll say and Mr on load data to actually retrieve that data from TENSORFLOW. And we'll put the resulting data set into these variables here. So the convention that we usually uses that extra verse refers to your feature data. This is your, ah, in our case, the images themselves. And why refers to your labels? So this will represent whether that image represents the numbers zero through nine. Furthermore, we split things into training and testing data sets. So with amnesty, we have 60,000 training samples and 10,000 test samples. That means that we're only going to train our neural network using that 60,000 set of training samples and were holding a side of 10,000 test samples so we can actually test how well are trained network works on data that it's never seen before. This is how we prevent over fitting. We actually evaluate our model based on data that the model has never seen before, so it didn't have a chance to over fit on that data to begin with. Next, we need to convert this to 32 bit floating point values, because that's what Tensorflow expects. So we start by creating these numb pyre rays of the underlying training and test data and converting that to end peed off low 32 data types. We then flatten those images down so it comes in as two dimensional 28 by 28 images. And there are ways of constructing neural networks. I can deal with two dimensional data. We'll get there. But for now we're going to keep things simple and just treat. Each image is a one dimensional array or vector or tensor, if you will, Of Senator in 84 features 184 pixels. In this case, the reshape command is what does that So by saying reshape negative one numb features some features. Again, it's 784 that's gonna flatten these two dimensional rays down to a one dimensional 784 10 sirs. So that's gonna have a new X train and a new X test that can that contains these 784 pixel one dimensional sensors. Next, we need to normalize our data. We talked about that in the slides as well. So the raw data coming in from this data set has every pixel representatives and integer value between zero and 2 55 zero represents a black pixel 255 represents a white pixel, and values in between represent various shades of gray. We need to scale that down to the range of 0 to 1. So to do that, very simply, we just divide everything by 2 55 we're done. All right. So we've prepared and scrubbed and cleaned our data here. Let's actually do some more stuff with it. Let's start by wrapping our heads around what this data looks like. So it's always a good idea to sort of visualize your data that you're going to be training with and understand its quirks and nuances before you actually try to implement an algorithm. That's what we're doing in this display sample function here. It's going to take us input. A specific number of a sample from our training data that we want to look at, and we'll just extract the label of it. So why train again? Is the training labels the number zero through nine that that image represents? And then we'll reshape that back to a two dimensional 28 by 28 minutes so we can actually display it on the screen. We'll give it a title, will show that two dimensional images great scale and just show it. So let's go ahead and kick that off. Actually didn't kick off this previous market with, do we? So before we forget, go back up to this block where we prepare our data and shift enter to run that. And now that we've done that, we could actually visualize that data that we've loaded Click down here and shift Enter. And here's a sample data set point. So sample number 1000 is this image here, and we can see that it's supposed to represent the number zero and, well, it looks like a zero. So this isn't a particularly challenging one for us to learn. Hopefully, you could play around here. Try different sample numbers is to get a better feel of what the state is like. So let's try 1500. Turns out, that's the number nine, um, kind of a weird looking nine, so that might be a little bit of a challenge. How about, I don't know, 1700. That's the one. Looks like a one. But if you keep, you know, poking around here and trying different values, eventually you'll find some weird ones. For example, that's a pretty funny looking to. But you know, you can see there's a wide variety of handwriting capabilities of people who made this test data. So that's a good way to wrap your head around what we're dealing with. Moving on. We could take this visualization to the next step and actually visualize those one dimensional rays that were actually training our neural network on. So this will give us a better picture of the input that our neural network is going to see and sort of make us appreciate just what's going on here and how differently it, ah, quote unquote thinks so. What we're going to do here is reshape everything back down to one dimensional arrays of 784 pixels. Take this on one of our training data sets here, and we're gonna eatery through the 1st 500 samples and can captain eight each individual image to that original image there of zero. So we're not. Take basically the 1st 500 training images, flatten them down to one dimensional arrays of center journey for epistle values, and then combine that all together into a single two dimensional image that will plot. Let's go ahead and click in that and shift enter to run it. And this is interesting. So this is showing you the input that's going into our actual neural network for every individual image for the 1st 500 images, and you can see that your brain does not do a very good job at all of figuring out what these things represent, right? So every single row of this image represents the input data going into our neural network. So our neural network is going to be able to take every one of those rows of one dimensional data and try to figure out what number that represents in two dimensional space so you can see that it's thinking about the world or perceiving the world. Receiving these images more specifically in a very different way than your own brain does. So you know, it's kind of sometimes a dangerous assumption to think that neural networks work the same way that your brain does. They're inspired by your brain, but they don't always work the same way. So important distinction there. All right, let's actually start setting up our neural network so we'll start off by defining some parameters or hyper parameters, if you will, that defined how are training will actually work. The learning rate is basically how quickly will descend through Grady and dissent in trying to find the autumn values. The optimal waits for our neural network training steps is basically how many training at Box will actually conduct how many times will actually go over and iterating over this entire neural network trying to train it bat sizes. How many random samples will take from our training data during each step and displaced up is just how often will display our progress as we train this neural network and underscore Hidden represents how many hidden neurons will have in our hidden layer so that middle layer of neurons and our neural network will have 512 neurons within it. And you can play with that number. See what number works best for you. These are all hyper parameters, and the dirty little secret of machine learning is that Ah, lot of your success depends on how well you can guess the best values for these. A lot of it's just trial and error, trying to find the right value of the learning rate, the right number of hidden neurons. These numbers are basically determined through experimentation. So, you know, it's not quite the exact science that you might think it ISS. Let's go ahead and execute that block shift. Enter. And we're also going to further slice up our data set here and prepare it further for training user tensorflow. So we're going to use ah TFT outdated out data set to create what's called a data set object within tensorflow from our training images and training labels. We will then use that data set to create our individual batches that we used to train the neural network with. So shuffle 60,000 means I'm going to shuffle all 60,000 training images that I have just randomly shuffle them all. I will then batch them up into batches of 250 pre fetch the 1st 1st batch. So I have that ready to go. That's all that's going on here. Shift Enter. All right, now we're gonna start to actually construct our neural network itself will start by creating the variables that will store the weights and biased terms for each layer of our neural network. So we start off by, ah, initializing all over variables with random variables just to make sure that we have a set of initial random settings there for our weights. We want to start with something, and for lack of something better will start with random values. Actually, your choice of ah initialization values can make a big difference in how your neural networks perform. So it's worth looking into how to choose the right initial values For a given type of neural network. We set up our weights here for the hidden layer, will call those weights H and we'll use our random, normal function that we just defined here. To initialize those weights randomly and hidden as you might recall again, is let's look up here again. Five year 12. So this will create 512 variables that contained the weights for hidden neurons. We also need a set of weights on our output layer of 10 output neurons. So the output is gonna be 10 neurons where every neuron represents the likelihood of that being that given classifications zero through nine, we also need have biases associate with both of these layers, So be will be the set of biases with our hidden layer. Again, there will be 5 to 12 of those. And we also have biases associate with her output layer of tenor neurons at the output layer as well. These will be initialized zeros. Okay, so a little bit different there for the biases By default. We want our biases to be zero. Let's go ahead and execute that. All right, moving on. 9. Using Tensorflow for Handwriting Recognition, Part 2: Now we're going to set up the topology of our neural network itself. So that's what neural net here does. And as we said in the slides, you can define these fancy neural networks is just simple matrix, multiplication and addition functions here, right? So we'll start off by saying TF dot matt Mole. That's it's just a matrix multiplication of our input neurons, which is the raw 784 pixel values with the 512 weights in our hidden layer of neurons there so that matrix multiplication is multiplying each one of those input values by the weights in that hidden layer. We then say TF dot add to add in the biased terms, which again are stored in the B variable that we just defined above. Okay, next, we need to apply a sigmoid function to the output of that hidden layer. So that's basically the activation function on each hidden or on Okay, that's all it's happening. They're very simple. The output layer will do it again, so we'll say, do a matrix multiplication with our output weights and the hidden layer, and then we'll add in those output biases at the end as well, and we'll call soft Max to actually normalize those output neurons to a final probability of each individual classification. So soft max again. It's just a mathematical trick for taking the outputs of these neural networks and converting those output neuron values to what we can interpret as a probability of each individual classification that those neurons represent being correct. Go ahead and execute that. And that's quick as well. Again remember, all we're doing is defining our graph. At this point, we're not actually training anything or running anything yet, but yeah, Do you take the time to noodle on that? We have basically constructed the apology of the neural network itself. The next thing we didn't need to do is figure out how to actually train this neural network , right? So again, we're doing this the hard way, so we have to write this by hand. We'll start by defining our loss function and its called cross entropy. Basically, it's a way of doing Grady and dissent is supplying a log arrhythmic scale, and that has the effect of penalizing incorrect classifications much more than ones that are close to the correct answer. That's ah, handy property for making training go quickly. So within this function will gonna pass into things. Why? Pred is the predicted values that are coming out of our neural network and why true are the known true labels that are associated with each each image. So we have to talk about one hot encoding at this point. We talk about this when we talk about feature engineering in the course, but in order can compare that known value that known label, which is a number from 0 to 9 to the output of this neural network. Remember, the output of the neural network is actually 10 different neurons where each one represents the probability of a given classification. It actually compare that to the known correct value. We need to convert that known correct number to a similar format, so we're going to use one hot encoding. It's best understood with an example. So let's say that we know that the answer for the known correct label for an image is one. We would one hot encode that as a 10 value array here, where each value represents the probability of ah, that given classifications, since since we know with 100% certainty that this image was a one. We can say that for the classifications zero, there's a 0% chance of that for the classification one, there's a 100% chance of that one point. Oh, and for two it's going to be zero for 30 so on and so forth. So that is one hot encoding. You're just creating a binary representation of a integer value, if you will. So the number one is representative. 0100000 Every one of those slots in the array there represents a different classification value, and this just makes it easier for the math to work out and to construct things. All right, so we start off, like encoding, that known label to a one hot, encoded array. We then do some clipping their to avoid some mathematical numerical issues that log zero. And then we could just compare that Ah, and compute the actual cross entropy term by doing reduce some to go across the entire set of all values within this patch and using this log arrhythmic comparison like we said to actually do cross entropy without log arrhythmic property. So let's go ahead and ah, shift, Enter To do that again, the key things here are reduced me and reduce some, which means we're going to apply this across the entire batch all at once. All right. Next we need to define what's called an optimizer, and in this case, we're going to use a stochastic Grady and dissent. We've talked about that before, and we will use our learning rate hyper parameter, which again will want to tune through experimentation later on. Teoh, define that optimizer. We need to write a function to actually run that optimization. And again with the lower level tensorflow ap eyes. We kind of have to do this the hard way. We're gonna use something called radiant tape to actually do this automatically. What that's doing is you can see here it's actually calling our neural net function that defines the topology of our neural network. We're going to compute the loss function are cross entropy function that we defined above as well. So this is tying it all together and actually allowing us to optimize this neural network within this one function here. So not terribly important to understand what's going on here at a low level you can kind of understand here through the comments, we, uh we're updating our variables as we go through it. Step here. We compute the Grady INTs and then we update our weights and biases at ease training step. So this code is what's computing new weights and biases through each training pass again, This is going to be a lot easier using the caress higher level AP I So right now we're just showing you this to give you an appreciation of what's going on under the hood. Let's go ahead and shift. Enter that one as well. All right, so now we have everything we need. We have the topology of our network defined. We have the variables defined for our weights and biases. We have a lost function defined, which is cross entropy, and we have an optimization function that ties it all together called Run optimization. Let's go ahead and start training this thing. So that's what's going on here. Oh, wait. One more thing. We need an accuracy metric as well, So a loss function isn't enough. We also want to display the actual accuracy at each stage two and all this accuracy metric does is say, let's compare of the actual maximum argument from each output array that's gonna correspond to our one hot encoded value. And compare that to the one hot, encoded, known value that we have for that label. So this is just a way of staying. We're gonna call, reduce me to actually compute the accuracy of each individual prediction and average that across the entire days set. So that's what our accuracy metric here does as well. Shift, enter. And now we can actually kick it off so you can see that we've done all the heavy lifting. Now it's fairly simple to actually do it. So for every training step, we're going to take a batch from our training data. Remember, we created those batches earlier on using a data set across the training steps. That's gonna be ah, 3000. I think we said it was. And for each batch each step of of training, we're gonna run optimization. So that's calling that one function that we have that tied everything together to apply our optimization across our neural network in compute the optimal weights and biases at each step and yeah, every 100 steps. That's what display step is. We'll put our progress as we go. So it every 1/100 epoch, we will actually, uh, execute our neural network here on the current batch and get a set of predictions for that batch of 250 values there and compute cross entropy on that to get a snapshot of our current loss function and compute accuracy on that as well. Tash, you get a metric of our accuracy at each state so we can see this converge over time for every 100 steps throughout our 3000 training steps or that box, if you will. Let's go ahead and kick that off. And this is where the action's happening. So this is gonna iterated over a one over 3000 F box and we can see that accuracy changing as we go. It's kind of interesting to watch this because the accuracy is kind of fluctuating a little bit as we go Fear So you know, you can tell this kind of like maybe settling into a little local minima here and working its way out of those and finding better solutions over time. But as it goes, it will start to converge on better and better values. So, you know, we started off with Justin 84% accuracy. Right now, we're up to about 90 4% but it's still kind of all over the place. Let's give it some more time pretty firmly in the nineties at this point. So that's good. 90 three's, I'd say up to 2000 that box. Now remember, we're going to go up to 3000 and again, you know, you just gonna have to watch this and see where it starts to converge. We could do early stopping to figure out at what point we actually stop getting a an improvement. But you kind of have to eyeball of the first to get a sense of how many a pox you really need. We're almost there and looks like we're not gonna do much better than 93% accuracy. So there we have it. We over 3000 that box we ended up with an accuracy of 92.8% and this is actually remember using our training data set. So there's a possibility that we're over fitting here to really evaluate how well this model does. We want to evaluate it on the test data set on data that it's never seen before. So let's go ahead and use that test data set that we set aside at the beginning and run the neural network on it and actually call our accuracy function to see how well it does on test data test images that it's never seen before. 93%. Not too bad, you know. I mean, we could do better. But for a very simple take at using tensorflow and setting up a very simple neural network with just one hidden layer, that's not too bad. We'll do better throughout the course. We're gonna try different techniques, but we're off to a good start here right again. It's good to understand your data, visualize things, see what's actually going on under the hood. So let's take a look at some of those misclassified images and get more of a gut feel of how good about our model really is. Let's take a look at the some examples of images that it did not classify correctly and see just how forgivable those errors are. So that's what this function is doing. Basically, we're going through 200 images that were known to be incorrect. Taking a look at the actual predicted value, thereby taking art max on the output array there of the output neuron layer. And comparing that to the known correct labels. And, if it's not correct, will print it out with the original label and the predicted label toe. Get an idea of what's happening here. Shift enter. All right, so these are some pretty messy examples in this example. Um, we had we knew that this guy was trying to draw five. We thought it was 1/6. Yeah, I can't understand that. Yeah, I mean, that's a really nasty looking five, and it's a pretty good case there to say that that was actually a six. So that's a case where your human brain couldn't actually probably do a whole lot better. I would not get that. That's a five, just like it looks like a squiggle. This one to, um, our best guess for my model was a six. The person intended to four that does not look like a four to me. I mean, looks like half of a four, basically like I don't know what happened to the guy's arm when he was drawing it. But again, you know this lets you appreciate just how well it's doing or not. Ah, this one. I'm not sure how our model thought that was a seven again. That's a very weird looking six, but look like a seven, either, anyway, that one, also kind of a nasty one, looks like a two to the brain. It's a really funny, squished, odd looking to, but ah, this is an example of where your brain does a better job than are simple neural network. But overall, you know, these are largely forgivable errors. Someone's where I messed up were some pretty weird, messy examples like, What is that? Um, I guess it's supposed to be a two. We guess it was a nine. You know, I could actually see that going either way. So, um, not too bad guys, you know? Anyway, if you want to play with it, I encourage you to do so. See if you can improve upon things. So like we talked about, there's a lot of different hyper parameters here to play with the learning rate. How many hidden neurons do we have? And so try different values. Try different learning rates. Try more neurons, less neuron. See what effect that has just play around with it. Because in the real world, that's what you have to do. Try adding a second hidden layer even or different batch sizes or a different number of that box. Just get your hands dirty and get a good gut feel of how those different parameters affect the output in the final results that you get. So give it a shot. And if you can get more than 93% accuracy, we'd love to hear about it in the Q and A. 10. Introducing Keras: So we've had a look at developing neural networks using tensorflow sort of lower level AP eyes. Where instead of really thinking about neurons or units, you're thinking more about 10. Sirs and maitresse ease and multiplying them together directly, and that's a very efficient way of doing it. But it's not really intuitive. It could be a little bit confusing, especially when you're starting to try to implement a neural network in those terms. Fortunately, there's, ah, higher level AP I called caress that's now built into Tensorflow. It used to be a separate product that was on top of Tensorflow. But as of tensorflow 1.9, it's actually been incorporated into tensorflow itself as an alternative, higher level a p I that you can use. And it's really nice because it's really purpose built for deep learning. So all the code is very much built around the concept of artificial neural networks, and it makes it very easy to construct the layers of neural network and wire them together and use different optimization functions on them. It's a lot less code and a lot less things that could go wrong. As a result, another benefit of caress in, in addition to its ease of use is its integration with ease. Psych It learned library. So if you're used to doing machine learning in python, you probably use psychic learn a lot and using caress. You can actually integrate your deep neural networks with psychic learn. And you might have noticed in our previous lecture that we kind of lost over the problem of actually doing trained test or cross validation on our neural network because it would have been a kind of a big pain in the butt, but it would psych it learn. It's very easy to do cross validation and, like, perform proper analysis and evaluation of this neural network. So that makes it easier to evaluate what we're doing and to integrate it with other models or even chain ah, neural network with other deep learning or machine learning techniques. There's also a lot less to think about, and that means that you can often get better results without even trying. You know, with tensorflow, you have to think about every little detail at a linear algebra level of how these neural networks are constructed because it doesn't really natively support neural networks out of the box. You have to figure out, How do I multiply all the weights together? How do I add in the bias terms? How do I apply an optimizer to it? How do we define a loss function? Things like that, whereas carats can take care of a lot of those details for you. So when there's less things for you to screw up and more things that caress can take on for you in terms of optimizing things where you're really trying to do often you can get better results without doing us much work, which is great. Why is that important? Well, the faster you can experiment and prototype things, the better your results will be. So if it's that much easier for you to try different layers in your neural network, you know, different apologies, different optimizers, different variations. It's gonna be that much easier and quicker for you to converge on the optimal kind of neural network for the problem you're trying to solve. Whereas of Tensorflow is putting up a bunch of roadblocks for you along the way. At the end of the day, you only have so much time to devote to these problems, right? So the more time you can spend on the topology and tuning of your neural network, and the less on the implementation of it, the better your results will be at the end of the day. Now you might find that Paris is ultimately a prototyping tool for you. It's not as fast as just going straight to Tensorflow. So you know, sometimes you want to converge on the topology wanted, then go back and implement that at the tensorflow layer. But again, just that use of prototyping alone is well worth. It makes life a whole lot easier. So let's take a closer look. And again caress is just a layer on top of tensorflow that makes deep learning a lot easier . All we need to do is start off by importing this stuff that we need. So we're going to import the caress library and some specific modules from it. We have the amnesty data set here that we're going to experiment with the sequential model , which is a very quick way of assembling the layers of a neural network. We're going to import the dense and dropout layers as well, so we can actually add some new things onto this neural network to make it even better and prevent over fitting. And we will import the RMS problem optimizer, which is what we're going to use for our Grady and dissent. Shift, enter and you can see they've already loaded up caress just by importing those things. It's using Tensorflow is the back end. Let's go ahead and load up the amnesty data set that we've used in the previous example. Paris's version is a little bit different, actually has 60,000 training samples as opposed to 55,000 still 10,000 test samples, and that's just a one line operation. All right, so now we need to, as before, convert this to the shape that Tensorflow expects under the hood. So we're going to reshape the training images to be 60,000 by 7 84 Again, we're going to still treat these as one D images. We're going to flatten these all out in tow one D rose of 784 pixels for each 28 by 28 image. We also have our test data set of 10,000 images, each with 784 pixels of peace and we will explicitly cast the images as floating 0.32 bit values, and that's just to make the library a little bit happier. Furthermore, we're going to normalize thes things by 2 55 So the image data here is actually eight bit at the source, so it 0 to 2 55 So to convert that to 01 what we're doing, basically here is converting it to a floating point number first from that 0 to 2 55 imager and then dividing it by 2 55 to re scale that input data 20 to 1. We've talked before about the importance of normalizing your input data, and that's all we're doing here. We're just taking data that started off as eight bit 0 to 2 55 data and converting that to 32 bit floating point values between zero and one. It's always going on there as before, we will convert ARD labels to one hot format, so that's what too categorical does for you. It just converts the label data on both the training and the test date is set. Teoh one hot 0 10 values. Let's go ahead and run that previous block there before we forget and we will run. This is well again. I'm just hitting a shift. Enter after selecting the appropriate blocks of code here. All right, as before, let's visualize some of the data just to make sure that it loaded up successfully. This is pretty much the same as the previous example. We're just going to look at our input data for a sample number. 1234 and we could see that are one hot label here is showing one and position for and since we start counting it, 00123 That indicates label three. Using aren't Max. That gives us back the human readable label. And by reshaping that 768 pixel array into a two D shape, we can see that this is somebody's attempt at drawing the number three. OK, so so far, so good Armor data looks like it makes sense and was loaded correctly. Now remember it Back in when we were dealing with tensorflow, we had to do a whole bunch of work to set up our neural network. We'll look at how much easier it is with caress all we need to do is say that we're setting up a model, a sequential model. And that means that we can add individual layers to our neural network one layer at a time , sequentially, if you will. So we will start off by adding a dense layer of 512 neurons with an input shape of 784 neuron. So this is basically our first layer that takes our 784 input signals from each image one for each pixel and feeds it into a hidden layer of 512 neurons. And those neurons will have the rela oh activation function associated with, um So with one line of code, we've done a whole lot of work that we had to do in tensorflow before, and then on top of that will put a soft max activation function on top of it to a final layer of 10 which will map to our final classification of what a number of this represents from 0 to 9. Okay, so wasn't that easy. We can even ask care is to get us back a summary of what we set up just to make sure that things look the way that we expected. And sure enough, we have two layers here, you know, one that has 512 and then going to a 10 neuron layer for the final classification. And this does sort of omit the input layer. But we do have that input shape of 784 features going into that first layer. All right now, you also might remember that it was kind of a pain in the butt to get the optimization and lost function set up in Tensorflow again. That's a one liner in caress. Well, we have to do is say that are lost Function is categorical cross entropy, and it will know what to do There. We're going to use the RMS prop optimizer just for fun. We could use any of the one that we wanted to. We could just sees Adam if you wanted to. Or there are other choices, like Eight. A grad SG can read up on those at this link here if you want to, and we will measure the accuracy as we go along. So that's all that saying, Let's go ahead and hit that and that will build the underlying graph that we want to run in tensorflow. All right, so now we actually have to run it. And again, that's just one line of code with caress. All we need to do is say they were going to fit this model using this training data set these air the input features, the input layers that were going to train with. We want to use batch sizes of 100. We're going to run that 10 times. I'm going to set of verbosity level of two because that's what works best with an eye Python night notebook and for validation, we will provide the test data set as well. So instead of writing this big function that does consideration of learning by hand like we did in tensorflow caress does it all for us. So let's go ahead and hit, shift, enter and kick that office. Well, now caress is slower than tensorflow, and you know it's doing a little bit more work under the hood, so this will take more time, but you'll see that the results are really good. I mean, even on that first generation, we've already matched the accuracy that we got after 2000 iterations in our hand coded tensorflow implementation. We're already up to Epoch six and we're approaching 99% accuracy in our training data. Keep in mind this is measuring the accuracy in the training data set, and we're almost there, but yeah. I mean, even with just 10? Pox? We've done a lot better than using tensorflow. And again, you know, caress is kind of doing a lot of the right things for you automatically, without making you even think about it. That's the power of caress. Even though it's slower, it might give you better results in less time at the end of the day. Now, here's something that we couldn't really do easily with transfer flow. It's possible. I just, you know, didn't get into because that lecture was long enough. Is it waas? But remember that we can actually integrate caress with psychic learn so we can just say model dot evaluate. And that's just like a psychic learned model. As faras Pythons concern and actually measure based on our test data, set what the accuracy is and using the test data set as a benchmark, it had a 98% success rate incorrectly classifying those images, so that's not bad. Now, mind you, a lot of research goes into optimizing this, and this data set problem in 98% is not really considered a good result. Like I said later in the course, we'll talk about some better approaches that we can use. But, hey, that's a lot better than we got in the previous lecture, isn't it? As before, let's go ahead and take a look at some of the ones that it got wrong just to get a feel of where it has troubled things that are. Neural network has challenges. Code here is similar. We're just going to go through the 1st 1000 test images here, and since it does have a much higher accuracy rate, we have to go deeper into that. Tested to find examples of things that went wrong will reshape each data each image and do a flat 784 pixel array, which is what are neural network expects is input. Call our Max on the resulting classification and one hot format and see if that predicted classification matches the actual label for that data. If not print out. All right, so you can see here that this model really is doing better. The one said It's getting wrong are pretty wonky. Okay, so in this case, we predicted that this was a number nine. And if I were to look at that myself, I might guess that was a nine as well. Turns out this person was trying to draw the number four, but, you know, this is a case where even a human brain is starting to run into trouble as to what this person was actually trying to write. I don't know what that's supposed to be. Apparently, they were trying to draw the number four. Our best guess was the number six not unreasonable, given the shape of things. Here's somebody who was trying to draw, too. But it looks a whole lot more like a seven again. I would be too sure about that myself. So, you know, even though we flatten this data to one dimension, this neural network that we've constructed is already rivaling the human brain in terms of doing handwriting recognition on these these numbers. I mean, that's kind of amazing that one, and I probably would've guessed a three on that one, but again. You can see that the quality of the stuff that has trouble with is really sketchy. What is that, A scorpion? Apparently, that was supposed to be an eight. And our best guess was a two. But that's much Wow. Okay, Yeah, some people really can't. Right? That's a seven. Yeah. I mean, you get the point here, So just by using caress alone, we've gotten better accuracy. We've got a better result because there's less for us to think about. All right. And you can probably have improve on this even more so again as before. With tensorflow. I want you to go back and see if you actually improve on these results. Try using a different optimizer than RMS prop try. You know, different apologies. And the beauty with caress is that it's a lot easier to try those different Tell apologies now, right? Carrots actually comes in its documentation with an example of using amnesty, and this is the actual topology that they use in their examples. So go back. Give that a try, see if it's actually any better or not. See if you can improve upon things. One thing you can see here is that they're actually adding dropout layers to prevent over fitting. So it's very easy to add those sorts of features here. Basically, we've done here is at a that same dense layer, 512 hidden neurons taking the 17 84 features. And then we're going to drop out 20% of the neurons that the next layer to force the learning to be spread out more and prevent over fitting. So might be interesting to see if that actually improves your results on the test data set by adding those dropout layers. All right, so go play with that mom, come back. We'll do some even more interesting stuff using caress. 11. Using Keras to Learn Political Affiliations: So that was a lot easier using caress, wasn't it? Now, the M NIST data said, is just one type of problem that you might solve the neural network. It's what we call multi class classification, multi class, because the classifications we were fitting into range from the number zero through nine. So in this case we had 10 different possible classifications values, and that makes this a multi class classification problem. Now, based on caress is documentation and examples. They have general advice on how to handle different types of problems. So here's an example of how they suggest setting up a multi class classification issue in general. So you can see here that we have to him layers. Here we have an input dimension of however many features you have coming into the system. In this example, there's 20 but depending on the nature of your problem, there may be more. It's setting up to rela activation function layers, each with 64 neurons each and again, that's something that you would want a tune, depending on the complexity of what you're trying to achieve, it sticking in a dropout layer to discard half of the neurons and each trading step again. That's to prevent over fitting. And at the end, it's using a soft max activation to one of in this example 10 different output values. OK, so this is how they go about solving the amnesty problem within their own documentation. They then use an SG de optimizer on a categorical cross and entropy loss function. So again, you could just refer to the carousel augmentation for some general starting point somewhere toe begin from at least when you're tackling a specific kind of problem again, the actual numbers of neurons and number of layers, the number of inputs and outputs. Well, obviously, very depending on the problem, you're trying to solve it. This is the general guidance they give you on what the right loss function is to start with . What the right optimizer to start with might be another type of classification problem is buying a reclassification? Maybe you're trying to decide if images or people are pictures of males or females may be trying to decide if someone's political party is Democrat or Republican. If you haven't either or sort of problem, then that's what we call a binary classification problem, and you can see here there. Recommendation here is to use a sigmoid activation function at the end instead of soft max , because you don't really need the complexity of soft max if you're just trying to, like, go between zero and one. So sigmoid is the activation function of choice. In the case of binary classifications, they're also recommending the RMS prop optimizer, and the lost function in this case will be binary cross entropy in particular so few things that are special about doing binary classification as opposed to multi class. Finally, I want to talk a little bit more about using caress with psych. It learn. It does make it a lot easier to do things like cross validation. And here's a little snippet of code of how that might look. So here's a little function that creates a model that can be used with psych, it learned. Basically, we have ah create model function here that creates our actual neural network. So we're using a sequential model, putting in a dense layer with four inputs and six neurons and that layer that feeds to another hidden layer of four neurons. And finally it's going to a binary classifier at the end with a sigmoid activation function . So a little example of setting up a little buying Eri classifications Neural network. We can then set up an estimator using the caress classifier function there, and that allows us to get back an estimator that's compatible with psych. It learn. So you see at the end there we're actually passing that estimator into psychic learns cross Val score function and that will allow psych it learn to run your neural network just like it were any other machine learning model built into psych. It learns that means cross Val score can automatically train your model and then evaluate its results using careful cross validation and give you a very meaningful results for how accurate it is in its ability to correctly predict the classifications for data it's never seen before. So what those snippets under our belt? Let's try out, um, or interesting example. Let's finally moved beyond the amnesty sample we're gonna do is try to predict the political parties of Congressman just based on their votes in Congress using the caress library. So let's try this out now. This is actually going to be an example that I'm going to give to you to try out yourself as an exercise. So I'm gonna help you load up this data and clean it up. But after that, I'm gonna leave it up to you to actually implement a neural network with caress to classify these things so again, to back up. What we're going to do is load up some data about a bunch of congressional votes that various politicians made. And we're going to try to see if we can predict if a politician is Republican or Democrat, just based on how they voted on 17 different issues. And this is older data is from 1984. So you definitely need to be of a certain age, shall we say, to remember what these issues were. And if you're from outside of the United States just to give you a brief primer in US politics, basically there are two main political parties in the United States the Republicans, which tend to be more conservative, and the Democrats, which tend to be more progressive, and obviously those have changed over the years. But that's the current meal. So let's let over our sample data. I'm going to use the Pandas library. That's part of our scientific python environment here. To actually load up these CSB files or just comma separated value data files and massage that data, clean it up a little bit and get it into a form that caress can accept. So we'll start by importing. The Panis Library will call a PD for short. I have built up this array of column names because it's not actually part of the C S V file , so I need to provide that by hand. So the columns of the input data are going to be the political party, Republican or Democrat, and then a list of different votes that they voted on. So, for example, we can see whether each politician voted yea or nay on religious groups and schools. And my really short the details of that particular bill were. But by reading these, you can probably guess the direction that the different parties would probably vote toward . So go ahead and read that CSP file. Using pandas read see SV function. We will say that any missing values will be populated with a question mark and will pass in a names array of the feature name. So we know what to call the columns that will just display the resulting data frame using the head command. So go ahead, Hit, shift, enter to get that up and we should see something like this is just the 1st 5 entries. So for the 1st 5 politicians at the head of our data, we can see how each person's party is in the label that we've assigned to that person, the known label that we're gonna try to predict and their votes on each issue. Now, we can also use the describe function on the resulting data frame to get a high level overview of the nature of the data. For example, you can see this lot of missing data, for example, even though that there is 435 people in the have a party associated with them. On Lee, 387 of them actually had a vote on the water project cost sharing bill, for example. So we have to deal with missing data here somehow. And the easiest thing to do is to just throw away Rose that have missing data Now in the real world, you'd want to make sure that you're not introducing some sort of unintentional bias by doing that. You know, maybe there is more of a tendency for Republicans to not vote than Democrats or vice versa . If that were the case, then you might be biasing your analysis by throwing out politicians that didn't vote in every every actual issue here. But let's assume that there is no such bias and we can just go ahead and drop those missing values. That's what this little line here does. It says, Drop in a in place. It was true. That just means that we're going to drop any rows that are missing data from our voting data data frame. And then we'll described again and we should see that every column has the same count because there is no missing data at this point. So we've window things down to 232 politicians here, not ideal. But, hey, that's what we have to work with. Next thing we need to do is actually massage this data into a form that caress can consume . So Carris does not deal with wise and ends. It deals with numbers, so let's replace all the wise and ends with ones and zeros using this line here. Panis has a handy dandy replaced function on data frames he can use to do that and similarly will replace the strings Democrat and Republican, also with the numbers one and zero. So this is turning this into a binary classification problem. If we classify someone as belonging to label one, and that will indicate their a Democrat and labelled zero will indicate that they're Republican. So let's go ahead and run that clean up that data, and we should see now if you run ahead on that data frame again. Everything has been converted to numerical data between zero and one, which is exactly what we want for the input to a neural network. All right, finally, let's extract this data into, uh, num pie raise that we can actually feed to caress. So to do that, we're just going to call dot values on the columns that we care about. We're going to extract all of the feature columns into the features array and all of the actual labels the actual parties into in all classes array. So let's go ahead and enter to get that in and at this point, I'm going to turn it over to you. For now. The code snippets you need were actually covered in the slides just prior to coming out to this notebook here. So just refer back to that, and that should give you the stuff you need to work off of and actually give things a go here. So I want you to try yourself. Now, my answer is below here. No peeking. I put a little binge there to try to stop you from scrolling further than you should. But if you want to hit pause here, we can come back later. And you can compare your results to mine. Okay, So at this point, I want you to pause this video and give it a go yourself. And when you think you've got something up and running or if you just ah, want to skip ahead and see how I did it hit play again and I'll show you right now. All right. I hope you did your homework here. Let's go ahead and take a look at my implementation here again. It's pretty much straight up. Taken from the slides that I showed you earlier. All we're going to do is import that stuff we need from Caris here. We're using dense dropout and sequential, and we're also going to use cross Val scored actually evaluator model and actually illustrate integrating caress with psychic learned like we talked about as well. So when were interviewed with, like, it learn we need to create a function that creates our models. We can pass that into cross Val score. Ultimately, we're gonna create sequential model, and we're just gonna follow the pattern that we showed earlier of doing a binary classification problem. So in this case, we have 16 different issues that people voted on. We're going to use a rela activation function with a layer of 32 neurons. And a pretty common pattern is to start with a large number of neurons and one layer and window things down as you get the higher layers. So we're gonna distill those 32 neurons down to another hidden layer of 16 neurons, and I'm using the term units in this particular example here a little bit of an aside, Mawr and Mawr researchers air using the term unit instead of neuron. And you're seeing that in some of the AP eyes and libraries that are coming out. Reason being is that we're starting to diverge a little bit between artificial neural networks and how they work and how the human brain actually works in some cases were actually improving on biology. So some researchers are taking issue with actually calling these artificial neurons because we're kind of moving beyond neurons, and they're kind of becoming their own thing at this point. Finally, we'll have one last layer with a single output neuron. For there are binary classification with a sigmoid activation function to choose between zero and one, and we will use the binary cross entropy loss function the Adam Optimizer and kick it off. At that point, we consider a caress classifier to actually execute that, and we will create an estimator object from that that we can then pass into psych. It learns cross viale score toe, actually perform K fold cross validation automatically, and we will display the mean result when we're done. So shift, enter and see how long this takes. Mind you in 1984 politicians were not as polarized as they are today, so it might be a little bit harder than it would be today. To actually predict someone's parties just based on their votes will be very interesting to see if that's the case using more modern data. Hey, we're done already 93.9% accuracy, and that's without even trying too hard. So, you know, we didn't really spend any time tuning the topology of this network. It all maybe you could do a better job, you know, if you did get a significantly better results, post that in the course here, I'm sure the students would like to hear about what you did. So there you have it using carats for amore. Interesting example. Predicting people's political parties using a neural network and also integrating it with psychic learned to make life even easier. That's the magic of caress for you. 12. Convolutional Neural Networks: so So far, we've seen the power of just using a simple multi layer perceptron to solve a wide variety of problems. But you can take things up a notch. Sheikhoun arrange Maura complicated neural networks together and do more complicated problems with them. So let's start by talking about convolution alone, neural networks or CNN's for short. Usually, you hear about CNN's in the context of image analysis, and their whole point is to find things in your data that might not be exactly where you expect it to be. So technically, we call this feature location. In variant, that means that if you're looking for some pattern or some feature in your data, but you don't know where exactly might be in your data, a CNN can scan your data and find those patterns for you wherever they might be. So, for example, in this picture here, that stop sign could be anywhere in the image, and a CNN is able to find that stop sign no matter where it might be. Now, it's not just limited to image analysis. It can also be used for any sort of problem where you don't know where the features you are might be located within your data and machine translation or natural language processing tests. Come to mind for that, you don't necessarily know where the noun or the verb or a phrase that you care about might be in some paragraph percent and say you're analyzing, but a CNN confined it and pick it out for you. Sentiment Analysis. Another application of CNN so you might not know know exactly where a phrase might be that indicates some happy sentiment or some frustrated sentiment, or what? Whatever you might be looking for. But a CNN can scan your data and pluck it out, and you'll see that the idea behind it isn't really as complicated as it sounds. This is another example of using fancy words. Teoh make things sound more complicated than they really are. So how do they work? While CNN's convolution, all neural networks are inspired by the biology of your visual cortex, it takes cues from how your brain actually processes images from your retina, and it's pretty cool. And it's also another example of interesting emergent behavior. So the way your eyes work is that individual groups of neurons service a specific part of your field of vision. So we call these local receptive fields there just groups of neurons that respond only to a part of what you're. I see it's sub samples the image coming in from your retinas and just has specialized groups of neurons for processing specific parts of the field of view that you see with your eyes. Now these little areas overlap each other to cover your entire visual field, and this is called convolution. Convolution is just a fancy word of saying I'm going to break up this data into little chunks and process those chunks individually, and then they'll assemble a bigger picture of what you're seeing higher up in the chain. So the way it works within your brain is that you have many layers. It is a deep neural network that identifies various complex cities of features, if you will. So the first layer that you go into from your convolution, all neural network inside your head might just identify horizontal lines or lines at different angles or, you know, specific cut times of edges. We call these filters, and that might feed into a layer above them that would then assemble those lines that it identified at the lower level into shapes. And maybe there's a layer above that that would be able to recognize objects based on the patterns of shapes that you see. And then, if you're dealing with color images, we have to multiply everything by three because you actually have specialized cells within your right enough for detecting red, green and blue light. And we need to assemble those together as well. Those each get processed individually to So that's all a CNN is. It is just taking a source, image or source data of any sort, really breaking it up into little chunks called convolutions. And then we assemble those and look for patterns and increasingly higher complexities at higher levels in your neural network. So how does your brain know that you're looking at a stop sign there? Let's talk about this and more colloquial language, if you will. So, like we said, you have individual local receptive fields that are responsible for processing specific parts of what you see and those local receptive fields air scanning your image and they overlap with each other looking for edges. You might notice that your your brain is very sensitive to contrast edges that it sees in the world does tend to catch your attention, right? That's why the letters on this slide catch your attention because there's high contrast between the letters and the white background behind them. So at a very low level, you're picking at the edges of that stop sign and the edges of the letters on the stop sign . Now. Ah, higher level might take those edges and recognize the shape of that stop Science says. Oh, there's an octagon there that means something special to me. Or those letters form the word stop. That means something special to me, too, and ultimately that will get matched against whatever classifications pattern your brain has of a stop sign. So no matter which receptive Field picked up that stop sign at some layer, it will be recognized at a stop sign. And furthermore, because you're processing data and color, it could also use the information that the stop sign is red and further use that to aid in its classification of what this object really is. So somewhere in your head, there's a neural network that says, Hey, if I see edges arranging an octagon pattern that's has a lot of red in it and says, Stop in the middle. That means I should probably hit the brakes on my car and it's some even higher level. The weird brain is actually doing higher reasoning. That's what happened. There's a wire that says, Hey, there's a stop sign coming up here. I better hit the brakes in my car. And if you've been driving long enough, you don't even really think about anymore. Do you like It's almost hard wired, and that literally may be the case anyway. A convolution, all neural network, an artificial convolution. All neural network works the same way. Same exact idea. So how do you build a CNN with caress? Obviously, you probably don't want to do this at the lower level tensorflow layer you can. But CNN's get pretty complicated. Ah, higher level library like carrots becomes essential. First of all, you need to make sure your source data is of the appropriate dimensions of the appropriate shape if you will, and you are going to be preserving the actual two D structure of an image. If you're dealing with image data here, so the shape of your data might be the with times the length, times, the number of color channels and by color channels. I mean, if it's a black and white image, there's only one color black and white, so you don't have one color channel for a grayscale image. But if it's a color image, you'd have three color channels one for red, one for green and one for blue, because you can create any color by combining red, green and blue together. Okay, now there are some specialized types of layers in Carriacou use when you're dealing with convolution, all neural networks, for example, there's the convert to D layer type that does the actual convolution on a two D image. And again, convolution is just breaking up that image into little sub fields that overlap each other for individual processing. There's also a conv one D and a con three D layer available as well. You don't have to use CNN's with images like we said. It can also be used with text data, for example. That might be an example of one dimensional data, and it's also a con. Three D layer is available as well. If you're dealing with three D volumetric data of some sort. So the lot of possibilities there have a specialized layer and caress for CNN's. Is Max pooling to D? Obviously, that's a one D and three d very into that as well. The idea of that is just to reduce the size of your data down. So if I take just the maximum value seen in a given block of an image and reduce it to layer down to those maximum values, it's just a way of shrinking the images in such a way that it can reduce the processing load on the CNN. As you'll see processing, CNN's is very computing intensive, and the more you can do to reduce the work, you have to do the better. So if you have more data in your image than you need a max, pulling two D layer can be useful for distilling that down to the bare essence of what you need to analyze. Finally, at some point you need to feed this data into a flat layer of neurons, right that at some point is going to go into a perceptron, and at this stage, we need to flatten that two D layer into a one D layer so we could just pass it into a layer of neurons. And from that point, it just looks like any other multi level perception. So the magic of CNN's really happens at a lower level, you know. Ultimately, it gets converted into what looks like the same types of multi layer Perceptron is that we've been using before the magic happens and actually processing your data involving it and reducing it down to something that's manageable. So typical usage of image processing with the CNN would look like this. You might start with a conto de layer that does the actual convolution of your image data. You might follow that up with a max pulling two D layer on top of that that distills that image down just shrinks the amount of data that you have to deal with. You might then do a dropout layer on top of that, which just prevents over fitting like we talked about before. At that point, you might apply a flattened layer to actually be able to feed that data into a perceptron, and that's where a densely or might come into play. So dense layer and caress is just a Perceptron, really, You know, it's a layer of, ah, hidden layer of neurons. From there he might do another drop out past to further prevent over fitting and finally do a soft max to choose the final classification that comes out of your neural network now. Like I said, CNN's our compute intensive. They are very heavy and your CPU, your GP you and your memory requirements shuffling all that data around involving it adds up really, really fast. And beyond that, there's a lot of what we call hyper parameters a lot of different knobs and dials that you can adjust on CNN's. So, in addition to the usual stuff, you can tune like the topology of your neural network or what optimize your user, what lost function to use or what activation function to use. There's also choices to make about the colonel sizes. What is the area that you actually involve across? How many layers do you have? How many years do you have? How much pooling do you do when you're reducing the image down? There's a lot of various here that's almost an infinite amount of possibilities here for configuring a CNN and often. Just obtaining the data to train your CNN with is the hardest part. So, for example, if you wanna Tesla's that's actually taking pictures of the world around you on the road around you and all the street signs and traffic lights as you drive and every night it sends all those images back to some data servers somewhere. So Tesla can actually run training on its own neural networks based on that data. So if you slam on the brakes while you're driving a Tesla at night, that information is going to be sped into a big data center somewhere, and Tesla's gonna crunch on that and say, Hey, is there a pattern here to be learned of what I saw from the cameras from the car? That means you should slam on the breaks in the case of a self driving car, and you think about the scope of that problem, just the sheer magnitude of processing and obtaining and analyzing all that data that becomes very challenging in and of itself. Now, fortunately, the problem of tuning the parameters doesn't have to be a SARD, as I described it to be, there are specialized architectures of convolution, all neural networks that do some of that work for you. So the Lauder research goes into trying to find the optimal apologies and parameters for a CNN for a given type of problem, and you could just think this is like a library you can draw from. So, for example, there's the Lynette five architecture that you can use that's suitable for handwriting recognition. In particular, there's also one called Alex Net, which is appropriate for image classification. It's a deeper neural network than Lynette, you know. So in the example we talked about on the previous slide, so we only had a single hidden layer. But you can have as many as you want released as a matter of how much computational power you have available. There's also something called Google Lynette. You can probably guess who came up with that. It's even deeper, but it has better performance because it introduces this concept called Inception Module. They basically group convolution layers together, and that's a useful optimization for how it all works. Finally, the most sophisticated one today is called rez Net that stands for residual network. It's an even deeper neural network, but it maintains performance by what's called skip Connection. So it has special connections between the layers of the Perceptron to further accelerate things. So it's sort of like builds upon the fundamental architecture of a neural network toe, optimize its performance, and as you'll see CNN's, can be very demanding on performance. So with that, let's give it a shot. Let's actually use a CNN and see if we can do a better job at image classification than we've done before using one. 13. Using CNN's for Handwriting Recognition: and we're going to revisit the M NIST handwriting recognition problem where we try to classify a bunch of images of people are drawing the number is zero through nine and see if we could do a better job of it. Using CNN's against CNN's are better suited to image data in general, especially if you don't know exactly where the feature you're looking for is within your image. So we should expect to get better results here. All right, so we're gonna start by importing all the stuff we from caress were on import the m this data set that were playing with the sequential model so we can assemble our neural network . And then we're gonna import all these different layer types that we talked about in the slides. The dense dropout calm to De Max, pulling to t and flatten layer types, and in this example will use the RMS prop optimizer. Go ahead and kick that off. And the rest here for loading up the training and test data is gonna look just like it did before. Still waiting for that cares to initialize itself there. All right, so that should load up the M nus data set, we're gonna shape the state a little bit differently. So since ah, convolution, all neural networks can process to D data in all their two d glory. We're not going to reshape that data into flat one D arrays of 768 pixels. Instead, we're going to shape it into the with times the length times the number of color channels. So in this case, our data is grayscale in nature. So there's only a single color channel that just defines how wider dark the images the specific pixel is. And there's a couple of different ways that data can be stored. So we need to handle a couple of different cases here. It might be organized as color channels by with times length or might be with times, lifetimes, color channels. So this is what this little bit of code here is dealing with. But either way, we will see if it's a channels first format or not, and reshape the data accordingly. And we're gonna store that shape in this thing called input shape. That's the shape of our input test data and training data, for that matter. As before, we're going to scale this data down, so it comes in as eight bit byte data, and we need to convert that into normalized floating Point it instead. So we'll convert that data to floating 80.0.32 bit values and then divide each pixel by 2 55 to transform that into some number between zero and one. Go ahead, hit, shift. Enter in there to kick that off, all right. And as before, we will convert the label data into one hot, categorical format because that will match up nicely with the output of our neural network and nothing different here. We just got again to a sanity check to make sure that we successfully imported our data. So we'll pick a random training set sample toe print out here on display. And there's the one hot format of the three labelled 0123 That's correct. Human readable format. Three. And it looks like that. Sure enough, that looks like the number three, so it looks like our data is in good shape for processing. So now let's actually set up a CNN and see how that works. So let's walk through what's going on in this next code block here as before, we start off by setting up a sequential model that just allows us to very easily build up layers to build up our neural network here. And we will start with a calm to dealer. So what this syntax here means is that our convolution, all two d layer, is going to have 32 windows or 32 regional fields, if you will, that it will use to sample that image with and each one of those samples will be about three by three colonel size. It also needs to know the shape of your input data which we stored previously that CEO won by 28 by 28 or 28 by 28 by one, depending on the input format. There, we will then add a second convolution. All filter on top of that to hopefully identify higher level features. This one will have 64 colonels, also a three by three size, and we're going to use a yellow action activation function on that as well. So we've built up to convolution layers here. Ah, and again you want to just reuse any previous research you can do for a given problem. There are so many ways to configure CNN's that if you start from scratch, you're gonna have a very hard time to tune it, especially when you consider how long it takes to generate between each run. These are very resource intensive. So I've just taken this from the CNN example that comes with the caress library and drawn my initial topology from it. So now that we've done our convolution layers, we're gonna do a max pulling two d steptoe, actually reduce that down a little bit. So we're gonna take a two by two pool size and for each two by two pixel block at this stage, we're going to reduce that down to a single pixel that represents the maximum pixel found within in that pool. So note that the pool size can be different from the underlying colonel size from the convolution. You did So really, this is just a technique for shrinking your data down to something that's more manageable at this point. Will do a dropout passed to prevent over fitting. We will then flatten what we have so far. So that will take R two D data and flatten it to a one d layer. And from this point, it's just gonna look like any other multi layer Perceptron just like we used before. So all the magic of CNN's has happened at this point, and now we're just gonna convert it down to a flat layer that we input into a hidden layer of neurons. In this case, we're gonna have 128 in that layer again with a rail. You activation function will do one more drop out past to prevent over fitting and finally choose our final categorization of the number zero through nine by building one final output layer of 10 neurons with ease. Soft max activation function on it. All right, so let's go ahead and let that run again. Nothing's really happening until we actually kick off the model, so that doesn't take any time at all. We can do a model that summary just to double check that everything is the way that we intended it to be. And you can see that we have are two convolution layers here, followed by a pooling layer, followed by a drop out of flatten. And from there we have a dense dropout in dense multi layer Perceptron actually do our final classifications. All right, finally, we need to compile that model with a specific optimizer and lost function. In this case, we're going to use the Adam Optimizer and categorical cross entropy because that's the appropriate loss function for a multiple category classification problem. And finally, we will actually run it now. Like I said, CNN's air very expensive to run. So when we talk about what this command does, first of all, nothing unusual here just says that we're going to run batches of 32 which is smaller than before, because there is a much higher computational cost of. This really got to run 10 epochs this time because again, it takes a long time or would be better. But there's only so much we have time to do verbosity level two because that's what do you want to choose for running within an eye python notebook and we will pass in our A validation test data for it to work with us? Well, now, I am not going to actually run this because this could actually take about an hour to run, and if you don't have the beefy machine, it might not finish it all. You know, if you don't have enough ram were enough CPU power. This might even be too much for one system. So I'm gonna skip ahead here. Actually ran this earlier and it did take about 45 minutes. But you can see that it very quickly converged to a very good Accuracy Valley here and it was still increasing. So there probably would have been value to going even beyond 10 iterations of the training here. But even after just 10 at box or 10 iterations, we ended up with a accuracy of over 99%. And we can actually evaluate that based on our test data and recreate that 99% accuracy. So that's kind of awesome. So CNN's definitely worth doing if accuracy is key and for applications where lives are at stake, such as a self driving car, Obviously that's worth the effort, right? You want complete accuracy of detecting If there's a stop sign in front of you high Teeley , right? Even 0.1% error is going to be unacceptable in a situation like that. So that's the power of CNN's. They are more complicated to take a lot more time to run. But as we said, the power of tensorflow, which caress is running on top of means you could distribute its work across an entire cloud of computers in an entire array of GP use out our on each computer. So there are ways of accelerating this. We're just not taking advantage of that in this little example. Here, it's just illustrative. So there you have it, your first convolution, all neural network, and you can see how powerful it is and successfully doing image classification, among other things. So cool, let's move on to another type of neural network next. 14. Recurrent Neural Networks: Let's talk about another kind of neural network, the recurrent neural network. What are our and ends for? Well, a couple of things, basically their first sequences of data. And that might be a sequence in time so you might use it for a processing time series data . We're trying to look at a sequence of data points over time and predict the future behavior is something over time. In turn, so aren't answer based for sequential data of some sort. Some examples of time serious data might be weblogs, where you receiving different hits to your website over time, or sensor logs were getting different inputs from sensors from the Internet of things. Or maybe you're trying to predict stock behavior by looking at historical stock trading information. These are all potential applications for recurrent neural networks because they can take a look at the behavior over time and try to take that behavior into account when it makes future projections. Another example might be If you're trying to develop a self driving car, you might have a history of where your car has been. Its past trajectories, and maybe that can inform how your car might want to turn in the future, so you might take into account the fact that your car has been turning along a curve to predict that perhaps they should continue to drive along a curve until the road straightens out. And another example. It doesn't have to just be time. It can be any kind of sequence of arbitrary length. So something else that comes to mind are languages, you know, sentences there, just sequences of words, right, so you can also apply our and ends to language or machine. Translation are producing captions for videos or images. These are examples of where the order of words in the sentence might matter, and the structure of the sentence and how these words are put together could convey more meaning. Then you could get by just looking at those words individually without context. So again, in our nn can make use of that ordering of the words and try to use that as part of its model. Another interesting application of are an ends is machine generated music. You can also think of music sort of like text, where instead of a sequence of words or letters, you have a sequence of musical notes. So it's kind of interesting. You can actually build a neural network that can taken existing piece of music and sort of extend upon it by using a recurrent neural network to try to learn the patterns that were aesthetically pleasing to the music in the past. Conceptually, this is what a single recurrent neuron looks like in terms of a model. So it looks a lot like a, uh, an artificial neuron that we've looked at before. The big difference is this little loop here. Okay, so now, as we run a training step on this neuron, some training data gets fed into it. Or maybe this is an input from a previous layer in our neural network, and it will apply some sort of step function after something all the inputs into it. In this case, we're gonna be drawing something more like a hyperbolic tangent because mathematically, you want to make sure that we preserve some of the information coming in and more of a smooth manner. Now, usually we would just output the result of that summation and that activation function as the output of this neuron. But we're also going to feed that back into the same neuron. So the next time we run a run, some data through this store on that data from the previous run also gets summed into the results. Okay, So as we keep running this thing over and over again will have some new data coming in that gets blended together with the output from the previous run through this neuron, and that just keeps happening over and over and over again. So you can see that over time the past behavior of this neuron influences its future behavior, and it influences how it learns. Another way of thinking about this is by unrolling it in time. So what this diagram shows is the same single neuron, just a three different times steps. And when you start to dig into the mathematics of how our ends work, this is a more useful way of thinking about it. So we consider this to be time, step zero. You can see there's some sort of data input coming into this recurrent neuron and that will produce some sort of output after going through its activation function. And that output also gets fed into the next time step. So if this is time Step one with the same neuron. You can see that this neuron is receiving not only a new input but also the output from the previous time step and those get some together, the activation function gets applied to it, and that gets output as well. And the output of that combination then gets fed on to the next time step called this time step to where a new input for time Step two gets fed into this neuron, and the output from the previous step also gets fed in. They get some together, the activation function is run and we have a new output. This is called a memory cell because it does maintain memory of its previous outputs over time. And you can see that even though it's getting some together at each time step over time, those earlier behaviors kind of get diluted, right? So you know, we're adding in that time, step to that time step and then the some of those two things that working into this one, so one property of memory cells is that more recent behaviour tends to have more of an influence on the current behavior that you get out of a recurrent neuron, and this could be a problem in some applications. So there are ways to work against that that we can talk about later. Stepping this up. You can have a layer of recurring Ireland, so you don't have to have just one, obviously. So in this diagram we are looking at four individual recurrent neurons that are working together as part of a layer, and you can have some input. Going into this layer is the hole that gets spent into these four different recurring neurons. And then the output of those neurons can then get fed back to the next step to every neuron in that layer. So all we're doing is, ah, scaling this out horizontally. So instead of a single recurrent Iran, we have a layer of four recurrent neurons in this example, where all of the output of those neurons is feeding in to the behavior of those neurons in the next learning step. Okay, so you can scale us out to have more than one neuron and learn more complicated patterns as a result, aren't ends open up a wide range of possibilities because now we have the ability to deal, not just with vectors of information static snapshots of some sort of a state. We can also deal with sequences of data as well, so there are four different combinations here that you can deal with. We can deal with sequence to sequence neural networks. If we have the input is a time Siri's or some sort of sequence of data. We can also have an output that is a time Siri's or some sequence of data as well. So if you're trying to predict stock prices in the future based on historical trades, that might be an example of sequence to sequence topology. We can also mix and match sequences with the older vector static states that we predicted back with just using multi layer Perceptron. We would call that a sequence to vector. So if we were starting with a sequence of data, we could produce just a a snapshot of some state. As a result of analyzing that sequence. An example might be looking at the sequence of words in a sentence to produce some idea of the sentiment that that sentence conveys from last. People get that an example shortly. You could go the other way around, too. You can go from a vector to a sequence, so an example of that would be taking an image, which is a static vector of information, and then producing a sequence from that factor, for example, words in a sentence creating a caption from an image. And we can change these things together in interesting ways as well. We can have encoders and decoders built up that feed into each other. For example, we might start with a sequence of information from, ah, sentence of some language, embody what that sentence means as some sort of a vector representation and then turn that around into a new sequence of words in some other language. So that might be how a machine translation system could work. For example, you might start with a sequence of words in French, build up a vector that sort of embodies the meaning of that sentence and then produce a new secrets of words in English or whatever language you want. That's an example of using a recurrent neural network for machine translation. So lots of exciting possibilities here training are and ends just like CNN's. It's hard. In some ways. It's even harder. The main twist here is that we need to back propagate not only through the neural network itself in all of its layers, but also through time. And at a practical standpoint, every one of those times steps ends up looking like another layer in our neural network while we're trying to train our neural network and those times steps can add up fast. So over time we end up with, like, an even deeper and deeper neural network that we need to train and the cost of actually performing Grady and dissent on that increasingly deep neural network becomes increasingly large. Soto put an upper cap on that training time off. When we limit the back propagation to a limited number of time steps, we call this truncated back propagation through time. So just something to keep in mind when you're training in R N N. You not only need to back propagate through the neural network topology that you've created , you also need a back pocket propagate through all the time steps that you've built up up to that point now. We talked earlier about the fact that as you're building up in our an end, the state from earlier times. Steps end up getting diluted over time because we just keep feeding in behavior from the previous step in our run to the current step. And this could be a problem if you have a system where older behavior does not matter less than newer behavior. For example, if you're looking at words in a sentence, the words at the beginning of its sentence might even be more important than words toward the end. So if you're trying to learn the meaning of a sentence, the position of the word in the sentence there is no inherent relationship between where that word is and how important it might be in many cases. So that's an example of where you might want to do something to counter act that effect. And one way to do that is something called the L S. T. M. Cell extends for a long, short term memory cell, and the idea here is that it maintains separate ideas of both short term and long term states, and it does this in a fairly complex way. Now, fortunately, you don't really need to understand the nitty gritty details of how it works. There is an image of it here for you to look at if you're curious. But, you know, the libraries that you use will implement this for you. The important thing to understand is that if you're dealing with a sequence of data where you don't want to give preferential treatment to more recent data, you probably want to use an L S T M cell instead of just using a straight up r n n. There's also an optimization on top of L S T M cells called G R U cells that stands for gated recurrent unit. It's just a simplification on Ellis TM cells that performs almost a swell. So if you need to strike a balance or a compromise between performance in terms of how well your model works and performance in the terms of how long it takes to train it, Aguiar you sell might be a good choice. Training them is really hard. If you thought CNN's was hard, wait till VCR and ends. They are very sensitive to the topology is that you choose and the choice of hyper parameters. And since we have to simulate things over time, and not just through you know the static topology of your network. They could become extremely resource intensive. And if you make the wrong choices here, you might have a recurrent or a network that doesn't converge it all. You know it might be completely useless, even after you've running for hours to see if it actually works. So again, it's important to work upon previous research. Try to find some sets of apologies and parameters that work well for similar problems to what you're trying to dio. This all makes a lot more sense with an example, and you'll see that it's really nowhere near as hard as it sounds when you're using caress Now. I used to work at IMDB, so I can't resist using a movie related example. So let's dive into that next and see our and ends recurrent neural networks in action 15. Using RNN's for Sentiment Analysis: what we're gonna do here is try to do sentiment analysis. So this is going to be an example of a sequence to vector are in and problem where we're taking the sequence of words in a user written movie review. And we try to output a vector that's just a single binary value of whether or not that user like the movie or not where they gave a positive rating. So this is an example of doing sentiment classifications using riel user review data from IMDB. And since I used to run, I am DVS engineering department. This is a little bit too tempting for me not to do is an example here. Now, mind you just give credit where credit is due. This is drawn heavily upon one of the examples that ships with caress the IMDb l s t M sample. I've sort of embellished on it a little bit here, but the idea is there says to give credit where credit's due and it does warm my heart by the way that they include the IMDb data set as part of caress free to experiment with. So it's ah springing back good memories for me. I enjoyed working there. Anyhow, this isn't a other example of how we're going to use L S t M cells long short term memory cells because again, when you're dealing with textual data sequence of words in the sentence, it doesn't necessarily matter where in the sentence that word appeared. You don't want the property of words toward the end of the sentence counting mawr toward your classifications than words at the beginning of the sentence. In fact, often it's the other way around. So we're going to use an L S T m cell to try to counter act that effect that you see in normal RN ends where data becomes diluted over time or as the sequence progresses in this example. So let's just dive in and see how it works. Will start by importing all the stuff that we need from Caris. We're going to use sequence pre processing modules, sequential model so we can embed different layers. Together, we're going to introduce a new embedding layer as part of our and in addition to the dense layer that we had before, we will import the LS tm module and finally will import the IMDb data set So let's go ahead and ship entered to do all of that and get caress initialized. And that's done now. So now we can import our training and testing data. Like I said, Caris has a handy dandy IMDb data set preinstalled. Oddly, it has 5000 training reviews and 25,000 testing reviews, which seems backwards to me. But it is what it is. The one parameter you're seeing here for numb words, indicates how many unique words that you want to load into your training and testing data set. So by saying numbers equals 20,000 that means that I'm going to limit my data to the 20,000 most popular words and the data set. So someone uses some really obscure word. It's not going to show up in our input data. Let's go ahead and load that up. And since it does have to do some thinking, it doesn't come back instantly but pretty quick. OK, were we're in business here. Let's take a peek at what this data looks like. So let's take a look at the first instance of training data here, and what the heck, It's just a bunch of numbers. It doesn't look like a movie review to me. Well, you can be very thankful to the folks that cares for doing this for you. So the thing is, when you're doing machine learning in general, usually models don't work with words. They work with numbers, right? So we need to convert these words to number. Somehow is the first step, and caress has done all of this pre processing for you already. So you know, the number one might correspond to the word the or I actually have no idea what corresponds to, But they've encoded each unique word between zero and 20,000 because we said we wanted the 20,000 most popular words two numbers. Okay, so it's kind of a bummer that we can't actually read these reviews and get sort of an intuitive meaning of what these reviews air saying. But it saves us a whole lot of work. And I have said before that often a lot of the work in machine learning is not so much building your models and tuning them. It's just processing and massaging your input data and making sure that your input data looks good to go. So even though this doesn't look like a movie review. It is a movie review. They've just replaced all of the words with unique numbers that represent each word. We can also take a look at the training data. So the classification of this particular review was one which just means that they liked it . So the only classifications are zero and one which correspond to a negative or a positive sentiment for that review. Okay, so we have all over input data converted already to numerical format. That's great. Now we just have to go ahead and set things up. Let's start by creating some vectors were input here. So we're gonna break out our training and testing data here. We're going to call sequence stop pad sequences just to make sure that everything has a limit on them to 80 words. So the reason we're doing this is because, like we said, our and ends can blow up very quickly. You have to back propagate through time. So we want to have an upper bound on how many times steps we need to back propagate to. So by saying Max Line equals 80 that means we're only going to look at the 1st 80 words in each review and limit our analysis to that. So that is a way of truncating our back propagation through time. It's kind of a low tech way of doing it, but it's effective. Otherwise we would be running this thing for days. Okay, so the only point here is to trim all of these reviews in both the training and the test data set to their 1st 80 words, which again have been converted to numbers for us already. Let's build up the model itself. Hey, we didn't actually run. Run. That's let's go ahead. Ish hit, shift, Enter on that block. Okay, now we can build up the model itself. And for such a complicated neural network, I think it's pretty remarkable how few lines of code Zehr going on here. So let's talk through. This will start by creating a sequential model, meaning that we can just build up the topology of our network one step at a time here, so we'll start with some additional pre processing were using was called an embedding layer here, and all that does is convert our input date of words of to the first 80 words and given a review into dense vectors of some fixed size. So it's going to create a dense vector of a fixed size of 20,000 words and then funnel that into 128 hit and neurons inside my neural network. So that's all that embedding Layer is doing is just taking that input textual data that's been encoded and converting that into a format that suitable for input into my neural network. Then, with a single line of code, we build our recurrent neural network. So we just say, Add in L S T M. And we can go through the properties here once they wanna have 128 recurrent neurons in that Ellis TM layer. And we can also specify dropout terms just in that same command here. So we can say that we want to do a drop out of 20% and that's all there is to it. That one line of code sets up R l s T M neural network with 128 recurrent neurons and adds dropout phases of 20% all in one step. Finally, we need to boil that down to a single output neuron with a sigmoid activation function because we're dealing with a binary classification problem, and that's it. So we've defined the topology of our network with just four lines of code, even though it's a very complicated, recurrent neural network using L S T M cells and dropout phases. But caress makes that all very easy to do. We then need to tell caress how to optimize this neural networking, how to train it So we will use binary cross entropy because this is ultimately a binary classification problem. Did the person like this movie or not, will use the Adam Optimizer this time just because that's sort of the best of both worlds for Optimizers, and then we can kick it off. So let's go ahead and run these two previous blocks shift, enter, shift, enter and at this point, you're ready to actually train your neural network. And let's just walk through is going on here. It's very similar from the previous examples. In this case, we're going to use batch sizes of 32 reviews at once. We're going to run it over 15 training steps or epochs set of verbosity layer that's compatible with I Python notebooks and provide the validation data forties as well. Now, again, I'm not going to actually run this right now because it will take about an hour. Like I said, our and ends are hard. They take a lot of computing. Resource is. And since all I'm doing is running this online single CPU, I don't even have things configured to use my GP. You are let alone a cluster of computers. This takes a long time. But I did run it earlier, and you can see the results here. So over 50 Net box, you can see that the accuracy that it was measuring on the training data was beginning to converge. Seems like after about 13 steps, it was getting about as good as it was going to get. And then furthermore, we can actually evaluate that model given the testing data set. So let's go ahead and call Evaluate on that with our test data again using 32 batches and if we were to run that we would see that we end up with a accuracy of 81% on our model here . Doesn't sound that impressive, but when you consider that all we're doing is looking at the 1st 80 words of each review and trying to figure out just based on that beginning. Whether or not a user like the movie or not, that's not too bad. But again, step back and think about what we just made here. We've made a neural network that can essentially read English language reviews and determine some sort of meaning behind them. In this case, we've trained it how to take a sequence of words at the beginning of a movie, review that some human road and classify that as a positive review or a negative review. So, in a very real sense, we've at some level in a very basic level, taught our computer how to read. How cool is that? And the amount of code that we wrote to do that was minimal, right? So it's kind of libeling. It's really just a matter of knowing what technique to use to build your neural network, providing the appropriate training data, and then your neural network does the rest. It's really kind of spooky when you sit back and think about it anyway. Cool stuff. So it's a great example of how powerful caress can be and a great example of an application of a recurrent neural network not using the typical example of stock trading data or something like that, but instead for sentiment analysis where we took a sequence of words and use that to create a binary classification of a sentiment based on that sequence, so fun stuff are in ends and caress. 16. Transfer Learning: the world of a eyes in a strange and exciting time. With transfer learning, it's never been easier to deploy a fully trained artificial intelligence model and start using it for real world problems. The idea here is to use pre trained models already that are out there, available on the Internet for anyone to use. And for a lot of common problems, you can just import a pre trained model that somebody else did all the hard work of putting together and optimizing and figuring out the right parameters and the right topology and just use them. So, for example, if you're trying to do image classification, there are pre trained models out there that you can just import some of the murder called rez. Net Inception, Mobile Net in Oxford, v. G. Or some examples. And they come pre trained with a very wide variety of object types. So in many cases you can just unleash one of these off the shelf models, point a camera at something, and it will tell you what it iss. That's kind of freaky. Similarly, for natural language processing, there are pre trained models available as well, such as where Tyvek and glove that you can use to basically teach your computer how to read . With just a few lines of code. Now, you can just use them as is, but you can also just use them as a starting point if you want to extend on them or build on them for more specific problems. So even if they don't solve the specific problem you're trying to solve, you can still use thes pre trained models as a starting point to build off of that is, you know a lot easier to get going with. You don't have to waste a lot of time trying to figure out the right topology and parameters for a specific kind of problem. You can start with them all that's already figured all that out for you and just add on top of it. This is called transfer learning. Basically, we're transferring and existing train model from somebody else to your application. Now you can find more of these pre trained models and what are called model zoos. A popular one is called the Cafe Models. Ooh, and it's ah, I'm not sure to think of all this. I mean, it is super super easy to deploy a I. Now, as you'll soon see in our next example, you can just import an existing models who model and start using it with, just, you know, four or five lines of code. You don't have to really be, Ah, very good developer anymore to actually use AI for practical applications. So kind of kind of a weird place for the industry to be right now and kind of opens up a lot of interesting and potentially scary possibilities. How people might start to use this technology when there's such a low barrier to entry now of actually using it anyway, Let's dive into a real world example, and I'll show you just how scary easy it is now. So let's dive into transfer learning. Open up the transfer learning notebook in your course materials, and you should see this, and you will soon see just how crazy easy it is to use and how crazy good it can be. So we're gonna be using the resonant 50 model here. This is used for image classification, so it's an incredibly easy way to identify objects in arbitrary images. So if you have, ah, a picture of anything Maybe it's coming from a camera or video frames or what have you this comptel you? What's in that picture? Pretty reliably, it turns out. So let's ah, have some fun with it. So just to prove a point, I'm going to try some of my own vacation voters here with it. It's so we're going to be sure that the pictures that I'm giving resident 50 to classify our pictures that it's never seen before and see what it could do with it. For example, I took this picture of a fighter jet while I was exploring the deserts of California. Let's just run that This is included with your course materials, and there we have a picture of a fighter jet. So as a start, let's see if the resident 50 model can identify it and see what's involved in actually coating that up. First. We just need to import the modules we need, so we need to import the resident 50 model itself again that is built into caress, along with several other models as well. We don't even have to go to the trouble of downloading and installing it. It's just there and from we're also going to imports to manage pre processing tools, both from caress itself and is part of the Resident 50 package itself. We're also gonna important, um, pie, because we're going to use numb pie to actually manipulate the image data into a number higher ray, which is ultimately what we need to feed into a neural network. So let's go ahead and run that block now. One sort of limitation of the resident 50 model is that your input images have to be to 24 by 2 24 resolution. You know, that's in part to make sure that it can run efficiently. It's also limited to one of 1000 possible categories, and that might not sound like a lot. But I think you'll be surprised at just how much detail it will give you as to what the thing is. So let's go ahead and load up that image again. This time, we will scale it down to 2 24 by 2 24 while we're loading it, and we will convert that to a numb pirate with these two lines of code and then call the resident 50 models pre process input to do something to prepare that data. I assume it's scaling it down into whatever range it wants and maybe doing some pre processing of the image itself to make it work better. It's kind of a black box, and that's a little bit. What's weird about using transfer learning? You know, you're just sort of taking it on faith that it's doing the right thing. But from a practical standpoint, that's not a bad thing to Dio. Let's go ahead and run this all right, so it's pre processed my image. That was pretty quick. Now we'll load up the actual model itself. One line of code is all it takes. Model equals resident 50. The weights there represent that it's going to use weights learned from the image net data set. So you can even use variations of resident 50 that were trained on different sets of images . There potentially. So let's go ahead and load up the model and thats done. So now we can just use it. So we now have a pre trained image classification model here with one line of code, and we can just use it now. All have to do is call. Predict on it and we're done that that's it. It's really that easy. So let's try it. We have. Ah, as you recall our pre processed fighter jet image here in the X ray here, and we will just call modeled operative decks and see what it comes back with. I'll come back with a classification and to translate that into something human readable will just call the decode predictions function that comes with the resident 50 model as well. It's just that easy. Okay, literally two lines of code here, right? We decide one line to actually load up the resident 50 model and transfer that learning to our application, if you will, specifying a given set of weights that was pre learned from a given set of images. And then we just call, Predict on that model and we're done. That's it. Let's run that and see if it actually works. Wow. Okay, so, yeah, it's top prediction was actually warplane, and that's exactly what this is a picture of, even though it's never seen this picture before, ever. And I didn't do anything to make sure that the photo was like from the right angle or properly framed or anything like that or, you know, pre processed. With a lot of contrast, it just works. It's kind of spooky. Good. It's second guessed was a missile followed by projectile. And yeah, there were missiles and projectiles on that plane as well. So not only did tell me it was a warplane and told me that it was a war plane that had missiles on it. I mean, Wow, that's That's crazy. Good, right? Let's try with some other images. May we just got lucky. So let's make a little convenient. Ah, function here to do this on a given image more quickly. So we'll write a little classify function here, and it will start off by displaying a picture of the thing that we're starting to classify here. It will then low that image up, scaling it down to the required to 24 by 2 24 dimension. Convert that to a numb pyre, a pre process it and then just call, predict on the resident 50 model and see what it comes back with. So now we could just say classify and whatever our image file name is, and it will tell us what ISS So we've reduced our a little bit of code here to just one line now, so I can now just a hit shift enter to define that function. And now I could just say, Well, I have, AH file called Bunny Dodge a Pig and my course materials. Let's classify that shift. Enter. There's a picture of a bunny rabbit in my front yard that I took once, And sure enough, the top classification is would rabbit followed by hair. So not only is it saying it's a rabbit is tell me what kind of rabbit I don't really know my rabbit species that well, so I'm not sure if that's actually a wood rabbit, but it could be, You know your way. It's pretty impressive. I mean, it's not even like a prominent piece of this image. It's just sort of like sitting there in the middle, my lawn. It's not even that clear of a photo, either. Imagine this scale down 2 to 24 by 2 24 There's really not going to be a lot of information there, but it's still figured out that that's a rabbit. How about a fire truck picture of a fire truck and this isn't a normal fire truck either. This is like at that same aviation museum that I took a picture of that war plane of its ah , sort of an antique fire truck that was used by the Air Force. But still, fire engine is the top prediction. Uh, wow, that's kind of cool. Um, I took a picture of my breakfast once at a fancy hotel in London. Let's see what it does with that. A full English breakfast, Mind you. When in London, one must eat. His Londoners do. Actually, I don't know if he really felt English breakfast there, but it's still good. Ah, yes. So it picked up that there's a dining table in this picture. There's a tray containing my food. A restaurant. I mean, Well, this was actually room service, but you could definitely imagine that's in a restaurant instead. So, yeah, again, An impressive job here on a random photo from vacation. It's never seen this picture before. I took absolutely no, uh, put no thought into making sure this was a picture that would work well with machine learning. Artificial intelligence for image classifications. Let's keep going. Um, when I was England, I visited some castles in Wales. Picture of a cast lives Did Goto whales. Guys, it's beautiful there. Uh, yeah, it's a castle. That's it. Stop prediction. Second guess was a monastery or a palace. Both good guesses, But yeah, it's a castle. And, you know, it's not even a typical looking castle. I'd still figured it out. This is incredible stuff. All right, let's see if I can trip it up. I also took a trip to New Mexico once and visited was called the Very Large Array. This is basically a an array of giant radio astronomy dishes with only 1000 classifications . I wouldn't imagine it would get this right. So there's a picture. It's just a bunch of, ah, giant radio astronomy telescopes. And it says it's a radio telescope. Uh, did this? This is mind blowing stuff, guys. All right, one more. I took a picture of bridge once and you remember what Bridget is. London Bridge. Apparently eso Okay, what's this resident 50 a suspension bridge. And also there's appear in a chain and chain link fence in there as well, for good measure. So Ah, that's pretty impressive, right? I mean, so if you need to do image classification. You don't even need to know the details of how convolution l neural networks worker, how to tune them. And, ah, you know how to, like, build up the right apology and it arrayed on the right hyper parameters. You can just use somebody else's work already did that, and by using models ooze from the cafe models who are elsewhere for a lot of common problems, you congest get up and running in a couple of lines of code. It's never been easier to use artificial intelligence in a real world application now. So although it's good to understand the fundamentals, especially if you're going to be doing something that nobody's ever done before for common ai problems, there's been so much research in recent years that there's a good chance that somebody's already solved the problem that you're trying to solve. And you can just reuse their results if they were kind enough to publish them on a models who somewhere, Wow, so yeah, tried out on some photos of your own to my Not if you have some just throwing the course materials and call my classify function on it and see what it does with it. Just have some fun with it, and you can also try some different models to and see how they behave differently. Resident 50 was actually the model that worked the best for my photos. But there are other models included with caress, including Inception and Mobile Net that you might want to try out. If you do want to play with them, you'll have to return for to the documentation here. There's a link to it here. You do need to know what image dimensions that it expects the input in, for example, or someone working all. So yeah, I give to try and ah, man, it's mind blowing stuff. Guys just kind of like, uh, sit there and let it sink in that it's that easy to use ai right now. 17. The Ethics of Deep Learning: A lot of people are talking about the ethics of deep learning. Are we actually creating something that's good for humanity or ultimately, bad for humanity? So let's go there now. I'm not gonna preach to you about cinci in robots taking over the world. I mean, maybe that will be a problem 50 years from now, maybe even sooner. But for the immediate future, it's more subtle ways in which deep learning can be misused, that you should concern yourself with. And it's someone entering the field, either as a researcher or practitioner. It is up to you to make sure that this powerful technology is used for good and not for evil. And sometimes this can be very subtle, so you might deploy a new technology in your enthusiasm, and it might have unintended consequences. And that's mainly what I want to talk about in this lecture. Understanding unintended consequences of the systems you're developing with deep learning. First of all, it's important to understand that accuracy doesn't tell the whole story. So we've evaluated our neural networks by their ability to accurately classify something, and if we see like a 99.9% accuracy value we congratulate ourselves and pat ourselves on the back, but often that's not enough to think about. First of all, there are different kinds of errors. That's what we call a type one error, which is a false positive. That's when you say that something is something that it isn't. For example, maybe you miss misinterpreted a tumor that was measured by some you know, biopsy that was taken from a breast sample as being malignant, and that false positive of malignant cancerous result could result in riel unnecessary surgery to somebody. Or maybe you're developing a self driving car, and your camera on the front of your car sees a shadow from an overpass in front of you. This is actually happen to me, by the way, and slams on the brakes because it thinks that the road is just falling away into oblivion into this dark mass, and there's nothing for you to drive on in front of you. Both of those are not very good outcomes. That could be worse. Mind you, I mean, arguably, it's worse to leave a cancer untreated than to have a false positive or one. Or it might be worse. Toe actually drive off the edge of a cliff than to slam on your brakes. But the's could also be very bad as well, right? You need to think about the ramifications of what happens when your model gets something wrong now for the example. The self driving car. Maybe it could take the confidence level of what it thinks is in front of you and maybe work that into who is behind you. So at least if you do slam on the brakes for no reason, you could make sure there's not someone riding on your on your tail is going to Rear Andy or something like that. So think through what happens when your model is incorrect because even a 99.9% accuracy means that one times out of 1000 you're going to get it wrong. And if people are using your system more than 1000 times, there's going to be some bad consequence that happens. As a result, you need to wrap your head around what that result is and how you want to deal with it. The second type is a false negative, and, for example, you might have breast cancer but failed to detect it. You may have misclassified. It is being benign instead of malignant. Somebody dies if you get that wrong. Okay? So think very closely about how your system is going to be used and the caveats that you put in place, and the fail safe's and the backups that you have to make sure that if you have a system that is known to produce errors under some conditions, you are dealing with those in a responsible way. Another example of a false negative would be thinking that there's nothing in front of unions, self driving car, when in fact there is. Maybe it doesn't detect the car that stopped at a stoplight in front of you. This is also happen to me. What happens then, If you're if the driver is not alert, you crash into the car in front of you and that's really bad again. People can die. OK, so people are very eager to apply deep learning to different situations in the real world. But often the real world consequences of getting something wrong is a life and death matter , quite literally. So you need to really, really, really think about how your system is being used. And make sure that your superiors and the people who are actually rolling this out to the world understand the consequences of what happens when things go wrong and the rial odds of things going wrong. You know you can't over sell your systems is being totally reliable because I promise you they're not. There could also be hidden biases in your system. So just because the artificial neural network you've built is not human does not mean that it's inherently fair and unbiased. Remember, your model is only as good as the data that you train it with. So let's taken example if you're going to build a neural network that can try to predict whether somebody gets hired or not just based on attributes of that person. Now you, your model itself, may be all pure and what not. But if you're feeding it training data from real humans that made hiring decisions that training it is going to reflect all of their implicit biases. That's just one example. So you might end up with a system that is, in fact, racist or ageist or or sexist simply because the training data you provide it was made by people who have these implicit biases who may not have even been fully aware of them at the time. OK, so you need to watch out for these things. Simple things you can do. I mean, obviously, making an actual feature for this model that includes age or sex or race or religion would be a pretty bad idea, right? But I can see some people doing that. Think twice before you do something like that. But even if you don't implicitly put in features that you don't want to consider is part of your model, there might be unintended consequences or dependencies in your features that you might not have thought about. For example, if you're feeding in years of experience to the system that predicts whether or not somebody should get a job interview, you're going to have an implicit bias in their right. The years of experience will very definitely be correlated with the age of the applicant. So if your past training data had a bias toward you know, white men in their twenties who are fresh out of college, your system is going to penalize more experienced candidates who might in fact be better candidates who got passed over simply because they were viewed as being too old by human people. So think deeply about whether the system you're developing has hidden biases and what you can do to at least be transparent about what those biases are. Another thing to consider is, is the system you just built really better than a human. So if you're building a deep learning system that the people in your sales department or your management or your investors really want to sell a something that can replace jobs and save people were save companies money. Rather, you need to think about whether the system you're selling really is as good as a human. And if it's not, what are the consequences of that? For example, you can build deep learning systems that perform medical diagnoses, and you might have a very eager sales rep who wants to sell that is being better than a human doctor. Is it really what happens when your it system goes wrong? Do people die? That will be bad. It would be better to be insistent with your superiors that this system is only marketed as an supplementary tool to aid doctors in making a decision and not as a replacement for human beings, making a decision that could affect life or death again. Self driving car is another example where if you get it wrong, if you're a self driving car, isn't actually better than a human being and someone puts your car on autopilot, it can actually kill people. So I see this happening already. You know where self driving cars are being oversold and there are a lot of edge cases in the world still where self driving cars just can't cut it where human could, And I think that's very dangerous. Also, think about unintended applications of your research. So let me tell you a story, because this is actually happen to me more than once. Sometimes you develop something that you think is a good thing that will be used for positive use in the real world. But it ends up getting twisted by other people into something that is destructive, and that's something else you need to think about. So let me tell you a story, so you need to think about how the technology you're developing might be used in ways that you never anticipated. And it can those usages be, in fact, malicious. This is actually happening me a couple of times. I'm not talking theoretically here, and this isn't just limited to deep learning. It's really an issue with machine learning in general or really any new, powerful technology. Sometimes our technology gets ahead of ourselves as a species, you know, socially. Let me tell you one story. So this isn't isn't actually relate to deep learning. But one of the first things I built in my career was actually a military flight simulator and training simulator. It's idea was to simulate combat in sort of, ah, virtual reality environment in order train our soldiers to better preserve their own lives and, you know, come out of the battlefield safely. I felt that was a positive thing. Hey, I'm saving the lives of soldiers. But after a few years, the same technology I made ended up being used in a command and control system. He was being used to help commanders actually visualize how to actually roll out real troops and actually kill real people. I wasn't OK with that. And I left the industry in part because of that stuff. A more relevant example. Back when I worked at amazon dot com, I was one of the men I want take too much credit for this because the people who came up the ideas were before me. But I was one of the early people actually implementing recommendation algorithms and personalization algorithms on the Internet, taking your user behavior on the Internet and distilling that down into recommendations for content to show you. And that ended up being sort of the basis that got built upon over the years. That ultimately led to things like Facebook's targeting algorithms is another example of that. And you know, when I look at how people are using fake news and fake accounts on social media to try toe , spread their political beliefs or, you know, some ulterior motive that may be financially driven and not really for the benefit of humanity, I don't feel very good about that, you know, I mean the technology that I created at the time just toe sell more books, which seemed harmless enough, ended up getting twisted into something that really changed the course of history in ways that might be good or bad, depending on your political leanings. So again, remember that if you actually have a job in deep learning and machine learning, you can go anywhere you want to. If you find yourself being asked to do something that's morally questionable, you don't have to do it. You can find a new job tomorrow, okay? I mean, this is a really hot field, and at the time, by the time you have real world experience in it, the world's your oyster. You know, if you find yourself being asked to be doing something that's morally questionable, you can say no, someone else will hire you tomorrow. I promise you, if you're any good at all. So I see this happening a lot lately. There's a lot of people publishing research about using neural networks to crack people's passwords. Or Teoh, you know, illustrate how it could be used for evil, for example, by trying to predict people's sexual orientation just based on a picture of their face. I mean, this can't go anywhere. Good guys. What are you trying to show by actually publishing this sort of a research? So think twice before you publish stuff like that, think twice before you implement stuff like that for an employer because your employer only cares about making money about making a profit. They are less concerned about the moral implications about the technology you're developing to deliver that profit, and people will see what you're building out there, and they will probably take that same technology, those same ideas and twisted into something you may not have considered. So I just want you to keep these ideas and these concerns in the back of your head, because you are dealing with new and powerful technologies here. And it's really up to us as technologists to try to steer that technology in the right direction and use it for the good of humanity and not for the detriment of humanity. Sounds very, very high level, high horse preachy, I know. But these are very real concerns, and there are a lot of people out there that share my concern. So So please consider these concerns as you delve into your deep learning career. 18. Deep Learning Project Intro: So it's time to apply what you've learned so far in this deep learning course. Your final project is to take some real world data. We talked about this in the ethics lecture, actually of masses detected and mammograms, and just based on the measurements of those in massive see if you can predict whether they're benign or malignant. So we have a data set of mammogram masses that were detected in real people, and we've had riel. Doctors look at those and determine whether or not they are benign and malignant in nature . So let's see if you can set up a neural network of your own that can successfully classify whether these masses are benign or malignant in nature. Let's dive in. So I've given you some data and a template to work from. At least the data that we're gonna be working with is in your course materials. The mammographies underscoring masses dot dated A text file is the raw data that you'll be using to train your model with, and you can see it's just, ah, six columns of stuff or those things represent. We'll show you in a minute. There's actually a description of the state is set in the names dot text file here that goes along with that data set. But to get you started, I have given you a little bit of a template to work with if you open up the deep learning project on I p Y N b file. So this came from the U. C I repository, and I gave you a little bit of a link to where the original data came from. It's actually a great resource for finding other data to play with. So if you're still learning machine learning, that's a great place to find stuff to mess around with an experiment with. But for this example, that's when we're gonna be messing with. So here's the description of those six columns. One is called the buyer as assessment, and that's basically a measurement of how confident are diagnosis was of this particular mass. Now, that's sort of giving away the answer. It's not what we call a predictive attributes, so we're not gonna actually use that for training or model. Next, we have the patient's age. We have a classification of the shape of the mass. We have a classification of the mass margin, but how? It looks like the density of the mass. And finally we have the thing that we're trying to predict. So this is the label, the severity, whether it's benign, zero or malignant one. So we have what's here, a binary classification problem very similar to stuff that we've done earlier in the course , And you shouldn't need much more than to use code snippets from earlier exercises in this course and adapting them to this data set. Okay, now, one little caveat here. Typically, when you're doing machine learning, you don't want to deal with what we call nominal data, and both shape and margin are technically nominal data. While we're converting them to numbers, those numbers aren't necessarily meaningful in terms of their great Asian. You know, going from 1 to 4 doesn't mean that we're increasingly going from one to round two irregular in a linear fashion. But sometimes you have to make do with what you have. It's better than nothing, and at least there is some logic to the progression of numerical scores here to these descriptions. So they do generally go from, you know, mawr regular tomb or irregular as those numbers increased. So we're gonna go ahead and use those anyway. Anyway, this is important stuff. You know, there's a lot of unnecessary anguish and surgery that comes along from false positives on manner where mammograms. So if you can build a better way of, ah, diagnosing these things all the better. But again, think back to my ethics lecture. You don't want to over sell this. You want to be sure this is just a a tool that might be used for a really human doctor Unless you're very confident that the system can actually outperform a human. And by definition, it can't because we're training this on data that was created by humans. So how could it possibly be better than a human? Think about that. All right, so your job is to build a multi layer perceptron to classify these things, I was able to get over 80% accuracy with mine. Let's see how you can do now. A lot of the work is just going to be in cleaning the data, and I will step you through the things you need to do here. So start by importing the data file using the read See SV function. You can then take a look at that, convert the missing data into not of numbers and make sure you import all the column names . You might need to clean the data, so try to get an idea of the nature of the data by using describe on your resulting pandas data frame. Next, you'll need to drop rose. It have missing data and what you've taken care of that you need to convert the data frames into numb pie. Raise that you can then pass into into psych it learn or into caress. Okay, so you also need to normalize the data before you analyze it with caress. So a little hint there's to use pre processing dot standards scaler out of SK learned that can make things very easy for you. That's the one thing that we haven't really done before. The rest of this, you should be able to figure out just based on previous examples. Once you have the data in the proper format, it's been pretty straightforward to build an M LTP model using caress and you can experiment with different kept apologies, different hyper parameters and see how well you can do so I'm gonna set you loose here and give this a shot. Let's see how you do when you come back in the next lecture. I'll show you my solution and how I walk through this myself. So go forth and practice what you've learned. 19. Deep Learning Project Solution: So I hope you got some good experience there and actually applying what you learned to create a neural network that can classify masses found in mammograms. It's benign or malignant. Like I said, I got around 80% accuracy myself. Wonder how you did anyway? I started off by just toe blindly reading into CSP file using pd dot reid CSP and taking a look at it. And I saw at that point that the column names were wrong. There was missing column name information in the data file, and there were missing values in there. They were indicated by question Mark, so have to read that in a little bit more intelligently. So on my second try here, I called read, see SV passing and explicitly the knowledge that question marks mean missing values or any values and passing an array of column names like we did before and did another head on the resulting Panis data frame. And things look a lot better now. At that point, we have to clean the data now that it's in a proper format and organized properly, we could do a describe on that to take a look at things and get idea that we are missing some data and things seem to be reasonably well distributed. At least at this point, we did a little count here to see what exactly is missing. So my strategy here was to see if there's any sort of bias that I'm going to introduce by actually removing missing values. And if I were to see that the missing data seems to be randomly distributed just by eyeballing it, at least that's probably good indication that it's safe to just go ahead and drop those missing rose. So given that I have determined that that is an okay thing to do, I've gone ahead and called Drop in a on that data frame and described it, and now I can see that I do have the same number of counts of rose in each individual column. So I now have a complete data set where I've thrown out rows that are missing data, and I've convinced myself that's an okay thing to do statistically. All right, so now we need to actually extract the features and values that we want to feed into our model, our neural network. So I've extracted the feature date of the age, shape, margin and density from that data frame and extracted that into a dump. I recall all features. I've also extracted the severity column and converted that, too, in all classes array that I can pass in as my label data. And I've also created a handy dandy array of column names since l need that later on. So just to visualize, I've punched in all features just to take a look at what that looks like. And, sure enough, looks legitimate looks like array of four features. A Pete's on each row looks reasonable. At that point. I need to scale my data down. So to do that, all I need to do is import the pre processing dot standards scaler function there and apply that to my feature data. And if I look at all features scale that came out of that transformation, I could see that everything appears to be normally distributed now centered around zero when, with a standard deviation of one, which is what we want, remember, when you're putting inputs into a neural network, it's important that your data is normalized before you put it in. Now we get to the actual meat of the thing, actually setting up our MLP model, and I'm going to wrap this in such a way that I can use psych. It learns cross Val score to evaluate its performance. So in my example here I've created little function called create model that creates a sequential model, adds in a dense layer with six units or six neurons using the rela Oh activation function. I've added in another layer with one that does my final sigmoid classifications, my binary classification on top of that, and I have compiled that with the Adam Optimizer and the binary Cross entropy lost function . So with this, we've set up a single layer of six neurons that feeds into one final binary classification layer. Very simple, and I've then gone ahead and used the caress classifier to build a psychic learn compatible version of this neural network, and I've passed that into cross Val score toe actually do K fold cross validation in this case with 10 folds and print out the results. So with just these six neurons, I was able to achieve an 80% accuracy and correctly predicting whether or not a mass was benign or malignant, just based on the measurements of that mass. Now in the real world, Doctor, it's used a lot more information than just those measurements. So you know where our algorithm is kind of at a disadvantage compared to those human doctors to begin with. But that's not too bad if you did better if you used more layers more neurons ivy, curious to see if you actually got a better result. Turns out that sometimes you don't need a whole lot to actually get the optimal result from the data that you have. But if you were able to substantially improve upon that result, congratulations. So I hope deep learning has been de mystified for you. And up next we'll talk about how toe continue learning mawr in the field of deep learning. 20. Deep Learning: Learning More: well, I hope I've whet your appetite on deep learning, and you're eager to learn more. It's a very quickly emerging field in a relatively new ones. So traditional resource is for learning. You're still a little bit hard to find. There's a few other online courses out there about I don't like this one. But as far as books go, O'Reilly's a pretty good bet. I'd like this one a lot. I don't make any money from this. By the way, I might like recommending this to, like, make a buck. I just actually like this book. It's called a hands on machine learning with psychic learning tensorflow it goes, it sticks at the tensorflow levels. So, like it we talked about. It's more of a lower level. Implementation doesn't get into caress much at all, but it is a good overview of the theory behind the different techniques of neural networks and RN ends and CNN's and whatnot. So a little bit of ah, supplementary print material to reinforce what you learn to this course. I think it's a pretty good bet. There's also a new book that just came out new. I'm talking in the fall of 2017 here called Deep Learning, just came out from O'Reilly. It's also good. It's different, though, in that it actually uses Java. So it's not using python at all or the Tensorflow Library or the caress library. Instead, it's using the deal for J Library. So if you actually prefer job over Python, this might be a very good book to pick up. Also goes over the general concept that we talk about here in a little bit more depth as well. Cool. And remember, Google is your friend. If there's some specific thing you're trying to figure out, like, how do I use a G? Are you sell in caress? Well, just punch that into Google, and it will probably come up with some tutorials and documentation, at least of how to do that. Google again is your friend in this stuff, because this stuff is always changing. And if you want up to the date information that's gonna have the most up to the date resource is for you available cool. So go forth and doom or deep learning 21. Let's Stay in Touch: congratulations again on completing this course. It's just one of many that I offer in the fields of AI and Big Data, and I hope you want to continue your learning journey with me. If you click on my name in the Courses Main Page, you'll see the other courses I offer. There's no fixed order. You need to take them in. Just go with wherever your curiosity and interest lead you. If you'd like to stay in touch with me outside of skill share, my website is Sun Dog Dash education dot com. From there, you can subscribe to our mailing list to be the first to know about new courses and the latest news in the fields of AI and big data links to follow us on. Social media are there as well, and I've also written a few books that you can explore it our website as well. If you want something a bit more permanent than online videos again, congrats on completing a challenging course, and I hope to see you again soon