Transcripts
1. Introduction: Hi. Welcome to data science and machine learning with Python. I'm your instructor. Frank Kane and I spent over nine years at amazon dot com and imdb dot com. Developing and managing some of their most famous features, like Recommended for You and People Who Bought Also Bought and top sellers and movie recommendations and I. M. D B. And all of these features required applying data mining in machine learning techniques to real world data sets. And that's what this course is all about. Being a data scientist in the tech industry is one of the most rewarding careers on the planet today I went and studied actual job descriptions for data scientist roles at tech companies, and I distill those requirements down into the topics that you'll see in this course. This course is really comprehensive. We'll start with a crash course in Python and do a review some basic statistics and probability. But then we're going to dive right into over 60 topics in data mining in machine learning that includes things like based beer. I'm clustering decision trees, regression analysis, experimental design. Look at him all and some of these topics are really fun. We're gonna develop an actual movie recommendation system using actual user movie rating data, We're gonna create a search engine that actually works for Wikipedia data. We're gonna build a spam classifier that can correctly classify spam and non spam emails in your email account. And we also have a whole section on scaling this work up to a cluster that runs on big data using Apache Spark. If you're a software developer or programmer looking to transition into a career in data science, this course will teach you the hottest skills without all the mathematical notation and pretense that comes along with these topics. We're just going to explain these concepts in plain English and show you some python code that actually works that you can dive in and mess around with to make those concepts sink home. And if you're working as a data analyst in the finance industry, this course can also teach you to make the transition into the tech industry. All you need is some prior experience in programming or scripting, and you should be good to go. The general format of this course is I'll start with each concept explaining it in a bunch of slides and graphical examples, and I'll do it in plain English as much as possible. I will introduce you to some of the notation and fancy terminology that data scientists like to you so you can talk the same language. But the concepts themselves were generally pretty simple after that. All three into some actual python code that actually works that we can run and mess around with. And I will show you how to actually apply these ideas to actual data, and these are going to be presented as I Python notebook files. And that's a format where I can intermix code and notes surrounding the code that explain what's going on in the concepts So you can take these notebook files with you after this course and use that as a handy, quick reference later on in your career and at the end of each concept, I'll encourage you to actually dive into that python code, make some modifications, mess around with it and just gain more familiarity by getting hands on and actually making some modifications and seeing the effects they have. So with that, let's dive in and get started
2. Windows Setup Instructions: So let's get you all set up for this course. Pretty easy to do. We're going to start by installing a Python development environment called Anaconda. If you don't have it already. Once that's installed, we're just going to install a couple of packages that we need that don't come with it. Pi dot plus and TensorFlow, which we'll use later on for doing neural networks. And then we'll download all the course materials from our website and get it installed and make sure it all works. So let's start by heading over to anaconda.com. This will be the Python environment that we use for this course. It is already made for data science and data analytics, so it saves us a lot of trouble and setting things up. Just head on over to the big friendly Get Started button or however you need to navigate to the download area. We're going to download the Anaconda installers. And we want, for me the 64 bit Windows version. Obviously choose whatever version makes sense for whatever operating system you are using. But for me that's Windows 64-bit. So we'll wait for that to download. Should come down pretty quickly. And once that's down, we're just going to open the installer and run it. Alright, so nothing special here, just your standard Windows Installer. I'll hit Next to start. Agree to the license terms. Install it just for me. And you wanna make sure you're installing this someplace. It has plenty of hard drive space for me. The C drive is a little bit tight because it's a small SSD. So I'm actually going to change this to my E Drive. Just do whatever makes sense for your assistant to make sure that you have enough room for it. And we'll hit Next to continue. Those default options are fine. And now we just wait for it to install and this will take a few minutes. There's a lot of stuff to install, so I'll pause and come back when this is done. All right, a couple of minutes later that did finish. So I'll hit Next to continue on with the installer. And we'll just say Next again. And I'm going to uncheck these options. I don't really need to see a tutorial. And there we have it. So now that Anaconda has been installed, we just need to install a couple of extra packages into it. So to get into our new Anaconda environment, go to your start menu and look for the Anaconda three folder there. And from there select Anaconda Prompt. And you should see a little command terminal like this. And from here we're just going to type in Conda install pi dot plus. And this is just going to install a package. Let us visualize decision trees later in the course. It'll take a moment to figure out how to do that. And when you get that prompt to say why Enter and let it do its thing. All right, Very good. Next we need to install the TensorFlow package. That's what we're going to be using for deep learning later in the course and neural networks and all that fun stuff. Normally you would say Conda install TensorFlow to do that, but there is a bug with that right now on Windows. So instead, we're going to say pip install TensorFlow. And that should work. Again, just let it do its thing. Bunch of dependencies it has to get first but shouldn't take too long. All right, so now we have TensorFlow installed as well. So let's leave this window. Well, we're going to come back to it a little bit later. But let's go back to our web browser and download the course materials next. For that, you'll head on over to the Media dot JSON, dog dash, softer.com slash machine dash learning dot HTML. Pay attention to the dashes and capitalisation at all matters. And once you're here, you'll find a nice friendly link for the course materials. Just go ahead and click that. And down at CMS. While you're here, if you want a copy of the slides, you'll find those here as well. Once that's downloaded, go ahead and open that up. And let's expand the course materials there. Right-click and Extract all. And what's in here is a bunch of what we call a Jupyter Notebook files. And these are ways of running Python code interactively within a web browser. So pretty much every lecture in this course is going to be accompanied with a hands-on IPython Notebook that you can play around with and experiment with. And that's what's in here. Also in here is a lot of experimental sort of test data that you can use for actually training these models and playing around and actually making predictions based on real data. And that's what most of what's being decompressed right now is give that some time to finish decompressing. All right, it's done unzipping. So let's go ahead and find the folder that it expanded into. There should be an ML course folder now there, and let's open that up. And inside the ML course folder is another MLK course folder. This is the one that we actually want. So let's go ahead and select that ML course folder inside the other one. And I'm going to hit Control X to cut that. And now we want to put that someplace that's going to be easy for us to remember and easy to type. So I'm gonna put this on the root of my C drive. I'll just hit Control V to paste it into my C drive here. What that copy over. And the reason I'm putting it here is because we're going to have to type in the path that it's in. That's going to be C colon backslash ML course in this case. All right, so now we have a CML course folder, and inside that is the actual course materials themselves. You can see all the data there as well as all the IPython Notebooks. That's what i, p, y and p stands for. It's called a Jupyter Notebook these days. So let's go ahead and try one and see if it actually works. Yeah, so go back to our Anaconda Prompt here. And we're done with all this stuff in the background. So what you need to do, remember to practice this guys. This is going to be something you need to do with almost every lecture. To actually open up the notebook file for a given lecture, first, you need to open up an Anaconda Prompt. And again, that's under the start menu, under the Anaconda menu. And then I need you to cd to wherever you install those materials. So I'm going to say cd C colon backslash ML course, because that's where I installed the course materials. It's important that you start this from within the correct directory or else these notebooks won't show up. But once we're in the directory that we actually install these materials to, I can then type in Jupiter notebook. And that will start up the web browser that will allow me to actually run those notebooks. So again, remember, for every time you need to open up a Jupiter notebook, open up an Anaconda prompt cd to the directory you install the course materials to, and then type in Jupiter notebook. You might want to write this down guys. You're gonna have to do this again a lot in the future. And what this does is actually bring up your web browser. And this brings us to the Jupiter main page here we can actually select the different notebooks to run. So let's see if it actually works. Let's scroll down a bit. Outliers is a fun one. So let's click on outliers dot p-y, a and B. And a quick introduction to Jupiter notebooks here, you can see that it's basically a way of running Python code inline. We can actually see the results within your browser and run it. And it's not just a pre-made webpage. You can actually run code in here. So watch this. I can actually click on one of these blocks here and hit the Run button and actually executes this code and generates a new graph in response to that. So this is a great way to sort of interactively experiment with some Python code and play around with new algorithms. And that's exactly what we're gonna do in this course. So let's really quickly walk through this example here just so you can see what's kinda going on at a high level, we'll talk about in more depth later. But basically what's going on here is we're simulating an income distribution. So we've simulated a bunch of random people who have incomes ranging from, you know, twenty-seven thousand dollars plus or minus $15 thousand a year. And then to mess things up, we throw in Jeff Bezos who's got a billion dollars to his name, probably more than that at this point, right? And you can see that that kinda skews our distribution here. So we'd like just had this like one little skinny point here that represents all the normal people. And then we have Jeff Bezos out here to kinda messing up our data. So what we're talking about in this particular exercise is how to identify outliers like Jeff Bezos and remove them from our data so that we can actually get a more meaningful distribution. And that's what's happening here. And you can actually, it's a shortcut to run this whole thing all at once. You can see go to the Kernel menu here and say restart and run all. And that will actually automatically rerun all of these cells. And you can see that it actually works. So hopefully if you do that, you're seeing some pretty graphs. And if so, that means that you have everything set up properly. Congratulations. Again, remember how to get here, guys write it down. You're going to open up an Anaconda prompt cd to the directory that you install the course materials to, and then type in Jupiter notebook and select the NOPLAT that you want to open. All right, with that under our belt, let's move on and actually start learning some stuff.
3. Mac Setup Instructions: All right, so real quick, let me walk you through getting set up here. I'll show you how to do all this firsthand here. But for the summary, we're going to install Anaconda, which is a Python environment and need for scientific computing and data science and machine learning. Once we have that setup, we'll install a package called pi dot plus, which we'll use later on in the course for visualizing decision trees. And will also install TensorFlow, which we're going to use to build neural networks and real AI and deep learning later on in the course. We'll also download the course materials from my website here and open up one of the notebooks from that materials and see if it works. So let's dive in. So the first thing we need is a Python development environment. And in this course we're using Anaconda for that purpose. It comes with most of the packages that we need for this course pre-installed. So it's going to save you a lot of trouble. So even if you have an existing Python development environment, I recommend installing Anaconda on the side as well. To get it just head on over to anaconda.com. And you want to go to the products menu and go to the individual addition. Basically look for the open source version of Anaconda wherever they might be hiding it on the website these days. And from there we're going to click on the big friendly Download button and select the Mac OS graphical installer and wait for that to download. All right, that is completed. Let's go ahead and open that installer up. There it is little, just double-click it to launch it. And we'll hit Continue. Continue. Read the license agreement. Continue. And agree. All right. We'll just go through the defaults here. We'll install for me only. And you didn't eat quite a bit of disk space for this guy is a two-point 13 gigabytes to start, and we're going to install more stuff on top of it as well. So make sure you have plenty of room for this install. And this will take awhile. There's a law for it to install, so we'll come back when it's done. All right, just about done here. We'll hit Continue here to finish things up and close. And yes, we're done with the installer, we can discard that. So now let's open up a new terminal. And if you had one open already, you're going to need a closed set and reopen it in order to pick up the new environment variables that Anaconda installed. So once you're sure that you have a fresh terminal open here, we can just type in Conda install pi dot plus, just like that. And that will install the pi dot plus package that we're going to need later in the course in order to visualize decision trees. So just let it go off and do its thing here for a bit. Why to continue? All right, the other thing we need is TensorFlow, and that's just as easy to install, Conda, install TensorFlow, just like that. And this is a package that we need to make deep neural networks, which is going to be a lot of fun later in the course. This will be a larger environment. Why to proceed. And off that goes. Cool. So we have Anaconda completely setup for the entire course at this point, that was pretty easy. Next thing we need to do is get the course materials. So let's go back to our browser and go to media Dotson dogs, dash, soft.com slash machine dash learning dot HTML. Pay attention to capitalization spelling, where the dashes are, it all matters. You should see a page that looks like this. And you'll see a big friendly link here for the course materials. Go ahead and click on that to download it. And also here you'll see a link to the course slides if you want a copy of that. All right, so now that the course materials have downloaded, let's go ahead and open that one done should be just doing a quick virus scan right now. All right, looks like our course materials are there and it already decompressed it for us as well. That's cool. So let's go back to our terminal here and see if we can actually use these materials. Now we need to know where they are. So in order to actually launch these notebook files that are inside the course materials, we need to be able to navigate to them first. So we're in our home directory right now. Let's cd into downloads. And that should be where the ML course directory lives. There it is. All right, so if you want to move that someplace else, you can. It just matters that you know how to get to this place, okay. Because what's in here is a bunch of what we call Jupiter Notebook files. These are interactive ways of actually running and experimenting with Python code that will allow us to play around with all the algorithms in this course. Now in order to launch these, we need to first navigate into this directory. So I need to use the cd command to navigate into where this is. So again, I'm under users Frank downloads ML course for you that will be in your own users directory most likely. And you just need to be able to cd into that. So the first step to actually launching these things as a cd into the directory that I downloaded these into. I'm just doing this for the sake of illustration here. And for you you can see the two wherever you save that too. Once you're in there, you're going to type in the following Jupiter notebook. Just like that, Jupiter is spelled funny and make sure you get that right. And what that will do is launch Jupyter notebook within this folder. That way all the correct script files will show up and all the data that they need will be in the right place as well. And you can see here a listing of all the notebook files that come with the course. So let's see if it actually works. Let's go to outliers dot, PY and B. That's a simple one. So whenever I'm in a lecture in this course and I say open up outliers, dot IP, y and b or whatever it is. This is what you wanna do guys. This is important. Write this down. Okay, I'm not going to go over this very many times again in the future. You need to again open up a terminal cd to where you install the materials type in Jupiter notebook, Jupiter spelled funny. And from there, select the script that you want to open. So let's open up outliers and see if it works. It does cool. So you should be seeing a screen like this. And what's cool is that you can actually run this code in line here and actually modify it and play with it a mess around with it. It's not just a static webpage. So for example, I can click on this block of Python code here and hit this Run button. And it'll go ahead and actually execute that code and generate this graph automatically based on that execution. So how cool is that? You're curious what's going on here? We're basically creating a distribution of people's incomes. And then we're throwing in Jeff Bezos has a billionaire at the end to show the effect of an outlier on a data distribution. And throughout the rest of this exercise, we go through and find ways of identifying outliers such as Jeff Bezos, rejecting them from the dataset, which allows us to get more meaningful interpretation of the data for everybody else. But we'll talk about that more in the future. It's fun stuff. For now though, you should be seeing a working Jupiter notebook here. If you do, then great things are set up properly. If not, go back double-check things could be a conflict with another Python environment that you might have installed. So that might be something you have to track down, but hopefully things are working and we can move on.
4. Linux Setup Instructions: All right, so real quick, let me walk you through getting set up here. I'll show you how to do all this firsthand here. But for the summary, we're going to install Anaconda, which is a Python environment and need for scientific computing and data science and machine learning. Once we have that setup, we'll install a package called pi dot plus, which we'll use later on in the course for visualizing decision trees. And will also install TensorFlow, which we're going to use to build neural networks and real AI and deep learning later on in the course. We'll also download the course materials from my website here and open up one of the notebooks from that materials and see if it works. So let's dive in. Alright, let's get things set up on Linux, somebody in a boon to host here. And the first thing we need to do is install Anaconda. Anaconda is a Python environment made for scientific computing. It contains libraries needed for data science and machine learning. So it will save you a lot of trouble and installing packages by using this instead of just a generic Python installation. And thought canopy can also work if you've got that. But Anaconda is what I'm using in this course for now. So head over to anaconda.com if you don't have it and find the download button and hit Download again. And select your operating system. We are on Linux. And you want the Python three version 3, whatever it is, the code in this course is for Python 3 and Python 2, I'm on an X86 system, so I will go ahead and install the X86 installer. You can see it's large. We'll download that to my home directory in the download folder automatically. And we'll just wait for that come down. Once that's done downloading, we can minimize our browser and open up a terminal. We'll cd into our Downloads folder. And we need to make that shell script executable. Thusly change mod H plus x Anaconda three, whatever it is. And now should be able to run that installer script. Press enter to look at the license agreement space as you read it. And assuming you agree to the terms type in yes. That home directory location is fine by me, we'll hit Enter. And off it goes. There is a lot for it to install, so we'll come back when it's done. All right. It's almost uninstalling. It wants to know if you want to initialize Anaconda, not sure why you would say no. So let's type in yes. And we're ready to go. So anaconda is installed at this point, That's awesome. Now some environment variables were changed here. So to make sure we pick them up, I'm going to close out of this terminal and bring up a new one. Now there are a couple of packages that we need to install that did not come with the default installation. One is pi dot plus. We'll use that to visualize decision trees later in the course. To install that, just type in Conda, install pi dot plus just like that. And you're gonna see a lot of warnings in general when you run code and operate with the Anaconda environment, usually they're just talking about things that are being deprecated in the future and it's safe to ignore those. So don't freak out about the warning messages, guys. It's going to be a lot of them and they are almost always safe to ignore. If it's an error, that's a different story, but don't sweat the warnings. All right, we'll hit Y to proceed. And that was quick. Now we also need to install TensorFlow. Tensorflow is a package used for building deep learning neural networks, and we'll be playing with that later in the course as well. To install that you just say Conda, install TensorFlow. And if you're on a system that has an NVIDIA GPU, you can accelerate things by actually installing TensorFlow GPU instead. But if you're not sure, just stick with TensorFlow. As I'm running into a little virtual environment here, I don't have a lot of confidence that TensorFlow GPU will actually work. So I'm going to stick with just plain TensorFlow. And that will go off and do its thing as well. And again, why to proceed? Alright, And at this point, Anaconda is installed with everything we need for this course. Let's go ahead and clear out the screen. And let's head back to our browser and we'll get the course materials next. Now to get those, you're going to head over to HTTP and media Dotson dogs, dash soft.com slash Machine Learning dot HTML. Pay attention to the capitalisation and dashes at all matters. And you should see a page that looks like this. All right, so you'll see a big friendly link here for the course materials. This contains all of the Python notebook files that we're going to use throughout the course. Go ahead and click that to download it. And we'll go ahead and save that. And if you want the slides is a copy of it here as well. All right, now that course material is in place. Let's go ahead and close out of the browser here. And back to our terminal. Let's go to our downloads folder again. And let's unzip that ML course dot zip file. All right, so we just need to remember where this is. So Let's go into ML course. And so it's under Downloads AMOLED course. You can move that someplace else if you want to. The important thing is just that you remember where it is and how to get there. All right, so let me show you how to actually run these things. So this is a collection of Jupiter Notebook files is what they're called. They're ways of interactively running Python code within a web browser. And this also includes all of the sample data that we need for the course to actually train our models and actually do machine learning. But to, in order to actually run them, we need to actually start what's called Jupiter notebook from within this directory. So remember, whenever we actually start a notebook within this course, what you need to do first is open up a terminal window cd into the directory where you install these course materials, okay, for us, that's going to be our home directory under downloads and then ML course. And once you're in that directory, type in Jupiter with a y notebook, just like that. It's important that you start this from the correct directory. Once we do that, we should have a browser pop-up. And there it is. So that's cool. Now we can see the list of Notebook files that we have for this course here we just need to select the one that we want. So whenever I say to open up a specific notebook file on the course, just look forward here. Let's open up outliers. Dot I-SPY, NB. That's an interesting one. Cool. So you can see that we have all this inline Python code that we can actually run interactively and actually see the output as we run it within our little web browser. It's kinda cool. And this isn't just a pre-made webpage, guys. This is actually a environment where you can actually run and modify code. So for example, I can click on this little block here. What it's actually doing is creating a random distribution of people's incomes. And then throwing in Jeff Bezos has an outlier with a billion dollars, but can just click on that and hit this Run button. And that will actually execute that code and produce this graph in response. Pretty cool. And the rest of this, we'll talk through this later when we actually get to this lesson. But basically we talk about how to remove Jeff Bezos has an outlier and get more meaningful visualizations of the data for normal people here. So important activity there. But anyway, yeah, it seems to be working. So if you've got this far, everything is set up properly. Congratulations, and we can move forward again. Remember how to go about opening these notebook files. You want to open up a terminal cd to the directory that you install the course materials into, and then type in Jupiter notebook. And remember, Jupiter is spelled funny, Jay UP why TER, and that should bring you to where you need to be. All right, guys, let's move on.
5. Python Basics, Part 1: so if you already know Python, you can probably skip the next two lectures. But if you need a refresher, or if you haven't done python before, you want to go through these. There's a few quirky things about the python scripting language that you need to know about . So let's dive in and just jump into the pool and learn some python by writing some actual code. All right, time for a crash course in Python. Now, like I said before in the year requirements for this course, you should have some sort of programming background to be successful. In this course, you've coated in some sort of language, even if it's a scripting language. JavaScript. I don't care what it is. C plus plus job something. But if you're new to Python, I'm gonna give you a little bit of a crash course here. I'm just gonna dive right in and go right into some examples here. There's a few quirks about python that air a little bit different than other languages you might have seen, So I just want to walk through what's different about python from other scripting languages you may have worked with and the best way to do that is by looking at some real examples. So let's dive into some code. So one last time, A little straight how to actually open a notebook here on your system here. And I'm on windows here. Refer to the previous lecture if you need instructions on a different operating system, but in general, you're gonna want open a command prompt of some sort. And on Windows, you're going to need to use the anaconda command prompt. So find your anaconda three. Many with start menu and Goto Anaconda prompt we for that come up and again, you need to CD into the directory where you saved all of your course materials. So for me, that was C colon slash ml course, you could do a de ir just to make sure it's all there. And once you're in the correct directory, type in Jupiter with why notebook and you should see a screen like this. And from here we want to select the Python 101 notebook because that's gonna contain our little tutorial in python here. So go ahead and click on Python 101 dot i p y N b and you should now have a screen that looks like this, so let's dive in. If you've never seen a Jupiter notebook before, the way it works is you can click in any of these boxes of code and hit the run button or shift. Enter and he will execute that code right from your Web browser. Let's try it with this first block. Click inside of it to select it and hit shift Enter. Now we're just going to cover the syntax of python here in the main ways in which it differs from other languages. So let's take a closer look at this code. One thing with Python is that white space is really important. Any nesting of code, like for adoration or conditional expressions, relies on the number of tabs to group code together instead of curly brackets like other languages. So here we have a list of numbers. In python, a list is like an array or a vector and other languages. We define a list of the numbers one through six by putting them in between square brackets separated by commas. In python, there was no character required to terminate a line you just hit. Enter when you're done, let's had seven to the list just to prove that running this actually does something. Yeah, we've got results for one through seven now. Next, we have an example of a four loop in python. This statement it a rates through the list named list of numbers, storing the current iteration in the variable number each time, Ah, four Statement does have to end with a colon like this, but now we use in dense to indicate what code lives within this four block. And here we have an example of an if else clause. If the number is evenly divisible by two, we print that. It's even otherwise. We print that it's odd and again we use in dense to indicate what code lives within each if or else clause here will remove all in dense to get out of the four loop and print all done at the end. Notice that we never had to declare any variable ahead of time. Nor do we have to define their types. Python is what's called a dynamically typed language. It tries to infer the data type of your variables based on how you initially used um, but internally, they do have types. You can also explicitly cast variables to different types if you need to. But variables won't be recast automatically as they would in weekly type languages. Sometimes this can lead to unexpected errors, and we'll see some of that as we go through the course. Let's move on to the next block, which just shows how to import external modules into your python scripts. You just used the import command for this, and you can define an alias for the module if you'd like to save yourself some typing, too. So here we're importing the num pie module so we can refer to it within our script, and we're importing it under the name NP. This lets us then use numb pies random function by just typing np dot random and in this case, were asking numb pie to give us 10 normally distributed random values with the given mean and standard deviation
6. Python Basics, Part 2: next, let's discuss lists in more depth, since we use them a lot. If you need to know how many elements air in a list, you can use the built in Len function to get that. Like so often, you also need to slice lists in certain ways to extract values in a given range within the list. The colon is used for this. In this first example, we use Colon three to ask for the 1st 3 elements of the list, and similarly, we can use three colon toe. Ask for everything after the third element. Like So, we can also do something like negative to colon toe. Ask for the last two elements of the list. And if you want to append a list to another list, that's what the extend function is for. Like this this lad, the list containing seven and eight to our original list. And if you want to upend a single value to a list, you can use the upend function like so. Another cool thing about Python is that lists can contain just about any type you want. You could even make a list of lists, so let's do that right now. we'll make a new list called Why and Make a new list of lists that contains our newly grown X list and this new Y list to retrieve an element of a list. Just use the bracket operator like this here. We're getting back Element one of the wildest. This is zero based. So why one actually gives you back the second element, not the 1st 1 Why zero would give you the first element, which is the number 10 in this example. Let's also have a built in sort function you can use to sort the list in place like so. And if you want to sort in reverse order, you just pass. Reverse equals true into the sort function. This is also a good time to mention that there are a couple of ways to pass parameters into functions. You could just pass in a list of values like you would in most languages, but you can also pass them in by name. Often, python functions will have a lot of parameters that have default values assigned to them, and you just specify the ones you care about by specifying them by name. Okay, lets keep going and talk about two polls. Next two polls are a lot like lists, but the main difference is that they are immutable. Once you create a to pull, you can't change them. They're handy for people doing functional programming or for interfacing with systems like Apache Spark that air developed on functional programming languages will do that later in the course. The only real difference is that you enclose two pools with parentheses instead of square brackets. So here's a to pull of the values 12 and three. We can use Len on it, just like we were with list. You can reference elements in A to pull in the same way that you would in a list is, well again it zero based. So why two gives us back the third element of the list, not the 2nd 1 You can also make a list of two bulls if you so desire. Another common use of two poles is in passing around groups of variables that you want to keep together. For example, the split function on a string will give you back a bunch of string valleys extracted from that string, and we can assign those values to elements in a to pull as a quick way of naming each one Look at this example. We have two numbers separated by a comma, and we know the first Valley represents an age and the second and income value. We can extract them right into variables named Age and income. Like so moving on another useful data structure and python is the dictionary. In other languages, you might know this as a map or a hash table. It's basically a look up table where you store values associated with some unique set of key values. It makes more sense. With an example, you declare a dictionary using curly brackets, so let's make a dictionary Mapping starship names to the names of their captains will call this dictionary captains now to create an entry in the Dictionary of Use square brackets to specify a key value that were interested in assigning. So to assign the value Kirk to the key enterprise, We can just say captains enterprise equals Kirk. We do the same for the other starships we know about. Then retrieving a dictionary element is done in the same way. Just use square brackets to get back the value want. So to get the captain of the USS Voyager, we can do it like so. But what happens if you try to retrieve a value for a key that doesn't exist? You get an exception. In that case, one way to avoid that is to use the get function on the dictionary. So you see, you can retrieve the captain of the enterprise successfully as it exists in the dictionary . But if we try to get a ship that isn't in her dictionary, it returns the special value none that you contest for and deal with however you want. And if you know that the captain of the N XO one is Jonathan Archer, You're my new best friend. You can iterated through all of the keys in a dictionary, just like you'd entering through a list like this for ship in. Captains will give you back each key and the captain's dictionary and name it shipped for you so we can then print it out. Like so
7. Python Basics, Part 3: OK, moving on to functions now. Fortunately, they worked pretty much as you'd expect in python. This intact looks like this. You define a function with the deaf keyword, followed by the function name, followed by any parameters you want to pass in with their names. Be sure to end the function definition with a colon. After that, your code within the function needs to be indented. Pythons all about the white space. Remember? So here we have a super simple function called Square It that just takes in one value, calls it X and returns its square calling. The function works exactly as you would expect. Just type in square it and then two in parentheses to get the square valley of to, for example, let's run it. There are a few funky things you can do with functions and python, for example. You can actually pass in another function to a function as a parameter. This is something that allows python to play nice with functional programming languages, which you encounter a lot of distributed computing. We can create a function like this called do Something that takes in some function f some parameter X and returns the result of f of X. So we can actually say Do something, square it comma three as a really convoluted way of getting back to square of three. But it works that, by the way, did you notice that this block of code still knows that the square it function exists even though we defined it in a previous block? Jupiter notebooks run within a single colonel session. So anything you define or modify before the block you're running will still be in memory within that colonel. Sometimes this can lead to confusing behavior, especially if you're running blocks within a notebook, out of order or repeatedly. If weird things start happening, you can always start from a clean slate by selecting Colonel restart from the menu. Let's also give some attention toe Lambda functions thes allow you to pass around simple functions in line without even giving them a name. You see this pretty often in the world of data science. It's easiest to understand with an example. Here we're calling our do Something function with a lambda function that computes the cube of a value instead of its square. Take a close look at the syntax here we say Lambda, followed by the function parameter than a colon, after which we can do what we want to that parameter. Whatever value was computed after the colon is implicitly the return value of the land of function. So this says, to create a lambda function that takes in something called X and return X times X times X We pass that lamb to function in to do something with a value of three, which then executes are lambda function for the value three. And if we run this, you can see that we do in fact, get back 27 which is three cubed.
8. Python Basics, Part 4: okay, The other stuff is pretty straightforward. Here's what the syntax looks like for Boolean expressions. You can use a double equal sign to test for equality, and you can also just to use the word is there's also a triple equal sign that you can use to compare actual objects together instead of their values. But that seem less commonly true and false are a little weird and that they are capitalized in python. But only the first letter, like so and for operators like and or and not you just used the words and or and not instead of special symbols, you could do something like a case or switch statement by using Boolean expressions with an If l. If else structure like this in this example, we print How did that happen if one is three, which is always false. Otherwise we move on to test whether one is greater than three, which should also be met with disbelief if it were true. When that test fails, we fall back to our final else clause, which prints all is well with the world again, we used in dense to associate code with specific clauses here we touched on looping earlier . Let's go into more depth here. You can use the range function toe automatically build up a list of values within a given range like so again, we start counting from zero here, so range 10 gives us back the values zero through nine. The continue keyword within a loop means to skip the rest of the adoration and go straight to the next one. And the brake keyword means to stop the adoration early in this example, we use continue to skip over the printing of each value if the value is one, and we use brake to stop after we reached the value. Five. Study the output of 02345 and you'll see that's exactly what it did. An alternative syntax is the wild loop, which generates until some Boolean Expressionist false. Here we set up a counter variable called X and loop through printing X only while X is less than 10. Once it hits 10 the loop ends as you can see in its output. So try out a really simple activity that pieces together some of the stuff we've just talked about. Your challenge is to write a block of code here that creates a list of imagers loops through each element of list, and Onley prints out even numbers in the list. This should just be a matter of rearranging some of the other code in this notebook. So even if you're not comfortable with Python, I encourage you to give it a try. You'll feel a lot better with Python if you've written something yourself with it, no matter how small. Okay, so hopefully you didn't have much trouble with that. Here's my solution. And keep in mind this more than one way to do it. Let's type in. Ah, my list equals in square brackets. 012583 and you can use any numbers that you want. Then we can say four number in my list. Colon Course. You might choose different variable names if number mod two is zero colon print number. And as you can see, we do get the expected output of just the even numbers here. Okay, so that's everything that's peculiar about Python. For the most part, if you're coming from some other language, that should give you enough knowledge to understand the python code. I'll be showing you in this course and to work with that code yourself. So that's your python crash course. You obviously just some very basic stuff there. As we go through more and more examples throughout the course, it'll make more and more sense since you have more examples to look at. But if you do feel a little bit intimidated at this point, maybe you're a little bit too new to programming or scripting, and it might be a good idea to go and take a Python course before moving forward. But if you feel pretty good about what's do you seen so far, let's move ahead and we'll keep on going.
9. Intro to Pandas: So now that you have a brief introduction to the Python programming language, let's talk a little bit about the Pandas library. That's a library of python functions that you'll be using a lot of the data scientists. So follow along with me here, go ahead and find the pandas tutorial that I p Y N b file. That should be inside your course materials. You should be seeing something like this at this point. What is pandas? Well, it's basically a way of processing tabular data. So when you have columns and rows of information like you often do in data science, pandas is a very easy way of loading that data in manipulating it, examining your data, cleaning it up and things like that. And it works together with two other libraries that you're going to be using a lot in the field of data science and machine learning. So when we talk about actual machine learning, algorithms will be using a python library called Psychic Learn or SK Learn for short. That's where it has all the actual code for doing things like linear regressions or SPM. All the stuff we're gonna talk about later on and that usually takes as an input a numb pyre, a so numb pies. Another library in the mix here that has its own representation of a raise of data on this could be a multidimensional arrays of data, too. So it's sort of a way of representing information. So the way it usually goes is you might use pandas to load in your data and manipulated and clean it up and understand it, and then translate that into a numb pyre. A that then gets fed into psych. It learn, and that translation often happens automatically. By the way, you don't have anything special. So what's more important at this stage is understanding pandas, right? Cause actually thing that into psych, it learns pretty trivial. So let's talk about pandas. Let's scroll down here a little bit and play with some data, shall we? So we'll start off by importing what we need. So we're going to say that we want to use the mat plot lib library in line. The slightest means that any grass that we create as part of our notebook will appear within the notebook itself and not within an external window with a need to specifically import the libraries that we want to use in our python code. So we're going to import the num pie library as and P. That means that we can refer to numb pies the shorthand and P within our script now. And we will also import the Pandas library as PD. So this means that we've basically created an alias for the Panis Library of PD. Just toe save us some keystrokes. So let's go ahead and use Panis for the first time. Where I say here is DF equals p d dot reid Underscore. See SV past higher Sazi SV. So what's going on here? This is going to load up the past higher start see SV file. That's a comma separated values file That just means that is tabular information where each column is separated by a comma. So it's It's a very simple, text based format, and the first row usually corresponds to the titles of those columns. So with one line of code, we can read that data in from disk and create what's called a data frame out of it. A pandas data frame, and we're gonna sign that data frame to a variable called DF. So this loads in past our Zazi SV and converts it into a pandas data frame. And then we can call head on that data frame object to visualize the 1st 5 rows of that data frame, and this is what it looks like. So it's actually clicking here and hit shift enter to run it. And you can see here this is a little preview of the file. So if you just want to, like, double check that everything loaded incorrectly and understand what's in it, this is a good way. Toe do little spot. Check with head. You can see here that we have the 1st 5 rows being displayed here and our columns air titled Properly Years experience employed previous employers level education, talked your school, interned and hired. We're gonna be using this data set later on in the course, actually see if we can predict whether a job applicant gets hired or not based on their past history. Okay, so everything looks reasonably good there. You can also pass in an integer it ahead if you want to see some specific number from the beginning of your file. So if I want to see the 1st 10 rows of my data frame. I could just say DF dot head 10 like that and the cramp the 1st 10 rows, you can see a little bit of a larger sampling of data there. Um, and he could also look at the end of your data file as well. If your data frame as well so DF dot tail well, look like that and that's displaying the last four rows in our data frame. You can see this is a very small data set. It's just something I made up. It only contains 12 rows of information. Now, sometimes we'll talk about the shape of your data frame or the shape of your data. And what we mean when we talk about the shape is just the dimensionality of it. So, for example, if we see DF dot shape that will come back with with 13 comma seven, and what that means is that we have 13 rose and seven columns in our data frame, and that's the shape of our data frame. Just how many columns has how many rows It has just a fancy word for a very simple concept . We can also say, DF dot size that comes back. It's 91 which is just the number of cells in our data frame, basically the number of unique data points. And that's just gonna be 13 times seven in our example. 13 Rose time. Seven columns. It's 91. There's also a Len operator. He could call Len DF, and that comes back with 13. That just cause you back the number of rows in your data frame. If you just need that. And if you do DF dot columns. What that gives you back. It's an array of the actual column names. So if you wanna do a little quick reminder of what your column names are and what they mean , that's a good way to, like, get a little quick visualization into what he's calling me. And sometimes you need to remind yourself, and that's, ah, handy little trick. Now let's do some manipulation of this data frame. Let's say we want to extract just a single column from that data frame. Let's say we just want to extract the hired column and do something specifically with that . Often when you're loading in data, you're not interested in every single feature that's in it. You want to extract certain features that you care about for the model that you're building . This is how that would work. So if I say DF bracket quote hired bracket that will just extract the first that column the hired column as a single data frame. So we now have a new data frame that consists of just that. Single column. The hired column and I could have signed that to another data frame if I wanted to turn around and do that, you know, and do something else with it. So I could say, you know, hired call on equals, DF hired or something like that. You can also extract a given range of rose within a column like this. So if I do, DF hired and then an additional bracket with Colon five that will extract the 1st 5 rows of the hired column and I get back a new data frame that looks like that just five rows of the higher column and nothing else so you can see how pandas could be used to sort of extract the data that you care about when you're trying to pre process your data You can also extract a single value like that. If I would just say bracket five at the end, that's explicitly plucking out the hired column on the fifth row, which happens to be the value Why. Okay, so let's talk a little about about terminology here. OK, we have a data frame that's basically a multi dimensional object, in this case of 13 by by seven object. And then when we extract a single row or a single column like this, that's going to give. Speck was called a Siri's. Okay, so that serious is basically a one D array. And if we extract a single value, that's usually refer to as, well, a value so little bit of terminology there. You could also extract more than one column if you want to. Obviously, that's going to be a more common situation. You would do that like this. So instead of passing in just a single quoted column name, you could pass in an array of column names instead like this. So we're gonna say DF bracket, and then we're gonna have another layer of brackets inside those brackets that represents the array of column names that you want So I say DF bracket bracket years experience comma hired bracket bracket and that will give us back this new data frame here that consists the years experienced and hired columns and nothing else. And you, obviously you can add more columns to that list if you want to. So again, a very common operation to extract only the features or the columns that you actually care about for a specific task. The less data that you push around, the better. So that's usually the first thing you want to do. Get rid of the stuff you don't care about. Okay, you can also extracts ranges of rose from more than one column in the same way. So I could say, I want just the 1st 5 rows of the years experienced in higher columns like this. Nothing too surprising this in tax there. If you want to sort your results, sort your data frame. You could do it like this. There's the sort underscore values function. You can call on data frame, just passing array of the column that you want to sort by and we'll say this experience. We want to sort by years experience. You can see It did in fact, do that sort of from lowest to highest zero, up to 20 years of experience and our little fabricated data frame here. What else could we do? We can also do value counts. That's the way of breaking down how many of each unique value exists, which could be a useful way to visualize your data kind. Look for weird values that might be out. Liars I need to do is say DF and then the name of the column that you want to count on dot value counts and I will give you a count of each unique value within their. So make it real. Let's say we want to create a value counts Siri's out of the level of education column in our data frame, and then we'll go ahead and print that out by just saying to create a score counts and we get back this result which indicates that in our entire data frame there are seven B s degrees for PhD degrees in to M s degrees. Okay, so and if you want Oh, the tempting thing to do here is to create a history, Graham, right? So we wanna if you wanna plot that distribution that's very easy. And Pan isas Well, we could just say degree counts. Stop plot kind Eagles Bar says, saying, We want a bar plot of those degree counts. And since we used Matt Plot live in line way at the top, that just go ahead and displays within our I Python notebook so very easy to actually make graphs using pandas as well. All right, if you want to practice this for yourself, I have a challenge for you. Get your hands dirty here and get a little bit of hands on experience. Try extracting rose five through 10 of that source data frame of the of the potential hires that we have, and I want you to preserve only the previous employers and the higher columns. Assign that to a new data frame object and then create a history graham like we just did here, plotting the the distribution of the number of previous employers within just that subset of the data. Okay, so that should allow you to sort of put together all the stuff we talked about here. There's a lot more dependent than this, obviously, but these are the most common operations that you need to deal with and pretty much everything you need to know to get through this course and understand what's going on. So have a crack of that exercise. I think it'll be good practice for you, and with that under your belt, we can move on to some actual data science.
10. Types of Data: all right. If you want to be a data scientist, we need to talk about the types of data that you might encounter and how to categorize them and how you might treat them differently. So let's dive into the different flavors of data you might encounter. All right, let's talk about different kinds of data that you might encounter pretty basic stuff here. But, you know, I got to start with simple stuff, and we'll work our way up to the more complicated data mining machine learning things. But it is important to know what kind of data you're dealing with because different techniques might have different nuances, depending on what kind of data you're you're handling. So there's several flavors of data, if you will, and I like ice cream, which is the main reason this slide is up. But moving on there are numerical, categorical and orginal data, and again, you know, there are different variations of techniques that you might use for different types of data . So you always need to keep in mind what kind of data you're dealing with when you're analyzing it. Let's start New Miracle. It's probably the most common data type basically represents some quantifiable thing that you can measure some examples I have here heights of people page load time, stock prices. You know, things that vary things that you can measure things that have, you know, a big range of possibilities. Um, now there's basically two kinds of numerical data, so a flavor of a flavor, if you will. There's discrete data, which is energy based, for example, that could be accounts of some sort of event. Some examples here. How many purchases did a customer make in a year? Well, that could only be discrete values. They bought one thing or they bought two. Things are about three things they could have bought, you know, 2.25 things or three and 3/4 things. It's a discrete value that has an integer restriction to it. The other type of numerical data is continuous data. This is stuff. It has an infinite range of possibilities where you can go into fraction. So, for example, going back to the height of people, there is an infinite number of possible heights for people. You could be five feet and 10.37625 inches tall or whatever or the time it takes to do something like check out on a website could be any huge range of possibilities. You know, 10.7625 seconds for all you know or how much rain fell on a given day again, there's an infinite amount of precision there, so that's an example of continuous data. So, to recap, numerical data is something you can measure quantitatively with a number. And it could be either a discreet where its energy based, like some sort of event count or continuous, where there are infinite precision that you could have available to that data. The second type of data that we're going to talk about his categorical data and this is data that has no inherent numeric meeting. You can't really compare one category to another directly. Things like gender. Yes, no questions. Race, State of residence, product category, political party. You can assign numbers to these categories, and often you will, but those numbers have no inherent meeting. So, for example, I can I can say that the area of Texas is greater than the area of Florida, but I can't just say Texas is greater than Florida I mean, they're just categories. There's no riel, new miracle, quantifiable meaning to them. It's just ways that we categorize different things. Now again, I might have some sort of numerical ass IG Nation to each state. I mean, I could say that Florida's state number three in Texas estate number four but there's no real relationship between three and four. They're right. It's just sort of a shorthand to more compactly represent these categories. So again, categorical data does not have any intrinsic numerical meaning. It's just a way that you're choosing to split up a set of data based on categories, and the last category that you tend to hear about with types of data is orginal data, and it's sort of, ah, mixture of numerical and categorical data. And a common example is star ratings for a movie or music or what have you. So in this case, we have categorical data in that it could be one through five stars. You know where one might represent poor and five might represent excellent, but they do have mathematical meaning. We do know that five means it's better than a one. So this is a case where we have data where the different categories actually have a numerical relationship to each other. So I can say that one star is less than five stars. I can say that two stars is less than three stars and say that four stars is greater than two stars in terms of a measure of quality. So it's sort of in the middle there. You could also think of this. You know, if you just thought of it as the actual number of stars as discrete numerical data. So definitely a fine line there, and in a lot of cases, you can actually treat it that way. So there you have it. The three different types there is numerical, categorical and orginal dia. Let's see if it's sunk in. Don't worry, I'm not gonna make it past in your work or anything. Quick quiz. So for each of these examples, is the data numerical, categorical or orginal? Let's start with how much gas is in your gas tank. What do you think? Well, the right answer is new miracle. It's a continuous numerical value there because you can have any infinite range of possibilities of gas in your tank. I mean, yeah, That's probably some upper bound how much gas you can fit in it. But there is no end to the number of possible values of how much gas you have. It could be 3/4 of a tank. It could be 7/16 of a tank. It could be one over pi of a tank. I mean, who knows, Right? How about if you're reading your overall health on a scale of 1 to 4, where those choices corresponded categories poorer, moderate, good and excellent, Do you think? Well, that's a good example of orginal data. That's very much like our movie rating data. And again, depending on how you model that you could probably treated as discrete numerical data as well. But technically, we're gonna call that ordeal data. How about the races of your classmates? You know, what nationality are they? Pretty clear example of categorical data. You know, you can't really say that, you know, you can't compare yellow people the green people right there just yellow and green and just picked two random races that don't exist. Um, but, you know, they are categories that you might want to study and understand the differences between on some other dimension. How about the ages of your classmates here in this class in years? A little bit of a trick question there. If I said it had to be in a new energy valley of years, like, you know, 40 50 or 55 years old, then that would be discrete numerical data. But if I had more precision like, you know, 40 years, three months and 2.67 days, then that would be continuous numerical data. But either way, it's a numerical data type, and finally, money spent in a store again. You know, that could be an example of continuous numerical data. So again, this is only important because you might apply different techniques to different types of data. So there might be some concepts where we do one type of implementation for categorical data and a different type of implementation for numerical data, for example. So that's all you need to know about the different types of data. Pretty simple concepts area you got numeric categorical and Ordell data and numerical data could be continuous or discrete, and it might be different techniques you apply, depending on what kind of data you're dealing with. So we'll see that throughout the course going forward, let's move on
11. Mean, Median, Mode: Let's do a little refresher of statistics. 101 e. I mean, this is like elementary school stuff, but good to go through it again and kind of appreciate the differences and how these different techniques are used. Mean median and mode. I'm sure you for those terms before, but it's good to see how they're used differently. So let's dive in. Let's move on to mean median and mode. That should be a review for most of you. I think they teach this stuff in elementary school these days. But just a quick refresher, in other words, start to actually dive into some real statistics. Let's just look at some actual data and figure out how to measure these things. So the mean, as you probably know, is just another name for the average to calculate the mean of a data set. All you gotta do is some up all of the values and divided by the number of valleys that you have. So let's take this example here. Let's say I went door to door on my neighborhood and asked everyone, how many kids do they have? How many Children live in their household that by the way is a good example of discrete numerical data, Right? Remember from the previous lecture. So let's say I go around and I found out that the first house has no kids in it. And the second house has two Children and the SEC Third household has three Children and two and one and so on and so forth. So a mask, this little data set of discrete numerical data. And to figure out the mean all I do is add them all together and divided by the number of houses that I went to. So in that case, it comes out zero plus two plus three, plus all the rest of these things divided by the number of houses that I looked at, which is nine. And the mean number of Children per house in my sample is 1.11 So there you have it mean now, median. It's a little bit different. The way you compute the median of a data set is by sorting all the values and taking the one that ends up in the middle. So, for example, if that was my original data set, I could sort it numerically, and I can take the number that slept dab in the middle of the data, which turns out to be one. So again, all I do is take the data sort of numerically and take the center point. That's all there is to media. Now there is one nuance if you have it even versus an odd number of data points than the median might actually fall in between two data points, right? Like what if I had Ah, an even number there of data points that it would be clear which one is actually the middle . In that case, all you do is take the average of the to do fall in the middle. So if you haven't even number, you just take the middle to now this example of a median and the mean we're pretty close to each other because there weren't a lot of outliers. You know, we had 012 or three kids, but we didn't have some wacky family that had 100 kids. That would have really skewed the mean. But it might not have changed the median too much, right? And that's why the median is often a very useful thing to look at and often overlooked. People have a tendency to mislead people with statistics sometimes, and I'm gonna keep pointing this out throughout the course wherever I can. But for example, you can talk about the mean or average household income in the United States, and that actual number from last year when I looked it up was 72,000 some odd dollars. But that doesn't really tell an accurate picture of what the average, what the typical American makes. Because if you look at the median income, it's much lower its $51,939. Why is that? Well, because of income inequality, there's a few very rich people in America, and the same is true in a lot of countries as well. America is not even the worst, but you know those billionaire so super rich people that live on Wall Street or Silicon Valley or whatever they skew the mean right. But there's so few of them, they don't really affect the median so much. So this is a great example of where the median tells a much better story about the typical person or data point in this example than the mean does So whenever someone talks about the mean you have to think about, Well, what's the date of distribution look like? Are there outliers that might be skewing that mean? And if the answer is potentially, yes, you should also ask for the median, because often that provides more insight than the mean or the average. Finally, we'll talk about mode. This doesn't really come up too often in practice, but you can't talk about mean and median without talking about mode. I don't know why. Because it starts with them, I guess. But all that is is the most common value in the data set. So going back to my example of number of kids in each house, if I just look at what number occurs most frequently, it turns out to be zero. And the mode, therefore, of this data is zero. The most common number of Children in a given house in this neighborhood is no kids, and that's all that means now. This is actually a pretty good example of, um, continuous versus discrete data, right, because this only really works with discrete data. If I have a continuous range of data that I can't really talk about the most common value that occurs unless, like want ties that somehow into discrete values. OK, so we've already run into one example here where the data type matters mode is usually only relevant to discrete numerical data. And if you have continuous data, not so much. A lot of real world data tends to be continuous. So maybe that's why you don't hear too much about mode. But it's here for completeness. There you have it mean median and mode in a nutshell, and we could move on all right. Mean median and mode, kind of the most basic statistics stuff you can possibly do. But I hope he gained a little refresher there on the importance of choosing between median and mean. They can tell very different stories, and yet people tend to equate them in their heads. So make sure you're being a responsible data scientist and representing data in a way that conveys the meaning you're trying to to represent. If you're trying to display a typical value, often the median is a better choice than the mean because of outliers. So remember that let's move on
12. Using mean, media, and mode in Python: Let's start doing some real coding and python and see how you compute the mean median and mode using python in some I Python notebook file we have Okay, so open up the mean median mode notebook from your course materials over here in your ah Jupiter notebook page here and again, if you do need Teoh refresher on how to do that, head back to the set up lecture and I'll show you how to do that again. But once you have mean median mode openly, you can play around with it. So let's see what's going on here. So what we're going to start off doing is creating a fake data set of income distribution. So we're going to model 10,000 people and how much money they make each year. Now, to do this, we're going to use the num pie package. So, like we talked about in importing packages in our introduction to Python, will start off by saying important numb pie AS and P. This allows us to refer to the numb pie package as just np, which just saves us some typing in addition to importing that package so we can use it now The num pie package includes a function called random dot normal, and what this does is create a random distribution. Basically, it creates a bell curve distribution of data around a certain point in this case, $27,000 with a standard deviation of $15,000. And we want 10,000 data points in this data set. Now, if you're not familiar with standard deviation and normal distributions, we will talk about that in more detail later on. But I think you'll start to make sense of it as you go through this exercise. So once we've created that data set of 10,000 people that have a distribution centered around $27,000 we can then col n p dot mean to use the num pie package to compute the mean or average of that data set. And since we specified that it should be centered around $27,000 we would expect that to be about $27,000. So it's click into this block of code and hit shift enter to run it, and sure enough, it is about $27,000 now because there is a random component to this your result may be slightly different. That's expected. That's OK, but it should be close to 27,000 so we can actually plot this to get a more into to feel of how the state is distributed. To do that, we're going to use a package called Matt Plot Lib, which allows us to create really simple graphs here in line and actually display these graphs within the notebook file itself. We need to say percent Matt plot lib in line as the first line here. Have you ever run into a problem in the future with this course in the notebooks where your graphs are displaying? It's probably because you forgot to run a block of code that it contained Matt plot live in line that is required to actually see these graphs. With that out of the way, though, we can import the actual Matt plot lib package itself specifically the pipe lot part of it and again will refer to that as an alias of Plt just to save us and typing. So now that we have Matt Pot Libs pie plot installed, we can just say plot dot hissed to create a hist a gram of our incomes data set, and we're going to pass in the number 50 meaning that we want this split up into 50 different buckets. So we're going to Kwan Ties are a data set into 50 discreet buckets of data, if you will, and then we can call plot I showed. Actually show it. Let's go ahead and run that as well. Shift, enter within that block and you should see something that looks a little bit like that. So there's that bell curve that I promised you, right, and you can see that it is centered around or $27,000 or so. If you're not familiar with history, Grams, The way to interpret this data is that a lot of people are making around $27,000 a year. Very few people are making between 60,000 and $80,000 per year. Okay, so that's the the fictitious data that we made up. So we've seen that the mean is about $27,000. That makes sense. That's what we would expect. What's the median? So again, the media is just. If we were to sort all of this data, what would be the value in the middle of it all. And since we do have a nice, even bell curve distribution here, the median should be about the same as the mean. Let's go ahead and click in Block three here and run that shift. Enter and sure enough, that's also about $27,000. So for a evenly destroyed a data set like this, the median and the mean will be about the same. However, not every data set is evenly distributed. Let's see what happens if we add Jeff Bezos into the mix. And let's just say that he made a $1,000,000,000 last year. It's probably a little bit on the high side, even for Jeff Bezos. But just for the sake of argument, we're gonna call n p dot append to just append one extra value to the incomes list, and it will contain a single value, one billion. So we'll have a new incomes list here that contains are normally distributed data plus Jeff Bezos there to mess things up. Now remember, the median just represents what's the middle value far to sort them all. And we've only added one more data point here, so that shouldn't change a whole lot, right? So let's run the median again, and we're still getting a value close to $27,000. So Jeff Bezos did not mess of the median of our data set. However, if we compute the mean, it's gonna be very different, right? That's up to, like $127,000. Almost so. This is a great story about how an outlier in a data set can really mess up the mean or the average value of that data set. So when people talk about averages or means, take that information with a grain of salt. Ask yourself, Could there be outliers that are skewing that data? And, ah, income distribution is a great example of this. In that case, the median is going to tell you a better story about what's really happening in the larger population. So you know, statistics lesson number one or how to watch out for people lying with statistics. Make sure you understand the difference between median and mean, and if there are outliers involved the media and it's probably going to give you more useful information will also touch on mode just because it starts with them and people talk about it together with mean and median for some reason. So let's go ahead and create another fake data set here. This will be evenly distributed. We'll have a bunch of fake ages for 500 people. So we're gonna call np dot random dot rand into having even distribution of between 18 and 19 years old for 500 people. And then we'll just type in ages, which allows us to visualize that array in line here. Shift enter within that and this is random. So we're gonna get different results every time now to compute the mode again. That's just the value that appears most often in this data set. So let's go ahead and do that. Did you to do that? We're going to use thesis I pi package and it's stats module, and it's just another way of importing it. So we're saying from CYP I import stats. In this case, we're not going to use the as clause because I can type stats that's not too hard. And then that we will call stats dot mode on the ages array to get back our mode result, and after that, chugs away for a little bit. It's got a load up. That package we get. Our answer. Turns out, in this instance, the mode is 28 which occurred 14 times in the state of set. And this is completely random. So every time you run this, you will get a different answer. Let's go. Let's go! Back up, Teoh. Block seven Here again and shift enter again to get a fresh data set here. And if we run this blockade again, you should get a different answer this time the voters 20 which occurred 15 times. So just shows you that the mode works. But this is random data, so it's not terribly meaningful, but illustrates how you would do this in cyp I All right, so there you have it mean median and mode. Let's move on to an exercise to let you practice with it. Gonna give you a little bit Simon. Here. If you open up mean median exercise I python notebook. There's some stuff you can play with, so I want you to roll up your sleeves and actually try to do this here we have a some random e commerce data. So what? This data represents is the total amount spent per transaction. And again, just like with our previous example, is just a normal distribution of data. Just like our income example can run that and your homework is to go ahead and find the mean and median of this data using the lump I package pretty much the easiest assignment you could possibly imagine. All the techniques you need are on the mean median mode by Python notebook. My point here is not really to challenge us, just to make you actually write some python code and convince yourself that you can actually get a result and make something happen here. So go ahead and play with that. If you have any trouble posting the discussions for the selection will help you out. But it should be pretty trivial if you want to play that some or feel free to play around with the data distribution here and see what effect you can have on the numbers there. You know, at some outliers kind of do what we do with the income data so messed around this way you learn this stuff, have at it have fun, all right. I hope he rolled up your sleeves and actually played around that code a little bit. Get some confidence in actually doing statistics in I Python notebook, Bear and Python in general. So without behind us, let's move forward to our nets concept, standard deviation and variance.
13. Variation and Standard Deviation: Let's talk about standard deviation and variants concepts in terms you've probably heard before. But let's go in a little bit more depth about what they really mean and how you compute them. It's a measure of the spread of a data distribution, and I will make little bit more sense in a few minutes. Let's talk about standard deviation in variants to fundamental quantities for a data distribution that you'll see over and over again in this course. So let's see what they are. If you need a refresher again, let's look at a history Graham, because variance and standard deviation are all about the spread of the date of the shape of the distribution of a data set. So let's take a take a look at this fake data. Let's say we have some data on the arrival frequency of airplanes at an airport, for example, and this history Graham would indicate that however we choose to Kwan ties that data. Let's say we have around four arrivals per minute. Well, that happened on, say, around 12 days that we looked at for this data. So 12 different discrete data points at four arrivals permanent ish, but then we have these outliers. We had one really slow day that only had one arrival per minute to me, only had one really fast day where we had almost 12 arrivals per minute. So again, the way to read a history Graham is Look up the bucket of a given value and that tells you how frequently that value occurred in your data and the shape of the history. Ram could tell you a lot about the probability distribution of a given set of data. So we know from this data it's very likely tohave around four arrivals per minute. But it's very unlikely to have one or 12 right, and we can also talk specifically about the probabilities of all the numbers in between. So not only is it unlikely to have 12 arrivals permitted, it's also very unlikely to have nine arrivals per minute. And once we start getting around, you know, eat or so things start to pick up a little bit. So a lot of information we have from a history graham and the variance just speaks to how spread out what is the shape of that data. How spread is your data set? How do you measure variance well, we usually refer to it as Sigma squared and you'll find out why momentarily. But for now, just know that variance is just the average of the squared differences from the mean. So to compute the variance of a data set, you first figure out the mean of it. So let's say I have some data. It could represent anything. Let's say Ah, maximum number of people that were standing in line for a given hour or something I don't know. And the first hour I observed one person standing in line than four than five and four than eight. Okay, so the first step in computing the variance is just to find the mean the average of that data. I add them all together, divided by the number of data points. And that comes out to 4.4. Is the average number of people standing in line Now the next step is to find the differences from the mean for each data point. So I know the mean is 4.4. So for my first data 0.4 point four, and at one, so one minus 4.4 is negative 3.4 four minus 4.4 is negative. 0.4 and so on and so forth. Okay, so I end up with ease both positive and negative numbers that represent the variance from the mean for each data point. OK, but what I want is a single number that represents the variance of this entire data set. So the next thing I'm going to do, let's find the square differences. So we just got to go through each one of those raw differences from the mean and square them. This is for a couple of different reasons. First of all, I want to make sure that negative variances kind of just as much as positive variances, right. Otherwise, they would cancel each other out, not be bad. I also want to give more weight to the outliers. So this amplifies the effect of things that are very different from the mean while still preserving, making sure that the negative and positives air compared comparably So let's look at what happens there. So negative 3.4 squared positive, 11.6 negative 0.4 squared, but smaller number of 0.16 because that's much closer to the mean of 4.4 point six, also close to the mean only 60.36 But as we get up to the positive outlier 3.6 cents up being 12.96 okay, and to find the actual variants value, we just take the average of all those squared differences from the mean. So we add up all these squared variances divided by five the number of values that we have , and we end up with a variance of 5.4 OK, that's all variances now. Typically, we talk about standard deviation more than variants, and it turns out standard deviation is just the square root of the variance. It's just that simple. So I had a variance of 5.4. The standard deviation is 2.24 So you see now why we called Variant Sigma squared. It's because Sigma itself represents the standard deviation. So if I take the square root of signal Square to get Sigma that ends up in this example to be 2.24 This is a history graham of the actual data we were looking at. Now we see that the number four her twice in our data set, and then we had 1115118 Now the standard deviation is usually used as a way to think about the the how to identify outliers in your data set. So if I say if I'm within one standard deviation of the mean of 4.4 that's considered to be kind of a typical value in a normal distribution. But you can see in this example the numbers one and eight actually lie outside of that range. So if I take 4.4 plus or minus 2.24 you know we end up around there and there and one in eight both fall outside of that range of a standard deviation. So we can say mathematically that one in eight or outliers We don't have to kind of like Guess an eyeball it. Now there is still sort of a judgment. Call us to what you consider an outlier in terms of how many standard deviations, so you can generally talk about how much of an outlier data point is by how many standard deviations from the mean it is. So that's something you'll see standard deviation used for in the real world, there is a little new wants to standard deviation in variants. And that's when you're talking about population versus sample variance. Okay, just one little minor difference. So if you're working with a complete set of data, you know, a complete set of observations, then you do exactly what I told you. You just take the average of all the squared variances from the mean and that's your variants. But if you're sampling your data, you know, if you're taking some subset of the data just to make computing easier, you have to do something a little bit different. Instead of dividing by the number of samples you divide by the number of samples minus one . Okay, so let's look at this example an example. We just had the population variance is exactly what we did. We took the sum of the squared variances in divided by five, the number of data points that we had to get 5.4 But the sample variance, which is estimated by S squared, is divided by four and minus one. So we took the number of data points we had subtracted one and got the sample variance which comes out to 6.3. So again, if this was some sort of sample that we took from a larger data set, that's what you would do. If it's the complete data set, you divide with the actual number, okay? And that's too extreme population and sample variance. As for why it gets into, like, really weird things about probability that you probably don't want to think about too much . And if you want to express this in terms of fancy mathematical notation, I try to avoid notation in this course as much as possible. I think the concepts are more important, but this is basic enough stuff that you will see it over and over again. Population various is usually designated a sigma squared with Sigmund standard deviation, and we could say that is the summation of the of each data point X minus the mean you squared. That's the variants of each sample squared over end. The number of of data points and sample variance similarly is doesn't it is s squared. And that's the summation of each data point minus the mean m of the sample set squared over and minus one. So you subtract one from the number of samples that you have, that's all there is to it. So let's look at some real examples and write some python code to make this happen. Let's write some code here and play with some standard deviation in variances. So if you pull up thes didn't did the variants Python notebook. Sarah Deviation Variants file. Follow along with me here. Please do, because there is an activity at the end that I want you to try. So we're going to hear is just like the previous example. We're going to use Matt Plot Live to plot a history graham of some normally distributed random data. And we're gonna call this we're calling it incomes on Were saying that's gonna be centered around 100. Hopefully, that's an hourly rate or something. And on annual or it's some weird denomination withstand deviation of 20 and 10,000 data points. So let's go ahead and generate that and plot it. There you have it. So we have 10,000 data points at a centered around 100 as you can see here. So with a normal distribution with a standard deviation of 20 so that's a measure of the spread of this data, and sure enough, you can see that the most common occurrence is around 100. And as we get further and further from that, things become less and less likely. And the standard deviation point of 20 that we specified is around there around there. So you can see this kind of the point where things start to fall off sharply, right? So we can say that things beyond that standard deviation, boundary or unusual. Now dump I also makes it incredibly easy to compete the standard deviation in variance. If you want to compute the actual standard deviation of this data set that we generated, you just call the STD function right on the data set itself. So no umpire when it creates a list, it's not just a normal python list, actually has some extra stuff tacked on to it, so you can actually call functions on it like STD for standard deviation. And we could do that, and we should get a number pretty close to 20 because that's what we specify. When we created our random data, we wanted a standard deviation of 20. Sure enough, 19.96 pretty close and the variance is just a matter of calling dot var, which comes out to pretty close to 400 which is 20 squared, right? So the world makes sense. Yea, standard deviation is just a square root of the variance. Or you could say that variances a standard deviation square the other way around. Sure enough, that works out So the world works the way it should. I want you to dive in here and actually play around with it. Make it really so try away different parameters on generate that normal data. Remember, this is a measure of the shape of the distribution of the data. So what happens if I change that center point? Doesn't matter. Does it actually affect the shape? Did you try it out and find out? Try messing with the actual standard deviation there that we specify and see you what impact that has on the shape of the graph. So if I want a standard deviation of 30 I could change that there. And you know, you can see how that actually affects things or let's make it even more dramatic, like 50 Just play around that starting a little bit fatter there, right, so player with different values just get a feel of how these values work. This is the only way to really get an intuitive sense of standard deviation, variance Mesereau's and just different examples and see the effect that it has. So play around this a little bit seeing the next lecture so that standard deviation in variants in practice got your hands on with some of it there. So hope you played around that a little bit to get some familiarity with it. Very important concepts. You know, we talked about standard deviations a lot throughout the course and throughout your career and data science, so make sure you got that under your belt. Let's move on.
14. Probability Density Function; Probability Mass Function: so we've already seen some examples of a normal distribution function for some of the examples in this course. That's an example of a probability density function, and there are other types of probability Density functions out there, so let's dive in and see what it really means and what some other examples of them are. Let's talk about probability, density functions, and we've actually used this already in the course. We just didn't call it that. So let's formalize some of the stuff that we've talked about. For example, we saw the normal distribution a few times in our examples, and that is an example of a probability density function here is that normal distribution curve. So you know, it's easy conceptually to try to think of this as the probability of a given value occurring. But that's a little bit misleading when you're talking about continuous data, right, because there's an infinite number of actual possible data points in a continuous state of distribution, you know there could be zero or 0.1 or 0.1 right, so the actual probability of a very specific value happening is very, very small, infinite, even infinitely small. The probably Desi function really speaks to the probability of a given range of values occurring. So it's what you got to think about it. So, for example, in a normal distribution between the mean and one standard deviation from the mean, there's a 34.1% chance it turns out of a value falling in that range, and you can tighten this up or spread it out as much as you want. Figure out the actual values, but that's the way to think about a probability, density, function, forgiven range of values. It tells you you can see a way of finding out the probability of that range occurring. Okay, you can see here. You know, if you get close to the mean within one standard deviation, you're pretty likely to get there. I mean, if you add up 34 34 whatever that comes out to is the probability of landing within one standard deviation of the mean. But as you get out here between two and three standard deviations, you know, we're down to just a little bit over 4% combined with the positive and negative and as you get out beyond three standard deviations and we're much less than 1% actually. So is this the way to visualize and talk about the probabilities of the given data point happening? So again, a probability distribution function. You see the probability of a data point falling within some given range of a given value. Okay. And a normal function is just one example of a probability density function. Look at some more in a moment. Now, when you're dealing with discrete data, that little nuance about having infinite numbers of possible values goes away right, and we call that something different. So that is a probability mass function for dealing with discrete data. You can talk about probability mass function. So, for example, you can plot a probability density, normal probability, density function of continuous data on this black curve. But if we were to quantifies that into a discrete data set like we do with the History Ram , we can say the number three occurs some set number of times, really, and you can actually say the number three has a little over than 30% chance of occurring, So probably mass function is the way that we visualize the probability of discrete data occurring, and it looks a lot like a history Graham, because it basically is a history ram. Okay, so terminology difference probably density function, a solid curve that describes the probability of a range of values happening with continuous data. Probably mass function is the probabilities of given discrete values occurring in a data set. Okay, so let's look at some actual examples, and it will make even more sense that we're going to somewhere depth next.
15. Common Data Distributions: Let's look at some real examples of probability distribution functions and data distributions in general, and kind of wrap your head around a little bit more about data distributions and how to visualize them and use them in python. So go ahead and open up the distributions I Python notebook file from the course materials , and you can follow along with me here if you'd like. Let's start off with a really simple example. So ah, uniform distribution just means there is a a flat, constant probability of a value occurring within a given range so we can uses by using the lump I random dot uniform function. And this call says, I want a uniformly distributed random set of values that ranges between negative 10 and positive 10 and I 100,000 of them. And if I then create a hist, a gram of those values you can see it looks like this. So there is pretty much an equal chance of any given value or range of values occurring within that data. So unlike the normal distribution, where we saw a concentration of values near the mean ah uniform distribution as equal probability across any given value within the range that you define. So what would the probability distribution function of this look like? While I'd expect to see basically zero outside of the range of negative 10 or beyond 10. But when I between negative 10 and 10 I would see a flat line because there is a constant probability of any one of those ranges of values occurring. Okay, so a uniform distribution, you would see a flat line on the probability distribution function because there is basically a constant probability. Every value, every range of values has an equal chance of appearing as any other value. Okay, And that does happen sometimes. Now we've seen normal or also known as galaxy and distribution functions in the past, and already in this course, you can actually visualize those in Python. There is a pdf function on the CYP idot stats dot norm package function. So here, in this example, let's just walk through what's happening here. We're creating a list of X values for applauding that range between negative three and positive three with a increment of 30.1 in between them. Okay, so those air the X values on the graph and they were going to plot the X axis and the Y axis is going to be the normal function norm dot pdf probability density function for a normal distribution on those X values. And we end up with this. So the pdf function with a normal distribution looks just like it did in our previous slide . That is a normal distribution for the given numbers that we provided where zero will represent the mean and these numbers are standard deviations now to actually generate random numbers with normal distribution. We've done this a few times already, but just as a refresher again, if you use the num pie package, it has a random dot normal function. And the first parameter mu represents the mean that you want to center. The data around Sigma is thes standard deviation of that data, which is basically the spread of it. And then we specify the number of data points that we want using a normal probability distribution function. Okay, so that's the way to use a probability distribution function in this case, the normal distribution function. To generate a set of random data, we condemn plot that just show a history. Graham broken into 50 buckets and show it, and that's what we end up with it. It does look more or less like a normal distribution, but since there is a random element, it's not gonna be a perfect curve. You know, we're all talking about probabilities or some odds of things not quite being what they should be. Another distribution function, you see pretty often is the exponential probability distribution function where things fall off an exponential manner. So when you talk about exponential fall off, expect to see a curve like this where it's very likely for something to happen, you know, near zero. But then, as you get farther away from it, it drops off very quickly. So there's a lot of things in nature that behave in this manner. And to do that in Python, just like we had a function in CYP I stats for a norm dot pdf. We also have an ex pond dot pdf for an exponential probability distribution function, and we could do the same syntax that we did for the normal distribution with an exponential distribution here. So again, we just create our X values using the num pie, a range function to create a bunch of values between zero and 10 with a step size of 100.1 And then we plot those X valleys against the Y axis, which is defined as the function exponential. Pdf of X and it looks like that exponential fall off. We also visualize probably probability mass functions. It's called the by nobody binomial probability, mass function and again, same sin taxes before. So instead of ex Pond or norm, we just use by gnome and again a reminder probability mass function deals with discrete data, and in this case, we are dealing with discrete data. We have been all along releases how you think about it. We're creating some discreet X values between zero and 10 at a spacing of 100.1 And we're saying I want to plot a binomial probability mass function using that data and with the probability mass function, I can actually specify the shape of that data using to shape parameters and NP. In this case, it's 10 and 0.5, and if you want to go in and play around different values to see what effects it has, that's a good way to get an intuitive sense of how those shape parameters work on the probability mass function. Lastly, the other distribution function you might hear about is a Poisson probability mass function , and this has a very specific application. Looks a lot like a normal distribution, but it's a little bit different. The idea here is if you have some information about the average number of things that happen in some given time period. Okay, this could give you a way to predict the odds of getting some other value instead on a given future day. Okay, so as an example, let's say, have a website, and on average, you get 500 visitors per day. I can use the Pawson Probability mass function to estimate the probability of seeing some other value on a specific day. So let's say I get an average of 500 visits per day. What's the odds of seeing 550 visitors on a given day? That's what a Prasong probability mass function can give you. So in this example, I'm saying my averages 500 you I'm going to set up some X values toe. Look at between 406 100 with a spacing of 1000.5. And I'm gonna plot that using the fossil on probability mass function. And I can use that graph to look up the odds of getting any specific value. That's not 500 assuming a normal distribution. So 5 50 it turns out, comes out to about 0.2 Is the probability there or 0.2%? Very interesting. All right, so those are some common data distributions you might run into in the real world. Pop quiz. Make sure you're paying attention. What is the equivalent probability distribution function when using discreet instead of continuous data. So remember we used a probability distribution function with continuous data. But when we're dealing with discrete data instead, we use hint. It's right on the screen right now. A probability mass function. Okay, let's move on. So that's probability. Density, functions and probability. Mass functions basically away toe visualize and and measure the actual chance of a given range of values occurring in a data set. Very important information in a very important thing to understand, keep using that concept over and over again. So make sure you watch this video again if you have to. You're good. All right, let's move on
16. Percentiles and Moments: next we'll talk about percentiles and moments percentiles. You hear about that in the news all the time. People that are in the top 1% of income, that's a percentile. We'll explain that and have some examples there, and we'll talk about moments. Very fancy mathematical concept. But it turns out it's very simple to understand conceptually. So let's dive in and get started. Let's talk about percentiles and moments. Couple of pretty basic concepts and statistics. But again, we're working our way up to the hard stuff. So bear with me as we go through some of this review. So percentiles. Basically, if you imagine that if you were to sort all of the data in the data set, a given percentile is the point at which that percent of the data is less than the point you're at. So a common example that you see talked about a lot is income distribution. When we talk about the 99th percentile or the one percenters, imagine that you were to take all of the incomes of everybody in the country in this case, the United States and sort them by income. The 99th percentile would be the income amount at which 99% of the rest of the country was making less than that. OK, so it's a very easy way toe. Comprehend it. This is some real data here. So, for example, at the 99th percentile, we can say that 99% of the data points here, which represent people in America, make less than $506,000 a year, and 1% make more than that. Conversely, So if you're a one percenter, you're making more than $500,000 a year roughly. Congratulations. But if you're a more typical median person, the 50th percentile defines the point at which half of the people are making. Less than half are making more, which is the definition of median, right? So the 50th percentile same thing is median, and that would be at $42,000 given this data set. So if you're making $42,000 a year in the U. S, you are making exactly the median amount of income for the country, and you can see you know the problem of income distribution. Hear things tend to be very concentrated toward the high end, which is Ah, very big political problem right now in the country, so we'll see what happens with that. But that's beyond the scope of this course. So that's percentiles in a nutshell. Percentiles are also used in the context of talking about the quarter tiles in a distribution. So if you're looking at, say, a normal distribution here, people talk about court tiles and quartile. One and quartile three in the middle are just the points that contain together 50% of the data, so 25% are on this side of the median in 25% on this side of the median. The meeting in this example happens to be near the mean So, for example, the inter quartile range when we talk about a distribution is the area in the middle of the distribution that contains 50% of the values. OK, now, this is an example of what we call a box and whisker diagram. So don't concern yourself yet about this stuff out here on the edges. That gets a little bit confusing, and we'll cover that later, even though it's called quartile when a court trial. Three they don't really represent 25% of the data. But don't get hung up on that yet. Focus on the point that thes quart tiles in the middle represent 25% of the data distribution, and those tend to be in the middle. Let's look at some more examples using python and kind of get our hands on it and conceptualize It's a little bit more. All right, let's get our hands dirty with percentiles. Go ahead and open up the percentiles like Python notebook file. If you'd like to follow along and again, I encourage you to do so because I wanted to play around with this a little bit later. So let's start off by generating some randomly distributed normal data or normally distributed random data. Rather and in this example, where we're going to do is generate some data centered around zero with a mean of zero with a standard deviation of 00.5. And I'm going to make 10,000 data points with that distribution, and we're gonna plot a history graham and see we come up with and it looks a little bit something like that, very much like a normal distribution. But because there is a random component you know, we have a lot Liar. Hear things. Air tip a little bit to the right here. A little bit. A little bit of random variation there to make things interesting. Now, to compute the percentile values of this distribution, Numb pie provides a very handy percentile function that will do that for you. So we created our vowels list of data here using the umpire dot random, not normal. And I could just call np dot percentile to figure out the 50th percentile value into this example that turns out to be Plano five. So remember, the 50th percentile is just another name from the median. And it turns out the median is very close to zero in this data and you can see we're tipped a little bit to the right, so that's not too surprising. I want to compute the 90th percentile. That gives me the point at which 90% of the data is less than this given value. So the 90th percentile of the state, it turns out to be 0.65 So it's around here, and basically at that 0.90% of the data is less than that, so I'll believe that 10% is greater. 90% is less than right around there in the 20th percentile value. That would give me the point at which 20% of the values are less than that number that I come up with. So the 20th percentile point works out to be negative 0.4 roughly and again, I believe that. So it's saying that 20% of the data lies to the left of negative 200.4, and conversely, 80% is greater. So if you wanna get a feel as to where those breaking points are in the data set, the percentile function is an easy way to compute them. If this were a data set representing income distribution, like in our slides, you know, we could just call MP Top percentile values common 99 figure out what the 99th percentile is. So you could figure out who those one percenters people keep talking about really are. And if you're one of them, all right, now get your hands dirty. I want you to play around this data, so this is a I Python open for a reason. It's so you can mess with it and mess with the code try, you know, different. Uh, try different standard deviation values. See what effect it has on the shape of the data and where those percent house end applying . For example, try using smaller data set sizes and a little bit more random variation into thing. It's just get comfortable that play around with it and, you know, find that you can actually do this stuff and write some real co that works. So spend a few minutes, Ah, playing around with that hit policy while you do that and when you continue will come back to the concept of moments of a distribution. Next, let's talk about moments. Moments are a fancy mathematical phrase, and you don't actually need a math degree to understand it, though intuitively, it's a lot simpler than it sounds. It's one of those examples where people in statistics and data mining in machine learning and data science like to use big fancy terms to make themselves sound really smart. But the concepts are actually very easy to grasp, and that's the theme you're going to hear, can and again in this course. So let's talk about moments. Basically, it's ways to measure the shape of a data distribution of a probably density function of anything, really. And mathematically, we've got some really fancy math notation here of how they're defined. And, you know, if you do know calculus, it's actually not that complicated of a concept. We're taking the difference between each value from some value race to the 10th power, where N is the moment number and integrating across the entire function from negative infinity to infinity. But intuitively, it's a lot easier than calculus ready. Here we go. The first moment works out to just be the mean of the data that you're looking at. That's it. The first moment is the meat, the average it's that simple. Second moment is the variance. That's it. The second moment of a data set is the same thing as the various value, and, you know, it might seem a little bit creepy that these things kind of fall out of the math naturally . But think about it. The various is really based on the square of the differences from the mean so coming up with a mathematical way of saying that very insist related to mean isn't really that much of a stretch, right? It's just that simple. Now, when we get to the third and fourth moments, things get a little bit trickier, but they're still concepts that are easy to grasp. So the third moment is called skew, and it is basically a measure of how lopsided a distribution is. So you can see in these two examples. If I have a longer tail on the left, you know that is a negative skew, and I have a longer tail on the right. That's a positive skew, so you can see here with the shape of a normal distribution would look like without skew. If I stretch that out on one side, then I end up with the skew on the other side of positive skew in that example. Okay, so that's all skew is. It's basically stretching out the tail on one side or the other, and it is a measure of how lopsided, how skewed a distribution is. The fourth moment is called Curto Sis. Wow, that's a fancy word. All that really is is how thick is the tale on how sharp is the peak. So again, it's a measure of the shape of the data distribution And here's an example here, and you can see that the higher peak values have a higher Kurt hostess value. So the red curve has a higher keratosis than you know. This is that black. I can't even tell blackish proud curve here at the bottom. So it's a very subtle difference, but a difference on the less it basically measures how piqued your data is so again. Review. First moment means second moment variants. Third moment skew fourth moment keratosis already know what mean and variance our ski was how lopsided the data is, how stretched out one of the tales might be and keratosis out Peaked house squished together. The demonstration is so let's play around in Python and actually compute these moments and see how you do that, Okay, to play around, Let's go ahead and open up the moments I Python notebook file and you can follow along with me here so it's again create that same normal distribution of random data, and again we're gonna make it centered around zero with a 0.5 standard deviation and 10,000 data points and plot that out so again, a randomly generated set of data with a normal distribution around zero. So to find the mean and variance we've done this before, numb pie just gives you a mean and bar function to compute that. So we can just call and p dot mean to find the first moment, which is just a fancy word for the mean. And that works out to be very close to zero, just like we would expect for a normally distributed, distributed data centered around zero. So the world makes sense so far. The second moment, just another name for the variance. And that works out to be about 0.25 And again that works out with a nice sanity check. Remember that standard deviation is the square root of variance. And if you take the square root of 0.25 it comes out 2.5, which is thes standard deviation we specified while creating the status. So again, that checks out too for a moment, sq. And to do that, we're gonna need to use the CYP I package instead of non pie. But that again is built into any scientific computing package like and thought canopy or anaconda imports. Type iDot stats as SP and then we could just say sp dot skew on vales and that will give us a skew value. And since this is centered around zero, it should be almost a zero skew. It turns out from random variation it does skew a little bit left. And actually that does jive with shape that we're seeing here. It looks like we did kind of pull it a little bit negative. Fourth woman is Curto Sys, which describes the shape of the tail and again for a normal distribution that should be about zero. And indeed it is. So you know, the shape of the tailor, the how sharp the peak is kind of if you push it in. You know, it has both effect so far to squish the tail down. It kind of pushes up that peak to be more pointing. And likewise if I were to push it down that distribution and imagine that kind of spreading things out a little bit, making it a little bit fatter and the peak of it a little bit lower. So that's what keratosis means. And in this example, keratosis is near zero because it is just a plain old, normal distribution. So If you want play around that, go ahead and again try to modify the distribution, make it centered around something besides zero and see if that actually changes anything, should it? Well, it really shouldn't because these are all measures of the shape of the distribution, and it doesn't really say a whole lot about where that distribution is exactly. It's a measure of the shape. That's what the moments are all about. So go ahead and play around that. Try different center values, try different san deviation values and see what effect it has on these values and doesn't change it all. Of course, you'd expect things like they mean to change because you're changing the mean value. But variance que maybe not play around, find out all right, that's moments. Let's move on. And there you have percentiles and moments percentiles. Pretty simple concept moments. Sounds hard, but it's actually pretty easy to understand how to do it. And it's easy and python to. So you got that under your belt. Let's move on
17. A Crash Course in matplotlib: so you know your date is only as good as you can present it to other people, Really? So let's talk about plotting and graphing your data and how to present out to others and make your graphs look pretty were on Introduce Matt Plot Lib, which is a library you can use in python to make pretty graphs, and I'll show you a few tricks on how to make them as pretty as you can. Let's go there. Let's have some fun with graphs. You know it's always good to make pretty pictures out of your work, and this will give you some more tools in your tool chest for visualizing two different types of data, using different types of crafts and making it look pretty. You know, these different colors, different line styles, different axes, things like that. So you know it's not only important to use graphs and data visualization to try to find interesting patterns in your data, but it's also interesting to present your findings well to a non technical audience. So let's dive into Matt plot lib. Go ahead and open up the mat plot lib I Python notebook and you can play around this stuff with me. We'll start by destroying a simple line graph. So in this example, I'm going to import Matt Plot lived up high plot as planting and we'll just refer to it as plt. From now on in this notebook and what I'm gonna do use use numb pie dot a arranged to create an X axis filled with values between negative three and three and arguments of 30.1 And I'm gonna use pi plots Plot function to plot X and the Y function will be norm dot pdf of x. So I'm gonna create a probability density function with a normal distribution based on the X values. And I'm using the site pi stats norm package to do that. So tying it back into our earlier lecture about probability density functions here we are plotting a normal probability density function using that plot lip. So we just called the pie plot plot method to set up our plot. And then we display it using plot dot show and when we run that, that's what we get. Pretty little graph with all the default formatting. Let's say I want to plot more than one thing at a time so you can actually call plot multiple times before calling showed actually add more than one function to your graph. So in this example, I'm calling my original function of just a normal distribution. But I'm gonna I'm kind of render another normal distribution here as well, with a mean around one point. Oh, in a standard deviation of 00.5. And I'm gonna show those together so you can see how they compare to each other and you can see by defaults. Matt. Plot lived chooses different colors for each graph automatically for you, which is very nice and handy of it. There you have it. If I want to say this to a file, you know, maybe I want to include it in a document or something. I can do something like this instead of just calling plot got show. I can call plot got, say fig with a path to where I want to save this file and what format I wanted in. So in this example, I have the same plot set up instead of show. I'm calling safe big to this path and you'll want to change that to an actual path that exists on your machine. If you're following along, you probably don't have a user's frank folder on your system. And remember to if you're on Lennox or Mac Os instead of a backslash, you're gonna use forward slashes and you're not gonna have a drive letter. So with all of these python notebooks, whenever you see a path like this, make sure that you change it to an actual path that works on your system. Okay, but I am on Windows here, and I do have the users frank folders so I can go ahead and run that. And if I check my file system under users, Frank. Sure enough, I have a my plot PNG file. I can open up and look at it, and I can use that in whatever document I want. So pretty cool. All right, let's move on. Let's say I don't like the default choices of the axes of this value, like it's automatically fitting it to the tightest set of access values it confined, which is usually a good thing to do. But sometimes you want things on an absolute scale, right? So in this example, I'm setting the X limit using first I get the axes using Plata axes. I want to have these axes objects. I can adjust them. So by calling set Excellent. I can set the X range from negative 55 and set while in my set their Y range from 0 to 1. And you can see that down here My ex values air ranging from night minus 5 to 5 and why it goes from 0 to 1. And I can also have explicit control over where these tick marks are. So I'm saying I want the x tex to Viet minus five on his four mines, three etcetera. And why ticks from 0 to 1 at 10.1 increments. Now, I could use the range function to do that more compactly. But the point is you have explicit control over where exactly those tick marks happen and you can skip some. You can have them at whatever increments you want or whatever distribution you want. Beyond that, it's the same thing. Once I've adjusted my axes, I just call plot with the functions that I want plot and call show to display it. And sure enough, there you have the result. What if I want grid lines? Well, same idea. All I do is call dot grid on the axes that I get back from Pipe Lott DOD axes. And by doing that, I get these nice little grid lines, and that makes a little bit easier to see where a specific point is, although it clutters things up a little bit, So a little bit of a stylistic choice there. What if I want to play games with the lion types and colors? You could do that, too. So you see here that's actually an extra parameter on the plot function, Reckon passable string that describes the style of the line. And in this first example, what this indicates is I want a blue line with a solid line. That's what the B stands for. Blue in the Dash means a solid line, and for my second function, I'm going to plot it in red. That's what the R means and the colon means I'm gonna plotted with little vertical hashes all the way up. I run that you can see that's what it does, and you can change, uh, different types of line styles there. In addition, you can do a double slash or a double dash rather, and that gives you this dashed line is a line style like into ah, dash dot and you can get something that looks like that. So what are the different choices there? I could make it green with horror with vertical slashes. There you go. So have some fun with that. If you want, experiment with different values and you can get different line styles. Something you'll do more often is labeling your axes. You know you never want to present data in a vacuum. You definitely want to tell people what it represents. And to do that you can use the X label and why label functions on pie plot to actually put labels on your axes. So I'm gonna label the X axis Grable's and the Wild Label Probability, and you can also add a legend and set here. Normally this would be the same thing. But just to show that it said independently, I'm sitting up here a legend, and you pass in basically a list of what you want to name each graph. So my first graph is going to be called snitches. Second graf is going to be called Jacks and the look parameter here indicates what location you wanted it at. Wherefore represents the lower the lower right hand corner. So let's go ahead and run that, and you can see that I'm plodding Grable's versus probability for both speeches and Jack's little Dr Seuss reference for you there. So that's how you said axes, labels and legends. A little fun example here if you're familiar with the Web comic X, K C D. And it's a little bit of an Easter egg in the mat plot loop, where you can actually plot things in XK CD style, and you can do that by Colony plot X K C D. Which kind of puts Matt plot live in X K C D mode. And after you do that, things will just start to look with, you know, like this style with kind of a comic book font and squiggly lines automatically. And this little simple example shows a funny little graph here where we're plotting your health versus time, where your health takes a steep decline once you realize you could cook bacon whenever you want to. And all we're doing there is using this x K C D method to go into that mode. A little bit of interesting python here and actually how we're actually putting this craft together. So we're starting out by making a data line that is nothing but the value one across 100 data points. And then we used the old python. Let's slicing operator to take everything after the value of 70 and we subtract off from that sub list of 30 items, the range of zero through 30. So that has the effect of subtracting off a larger value linearly as you get past 70 which results in that lying heading downward down to zero beyond the 00.70. So a little example. There. Some python list slicing in action there in a little creative use of the A range function to modify your your data. Now going back to the real world, we can remove XK CD mode by saying RC defaults on that plot live, and we can get back to normal load here. If you want a pie chart, all you have to do is call plot dot pie and give it an array of your values, colors, labels and whether or not you want items exploded. And if so, by how much So you can see here I'm creating a pie chart with these values 12 55 for 32 14 . I'm going to assign explicit colors to each one of those values explosive labels to each one of those values. I'm gonna explode out the Russian segment of the pie by 20% and I'm going to give this plot a title called Student Locations and show it That's all there is to it. We went to a bar chart also very simple, kind of a similar idea to the pie chart. You given an array of values and an array of colors, and you just plot your data. So I'm telling it to plot from the range of 0 to 5, using these y values in this array and using its explicit list of colors. Go ahead and show that. And there you have your bar chart and a scatter plot. This is something we'll see pretty often in this course. So say you have a couple of different attributes. You want a plot for the same set of people or things. For example, they were applauding ages against income or something for each person reached dot represents a person, and thes axes represent different attributes of those people. The way you do that with a scatter plot is you call my pie plot with scatter using the two axes that you want to define that two attributes that contain data that you want a plot against each other. So let's say, have a random distribution in X and Y, and I scattered those on a scatter plot, and I show it That's what it looks like pretty cool. So you can see this sort of a concentration in the center here because of the normal distribution that's being used in both axes. But since it is random, but there's no real correlation between those two. Finally, we'll remind yourselves how hissed a gram works. We've already seen this plenty of times in the course, but if you just call, for example, the normal distribution centered on 27,000 with a standard deviation of 15,000 with 10,000 data points, I can just call pie plots, hissed a graham hissed function, and you specify the input data and the number of buckets that you want to group things into in your history, Graham into this call show, and the rest is magic. Finally, box and whisker plots. So remember the previous lecture, and we talked about percentiles touched on this a little bit again. With a box and whisker plot. The box represents the to intercourse tiles where 50% of your data resides, and, conversely, another 25% resides on either side of that box. But the's daughter line and whiskers represent the range of the data except for outliers. So we define outliers in a box and whisker part plot as anything beyond 1.5 times the inter quartile arrange or the size of this box. So we take the size of that box times 1.5, and up to that point, that's what we call these outer core tiles. But anything outside of that is considered an outlier, and that's what these lines represent here. That's where we or defining outliers based on our definition with the box and whisker plot . Now, just to give you an example here we have created some fake data set where we have a uniform , random distribution of data, and then we add in a few outliers on the high end and a few negative outliers us well, And then we can captain eight those lists all together and create a single data set from these three different sets that we created using numb pie. We then take that combined data set of random of uniformed data and a few outliers, and we plotted using plot dot box plot. And that's how you get a box and whisker plot call show to visualize it. And there you go. So you can see that is showing that box that represents the inter 50% of all data. And then we have these outlier lines where you can see a little crosses for each individual out Liar that lies in that range. All right, that's in that plot. Live your crash course. Get your hands on it. Actually do some exercises here. So as your challenge, I want you to create a scatter plot that represents random data that you fabricates on age versus time spent watching TV and you could make that anything you want. Really? If you have a different fictitious all data set in your head that you'd like to play with, have some fun with it. So create a scatter plot that plots to random sets of aid against each other and label your axes make it look pretty. Play around it, have fun with it. Everything. And he should be in this Python notebook that you need for reference any for examples. But have any trouble feel free to post in the discussions for this lecture and we'll help you out. So keep that I Python notebook around with your tips and tricks for Matt Plot lib. It's kind of a cheat sheet, if you will, for different things you might need to do for generating different kinds of graphs and different styles of craft, so I hope it proves useful.
18. Data Visualization with Seaborn: All right, let's talk about Seaborne now, which is basically Matt plot lib plus plus, if you will. All right, So Seaborn is basically a visualization library that sits on top of map lot lib, and all it does is make it a little bit pretty to look at. But it also has a bunch of different kinds of charts and graphs that we didn't have in Matt plot live. And just since the example will start off again saying that plot live in line, meaning that we want to view all of our results as part of this notebook itself within the browser will import pandas as PD load up a fuel efficiency dot C S V file that I've uploaded here to my website here and this Israel data, by the way. So this is actual data that comes from the U. S. Government about the fuel efficiency of every car. They have a record off for the 2019 model year and specific, So let's extract some information from that that we can play with. Let's start by extracting the number of gears from that resulting data frame, and we're going to do value counts. And if you remember back from our pandas tutorial, that basically gives us back the data we need for a hist a gram that says how many times each unique value occurs in our data frame. So this should give us back a Siri's that maps the gear numbers to the number of times each unique value appeared. We can then just plot that, saying that we want a bar chart. So right now, we're just using that plot lib, as is just to visualize this data. And there you have it, so you can see that Ah, eight speed transmission seem to be the most common one, followed by six speed, and we have sort of exponential drop off their two other more obscure values. Now let's you see born so see born in its most basic form can just make Matt plot live look better. So all we need to do, say, import Seaborn as SNS, and then we can say sns dot set and all that does is replaced. The default settings in Matt plot live with the more visually modern looking settings that Seaborne has given us. That pot live is pretty old. I mean, it goes back to that plot, and it's kind of showing its age quite frankly. So this gives it a more modern look and feel. So now we could do that same exact bar chart. But with the Seaborne defaults applied, you can see a little bit prettier. We have, you know, more, Ah, muted tones here. And it's also against this nice little graphical background here that actually let you visualize that great a little bit better otherwise, pretty much the same. But it just it's just a little bit easier on the eyes, right? Let's dive into some more depth here. Let's take a closer look at the data that we're dealing with. So here's our raw data frame that we actually loaded up that came from the government here , and we just you had to take a look at the 1st 5 rows here as an example. So the information I have extracted are the car manufacturer like Aston Martin or Volkswagen. The car line, which is basically the model the engine displacement. That's how many leaders the engine is. How many cylinders air in the engine, the transmission type. It's a city MPG. Fuel efficiency hits highway fuel efficiency, the combined city plus highway, my mpg value and the number of gear Cities Car has. So that's the information that we have to play with here now. See Born has some plots that Matt plot live doesn't offer at all. So, for example, there's dis plot, and that's the way of actually plotting a hist a gram together with a smooth distribution over. Laid on top of that hissed a gram. So let's take a look at that on the calm and PG column. So here we have a history graham of how many times each value within comment. PG appears. You can see we have kind of this spike here around. You know, the low, low to mid twenties, right? That seems to be kind of like the most common mpg rating for a vehicle. And we can overlay this sort of trend curve here automatically is part of dis plots. So that's something Seaborn is doing for us automatically without us even trying so that they said a little bit easier to visualize the bigger trans here. And you can see that's kind of helpful because we have these, like weird values in between these other values. So it seems like there seems to be some sort of quantum ization that occurs in our data that we can smooth over a little bit with that trendline. That's sometimes a useful way to visualize things. Another thing you can have in Seaborn is the pair plot that's also something unique to see Born. And this is cool stuff because it lets you visualize plots of every possible combination of , ah, set of attributes. So you can, like, just look at every possible way of visualizing a set of values and try to find the ones that look interesting that might be useful to investigate more deeply. So as an example, let's classify cars by how many cylinders they have, and we'll look for relationships between how many cylinders each car has and their city mpg rating there highway mpg rating and their combined mpg rating. So let's just start by extracting those columns from our data frame into DF two. We're going to use that same syntax so introduced in our panties tutorial. Just extract these columns into a new data frame, so we now have a new set of rose here that only contain the cylinders and the NPG columns from our original data. Now watch this. If we do pair plot on that new data frame, do you have to? We can say that we want to focus on the cylinders as our primary thing that we want to look at and with a given height to say that we want this to be a nice and big plot that we can visualize easily. Let that run. Here we go. So what we have here is like a grid of grids, right? So this is kind of neat. Let's scroll down a little bit so we can sort of visualize what's going on here so you can see that we have on here every single column. And over here we have everything, every single column as well. So if you want a plot comment PG versus cylinders, you can look so here. If you want a plot highway, mpg versus City I mpg. You can look at this plot here so you can see here that you can find interesting linear relationships between different ah columns here. So, for example, just looking at the Cylinders column here, we can see that there's a pretty clear relationship between the number of cylinders and the mpg, whether it's city, highway or combined. So it's the number of cylinders increases we can see that that mpg is has to be dropping. But there's a really widespread here for four cylinder vehicles. So this is more to the story here in the world of four cylinder vehicles. Some are really bad. Some are really good, really big spread there. So already we've got some useful insight there into our data, so we can also use a scatter plot and Seaborn 1.9. It's just a sort of a prettier version of the map plot. Live one. Basically, you can plot individual data points across any two axes you want and see how your data is distributed across those dimensions. So let's say sns dot scatter plot. We're going to say the X axis is going to be engine displacement. Why is going to be combined mpg, mpg? And for the data itself, we're gonna refer to our DF data frame from our raw data. So this is gonna pluck out those two columns and plot them against each other on a scatter plot. And there you have it. So each individual point in our data frame is being scattered onto this plot that maps that particular points engine displacement and combined mpg value. And again, you can see Ah, there is a relationship here. So already we're getting some, you know, insights from visualizing that data again. The lower engine displacements tend to have a very widespread of M p G's, but in general, the bigger the engine displacement, the worst, the fuel efficiency, which shouldn't be that big of a surprise, right? Another cool thing And Seaborn is the joint plot. This lets you visualize, scatter plots and hissed a grams at the same time on each axis. So let's take a look at that same spread of engine displacement versus convent PG. But this time we're gonna do a joint plot instead of a scatter plot. Here is what it looks like, so we have the same scatter plots before, but we have a history grams over laid on each access so we can see over here on this side the hissed, a gram of mpg ratings. Okay, so we can visualize that very easily and source see how this data all rolls up and up up here. We have a history ram of the engine displacement values as well, so this makes it a lot easier to tell that the most common engine displacement is around. Ah, little bit under 22 leaders, right? So that's a little bit of an easier way of, like trying to figure out how many dots air in a given column here a section because a lot of times they can overlap and that it is not really that intuitive to figure out the history. Ram makes that distribution of data easier to see. Another thing Seaborn offers is L M plot, and that's just a scatter plot with a linear regression. Apply to it automatically so I can say that same scatter plot, but instead scatter plot and lamb plot gives me back this same exact Scott a plot, but with a linear regression applied to it. And if you look really closely, you can see this sort of a shaded area around there to this. Give any sort of your, uh, your bounds on that regression, and we'll talk about linear regression and more death later in this course. But basically we're fitting a line to the day that we have a very simple concept. Back in Matt plot live. We talk about box plots and Seaborne has its own version of it as well. Box and whiskers plots. In this example. Let's take a look at each vehicle manufacturer and visualize the mpg rating across the vehicles they produce. So that's going to give us the spread of mpg ratings across all the vehicles each manufacturer offers. Okay, so we're gonna do basically an individual box plot for each manufacturer showing the distribution of NPG ratings across their entire product line. Got it. All right, So there's a lot of manufacturers, so we're gonna have to do a couple of things here toe take advantage of what Seaborne offers. First of all, we're going to set the figure size to 15 5 That just makes it bigger so we can fit more information on the screen. Well, then define the box plot itself were on. Say we want to plot the manufacturer on the X axis and the combined mpg values on the Y axis using our original data frame. Here is the data DF and we're going to save that box plot into an a X variable. We will then set the tick labels on that plot to have a 45 degree rotation. That way they'll be easier to read because there's a lot of them. So the syntax here is Ron, Say, sit set X tick labels on the X tick labels that we get back from that plot with a rotation of 45 degrees. So it's basically saying, I want to set the labels on the X axis to the existing labels. You know, leave them unchanged, but specify a rotation of 45 degrees. So let's go ahead and kick that off the set exit checked labels Command put out some output . Here is part of its ah process here. But here's the chart itself pretty interesting, so you can see that 45 degree angle that we specified on the labels here being used there. That's a lot easier to read, and you can look at the spread of mpg values for each individual manufacturer. So pretty interesting. Volkswagen has a really wide range, for example, whereas Aston Martin is pretty tightly clustered. Volvo are Volvo also pretty tight here, you know, so interesting stuff. Also, General Motors tends to be clustered here around, you know, mid twenties or so, but they have a lot of outliers up here on the higher end as well. So it seems there's a few very efficient General Motors cars out there as well. Then we have Ferrari, obviously not very good mpg, because people who drive Ferraris care more about performance and fuel efficiency. I think so. Interesting insights to be gained from this box and whiskers plot here of fuel efficiency across the models for each vehicle manufacturer that we know about fun stuff, and it's pretty to look. It's again. It's modern, pleasing colors, and that's kind of what C one gives you out of the box. There's also the swarm plots, which, instead of boxes and whiskers, plots each individual data point. But it actually groups them together in a way that makes it easier to visualize them. So it makes more sense. When you look at it, we'll just do a swarm plot on the same exact thing. So in the manufacturer name and combined mpg from our DF data frame. Again, we will set the rotation to 45 degrees on the X axis and kick it off on Lee. Different Sears were doing a swarm plot instead of a box plot you can see here it's of the box and whiskers. We're just getting this different format here, where we're sort of clumping together. These points here to actually represent the distribution of the data better. So so each individual vehicle is being plotted to a point on the scrap, but we're grouping those points together horizontally to try toe reflect a distribution of those points a little bit better. So it's a way of looking at the raw data a little bit more so than in a box plot. But it's still grouped in a way that gives you the same information as a box pockets with more refined information. So this is what we call a swarm plot. You could get the same results out of it. So again, you know, looking more deeply into Volkswagen, you can see that they have a pretty widespread here. There's a bunch around 30 and a bunch around 10 and nothing much in between. So kind of a curious case there, and I think that's because Volkswagen actually owns a bunch of different brands that are targeted at very different markets. So we're kinda probably seeing the consumer vehicles up here. And the performance vehicles way down here would be my guess. General Motors very tightly clustered in this range. Here they are, more about mass market vehicle, so they kind of want to be in that sweet spot there of things that performed reasonably well but also perform well to kind of appeal. Soon American audience anyway, just another way of looking at it. One more is the count plot. Basically the same thing is a history, Graham, but it's for categorical data, so hissed a gram really is only hissed a gram if you're dealing with numerical values. If you're doing with categories, though, that's called account plot. So let's just look at it as an example again, let's extract the manufacturer names and just take a look at how many cars each manufacturer makes. So we're going to account plot, counting up how many vehicles each manufacturer has and again will rotate them by 45 degrees so we can actually read those X labels. And there you have it. So, just like a history Graham, except that it's broken down by category, so there's no real inherent meaning to the actual order that these appear in. There just counts broken down by category, that's all there is to it. That's all account plot ISS. So you can see pretty clearly here that General Motors has the most number of car models available, followed closely by BMW and, you know, again, these air big companies that own other manufacturers. So you know, we're not necessarily saying that there are over 100 different BMW models on the market in 2019. Those include other brands. The onus well, but, you know, on the ah, on the other end here there's a very few number of Aston Martin models in a very low number of Rolls Royce models, for example, so you can really see the distribution here of how many models each manufacturer produces very easily. Finally, let's take a look at a heat map, heat maps or fun so they're away to plot Ah two d data, but where the colors represent the individual values within each cell of that table, so it makes more sense again. If you just look at it, let's make a pivot table from our original data frame to create a two D table mapping average mpg rating for each combination of the number of cylinders and engine displacement . Let's take a look at this heat map that we got here. It's a pivot table on original data frame just to extract this two D information basically a two D array that maps the combined mpg for each combination of cylinders and engine displacement. So basically, we're running up here. It's kind of like a data frame, right where were mapping cylinders against engine displacement with individual cells in that plot contain the mpg fridge combination. And we're gonna aggregate these together using means we're going to say we want to look across all the different values and take the mean for each individual combination of cylinders and engine displacement. So if there's more than one car that has say, you know ah, four cylinder to point, a leader engine will take the average of all those cars together to arrive at the value in that cell of that plot. Okay, so this is what that plot looks like as a heat map. Now a lot of the date is missing because apparently there's no such thing as a 12 cylinder , 1.4 litre engine. That would be crazy. But the's represent all the values that we actually have data for in our data frame, and the actual color of each point corresponds to the value of that cell. So, for example, here's the legend of what those colors mean. Black is somewhere around 12 mpg. So if you have a 16 cylinder, eight liter engine, that's gonna have a really horrible fuel efficiency on average of just 12 mpg. Okay, so that's how you read this thing. And you can see just by looking at it that as you go up to this end of the plot this corner here you have a low number of cylinders, low engine displacement. Those have very light colors because they're more fuel efficient. As you get down toward this corner here of lots of cylinders and lots of engine displacement, you gettinto worse and worse, fuel efficiency. So this heat map makes it very easy to visualize how those actual mpg ratings change as a function of where they are in this plot. So that's a heat map. Alrighty. If you want to try this out on your own. Here's a little bit of a challenge for you. So try to explore the relationship between the number of gears the car has and it's combined mpg rating. And I want to divisional eyes these two dimensions of data in a bunch of different ways to a scatter plot doing Ellen plot to a joint plot to a box blood and to a swarm plot. What conclusions can you draw from that? So before you scroll down, give to try yourself. I left you some empty spots here to actually play with and give that a shot. No peeking ahead of time, But I do have my solution down below. If you want to take a look when you're done and compare your results to mind. So give that a shot, hopefully get some results. But if you do get stuck, feel free to scroll down and don't beak. But the my answers air down there. Okay, So have some fun with that. And I hope that makes Seaborn a little bit more real to you again. We're gonna be using it quite a bit throughout this course. It's a very useful visualization library that is also good to look at and there you have it
19. Covariance and Correlation: next, we're gonna talk about co variance and correlation. So let's say I have two different attributes of something, and I want to see if they're actually related to each other or not. This will give you the mathematical tools you need to do so and will dive into some examples and actually figure out co variance and correlation using Python. Next, let's talk about co variance and correlation. These are ways of measuring whether two different attributes are related to each other in a set of data, which is cumbia. Very useful thing to find out. So let's talk about co variance. So imagine we have a scatter plot here, and maybe each one of these data points represents a person that we measured. And we're applauding maybe their age on one axis versus their income on another. So each one of these dots would represent a person or say they're exe Valley, represented their age, and their why represented their income. Okay, I'm totally making this office is fake data. Now that a scatter plot that looked like this, you see that these values tend to lie all over the place. And this would tell you that there's no riel correlation between age and income based on this data, right, like that doesn't seem to matter that for any given age that could be a huge range of incomes. They tend to be clustered around the middle. But, you know, we're not really seeing a very clear relationship between these two different attributes of age and income. Now, in contrast, over here on the right, you can see there's a very clear linear relationship between age and income, so co variance and correlation gives us a means of measuring just how tight these things are correlated. So, you know, I would expect a very low correlation or co variance for this data on the left, but a very high Corvair co variance and correlation for the data on the right. So that's the concept of core variance and correlation measures how much these two attributes that I measuring seem to depend on each other, so measuring co variance mathematically. It's a little bit hard, but I'll try to explain it. It's really more important that you understand how to use it and what it means, but actually derive it. If you were to think of these, these attributes of the data is high dimensional vectors. What we're going to do on each attribute for each data point is compute the variance from the mean at each at each point. So now I have these high dimensional vectors reach each data point each person, if you will corresponds to a different dimension. And I have one vector in this high dimensional space that represents all the variances from the mean for a say age for one attributes. And then I have another victory that represents all the variances from the mean for some other attributes, like income. And what I do then is I take these vectors that measure the variances from the mean for each attribute, and I take what's called the dot product between the two and mathematically. That's a way of measuring the angle between these high dimensional vectors. So if they're not being very close to each other, that tells me that these variances are pretty much moving in lockstep with each other across these different attributes. And if I take that final dot product and divide with a sample size, that's how it ended with the co variance amount. Now you're never gonna have to actually compute this yourself the hard way. You know, we'll see how to do this in Python, but conceptually, that's how it works now. The problem with co various is that it can be hard to interpret. So if I have, AH, co variance is close to zero. Well, I know that's telling me there's not much correlation between these variables at all, but, ah, large co variance implies there is a relationship. But how large is large? You know, depending on the units that I'm using, there might be very different ways of interpreting that data. So that's a problem. That correlation Saul's. It normalizes everything by the standard deviation of each attribute, and by doing so, I can say very clearly. A correlation of negative one means there's a perfect inverse correlation. So as one value increases, the other decreases and vice versa. A correlation of zero means there's no correlation at all between these two sets of attributes, and a correlation of one would imply a perfect correlation where these two attributes are moving in exactly the same way as you look at different data points. So remember correlation does not imply causation just because you find a very high correlation value does not mean that one of these attributes causes the other. It just means there's a relationship between the two, and that relationship could be caused by something completely different. The only way to really determine causation is through a controlled experiment, which we'll talk about more later. Let's get our hands dirty and computing correlation and causation and see how we actually do this in Python. All right, let's get our hands dirty with co variance and correlation here with some actual python code. So again, as I explained to the slides, you can think conceptually of co variance is sort of taking these multi dimensional vectors of variances from the mean for each attribute and computing the angle between them as a measure of the co variance. And the math for doing that is a lot simpler than it sounds. You know, we're talking about high dimensional vectors. I mean, it sounds like, you know, Stephen Hawking stuff, but really, for a mathematical standpoint, it's pretty straightforward. So I'm gonna do this the hard way. Numb Pie does have a method to just compute the co variance for you, and we'll talk about that later. But for now, I want to show that you can actually do this. You know, from first principles. So co variance again is defined as the dot product, which is a measure of the angle between two vectors of the vector of the deviations from the mean for a given set of data and the deviations from the mean for another give instead of data for the same data, data points. And then we divide that by end minus one. In this case, because we're actually dealing with a sample so d e mean or deviation from being function is taking in a set of Data X actually list, and it's computing the mean of that set of data. And here's a little bit of python trickery for you. This syntax is saying I'm going to go through I'm gonna create a new list and go through every element of next. Call it X I, and then return the difference between X I and the Me next mean for that entire data set. So this function returns a new list of data that represents the deviations from the mean for each data point. So my co variance function will do that for both sets of data coming in, divided by the number of data points minus one. Remember that thing about sample versus population? Well, that's coming into play here, and then we can just use those functions and see what happens. So in this example, I'm gonna fabricate some data that is going to try to find a relationship between paid speeds. That is how quickly it page renders on a website and how much people spend. So, for example, at Amazon, we were very concerned about the relationship between how quickly pages render and how much money people spend after that experience. You know, we want to know. Is there an actual relationship between how fast the website is and how much money people actually spend on the website? So this is one way you might go about figuring that out. So let's just generate some randomly normally distributed random data for both paid speeds and purchase amounts. And since it's random, that's not gonna be a riel correlation between them. So just a sanity check here will start. Start off by scatter plotting this stuff, and you'll see that it tends to cluster around the middle because of the normal distribution on each attribute. Well, there's no real relationship between the two, you know, for any given paid speed is a wide variety of amount spent and for any given amount spent, there's a lie variety of paid speeds. So no riel correlation there, except for ones that are coming out through randomness or through the nature of the normal distribution. And sure enough, if we compute the co variance in these two data thes two sets of attributes, we end up with a very small value. Negative 0.7 So that's a very small co variance value close to zero. That implies there's no real relationship between these two things. Now let's make life a little bit more interesting. Let's actually make the purchase amount of real function of paid speed. So we are keeping things a little bit random here, but we are creating a real relationship between these two sets of values. So for a given trend forgiven user, there is a real relationship between the paid speeds they encounter and the amount that they spent. And if we plot that out, we can see that it's actually this curve here where things tend to be tightly aligned. Things get a bit wonky down here near the bottom here just because of how random things work out. But if we compete the co variance here, we end up with a much larger value. Negative eight. And you know it's the magnitude of that number that matters. Thesis signed, positive or negative just implies a positive or negative correlation. But that value of eight says, Hey, that's a much higher value than zero. So there was something going on there. But again, it's hard to interpret what eight actually means. So that's where the correlation comes in, where we normalize everything by the standard deviations. So again, doing that from first principles, we can take the correlation between two sets of attributes. Compute the standard deviation of each computer co variance of each co variance between these two things and divide by the standard deviations of each data set. And that gives us the correlation value which is normalized to negative 1 to 1, and we end up with a value of negative 0.4, which tells us there is some correlation between these two things in the negative direction . It's not perfect. It's not perfect line, you know, that would be negative one. But there's something interesting going on there and again. A negative one correlation coefficient means perfect negative correlation. Zero means no correlation, and one means perfect positive correlation. Now, numb Pichon actually compute correlation for you using the core coif function. So if he wanted to do this the easy way, we could just say dump iDot core coif, paid speeds, purchase amount. And what that gives you back is an array that gives you the correlation between every possible combination of the sets of data that you pass in. So the way to read this is the one replies. There is a perfect correlation between comparing paid speeds to itself and purchase amount to itself, which is expected. But when you start comparing comparing paid speed to purchase amount or purchase amount paid speed going over this native 10.4005 value, which is roughly what we got when we did it the hard way, there's gonna be a little precision errors, but you know, that's not really important. Now. We could force a perfect correlation by fabricating a totally linear relationship, and that's what we did in this example and again here. We would expect the correlation to come out to negative one for a perfect negative correlation. And in fact, that's what we end up with. All right, so, again, a reminder correlation does not imply cause ality. Just because people's pay, people might spend more if they have faster paid speeds. Maybe that just means that they can afford a better Internet connection. Maybe that doesn't mean that there's actually a causation between how faster pages render and how much people spend. But it tells you this interesting relationship that's worth investigating Mawr. So you you cannot say anything about causality without running an experiment. But Correlation can tell you what experiments you might want to run, so get your hands dirty roll. Chris Lee's I Want You to use the Num pie Doc U've function. That's actually a way to get numb pie to compute co variance for you. We saw how to compute correlation using the core coif function, so go back and re run these examples. It's using the no numb pie dot c o V function on see what if you get the same results or not okay, you should. They should be pretty darn close. So instead of doing it the hard way with the co variance function that I wrote from scratch , just use numb pie and see if you get the same results again. It's just Ah, the point of this exercise is to get you familiar with using numb pie and applying it to actual data. So have at it. See where you get So there you have it. Co variance and correlation both in theory and in practice, Very useful technique. Toe Have so definitely remember that lecture. Let's move on.
20. Exercise: Conditional Probability: Let's talk about conditional probability. Pretty simple concept. It's basically the probability of something occurring given that something else occurred first, that it depends on good real world examples. If you go to amazon dot com and look at the future, that's like people who bought this also bought, or people who viewed this also viewed. You can think of that in terms of conditional probability. What are the odds of purchasing another item given that you purchased this other item first ? Same concept. Now the notation in conditional probability is probably the most confusing part, so we're gonna try to walk you through it in this lecture. But it would help if you grab an extra cup of coffee or put on your thinking cap whatever it takes to get yourself into your sharpest mental state, because this is one of the more challenging things to get through. With that, let's dive in and I'll try to make it as simple as possible. Let's talk about conditional probability again. Simple concept with the notation contribute up sometimes, so let's just dive into what that notation is and what it means Now. The basic concept of conditional probability is that if I have to, events that depend on each other, I can make a statement about the probability of that second event occurring. Given that the first event occurred now, the notation that we're going to use here is twofold. There is p a comma B, and that means the probability of both A and B occurring independently of each other. And then we have p of be bar A. That's the probability of be given that a has occurred. So that implies some dependency between B and A. And we can tie this all together using this handy dandy equation here, where the probability of be given a that is the conditional probability of be given event A is equal to the probability of A and B together divided by the probability of a So you can use that to tease out the conditional the dependence between being a It'll make more sense with a real example. So let's take a look at a real example. Let's say that I give my students to tests, and overall, 60% of my students passed both tests. So if we call the tests A and B p a comma, be would be 60%. However, the first test was easier. 80% of my students passed that one. So if be is the second test and a is the first test P of a would be 80% in this example, right and see. This gets confusing pretty quickly with all the A's and B's and commas and pipes. But let's review again. So 60% of the students passed both tests. Opiate a commodious 60%. The first test was easier. 80% passed. That test soapy of a is 80%. So now how do I figure out the percentage of students who passed the first test who also passed the second? So what's the probability of passing my second test? Given that you passed the first test? That's conditional probability. So we're asking for the probability of B bar a. The conditional probability of passing test to, given that you pass test one and we can compute that using the equation that we just saw p of be Bari. The conditional probability of be given a is able to the non conditional probability p a comma B, which was 60% over the probability of A, which is 80%. And if we do, that division went with 75%. And we can say that 75% is the conditional probability of passing the second test. Given that you pass the 1st 1 makes sense, hit, pause and digest this for a minute. Because with all these letters and different punctuation marks and what not, it can get confusing. So let me just dive you into a notebook and we'll go through a bunch of other examples to try to make this notation and how to handle all these things second nature to you. And when we're done, I'll give you a little bit of an exercise single in practice it yourself. Okay for this exercise, I want you to open up the conditional probability exercise notebook in your course materials, and I have tried to walk through this a little bit slowly. This is sort of ah, tough thing to wrap your head around a little bit. What we're going to do in this activity is generate 100,000 random people and let's say that they're all customers on a big e commerce website like Amazon or something for each of these 100,000 people will randomly assign them to a given age range, being in their twenties or their thirties or there forties, all the way up to their seventies. And we're going to create a conditional probability of a dependence between their likelihood of purchasing something and their age. Basically, we're going to say that the older you are, the more likely you are to buy something. So if you're young, you'll have a lower probability of purchasing something. So let's say that we say the probability of purchasing something is called E, and the probability of actually being in a given age ranges F. That means that we don't have a dependency conditional probability between E and F. All right, so let's walk through the code here that sets up that random data set. I get a lot of questions from people about how this code work. So bear with me guys, if you do know Python already. But for the people that are new to it, this does take some explaining. So I'm going to go through this line by line. All right, so we're going to start off by importing random from the numb pie package. Nothing exciting. There That's just so we can actually generate random numbers within this little snippet of code here, random dot seed just generates a seed value for the random number generator. The purpose of this line is to make sure that we get consistent results every time we run this code. So as before, we used to get different results every time you ran this. But by having a consistent, see number, that means it will get the same results back for our random numbers. Each time that we run this, the number zero is arbitrary. All that matters is that it's the same value being used each time. It could be 1234 or any number you want, as long as it's the same one that's always going on. There were just making sure we have some consistency in our results next time. Setting up to python dictionaries called totals and purchases and what this is doing is keeping track of how many total people I have in each rage range. The 20 year olds, the 30 year olds of 40 year olds and so forth, and how many purchases were made by each person in those age rages So basically, I'm saying, initially, I have zero people in the 20 year old bucket here on the 30 year old bucket and here on the 40 year old bucket and so on and so forth. And I have zero purchases from 20 year olds, zero purchases from 30 year olds, zero purchases from 40 year olds and so on and so forth. So this is just how I'm going to keep track of the total number of people and the total number of purchases associated with each age group. I will also keep track of the total number of purchases regardless of age, with the total purchases variable. Next, we're going to create a loop to jittery through 100,000 random people that were going to create. And that underscores us a place holder. I could just say forex and ranged 100,000 or whatever you want. But since I'm not actually using that value anywhere within the loop here, I can just use the underscore character as a placeholder. It just means I don't really care what that value actually is each time through. I don't care that this is user number 1776. I could just discard that information. That's all. The underscore means it means I don't really care what the actual number is. So for each of these, 100,000 people were gonna eatery through the soup 100,000 times and for each time we will assign age decade to this person. Random dot choice will just randomly pick a value out of this list that we pass in so it will randomly pick one of these numbers 2030 40 50 60 or 70 evenly distributed. So we will have a random, even chance of being a 20 year old or a 30 year old or a 40 year old. All we have to a 70 year old for each individual person. Now here's Here's where things get a little bit weird. Based on your age, we compute a purchase probability. So we're taking your age and dividing it by 100 to figure out the odds of you actually purchasing something from our website. So, for example, if I'm in my twenties, I will take 20 divided by 100. That works out to 0.2 or 20% so 20 year olds will have a 20% chance of actually buying something 30 year olds who have a 30% chance of buying something and so on and so forth. So this is how we're figuring out that younger people are less likely to buy something that older people in our randomly generated data set here as we go through it, will bump up the total for that age Decade by one mean that we generated to a new random person within this age range. And here we're saying, if random dot random, it's less than that purchase probability to actually attribute of purchase to this person. How does that work? Well, random dot random just randomly selects a value between zero and one. So if that random number is less than our purchase probability, we say this person actually bought something. Let's look at an example to make that a little bit more intuitive. Let's say that we have a 30 year old, okay, so someone in their thirties their purchase probability will work out to 0.3 or 30%. So if our random number between zero and one comes out to be less than 10.3, that person is attributed to a purchase if it's greater than 0.3. They did not buy something. So you see how that works. That's how are basically rolling the dice to see if this person bought something or not, given their overall probability of purchase further age range. So as we're done, we're building up thes total number of purchases that are done for across the entire data set. We're also keeping track of purchases before that individual age decade, and we're also keeping track of the total number of people in each age decadas well, we'll need all of these numbers to figure out things like PV and Piece PF and PV given half and all that stuff. Let's go ahead and run. This and that. Generates are fake data set that has a dependence between age and purchases. Let's take a look at what we got. I would expect the totals for each age range to be roughly consistent, and they are so we have about 16.5 1000 people in each age decade, so that's good. And even the distribution, that's what we expect. But if we look at the purchases attributed to each age range, you can see that that is increasing based on AIDS. So we have that dependency that we try to model in there working nicely. So about 3020 year olds, but purchase something but about, you know, almost 12,007 year old spots something, even though that they were evenly distributed in number of people. So we're seeing here very clearly that there is a relationship between your age and your likelihood of purchasing something. Okay, so we have a nice little fake data set toe work with year for conditional probability, we can also compute the total number of purchases across the entire data set That comes out to 45,012. And now we have the valleys we need to work with for playing with conditional probability. Okay, so again, a lot of this is just getting your head around the notation and keeping track of what letter means what? So again, we're going to call e purchasing something and f a given age that you're in. So let's start by computing p of e given f. This is a conditional probability between making a purchase and f where we're gonna call that being in your thirties with this arbitrarily selected age range there. So the probability of purchasing something e given that you're in your thirties f we can compute that directly. Actually, we can just figure out how many 30 year olds bought something as a percentage. So let's go ahead and just compute that. How many total purchases did we see from thirtysomethings and how many people were in that data set? And that works out to 0.299 almost 0.3 right? We can also independently compute piece of BNP Sabbeth piece of f will be the probability of being 30 overall in the state of set. That's easy to compute. Will just take the total number of 30 year olds divided by the total number of people in our data set, and that works out. Teoh 16.6% 0.166 piece PF e is just the overall probability of buying something regardless of your age, if there was no dependency there at all, So to compute, PV will just take the total number of purchases across everybody, regardless of age, divided by the total number of people overall, and that works out to 2.45 or 45%. So overall in our entire data set, taking age out of the equation, there's a 45% chance of buying something. All right, so this is where you start to put on your thinking cap a little bit. So wrap your head around this statement. If e and f that is buying something and your age were independent than you would expect p of e given f to be about the same as PV, right? If there was no dependency between buying something in your age, you would expect the overall probability of buying something to be the same as a probability of buying something given your age, because there shouldn't be a dependency, there shouldn't matter. You have shouldn't matter. But we've seen that that's not true, right? So PV we computed to be about 45% but p of e given f we computed earlier to be about 30% or 300.299 Whatever it ISS thes numbers air fairly different. So that alone is telling us that E and f are dependent that there is a condition between these two things and we know that's the case in this example, we artificially created a dependency between purchase probability and age. So that's one way to tease that out of the data right there. If you see that P, e, v and P V given F or if you want to use different letters you can pee of A is not equal to P of a given be whatever letters you want to use, it's just notation. If those aren't the same, then there might be a dependency going on there that you need to know about. All right, let's also compute p of e com af again. This is all about notation. PFE comma F is different from P A e bar f. So PV comma F is the overall probability of being both in your thirties and buying something without a dependency there. So we're looking at the overall probability of being both in your thirties and buying something, not just restricting that to the population of people that are in their thirties. We can compute that easily enough. We could just look at the total number of purchases from thirtysomethings over the total size of the data set here, and that works out to about ah 5% ish point for point. Oh, for nine. While we're at it, we can also compute the product of P E and P F. That's just going to be multiplying P e and P F, the overall probability of buying something and the overall probability of being in your thirties. That comes out to about 7.5% now. In statistics, when they talk about probability, you'll often see the relationship that p of aecom F is equal to the product of PNE and PMF . But that is only true if he enough are independent now. We found here that p of aecom f the overall probability of just being in your thirties and buying something out of the total data set is about 0.5 But PV times Pff is about 0.75 So when you have a dependency going on between these two variables, there's a conditional probability going on, and the relationship of PV comma F equals P if he times pf no longer hold. So that's another way you can kind of figure out that maybe there's a dependency going on here that's messing up your results. However, we could go back and check that equation that we gave back in the slides and just see if p a e given f is in fact equal to PV comma F over PF. And that's just a way of computing conditional probability. If you cannot computer directly as we could in this example, and sure enough we can prove that that is true. The probability of aecom a F That's just going to be the total number of purchases from 30 year olds over the entire data set over Pff, which we computed earlier. That does work out 2.299 to 9, which is exactly the same as P A E given f that we computed way up at the top of this, right? So let's double check that 90.299295 Same number that we got up here originally. So that's cool. The math works out. Wow. OK, so this is basically a couple of ways to figure out if you have a dependency in your data that you might not have known about and a way of computing conditional probability, given other things that you might know. So anyway, let's do a little assignment here a little bit of a challenge, if you will. So your task should you choose to accept it, is to modify the code above such that the purchase probability does not vary with age. So remember up here in this first block up here we have this purchase probability that was a function of your age. Just make that a constant value. Instead, see what that does to your results. So if you do that, can you generate a new data set where you can show that PV given f is about the same of P and V? That would show you that there is no condition there. If you show that to be true, that PV give enough is the same spv then there is not a dependency between those two things and that's a mathematical way to find that. So give that a shot and see if you can prove that to yourself and I'll show you my solution to that in the next lecture. So there we have some examples of using conditional probability again, the concepts not that hard. It's just really easy to get tripped up on all the notation with all the pipes and commas, meaning different things and stuff. But once you get used to it, it's not so bad. So I hope you have a chance to dive into the homework assignment here and try this little exercise of playing with yourself and removing that condition and confirming that conditional probability ends up wiping itself out. In that case, let me show you my solution in the next lecture.
21. Exercise Solution: Conditional Probability: Did you do your homework? I hope so. So let's take a look at my solution to the problem of taking a look at how conditional probability tells us about whether there's a relationship between age and purchase probability in a fake data set. Let's go. Okay, let me walk you through my solution here, and this should be pretty straightforward again. The objective here was to remove the dependency between your age and your likelihood of purchasing and see if you can actually tease out of the math of that dependency that conditionality went away. So if you recall from our walk through this code that generates our data set, we used to have a line here on purchase probability that generated that based on your age. It basically took your age decade and divided by 100 to create that condition between your age and how likely you are to purchase something. So what I've done here, instead of just hard code of that to 40% so there's no longer any relationship between your age and your likelihood of purchasing something. So let's go ahead and regenerate that data set. Removing that dependency between age and purchase probability it's now we can compute. PFE Given f Again Again That's the probability of making a purchase given your age, for some age group will pick 30 year olds again just to be consistent. So again we will compute the total number of purchases by 30 year olds by the total number of 30 year olds, and we end up with the number about 40%. And we can independently compute p of e, which is just the probability of purchasing something overall, regardless of your age, that comes out to 40%. So 39.8 40.0, pretty darn close. Okay, so in this case, PV is roughly equivalent to PV given f a little bit different just because of random variation. But it's close enough that we can say that TNF are likely independent variables in this case. So the math told us that p of e given F is the same spv more or less and sure enough, that does reflect the fact that we remove that tie between age and purchase probability. So I hope you were able to reach the same result and learn a little bit of something along the way about conditional probability and how to tease those dependencies between your various features of your data set out of the math. Okay, hopefully you arrived at a similar solution on your own. If not, you know, go back and study my solution. It's right there in the data files for this course. If you need to open it up and study it and play around with it. And with that behind us, let's move on to Bayes theorem.
22. Bayes' Theorem: So now that you understand conditional probability, you can understand how to apply Bayes Theorem, which is based on conditional probability. And it's a very important concept, especially if you're going in the medical field but broadly applicable to, and you'll see why in a minute it can tell you very quantitatively. Sometimes when people are misleading you with statistics. So let's see how that works. Now that you understand conditional probability, we can talk about Bayes theorem. You hear about this a lot, but not many people really understand what it means or its significance. So let's talk about based there at a high level here. So based there, um, is simply this. The probability of a given B is equal to the probability of a times the probability of be given a over the probability of B. All right, so you know you can substitute A and B whatever you want. One common example is drug testing. So we might say, What's the probability of being an actual user of a drug, given that you tested positive for it? And the reason based theorem is important is that it calls out that this very much depends on both the probability of a and probably of bees. So the probability of being a drug user, given that you tested positive depends very much on the base overall probability of being a drug user and the overall probability of testing positive. Okay, it also means that the probably be given a It's not the same thing as a probably a given be . So the probability of being a drug user, given that you tested positive, could be very different from the probability of testing positive, given that you were a drug user, so you know you can see where this is going. There's a very real problem where diagnostic tests in medicine or drug tests yield a lot of false positives, right? And you can still say that the probability of a test detecting a user could be very high. But it doesn't necessarily mean that the probability of being a user given that you tested positive is high. There's a two different things and based there and allows you to quantify that difference. So let's nail that example home a little bit more So again, a drug test could be a common example of applying Bayes theorem to prove a point. Even a highly accurate drug test can produce more false positives and true positives. So in our example here, we're going to come up with a drug tests that can accurately identify users of a drug 99% of the time and actually has a negative result for 99% of non users. So but only 0.3% of the overall population actually uses the drug in question. OK, so I have a very small probability of actually being a user of a drug. What seems like a very high accuracy of 99% isn't actually high enough, right? We can work out the math. So let's let event Amy means that you're a user of some drug and event be is that you tested positively for the drug using this drug test. So we need to work out the probability of testing positively overall. And we can work that out by looking at the probability testing positive if you are user and the probability of testing positive if you're not a user and that works out to 22 point April of that works out to 1.3% in this example Okay, so we have probability of be the probability of testing positively for the drug overall without knowing anything else about you. If you do the math, the probability of being a user of the drug Given that you tested positively, you know what's the probability of a positive test result? Meaning that you're actually a drug user works out to the probability of being a user of the drug overall, which is 3%. We know that 3% of the population is a drug user times the probability of be given a the probability of testing positively. Given that your user and again this test has a what sounds like a very high accuracy of 99% . So we have 0.3% of the population uses the drug times the I accuracy of 99% divided by the probability of testing positively overall, which worked out to 1.3%. So the probability of being an actual user of this drug, given that you tested positively for it is only 22.8%. Okay, So even though that this drug test is accurate 99% of the time, it's still providing a false result in most of the cases where you're testing positive. Okay, people overlook this all the time. So if there's one lesson to be learned from base there, it's to always take the source of things with the crane assault work that apply based there into these actual problems. And you often find that what sounds like a high accuracy rate can actually be yielding very misleading results. If you're dealing with a low overall incidence of a given problem, we see the same thing in cancer screening and other sorts of medical screening as well. It's a very real problem, and it's a lot of people getting very, very real and very unnecessary surgery as a result of not understanding based there. So if you're going into the medical profession with big data, please, please, please remember this lecture. So that's based there. I'm always remember that the probability of something given something else is not the same thing, is the other way around, and it actually depends a lot on the base probabilities of both of those two things that you're measuring. So the probability of a drug test being accurate depends a lot on the overall probability of being a drug user in the population, not just theocracy of the test, very important thing to keep in mind. And, you know, always look at your results with that in mind based there and gives you the tools to quantify that effect. Hope it proves useful.
23. Linear Regression: Let's talk about regression analysis, very popular topic and data science and statistics. So all it is is trying to fit a curve, some sort of a function to a set of observations. And then you can use that function to predict new values that you haven't seen yet. It's all there is to it. So let's start off by talking about the simplest form of regression analysis. Linear regression. Let's talk about linear regression. You hear a lot about regression analysis in the field of days. Science sounds fancy, but it's actually a very simple concept. Let's see. So linear regression all it is is fitting a straight line to a set of observations. That's it. That's all there is to it. So, for example, let's say that I have a bunch of people that I measured and the two features that I measured or these people are their weight and their height, some shocking, showing the weight on the X axis and the height on the Y axis. And I can plot all these data points. That people's wait for us is their height, and I could see him. That looks like a linear relationship, doesn't it? Maybe I can fit a straight line to it and use that to predict new values. And that's what Lenny linear Regression does. So in this example, I might end up with a slope of 0.6 in the Y intercept of 1 30.2 in this example on that defines a straight line, given a slope in a Y intercept that fits the data that I have best. And I can use that line and create new values so you can see that the weights that I observed only went up to people that weighed 100 kilograms. What if I had someone who weighed 120 kilograms? Well, I could use that line to then figure out Where would the height be for someone with 100 20 kilograms based on all this previous data? That's it. I don't know why they call it Regression. Regression kind of implies that you're doing something backwards, and I guess you can think of it in terms of your creating a line to predict new values based on observations you made in the past backwards in time. But it seems like a little bit of a stretch. It's just a confusing term, quite honestly. And you know yet another way that we kind of obscure what we do with very simple, simple concepts using very fancy terminology. So don't let linear regression trip you up in terms of sounding fancy. All it is is fitting a straight line to a set of data points. How's it work Well internally uses a technique called a least squares, ordinary least squares that's also known as OLS. You might see that term tossed around as well, and the way it works is it tries to minimize the squared error between each point and the line, and the error is just the distance between each point in the line that you have. So if we sum up all the squares of those errors, sounds a lot like when we computed variants right except instead of relative to the means to this line that we're defining. We can measure the variance of the data points from that line, and by minimizing that variance, we can find the line that fits it the best. Now you'll never have to actually do this yourself the hard way. But if you had to for some reason or if you just curious about what happens under the hood . This describes the overall algorithm here for you of how you would actually go about computing the slope and why intercept yourself the hard way if you need it to. It's really not that complicated. The slope just turns out to be the correlation between the two variables times a standard deviation and why divided by the standard deviation in X, and it might seem a little bit weird that standard deviation just kind of creeps into the map naturally there. But remember, Correlation had standard deviation baked into it as well, so it's not too surprising we have toe reintroduce that term. The intercept can then be put. The intercept can then be computed as the mean of the Y, minus the slope, times the mean of X and again, even though that's really not that difficult. Python will do it all for you. But the point is that these aren't complicated things to run, you know, they can actually done pretty efficiently. So again, just remember, least Squares minimizes the sum of squared errors from each point to the line and another way of thinking about linear regression is that you're defining a line that represents the maximum likelihood of an observation lying there, but the maximum probability of the Y value being something for a given X value. So again, you know, people sometimes call this maximum likelihood estimation, and it's just another example of people giving a fancy name to something that's very simple . So if you hear someone talk about maximum likelihood estimation, they're really talking about regression. They're just trying to sound really smart. But now you know that term to so you two can sound smart. There is more than one way to do it. We talk about ordinary least squares as being a simple way of fitting a line to a set of data. But there are other techniques as well. Grady and dissent being one of them, and it works best in three dimensional data, so it kind of tries to follow the contours of the data for you. It's very fancy and obviously a little bit more computational e expensive, but python doesn't make it easy for you to try it out. If you want to compare it to ordinarily squares, usually, though, least squares is a perfectly good choice we're doing when your regression and you know it's always, ah, legitimate thing to do. But if you do run into Grady and dissent, you will know that that is this an alternate way of do England your regression and it's usually seen in higher dimensional data. So how do I know how good my regression is? How well does my line fit my data? Well, that's where R. Squared comes in and R squared is they also known as the coefficient of determination again . So I'm trying to find someone trying to sound smart. My call it that, but usually it's called R squared. It is the fraction of the total variation and why that is captured by your models. So how well does your line follow that variation that's happening? Are we getting an equal amount of very variants on either side of your line or not? That's what R. Squared has met this measuring and actually compute the value take one minus the sum of the squared errors over the sum of the squared variations from the mean so it's not very difficult to compute, but again, python will give you functions that would just compute that for you, so you'll never Acto actually do that math yourself way to interpret R squared, you'll get a value that ranges from 0 to 1 zero. Means you're fit is terrible. It doesn't capture any of the variance in your in your data, and one is a perfect fit. So all of the variance in your data gets captured by this line, so all the various to see on either side of your line should be the same in that case, So zero is bad. One is good. It's already really need to know something in between is something in between. So a low R squared value means to support fit high R squared. Value Means is a good fit. And, as you see in the coming lectures, there is more than one way to do regression. Linear regression is one of them. It's a very simple technique, but there are other techniques as well, and you can use our squared. It's a quantitative measure of how good a given regression is to a set of data points and then use that to choose the model that best fits your data. OK, so let's play with it and actually compute some linear regression and r squared. Let's have some fun with linear regression hands on. So go ahead and open up the linear regression I Python notebook file and follow along with me if you will, because they do. You want to play around this to get a good feel of it. So we'll start by creating a little bit of python code here that generates some random ish data that is, in fact, linearly correlated. So in this example, I'm gonna fake some data about page rendering speeds and how much people purchase just like a previous example. So fabricate a linear relationship between the amount of time it takes for a website toe load and the amount of money people spend on that website. So I've done here is I've made a random, a normal distribution of pay speeds centered around three seconds with a center deviation of one second, and I've made the purchase amount of linear function of that. So I'm making it 100 minus the page speeds, plus some normal random distribution around it. Times three. And if we scatter that, we can see that the line the data ends up looking like this. Okay, so you can see just by eyeballing it that there's definitely a linear relationship going on there. And that's because we did hard coda, riel linear relationship in our source data. So let's see if we can tease that out and find the best fit line using ordinary least squares. Now we talked about how to do ordinarily squares and linear regression in the slides, but you don't have to do any of that math yourself, because thesis I pi package has a stets package that you can important seeking safe from CYP I import stats. And then you could just call stats. Stotland regress on your two features. So have a list of paid speeds and a corresponding list of purchase amounts. Lane regress will give me back a bunch of stuff, so it gives me back all of these variables that I'm getting back that slope the intercept, and this is what I need to actually define my best fit line. It also gets me the our value from which we can get our square to measure the quality of that fit and a couple of things that will talk about later in the course for Now we just need Slope intercept in our value. So let's go ahead and run these. So there is my line and let's go ahead and find the linear regression. Best fit now the R squared value that the line that we got back is 0.99 That's almost one point. Oh, so that means we have a really good fit, which isn't too surprising because we made sure there was a real linear relationship between this data. Even though there is some variance around that line, are lying captures at variance. So we have roughly the same amount of variance on either side of the line, which is a good thing. It tells us that we do have a linear relationship and our model is a good fit for the data that we have. Let's actually plot that line so this little bit of code will actually create a function to draw that red best fit line alongside the data. So a little bit more of Matt plot lib magic here, we're going to make a fit line list, and we're going to use this predict function. We wrote to take the paid speeds, which is our X axis and create the Y function of from that. So instead of taking the observations for amount spent, we're going to find the predicted ones just using the slope Times X plus the intercept that we got back from the land regress call above. So basically, we're going to a scatter plot like we did before to show the raw data points the observations. And then we're also gonna call plot on that same pipe lot instance using our fit line that we created using the line equation that we got back and show them all both together. Do that. And it looks like that so you can see that our lion is in fact a great fit for our data goes right, snack down the middle and all you need to predict new values is this predict function. So, given a new previously unseen paid speed, we could predict the amount spent just using the slope times the paid C plus C pe speed plus the intercept. That's all there is to it. So time to get your hands dirty. Try increasing the random variation in the test data and see if that has any impact. Remember the R squared is a measure of the fit. How much to be captured, the variants. So the amount of variance we'll see if it actually makes a difference or not. That's linear regression. Pretty simple concept. All we're doing is fitting a straight line to instead of observations. And then we can use that line to make predictions of new values. It's all there is to it. But why limit yourself to align? There's other types of regression. We could do that arm or complex. We'll do that next.
24. Polynomial Regression: so we talked about linear regression. Polynomial regression is our next topic, and that's using higher order. Polynomial is to fit your data, so sometimes your data might not really be appropriate for a straight line. That's where polynomial regression comes in. Let's dive in All right. We talked about linear regression earlier, where we fit a straight line to a set of observations. Let's talk about polynomial regression, which is a more general case of regression. So why limit yourself to a straight line? Maybe your data doesn't actually have a linear relationship. Maybe there's some sort of a curve to it, right? That happens pretty frequently. Not all relationships are linear, but the linear regression is just one example of a whole class of regressions that we can do. So if you remember, the linear regression line that we ended up with was off the form Y equals MX plus B, where we got back the values M and B from our linear regression analysis from ordinary least squares or whatever method you choose. Now this is just a first order or a first degree polynomial, and the order of the degree is the power of X, a C. So that's the first order polynomial. But we could also use a second order polynomial, and that would look like y equals X squared plus B x plus e. And if we were doing a regression using a second order polynomial, we would get back values for A, B and C. Or we could do 1/3 order polynomial that has a X cubed plus B X squared plus e x plus D. And the more the higher orders that you get, the more complex the curves you can represent. Right? So you know the mawr powers of X. You have blended together the more complicated shapes and relationships you can get. But more degrees isn't always better, you know. Usually there's some natural relationship in your data that isn't really all that complicated. And if you find yourself throwing, you know very large degrees at fitting your data, you might be over fitting. Okay, so if you're if you have data, that's kind of all over the place and has a lot of variants you can get you can go crazy and create this line that, just, like, goes up and down to try to fit that data's closely as it can. But in fact, that doesn't represent the intrinsic relationship of that data. It doesn't do a good job of predicting new values. So always start by just visualizing your data and think about you know, how complicated does this curve really need to be? Now you can use our square to measure how good your fit ISS. But remember, that's just measuring how well this curve fits your training data. The data that you're using to actually make your predictions based off of it doesn't measure your ability to predict accurately going forward. Later on, we'll talk about some techniques for preventing over fitting called train test. But for now, you're just gonna have to eyeball it and make sure that you're not over fitting and throwing more degrees than a function. Then you need to. This will make Mawr more sense when we do an example. Fortunately, none. Pie has a poly fit function that makes it super easy to play with this and experiment with different results. So let's go take a look. Time for fun with polynomial regression. I really do think it's fun, by the way. It's kind of cool, seeing all that high school math actually coming into some practical application. Go ahead, open the polynomial regression i Python notebook and let's have some fun. So let's create a new relationship between our page speeds and purchase amount fake data. And this time we're going to create a more complex relationship. That's not linear. We're gonna take a the paid speeds and make it some function of the division of page speeds for the purchase amount. And if we do a scatter plot, we end up with this. By the way, if you're wondering with this NPR random dot seed line does that creates a random seed value, and it means that when I do subsequent random operations, they will be deterministic. So by doing that, I could make sure that every time I run this being a bit of code, end up with the same exact results, okay, and that's gonna be important later on because I'm gonna have you come back and actually try different fits to this data to see compared the fits that you get. So it's important that you're starting with the same initial set of points. So there we have it. You can see that That's not really a linear relationship. You know, we could try to fiddle line to it, and it it would be okay for a lot of the data may be down here, but not so much here. We really have more of an exponential curve. Now it turns out numb pie has a poly fit function that allows you to fit any degree polynomial you want to this data. So, for example, we could say our X axis is an array of the page speeds that we have in our Y axis is an area. The purchase amounts that we have. We can then just call and P, which is a short cut for numb pie, Paul, if it x y and four meaning that we want 1/4 degree polynomial fit to this data. So let's go ahead and run. That runs pretty quickly, and we can then plot that. So we're gonna create a little graph here that plots are scatter plot original points vs are predicted points, and it looks like that. So at this point, looks like a reasonably good fit. What you want to ask yourself, though, is m I over fitting Does my curve look like it's actually going out of its way to accommodate outliers and not really. You know, I don't really see a whole lot of craziness going on. Like if I had a really high order polynomial, it might, you know, swoop up here to catch that one and then swooped down here to catch that one and, you know, get a little bit more stable through here, where we have a lot of density. And maybe then it would like it could potentially, you know, go all over the place trying to fit this last visit set of data here. So maybe it will go. We woo, for example. So if you see that sort of a nonsense, you know, you have too many orders too many degrees in your polynomial, and you should probably bring it back down, because although it fits the data that you observed, it's not gonna be useful for predicting data you haven't seen. So imagine I have some curve that swoops way up here and then back down again to fit these data points. My prediction for something in between there isn't gonna be accurate, right? It really should be in the middle here so again later in the course will talk about principal means of detecting that over fitting. But for now, just eyeball it. Now we can measure the R squared air. So by taking the why and the predicted values we haven't are to score function in its SK learned psychic learned on metrics that we can use that computes that forest. So basically, it compares a set of observations to a set of predictions and computes r squared for you with just one line of code and r r squared code for this turns out to be 10.8 to 9, which isn't too bad. Remember, zero is bad. One is good, pointing to pretty close to one you know, not perfect. And intuitively that makes sense. You can see that our line is pretty good in this section of the data, but not so good out here and not so good up here and point a to sounds about right. So I want you, Teoh, get down and dirty with this stuff. Try different orders of polynomial. So go back up here to where we ran the poly fit function and try different values there besides four. You know you could You could use one, and that would go back to a linear regression. Or you could try some really high amount, like eight. And maybe you'd start to see over fitting, so see what effect that has. You know, you're gonna want to change that. For example, let's go to 1/3 degree polynomial. Just keep hitting run to go through each step, and you can see the effect it has. So our third degree polynomial definitely not as good of a fit. And if you actually measure the r squared error, it's actually worse quantitatively. But if I go to high, I might start to see we're fitting. So just just have some fun with it, play around different values and get a sense of what different orders of polynomial due to your to your line here, regression and get your hands dirty and try to learn something. So that's polynomial regression again. You need to make sure that you don't put mawr degrees at the problem than you need to use just the right amount to find what looks like a an intuitive fit to your data. Too many can lead to over fitting too few can lead to a poor fit. So you can use both your eyeballs for now and the R squared metric to figure out what the right number of degrees are. For your data, let's move on.
25. Multiple Regression: Let's dive into multiple regression. That's just a regression that takes more than one variable into account, more than one feature. The concept is actually pretty simple. It's just answering the question, what if I have more than one variable influencing the thing that I'm trying to predict. So basically I'm doing a regression that I don't just have one feature that I'm measuring to try to predict some value. I have many features that might come together. So for an example, that might be predicting the price of a car based on its many attributes. The car has a lot of different things you can measure that might influence its price, such as its mileage, it's age, how many cylinders it has, how many doors it has, things like that. And you can actually take all those into account and roll that into one big model, many variables as part of it. Now, as often as the case in data science, there is some confusing terminology here. In addition to multiple regression, which is using multiple features to predict a single value, we also have the concept of multivariate regression. And you would think that would mean the same thing, but it doesn't. Typically when we talk about multivariate regression, we're talking about not only having multiple feature attributes that we're trying to use to make a prediction. But we're also trying to predict more than one thing at the same time. So maybe I'm trying to predict not only the price of a car based on its mileage and age and number of doors. I'm also trying to predict how long it will take to sell it or something like that. That would be an example of multivariate regression where we have multiple things we're trying to predict in addition to multiple features being used to make those predictions. Either way though, the way we do it is actually pretty simple. So we can just have instead of a single coefficient attached to some single feature variable, we can have multiple terms with multiple variables. So we can say that we can predict the value price based on some constant value Alpha times some coefficient called beta one times your first feature, which say could be mileage plus some coefficient beta two, which might be multiplying with some other feature like the age of the car plus beta three times the number of doors, whatever you wanna do. And those coefficients are just measuring how important each factor is to the actual end result. Now this assumes that all of your features are normalized going into it. So you can actually compare those coefficients together fairly. If they're not normalized, that coefficient will also be working to scale that feature into the final result as well. And that can also be informative if he actually work out the values of Beta-1, Beta-2, Beta-3, and whatever else you might have. That can also tell you a little bit about what features are actually important for your model. So if you end up with a very low coefficient for a given feature after things are normalized. That might be nature's way of telling you that that feature isn't actually very important for predicting the thing that you're trying to predict. And that can help you to simplify your model by eliminating feature data that you don't need. So that's a very useful thing. That's called feature selection. And it's often a very important part of building a good machine learning model. Now all of this still uses least-squares. So in our notebook we're going to use something called OLS that stands for ordinary least squares. And it can handle multiple features like this. So we can still measure the fit of this thing overall using R-squared. Nothing is different there. And another thing that we need to point out is that this whole thing assumes that there's no dependency between these different features. Notice that I'm treating all these features independently with their own coefficients. So if there is in fact a relationship between these features, this model will not capture that. And this is actually an example of where that would probably be the case. For example, the mileage on the car would probably be highly correlated to the age of the car. And this model will not capture that relationship. In fact, you'll probably be just fine just using mileage or age independent of each other. But this could at least tell you which one of those is more important to keep. So with that, let's dive in and actually fire up a notebook and see how it works. There is something called the Statsmodel package that makes things easy. And it offers the OLS model that we can just use to go off and make it chug away and do it all for us. So let's make that example of multiple regression real. Go ahead and open up the multiple regression notebook file here, and you should be seeing something like this. Alright, let's see multiple regression in action. Fortunately, the Statsmodel package has an OLS regression model that can handle multiple regression. It's pretty easy to use, although there are a few courts that we'll talk about here that you need to know about. What we're going to try to do is predict car prices using multiple regression on various attributes of the cars. For example, the mileage, the number of cylinders in the number of doors. And I do have a real dataset for you here to play with that I've uploaded to our website. So the first thing we're gonna do is import pandas and call its read Excel function to load up this Excel spreadsheet of a bunch of data about cars in their attributes and what they sold for. And we're going to load that up into a DataFrame called df. So let's go ahead and Shift Enter to do that. Now if you get a failure, they're just try it again. Sometimes accessing the internet from a notebook can be a little bit unreliable, but usually if you just try it again, things will catch up and it will work the second time. All right, so let's try to visualize this data. The first thing you usually wanna do when you're dealing with a new dataset is to inspect the data and make sure that you can be comfortable with it and manipulate it and get information out of it and make sure that it's what you expect. So that's all we're gonna do in this next block. We're going to load up matplotlib so we can plot things, will import the NumPy package so we can manipulate our data. And we will create a new df dataframe that just extracts the mileage and price features from our original DataFrame from the Excel spreadsheet that we loaded up as df. So now all we're going to try to deal with C. Is there a relationship between mileage and price? You would think there would be, right? You would think that higher mileage cars in general would cost less than low mileage cars. So let's see if that's true. Let's bucket this up. So we're going to create a bin's DataFrame and we're going to call np.arange. What this is going to do is break up our data into 10 thousand mile chunks between 0 and 50000 miles. Got it. So this is gonna give me back a range of data between 010 thousand miles, 10000 and 20000 miles, all the way up to 50 thousand miles. And then we will create these groups and take that mileage and price feature data, group it by those bins that we created and compute the mean for each one of those bins. So now groups is going to contain the mean price for each one of those ranges of mileages. We'll print that out, make sure that it looks reasonable, and then we'll plot it and see what we got. So let's hit Shift Enter. And you can see here that we do have those ranges we expected. So between 010 thousand miles, the mean mileage was 4588 and the mean price was $24 thousand. Work your way up to 30 to 40 thousand miles and the mean price drops down to 19,463 miles. And if you plot it, then you see what you expect. Higher mileage cars in general costs less than low mileage cars. Now I will warn you there are some outliers in this dataset. And this might not hold true all the way up beyond 50 thousand miles. So maybe there's some like collectible sports cars that have high mileage, but they're still worth a lot, right? So this dataset is actually a good one to play with for dealing with outliers. To if you want to dig into that and I'll encourage you to do so at the end of this exercise. Anyway, our objective there was just to make sure that we're comfortable, the data were able to load it, we're able to manipulate it so far, so good. All right, so let's build a model. So we're going to start by importing Statsmodel API as SM. So this is going to be importing the model that we're actually going to be using here. We're going to import standard scaler from SKLearn dot preprocessing. And we're going to create a new instance of standard scaler called scaled standard scalar is and what we're going to use to normalize all of our feature data so that they're all in the same general range. That makes models work a lot better in some cases including this one. So let's start by extracting the feature data that we want. We're going to go back to our original dataframe that was loaded directly from that Excel spreadsheet. And just extract three features, mileage, number of cylinders and number of doors. And by convention we call our feature array X, uppercase X. So we're going to try to build a model that only tries to predict price based on mileage, number of cylinders, and number of doors. Now there's actually a lot of other data there and the source data and for example, the make and model of the car that would be pretty important as well you would think. But this sort of regression model, you can't really mix and match ordinal data with numerical data. So we kinda have to pick one or the other. In this case, we're going to go with some numerical features. And that's why we're emitting things like make and model of the car. By convention, lowercase y is our labels, the things that we're trying to predict in our case, that is the price, okay? Now we need to preprocess that feature data to make it work well with our model. This is a pretty common thing to do. So we're going to call scale, which is our standard scaler fit transform and pass in that myelin cylinder endorse features from our X-ray, extract the values from that, pass it into fit transform and that will give us back scaled mileage cylinders and doors that had been scaled it down into a normal distribution. So more or less a bell curve between negative 11 and we'll stick that right back into the original X-ray in the same columns that we started with. This line again is going to scale the mileage cylinder indoors features within that dataset into a bell curve between negative 11 into a normal distribution. This next slide is kind of a quirk of OLS. So you'll remember from the lecture that there is a B term, a constant y-intercept for each, for where we start from. And unless you add a constant column to your feature data, the model can't create that. So in order to allow it to have a y-intercept, to have that be constant in the model, we need to call Statsmodel dot add constant and pass in our feature array. They're all it does is add a column full of ones to the beginning of our DataFrame. We'll print that out and make sure that it looks as we expect. And then we'll actually train the model itself. And that's a one liner here. We just call Statsmodel dot OLS, pass in our labeled data, pass in our feature data and call fit to fit that data to our model and create a new estimator. We will then print out summary information about how that training went. And it will tell us, it'll give us some insights about what's actually going on inside the model right there. So let's Shift Enter and kick that off pretty quick. All right, so the first thing we printed out was that feature array that's been scaled down. And you can see that we have that constant column that we stuck in the front there That's going to be used for modelling the y-intercept of the model be 0. And you can see that our feature data has been scaled more or less into plus or minus negative 1 range into a normal distribution. So looks like that worked. We then train the model and printed out summary information about the train model. And by looking at this, you can get some insights. R-squared and all the usual metrics here are here for you to look at. Really got to read the documentation on what these stats really mean because they're not always what you expect. For example, r-squared is based on a, a weighted metric that is not what you would normally think of as being r-squared, but there's still comparable as you're running different, different models on the same data. Let's take a look at the actual coefficients here. So this is kinda like the meat of the output of our model here. These are the actual coefficients for those b0, b1, b2, b3 terms in our multiple regression model. So you can see our constant term is actually pretty big. 21,340 ish dollars is our y-intercept. Mileage actually has a negative coefficient. So again, that makes sense because as you increase a mileage, the price would decrease. We're also seeing the cylinders have a really big impact on it. So the number of cylinders in the car actually seems to be the highest absolute magnitude of coefficient that we see here. So rather surprisingly, the number of cylinders seems to affect the model more than anything went to seeing that coming. But I guess it makes sense. You know, if you have really high cylinder vehicles, you're probably in the world of like exotic super cars, right? So this might be driven by outliers where somebody has some, you know, million dollars supercar with ten cylinders. And that might be skewing this whole model, often doing weird things. So again, outliers or something to dig into and talk about that more later. Number of doors, surprisingly also a negative coefficient. The number of doors, more doors, it's not me more money, it turns out. And if you think about that again, you know, sports cars, sports coops, they tend to be two door and they tend to be expensive. So a little bit of an interesting insight and the data just from looking at these coefficients and nothing more. Alright, moving on. So this was actually a pretty convoluted and complex way of figuring out that more doors does not mean more money. And I want to reiterate that. You always want to go for the more simple solution when you can. So I could have figured that out with one line of code here. Much more simply, if I just grouped my data by the number of doors and computer, the mean price for each set of doors numbers. Let's see what we come up with. So right there I can just see with that one simple line of code without building a big fancy machine learning model, that the mean price of a two door vehicle is $23,800, while the mean price of a four door is $20,580. So I could have reached that conclusion a lot more easily and with less complexity if I wanted to. So little side parable there and how simplicity is often a good thing. Anyway. So we have this model we're gonna do with it. So let's say you want to make an actual prediction for a fictitious car or some new car that you've encountered? Well, it's not quite as simple as you would think it would be to get back to an actual prediction using that model. But here's how you go about it. So first of all, we're going to fabricate this new fake car that has 45 thousand miles, eight cylinders and four doors. So that's what that array there means. First, we need to scale that into the same range that was used to train our model. So we're gonna take that same scale instance of standard scalar that we had before and call transform on it to scale this particular car back into the same range that we use to actually train the model. And that way it will be compatible with a model that we made. We also need to insert that constant column as well again. So we're going to call numpy dot insert. We're going to extract that feature data from scaled, which is what we call that resulting array of our fictitious car. We're going to say we wanted to insert at position 0, the number one. Okay, That's all that's going on here. Will then print that out, make sure that's what we expect it to be. Finally, we can call predict on our estimator that we created with that fictitious vehicle that we created and get back a predicted sale price for it. So let's go ahead and run that. There we go. So we can see that our input feature array here has that constant column 1 and our scaled feature data for the mileage, number of cylinders and the number of doors. And it came back with a predicted price in this case of 27 $1658, which it's in the ballpark of our data. So I think that's a reasonable estimation. So there you have it. Multiple regression in action using some real data from actual car sales little bit. And I guess it's kind of an old dataset, but, you know, car prices haven't changed all that much, not used cars anyway. As always, I encourage you to fiddle around with this further on your own. Try downloading that XLS spreadsheet from our website there. You can just use that link that was in the first block there and actually download it through your browser if you want to and get more familiar with what's going on. So in the activity here, I suggested maybe trying to mess around with the number of doors and see if you can actually fabricate some data to make a more interesting or maybe a different influence on the number of doors on the price. And maybe we can have some fun with it and create cars that have 10 doors or something like that to try to see if you can skew things one way or the other. It would also be a good idea to take a look at the data and try to identify some of those outliers we were talking about. Try removing them and see what that does to the quality of your model. I think, like I said, there might be some super cars in there that are throwing things off for the more common kinds of cars that people generally buy. So I have a little bit of fun with it and tinker if you, if you're so inclined. And that is multiple regression in a nutshell, again, all we're doing is doing regression on multiple features at the same time and assign different coefficients to each feature to have a single regression model that we can use to make predictions based on more than one feature.
26. Multi-Level Models: Let's talk about multilevel models. This is definitely an advanced topic, and I'm not going to get into a whole lot of detail here. My objective here is just to introduce the concept of multilevel models to you and let you understand some of the challenges and how to think about them. When you're putting them together, that's it. So the concept here is that some effects happened at various different levels in the hierarchy. So, for example, your health your health might depend on how healthy your individual cells are. And those cells might be a function of how healthy the organs that they're inside our and the health of your organs might depend on the health of you as a whole, and your health might depend in part on your family's health and the environment your family gives you and your family's health, in turn, might depend on some factors of the city that you live in. How much crime is there, How much stress is there, how much pollution is there? And even beyond that? It might depend on factors in the entire world that we live in may be just the state of medical technology in the world is a factor, right? Another example of your wealth. How much money do you make? Well, that's a factor of your individual hard work, but it's also a factor of the worth of your parents. Did you know how much money were they able to invest into your education and the environment that you grew up in? And in turn, how about your grandparent's? What sort of environment were they able to create? And what sort of education were they able to offer for your parents, which in turn influenced the resources they have available for your own education and upgrade brought bringing. So these are all examples of multilevel models where there is a hierarchy of effects that influence each other at larger and larger scales. Okay, now the challenge of multilevel models is to try to figure out while how do I model these interdependencies? How do I model all these different effects and how they affect each other? You see the line of health care, by the way, So the challenge here is to identify the factors in each level that actually affect the thing you're trying to predict. So I'm trying to predict overall you know s a T scores, for example. Well, I know that depends in part on the individual child that's taking the test. But what is it about the child that matters? Well, it might be the genetics it might be. Um, you know, the their individual health. You know, the individual, you know, praying size that they have. You can think of any number of factors that affect the individual that might affect their S a T score. And then, if you go up another level, look their home environment to look their family. No. What is it about their families that might affect their S A T scores? How much education were they able to offer? Are the parents able to actually tutor the Children in the topics that are on the S A. T? These are all factors at that second level. That might be important. What about their neighborhood? The crime rate of that neighborhood might be important. You know, the facilities they have for teenagers and keeping them off the streets, things like that. Now, the idea is you want to keep looking at these higher levels, but each level identify the factors that impact thing you're trying to predict, and I can keep going up to the quality of the teachers in their school, the funding of the school district, the education policies at the state level. You can see there are different factors at different levels at all feed into this thing you're trying to predict. And some of these factors might exist in more than one level. So crime rate, for example, at the local and state levels. You need to figure out how those all interplay with each other as well. When you doing multilevel modelling. Okay. And as you can imagine, this gets very hard and very complicated very quickly. It is really way beyond the scope of this course, not just the point that you're at now in this course, but any introductory course in data science, This is hard stuff, their entire thick books about it. You do an entire course about it. That would be a very advanced topic. The only reason I'm even bringing it up in this course is because I've seen it mentioned on job descriptions as something that they want you to know about. In a couple of cases, I've never had to use it in practice. But I think the important thing from the standpoint of getting a career day of science is that you at least are familiar with the concept, and you know what it means and some of the challenges involved in creating a multilevel model. So I hope I've given you those concepts with that we can move on to the next section. So there you have the concepts of multilevel models. It's a very advanced topic, but you need to understand what the concept is, at least, and the concept itself is pretty simple. You just are looking at the effects that different levels, different hierarchies when you're trying to make a prediction. So maybe there are different layers of effects that have impacts on each other, and those different layers might have factors that interrelated to relate with each other as well. Multilevel modelling tries to take account of all those different hierarchies and factors and how they interplay with each other. That's all you need to know. For now,
27. Supervised vs. Unsupervised Learning, Train / Test: Let's talk about some more machine learning techniques. And one of the fundamental concepts behind machine learning is something called trained test that lets us very cleverly evaluate how good a model that we make in machine learning is. So let's learn more about that. Let's talk about machine learning and specifically the difference between supervised and unsupervised machine learning. We're getting into the interesting stuff here, so let's go. So what is machine learning? Well, if you look it up on Wikipedia or whatever, it'll say that its algorithms that can learn from observation all data and can make predictions based on it sounds really fancy. All right, like artificial intelligence stuff you know, would like your have a throbbing brain inside of your computer. But in reality, these techniques are usually very simple, and we would have already done this right. If you look at regressions, we took a set of observational data. We fiddle line to it, and then we could use that line to make predictions. So by this definition, that's machine learning, and it's pretty darn simple. And yeah, I mean, your brain works that way, too, so, you know, it's kind of fun to think about. Are there any insights in these algorithms into how your brain actually works? Maybe there is maybe underneath it all, there's really a very simple thing going on in there, but that's a topic for a different course. So let's talk about the two different types of machine learning we talk about supervised and unsupervised. Sometimes there can be kind of a blurry line between the two, to be honest. But the basic definition of unsupervised learning is that you're not giving your model any answers to learn from your just presenting it with a group of data, and it tries to make sense out of it, given no additional information. So for an example, let's say I give it a bunch of different objects, you know, balls and cubes and sets of dice and whatnot. And I have some album that will cluster these objects in the things that are similar to each other, based on some similarity metric. Okay, now I haven't told it ahead of time. What categories? Certain objects belong Teoh. I don't have sort of a cheat sheet that it can learn from where I have a set of existing objects in my correct category categorization of it. It has to in for those categories on its own. So that's an example of unsupervised learning where I don't have a set of answers that I'm letting it learn from. I'm just trying to let it gather. Its own answer is based on the data presented to it alone. Okay, so the problem with that is that you don't necessarily know what the album will come up with. So if I gave it a bunch of these objects on this slide, is it gonna group things into things that are around things that are large versus small things that are red versus blue? I don't know. It's going to depend on the metric that I give it for similarity between items, primarily. But sometimes you'll find clusters that are surprising and emerge that you didn't expect to see. So that's really the point of UN supervised learning. If you don't know what you're looking for, it could be a powerful tool for discovering classifications that you didn't even know were there. We call that a latent variable. So some property of your data that you didn't even know was there originally but unsupervised learning can tease out for you. So an example. Let's say I was clustering people instead of, you know, balls and Dyson What not? I'm running a dating site, and I want to see what sorts of people tend to cluster together here. There's some attributes that people tend to cluster around, that they tend to like each other and date each other or whatever. And you might find that the clusters that emerge don't conform to your predispose stereotypes. Maybe it's not about college students versus, you know, middle aged people or people who are divorced and what not or their religious police. Maybe if you look at the cluster's that actually emerge from that analysis, you learn something new about your users and actually figure out that there's something more important than any of those existing features of your people that really count toward whether they like each other. So that's an example of unsupervised learning providing useful results. Okay, Another example. Clustering movies based on their properties. You know, if you were to run clustering on a set of movies from, like, imdb or something, maybe the results were surprised you. Maybe it's not just about the genre of movie. Maybe there are other properties like the age of the movie or the running length, or what country was released in that are more important, you just never know. Or we could analyse the text of product descriptions and try to find the terms that carry the most meaning for a certain category. Again, we might not necessarily know ahead of time what terms what words are most indicative of a product being in a certain category. But through unsupervised learnings, we can tease out that latent information now. In contrast, supervised learning is a case where we have a set of answers that the model can learn from . So we give it a set of training data in this case that the model learns from Hennequin and for relationships between the features and the categories that we want, and then apply that to unseen new values and predict information about them. So, going back to our earlier example, where we're trying to predict car prices based on the attributes of those cars, that's an example where we are training our model using actual answers. So I have a set of known cars in their actual prices that they sold for. I trained the model on that set of complete answers, and then I can create a model that I can use to predict the prices of new cars that I haven't seen before. So that's an example of supervised learning. Were you giving it a set of answers to learn from, you know, your party's assigned categories or whatever to a set of data and then uses that to build a model that it can use to predict new values from? So how do you evaluate supervised learning? So the beautiful thing about supervised learning is that I can use a trick called train test, and the idea here is, What if I were to split my my observation all data that I want my model toe learn from into two groups a training set and a testing set. So when I actually trained my model when I build my model based on the data that I have, I only do that with part of my data that I'm calling my training set okay, and I reserve another part of my data, and I'm going to use that for testing purposes so I can build my model using a subset of my data for training data. And then I can evaluate the model that comes out of that and see if it can successfully predict the correct answers for my testing data. So you see what I did there? I have a set of data. Where already have the answer is that I can train my model from, but I'm gonna withhold a portion of that data and actually use that to test my model that was generated using the training set. Okay, so that gives me a very concrete way to test how good my model is on unseen data. Because actually have a bit of day that I set aside that I contest it with. And you can then measure quantitatively how well it did using R squared or some other metric like, routine squared error, things like that. And, you know, you can use that to test one model versus another and see what the best model is for a given problem. You can tune the parameters of that model and use trained test to maximize the accuracy of that model on your testing data. So great way to prevent over fitting. There are some caveats. You need to make sure that both your training and test data sets are large enough to actually be representative of your of your data. You need to make sure that you're catching all the different categories and out liars that you care about in both training and testing. To get a good measure of its success and to build a good model, you make you have to make sure that you selected from those data sets randomly. So you're not just carving your data set and two and saying everything left of here is training and right of here is testing. You want to sample that randomly because there could be some patterns sequentially in your data that you don't know about. But fundamentally, it is like I said, a great way to guard against over fitting. So if you're a model is over fitting and just going out of its way to accept outliers in your training data, well, that's gonna be revealed when you put it against a unset seen of testing data, right? Cause all that gyrations for outliers won't help with the allies that it hasn't seen before . Now, train test isn't perfect you can get misleading results from it may be your sample sizes air to small like we already talked about. Or maybe the pseudo random chance your training data in your test data look remarkably similar. They actually do have a similar set of outliers, and you can still be over fitting, who knows? And you can see in this example Yeah, it can happen. So there is a way around that two called K fold cross validation and will do an example of this later in the course. But the basic concept is you do train test many times. So you actually split your data not into just one training set in one test set you to put your data into multiple randomly assigned segments. K segments. That's where the K comes from and you reserve one of those segments is your test data. And then you start training your model on the remaining segments and measure their performance against your testing to set. And then you take the average performance from each of those training sets, models results and take their R squared average score. So this way you're actually training on different slices of your data measuring them against the same test set. And that way, if you have a model, it's over Fitting to a particular segments of your training data that will get averaged out by the other ones that are contributing to careful cross validation will make more sense later. In the course of, I just want you to know this tool exists for actually making train test even more robust than it already is. So let's go and actually play with some data and actually evaluated using trained test.
28. Using Train/Test to Prevent Overfitting: let's put train test into action so you might remember that a regression is can be thought of a form of supervised machine learning. So let's just take a polynomial regression, which we already covered earlier in this course and use trained tests to try to find the right degree polynomial to fit a given set of data. So just like in our previous example, we're gonna set up a little fake data set of randomly generated paid speeds and purchase amounts, and I'm gonna create a weird little relationship between them. That's kind of exponential in nature. So let's go ahead and generate that data. It's going to use a normal distribution of random data for both paid speeds and purchase amount using this relationship here. So next I'm going to split that data. I'm gonna take 80% of my data, and we got to reserve that for my training data. So only 80% of these points are going to be used for training the model. And then I'm going to reserve the other 20% for testing that model against unseen data. OK, so I'm just going to use pythons syntax here for splitting a list into the 1st 80 points are going to go to the training set and the last 20. Everything after 80 is going to go to test. So remember that from our Python Basics course, we cover that syntax before, and I'll do the same thing for purchase amounts now on the slides. I did say you shouldn't just slice your data set into like this. You should randomly sample it for training and testing. In this case, it works out because my original data was randomly generated anyway, so there's really no rhyme or reason to where things fell. But in real world data, you'll want to shuffle that data before he split it. And there's a random dot shuffle method you can use for that purpose. Also, if you're using the Pandas package, there's there some handy functions in there for making training and test data sets automatically for you. But we're just gonna do it using a python list here just to keep it simple. So let's visualize our training data set that we ended up with. So we'll do a scatter plot of our training, paid speeds and purchase amounts, and it looks like that basically 80 points selected at random from the original of complete data set has basically the same shapes. That's a good thing. It's representative of our data. That's important. And our remaining 20 for testing also, you know, has the same general shape is our original data. So I think that's ah, representative test set to little bit smaller than you would like to see in the real world for sure, you probably get a little bit of a better result if you had, you know, 1000 points instead of 100 for example to choose from and reserved 200 instead of 20. So now I'm gonna try to fit an eighth degree polynomial to this date. And I'm not just gonna pit pick the number eight at random because I know it's a really high order is probably over fitting. So let's go ahead and fit our eighth degree polynomial using an paedo, Polly, Wendy and people if it using X y and eight, where X is an array of the training data only and wise an array of the training data only. So we're fitting our model using only those 80 points that we reserve for training and Now we have this P four function that results that we can use to predict new values. So let's go ahead and plot to the polynomial this came up with against the training data on Weaken. Scatter our original data here for the training data set. And then we can plot are predicted values against them. So you can see here. It looks like a pretty good fit, but clearly it's doing some over fitting here. What's this craziness out here? I mean, I'm sure pretty sure are, really Data, if we had it out here, wouldn't be crazy high as this function would implicate. So this is a great example of over fitting your data. It fits the data. You gave it very well, but it would do a terrible job of predicting new values beyond this point. Right? So let's try to tease that out. Let's give it our test data set. And indeed, if we plot our test data against that same function, well, it doesn't actually look that bad. We got lucky, and none of our tested is actually out here to begin with. But you can see that you know, it's a reasonable fit, but it's far from perfect. And in fact, if you actually measure the R squared score, it's worse than you might think. So we can do that here. Using the are to score function from psych, it learn metrics, and we just give it our original data and are predicted values. And it just goes through and measures all the variances from the predictions and squares. I'm all up for you and we end up with an R squared score of just 0.3. So not that hot. And you can see that it fits the training data a lot better, which, with r squared value of 0.6, which isn't too surprising because we trained it on the training data. The test status sort of. It's unknown. ITT's test and it failed the test, quite frankly, 30%. That's enough. So that's an example of using trained test to evaluate a supervised learning algorithm. And, like I said before, pandas has some means of making this even easier. We'll look at that a little bit later, and we'll also look at more examples of trained tests, including careful across validation later in the course as well, and you can probably guess what your homework is. So we know that an eighth order polynomial isn't very useful. Can you do better? So I want to go back, Run this I python no book all the way through, but use different values for the number for the degree polynomial that you're gonna use to fit. So change that 82 different values and see if you can figure out what degree fault polynomial actually scores best Using trained test is a metric. So where do you get your best r squared score for your test data? What two degree fits here. So go play with that. Have any problems posed in the discussions, but should be a pretty easy exercise in a very enlightening one for you as well. So have some fun with it. So that's trained test in action. Very important technique. Toe have. And you're going to use it over and over and over again to make sure that your results are a good fit for the model that you have and are a good predictor of unseen values. So great way to prevent over fitting when you're doing your modeling. Let's move on
29. Bayesian Methods: Concepts: did you ever wonder how the spam classifier in your email works? How does it know that email might be spam or not? Well, one popular technique is something called naive Bayes, and that's an example of a busy and method. So let's learn more about how that works. Let's discuss Basie and methods. So we talked about Bayes Theorem earlier in the course, in the context of talking about how things like drug tests could be very misleading in their results. But you can actually apply the same based here, um, toe larger problems like spam classifier. So let's dive into how that might work. And that's called a BZ and method. So just a refresher on Bayes theorem. Remember, the probability of a given B is equal to the overall probability of a times the probability of be given a over the overall probability of B. So what's How can you use that in machine learning? I can actually build a spam classifier for that. An algorithm that can actually analyze a set of known spam emails and a known set of non spam emails and train a model to actually predict where their new e mails are spam or not. And there are. This is a real technique used to natural spam. Classifier is in the real world. So, as an example, let's just figure out the probability of an email being spam, given that it contains the word free. You know, most people promising you free stuff. It's probably span, so let's work that out. The probability of being spammed, given that you have the word free in an email, works out to the overall probability of it being a span that message times the probability of containing the word free, given that it's spam over the probability overall of being free now, the numerator and just the thought of this, the probability of a message being spam and containing the word free. But that's a little bit different than what we're looking for, because that's the odds out of the complete data set, and not just the odds within things that contain the word free. Okay, and the denominator just the overall probability of containing the word free. Sometimes that won't be immediately accessible to you from the data that you have. If it's not, you can actually expand that out to this other expression here if you need to derive it. So at the end of the day, that gives you the percentage of e mails that contain the word free that span, which would be a useful thing to know when you're trying to figure out if it's family. What about all the other words in the English language, though, So our spam classifier should know about more than just the word free should automatically pick up the every word in the message ideally, and figure out How much does that contribute to the likelihood of this email being spam? So what we can do is train a model on every word that we encounter during training. You know, throwing out things like a and the and and and meaningless words like that, of course. And then when we go through all the words in a new email, we can multiply the probability of being spam for each word together. And then we get the overall probability of that email being spam. Okay, now it's called naive Bayes for a reason. That's what this technique is called. And one reason that it's naive is because we're assuming that there is no relationships between the words themselves were just looking at each world, word in isolation, individually within a message and basically combining all the probabilities of each words contribution to it. Being spam or not, we're not looking at the relationships between the words. Okay, so a better spam classifier would do that. But obviously that's a lot harder. So this sounds like a lot of work. The overall idea, and not that not that hard but psychic learning Python makes it's actually pretty easy to do. It offers a feature called Count Factor Riser that makes it very simple to actually split up an email to all of its component words and process those words individually. And then it has a multi no meal NB function where N B stands for naive Bayes that will do all of the heavy lifting for naive base for us so we can actually build a spam classifier with not a lot of coat. And included in your course materials is some sample data that includes a set of known spam emails and instead of known ham emails. Ham is what we call email. That's not spam, and let's do it. Let's build spam classifier
30. Implementing a Spam Classifier with Naive Bayes: All right, let's write a spam classifier. Using naive Bayes, you're gonna be surprised how easy this is. In fact, most of the work ends up just being reading all the input data that we're gonna train on and and actually parsing that data in the actual spam classification. But the machine learning bit is actually just a few lines of code, so that's usually how it works out. You know, actually, reading in and massaging and cleaning up your data is usually most of the work When you're doing data science, so get used to the idea. Go ahead and open up the naive Bayes I Python notebook if you like to follow along with me . And like I said, most of the work is just in in reading in the data. So what I have here in your course materials is a couple of different directories filled with emails. And one is a bunch of emails that I already know where span that have been classified ahead of time and another directory full of emails that are ham that are not span. And I'm gonna use this information to train my model and actually test it out. So the first thing have to do is read all those emails in somehow, and we're going again. Use pandas to make this a little bit easier. So again, pandas is a useful tool for handling tabular data. And let's, uh, after we import all the different packages that we're gonna use within our example here, that includes the OS library, the Iot library, numb pie, pandas and count factor riser and multi no meal and be from psych. It learn, and we'll go through all that as we encounter them. Let's get past these function definitions from now and go down to the first thing that our code actually does. And that's to create a Pandas data frame object. And we're gonna construct this from a dictionary that initially contains a empty list for messages in an empty lists of class. So this syntax is saying I want to date a frame that has two columns, one that contains the message, the actual text of each email and when that contains the class of each email, that is whether it's spam or ham. Okay, so basically, this one line is saying I want to create a little database of E mails, and this database has two columns. It has theatrical text of the email, and it has whether it's spam or not. Okay, now I need to put something in that database into that data frame in python syntax. So I'm gonna to call these two methods to actually throw in all of my spam emails from my spam folder and all of my hand emails for my ham folder. And if you are playing along here, make sure you modify this path to match wherever you installed the course materials for this course, okay? And you know, again, if you're on a Mac or limits those backs pay attention to back slashes and forth slashes and all that stuff in this case, it doesn't matter. But you won't have a drive letter, for example, if you're not on Windows, so just make sure those pasts are actually pointing to where your spam and ham folders are for this example. So what is this due date? A frame from directories of function I wrote up here. Basically, it says I have a path to a directory, and I know it's like giving classifications spammer, ham and what I'm gonna do is call this read files function that I also wrote that will literate through every single file in a directory, and I don't want to go into too much detail on how this works. But basically it's using the OS dot walk function to find all of the files in a directory, builds up the full path name for each individual file in that directory, and then it reads it in. And while it's reading it in, it actually skips the header for each email and just go straight to the text. And it does that by looking for the first blank line here. It knows that everything after the first empty line is actually the message body, and everything in front of that first empty line is just a bunch of header information that I don't actually want to train my spam classifier on. So it gives me back both the full path to each file and the body of the message. Okay, so that's how I read in all of my data, and that is the majority of the code. So what I have at the end of the day here is a data frame object, basically a database of two columns that contains body message bodies and whether it's spam or not. And we could go ahead and run that and we can use the head command from the data frame to actually preview what this looks like. So the first few entries in our data frame look like this for each path to a given file full of emails, we have a classification and we have the message body. Okay, alright, now for the fun part. So we're going to use the multi no meal nb function from Psychic Learned, actually perform naive based on the state of that we have and what it expects to get is two things. So once we build a multi multi multi no meal naive Bayes classifier in these two inputs and he's the actual data that were training on and the targets for each thing. So what that is is basically a list of all the words in each email, okay, and the number of times that word occurs. So that's what this count vector riser thing does. It will. This syntax means take the message column for my data frame and take all the values from it . I'm gonna call vector riser dot Fit, transform What that does. Is it basically token ises or converts all the individual words seen in my data into numbers into values, And it will then count up how many times each word occurs. So this is a more compact way of representing how many times each word occurs in an email instead of actually preserving the words themselves. I'm representing those those words as different values in a sparse matrix. Okay, which is basically saying that I'm treating each word is a number as a numerical index into an array. So what that does is just in plain English. It splits. He Smith each message up into a list of words that are in it and how many times each word occurs. So we're calling that counts. It's basically that information of how many times each word occurs in each individual message and then targets is the actual classification data for each email that I've encountered. And I can call classifier dot fit using my multi nobile and be function to actually create a model using naive Bayes that will predict where their new e mails are spam or not, based on the information I gave it. Its let's go ahead and run. That runs pretty quickly, So let's try it out. I'm gonna use a couple of examples here. Let's try a message Body that just says free Viagra now probably pretty clearly span. And, ah, more innocent message that just says, Hi, Bob, How about a game of golf tomorrow? So we're past that in first thing we need to do is convert these messages into the same format that I trained my model on. So I'm going to use that same vector riser that I created when creating the model to convert each message into a list of words and their frequencies, where the words air represented by positions in an array. And then once I've done that transformation, I can actually use the predict function on my classifier on that area, examples that have transformed into, you know, lists of words and see we come up with, and sure enough it works. So given this array of to input messages, free Viagra now and high. Bob is telling me that the first results came back a spam, and the second result came back his hand, which is what I would expect so that's pretty cool. So there you have it. Now, we have a pretty small data set here. So, you know, you could try running some different emails through it if you want, and see if you get different results. But try to apply three. Don't watch out yourself. Tried applying train test to this example. So the real measure of whether or not my spam classifier is good or not is not just intuitively whether it can figure out that free Viagra now is a spam. You want to measure that quantitatively. So if you want a little bit of a challenge, go ahead and try Teoh split the state up into a training set in a test data sets and you can actually look up online how pandas can split data up into train testing and testing sets pretty easily for you. Or you can do it by hand, whatever works for you, and see if you can actually apply your multi no meal nb classifier to a test data set and measure its performance. So a little bit of an exercise, a bit of a challenge. Go ahead, Give that a try. How cool is that we just wrote her own spam classifier just using a few lines of code in Python it It's pretty easy using psychic learn and python. That's naive. Bayes in action, and you can actually go in. Classify some spam or ham messages now that you have that under your belt. Pretty cool stuff. Let's talk about clustering next.
31. K-Means Clustering: next we're going to talk about K means clustering, and this is a unsupervised learning technique where you have ah, collection of stuff that you want to group together into various clusters. Maybe it's movie genres or demographics of people who knows, But it's actually a pretty simple idea. So let's go and see how it works. All right, let's talk about K means clustering. Very common technique in machine learning, where you just try to take a bunch of data and find interesting clusters of things just based on the attributes of the data itself. Sounds fancy, but it's actually pretty simple. All we do in K means clustering is try to split our data into K groups. That's where the K comes from. Its how many different groups you're trying to split your data into. And it does this by finding case Central Lloyds. So basically, what group of given data point belongs to is defined by which of these central oId points it's closest to in your scatter plot, so you can visualize that over here this is showing an example of K means clustering with K of three, and the squares represent data points in a scatter plot the circles represent the central is that the K means clustering algorithm came up with, and each point is assigned a cluster based on which central it is closest to. Okay, so that's all there is to. It really is an example of UN supervised learning. So it's not a case where we have a bunch of data, and we already know the correct cluster for a given set of training data. Rather, you're just given the data itself, and it tries to converge on these clusters naturally, just based on the attributes of the data alone. So it's a good case of where you're trying to find clusters are categorizations that you didn't even know were there. It's Ah, as with most unsupervised or any techniques, the point is to find latent values. You know, things you didn't really realize where there until the algorithm show them to you. So, for example, where do millionaires live? Maybe, I don't know. Maybe there is some interesting geographical cluster where you know rich people tend to live and K means clustering could help you figure that out. Maybe I don't really know if today's genres of music or meaningful What does it mean to be alternative these days? Not much, right. But by using K means clustering on attributes of songs, maybe I could find interesting clusters of songs that are related to each other and come up with new names for what those clusters represent. Or maybe I can look at demographic data, and maybe existing stereotypes are no longer useful. Maybe, you know, Hispanic has lost its meaning. And this actually other attributes that defined groups of people, for example, that I could uncover with clustering sounds fancy. Doesn't really complicated stuff. Unsupervised machine learning with clusters K. It sounds fancy, but as with most techniques and data science, it's actually a very simple idea. So here's the algorithm we start off with a randomly chose instead of Central. It's so if we have a K of three, we're going to look for three clusters in our group, and we will assign three randomly position central aids in our scatter plot. We then assigning stated point to the you randomly assigned centrally that is closest to, and then we re compute the central reach cluster that we come up with. So for a given cluster that we end up with. We will move that centrally to be the actual center of all those points. And then we will do it all again until the centrally stop moving. You know, we hit some threshold value that says, OK, we've converged on something here and then to predict the clusters for new points that I haven't seen before. We can just go through our central locations and figure out which central it is closest to you to predict its cluster. Okay, let's look at a graphical example here to make a little bit more sense. So say I have a scatter plot again. These gray squares represent data points in our scatter plot. So these reports these axes represent some different features of something. Maybe its age and income is an example I keep using, but it could be anything. And these squares represent individual people or individual songs or individuals, something that I want to find relationships between. Okay, so I start off by just picking three points at random in my scatter plot. Could be anywhere. Got to start somewhere, right? So the next thing I'm gonna do it's for each point. I'll compute which one of these points it's closest to so you can see where that ends up being by doing that, these points shaded in blue are associated with this blue sentry oId. The green points are closest to the green central. It in this single Redpoint is closest to that red random point that I picked out. But you can see you know, that's not really ah reflective of where the actual clusters appear to be. So if I do this again, what I'm gonna do is take the points that ended up in each cluster and compute the actual center of those points. So, for example, and this green cluster here, the actual center of all that data turns out to be a little bit lower. So we're gonna move that down a little bit in this red cluster Island had one point, so its center moves down to where that single point iss okay, and the blue point was actually pretty close. So that just moves a little bit. And on this next generation, we end up with something that looks like this. So now you can see that our cluster for red things has grown a little bit, and things have moved a little bit with those got taken from the green cluster, basically. And if we do that again, you can probably see what's gonna happen. Nets that Green Central. It's gonna move over here a little bit that blue. Sen. Troy is still about where it should be, but at the end of the day, you're gonna end up with the clusters you would probably expect to see. The red cluster will end up being this group, and Blue will end up being this group and Greenwell on it being that group, that's how k means works. So it just keeps iterating trying to find the right central rights until things stop moving around and we converge on a solution. So there are obviously some limitations to K means clustering. First of all, we need to choose the right value of K, and that's not a straightforward thing to do it all. You know the way that the principled way of choosing K is to just start low and keep increasing the value of K. How many groups you want until you stop getting large reductions in squared error. So you know, if you look at the distances from each point to their centrales. You can think of that is an error metric, and at the point where you stop reducing that error metric, you know you probably have too many clusters. OK, so you're not really gaining any more information by adding additional clusters at that point. Also, there is a problem of local minima, so you could just get very unlucky with those initial choices of central rights. And they might end up just converging on local phenomenon instead of more global clusters. So usually you want to run this a few times and maybe average the results together. You know, we call that ensemble learning. We'll talk about that more a little bit later on. But you know, it's always a good idea to run K means more than once using a different set of random initial values and just see if you do in fact end up with the same overall results or not. Finally, the main problem with K means clustering is that there's no labels for the clusters that you get. It will just tell you that this group of data points are somehow related, but you can't put a name on it. You know you can't tell you the actual meaning of that cluster. Let's say, have a bunch of movies that I'm looking at and K means Clustering tells me that bunch of science fiction movies air over here, but it's not gonna call them science fiction movies. For me, it's up to me to actually dig into the data and figure out well, what do these things really have in common and how it might I describe that in English. That's that's the hard part, and K means won't help you with that. So again, psychic learn makes it very easy to do this. Let's go actually do an example and put k means clustering into action.
32. Clustering People by Income and Age: All right, let's see just how easy it is to do K means clustering using psychic learn and python. So first thing I'm going to do is create some random data that I want to try to cluster. And just to make it easier, I'm going to actually build some clusters into my fake test data. So let's pretend there's some riel fundamental relationship between these data, and there are some real natural clusters that exist in it. So to do that, I just wrote this little create clustered data function and python, and it starts off with a consistent random seed, so you'll get the same result every time and it takes in. I want to create clusters of end people in K clusters. So it figures out how many points per cluster that works out to first and then builds up this list X that starts off empty. So for each cluster for I in Range K, I am going to create some random central rate of income between $220,000 some random century of age between the age of 20 and 70. So what I'm doing here is creating some fake as fake scatter plot that will show income versus age for N people and K clusters. So for each random centrally that I created, I'm then going to create a normally distributed set of random data with a standard deviation of 10,000 and income and a standard deviation of two and a judge and that will give me back a bunch of age income data that is clustered into some pre pre existing clusters that I chose at random. Okay, let's go ahead and run that. And now to actually do K means you'll see how easy it is. All you do is import k means from psych. It learns cluster package, and we also important that plot lips so we can visualize things and also the scale things so we can take a look at how that works in a minute. So I'm gonna use my create clustered data function to say I want 100 random people around five clusters. So there are five natural clusters from the date of them creating. I have been going to create a model k means model with K of five. So I and picking five clusters because I know that's the right answer. But again and unsupervised learning, you don't necessarily know what the real value of K is. You need to iterated and converge on it yourself. And then I could just call Modeled outfit using, like, a means model, using the data that I had now scale I alluded to earlier. That's normalizing the data. And one important thing with K means is that it works best. If your data is all normalized, it means everything is at the same scale. So a problem that I have here is that my ages range from 20 to 70 but by incomes ranged all the way up to 200,000. So these values are not really comparable. The incomes are much larger than the age values scale, will take all of that data and scale it together to a consistent scale. So I can actually compare these things as apples to apples, and that will help a lot with your K means results. So once I've actually called fit on my model, I have a model, and I can actually look at the resulting labels that I got and then we can actually visualize it using this little bit of Matt Plot Live magic. You can see here I have a little trick here where I signed the color, too. The labels that I ended up with converted to some floating point number, and that's that's a little tricky can use to assign arbitrary colors to a given value. So let's see, we end up with didn't take that long. You see the results here, basically what clusters I signed everything into. And you can see we know that our fake data is already pre clustered, so it seems that it and identify the first and second clusters pretty easily got a little bit confused beyond that point, though, because our clusters here in the middle are actually a little bit a little bit mushed together. They're not very really that distinct. So that was a challenge for K means. But regardless, it did come up with some reasonable guesses at the cluster's. But we ended up with here was cluster here, close to here. Cluster there in a cluster there, cluster down here and it's not a bad choice. You know, this is probably an example of where you four clusters were more naturally fit. The data. So what I want you to do foreign activities to try that out. Try different value of K and CTO end up with, you know, just eyeballing this. It looks like four would work Well, does it really? What happens if I increase K two large? What happens to my results there? What does it try to split things into and doesn't even make sense? So play around with it, try to for values. OK, so the end clusters function here, change the five to something else, running all through it again and see if they end up with so play around, have some fun with it. That's all there is to K means clustering. It's just that simple. You could just use like it learns k means thing from cluster and again the only real gotcha . Make sure you scale the data, normalize it or whiten it as the case may be another another name for the same thing. You want to make sure the things that you're using k means on are comparable to each other , and the scale function will do that for you. So those are the main things for K means clustering. Pretty simple concept, even simpler to do it using psychic learned. There you have it. It's all there is to it. That's K means clustering. So if you have a bunch of data that is unclassified and you don't really have the right answer is ahead of time. It's a good way to try to naturally find interesting groupings of your data. And maybe that could give you some insight into what that data is. So good tool tohave. I've used it before in the real world, and it's really not that hard to use, so keep that in your tool chest.
33. Measuring Entropy: all right. Pretty soon we're gonna get to one of the cooler parts of machine learning, but at least I think so called decision trees. But before we could talk about that, you need to understand the concept of entropy and data science, and it's a pretty simple exercise. Very short lecture here. But let's just get that concept under your belt. Let's talk about entropy, another example of a fancy word for a simple concept. But we need to understand this before we talk about decision trees. So let's get this under our belt first. So entropy, just like it is in physics. Thermodynamics. It's a measure of a data sets disorder. So how same or different his day set. So imagine we have a data set of different classifications. For example, animals. Let's say I have a bunch of animals that are classified by species. Now, if all of the animals in my data set aren't iguana, I a very low entropy because they're all the same. But if every animal in my data set is a different animal, have iguanas and pigs and slots, and who knows what else that I would have a higher entropy because there's more disorder. In my data set, things are more different than they are. The same entropy is is a way of quantifying that sameness or that differentness throughout my data. So again, an interview zero implies all the classes and the data are the same. Whereas if everything is different, I would have a high entropy and something in between would be a number in between. So it's just another example of a fancy word for a simple concept. Entropy just describes how same or different the things in a data set are. That's all there is to it. It's a very short lecture because it's a very simple concept. Now, mathematically, it's a little bit more involved in that. So when actually compute a number for entropy is computed using this expression here. So for every different class that I have in my data, I'm gonna have one of these p terms. So piece of one piece of two and so on and so forth through end for end different classes that I might have, and each term the P DIS represents the proportion of the data that is that class, and if you actually plot with this looks like for each term, this negative piece of by times a natural law algorithm of piece of I. It'll look a little bit something like this, and you add these up for each individual class. So if you look at it, it kind of makes sense. You know, for example, if the proportion of the data that is a given class zero than the contribution to the overall entropy is zero. And if everything is that class and again, the contribution to the overall entropy is zero. Because in either case, if nothing is this class or everything is this class that's not really contributing anything to the overall entropy? You know, it's these things in the middle that contribute in tribute of the class, where there's some mixture of this, this classification and other stuff, and when you add all these terms together, you end up with an overall entropy for the entire data set so mathematically that's how it works out. But again, the concept is very simple. It's just a measure of how disordered your data set, how same or different the things in your data are. That's all there is to entropy. So with that under our belt, we can move on and talk about decision trees. So that's entropy, just a measure of the disorder of a data set. How same or different is it all? And you just need to understand that as we talk about decision trees up next.
34. Windows: Installing Graphviz: So this is gonna be the shortest lecture in the world on Windows. You don't have to do anything special to actually use decision trees in Anaconda. It already installed everything you need for you automatically. You might notice that there are set up videos in here for a Mac and Lennox users where they do need to follow an extra step. But you're good. So just continue on to the next lesson, and we can actually start playing with our decision trees.
35. Mac: Installing Graphviz: now, before we can actually display decision trees, we need a package called graph Viz installed on your system and on the Mac. The easiest way to do that is through Homebrew. So if you don't already have homebrew installed, go over to brew dot s h here and to install Homebrew. We're just gonna copy all of this information here, Command. See? Go to a terminal, prompt paste it in command V and run that. Will he return? He'll need to authenticate. And off it goes All right. After a few minutes, that did finish. And now we can just type in brew. Install graph is just like that. All right. Looks like we're in business. Let's go ahead and close this terminal window so we can be sure that we pick up the new environment next time. And that should be it. We should be all set to go now.
36. Linux: Installing Graphviz: so on Lenox. Before we can actually play with decision trees, we need to install a package called Graph Viz that will allow us to actually draw on Visualize them within our notebook. To do that, it's really easy, and Lennox, I'll have to do is open up a terminal and type in sudo Apt Dash get install graph viz. At least that's how it works on a boon to on different flavors of winnicks. You might have different package managers, but graph is is probably in there. So just do whatever you need to do in order to install. Graph is on your Lennox system. But why to continue and graph is is now installed. It's just that easy. So now we can continue and start playing with decision trees.
37. Decision Trees: Concepts: believe it or not. Given a set of training data, you can actually get python to generate a flow chart for you to actually make a decision. So if you have something you're trying to predict on some classification, you can use a decision tree to actually look at multiple attributes that you could decide upon it each level in a flow chart. And you can actually print out an actual flow chart for you to use to make a decision from based on actual machine learning. How cool is that? Let's see how it works. All right, we're going to talk about decision trees. This is one of the most interesting applications in machine learning that I can think of. I think it's pretty cool stuff, but let's talk about how it works. So a decision tree basically gives you a flow chart of how to make some decisions so you have some dependent variable, like whether or not I should go play outside today or not based on the weather. And when you have a decision like that, that depends on multiple attributes, multiple variables. A decision tree could be a good choice, so there are very many different aspects of the weather that might influence my decision of whether I should go outside and play. I might have to do with the humidity, the temperature, whether it's sunny or not. For example, a decision tree can look at all these different attributes of the weather or anything else and decide where the thresholds, where the decisions I need to make on each one of those attributes before I arrive at a decision of whether or not I should go play outside. That's all the decision tree is. So it's a form of supervised learning. The way it would work in this example would be I would have some sort of data set of historical weather and whether or not people went outside to play that day and I would feed the model this data of what, whether it was sunny or not on each day and what the humidity was, for example, and if it was windy or not, and whether or not it was a good day to go play outside. And given that training data, a decision tree algorithm can actually arrive at a tree that gives you this flow chart that you can print out looks just like this that you could just walk through and figure out whether or not it's a good day toe play outside based on the current attributes. So you can use that to predict the decision for a new set of values. It's pretty awesome stuff. I mean, how cool is that? You know, we have an algorithm that will make a flow chart for you automatically, just based on observational data. And what's even cooler is how simple it all works once you learn how it works. So, for example, we're gonna actually do this. Let's say I want to build a system that will automatically filter out resumes and based on the information in the resume. So a big problem that technology companies has is we get tons and tons of resumes for our positions. Then we have to decide who we actually bring in for an interview because it can be expensive to fly somebody out and actually take the time out of the day to conduct an interview. So what if there were a way to actually take historical data on who actually got hired and map that things that are found on their resume, we could construct a decision tree that would let us go through it individual resume and say, OK, this person actually has the high likelihood of getting hired or not. So we can train a decision tree on that historical data and walk through that for future candidates. Would that be a wonderful thing to have? So let's make some totally fabricated hiring data that I'm gonna use in this example. We have candidates that just identified by numerical identifiers, and I'm gonna pick some attributes that I think might be interesting to predict whether or not they're a good hire or not. How many years of experience do they have? Are they currently employed? How many employers have they had previous to this one? What's their level of education? What degree do they have? Did they go to what we classify as a top tier school? Did they do an internship while they were in college? And we can take a look at this historical data and the dependent variable here is hired. Did this person actually get a job offer or not based on that information Now, obviously there's a lot of information that isn't in this model. That might be very important. But the decision tree that we train from this data might actually be useful in doing an initial pass, that reading out some candidates and well, we end up with might be a tree that looks like this. So it just turns out that in my totally fabricated data, anyone that did an internship in college actually ended up getting a job offer. So my first decision point is, did this person do an internship or not? If yes, go ahead and bring him in. And you know, in my experience, internships are actually a pretty good predictor of how good a person is, You know, if they have the initiative to actually go out and do an internship and actually learn something at that internship, that's a good sign. Do they currently have a job? Well, if they if they're currently employed in my very small think data said, it turned out that they are worth hiring, you know, just because somebody else thought they were worth hiring to. Obviously, it would be a little bit more of a nuanced decision in the real world. If not currently employed, do they have more than one less than one prior employer. If yes, this person has never held a job, and they never did an internship, either. Probably not a good hire decision. Don't hire that person. But, you know, if they did have a previous employer, do they at least go to a top to your school? If not, it's kind of iffy. If so, then yes, we should hire this person based on the data that we trained on. So that's how you walk through the results of a decision tree. It's just like going through a flow chart, and it's kind of awesome that an algorithm can produce this for you. The algorithm itself is actually very simple. Here's how it works. So at each step of the decision tree flow chart, we find the attribute that we can partition our data on. That minimizes the entropy of the data at the next step. Okay, so you know we have a resulting set of classifications in this case, higher, I don't hire and we want to choose the attribute decision at each step that will minimise the interview the next step. So basically, at each step, we want to make all of the remaining choices result in either as many know hires or as many higher decisions as possible. We want to make that datum or more uniform. So as we work our way down the flow chart, we ultimately end up with a a set of candidates that are either all hires are all know hires that we can classify into, you know, yes, no decisions on a decision tree. So that's it. You know, we just walked down the tree, minimize entropy at each step by choosing the right attribute to decide on, and we keep on going until we run out. And there's a fancy name for this algorithm is called idee three. I don't even know what that stands for, but that's all it means is what's known as a greedy algorithm. So as it goes down the tree, it just picks the attribute that will minimise entropy at that point. Now that might not actually result in an optimal tree that minimizes the number of choices that you have to make, but it will result in a tree that works, given the day that you gave it. So that's all there is to it. It's a pretty simple idea. Now, one problem with decision trees is that they're very prone to over fitting. So you can end up with a decision tree. Kind of like we saw that works beautifully for the data that you trained it on. But it might not be that great for actually predicting the correct classification for new people that hasn't seen before. That decision trees air all about arriving at the right decision for the training data that you gave it. Okay. And maybe you didn't really take into account the right attributes. Maybe didn't give it enough of a representative sample of people to learn from these can result in real problems. So to fight that we use a technique called random forests. And the idea here is that we actually sample our data that we train on different ways for multiple different decision trees. So each decision tree takes a different random sample from our set of training data and conducts the truth, constructs a tree from it, and then each resulting treaty can vote on the right result. Now, that technique of randomly re sampling our data with the same model is a term called bootstrap aggregating or bagging again, a fancy term for a very simple idea. And this is a form of what we call ensemble learning, which will cover in more detail shortly. But the basic idea. We have multiple trees, a forest of trees, if you will. Each that uses a random sub sample of the data that we have to train on, and then each of these trees can vote on the final result, and that will help us combat over fitting for a given set of training data. The other thing random forests can do is actually restrict the number of attributes that it can choose between at each stage is it's trying to minimize the entropy is ago goes and we can randomly pick what attributes that can choose from at each level. So that also gives us more variation from tree to tree. And therefore we get more of a variety of algorithms that we can compete with each other, and they can all vote on the final results using slightly different approaches to arriving at the same answer. So that's how random forests work. Basically, it is a forest of decision trees where they're drawing from different samples and also different sets of attributes at each stage that it can choose between. Okay, so with all that, let's go make some decision trees and we'll actually use random forces Well, when we're done because like Hitler and makes it really, really easy to do as well see in a minute.
38. Decision Trees: Predicting Hiring Decisions: All right, let's make some decision trees. It's pretty easy. In fact, it's crazy how easy this is. It's pretty exciting stuff. If we could just create an actual flow chart from training data that really works with just a few lines of code in Python. So let's give it a try. So I've included a past tires dot C S V file with your course materials, and that just includes some fabricated data that I made up about people that either got a job offer or not based on the attributes of those candidates. So go ahead and change that path to whatever. Wherever you installed the materials for this course, I'm not sure where you put it, but it's almost certainly not there. So go ahead and end of that file were in the decision Tree Python notebook file here right now. And we're just going to use pandas to read that's ESPN and create a data frame object out of it. Okay, so let's go ahead and read that in, and we can use the head function on the data frame toe, print out the first few lines and make sure that it looks like it makes sense, and sure enough, we have some validated here. So for each candidate, I d. We have their years of past experience whether or not they were employed there, number of previous employers, their highest level of education, whether they went to a top tier school and whether they did an internship. And finally, here's the answers where we knew that we either extended a job offer to this person or not . Okay, so as usual, most of the work is just in massaging your data, preparing your data before you actually run the algorithms on it. And that's what we need to do here. Psychic learn requires everything to be new. Miracle. So we can't have wise and ends and B s's and m s is. And PhDs We have to convert all those things to numbers for the decision tree model toe work. So the way of doing that is using some shorthanded panda This which makes these things easy . Basically, I'm making it dictionary and python that maps the letter y to the number one and the letter end to the value zero. So I want to convert all my wives toe ones and ends 20 So why will mean one will be mean? Yes, and zero will mean no. And what I can do is just take the hired call and from the data frame, using this syntax here and call map on it using a dictionary. And what that will do is go through the entire hired column in the entire data frame and use that dictionary. Look up to transform all of the entries in that column, and it returns a new data frame column that I'm putting back into the higher call. So this basically replaces the higher column with one that's been mapped toe ones and zeros . Okay, and I do that same thing for employees. Talk to your school and interns so all of those get mapped using the yes no dictionary so the wise and ends become ones and zeros instead. For the level of education, I did the same trick I created dictionary that has signs Bs 20 m esto one and PhD to to and uses that to remap those degree names to actual numerical values. So if I go ahead and do that and do ahead again, you can see that it worked all my yeses are ones, my nose or zeros, and my level of education is now represented by a numerical value that has real meaning. So now we just need to prepare everything to actually go into our decision tree classifier , which isn't that hard to do that. We need to separate our feature information, which are the attributes that were trying to predict from and our Target column. So what contains the thing that we're trying to predict? So to extract the list of feature name columns were just going to say, Create a list of columns up to number six. So the 1st 6 columns and we go ahead and print that out. And those are the column names that contain our feature information years experience employed, previous employees level education topped your school and in turn, these air the attributes of candidates that we want to predict hiring on and we next construct R y vector here is assigned to what we're trying to predict. So that is our hired column. So this extracts the entire hired column and calls it why, and then it takes all of our columns for the feature data and puts them into something called X. So this is a collection of all of the data and all of the feature columns. And these are the two things that are Decision Tree classifier needs. So it actually create the classifier itself. Two lines of code we call treat up decision tree classifier to create our classifier. And then we fit it to our feature data, and the answer is whether or not people were hired. So let's go ahead and run that pretty quick now to display it. I don't want to get too much detail. What's going on here? Basically, displaying graphical data is a little bit tricky. Just consider this boilerplate code on how to do this. You need to understand how graph is works in dot files on all that stuff, but it's not important. Basically, this is the Cody need to actually display the end result of a decision tree. So let's go ahead and run that. And there we have it. How cool is that? We have an actual flow chart here. Now let me show you how to read it so it eats stage. We have a decision. So remember, most of our data is yes. No, it's gonna be zero or one. So the decision point here is is employed less than 10.0.5, meaning that if we have an employment value of zero No, we're going to go left. Employment is one. Yes, we're gonna go. Right. So where they previously employed, if not, go left? If Yes, go. Right. And it turns out that in my sample data, everyone who is currently employed actually got a job offer. So I can very quickly say, if you are currently employed, Yes, you're worth bringing in. We're gonna follow it down to this level here. How do you interpret this? The Gini score is basically a measure of entropy that it's using it each step. Remember, as we're going down, the algorithm is trying to minimize minimize the amount of entropy and the samples air the remaining number of samples that haven't been basically sectioned off by a previous decision. Way to read the final leaf notes here. This value column. So that tells you at this point we have zero candidates that were no hires and five that were hires. So the way to interpret this is if employed was one. I'm gonna go to the right, meaning that they are currently employed. And this brings me to a world where everybody got a job offer. So that means I should hire this person. Okay, Now, let's say that we this person doesn't currently have a job. The next thing we'll look at is do they have an internship? If yes, then we're at a point where an artist training data, everybody got a job offer. So at that point, we can say our entropy is now zero because everyone's insane and they all got an offer at that point. However, if we keep going down, we're at a point here now where the entropy is 0.0.32 is getting lower and lower. That's good thing. So next we're gonna look at how much experience they have do they have less than one year of experience. And if the case is that they do have some experience and they've gotten this far there pretty good, No higher decision. So we end up at this point where we have zero entropy. But all three remaining samples in our training set were no hires. Okay, we have three no hires and zero hires, but if they do have less experience than there, probably fresh out of college, they still might be worth looking at. The final thing we'll look at is whether or not they went to a top tier school, and if so, they end up being a good prediction for being a higher. And if not, they end up being a no higher because we end up with one candidate that fell into that category that was a no higher and zero that were higher. Whereas in this case we have zero no hires and one higher, so you can see we just keep going until we reach an entry of zero, if at all possible for every case. Now let's say we want to use a random forest. You know, we're worried that we might be over fitting our training data. It's actually very easy to create a random forest classifier of multiple decision trees. So to do that, we can use the same data that we created before again. You just need your X and y vector, the set of features and the call him that you're trying to predict on, and we're just going to make a random forest classifier also available from psychic learn and all you need to pass it is how many trees do you want in your forest? So let's make 10 trees in our random forest. We can then fit to that to the model, and you don't have to walk through the trees by hand. And when you're dealing with a random forest, you can't really do that anyway. So instead, we're going to use the predict function on the model on the classifier that we made. We're gonna pass in a list of all the different features for a given candidate that we want to predict employment for. So, if you remember this, maps to these columns. Years experience employed previous employers, level of education, top to your school and interned interpreted as numerical values. So we want to predict the employment of unemployed 10 year veteran. We can do that or we want toe predict the employment of an unemployed 10 year veteran. We could do that, and sure enough we get a result. So in this particular case, we ended up with a higher decision on both. But what's interesting is there is a random component to that, so you don't actually get the same result every time. More often than not, the unemployed person does not get a job offer, and if you keep running this, you'll see that's usually the case. But the random nature of bagging of bootstrap aggregating he's one of those trees means you're not gonna get the same result every time. So maybe 10 isn't quite enough trees. So anyway, good lesson to learn there for an activity. If you want to go back and play with this Mesereau play input data. Go ahead and edit that CSB file that we started from and create an alternate universe where it's topsy turvy world. Everyone that I gave a job offer to now doesn't get one and vice versa. See what that does to your decision Tree dismissed around and see what you can do and try to interpret the results. So have some fun with it. This is interesting stuff. I think this is really cool, so that's decision trees and random forests. So that's decision trees, one of the more interesting bits of machine learning. In my opinion, I always think it's pretty cool to just generate a flow chart out of thin air like that, so hopefully you'll find that useful. Let's move on
39. Ensemble Learning: So when we talked about random forests, that was an example of ensemble learning, where we're actually combining and multiple models together to come up with a better result than any single model could come up with. So let's learn about that a little bit more depth. Let's talk about ensemble learning a little bit more. We've covered this a little bit already, but there is more than one way to do it, so I just want to cover it. Some of the basic techniques that exist for ensemble learning it is kind of an important topic, but it'll be a short lecture because the ideas are pretty basic. So remember random for us. We had, ah, bunch of decision trees that were using different sub samples of the input data and different sets of attributes that it would branch on. And they all voted on the final result when you were trying to classify something at the end. So that's an example of ensemble learning. Another example, when we were talking about K means clustering, we had the idea of maybe using different K means models with different initial random Centrowitz and letting them all vote on the final result as well, also an example of ensemble learning. Basically, the idea is that you have more than one model, and it might be the same kind of model. It might be different kinds of models, but you run them all on your side of training data, and they all vote on the final result for whatever it is you're trying to predict. And oftentimes you'll find that this ensemble of different models produces better results than any single model could on its own. A good example from a few years ago was the Netflix Prize. So Netflix ran a contest where they offered. I think it was a $1,000,000 to any researcher who could outperform their existing movie recommendation algorithm and the ones that one were ensemble approaches, where they actually ran multiple. Recommend her algorithms at once and let them all vote on the final result. So ensemble learning can be a very powerful yet simple tool for increasing quality of your final results and machine learning. Now, random forests again used a technique called bagging, which is short for bootstrap aggregating, which is yet another fancy term for a simple concept. All it means is that we took random sub samples of our training data and fed them into different versions of the same model and let them all vote on the final result. So if you remember, Random Forest took many different decision trees that use a different random sample of the training data to train on, and then they all come came together in the end of vote on the final result. Okay, so that's bagging. Boosting is an alternate mall, and the idea there is that you start the model. But each subsequent model boosts the attributes that address the areas that were misclassified by the previous model. So you take a run train test on a model, he figure out, what are the attributes that it's basically getting wrong? And then you boost those attributes and subsequent models in hopes that those subsequent malls will pay more attention to them and get them right. So that's the general idea behind boosting. You know, you run a model, figure out its weak points, amplify the focus on those weak weak points as you go and keep building more and more models that keep refining that model based on the weaknesses of the previous one another technique. And this is kind of what the Netflix Prize winner did was called a bucket of models where you might have entirely different models that try to predict something. Maybe I'm using K means and a decision tree and regression. I can run all three of those models together on a set of training data and let them all vote on the final classification result when I'm trying to predict something. And maybe that would be better than easing any one of those models in isolation. Okay, stacking, same idea. So you know, you run multiple models on the data combined the results together somehow. So you know the subtle difference there, between bucket of models and stacking bucket of models, you basically pick the model that wins, so you you'd run trained test. Do you find the model that works best for your data and use that model? Where is stacking will combine the results of all of those models together to arrive at a final results. Now there is a whole field of research on ensemble learning that tries to find the optimal ways of doing ensemble learning. And if you want to find sound smart. Usually that involves using the word bays a lot, which you will see on the slide. So there are some very advanced methods of doing ensemble learning. But all of them have weak points. And I think this is yet another lesson in that you should always use the simplest technique that works well for you. So these are all very complicated techniques, so I can't really get into in the scope of this course. But at the end of the day, it's hard to outperform just simple techniques that we've already talked about. So in theory, there's something called the Bays optimal classifier that will always be the best. But it's impractical. You know it's computational prohibitive to do it. People have tried to do variations of the base autumn a classifier to make it more practical, like Beijing parameter advert averaging. But it's still several over fitting, and it's often outperformed by backing, which is the same idea behind random. For us, you just to re sample the data multiple times we're on different models and let them all vote on the final result. It turns out that works just as well, and it's a heck of a lot simpler. Finally, there's something called daisy and model combination that tries to solve all the shortcomings of bees, optimal classifier and busy and parameter average averaging. But at the end of the day, it doesn't do much better than just cross validating against a combination of models against, you know, stacking basically. So again, you know, these are very complex techniques that are very difficult to use in practice. You're better off with the simpler ones that we've talked about in more detail. But you want sound smart and use the word days a lot. It's good to be familiar with these techniques, at least to know what they are. So that's ensemble learning again. The take away is that the simple techniques like bootstrap, aggregating or bagging or boosting or stacking or bucket of models, or usually, the right choices. There are some much fancier techniques out there, but they're largely theoretical, But at least you know about them now, anyway. Ensemble learning always a good idea to try out, you know, it's been proven time and time again that it will produce better results than any single model, so definitely consider it. Even though it's more work, you will get better results as a result of using ensemble learning
40. [Activity] XGBoost: So now that we've talked about boosting and we've talked about decision trees in this course. Let's put those concepts together and talk about XGBoost, which is arguably the most powerful machine learning algorithms out there today. So very important chapter here, XGBoost stands for extreme gradient boosted trees. Now remember, boosting is an ensemble method. The idea is that we take a model and we have multiple versions of that model chained together. So what happens is every tree within our boosting scheme here is going to boost the attributes that lead to misclassifications from the previous tree. So basically we have multiple trees are just building on top of each other to correct the errors of the previous tree before it. And it turns out this rather simple idea is really, really amazing. Xgboost is routinely winning Kaggle competitions. It's very easy to use. It's very computationally efficient. So it makes for a really good choice for an algorithm to start from whatever your problem, whether it's classification or regression, there's a really good chance that XGBoost is going to prove to be the best algorithm to actually try to make a model for your data and make accurate predictions based on that model. And it's really easy to use. I mean, it's, it's almost disturbing how easy it is under the hood has a lot of really cool features that make it so good. Ease is something called regularized boosting. So this is what sets it apart from other boosted tree methods out there. Regularization is something that prevents overfitting. So it ensures that the model we end up with is generalized and it's not really over-fitted to the set of data that you trained it on. We'll talk about regularization in more detail later in the course. But under the hood it uses L1 and L2 regularization, which again we'll talk about more later. Another really cool feature is that XGBoost can handle missing values automatically. So it will automatically figure out the best way to handle the missing values in your data. You don't have to think about it too much. That's a really cool feature because, you know, dealing with missing values and imputing those missing values can be a huge part of your job as a data scientist. But XGBoost just kinda makes it happen. It also can be run in parallel. So that's the key to its efficiency. It can actually take advantage of all the cores on your CPU or even take advantage of a cluster of computers. It can be run in parallel across multiple threads. And this also means that you can use it for big data, for large datasets, I won't necessarily fit on one machine. So not only is XGBoost are really powerful and accurate algorithm for small datasets. It also scales well. So what's not to love about it, right? Another nice feature is that you can do cross-validation at each iteration. We haven't talked about cross-validation too much yet. But the idea is that you can evaluate the performance of this algorithm of XGBoost at each step of its training. And that allows you to do things like say, Well, I'm not really seeing much more benefit from further iterations. So I'm gonna go ahead and stop this iteration early. Or I can actually find the optimal number of iterations as I'm training it. So like I can monitor the accuracy of my model as it iterates and figure out when I should stop it and find that optimal point pretty easily. It also supports incremental training. What I mean by that is that you can actually stop the training of an XGBoost model and then save it and come back and pick it up later again. So. If you want to actually split up the training over a period of time or across multiple batch jobs. That's a possibility as well with XGBoost. Also, it allows you to plug in your own optimization objectives. This makes it very flexible in nature. So whatever the problem you have, if you can describe it in terms of something that you want to optimize, you can probably get XGBoost to work on it. And finally uses a feature called tree pruning. So unlike normal decision trees where it just stops branching once it stops singing, benefit of doing so, it kind of takes a different approach here, will actually go very deep by default and then try to prune that tree backwards so that generally results in deeper trees, but more highly optimized trees. And that's part of the key to its success. Using XGBoost is ridiculously easy to install. It just type pip install XGBoost from your Anaconda Prompt. And once it's installed, you can just start using it. It also offers interfaces on the command line interface for C plus plus, which is what it's written in natively, also for the R language Julia, and it has a JVM interface. So you can use XGBoost very efficiently within programs like Java or Scala and Spark using Scala for example. So it's not just made for scikit-learn and Python notebooks. It's more general than that. And as such, it has its own interface. So it's not really made specifically for scikit-learn. So things are a little bit different when you're using XGBoost within scikit-learn it within a notebook. The main difference is that it uses something called a D matrix data structure to hold the feature and labeled data. But using it's very easy and is a very easy way to create one of those d matrix structures from a NumPy array. So in practice, it's not a big deal. And you'll pass in all the parameters for XGBoost as a giant dictionary and we'll talk about that in a second. Once you've done that, all you have to do what's called train on the model and then call predict on the trained model to make predictions from it. It's really, really easy. The hard part is, is tuning all those hyperparameters of XGBoost. So there's a bunch of knobs and dials and XGBoost. And to get the best results, you need to choose the right settings. And often that's just done through experimentation. Somethings are going to be pretty straightforward. For example, you can choose your booster, your priorities. A tree for classification problems are linear for regression problems. And you also need to choose your type of objective function. So for example, I might choose softmax if I want to just choose one of many classifications and choose the best classification for any given thing. Or I could say I want multi soft prob that gives me actual probabilities for each classification. So that could potentially allow me to get a list of likely classifications, more than one for each thing that I'm trying to predict. Beyond that, the rest need to be learned through experimentation. So eta is probably the primary parameter, the biggest knob that you have in XGBoost, if you will. You can think of that as learning rates. So it's going to adjust the weights on each step of training. And the default value of that is going to be 0.3. And often you'll find that in practice, lowering that a little bit to, point to or even lower will often produce better results. So that's the main thing you want to start tinkering with. Once you're trying to tune the performance of your XGBoost model. Some other important parameters or a max depth that's going to be the maximum depth of the tree. And obviously if that's too small, you're not going to be able to create a very accurate model, but if it's too large, you might end up over fitting. So tuning that can be a very important thing to try to get right. There's also one called Min child weight. This can also be used to control overfitting, making sure that your model is not too specific to the data that you trained it on. But if you set it too high, you'll end up under fitting. So you need to get the right balance on that as well. There's a large number of other parameters and XGBoost as well. But these are the main ones that you want to fiddle with and experiment with. And again, sometimes you just need to experiment to figure out which combination of values that works the best. Within a Python notebook, you can use tools such as GridSearchCV to automatically try and array of different values for these parameters and automatically find out which one is the right one. Or if you're using a larger system like AWS, SageMaker, it will have things like automatic hyperparameter tuning jobs that you can set it to try to find just the right combination of parameters here. So tuning these just rider key to getting the best performance out of XGBoost. But as you'll see, you don't have to really think too hard to get good results out of it. So remember, XGBoost is almost all that you need to know for machine learning these days in practical terms. For simple classification or regression problems, odds are you're going to get the best results from XGBoost and using its really easy. So with that, let's actually see it in action. We're going to throw it at the Iris dataset. This is a common dataset used for educational purposes. It's just a dataset of a bunch of flowers and they measure the length and width of both the petals and the sepals. The sepals is just a specific kind of pedal. I think it's the one on the bottom of the iris flower. And based on those measurements of the petals, we're not trying to predict what subspecies of virus that flower actually belongs to. And as we'll see, XGBoost is extremely good at this. So let's dive in and give it a shot. So let's play with XGBoost. We'll start by bringing up our Anaconda prompt is always so. I'll go to my start menu here and go to the Anaconda Prompt or on Mac or Linux, bring up the terminal. First thing we need to do is cd into where our course materials are stored. For me, that's going to be CDC colon backslash ML course. And before we spark up the Jupyter Notebook, we need to first install XGBoost itself. So I've already done that, but you probably haven't. So go ahead and type in pip install XGBoost to take care of that. For me, it won't do anything because I've already installed it. But for you it should be going out there and downloading and installing the latest version of XGBoost. Once that's done, we can type in Jupiter notebook, Jupiter with y. Once we're in here, go ahead and find the XGBoost, no book and open that up. And let's get started. So again, using XGBoost is really easy. There's not a whole lot here to look at because there's not a whole lot to do. It's really easy to use and it just works. So we were already installed it, but as a reminder, that's how you would install it if you didn't already. And again, we're going to be experimenting with the Iris dataset. The idea of this is that we have a dataset of flowers where we have measurements of both the petals and sepals, which is just a special kind of petal, the length and width of each. So for every flower, we're going to have four measurements for attributes or features, if you will, the length and width of the petals and the length and width of the sepals. And what we need to do is predict which subspecies of flower it is based on those measurements. And then it turns out there are three subspecies of iris flowers, setosa, versicolor and virginica. Let's go ahead and load that data up. Fortunately, that's already built into scikit-learn. So we just had to say load underscore iris to load that up. And we can explore the parameters and features of that data here. So we'll do data.dat shape to figure out what's in it exactly, to try to understand what's in there. Go ahead and hit Shift Enter in there to actually run that. And we can see that this consists of a 150 samples, so there's only a 150 flowers in our dataset here. Every, every flower has four features, the length and width of the petal, and then link them with the sepal. And a possible target names. The actual categories that we're trying to predict the labels are setosa, versicolor and virginica. So let's get started by dividing up that data into both the training dataset and a test dataset. The idea here is that we want to make sure that we're only training our model based on our training data. And then we're going to set aside 20% of our data to actually evaluate the model with. So we want to make sure that we're holding this test data set aside and not training the date on it so that we can say, Okay model, okay, XGBoost model. How well do you do? I'm predicting subspecies for flowers you've never seen before. So this makes sure that we're not training on. The answer is we're not like cheating if you well, right, so the idea is that we're going to set aside 20% of our data for testing purposes, train the model and the remaining 80 percent, and then evaluate the model based on that data that was withheld. So that's what train test split is doing, is just randomly splitting up that data. That way we pass in the actual feature data, the labeled data. And we say that we want to reserve 20 percent of the data for testing. And we can give it a specific random state to make sure that we get the same results every time we run this. That will go into a bunch of different arrays here, x is basically by convention the feature data, in this case the lengths and widths of the petals and settled sepals. And y by convention refers to the labeled data. So that's going to be what subspecies it is. So what this means is that the feature data, the actual measurements of the petals is gonna go into x train for the training dataset and X test for the test dataset. And the labels, the answers, if you will, of what subspecies it is, we'll go into why underscore train for the training data and y underscore test for the test data. Let's go ahead and run that before I forget. And now we can load up XGBoost itself. So as we said, XGBoost is a little bit quirky and that you have to use these d matrix things instead of just using straight-up numpy arrays. But as you can see here, it's really easy to create them from a NumPy array. So we're gonna say the entire training data is going to be a D matrix that consists of the training feature data, the actual measurements, and the labeled data, the actual subspecies. And we do the same thing for the test data. So we basically embodied all of the features and label data for both training and testing in these two d matrix objects. Go ahead and Shift Enter to run that. Next, we need to define our hyperparameter values. And as we said, this is often the hardest part of the whole thing is trying to find the right values of these settings, if you will. And so we'll just start with a guess. So let's say that we're going to start with a maximum depth of our trees of four. Eta will start with the default learning rate of 0.3. And again, usually you want to go a little bit lower on that if anything, our objective function will be Softmax. Softmax just means that we want to look at the most likely classification for each flower. So in contrast, we could use soft prompt to get the actual probabilities associated with each individual subspecies. But I'm just interested in one answer per flower, so I want it to automatically pick the best classification value, and that's what softmax does. It fixes the classification with the maximum probability. And we will specify the number of classifications that we have, in this case three, because there are only three subspecies to choose from. Your thing, we have to guess at here in tune is the number of epochs or iterations, if you will call it what you want. Basically, how many times are we going to actually run this algorithm over? So with that Shift Enter and we can then train our model with one line of code. It's just that easy. So by saying XGBoost dot train, we just pass in that dictionary parameters, the actual training data, that's that d matrix object. And how many epochs we want to run it over. Go ahead and Shift Enter. And you can see it's already done. Like we said, XGBoost is really, really fast. So now we can just make predictions based on that train model. So let's go ahead and call predict passing in that test data that we withheld, right? So remember we took 20% of our dataset, set it aside so that the model never saw it while training. And now we're going to evaluate how well it predicts the flowers that it's never seen before. And if we print out those predictions, we can see that it's printing out the category numbers of each individual flower in that test dataset. So I forgot what these actually correspond to. I think two means a virginica, for example, right? So these are the actual subspecies predictions for each data point in the test data that the model never saw before. Let's see how well it actually did. So I'm just going to call accuracy underscore score from the scikit-learn metrics package there, I'll pass in the actual known correct values which are going to be in Y and a score tests, those are the correct classifications that we know and the predicted values. And we'll compare how well they actually match up. And look at that. It's actually perfect. You do not see that very often. I mean, we just guessed at the right hyperparameters for XGBoost. And even just guessing, we got perfect results out of it. I mean, obviously those are exceptionally like you're usually not going to get a 100 percent accuracy, but wow, that is amazing performance here. Now normally is a hands-on activity. I'd have you tried to improve the accuracy by messing with the hyperparameters further, but you can't improve on a 100 percent. So instead, what I want you to do is try to make this model more efficient. Could actually get away with fewer epochs or iterations, could actually get away with a smaller trees by lowering the max depth parameter. So try to optimize the simplicity of this model and therefore its performance and see how simple you can make it before actually losing accuracy. So play around with that and get a feel as to how those hyperparameters affect the actual accuracy of your results. But yeah, that's XGBoost inaction. As you can see, it produces awesome results. It's not hard to use. It should be your go-to algorithm. And a lot of cases.
41. Support Vector Machines (SVM) Overview: Finally, we're going to talk about support vector machines, which is a very advanced way of clustering or classifying higher dimensional data. So what if you have multiple features that you want to predict based off of SPM could be a very powerful tool for doing that, and the results could be scary. Good. It's very complicated under the hood, but the important things understanding when to use it and how it works at a high level. So let's cover that now. Let's talk about support vector machines, and this is a fancy word for what actually is a fancy concept, but fortunately, it's pretty easy to use. The important thing is knowing what it does and what it's good for. So support vector machines. It works well for classifying higher dimensional data, and by that I mean lots of different features. So it's easy to use something like, You know, K means clustering to cluster data that has two dimensions, you know, maybe age on one axis and income on another. What if I have many, many different features that I'm trying? Teoh Predict based off of we'll support vector machines might be a good way of doing that, mathematically. What it can do is find these higher dimensional support vectors. That's where it gets its name, that define these higher dimensional planes that split the data into different clusters. And obviously the methods gets pretty weird pretty quickly with all of this. Fortunately, the psychic learned package will do it all for you without having to actually get into it yourself under the hood. You need to understand, though, that it uses something called the Colonel trick to actually find those support vectors. And there are different kernels you can use to do this in different ways. All sounds very fancy. But again, the main point is you need to use SPM, Zahra Good choice. If you have higher dimensional data with lots of different features, and there are different kernels you can use that have varying computational costs and might be better fits for the problem at hand. And also I want to point out that this is a supervised learning techniques, so we're actually going to train it on a set of training data, and we can use that to make predictions for, you know, future unseen data or test data. So it's a little bit different than K means clustering. And that K means was completely unsupervised with support Vector machine. It is training based on actual training data where you have the answer of the correct classification for some set of data that it can learn from. Okay, so S PM's useful for classification, you know, clustering if you will. But it's supervised technique. Okay, keep that in mind. So one example that you often see it was support. SPM is using something called support vector classifications, and the typical example uses the Irish data set. So one of these sample data sets that comes with psych it learn, is called the Irish Data Set, and what it is is a classification of different flowers, different observations of different iris flowers and their species. And the idea is to classify these using information about the length and width of the pedal on each flower and the length and width of the CEPAL of each flower. And a CEPAL apparently is a little support structure underneath the pedal. I didn't know that until now, either, but you have four dimensions of attributes. There you have the length and with of the pedal and the length and the width of a CEPAL, and you can use that toe. Predict the species of an iris, given that information. So here's an example of doing that with SBC. Basically, we have several with and supple length projected down to two dimensions, so we can actually visualize it. And with different kernels, you might get different results. So SBC with a linear kernel produces something like this, it turns out, and then there's a linear SBC method as well. Also the linear kernel. It produces a result like that, and you can also use polynomial kernels or fancier kernels that might project down to curves in two dimensions. So you could do some pretty fancy classifications this way, right? So that's an example again, These have increasing computational costs, and they can produce, you know, more complex relationships. But again, it's a case where too much complexity can yield misleading results. So you need to be careful and actually use trained test when appropriate. Since we are doing supervised learning, you can actually do train tests and find the right model that works. Or maybe it's an ensemble approach right. You need to arrive at the right colonel for the task at hand. And for things like polynomial SBC, what's the right degree polynomial to use? You know, even things like linear SBC will have different parameters associate with them that you might need to optimize for, so it'll make more sense with a real example. Let's dive into some actual python code and see how it works.
42. Using SVM to Cluster People: Let's play around with support vector machines. Open up the S V C notebook in your course materials and fortunately, it's a lot easier to use support vector machines than to understand how they work. So for this example, let's go back to the same clustering sample data that we created for our K means clustering lecture. We just have this crate cluster data function here again, we're gonna put in a consistent random seed so you'll see the same results that I see here . We will pass in a given number of clusters that we want and a number of points per cluster will be computed by an over K. So basically, we're passing in how many clusters you want and how many total points we want to distribute between them. Here we then create our X and y rays here X will container feature data which will be two D points consisting of incomes and ages that air randomly generated and fabricated. And why are labels will represent the clusters that these points are associated with. So, for each cluster we want to create will pick a random income central oId in a grand and eight Central had randomly selected between 20 and 70 years and $220,000. And then for every point per that cluster, we will create a fictitious point randomly sampled with a normal distribution for a given income and age around those Centrowitz. Those will be appended to our ex feature array and then will upend the actual cluster. Number two are why label array will convert those back to numb pie, raise and return them back. All right, so now we'll actually prepare our data here so we'll actually call, create Clustered Data. We're going to ask for 100 points broken up into five different clusters, and I function will do that for us. And let's plot what it comes back with. So we'll just say we want to create an eight by six plot here figure, and we will do a scatter plot with a color being correlated with the actual label associated with it that will make sure that each cluster gets its own color will then plot that. Now it turns out that in order for SBC toe work well, in some kernels, especially the poly Colonel, we need to scale that data down to a normally distributed range between negative one and one. Otherwise the solutions might never converge. So we're also going to scale the stated down to the range negative Oneto one here. And to do that, raising something called Min Max Scaler that comes out of the pre processing module of psychic Learn. And we'll just say we want to fix all the features in the X ray, which contains all the incomes and ages to the range. Negative 1 to 1 will apply that transform to our feature data, and then we'll plot the resulting scale data as well. Now, if you ever wanted to get back to the original data range, there's also an inverse transform function on Min Max scaler you can use to go back the other direction. But for this example, we won't need to do that anyway. Let's go ahead and run these before we forget, so I'll go back up to the function definition block here and run that shift. Enter and then in block to shift, enter again and you can see here the plot of our unskilled results, where the incomes ranged from 20,000 to 200,000 plus or minus some slot value their and our ranges air roughly 30 to 70 years old for the age access. So basically, our feature data consists of incomes on the X axis and ages on the Y axis year after we scale it, it looks exactly the same. Except the actual numbers have been scaled down to the range Negative Oneto one on both axes. So that's prepared our data for being friendly to S V C. Now we just have to actually apply spc, which is incredibly easy. All we have to do is import the SPM package from psych it learn and call SPM bought spc Specify the colonel that you want. We'll start with a simple linear kernel. The main hyper parameter that matters here is C, which is by default one point. Oh, that works well and we'll ask it to fit to our feature of a and R labeled array. So basically saying fit with our array of all the two D points of ages and incomes and all of our labels of what clusters we know those that are assigned to originally so that will train our SPC model. Let's go ahead and run. That went really quickly. So let's take a look at the results we're going to do here is create a plot where we can actually visualize the ranges within this two D space of ages and incomes that are model things correlates to given clusters, and at the same time we'll plot the actual results of what clusters that thinks each individual point is in this well, so to build up this graphic, we first need to build up a very dense two D array that we can actually create predictions on so we can actually plot what those clusters are at every point in this space. To do that, we're calling mesh grid on numb pie, saying that we want a grid of sample points here that range between negative one and one in both the x and Y axis at a spacing of just 0.1 apart from each other. So this is close enough that it's gonna look like a continuous range of color When we actually plot this, we then need to convert the result from mesh great tune, umpire race. We can use them with a classification prediction and then we need to convert that in turn to a list of two D income age points. So the Raval function here is just converting this back to numb pie rays and the np dot c underscore. See underscores shorthand for can Katyn eight. That says, I'm going to take all of these X values and all of these y values and can cabinet them together into a list of points that have x y x y x y x y etcetera. So now that I have that list of sample points, I can pass that into predict. Now, just ah, to clarify a point of confusion here, X and Y here can correspond to the X and Y axes. So N p X is going to end up being all of the income data and MP y will be all of the age data. All right, So don't let that confuse you when we talk about why, in this context, I'm talking about the y axis, not the label data. Why? So we have these this feature array we're gonna pass into predict, and that will give us back predicted labels, which are the actual cluster numbers for each point within this space. Then we just have to plot it out again. We'll make an eight by six figure. We will reshape those results to make sure it matches the X dimension of our plot, and then we'll call Contour F. This will basically go through every single point that we created on that mesh grid and take its labels that were predicted. So these are the predictive labels, not the actual labels, and plot them using whatever color map we want to use associated with those values. And after that, we'll actually do a scatter plot that is going to plot the original points that are still living in our capital X array. There are feature data again, that's income and age, and we will color them based on the actual original clusters calling Why, as type float to convert those original labels back to color values for the plot. So let's go ahead and run that and see what it looks like. Takes a little bit of a while to create that many predictions, but not too long. And you can see here what happened. So remember what we're doing. He was plotting the actual ranges of predictions as its background color, So we're using a linear kernel. So we have straight lines delineating each region where it believes these clusters correspond to. So basically anything in this triangle is going to be a model to be in one cluster. Anything that's on a red background will be another cluster. And we're plotting on top of that, the original true clusters from when we actually created the data. Now, these clusters do overlap to some extent, so it really it's not gonna be possible or realistic to really expect any algorithm todo to get 100% on this. You can see that it, Mr. Couples, for example, we have this blue feature here that's showing up is a different color than the rest of the ones that were should be in the same cluster. Hear the background colors don't necessarily correlate with a four grams. You can see when a point doesn't match its neighbors around it. For the rest of that cluster, that's probably going to be an incorrect result. Also, we have this purple point here for the this cluster that seemed to have ah snuck into this red region here that really belongs to this other cluster here. So a couple of points that got wrong but can't really fault it too much for that. Thea Clusters do, in fact, overlap now. You can also use the predict function on the train model to make predictions for specific new points that the model hasn't seen before. That's kind of the whole point of all of this. So in this particular example, we're calling s V c dot Predict to predict a cluster for someone with an income of $200,000 at age 40. And it turns out that lands and cluster number three with our model and for someone with $50,000 a day 65 that ended up in cluster number two now know that we have to scale those numbers down for us before we can actually make a prediction using them. So remember that before we train this model, we scaled all of our inputs down using men Max scaler. So we have to use that same scaler which we named scaling and use that to transform our input data when we're trying to make predictions for new individual points as well, so, so very important not to forget that step. It's very easy to do, and I've done it myself. But with that, I will challenge you to extend this and for your activity here, try different kernels, so linear is just one of them. Go and look up the documentation in psychic Learn on line for the SPC model, which itself is a good exercise because I can answer a lot of your own questions that way and find out what the other possible colonel options are. See if you can find a better one. When you get into nonlinear kernels. Maybe those conf it the state a little bit better, actually, and do a better job on it. So just do some experimentation. Try different kernels. See what effect that has on the shapes of the clusters and the regions that to find them you might find it interesting and also see what effect it has on the run time. Some of these are more complicated than others, and some of them require more computational resource is another's, so that's also a trade off. You have to consciously make, too. Also, you might want to fiddle around with some of the hyper parameters associated with these different models. They'll be documented in the documentation in psych, it learned for each kernel type. Many of them have very many different parameters, and you just have to experiment to find the parameters that work best. So play with that as well. Hyper, parameter tuning, very important part of machine learning. So go and get your hands dirty if you will, and see if you could do a better job using a different colonel type.
43. User-Based Collaborative Filtering: Let's talk about my personal area of expertise. Recommend er systems. So systems that can recommend stuff to people based on what everybody else did. Look at some examples of this in a couple of ways to do it specifically to techniques called user based and item based collaborative filtering. So let's dive in. I want talk about a subject that's near and dear to my heart. Recommend her systems. If you remember, I actually spent the most of my career at amazon dot com and imdb dot com, and a lot of what I did there was developing recommended systems, things like people who bought, also bought or recommended for you and things that did movie recommendations for people. So this is something I know a lot about personally, and I hope to share some of that knowledge with you. So what do we mean by recommend her systems? Well, like I said, Amazon is a great example and one that I'm very familiar with. So if you go to their recommendations section here, you can see that it actually will recommend things that you might be interested in purchasing based on your past behavior on the site and that might include things that you're rated or things that you bought, and among other signals that I might use this well can go into the details because they'll hunt me down and, you know, do bad things to me. But it's pretty cool. It's pretty good stuff, too, and you could also think of the people who bought. Also bought feature on Amazon is a form of recommend er, system. The difference is that the recommendations you're seeing here are based on all of your past behavior, whereas people who bought also bought or people who viewed also view things like that are just based on the thing you're looking at right now, or the thing that you're thinking of buying right now and showing you things that are similar to it. You might also be interested in it. Turns out, what you're doing right now is probably the strongest signal of your interest anyhow. Another examples from Netflix. So they have various features that try to recommend new movies or other movies you haven't seen yet. Based on the movies that you liked or watched in the past as well, Then they break that down By John Run They have kind of a different spin on things where they try to identify the genres or the types of movies that they think you're enjoying the most, and they show you more results from those genres. So that's another example of a recommended system in action. And the whole point of it is to help you discover things that you might not have known about before. So it's pretty cool. You know, it gives movies or books or music or whatever a chance to be discovered by people who might not have heard about it before. So, you know, not only is a cool technology, it also kind of levels the playing field a little bit and helps new items get discovered by the masses. So it plays a very important role in today's society. At least I'd like to think so. So there's a few ways of doing this. Let's talk about recommending stuff based on your past behavior. One technique is called a user based collaborative filtering, and here's how it works. Collaborative filtering, by the way, just a fancy name for saying recommending stuff based on the combination of what you did, what everybody else did Okay, so it's looking at your behavior and comparing that to everyone else's behavior to arrive The things that might be interesting to you that you haven't heard of yet. So the idea here is we build up a matrix of everything every user that I have has ever bought or viewed or rated, or whatever a signal of interest that you want to base this system off of. So basically end up with a row for every user in my system. And that row contains all of the things they did that might indicate some sort of interest in a given product. So picture a table. I have users for the Rose and each column is an item. Okay, that might be a movie. Ah, product. Whatever a Web page you can. You can use this for many different things. Then I use that matrix to compute the similarity between different users. So I basically trade treat each row of this as a vector, and I can compute the similarity between each vector of users based on their behavior. So you know, to users who liked mostly the same things would be very similar to each other, and I can then sort this by those similarity scores. So if I can find all the users similar to you based on their past behavior, I can then find the users most similar to me and then recommend stuff that they liked that I didn't look at yet. Okay, so you know, let's look at a real example, and it will make a little bit more sense. So let's say that this nice lady here watched Star Wars and the Empire strikes back, and she loved them both. So we have a user vector here of this lady Liked, gave five stars rating more specifically to Star Wars, and the Empire strikes back. So let's say Mr Edgy Mohawk Man comes along and he only watched Star Wars. That's the only thing he's seen he doesn't know about. The Empire strikes back yet. Somehow he lives in some strange universe where he doesn't know that there's actually many , many Star Wars movies growing by growing every year, in fact, but we can say, Well, this guy's actually pretty similar to this other lady because they both enjoyed Star Wars a lot, So there's similarity. Score is probably fairly good, and we can say, OK, well, what has this lady enjoyed that he hasn't seen yet? And the empire strikes back this one. So we can then take that information that these two users air similar based on their enjoyment of Star Wars. Find that this lady also liked Empire strikes back. And that might be a good recommendation for Mr Edgy Mohawk Man. And we can go ahead and recommend him. The Empire strikes back and will probably love it because, in my opinion, it's actually a better film. But I'm not going to get into ah G course with you here. Now, unfortunately, user based Clara Filter has some limitations. When we think about relationships and recommending things based on relationships between items and people on what not we tend, our mind tends to go in relationships between people. So we want to find people that are similar to you and recommend stuff that they liked kind of the intuitive thing to do. But it's not the best thing to do. One problem is that people are fickle, their tastes are always changing. So maybe that nice lady in the previous slides had sort of a brief science fiction action film phase as she went through and she got over it. And maybe later on in her life, she started getting more into, you know, dramas or romance films or Rahm columns, right? So what would happen if my edgy Mohawk I ended up with a high similarity to her? Just based on her earlier SciFi period? And we ended up recommending romantic comedies to him. As a result, that would be bad, right? I mean, there is some protection against that in terms of how we compute the similarity scores to begin with. But it's still pollutes our data that people's tastes can change over time. So comparing people to people isn't always a straightforward thing to do because people change. The other problem is that there's usually a lot more people than there are things in your system. So seven billion people in the world and counting. It's probably not seven billion movies in the world or seven billion items that you might be recommending out of your catalog. So the computational problem of finding all the similarities between all of the users in your system it's probably much greater than the problem finding similarities between the items in your system. So by focusing the system on users, you're making your computational problem a lot harder than it might need to be because you have a lot of users, at least hopefully you do if you're working for a successful company. The other problem is that people do bad things. There's a very real economic incentive to make sure that your product or your movie or whatever it is, gets recommended to people. And there are people who try to gain the system to make that happen for their their new movie or their new product or their new book or whatever. And when you're basing this on user relationships, it's pretty easy to fabricate fake personas in the system by creating a new user and having them do a sequence of events that you know likes a lot of popular items and then likes your item to write. This is called a shilling attack, and we want toe ideally have a system that can deal with that. There is research around how to detect and avoid these shelling attacks in user base collaborate filtering. But an even better approach would be to use a totally different approach entirely that's not so susceptible to gaming the system, and we'll talk about that in our next lecture. There is a way to foot this on its head and actually do better than usual. Raise collaborative filtering. So that's user based collaborative filtering Again, a simple concept. You look at similarities between users based on their behavior and recommend stuff that a user enjoyed that was similar to you that you haven't seen yet. Now have does have its limitations, as we talked about. So let's talk about flipping the whole thing on its head with a technique called item based Cloud grew filtering up next.
44. Item-Based Collaborative Filtering: So let's try to address some of the shortcomings in user based Cloudera filtering with a technique called item based collaborative filtering, and we'll see how that could be more powerful. It's actually one of the techniques that Amazon actually uses under the hood, and they've talked about this publicly, so I can tell you that much. But let's see why it's such a great idea. So we talked about user based collaborative filtering where we recommend items based on what people similar to you liked that you haven't seen yet or experienced yet. And we talked about some of the problems with user base collaborative filtering. So what if we flip it on its head? And instead of basing our recommendations on relationships between people, we base them on relationships between items, and that's what item based collaborative filtering is. So this draws in a few insights. For one thing, we talked about people being fickle. Their tastes can change over time, so comparing one person to another person based on their past behavior becomes pretty complicated. You know, people have different phases where they have different interests, and you might not be comparing the people that are in the same phase to each other, but an item will always be. Whatever it is, A movie will always be a movie. It's never gonna change. Star Wars will always be Star Wars well, until George Lucas tinkers with it a little bit. But for the most part, items do not change as much as people do. So we know that these relationships are more permanent. And there's more of a direct comparison you can make when computing similarity between items, because they do not change over time. The other advantage is that there are generally fewer things that you're trying to recommend that there are people you're recommending to. So again, seven billion people in the world you're probably not offering seven billion things on your website. To recommend to them so you can save a lot of computational resource is by evaluating relationships between items instead of users, because you will probably have fewer items than you have users in your system. And that means you can run your recommendations more frequently, make the more current more up to date better. You can use more complicated algorithms because you have less relationships to compute, and that's a good thing. It's also harder to game the system. So we talked about how easy it is to gain a user based collaborative filtering approach by just creating some fake users at like a bunch of popular stuff. And then the thing you're trying to promote, with item based cloud of filtering that becomes much more difficult. You have to game the system into thinking there are relationships between items. And since you probably don't have the capability to create fake items with fake ties to other items based on many, many other users, it's a lot harder to Gaiman item based collaborate filtering system, which is a good thing. While I'm on the topic of gaming the system, another important thing is to make sure that people are voting with their money. So a general technique for avoiding Schilling attacks or people trying to gain your recommend er system, make sure that the behavior that you're basing it off of its based on people actually spending money so you're always kind of get better and more reliable results when you based recommendations on what people actually bought as opposed to what they viewed or you know, what they clicked on Okay, All right. So let's talk about how item based Claverie filtering works. It's very similar to user based collaborative filtering, but instead of users were looking at items. So let's go back to the example of movie recommendations. The first thing we would do is find every pair of movies, every movie pairing that is watched by the same person. So we go through and find every movie that was watched by identical people. And then we measure the similarity of all those people who viewed that movie to each other . So by this means we can compute similarities between two different movies based on the ratings of the people who watched both of those movies. So I have a movie pair. Okay, maybe Star Wars in the Empire strikes back. I find a list of everyone who watched both of those movies compare their ratings to each other. And if they're similar than I can say, these two movies air similar because they were rated similarly by people who watched both of them. OK, that's the general idea here. That's one way to do it. There's more than one way to do it, and then I could just sort everything by the movie and then by the similarity strength of all the similar movies to it. And there's my results for people who liked also liked people who rated this highly ulcerated, this highly and so on and so forth. And like I said, that's just one way of doing it. So that's sort of step one of item based collaborative filtering. First, I find relationships between movies based on the relationships of the people who washed every given pair of movies. You'll make more sense when we go through the example. So, for example, let's say that our nice young lady here watched Star Wars and Empire strikes back and like both of them, so reading both five stars or something. Now along comes Mr Edgy Mohawk Man, who also watched Star Wars in The Empire Strikes Back and also like both of them. So at this point, weaken say, there's a relationship. There's a similarity between Star Wars and the Empire strikes back based on these two users who like both movies. So what we're gonna do is look at each pair of movies. We have a pair of Star Wars and empire strikes back and then we look at all the users that , like that, watched both of them, which are these two guys. And if they both like them, then we can say that they're similar to each other or if they both disliked them, we can say they're similar to each other, right? So we're just looking at the similarity score of these two users behavior related to these two movies in this movie pair. So along comes Mr Mustache, she lumberjack, hipster man, and he watches The Empire strikes back, and he lives in some strange world where he wants the Empire strikes back but had no idea that Star Wars the first movie existed. Well, that's fine. We computed a relationship between the Empire Strikes Back and Star Wars based on the behavior of these two people. So we know that these two movies air similar to each other. So given that Mr Hipster Man like the Empire strikes back, we can say with good confidence that he would also like Star Wars, and we can then recommend that back to him as his top movie recommendation. So you can see you end up with very similar results in the end but we've kind of flip the whole thing on its head. So instead of focusing the system on relationships between, people were focusing them on relationships between items. And those relationships are still based on the aggregate behavior of all the people that watched. Um, but fundamentally, we're looking at relationships between items and not relationships between people. Got it. All right, so let's do it. We actually have some python code that will use pandas and all the various other tools at our disposal to create movie recommendations with a surprisingly little amount of code. So the first thing we're going to do is show you item based collaborative filtering in practice, so we'll build up people who watched. Also watched. Basically, you know, people who rated things highly also rated this thing highly. So building up these movie to movie relationships and we're gonna base it on real data that we got from the movie lines project. So if you got a movie lens dot org's, it's actually an open movie. Recommend er system there, where people can rate movies and get recommendations for new movies, and they make all the underlying data publicly available for researchers like us. So we're gonna actually use some riel movie ratings data. It is a little bit dated. It's like 10 years old, so keep that in mind, but it is real behavior data that we're gonna be working with finally here. And we will use that to compute similarities between movies. And that data in and of itself is useful. You know, you can use that to say people who liked also liked right? So let's say I'm looking at a web page for a movie I can right there and then and there. Say, if you like this movie, given that you're looking at it, you're probably interested in it. You might also like these movies, and that's a form of a recommended system right there. Even though we don't even know who you are now, it is a real it is real world data. So we're gonna encounter some real world problems with it. Our initial set of results aren't gonna look good, so we're going to spend a little bit of extra time trying to figure out why, which is a lot. What you spend your time doing is a data scientist. Correct those problems and go back around again until we get results. That makes sense. And finally, well, actually do user item based collaborative filtering in its entirety, where we actually recommend movies to individuals based on their own behavior. So let's let's do this. Let's get started. So that's item based collaborative filtering. A wonderful idea thought up of people much smarter than me. But, you know, it did have its origins at Amazon, which is kind of cool. As you can see, it addresses a lot of shortcomings of user based collaborative filtering, and it works out really well. So let's actually put it into practice and start writing some python code to make it happen .
45. Finding Movie Similarities: So let's apply the concept of item based Claverie filtering to start with movie similarities. Figure out what movies Air similar to other movies in particular will try to figure out what movies Air similar to Star Wars based on user rating data. And we'll see if we get out of it. Let's dive in. Okay, So let's go ahead and actually compute the first half of item based Collaborate filtering, which is finding similarities between items. In this case, we're going to be looking at similarities between movies based on user behavior, and we're gonna be using some riel movie rating data from the group lens project. If you go to group lines dot org's that actually makes publicly available to researchers like us rial movie ratings data by real people who are using the movie lens daughter work website to rate movies and get recommendations back for new movies it that they want to watch. So we have included the data files that you need from the group plans data set with the course materials, and the first thing we need to do is import those into a pandas data frame, and we're really going to see the full power of pandas in this example, it's pretty cool stuff. So the first thing we're gonna do is import the u dot data file. That's part of the movie lines data set. And that is a tab delimited file that contains every rating in the data set. So the way this works is so even there were calling read See SV on Pandas weaken specify a different separator than a comma. In this case, it's a tab. So we're basically saying, Take the 1st 3 columns in the u dot data file and imported into a new data frame with three columns. The user i d. The movie I d. And the rating. So what we end up with here is the data frame that has a row for every user I D. Which identify some person. And then for every movie they rated, we have the movie I D. Which is this? A new miracle shorthand for a given movie. So Star Wars might be movie 53 or something, and they're rating, you know, 1 to 5 stars. So we have here a database, a data frame of every user, and every movie they rate it. Okay, Now we want to be able to work with movie titles so we can actually interpret these results more intuitively. So we're gonna use their human readable names instead. If you're using a truly massive data set, you'd say that to the end because you do want to be working with numbers. They're more compact for as long as possible, but for the purpose of example, in teaching will keep the titles around so you can see what's going on. So there's a separate data file with the movieland status that called You Don Item, and it is Piper Limited, and the 1st 2 columns that we import will be the movie I D and the title of that movie. So now we have to Data frames are calls, has all of the user ratings and M calls has all the titles for every movie I D. And we can use this magical merge function in pandas to mush it altogether. What we end up with, it's something like this. I was pretty quick, so we end up with a new data frame that contains the user I D and rating for each movie that user rated and we have both a movie I D and the title that we can read and see what it really is. So the way to read this is use your I. D. Number 308 Read a Toy story four Stars User I D to 87 Radio Toy Story, five stars and so on and so forth. And if we were to keep looking at more and more of the state of frame, we see different ratings for different movies as we go through it now, the real magic of Pandas comes in. So what we really want is to look at relationships between movies based on all the users that watched each pair of movies. So we need at the end of matrix of every movie and every user and all the ratings that every user gave to every movie and the pivot table command in panties. Come do that for us. It can basically construct a new table from a given data frame pretty much any way that you want it. So what we're saying here, take our readings data frame up here, and I want to create a new data frame called movie ratings, and I want the index of it to be the user i d s. So we'll have Ah, ro for every user i d. And I'm gonna have every column b the movie titles. So I'm gonna have a column for every title that I encounter in that data frame and the South. Each cell will contain the rating value if it exists. So let's go ahead and do that. And we end up with a new data frame that looks like this kind of amazing how that just put it all together for us now these n a n values that stands for not a number. And it's just how pandas indicates a missing value. So the way to interpret this is user I D one, for example, did not watch the movie won 900 but user I d one did watch 101 Dalmations and read it to stars. He's right. He won also watched 12 Angry men and rated it five stars, but did not watch the movie Two days in the Valley, for example. Okay, so what we end up here with here is a sparse matrix, basically, that contains every user and every movie and at every intersection where a user rated a movie. There's a rating value. Okay, so you can see now we can very easily extract vectors of every movie that air User watched . And we can also extract vectors of every user that rated given movie, which is what we want. So that's useful for both User based and item A's Clavier filtering right If I want to find relationships between users, I could look at correlations between these user rose. But if I want to find correlations between movies for item based collaborative filtering, I can look at correlations between columns based on the user behavior. Okay, so this is where the rial flipping things on its head for user versus item based similarities comes into play. Now we're going with item based collaborative filtering, so we want to extract columns. So let's do that next. Let's go ahead and extract all the users who rated Star Wars, and we can see most people have in fact watched and read it. Star Wars and everyone liked it. So at least in this little sample that we took from the head of the data frame, so we end up with a resulting set here of user ID's and their ratings for Star Wars. And he's a righty. Three did not rate Star Wars, for example, so we had a not a number value indicating a missing value there. But that's OK. You know, we want to make sure that we preserve those missing values so we can directly compare columns from different, different movies. So how do we do that? Well, Pandas keeps making it easy for us and has a core with function here that we can use and that will actually go ahead and correlate a given column with every other column in the data frame and compute the correlation scores and give that back to us. So what we're gonna do you hear issues core with on the entire movie ratings data frame. That's an entire matrix of user and movie ratings correlated with just the Star Wars Readings column. We're going to then drop all of the missing results with drop in a so that just leaves us with items that actually had a correlation. You know where there was more than one person that viewed it, and we will create a new data frame based on the results and look at the top 10 results. So again, just to recap, we're going to build the correlation score between Star Wars and every other movie drop all the and a not a number value so that we only have movie similarities that actually exist where more than one person rated it. And we're going to construct a new data frame from the results and look the top 10 results . And here we are. So we ended up with this result of correlation scores between each individual movie for Star Wars. And you can see, for example, a surprisingly high correlations score with the movie Till there was you. A negative correlation, actually, with the movie 1 900 a very weak correlation with 101 Dalmations. So now all we should have to do is sort this by similarity score, and we should have the top movie symbolize for Star Wars, right? Let's go and do that. Just call order on the resulting data frame. A Ken Pandas makes it really easy, and we can say ascending equals false to actually get it sorted in reverse order by correlation Score. So let's do that. Okay, So star wars came out pretty close to top because Star Wars is similar to itself. But what's all this other stuff? What the heck? Full speed man of the year out the outlaw. These are all you know, fairly obscure movies that most of them I've never even heard off. And yet they have perfect correlations with Star Wars. That's kind of weird. So obviously we're doing something wrong here. What could it be? Well, let's talk about that in our next lecture. Turns out there's a perfectly reasonable explanation, and this is a good lesson and why you always need to examine your results when you're done with data with any sort of data science task. Question the results because often there's something you missed. There might be something you need to clean in your data. That might be something you did wrong. You should also always looks skeptically. Your results don't just take them on faith. Okay? If you do so you're gonna get in trouble because of our to actually present these as recommendations to people who like Star Wars, I would get fired, don't get fired, pay attention, the results. So let's dive into what went wrong in our next lecture. So that's our initial crack and item based collaborative filtering and finding a movie. Similarities based on user behavior and the initial results really aren't that great. But it turns out there's a perfectly rational explanation for why, in a perfectly simple way to account for it. So let's dive into what went wrong and fix it.
46. Improving the Results of Movie Similarities: So as you recall, our initial results for a movie similar to Star Wars, using item based collaborative filtering techniques didn't come out so well. So let's figure out why and see if we can do about it. So let's figure out what went wrong with our movie similarities. There we went through all this exciting work to very easily with pandas. Compute correlation scores between movies based on their user ratings vectors and the results we got kind of sucked. So just remind you, we looked for movies that are similar to Star Wars using that technique, and we ended up with a bunch of weird recommendations at the top. That had a perfect correlation, and most them are very obscure movies. So what do you think might be going on there? Well, one thing that might make sense is, let's say we have a lot people who watch Star Wars and some other obscure film. You know, we end up with a good correlation between who to these two movies because they're tied together by Star Wars. But at the end of the day, do we really want to base our recommendations on the behavior of, you know, one or two people that watch some obscure movie. Probably not. I mean, yes, the two people in the world or whatever it is that watch the movie full speed and both liked it in addition to Star Wars. Maybe that is a good recommendation for them, but it's probably not a good recommendation for the rest of the world. You know, we need to have some sort of confidence level in our similarities by enforcing some minimum bound of how many people watched a given movie. You know, we can't make a judgment that a given movie is good to space on the behavior of one or two people. So let's try to put that insight into action here. So what we're gonna do is take a look, try to identify the movies that weren't actually rated by very many people on this. Just throw them out, okay and see what we get. So to do that, we're gonna take our original ratings data frame, and we're going to say group by title again. Pandas has all sorts of magic in it, and this will basically crypt construct a new data frame that aggregates together all of the rows for a given title into one row, and we can say that we want to aggregate specifically on the rating and we want to show both the size, the number of ratings for each movie and the mean average score the mean rating for that movie. So when we do that, when it was something like this. So this is telling us, for example, for the movie 101 Dalmations, 109 people rated that movie, and their average rating was 2.9 stars. So not that great of a score. Really? So you know this. If we just eyeball this data, we can say OK, well, Movie said, I consider obscure like 1 87 had 41 ratings, but 101 Dalmations. I've heard of that. You know, 12 angry men I've heard of. That seems like there's sort of a natural cut off value at around 100 ratings where maybe that's the magic value where things start to make sense. So let's go ahead and get rid of movies rated by fewer than 100 people. And yes, you know, I'm kind of doing this intuitively at this point, like we'll talk about later. There are more principle ways of doing this. We could actually experimented. You train tests experiments on different threshold values to find the one that actually performs the best. But initially, let's just use our common sense and filter out movies that were greeted by fewer than 100 people. Again, Pandas makes that really easy to do. So we could just say popular movies. A new data frame is going to be constructed by looking at movie stats, and we're going to only take Rose worthy. Rating size is greater than or equal to 100 and I'm then going to sort that by a mean rating just for fun, to see the top rated, widely watched movies. You know, I'm getting this this warning now. Ever since I originally create this course, a new version of Panis came in. You could just use sort underscore values there, and it will work just as well, not warning. Go away and we end up with this. So, you know, we have basically here a list of movies that were raided by more than 100 people sorted by their average rating score, and this in itself is a recommend er system highly rated popular movies. A close shave apparently was a really good movie, and a lot of people washed it, and they really liked it. So again, this is a very old data set from the late nineties. So even though you're not, might not be familiar with the film a close shave it might be worth going back and rediscovering added to your Netflix here, whatever Schindler's List. Not a big surprise there that comes up on the top of most top movies lists The wrong trousers. Another example of an obscure film that apparently was really good. And it was also pretty popular. So some interesting discoveries there already. Just by doing that, So it's things look a little a little bit better now. So let's go ahead and basically make our new data frame of Star Wars recommendations movies similar to Star Wars, where we only base it on movies that appear in this new data frame. So we're gonna use the joint operation to go ahead and join our original, similar movies data frame to this new data frame of Onley movies that have greater than 100 ratings. Okay, so we create a new data frame based on similar movies where we extract the Similarity column. Join that with our movie stats. State of Frame, which is our popular movie State of Frame, and we'll look at the combine results and there we go. So now we have restricted only to movies that were rated by more than 100 people. The similarity score to Star Wars. So now all we need to do is sort that better. Get that warning again. Yeah, lets you sort values instead of sort reverse sorted, and we're just gonna take a look at the 1st 15 results. And, hey, this is starting to look a little bit better. So Star Wars comes out on top because it's similar to itself. Empire strikes back. It's number two, right turn the jet eyes Number three Raiders of the Lost Ark. Number four. You know it's still not perfect, but these make a lot more sense, right? So you would expect the three Star Wars films from the original trilogy to be similar to each other. The state it goes back to before the next three films and Raiders of the Lost Ark, also a very similar Ruby to Star Wars in style comes out, it's number four. So I'm starting to feel a little bit better about these real results. There's still room for improvement. But, hey, we got some results that makes sense. Who now, ideally, we'd also filter out Star Wars. You don't want to be looking at similarities to the movie itself that you started from, but worry about that later. So if you want play this little bit more like I said, 100 was word of an arbitrary cut off for the minimum number of ratings. If you do want to experiment with different cut off values, I encourage you to go back and do so See what that does to the results. You know, you can see here that the results that we really like actually had much more than 100 ratings in common. So with Austin Powers coming in there pretty high with only 130 ratings, so maybe hundreds isn't high enough. Pinocchio stuck in at 101 not very similar to Star Wars, so you might want to consider an even higher threshold there and see what it does. So keep in mind to This is a very small, limited data set that we use for experimentation purposes, and it's based on very old data's here, only going to see older movies. So, you know, interpreting these results intuitively might be a little bit challenging as a result, but not bad results. So let's move on and actually do full blown item based collaborative filtering. We recommend user recommend movies to people using a more complete system will do that next , so that's looking much better. You always need to work. Watch out for spurious relationships, so there's a certain amount of confidence or support that you should have. When you're looking at relationships in data and by enforcing that minimum threshold of support, we ended up with much better and more reasonable results. So good lesson to learn There. Let's take it to the next level and actually do full blown item based cloud or filtering and produced recommendations for an entire user based on their entire history. And we can build a system that could do that for any user in our data set. We'll do that next
47. Making Movie Recommendations to People: Okay, let's actually build a full blown recommend er system that can look at all of the behavior information of everybody And what movies? They read it on every movie and use that to actually produce the best recommendation movies for any given user in our data set kind of amazing, and you'll be surprised how simple it is. Let's go. Okay, let's put it all together and actually do full blown item based collaborate filtering where we can recommend movies for any user based on all the behavior of what everybody rated every movie. How amazing is that? What's really amazing is how simple pandas makes it to do. So let's walk through it. Okay, so let's start off by importing the movie lens data set that we have again. We're using a subset of it that just contains 100,000 ratings for now. But there are larger data sets you can get from group Lynn Stott or go up to millions of ratings if you're so inclined. Keep in mind, though, when you start to deal with that really big data, you're gonna be pushing the limits of what you can do in a single machine in what pandas can handle. So, you know, I do have other courses on techniques like spark and map reduce that can handle much larger scale recommendation. So if you're curious, go check those out. But for now, let's work with this. So just like before, we're going to import the u dot data file that contains all of the individual ratings for every user what movie they raided. And then we're gonna tie that together with the movie title so we don't to just work with the new Miracle movie. I DS. Go ahead and do that, and we end up with this data frame way to read this. For example, User I D 308 rated Toy Story four stars and User I D 66 rated Toy Story three stars, and this would contain every rating for every user for every movie and again, just like before, we used the wonderful pivot table command and pandas to construct a new data frame. Based on that information where the index each row is the user, i D. And the columns were made up of all the unique movie titles in my data set, and each cell contains a reading. So what we end up with is this incredibly useful matrix sparse matrix that contains users for every row and movies for every column. And we have basically every user rating for every movie in this matrix. So user I D one, for example, gave 101 Dalmations two stars. And again, all these Entei ends. Not a number is represent missing data, so that just indicates, for example, user i D one did not rate the movie won 900. Okay, so again, very useful nature's toe have. If we were doing user based collaborative filtering, we could compute correlations between users between each individual user of rating vector to find similar users. And since we're doing item based, collaborative filtering were more students relationships between the columns. So like doing a correlation score between any two columns that will give us a correlation score for a given movie pair. So how do we do that? It turns out that pandas makes that incredibly easy to do as well. It has a built in core function that will actually compute the correlation score for every call impair found in the entire matrix. It's almost like they were thinking of us, So let's go ahead and run that. It's a fairly computational, expensive thing to do. So it will take a moment to actually come back with a result. But there we have it. So what do we have here? We have here, a new data frame where every movie is on the row and in the column so we can look at the intersection of any two given movies and find their correlation score to each other based on this user rating data that we had appear originally? How cool is that? So, for example, the movie 101 Dalmations is perfectly correlated with itself, of course, because it has identical user rating vectors. But if you look at 101 Dalmations relationship to the movie 12 Angry Men, it's a much lower correlation score because those movies air rather dissimilar, makes sense, right. So I have this wonderful matrix now that will give me the similarity score of any two movies to each other. It's kind of amazing and very useful for what we're going to be doing now. Just like before, we have to deal with spurious results, so I don't want to be looking at relationships that are based on a small amount of behavior information. So it turns out that the pandas core function actually has a few parameters you can give it . One is the actual correlation score method that you want to use. So I'm gonna say Use Pearson Correlation. But it also has a men periods parameter you can give it, and that's basically says, I only want you to consider correlation scores that are backed up by, at least in this example, 100 people that rated both movies and that will get rid of these spurious relationships that are based on just a handful of people a little bit different than what we did. And the item similarities exercise where we just threw out any movie that was raided by less than 100 people. What we're doing here is throwing out movie similarities where less than 100 people rated both of those movies. Okay, so you can see now that we have a lot more in ends and the resulting matrix. In fact, even movies I was similar to themselves get thrown out. So, for example, the movie 1 900 was presumably watched by fewer than 100 people, so it just gets tossed entirely 101. Dalmations, however, survives with a correlation score of one, and there are actually no movies in this little sample of the day is set that are different from each other that had 100 people in common that watched both. But there are enough movies that survive to get meaningful results. So what, we do this data well, what we want to do is recommend movies for people. So the way we do that as we look at all of the ratings were given, person find movies similar to the stuff they rated, and those are candidates or recommendations to that person. So let's start by creating a fake person to create recommendations for So I've actually added a fake user I. D. Number zero to the movie line Status said that we're processing by hand, and that kind of represents someone like me who loved Star Wars and the Empire strikes back but hated movies Gone With the Wind. So this represents someone who really loves Star Wars but does not like old style over romantic dramas. Okay, so I gave a five star rating to Empire Strikes back in Star Wars and a one start reading to Gone With the Wind. So I'm gonna try to find recommendations for this fictitious user. So how do I do that? Well, let's start by creating a series called SIM Candidates, and I'm going to go through every movie that I rate it. So for I and Rain zero through the number of ratings that I have in my ratings, I am going to add up similar movies to the ones that I rated. So I'm gonna take that core matrix State of Frame, that magical one that has all of the movie similarities. I am going to create a correlation matrix with my ratings drop, any missing values. And then I'm going to scale that resulting correlation score by how well I rated that movie . So the idea here is I'm going to go through all the similarities for the Empire strikes back, for example, and I will scale those all by five because I really liked the Empire strikes back. But when I go through and get the similarities for gone with the wind, I'm only going to scale those by a one because I did not like Gone with the Wind. So this will give more strength to movies that is similar to movies that I liked and less strength to similar two movies that are similar to movies that I did not like. Okay, so I just go through and build up this list of similarity candidates recommendation candidates. If you will sort the results and put him out, Let's see what we get. Hey, those don't look too bad, right? So obviously the Empire strikes back in Star Wars come out on top because I like those movies explicitly. I already watched them and rate of them, but bubbling up to the top of the list has returned to the Jet I, which we would expect, and Raiders of the Lost Ark. So let's start to refine these results a little bit more. We're seeing that we're getting duplicate values back, So if we have a movie that was similar to more than one movie that I rated, it will come back more than once in the results. So we want to combine those together. So if I do in fact have the same movie return of the Jedi, for example, with similar to both Star Wars and the Empire strikes back. Maybe that should get added up together into a combined, more strong recommendations. Score. Let's go ahead and do that. We're going to use the group by command again to group together all of the roads that are for the same movie, and we will sum up there. Correlation scores and look at the results. Hey, this is looking really good. So return of the Jedi comes out way on top, as it should, with a score of seven Raiders of the Lost Ark, a close second and five. And then we start to get to Indiana Jones and Last Crusade and some more movies. Bridge on the River Kwai Back to the future of the Sting. These are all movies that I would actually enjoy watching. You know, I actually do like old school Disney movies, too. So it's isn't as crazy as it might seem. So the last thing we need to do is filter out the movies that I've already rated because it doesn't access to recommend movies you've already seen so I can drop any rows that happen to be in my original ratings. Siri's look at the top 10 results. There we have it. Return the jet. I return rays of the Lost Ark. Indiana Jones All the top results for my fictitious user, and they all makes sense seeing a few family friendly films. You know, Cinderella was DeVos Dumbo creeping in? Probably based on the the presence of Gone with the wind in there, even though it was weighted down word it's still in there. It's still being counted, and there we have our results. So there you have it Pretty cool. We have actually generated results, recommendations for giving user, and we could do that for any user in our entire data frame. So go ahead, play that if you want to next, I want talk about how you can get your hands dirty little bit mawr and play with these results tryto try to improve upon them. All right, I'm pretty excited by these results so far. They're actually looking really reasonable. There is room from improvement, though, and that's gonna be my challenge to you in our next lecture. We'll talk about some ways that you might actually extend and build upon this python notebook and actually make better movie recommendations than what I gave you to start with . So there's a bit of an art to this. You know, you need to keep iterating and trying different ideas and different techniques until you get better and better results. And you can do this pretty much forever. I mean, I made a whole career out of it, so I don't expect you to spend the next you know, 10 years trying to refine this like I did. But there are some simple things you can do, so let's talk about that.
48. Improving the Recommender's Results: so as an exercise, I want to challenge you to go and make those recommendations even better. So let's talk about some ideas I have, and maybe you'll have some of your own to that. You can actually try out and experiment, get your hands dirty and try to make better movie recommendations. Okay, so there's a lot of room for improvement. Still, in these recommendation results, then is you can see there's sort of an art to it. There's a lot of decisions we made about how toe way different recommendation results based on your rating of that item that it came from, or what threshold you want to pick for the minimum number of people that rated to given movies. So there's a lot of things you can tweak a lot different algorithms you can try, and you can have a lot of fun with trying to make better movie recommendations out of the system. So if you're feeling up to it, I'm challenging you to go and do just that. So here are some ideas on how you might actually try to improve upon the results in this in this lecture, so you can just go ahead and play with the item based cf dot i python notebook file and tinker with it. So, for example, we saw that the correlation method actually had some parameters for the Correlation comp YouTube computation. We used Pearson in our example, but there are other ones you can look up and try out. See what it does to your results. We used a minimum period value of 100. Maybe that's too high. Maybe it's too low. We just kind of picked it arbitrarily. What happens if you play with that with that value? If you were to lower that, for example, I would expect you to see some new movies. Maybe you never heard of, but might still be a good recommendation for that person, or if you to raise it higher, you would see you know nothing but blockbusters. So sometimes you have to think about what the result is that you want out of a recommend her system. Is there a good balance to be had between showing people movies that they've heard of a movie that moves that they haven't heard of? How important is discovery of new movies to these people versus having confidence in the recommended system by seeing a lot of movies that they have heard off. So again, there's sort of an art to that. We can also improve upon the fact that we saw a lot of movies in the results that were similar to Gone With the Wind, even though I didn't like Gone with the Wind. You know, we waited those results lower than similarities movies that I enjoyed. But maybe those movies should actually be penalized if I hated Gone with the Wind that much . Maybe similarities to Gone With the Wind, like The Wizard of Oz, should actually be penalized and, you know, lowered in their score instead of raised it all. So that's another simple modification you could make and play around with. There are probably some outliers in our user rating data set. What if our toe throwaway people that raided some ridiculous amount of movies, maybe they're skewing everything? You could actually try to identify those users and throw them out as another idea. And if you really want a big project, if you really want to sink your teeth into this stuff, you could actually evaluate the results of this recommendation but recommended engine by using the techniques of trained test. So what if instead of having a arbitrary recommendation score that sums up the the correlation, scores of each individual movie actually scaled that down to a predicted rating for each given movie. So if the output of my recommend er system were movie and my predicted rating for that movie in a train test system, I could actually try to figure out How well do I predict movies that that user has in fact , watched and read it before? Okay, so I could, like, set aside some of the ratings data and see how well my recommended system is able to predict. The users ratings were those movies, and that would be a quantitative and principal way to measure the error of this. Recommend her engine. But again, there's a little bit more of an art than a science to this. Even though the Netflix Prize actually used that area, metric called Route Me means squared errors. What they used to be pretty particular is that really a measure of a good recommend er system. Basically, you're measuring the ability of your recommend er system to predict the ratings of movies that a person already watched, but isn't the purpose of a recommended engine to recommend movies that a person hasn't watched that they might enjoy? There are two different things. So, unfortunately, not very easy to measure the thing you really want to be measuring. So sometimes you do kind of have to go with your gut instinct and the right way to measure the results of a recommend. Her engine is to measure the results that you're trying to promote through it. Maybe I'm trying to get people to watch more movies or great new movies more highly or buy more stuff. Running actual controlled experiments on a real website would be the right way to optimize for that, as opposed to using trained test. So, you know, I went to a little bit more detail there than I probably should have. But the lesson is, you can always think about these things in black and white. You know, sometimes you can't really measure things directly and quantitatively, and you have to use a little bit of common sense, and this is an example of that. Anyway, here's some ideas on how to go back and improve upon the results of this recommend er engine that we wrote. So please feel free to tinker around with it. See if you can improve upon it. However you wish to and have some fun with it. This is actually a very interesting part of the course, so I hope you enjoy it. So go give it a try. See if you can improve on our initial results. There. There's some simple ideas there to try to make those recommendations better and some much more complicated ones, too. Now, there's no right or wrong answer. I'm not gonna ask you to turn in your work, and I'm not gonna review your work. You know, you just have to play around with it and get some familiarity with it and experiment and see what results you get. That's the whole point just to get you more familiar with using python for this sort of a thing and get more familiar with the concepts behind item based collaborative filtering. So have some fun with it. See what I come up come up with? If you come up with some really good results, make sure you post those for all of our other students to see in the discussions here I'd be curious to see what you come up with. So have at it
49. K-Nearest-Neighbors: Concepts: Let's talk about a few more data mining and machine learning techniques people will expect you to know about. We'll start with a really simple one called K nearest neighbors Air Cannon for short. And you're gonna be surprised at just how simple a good supervised machine learning technique can be. Let's take a look. So let's talk about some more data mining machine learning techniques that employers expect you to know about a few more that we haven't covered yet. One of the simpler ones is called K nearest neighbors, so let's start with that. Sounds fancy, but it's actually one of the simplest techniques out there. The idea is, let's say you have a scatter plot and you can compute the distance between any two points on that Skylar plot, right? So the idea of K nearest neighbors is Let's say you have a bunch of data that you've already classified, that you can train the system from if I have a new data point. All I do is look at the K nearest neighbors based on that distance metric and let them all vote on the classification of that new point. So let's taken example here. Let's imagine that this scatter plot here is plotting movies, and maybe the the Blue Squares represent science fiction movies, and the red triangles represent drama movies. Okay, and maybe this is plotting ratings versus popularity or anything else you can dream up. So we have some sort of distance that we can keep compute based on rating and popularity between any two points on the scatter plot. Let's say a new point comes in a new movie that we don't know the John R. Four. But we could do is say, Let's set K 23 and take the three nearest neighbors to this point on the scatter plot. They can all then vote on the classifications so you can see if I take the three nearest Neighbors K's three. I have to drama movies and one science fiction movie, and I would then let them all votes and we would choose declassification of drama for this new point based on that those three nearest neighbors. Now, if I were to expand this circle to include five nearest neighbors K of five, I get a different answer. So in that case, I pick up three science fiction into drama movies If I let them all vote, I would just end up with a classification of science fiction instead, so you can see the choice of K can be very important. You want to make sure it's small enough that you don't have to go too far and become Start picking up your relevant neighbors, but it has to be big enough to and close enough data points to get a meaningful sample so often you'll have to use trained test or a similar technique to actually determine what the right value of K is forgiven. Data set. But the end of the day you have to start your intuition and work from there. That's all there is to it. It's just that simple. So even though it is a very simple technique, how are you doing? Is literally taking DK nearest neighbors on a scatter plot and letting them all vote on the classification. It does qualify a supervised learning because it is using the training data of a set of known points and known classifications to inform the classification of a new point. But let's do something a bit more complicated with it and actually play around with movies just based on their metadata. So let's see if we can actually figure out the nearest neighbors of a movie based on just the intrinsic values of those movies de than the ratings for the John or information for it . So in theory, we could recreate something similar to customers who watched. Also wash, You know, this is a screenshot from amazon dot com. Just using cane nearest neighbors and I could take it one step further. Once identify the movies that are similar to a given movie based on the K nearest neighbors algorithm, I can let them all vote on a predicted rating for that movie. So that's what we're going to do in our next example. Let's get to it. So there you have the concepts of Can and K nearest neighbors. Let's go ahead and apply that to an example of actually finding movies that are similar to each other and using those nearest neighbor movies to predict the rating for another movie we haven't seen before.
50. Using KNN to Predict a Rating for a Movie: all right, we're going to actually take the simple idea of Cayenne NK nearest neighbors and apply that to a more complicated problem. And that's predicting the rating of a movie, given just its genre and reading information. So let's dive in and make that Let's have some fun with K and N and actually try to predict movie ratings just based on the K nearest neighbors algorithm and see where we get. So if you want to follow along, go ahead and open up the cannon high Python notebook and you can play along with me. It's what we're gonna do is defined a distance metric between movies just based on their metadata and by metadata, just me and information that is intrinsic to the movie information associated with the movie. Specifically, we're gonna look at the genre classifications of the movie. Every movie in our movie Lens data set has additional information on what John or Is it belongs to, and a movie can belong to more than one genre. A drawn are being something like science fiction or drama or comedy. You know what have you animated movies, and we will also look at the overall popularity of the movie given by the number of people who raided it. And we also know the average rating of each movie. So I can combine all this information together to basically create a metric of distance between two movies just based on rating information and john or information. So let's see what we get. So we use pandas again to make life simple. And if you are following along again, make sure to change the path to the movie lines data set to wherever you installed it, which will almost certainly not be what is in this python notebook. So go ahead and change that. If you're gonna follow along as before, we're just going to import the actual ratings data file itself, which is you dot data using the read See SV function and Pandas, where until it actually has a tab, delimited or not, a comma. And we're going to import the 1st 3 columns, which represent the user I d movie I D and rating for every individual movie rating in our data set. So we go ahead and run that and look at the top of it. We can see that it's working. We end up with a data frame that has user I. D. Movie idea and rating, for example, user I D rated movie I D 50 which I believe is Star Wars, five stars and so on and so forth. So if you want to get aggregate information about the ratings reach movie, that's the next thing we have to figure out. So we're going to use the group by function in Pandas to actually group everything by movie ideas. So we're gonna combine together all of the ratings reach individual movie, and we're gonna about put the number of ratings and the average rating score that mean for each movie. So let's go ahead and do that comes back pretty quickly. So this gives us another data frame that tells us, for example, movie I D one had 452 ratings, which is a measure of its popularity. How many people actually watched it and raided it and a mean review score of 3.8. So 14 52 people watched movie I D one, and they gave it an average review of 3.87 which is pretty good Now. The raw number of ratings isn't that useful to us? I mean, I don't know, 452 means it's popular or not. So to normalize that what we're gonna do is basically measure that against the maximum and minimum number of ratings reach movie. And we could do that using this little lambda function here so we can apply a function to an entire data frame This way and we're gonna do is use the NUM Pie Min and Max functions to find the maximum number of ratings in the minimum number of ratings found in the entire data set. So we'll take the most popular movie in the least popular movie and find the range there and normalize everything against that range. So what this gives us, we run it. It's basically a measure of popularity for each movie on a scale of 0 to 1. So a score of zero here would mean that nobody watched it. It's the least popular movie, and a score of one would mean that everybody watched it. It's the most popular movie where specifically it's the most popular movie, the movie that the most people watched. Okay, so we have a measure of movie popularity now that we can use for our distance. Metric. Next. Let's extracts from Jonah information. So it turns out that there is a Utah item file that not only contains the movie names, but also all the genres that each movie belongs to. So this little bit of code will actually go through. Each line of you dot item are doing this the hard way we're not using. You know any pan dysfunctions were just going to use straight a python this time again, Make sure you change that path to wherever you installed this information. So we will open our you dot item file, and then we will literate through every line in the file one at a time, Russ, strip out the new line at the end and split it based on the pipe. The limiters in this file and we will extract the movie I D. The movie name and all of the individual genre fields. So basically there's a bunch of zeros and ones in 19 different fields in the source data reach. One of those fields represents a given genre, so let's see what that looks like, and we will construct a python dictionary in the end that maps movie I DS to their names, genres, and then we'll also fold back in our rating information. So we will have named genre popularity on a scale 01 and the average rating. So that's what this little snippet of code does. Let's run that and just to see what we end up with, we can extract the value for movie I. D. One, which happens to be Toy Story. It'll Pixar film from 1995 you've probably heard of, and what we have in our dictionary is for entry. One move the I D Won. The name is Toy Story. This is a list of all the genres where a zero indicates it is not part of that genre, and one indicates it is part of that genre. And there is a data file in the movie lens data set that will tell you what each of these John or Fields actually corresponds to. But for our purposes, it's not actually important, right? We're just trying to measure distance between movies based on their genres, So all that matters mathematically is how similar this vector of genres is to another movie . Okay, the actual Jonah's themselves. Not important. We just want to see how same or different two movies are. And there, John or cost classifications. So you have that John Relist. We have the popularity score that we computed, and we have there the mean or average rating for Toy Story. So let's go ahead and figure how to combine all this information together into a distance metrics. So we confined the K nearest neighbors for a toy story, for example. So I've rather arbitrarily computed this distance function that takes two movie I DS and computes a distance score between the two. And we're going to base this first of all on the similarity, using a co sign similarity metric between the two John Reflectors. So, like I said, we're just going to take the list of genres for each movie and see how similar they are to each other again. A zero indicates it's not part of that genre. One indicates it is. We will then compare the popularity scores and just take the raw difference absolute value of the difference between those two popularity scores and use that toward the distance metric as well, and we will use that information alone to define the distance between two movies. So, for example, if we were to compute the distance between movie ideas two and four, this function would return some distance function based on Lee on the popularity of that movie and on the genres of those movies. Okay, so imagine that is a scatter plot, if you will, back to our original example in the slides where one access might be a measure of genre similarity based on co sign metric, the other access might be popularity. Okay, we're just finding the distance between these two things. So for this example, where we're trying to compute the distance using our distance metric between movies two and four, we end up with a score of 40.8. And remember, a A far distance means it's not similar, right. We want the nearest neighbors with smallest distance. So a score of 0.8 pretty high number on a scale of 0 to 1. So that's telling me that these movies really aren't similar. We just do a quick sanity check and see what these movies really are. Turns out it's the movies GoldenEye and Get Shorty, which are pretty darn different movies. You know you have James Bond action adventure here in a comedy movie and not very similar at all. They're actually comparable in terms of popularity. But the John R. Difference did it in. Okay, so let's put it all together next. We're gonna right a little bit of code to actually take some given movie I D and actually find the K nearest neighbors. So all we have to do is compute the distance between Toy Story and all of the other movies in our movie dictionary and sort the results based on their distance score. And that's what this little snippet of code does Here. Take a moment to wrap your head around it. It's fairly straightforward, but like we say, we have a little get neighbors function that will take the movie that we're interested in and the K neighbors that we want to find you illiterate through every movie that we have. If it's not, if it's actually a different movie than when we're looking at, it will compute that distance score from before a pen, that to the list of results that we have sort that result and then we will pluck off the K top results. Okay, so in this example, we gotta take the Rosset K to 10. Find the 10 nearest neighbors. We will find the 10 nearest neighbors using get neighbors, and then we will literate through all these 10 nearest neighbors and compute the average rating for each from each neighbor. And that average rating will inform us of our reading prediction for the movie in question . And as a side effect, we also get the 10 nearest neighbors based on our distance function, which we could call similar movies. So that information itself is useful. Going back to that. Customers who watched also watched example. If you want to do a similar feature that was just based on this distance metric and not actual behavior data, this might be a reasonable place to start, right? So let's go ahead and run this and see if we end up with and the results aren't that unreasonable. So we are using as an example of the movie Toy Story, which is movie I. D. One and what we came back with for the top 10 nearest neighbors are, ah, pretty good selection of comedy and Children's movies. So given that toy story is a popular comedy and Children's movies. We got a bunch of other popular comedy and Children's movies, so it seems to work. We didn't have to use a bunch of fancy collaborative filtering algorithms. These results aren't that bad. And if we just want to predict use Cannon to predict the rating, where we're thinking of a reading is the classifications. In this example, we end up with a predicted rating of 3.34 which actually isn't all that different from the actual rating for that movie, which was 3.87 So not great. That's not too bad either. I mean, it actually works surprisingly well, given how simple this algorithm is. Most of the complexity in this example was just in determining our distance metric and, you know, we intentionally got a little bit fancy there just to keep it interesting, but you could do anything else you want to. So if you want fiddle around with this, I definitely encourage you to do so. Our choice of 10 for K was completely out of thin air. I just made that up. How would you? How does this respond to different K values? Do you get better results with a higher value of K or with a lower value K Doesn't matter, Can you? Actually, and me? If you really want toe do amore involved exercise, you could actually try to imply train test to actually find the Value K that most optimally can predict the rating of a given movie based on kn n, and you can use just different distance metrics. I kind of made that up to so play around the distance metric maybe can use different sources of information or way things differently might be interesting thing to do. Maybe popularity doesn't isn't really as important as the Jonah information. Or maybe it's the other way around. See what impact that has any results to. So go ahead and mess with these algorithms messed with co. And run with it and see what you can get. And if you do find a significant way of improving on this share that with your classmates, that is can end in action. So a very simple concept, but it can be actually pretty powerful. So there you have it, and there you have it. Similar movies just based on the genre and popularity and nothing else works out surprisingly well. And we use the concept of can end actually use those nearest neighbors to predict a rating for a new movie, and that actually worked out pretty well, too. So that's K and inaction. Very simple technique, but often it works out pretty darn good.
51. Dimensionality Reduction; Principal Component Analysis: all right. Time to get all trippy. We're gonna talk about higher dimensions and dimensionality reduction. Sounds scary. There is some fancy math involved, but conceptually, it's not as hard to grasp as you might think. So let's talk about dimensionality reduction and principal component analysis. Next, let's talk about the curse of dimensionality. Very dramatic sounding. Usually when people talk about this, they're talking about a technique called principal component analysis and a specific technique called singular value decomposition. RSVP'd. So PC and S V. T. S V. D are the topics of this lecture. Let's dive into it. So what is the curse of dimensionality? Well, a lot of problems can be thought of having many different dimensions. So, for example, when we were doing movie recommendations, we had attributes of various movies, and every individual movie could be thought of of its own dimension in that data space. So if you have a lot of movies, that's a lot of dimensions, and you can't really wrap your head around more than three, right, cause that's what we grew up to evolve within. Or you might have some sort of data that has many different features that you care about. In a moment, we'll look at an example of flowers that we want to classify and that classifications bet based on four different measurements of the flowers and those four different features. Those four different measurements can represent four dimensions, which again is very hard to visualize. So dimensionality reduction techniques exist to find a way to reduce higher dimensional information into lower dimensional information. And not only can that make it easier to look at and classify things, but it has to be useful for things like compressing data. So by preserving the maximum amount of variance while we reduce the number of dimensions, were more compactly representing a data set while still trying to preserve the variance in that data set. So very common application of dimensionality reduction is not just for visualization but also for compression and for feature extraction. We'll talk about that a little bit more in a moment. A very simple example of dimensionality reduction can be thought of his K means clustering . So you know, for example, you might start off with many points that represent many different dimensions in a data set . But ultimately we can boil that down to k different Centrowitz and your distance to those centralized. So that's one way of kind of boiling data down to a lower dimensional representation. But usually when people talk about dimensionality reduction, they're talking about a technique called principal component analysis. And this is a much more fancy technique. It gets into some pretty involved mathematics, but at a high level. All you need to know is that it takes a higher dimensional data space, and it finds planes within that data space in higher dimensions. And these higher dimensional planes air called hyper planes. And they're defined by things called Eigen vectors, and you take as many planes as you want dimensions. In the end, project that data onto those hyper planes, and those become the new axes in your lower dimensional data space. You know, unless you're familiar with higher dimensional math and you thought about it before, that's gonna be hard to wrap your head around. But the end of the day it means that we're choosing planes in a higher dimensional space that still preserve the most variance in our data and project the data onto those higher dimensional planes that we then bring into a lower dimensional space. Okay, You know, I mean, you don't really have to understand all of the math to use it. The important point is that it's a very principled way of reducing a data set down to a lower dimensional space while still preserving the variance within it. We talked about image compression is one application of this. So you know, if I want to reduce the dimensionality in an image, I could use PC A to boil it down to its essence. Facial recognition is another example. So if I have ah, data set of faces, you know, maybe the each face represents 1/3 dimension of two D images and I want to boil that down. SPD and principal component analysis could be a way to identify the features that really count in a face so it might end up focusing more in the eyes. And the mouth, for example, is important features that are necessary for preserving the variance within that data set. So it's it can produce some very interesting and very useful results that just emerged naturally out of the data, which is kind of cool to make it. Really, we're going to use a more simple example, using what's called the IRA status set. And this is a data set that's included with psychic learn. It's used pretty commonly in examples, and here's the idea behind it. So what? Iris actually has two different kinds of pedals on its flower once called a pedal, which is, you know, the flower petals you're familiar with. And it also has something called a CEPAL, which is kind of this supportive lower set of pedals on the flower. And we can take a bunch of viruses in different species of virus and measure the pedal lengthen with and the CEPAL length. And with so together, the length and width of the pedal on the lengthen with of the CEPAL are four different measurements that correspond to four different dimensions in our data set. And I want to use that to classify what species and Iris might belong to. Now, P. C. A. Will let us visualize it's in two dimensions instead of four, while still preserving the variance in that data set. So let's see how well that works and actually write some python code to make PC a happen on the Irish data set. So That's the concepts of dimensionality reduction, principal component analysis and singular value. Decomposition all big fancy words. And yet it is kind of a fancy thing. You know, we're dealing with reducing, reducing higher dimensional spaces down to smaller dimensional spaces in a way that preserves their variants. Fortunately, psychic learned makes us extremely easy to do, like three lines of code is all you need to actually apply PC A. So let's make that happen.
52. PCA Example with the Iris Data Set: So let's apply. Principal component analysis to the Irish data set is a four dimensional data set that we're going to reduce down to two dimensions, and we're going to see that we can actually still preserve most of the information in that data set even by throwing away half the dimensions. It's pretty cool stuff. It's pretty simple to, So let's dive in. All right, let's do some principal component analysis and cure the curse of dimensionality. So it's actually very easy to do using psychic learn as usual and again PC ese dimensionality reduction technique. It sounds very science fiction. E how all this talk of higher dimensions, but just to make it more concrete and real again. Ah, common applications image compression. So you think of a black and white photograph, an image of a black and white picture as three dimensions, where you have with your X and your Y axis of height, and then each individual cell has some brightness value on a scale of 0 to 1 that you know , is black toe white or some value in between. So that would be three dimensional data. You know you have to spatial dimensions and then a brightness and intensity dimension. On top of that, if you were dead to still that down to say two dimensions alone, that would be a compressed image. And if you were to do that in a technique that preserve the variance in that image as well as possible, you could still reconstruct the image without a whole lot of loss in theory. So that's dimensionality reduction, you know, distilled down to a practical example. Now we're gonna use a different example here, using the Irish data set and psychic learned includes this. All of this is a data set of various iris flower measurements and the species classification for each iris in that data set. And it has also, like I said, before the length and width measurement of both the petal and the supple for each iris specimen. So between the length and width of the pedal and the length and width of the CEPAL, we have four dimensions of ah feature data of information. In our data set, we wanted to still that down to something we can actually look at and understand, because your mind doesn't deal with four dimensions very well, but you can look at two dimensions on a piece of paper pretty easily. So let's go ahead and load that up. Here there is a handy dandy load iris function built into psych. It learned that will just load that up for you with no additional work, so you can just focus on the interesting part. And if we take a look at what that data set looks like, you can see that we are extracting the shape of that data set, which means how many data points we have in it 150. And how many features or how many dimensions that data set has, and that is four. So we have 100 50 iris specimens in our data set with four dimensions of information. Again, that is the length and width of the supple in the lengthen, with of the pedal for a total of four features, which we can think of its four dimensions. And we can also print out the list of target names in the status set, which are the classifications, and we can see that each iris belongs to one of three different species, Sentosa versus Color or Virgin Icka. So that's the data that we're working with 150 IRA specimens classified into one of three species, and we have four features associated with each iris being the length and width of the pedal in the length and width of the supple. So let's look at how easy PC A is, even though it's a very complicated technique under the hood. Doing it is just a few lines of code assigned the entire IRA status set, and we're gonna call it X. We will then create a PC, a model, and we're going to say and components equals two. So we want to dimensions. We're going to go from 4 to 2. We're going to use whitened equals. True, that means that we're going to normalize all of the data and make sure that everything is nice and comparable. Normally, you will want to do that to get good results, and then we will fit the PC a model to our Irish data Set X, and then we can use that model. Then, too, transform that data set down to two dimensions on. Let's go ahead and do that happened pretty quickly. So think about what just happened there. We actually created a PC a model to reduce four dimensions down to two. And it did that by choosing to four dimensional vectors to create hyper planes around to project that four dimensional doubt data down to two dimensions, and you can actually see what those four dimensional vectors are. Those Eigen vectors by printing out the actual components of PCs OPC A stands for principal component analysis. Those principal components are the i n vectors that we chose to define our planes about. Okay, and you can actually look at those values here. It's not gonna mean a lot to you because you can't really picture for dimensions anyway. But just so you can see that it's actually doing something with principal components. So let's have ah, let's evaluate our results. The PC a model gives us back something called the explained variance ratio, and basically that tells you how much of the variance in the original four dimensional data was preserved as I reduced it down to two dimensions. So let's go ahead and take a look at that. What it gives you back is actually ah, list of two items for the two dimensions that we preserved. So this is telling me that in the first dimension I can actually preserve 92% of the variance in the data. And the second dimension only gave me an additional 5% of variants. And if I summit together, these two dimensions that I projected my data down into still preserved over 97% of the variance in the source data, so four dimensions weren't really necessary to capture all the information in this data set was pretty interesting. It's pretty cool stuff. So if you think about it, what do you think that might be? Well, maybe this overall size of the flower has some relationship to the species at its centre. Maybe, is the ratio of length to width for the pedal and the supple. You know, some of these things probably move together in concert with each other for a given species or forgiven overall size of a flower. So perhaps there are relationships between these four dimensions that PC A is extracting on its own. It's pretty, pretty cool and pretty powerful stuff. Let's go and visualize this. So the whole point of reducing this down to two dimensions is that that so we could make a nice little tooty scatter plot of it. At least that's our objective for this little example here. So we're going to a little bit of Matt plot little magic here to do that. There is some sort of fancy stuff going on here that I should at least mention. So we're gonna do is create a list of colors red, green and blue. We're gonna create a list of target I ds. So the value 01 into them after the different IRA species to three different species that we have. And what we're gonna do is zip all this up with the actual names of each species. So this little line here for I see label in zip target ideas, colors, iris, target names means that we will illiterate through the three different IRA species. And as we go, we're gonna have both the index for that species ah, color associated with it and the actual human readable label name for that species. So we'll take one species at a time and plotted on our scatter plot just for that species with a given color and the given label people then add in our legend and show the results, and this is what we end up with. So that is our four dimensional iris data projected down to two dimensions. Pretty interesting stuff you can see. It's still clusters together pretty nicely. You know, you have all the virgin nigga's sitting together over here. All the verses color is sitting in the middle of this toast is way off on the side here, and it's really hard to kind of imagine what the's actual values represent. But the important point is we've projected 40 data down to two D and in such a way that we still preserve the variance. And we can still see clear delineations between these two between these three species little bit intermittent mingling going on in there. It's not perfect, you know. But by and large it was pretty effective. So if you want to play with this little bit askew recall from the explained variance ratios , we actually captured most of the bearings in a single dimension. You know, maybe the overall size of the flowers all that really matters and classifying it, and you could specify that with one feature. So go ahead and modify the results. If you are feeling up to it. See if you can get away with two dimensions or one dimension instead of two. So go change that and components toe one and see what kind of various ratio you get. What happens is it makes sense, so play around with it, get some familiarity with it, and that is dimensionality reduction, principal component analysis and singular value. Decomposition all in action. Very, very fancy terms. And, you know, to be fair, it is some pretty fancy math under the hood. But as you can see, it's a very powerful technique. And with psychic learn, it's not hard to apply, so keep that in your tool chest. So there you have it, a four dimensional data set of flour information boiled down to two dimensions that we can both easily visualize and also still see clear delineations between the classifications that were interested in So PC A works really well in this example, and again, it's a useful tool for things like compression or feature extraction or facial recognition as well. So keep that in your toolbox. No, it's there for you.
53. Data Warehousing; ETL and ELT: next, we're going to talk a little bit about data warehousing, and this is a field that's really been upended recently by the advent of Hadoop and some big data techniques and cloud computing. So a lot of big buzz words there, but concepts that are important for you to understand. So let's dive in and explore these concepts. Let's talk about E, L T and E. T. L and data warehousing. In general, this is more of, ah, concept, as opposed to a specific practical techniques, that we're going to talk about it conceptually. But it is something that's likely to come up in the setting of a job interview. So let's make sure you understand these concepts. Let's start by talking about data warehousing in general. So what is a data warehouse? Well, it's basically a giant database that contains information from many different sources and ties them together for you. So, for example, maybe you work at a big e commerce company, and they might have an ordering system that feeds information about the stuff people bought into your data warehouse. And you might also have information from Web server logs that get ingested into the data warehouse a swell, and this would allow you to tie together browsing information on the website with what people ultimately ordered. For example, maybe could also Italian information from your customer service systems and measure if there's a relationship between browsing behavior and how happy the customers are at the end of the day. So a data warehouse has a challenge of taking data from many different sources, transforming them into some sort of schema that allows us to query these different data sources simultaneously. And it lets us make insights through data analysis using these disparate data sources. So large corporations and organizations have this sort of a thing pretty commonly this. We're getting into the concept of big data here, right? And you can have a giant Oracle database, for example, that contains all of this stuff. You know, maybe it's partitioned in some way and replicated and all sorts of complexity there. And you could just query that through sequel structured query language or through tools. Graphical tools like Tableau is a very popular one these days, and that's what a data analyst does. They query large data sets using stuff like tableau. That's kind of the difference between a data analyst in a data scientist. You might be actually writing code to provide to perform or advanced techniques on data that border on a I, as opposed to just using tools to extract graphs and relationships out of a data warehouse . And it's a very complicated problem. You know, At Amazon, we had an entire department for data warehousing that took care of this stuff full time, and they never had enough people, I can tell you that it's a big job, You know, there are a lot of challenges in doing data Warehousing one is data normalization. So you have to figure out how to all of the fields in these different data. Sources actually relate to each other. And how do I actually make sure that a column in one data source is comparable to a column from another data source and has the same set of data at the same scale using the same terminology? How do I deal with missing data? How do I deal with corrupt data or, you know, data from outliers or from robots and things like that? All very big challenges. Maintaining those data feeds also a very big problem. A lot can go wrong when you're importing all this information into your data warehouse, especially. We have a very large transformation that needs to happen to take the raw data safe from Weblogs into an actual structure database table that could be imported into your data warehouse. Scaling also could get tricky when you're dealing with a monolithic data warehouse. You know, eventually your data will get so large that there was transformations themselves start to become a problem, and this starts to get into the whole E l T vs CTL thing. So let's first talk about E. T. L was that stand for sense for extract, transform and load. And that's where the conventional way of doing data warehousing. So basically, first you extract the data that you want from the operational systems that you want. So, for example, I might extract all of the Web logs from our Web servers each day that I need to transform all that information into an actual structure database table that I can import into my data warehouse. So that transformation stage might go through every line of that of those Web server logs extract transform that into a national table where I'm plucking out from each webs Weblog line. You know, things like Session I D and what page they looked at and what time it was and what the refer was. And things like that. And I can organize that into a tabular structure that I can then load into the data warehouse itself as an actual table in the database. So as data becomes larger and larger, that transformation step can become a real problem. You know, think about how much processing work is required to go through all of the Web logs on like Google or Amazon or any large website, and transform that into something a database can ingest. That itself becomes a scalability challenge and something that can introduce, you know, stability problems to the entire data warehouse pipeline. So that's where the concept of E. L. T. Comes in, and it kind of flips everything on its head. It says. Well, what if we don't use a huge oracle instance? What if, instead, we use some of these newer techniques that allow us to have a more distributed database over a Hadoop cluster and that lets us take the power of these distributed databases. You know, these things built on Hadoop like high for spark or a map reduce and use that to actually do the transformation after it's been loaded. So the idea here is we're going to extract the information we want, as we did before, you know, say, from a set of Web server logs. But then we gonna load that straight in to our data repository, and we're going to use the power of the repository itself to actually do the transformation in place. So the idea here is instead of doing an offline process to transform my weblogs as an example into a structured format, I'm just gonna suck those in as raw text files and go through them one line at a time, using the power of something like Hadoop to actually transform those into a more structured format that I can then query across my entire data warehouse solution. So things like hive let you host a massive database on a Hadoop cluster. And there's things like a spark sequel that lets you also do. Query is in a very sequel like data warehouse like manner on a data warehouse that is actually distributed on a Hadoop cluster. There are also distribute no sequel data stores, second equerry using spark and map reduce. And three idea is that instead of using a monolithic database for a data warehouse, your instead using something built on top of Hadoop or some sort of a cluster that can actually not only scale up the processing and querying of that data, but also scale the transformation of that data as well. So again, you first extracted raw data. But then we're gonna load it in to the data warehouse system itself as is, and then use the power of the data warehouse, which might be built on Hadoop to do that transformation as the third step. Then I can query things together. So it's a very big project, very big topic. You know, Data warehousing is an entire discipline and of itself. And we're gonna talk about sparks amore in this course for very soon. Which is one way of handling this thing. That's something called Spark sequel, in particular, that's relevant. Also, things like hive map reduce big data techniques in general that are that are more modern that we can cover, and there are other courses that I offer on spark and map reduce. That will give you more insight into this house of a free course on big data basics. You can check out, um, but lots to learn about their. So again, the The overall concept, though, is if you move from a monolithic database built on Oracle or my sequel to one of these more modern distributed databases built on top of Hadoop. You can take that transform stage and actually do that after you've loaded in the raw data as opposed to before. And that can end up being simpler and more scale and taking advantage of the power of large computing clusters that are available today. So that's E T. L vs CLT. Sort of like the legacy way of doing it. Before, we had a lot of clusters all over the place and cloud based computing versus a way that makes sense today when we do have large clouds of computing available to us for transforming large data sets. That's the concept. So again, e t l kind of the old school way of doing it. You transform a bunch of data off line before importing it in and loading it into a giant data warehouse. Monolithic database. But today's techniques with cloud based databases and Hadoop and hive and spark and map reduce you can actually do it a little bit more efficiently and take the power of a cluster to actually do that transformation step after you've loaded the raw data into your data warehouse. So this is really changing the field. It's important that you know about it again. There's a lot more to learn on this subject, so I encourage you to explore more on this topic. But that's the basic concept, and now you know what people are talking about when they talk about E T l vs E L T.
54. Reinforcement Learning: Our next stop is a fun one. Reinforcement learning, and we can actually use this idea with an example of Pac Man. We can actually create a little intelligent Pacman agent that can play the game Pac Man really well on its own. And you'll be surprised how simple the technique is for building up the smarts behind this intelligent Pac Man. Let's take a look. Let's talk about reinforcement learning. This is kind of, ah, fun little concept here you can think about in terms of the game Pac Man, one of my all time favorites. So the idea behind reinforcement learning is that you have some sort of agent in this case , Pacman, that explores some sort of space. And in our example, that space will be the maze that Pac Man is in. And as it goes, it learns the value of different state changes within different conditions. So, for example, here the state of Pac Man might be defined by the fact that it has a ghost to the south and a wall to the west and empty spaces to the north and east, and that might define the current state of Pac Man and the state changes it can take would be to move in a given direction, and I can then learn the value of going in a certain direction. So, for example, if I were to move north, nothing would really happen. There's no real reward associated with that. But if I were to move south, I would be destroyed by the ghost, and that'll be a negative value. So as I go and explore this entire space, I can build up a set of all the possible states that Pacman convey in and the values associated with moving a given direction in each one of those states. And that's reinforcement learning. So as it explores this whole space, it refines thes reward values for a given state, and it can then use those stores reward values to choose the best decision to make, given a current set of conditions. So in addition to Pac Man, that's also a game called cat mouse. That is an example that to use commonly that will look at later. And the benefit of this technique is that once you've explored the entire set of possible states that your agent can be in, you can very quickly having a very good performance when you run different iterations of this. So you know, you can basically make an intelligent PacMan by running reinforcement, learning and letting it explore the values of different decisions that can make in different states and then storing that information to very quickly make the right decision , given a future state that it sees in an unknown set of conditions. So a very specific implementation of reinforcement learning is called Q learning and this formalizes what we just talked about a little bit more. So again, you start with a set of environmental states we're going to call that s and possible States Air. You know, the surrounding conditions of the agents. So is there Ah, ghost next to me. Is there a power pill in front of me? Things like that and I have a set of possible actions that I can take in those states. We're gonna call that set of actions A and in the case of Pac Man, does possible actions are moved up, down, left or right, and then we have a value for each state action pair that will call Q. That's why we call it Q learning So for each state, you know, a given set of conditions surrounding Pacman given action will have a value. Q. So moving up might have a given value. Que moving down might have a negative Q value if it means encountering ghost, for example, so we start off with a Q value of zero for every possible state that Pac Man could be in. And as Pacman explores amaze as bad things happen to Pac Man, we reduced the Q value for the state that Pac Man was into the time. Okay, so if Pac Man ends up getting eaten by a ghost, we penalize whatever he did in that current state. And as good things happen to Pac Man as he eats a power pill or eats a ghost will increase the Q value for that action for the state that he was in. Okay, and then what we can do is use those Q values to inform Pac Man's future choices and sort of build a little intelligent agent that can perform optimally and make a perfect will. Pacman. So getting back to a real example here some state actions here, Pac Man, we could define the current state of Pac Man by the fact that he has a wall to the West, empty space to the north and east of Ghost to the south, and we can look at the actions he can take. You know he can't actually move left at all, but he can move up, down or right, and we can assign it value to all those action. So by going up or right, nothing really happens at all. There's no power pill or dots to consume, but if he goes left, that's definitely a negative value. So you can say, for the state, given by the current conditions that Pac Man is surrounded by moving down would be a really bad choice. There should be a negative Q value for that Moving left just can't be done it all and moving up or right or just neutral. So the key value would remain zero for those action choices for that given state. Now, you can also look ahead a little bit to make it even more intelligent, Agent. So I'm actually two steps away from getting a power pill here. So as Pac Man were to explore this state, if I were to hit the case of eating that power pill on the next state. I could actually factor that into the Q Valley for the previous state. And you know, if you just have some sort of a discount factor based on how far away you are in time, how many steps away you are, you can factor that all in together. So that's a way of actually building in a little bit of memory into the system. So the Q value that I experienced when I consumed that power pill might actually give a boost to the previous que valleys that I encountered along the way. So that's a way to make you learning even better. So one problem with that we have in reinforcement learning is the exploration problem. How do I make sure that I efficiently cover all the different states and actions within those states during the exploration phase? So sort of the naive approaches to always choose the action for a given state with the highest que Valley that I've computed so far. And if there's a tie, just choose at random so initially, all of my cue valleys might be zero, and I'll just picked actions that random that first and as I start to gain information about better cue values for a given actions and given states, I'll start to use those as I go. But that ends up being pretty inefficient, and I can actually miss a lot of past that way if I just tie myself into this rigid algorithm of always choosing the best Q Valley that have computed thus far. So a better ways to introduce a little bit of random variation into my actions as I'm exploring. So we call that an Epsilon term. So we have some value that I roll a dice. I have a random number, and if it ends up being less in this Epsilon value, I don't actually follow the highest Q value. I don't do the thing that makes sense. I just take a path at random to try it out and see what happens. And that actually lets me explore a much wider range of possibilities, a much wider range of actions for a wider range of states more efficiently during that exploration stage. So what we just did can be described in very fancy mathematical terms, you know, conceptually is pretty simple I explore some set of actions that I could take for a given set of states. I use that to inform the rewards associate with the given action for given set of states. And after that exploration is done, I can use that information those Q values to intelligently navigate through an entirely new maze, for example. Okay, but this can also be called a Markov decision process, so I can, ah, lot of data science is just assigning fancy, intimidating names, simple concepts. And there's a ton of that in reinforcement learning. So if you look up the definition of Markov decision processes, it is a mathematical framework for modeling, decision making, decision making. What action do we take, given a set of possibilities for given state in situations where outcomes air partly random kind of like our random expiration there, and partly under the control of a decision maker, the decision maker being our Q values that we computed? So M DPS Markov decision processes are a fancy way of describing our exploration algorithm that we just described for a reinforcement learning, and the notation is even similar states or still described its S and S. Prime is the next state that we encounter. We have state transition functions that are defined as piece of a forgiven state of SNS prime, and we have our Q values are basically representatives or reward function, so a are some value for a given sns prime. So moving from one state to another has given recent reward associated with it, and moving from one state to another is defined by a state transition function. So again describing what we just did. Only a mathematical notation and he fancier sounding word Markov decision processes. And if you want to sound even smarter, you can also call a Markov decision process by another name. A discrete time stochastic control process. Holy cow. That sounds intelligent, but the concept itself is the same thing that we just described. So even more fancy words dynamic programming can be used to describe. What we just did is well, wow, that sounds like artificial intelligence computers, programming themselves, Terminator two, Skynet stuff. But no, it's just what we just did. So if you look up the definition of dynamic programming, it is a method for solving a complex problem, like creating an intelligent Pac man. That's a pretty complicated and results by breaking it down into a collection of simpler sub problems. So, for example, what is the optimal action to take for a given state that Pac Man might be in? There are many different states Pac Man could find himself in. But each one of those states represents a simpler sub problem where there's a limited set of choices I could make. And there's one right answer for the best move to make and storing their solutions. Those solutions being the Q values that I associated with each possible action in each state, ideally using a memory based data structure. Well, of course, I need to store those Q values and associate them with state somehow right the next time the same sub problem occurs the next time Pac Man is in a given state that I have a set of Q values. Four. Instead of re computing its solution when simply looks up, the previously computed solution the Q value already have from the exploration stage, thereby saving computation time at the expense of a modest expansion storage space. That's exactly what we just did with reinforcement learning. We have a a complicated exploration phase that finds the optimal rewards associate ID with each action forgiven state and was, we have that table of the right action to take forgiven state. We can very quickly use that to make our Pacman moving an optimal manner and a whole new maze that he hasn't seen before. So reinforcement learning is also a form of dynamic programming. Wow. So to recap, you could make an intelligent PacMan agent by just having its semi randomly explore different choices of movement given different conditions where those choices or actions, those conditions or states, we keep track of the reward of penalty associated with each action or state as we go and we can actually discount, you know, going back multiple steps if you want to make it even better. And then we store those Q values that we end up associating with each state. And we can use that to inform its future choices so we could go into a whole new maize and have a really smart pacman that can avoid the ghosts and eat him up pretty effectively all its own. Pretty simple concept. Very powerful, though, And you can also say that you understand a bunch of fancy terms because it's all called the same thing. Que learning, reinforcement, learning Markov decision process is dynamic programming all tied up in the same concept. So I think it's pretty cool that you can actually make sort of an artificially intelligent Pacman through such a simple technique, and it really does work. If you want to go look at it in more detail, here are a few examples you can look at. It has an actual source code you can look at and potentially play with. So there is a Python Markov decision process toolbox that you know and wraps it up. In all that terminology we talked about, there's an example. You can look at a working example of the cat and mouse game, which is similar, and there's actually a Pacman example you can look at online as well. That ties in more directly what we were talking about, so feel free to explore these links and learn even more about it. But that is reinforcement learning in a nutshell. So that's reinforcement. Learning more generally, it's a useful technique for building an agent that can navigate its way through a possible different set of states that have a set of actions that could be associated with each state . So we've talked about it mostly in the context of a maze game. But you think more broadly. And whenever you have a situation where you need to predict behaviour of something, given a set of current conditions and a set of actions that can take reinforcement, learning and Q learning might be a way of doing it, so keep that in mind.
55. Hands-On with Q-Learning: so reinforcement Learning has become a lot more popular in recent years as excitement about machine learning in general has grown. And fortunately, there's a new package called Open Ai Jim. That makes it pretty easy for you to set up test cases for reinforcement learning. So let's actually do some hands on practice using it. First, however, we need to install Jim. So to do that, go to your anaconda prompt on Windows or your terminal on Lennox or Mac OS and just type in pip Install Jim. Make sure you control See out of Jupiter notebook. If that's still running first, of course. And Jim actually works better on Lenox machines than Windows. Machines will allow you to do things like play space, invaders and training, how to play space invaders and doom or graphical video gamey things on Windows. It's gonna be more limited because it doesn't have the access to the graphical system, but as you'll see, we can still use it. All right, so we have at least the bare bones version of Jim installed here. Let's go ahead and start our notebook. Jupiter a notebook and of course, I'm already in the ML course folder where my course materials are, and look for the Q learning notebook. There it is. And let's see if it works. All right, so what we're gonna play with here is what's called the taxi problem will make more sense when we look at it here. But basically we're modeling a self driving taxi that can pick up passengers at a one of a set of fixed locations, drop them off at another location and try to get there in the quickest amount of time while avoiding obstacles. So we're basically going to train our taxi on how to get passengers from one point to another in the quickest way possible, using reinforcement learning. So let's start off by importing the environment that we need will import the gym package we just installed and random give ourselves a consistent random seed. So we get the same results each time, and we will make our environment called Taxi V to that just contain sort of the rules of this game, if you will, on how it all works. We'll call that resulting model streets and then render it outs. We can visualize what this environment looks like, so let's go ahead and shift Enter and there we have it. So here's how to interpret what you're seeing here. Basically the letters RGB and why they're which correspondent? Red, green, blue and yellow or something. Whatever you want. Those air the valid pick up and drop off locations. So a passenger may choose to be picked up at any of those locations that may choose to be dropped off a day. Those locations are taxi needs to learn how to get them from one point to another as quickly as possible. Now, whichever letter is colored, blue is where we need to pick somebody up from and whenever letter is colored in Magenta indicates where that passenger wants to go to. Okay, so remember the blue letter is where we're coming from. The magenta letters where we're going. So in this case, our customers being picked up at B and wants to go to G. Now those solid lines represent walls, so the taxi cannot cross those lines. That's kind of like, you know, the edge of the road there or whatever. And the filled in rectangle that ah orangy Yellow Square is actually the taxi itself. That represents where the taxi is now, when it's empty, will be yellow. And when it's green, that means that is, carrying a passenger. So we have this little like virtual taxi game here, and we need to train our taxi. How toe play the game. So we have a little world here, and we've named it streets. It's basically a five by five grid, and we can define the state of this world any time, with just a few things where the taxi is, which is one of 25 possible locations what the current destination is, which is one of four possibilities and where the passenger is, which is five possibilities. Actually, it it could be either one of the destinations or inside the taxi for 1/5 possibility. So in total, that works out to 500 possible states that could describe our world at any given point. That's a manageable number there now for each possible state, each one of those five states. There are six possible actions associated with that state, and we need to learn which of those actions make sense for each state, right? So for each state, we can move southeast, western north. We could pick up a passenger or we could drop off the passenger. So as our virtual taxi here is exploring this environment and learning about it, we need to basically associate rewards and penalties as it does things as it applies different actions to these different states. So we'll define those thusly. If our taxi successfully dropped someone off where they belong, we'll give you a reward of 20 points. If you take a time step while driving a passenger but not dropping them off, you get a negative one point penalty that will serve to make sure that we reward the shortest path over time. And if you do something bad, like picking up or dropping off at an illegal location, that gives you a negative 10 point penalty so it doesn't have any built in smarts to know what you know ahead of time, where you're allowed to pick up and drop off passengers. It has to learn that, too. And as far as walls go well, yeah, it knows about the walls. We just don't allow that to happen at all. You cannot cross the wall no matter what. That's physically impossible. All right, so let's start with an initial state. Our initial state here will just be with a taxi at location x two and why three? The passenger will be a pickup location number two, and the destination will be at location zeros. We encode that state as the location of the taxi, A to three passenger location and the destination location. We'll set our streets, stay to that and render it so we can visualize it again. And there you have it. Sort taxis at that initial location to three. The initial passenger pickup location indicated by the Blue letter is location to, and they want to goto Location zero, which is colored in magenta. So basically you want toe. Our taxi's starting over here. It has to pick up a passenger down here, and I want to drop them off up there. All right, let's take a look at our initial reward table for that initial state. So here's how do you actually interpret the reward table here now? Each rose going to correspond to a potential action at this state? Like we said, there are six potential actions for each state so they can move south, north, east or west, pick up or drop off and the four values in each row or the probabilities assigned to that action and then the next state that results from that action. The reward for that action and whether that action indicated a successful drop off took place. So, for example, we can see that moving north from this state would put us into state number 368. It would incur a penalty of negative one for just taking up a time step. And it does not result in a successful drop off and initially are probabilities are all one point. Oh, because we haven't learned anything yet. Let's take care of that. So let's actually do some que learning here like we talked about the slides. First thing we need to do is to train our model. So at a high level, we're gonna train 10,000 simulated taxi runs and for each run will step through the time with a 10% chance of just making a random exploratory step instead of using the learned Q values to guide our actions. So basically we have a 10% exploration factor here that we're using to do the learning phase quickly, walk through the code here. Ah, we start off by defining a que table, which is a numb pyre, A and that contains ah two D array that represents every possible state and action within our virtual space here and initialize. Is those Q values all the zero? We have some hyper parameters here is, well, the learning rate. That's basically how quickly we try to learn discount factor. These all pipe into that Q learning equation that we looked at in the slides and our exploration rate, which, as we said is 10% and the number of F blocks that were going to train over his 10,000. So we go over all 10,000 taxi runs. We reset our virtual playing field here, and while we're not done well, first, draw a random number between zero and one. If that number is less center expiration rate, which is 10.1, then we will actually just explore a random action. Otherwise, we will just go with the highest Q value associated with the actions that are available to us. Okay, so since exploration is 0.1, that's basically a 10% chance of a random number being less than that in which case we just pick a random sample action. Otherwise, we go with the maximum Q value available to us for the actions that we have at hand. We then called ST Starts stepped, actually apply that action that gives us back the next state that results from that action . The resulting reward, whether or not we're done at that point where we successfully dropped off our passenger and some more info as well. At that point, we need to actually do the Q learning equation. This is basically the code version of the equation that we looked at in the slides. There were just taking a look at Thea current Q value than one of the next state and then computing the new Q value, based on the Q learning equation that takes are learning rate and discount factor into account. And then we assign the Q table for that given state in action to the newly learned Q value , and we set our state to the next state and do it all over again until we're actually done and actually drop off for a passenger and I will complete a single run of the taxi, and then we do that 10,000 times, so this could take a little bit of, Ah, a little bit of time to run. Let's actually try it out. We'll click in here and shift, enter and just watch it go. Too bad it took a few seconds. That's pretty impressive, right? All right, so let's see what we got. So now we should have built up a table of Q values that we can quickly use to determine the optimal next action for any given state. So let's take a look. Let's take a look at the cute table for our initial state where we started from. So here you can see that the lowest Q value corresponds to the action. It's pretty close here, but it corresponds to the action. Go west, it turns out, so that makes sense. If you look back at our initial state, Uh, yeah, I mean, we want to go that way, right? So, like, we want to pick up our guy around this wall like the only direction that makes sense is going west. So looks like it worked so far. That's pretty cool. All right, so let's actually see it in action. A cool thing about open ei Jim is that it kind of lets you animate what's going on, and it's really fun to watch. So what we're gonna do in this block of code, it's actually simulate Ah, 10 different trips here at each step will, for each trip will reset the streets state, and we will just walk through it one step at a time, applying whatever action we learned from that learn que table. So there's no learning that has to happen here, and that we've already learned the optimal action for every possible state we could have in this world. We can very quickly guide our taxi through any given situation at all. So that's really the power of Q learning. You know, once you've learned that Q table actually powering an AI to go through your virtual environment is super fast and super simple, And the really fun part here is that we could call ST Stop rendered, actually animate that taxi moving through as we go. So let's watch. There it goes, picking up our passenger down at ah, why location picked it up and delivered it to G cool and we're on trip number three now picking up her passenger again. It's in the green box now. It's in the taxi and got dropped off successfully, starting again, picking up a passenger at the blue location and going down to Magenta as quickly as possible. This is really fun. So again, we picked up a pastor a g, and now we're dropped him off down there. Why began? It took a really quick path. So this actually seems to work. Guys, how cool is that? So we've actually actually trained a virtual taxi of how to find the quickest path between any two points for picking up a passenger and dropping them off at a given location. So fun stuff. We've taught a little virtual vehicle how to navigate this world all on its own. That's fun stuff, guys. You can just watch this forever. But anyway, if you want to play with this, Samora definitely challenge you to do so. Eso try modifying this little experimental block up here to keep track of the total number of times steps that it takes to actually get through all 10 trips. And that will actually be a useful metric as to how good our system is basically if it gets through those 10 trips in the minimum amount of time, that implies that it's doing a really good job, right? So if you want actually sample over a larger number of trips, you can remove that sleep function there to make it run faster, and that will allow you to run over more samples more quickly. So once you have that metric in place of how many steps it takes to get through those trips , you could start playing with those hyper parameters. So try to see how low you can make the number of F box go before the model starts to suffer . Do I really need 10,000 training steps? Can you come up with a better Valley for the learning rate or the discount factor or the expiration factors to make training even more efficient? Can those factors influence how maney epochs? You actually need to get a good result. So these air good things to get an intuitive feel for and see how those values influence your resulting model. The exploration rate in particular is going to be interesting one to experiment with. So if you only have time to play with one value. I would recommend that one. That's a fun little example, huh? So that's que learning in action training a virtual taxi. And you have now actually applied reinforcement learning on a real example.
56. Understanding a Confusion Matrix: Something you might encounter is the concept of confusion matrices. So let's dive into what those are all about. What's the confusion matrix for? Well, the thing is, sometimes accuracy doesn't tell the whole story and a confusion matrix can help you understand the more nuanced results of your model. For example, a test for a rare disease could be 99.9% accurate by just guessing, no, all the time I saying you don't have it. A model that does that would look on paper to have very high accuracy, but in reality, it's worse than useless, right? So you need to understand with a case like this, how important a true positive or true negative is. How important a false positive or false negative is to what you're trying to accomplish. And to be able to measure how good your model is that each one of those cases. And a confusion matrix is just a way to illustrate those nuances in the accuracy of your model. One might look like this. This is the general format of it. So imagine that we have a binary situation where we're just predicting yes or no. Like I have this disease or I don't have this ID disease or I test positive for this drug, right? Don't test positive for this drug. This image has a cabinet or this image does not have a cabinet. This is the format of what it would look like. So you see that on the rows we have predicted values and in the columns we have actual values. So go through it. If we predicted something is true and it really is, then that's a true positive. If we predicted yes, but it's actually no, actually negative. That'll be a false positive if we predict it, no, but it's actually yes and that's a false negative. And if we predicted no and it's actually no, that's a true negative. I mean, it gets a little confusing, but if you think through what this all makes sense, right? An actual confusion matrix, these cells will contain actual numbers of how often your model actually did that on his testing dataset. So keep in mind too that you have to pay attention to the labels. There's no real convention to how this is ordered. Sometimes you'll see predictions up here and the actual values over here. Don't just jump in assuming that a given confusion matrix is of a certain format. Pay attention to how it's labeled and make sure you understand what it's telling you before you draw conclusions from it. Something else worth noting here is that you deaf typically want to have most of your values here and here, right? So the diagonal here, if your confusion matrix is where most of your results should be, this is where accuracy lives, right? So this is where I have a true positive. This is where I have a true negative. You want those to be nice big numbers and false negatives and false positives to be comparably low numbers hopefully, right? So an accurate model would have high numbers along this diagonal value here. Let's plug in some actual numbers to see what that might look like. So say I have a machine-learning model that's trying to figure out if an image contains a picture of a cat or not. If we predicted that a had a cat and it really did have a cat that had been 50 times in my test set. But sometimes I predicted it was a capital, wasn't a cat, he was a dog or a fish or something. And that haven't 5 times, they predicted that wasn't account, but it really wasn't a cat. That had been 10 times this example. And if I said it was not a cat in, it really was not a cat that had been a 100 times in this case. So That's just how you interpret a confusion matrix. And we'll talk about how to make metrics off of this data that are more useful for analysis. Shortly. Sometimes you'll see confusion matrices in a different format where we actually add things up on each row and column as well. So that's something you might see once in a while. All that is it's adding up how many actual nodes we have, how many actual yeses we have, how many predicted no's we have and how he predicted yes as we have in total. So just so you have seen that format before, That's what that looks like. The inner part of it though is just the same confusion matrix that we looked at before. And again, remember, things can be flipped as far as where the predicted values and the actual values are. So make sure you pay attention to the labels on these things. And you know, what can I say? Confusion matrices can be confusing. Sometimes you'll see them in this sort of a format too. So maybe we have a multi-class classification model here to imagine that we have a handwriting recognition system that's trying to identify somebody writing the values 0 through nine. So a more complicated confusion matrix might look like this, where instead of just yes, no answer is we actually have multiple classifications, but it works the same way. So here we have predicted labels on this axis and true labels on this axis. So we're saying that if I predicted something was a five and it really was a five, well that shade of blue corresponds to some number here. So two things that are different in this example. First of all, we have more than yes, no options here we have multiple classification, so our confusion matrix is larger. Let's dive into another example there just to drive that home. So sometimes I predicted it was a one, but it was really an eight that has sort of a lighter blue there. Maybe that happened, you know, 20 or so times in this example. And we're also using what's called a heatmap. So instead of just displaying numbers in these individual cells, we're mapping those numbers to columns where the darkness of that color corresponds to how high of a number it is. You'd expect to see sort of a dark line going down the diagonal here representing a good accuracy on true positives and true negatives. And some sparser, lighter colors outside here ideally. But that color will map to an actual value and it just makes it easy to visualize how your confusion matrix is laid out. All right, make sense, guys, That's what a confusion matrix is all about. And it can be a little bit confusing, but just stare at these examples a little bit and it should make sense to you.
57. Measuring Classifiers (Precision, Recall, F1, ROC, AUC): Let's talk about some metrics that you can derive from a confusion matrix. So let's revisit our friend the confusion matrix again, in this particular example of one, we have actual values going down the columns and predicted values across the rows. That can be different. But in this format, we have the number of true positives in the upper left corner. The number of true negatives in the lower right corner, the number of false negatives in the lower left corner, and the number of false positives in the upper right corner. Okay? So make sure you understand where your true positives and negatives are, where your false positive and negatives are when you're starting to look at a confusion matrix. And again, can vary based on the layout of the confusion matrix itself. Let's start with recall. So recall is computed as the true positives over true positives plus the false negatives. You should seriously memorize this. You need to know this. It goes by other names as well just to make things more confusing. So it's also known as sensitivity, true positive rate and completeness. And completeness kind of hearkens back to its original use in the world of information retrieval. So it's a good choice of metric when you care a lot about false negatives, okay, so fraud detection is a great example of where you might be focusing on recall, because a false negative in the world of fraud means that something was fraud, but you fail to identify it as fraud. You had a fraudulent transaction that was flagged as being perfectly okay, that is the worst possible outcome in its system that's supposed to be detecting fraud, right? You want to be erring on the side of false positives and false negatives in that case. So recall, good choice symmetric when you care about false negatives. Fraud detection being an example of that, is true positives over true positives plus false negatives. Let's make it real of an example here. So in this particular example of a confusion matrix, again, recall is true positives over true positives plus false negatives. We just plug the values out of this confusion matrix. In this particular layout, two positives will be five, false negatives will be 10. So we just say 5 over 5 plus 10, which is 5 over 15 or 1 third or 33.3, 33, 33 percent, right? So that's recall. Recalls a partner in crime is precision, and precision is computed as true positives over true positives plus false positives. This goes by other names as well, including the correct positive rate or the percent of relevant results. So this is a measure of relevancy in the world of information retrieval. When should you care about precision? Well, it's an important metric when you care about false positives. Some examples would be medical screening or drug testing. You don't want to say somebody's, you know, on cocaine or something when they're not, that would have really bad effects on their life and career and stuff, right? So again, precision when you care about false positives, more so than false negatives. Drug testing being a classic example of that. Again, it is computed as true positives over true positives plus false positives. And again, we'll dive into an example here. In this particular confusion matrix, the true positives will be five, the false positives in this example or 20. So the precision is calculated as five over 25, which is 20 percent. There are other metrics as well. For example, specificity, which is the true negatives over true negatives plus false positive. Also known as the true negative rate. Also F1 score is a very common thing that's used. That is two times the true positives over 2 times true positives plus false positives plus false negatives. You can also compute it as two times the precision times recall over 2, over the precision. Recall either way works. Mathematically, it is the harmonic mean of precision and sensitivity. So if you care about precision and recall, remember, recall and sensitivity are the same thing. F1 score is a metric that balances the two. If you know that your model doesn't just care about accuracy alone and you want to capture both precision and recall. F1 score can be a way of doing that. But in the real-world, you're probably going to care about precision or recall more so than the other. So it really pays to think about what you care about more. Using F1 score, in my opinion, is a bit of a shortcut, little bit of laziness. Also RMSE is often used as a metric. It's just a straight-up measure of accuracy and it's exactly what it sounds like, the root mean squared error. So you just add up all of the squared errors of each prediction from its actual true value and take the square root of it. That's it. So it only cares about right and wrong answers. It doesn't get into the nuances of precision and recall. So if all you care about is accuracy, RMSE is a common metric used for that. Another way of evaluating your models is the ROC curve, that stands for receiver operating characteristic curve. And what it does is plot your true positive rate or your recall versus your false positive rate at various threshold settings in your model. So as you choose different thresholds of choosing between true and false, was that curve look like? Basically the way to interpret a ROC curve is that you want it to be above the diagonal line there. So the ideal curve would just be a single point at the upper left-hand corner, just a big right-hand angle where the, where the whole thing is in that upper left-hand side of the graph, if you will, to the left of that diagonal line. So the more bent or ROC curve is toward that upper left corner, the better. That's how you interpret these things. We can also talk about the area under the curve, which is the area under the ROC curve, exactly what it sounds like. So you can actually interpret that value as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. So an AUC of 0.5 would be what you'd expect to see if you were at that diagonal line, right? So if you actually had the area underneath that diagonal line where things are no better than random. That turns out to be an area of 0.5, right? So that kinda makes sense. So if you see an AUC of 0.5 or below, That's useless or worse than useless. The perfect classifier would have an area under the curve and AUC of 1. That would again be that perfect case where the curve is just this right, a right angle with a 1 at 01 up there in the upper left-hand corner. That would include the entire area, that entire graph, which works out to one. So AUC can be a useful metric for comparing different classifiers together. Where are the higher the value, the better? So there you have it. Some common metrics for evaluating classifiers precision recall, F1 score, ROC and AUC are the important ones to remember.
58. Bias / Variance Tradeoff: in this next section, we're going to talk about the challenges of dealing with the real world data and some of the quirks you might run into. So let's start by talking about the bias of variants. Tradeoff Just kind of a more principled way of talking about the different ways you might over fit in under fit data and how it all into relates with each other. Let's take a look. So one of the basic challenges that we face when dealing with real world data is over fitting versus under fitting your regressions to that data or your models of your predictions. And when we talk about under fitting and over fitting, we can often talk about that in the context of bias and variance and the bias variance tradeoffs. So let's talk about what that means. So conceptually bias and variance air pretty simple biases just how far off you are from the correct values. So how good your predictions overall in predicting the right overall value to take the mean of all your predictions? Are they more or less of the right spot? Or are your errors all consistently skewed in one direction or another? If So then your predictions air biased in a certain direction. Variance is just a measure of how spread out how scattered your predictions are. So if your predictions air all over the place, that's high variance. But if they're very tightly focused on what the correct values are, or even incorrect value in the case of high bias than your variance is small. So let's look at these examples here. Let's imagine that this dart board represents a bunch of predictions were making where the real value we're trying to predict is in the center of the bull's eye. So starting in the upper left hand corner, you can see that our points are all scattered about the center. So overall you know the meaning. Error comes out to be, you know, pretty close toe reality. Our bias. It's actually very low because our predictions are all around the same correct point. However, we have very high variance because these points are scattered about all over the place. So this is an example of low bias and high variance. If we move on to this one to the upper right corner, we see here that are points, they're all consistently skewed from where they should be to the Northwest here. So this is an example of high bias in our predictions where they're consistently off by a certain amount, and we have low variance because they're all clustered tightly around this wrong spot. But at least they're close together, so we're being consistent. Our predictions and that's low variance. But the biases high so again, this is high bias. Low variance this example. In the lower left corner, you can see that our predictions are scattered around the wrong mean point. So we have high bias, everything skewed to some place Where shouldn't be. But our variance is also high, so that's kind of the worst of both worlds. Here we have high bias and high variance in this example. And finally, in a wonderful, perfect world, we would have an example, like the lower right image here, where we have low bias, where everything centered around where it should be and low variance, where things are all clustered pretty tightly around where they should be. So, in a perfect world, that's what you end up with. But in reality, you often need to choose between one or the other So let's take a look at this example a little bit of a different way of thinking of bias and variance here. So here we have a straight line and you can think of that as be having a very low variants relative to these observations. Okay, so there's not a lot of variants in this line low variance for the bias. You know, the error from each individual point. It's actually high. Okay, now contrast that to this over fitted data here where we kind of gone out of our way to fit these observations. This line has high variance but low bias because each individual point is pretty close to where it should be. So this an example of where we trade it off, variance for bias. Now, at the end of the day, you're not out to just reduce bias or just reduce variants. You want to reduce error, right? That's what really matters. And it turns out you can express error as a function of bias and variance, so error is equal to buy a squared plus variants. So these 10 things both contribute to the overall error with bias actually contributing more, but keep in mind, it's area you really want to minimize, not the bias or the variant specifically and overly complex. Model will probably end up having a high variance and low bias, whereas a two simple model will have low variance and high but bias. But they could both end up having similar error terms at the end of the day. So you just have to find the right happy medium of these two things when you're trying to fit your data and we'll talk about some more principle ways of actually avoiding over fitting in our in our forthcoming lectures. But it's just the concept of bias and variance that I want to get across because people do talk about it. You're gonna be expected to know what it means if we tie that back to some earlier concepts . In this course. For example, in K nearest neighbors, if we increase the value of K, we start toe spread out our neighborhood that were averaging across to alert a larger area . So that has the effect of decreasing variants because we're kind of smoothing things out over a larger space, but might increase our bias because we'll be picking up a larger population that may be less and less relevant to the point we started from. So by smoothing out can and over a larger number of neighbors, we can decrease the variance because we're smoothing things out over more values. But we might be introducing bias because we're introducing more and more points that are less unless related to the point we started with decision trees and other examples. So we know that a single decision trees proto over fitting, so that might imply that it has a high variance. But random forests seek to trade off some of that variance for bias reduction. And it does that by having multiple trees that are randomly variant and averages all there solutions together. So kind of like when we average things out. By increasing K and K and N, we can average out the results of a decision tree by using more than one decision tree using random forests. Similar idea. So that's the basic idea of bias variance in the bias variance tradeoff. I hope it makes little sense. Let's move on. So that's biased. Various trade off, You know, again, there's sometimes the decision you have to make between how overall accurate your values are and how spread out or how tightly clustered they are. So that's the bias variance trade off and they both contribute to the overall error, which is the thing you really care about minimizing, so keep those terms in mind.
59. K-Fold Cross Validation: Earlier in the course, we talked about train test as a good way of preventing overfitting and actually predicting, measuring how well your model can perform on data. It's never seen before. We can take that to the next level with a technique called k-fold cross-validation. And we'll talk about that next is an important tool in your tool chest for fighting overfitting. So let's talk about another powerful tool in your arsenal for fighting overfitting, K-fold cross-validation. And you may remember that we talked about train test earlier in this course about a good way of fighting over fitting as well. K-fold cross-validation makes train test even better. So let's learn how that works. So if you recall from train test, the idea was that we split all of our data that we're building a machine learning model based off of into two segments, a training dataset and a test dataset. The idea is that we train our model using only the data in our training dataset. And then we evaluate its performance using the data that we reserved for a test dataset that prevents us from over fitting to the data that we have because we're testing the model against data that it's never seen before. However, trained has still has its limitations. You could still end up over fitting to your specific train test split. Maybe your training dataset isn't really representative of the entire dataset and too much stuff ended up in your training dataset that skews things. So that's where k-fold cross-validation comes in. It takes train test and kicks it up a notch. So the idea, although it sounds complicated, it's fairly simple. Instead of dividing our data into two buckets, one for training and one for testing. We divide it into k buckets. For each bucket, we'll use that bucket as our test dataset, and we use the remaining data as our training data. We then measure the resulting r squared error score from using that bucket as our test data. Then we move on to the next of our K buckets and we use it as our test data and the remaining k minus 1 buckets as our training data and we measure the error again. We keep doing this until we've tried using all k buckets as our test set. And we just average all the R-squared scores that we ended up with to get a more robust measure of our model's accuracy. And that's all it is. K-fold cross-validation is a more robust way of doing train test, and that's one way of doing it. There are other variations on this. For example, you could reserve one bucket as your test data and then train against the remaining individual buckets and average those scores together. But the technique described in this slide is how scikit-learn does it, and that's what we're going to play with next. So fortunately, scikit-learn makes this very easy to do and it's even easier than doing normal train test. It's extremely simple to do k-fold cross-validation, so you may as well just do it. Now the way this all works in practices, you'll have a model that you're trying to tune and you'll have different variations of that model or different parameters that you might want to tweak on it. For example, the degree of polynomial for a polynomial fit. So the idea is to try different values of your model. Different variations, measure them all using k-fold cross-validation and find the one that minimizes error against your test dataset. And that's kind of your sweet spot there. So in practice, you want to use k-fold cross-validation to measure the accuracy of your model against a test dataset. And just keep refining that model. Keep trying different values within it, keep trying different variations of that model or maybe even different models entirely until you find the technique that reduces the error of the most using k-fold cross-validation. Let's go dive into an example and see how it works. We're going to apply this to our Iris dataset again, revisiting SVC. And we'll play with k-fold cross-validation and see how simple it is. Let's actually put k-fold cross-validation and train-test into practice here, using some real Python code, you'll see it's actually very easy to use, which is a good thing because this is a technique you should be using to measure the accuracy, the effectiveness of your models and supervised learning. So go ahead and open up the k-fold cross-validation no book and follow along if you will. And we're going to look at the Iris dataset again. Again, remember we introduced this when we talked about dimensionality reduction. And just to refresh your memory, the Iris dataset contains a set of 150 iris flower measurements. Each flower has a length and width of its pedal and a length and width of its sepal. And then we also know which one of three different species of iris each flower belongs to. So the challenge here is to create a model that can successfully predict the species of iris flower just given the length and width of its petal and sepal. Okay. So let's go ahead and do that. We're going to use the SVC model. If you remember back again, That's just a way of classifying data That's pretty robust. There's a lecture on that if you need to go refresh your memory. So what we're gonna do is use the cross-validation library from scikit-learn. And we're going to start by just doing a conventional train test split. Just a single train test split and see how that will work. To do that, we have a train test split function that makes it pretty easy. So the way this works is we can feed, train, test, split, a set of feature data. Iris dot data just contains all the actual measurements of each flower. And Iris dot target, which is basically the thing we're trying to predict. So in this case, it contains all of the species for each flower. And our test size here it says, what percentage do we want to train versus tests. So 0.4 means we're going to extract 40% of that data randomly for testing purposes and you 60% for training purposes. And what this gives us back is for datasets, basically a training to a training dataset and a test dataset for both the feature data and the target data. So x train ends up containing 60% of our iris measurements and X test contains 40% of the measurements used for testing the results of our model. And y train and y test contain the actual species for each one of those segments. So we'll go ahead and build a SVC model for predicting Iris species given their measurements here. And you'll see that we're building that only using the training data. So we're going to fit this SVC model using a linear kernel using only the training feature data and the training species data target data. And we're going to call that model CLF. Now we can call the score function on CLF to just measure its performance against our test dataset. So we're going to score this model against the test data we reserved for the iris measurements and the test Iris species and see how well it does. And it turns out it does really well. Over 96% of the time our model is able to correctly predict the species of an iris that it had never seen before, just based on the measurements of the iris. So that's pretty cool. But this is a fairly small dataset, a 150 flowers, if I remember right, So we're only using 60% of a 150 flowers for training and only 40% of a 150 flowers for testing. These are still fairly small numbers, so. We could still be overfitting to our specific train test split that we made. So let's use k-fold cross-validation to protect against that. And it turns out that using k-fold cross-validation, even though it's a more robust technique, is actually even easier to use than train test. So that's pretty cool. So let's see how that works. So we have a model already, the SVC model that we defined for this prediction. And all you need to do is call cross Val score on the cross-validation package. So you pass it in a model of a given type and the entire dataset that you have. So this is all of my feature data and all of my target data, all of the measurements, all the species. And we're going to say, I want a cross-validation folds of five. And that means it's actually going to use five different training data sets, okay? While reserving one for testing. Basically it's going to run it five times. And that's all we need to do that will automatically evaluate our model against the entire dataset, split up five different ways and give us back the individual results. So if we print back the output of that, it gives us back a list of the actual error metric from each one of those iterations, each one of those folds. And we can average those together to get an overall error metric based on k-fold cross-validation. And when we do this over five-folds, we can see that our results are even better than we thought 98 percent accuracy. So that's pretty cool. In fact, in a couple of runs we had perfect accuracy. So pretty amazing stuff. So let's see if we can do even better. We're using a linear kernel here. What if we used a polynomial kernel? And God, even fancier, will that be overfitting or will it actually better fit the data that we have? Kinda depends on whether there's actually a linear relationship or a polynomial relationship between these petal measurements and the actual species or not. So let's try that out. We'll just run this all again using the same technique, but this time we're going to use a polynomial kernel and we'll do the same thing, will fit that to our training dataset. And does it really matter? We fit it to you in this case, because cross Val score will just keep re-running it for you. And it turns out that movies a polynomial fit, we end up with an overall score that's even lower than our original runs. So this tells us that the polynomial kernel is probably overfitting. When we use k-fold cross-validation, it revolt reveals at an actual lower score than with our linear kernel. And the important point here is that if we just use a single train test split, we went have realized that we would have actually gotten the same result if we just did a single train test split here as we did on the linear kernel. So we might inadvertently be overfitting our data there and not have even known it had we not use k-fold cross-validation. So good example here of where k-fold comes to the rescue and warns you have over-fitting where a single tenant train test split might not have caught that. So keep that in your tool chest. You want to play around with this some more. Go ahead and try different degrees. So we tried. You can actually specify a different number of degrees. The default is three degrees for the polynomial kernel. But you can try different, you can try to. Does that do better? If you go down to one that degrades basically to a linear kernel, right? So maybe there is still a polynomial relationship and maybe it's only a second degree polynomial. So go find out, try it out and see what you get back. So go play around with that. That's k-fold cross-validation. That's k-fold cross-validation. As you can see, it's very easy to use thanks to scikit-learn. So use it. It's an important way to measure how good your model is in a very robust manner.
60. Data Cleaning and Normalization: Now, this is one of the simplest, but yet it might be the most important lecture in this whole course. We're gonna talk about cleaning your input data, which you're going to spend a lot of your time doing and how well you clean your input data and understand your raw input data is going to have a huge impact on the quality of your results, maybe even more so than what model you choose or how well you tune your models. So pay attention. This is important stuff. So let's talk about a inconvenient truth of data science. And that's what you spend most of your time, actually, just cleaning and preparing your data and actually relatively little of it analyzing it and trying out new algorithms. So it's not quite as glamorous as people might make it out to be all the time. But this is an extremely important thing to pay attention to, so there is a lot of different things that you might find in raw data data that comes into you just raw data is going to be very dirty. It's going to be polluted in May different ways. And if you don't deal with it. It's going to skew your results, and it will ultimately end up in your business making the wrong decisions. And if you know what comes back that you made a mistake where you ingested a bunch of bad data and didn't account for it didn't clean that data up. And what you told your business was to do something based on those results that later out turns to be completely wrong. You're gonna be in a lot of trouble, so pay attention. There's a lot of different kinds of problems India that you need to watch out for. One is out liars. So maybe you have, you know, people that are behaving kind of strangely in your data. And when you dig into them, they turn out to be, you know, data you shouldn't be looking at in the first place. A good example would be if you're looking at Web log data and you see one session I d. That keeps coming back over and over and over again, keeps doing something at a ridiculous rate that a human could never do. But you're probably seeing there is a robot, you know, a script that's being run somewhere to actually scrape your website, or it might even be some sort of malicious attack. But at any rate, you don't want that behavior data informing your models That's meant to predict the behavior of real human beings using your website. So watching for outliers is one way to identify types of data that you might want to strip out if your model, when you're building it missing data, what do you do when date is just not there? Going back to the example of a weblog? You might have a refer in that line or you might not. What do you do if it's not there? Do you create a new A new classifications for missing or or not specified? Or do you throw that line out entirely? You have to think about what the right thing to do. Is there malicious data like we talked about? There might be people trying to game your system. There might be people trying to cheat the system, and you don't want those people getting away with it. Let's say you're making a recommend er system. There could be people out there trying toe fabricate behavior data in or just to promote their new item. Right? So you need to be on the lookout for that sort of a thing And make sure that you're identifying thes shelling attacks or other types of attacks on your input data and filtering them out from the results. And don't let them win erroneous data. What if there's like a software? But somewhere in some system, that's just writing out the wrong values in some set of situations. It can happen, unfortunately, is no good way for you to know about that, But if you see data that just looks fishy or the results don't make sense to you, digging in deeply enough can sometimes uncover an underlying bug that's causing the wrong data to be written in the first place. Maybe things aren't being combined properly at some point. Maybe sessions aren't being held throughout the entire session. People might be dropping their session i D and getting new session ideas as they go through a website, for example, irrelevant data. You know, very simple. One here may be your only interest interested in data from New York City people or something. For some reason. In that case, all the data from people from the rest of the world is irrelevant. What you're trying to find out and the first thing you wanted was just throw all that data way and restrict your data. Whittle it down to the data that you actually care about. Inconsistent data. This is a huge problem. Okay. For example, in addresses, people can write the same address in many different ways. They might abbreviate street or they might not abbreviate Street. They might not put street at the end of the street name at all. They might combine lines together in different ways. They might spell things differently. They might use a zip code in the U. S. Or a zip plus four code in the US They might have a country on it. They might not have a country on it. You need to somehow figure out what are the variations that you see, And how can I normalize them all together? Maybe I'm looking at data about movies and a movie might have different names in different countries or a book might have different names in different countries, but they mean the same thing. So you need to find you need to look out for these things where you need to normalize your data, where the same data can be represented many different ways, and you need to combine them together in order to get the correct results. Okay, formatting that can also be an issue. Things could be inconsistently formatted. Take the example of dates in the US we always do month, day year. But in other countries they might do day, month, year, who knows? But you need to be need to be aware of these formatting differences may be phone numbers. Have parentheses around the area code. Maybe they don't. Maybe they have dashes between each section of the numbers. Maybe they don't. Maybe Social Security numbers have dashes. Maybe they don't. These are all things that you need to watch out for. And you need to make sure that variations in formatting don't get treated as different entities or different classifications during your processing. So lots of things to watch out for. And that's just ah, that's just the main ones to be aware of. Okay, remember garbage in garbage out. Your model is only as good as the day that you give to it, and this is extremely, extremely true. You know you could have a very simple model that performs very well if you give it a large amount of clean data and it could actually outperform a complex model on a more dirty data set. So, you know, making sure that you have enough data and high quality data is often most of the battle. You'd be surprised how simple some of the most successful algorithms used in the real world are. And there are only successful by virtue of the quality of the data going into it and the amount of data going into it. You know, you don't always need fancy techniques to get good results. Often, the quality and quantity of your data counts just as much as anything else. And always question your results. You know, you don't want to go back in, look for anomalies and your input data. Just when you get a result that you don't like, you know that will introduce an unintentional bias into your results, where your leading results that you like or expect, go through unquestioned, right? You want to question things all the time to make sure that you're always looking out for these things because even if you find a result you like. If it turns out to be wrong, it's still wrong. It's still going to be informing your company in the wrong direction, and that could come back to bite you later on. So as an example, I have a website called No Hate News. It's not profit, so I'm not try to make any money by telling you about it. But let's say I just want to find the most popular pages on this website that I own. That sounds like a pretty simple problem, doesn't it? I should just be able to go through my Web logs and count up how many hits each page has and sort them right. How hard can it be? Well, turns out, it's really hard. So let's dive into this example and see why and see some examples of real world data cleanup that has to happen. So let's see just how important data cleaning can be. We have a very simple task ahead of us. Find the top view pages on a very small website. How hard can it be? Well, we'll dive into that next and see just how hard it is
61. Cleaning Web Log Data: So we're going to show the importance of cleaning your data. Have some Web log data from a little website that I own, and we're just going to try to find the top viewed pages on that website. Sounds pretty simple, but as you'll see, it's actually quite challenging. So let's walk through a simple example. Actually, it's not so simple where I just want to figure out The top viewed Web pages on my website. Sounds pretty easy, doesn't it? Well, let's see. So if you want to follow along the top pages I Python notebook is the one that we're working from here. And let's start so actually have an access log that I took from my actual website. It's a riel http access log from Apache, and that is in your course materials. So I went and got this little snippet of code off of the Internet that will parse an Apache access logline into a bunch of fields. So it contains things like the host and the user and the time and the actual page requests and status, and they refer, and the user agent, meaning what browser actually was usedto view this page. So this builds up was called a regular expression, and we're using the Ari library to use it. And that's basically a very powerful language for doing pattern matching on on a large string. So by using this regular expression, we can actually apply that to each line of our access log and automatically group the bits of data, the bits of information in that access logline into these different fields. Okay, so if you want to play along here, make sure you update the path to move the access log to wherever you saved the course materials for this course and let's go ahead and run this. All right, so we have a path to our data file. So the obvious thing to do here, let's just whip up a little script that counts up each. You are l that we encounter that was requested and keeps count of how many times it was requested. Then we can sort that list and get our top pages right? Sounds simple enough. So we're going to construct a little python dictionary here called your accounts, and we're going to open up our log file and for each line, we're going to apply our regular expression. And if it actually comes back with a successful match for the pattern that we're trying to match, will say, OK, this looks like a decent line in our access log. Let's extract the request field out of it, which is the actual http request. What Page is actually being requested by the browser split that up into its three components. A contest consists of, ah, action like get our post, the actual year L being requested and the protocol being used. So given that information split out, we can then just see if that you are l already exists in my dictionary. If so, I will increment the count of how many times that Europe has been encountered by one. Otherwise, I'll introduce a new dictionary entry. For that, your Ellen initialize it to the value off one. I do that for every line in the log, sort the results in reverse, order numerically and print it out. So let's go ahead and run that. Oops. We end up with this big old error here, and it's telling us that we need more than one value to impact. So apparently we're getting some request fields that don't contain an action a year on a protocol. It contains something else. Let's see what's going on there. So if we print out all the requests that don't contain three items, we can see what's actually showing up here. So we're going to do Here is a similar little snippet of code, but we're we're gonna actually do that. Split on the request field and print out cases where we don't get the expected three fields and see what's actually in there. I mean, so a bunch of empty fields, that's our first problem. But then we have this feels follow, just garbage. You know, who knows where that came from. It is clearly erroneous data, so Okay, fine. Let's modify our script will actually just throw out any lines that don't have the expected three fields in the request. And that seems like a legitimate thing to do, because this does, in fact, tohave completely useless data inside of it. It's not like we're missing out on anything here by doing that so well modifier script. To do that, we've introduced this if lend fields equals three lion before it actually tries to process it, and we'll run that Hey, we got a result, but this doesn't really look like the top pages on my website. Remember, this is a news site. So we're getting much of PHP file hits. That's, you know, PERL scripts. What's going on? There are top result Is this xml rpc dot PHP script and then WP log in dot PHP followed by the home page. So not very useful robots dot text at bunch of XML files. You know, when I looked into this later on, it turned out that my slight was actually under a malicious attack. Someone was trying to break into it, and this XML RPC script was the way they were trying to guess at my passwords, and they were trying to log in using the log in script, and fortunately, I shut them down before they could actually get through to this website. But this was an example of malicious data being introduced into my data stream that I have to filter out. So, you know, by looking at that weaken, see that not only was that malicious attack, you know, looking at PHP files, but it was also trying to execute stuff. So it wasn't just doing a get request. He was doing a post request on the script. Actually try to execute code on my website. Now I know that the data that I care about you know, in the spirit of the thing I'm trying to figure out, is people getting Web pages from my website. So a legitimate thing for me to do is to filter out anything that's not a get request out of these loss. Let's do that nest. So we're going to check again if we have three fields in our request field, and then we're also going to check if the action is get. And if it's not, we're just going to ignore that line entirely. So we should be getting closer to what we want now. And, yeah, this is starting to look more reasonable, but it still doesn't really pass a sandy check. This is a news website. People go to it to read news, and are they really reading my little blogger on it? That just has a couple of articles? I don't think so. That seems a little bit fishy, so let's dive in a little bit and see who's actually looking at those blawg pages. if you would actually go into that file and examine it by hand, you would see that a lot of these Blawg requests don't actually have any user agent on TEM . They just have a user. Agent of Dash, which is highly unusual because of a real human being with a real browser, was trying to get this page. It would say something like Mozilla or Internet Explorer or something like that, right? Or chrome? So it seems that this these requests air coming from some sort of a scraper again, potentially malicious traffic that's not identifying who it ISS. So OK, maybe we should be looking at the user agents to to see if these are actual humans making requests or not. Let's go ahead and print out all the different user agents that were encountering. So in the same spirit of the code that actually summed up the different you or else we're seeing. We can look at all the different user agents that we were seeing and sort them by the most popular user Agents strings in this log and you can see most of it looks legitimate, so you know it's at least if it's a scraper, and in this case, it actually wasn't malicious attack, but they were actually pretending to be a legitimate browser. But this dash user agent shows up a lot to, so I don't know what that is, but I know that it's not a natural browser. A real browser, you know, would look something more like this. The other thing I'm seeing is a much of traffic from spiders from webcrawler. So by do is a search engine in China. Google bought you notice, crawling the page. Um, you know, I think I saw Yandex in here to, you know, Russian search engine. So our data is being polluted by a lot of crawlers that are just trying to mine our website for search engine purposes. And again that traffic shouldn't count toward the intended the intended purpose of my analysis of seeing what pages actual human beings are looking at on my website. These are all automated scripts. All right, so this gets a little bit tricky. You know, there's no real good way of identifying spiders or robots just based on the user string alone, But we can at least take a legitimate crack at it and filter out anything that has the word botnet or anything from my cashing plug in that might be requesting pages in advance as well. And we'll also strip out our friend single dash. So we will once again refine our script to, in addition to everything else strip out and a user agents that looked fishy and what we get. All right, so here we Oh ho, this is starting to look more reasonable for the 1st 2 entries. The home page is most popular, which would be expected. Orlando Headlines is also popular because I use this website more than anybody else, and I live in Orlando. But then we got a bunch of stuff that aren't white pages at all. Bunch of scripts of much of CSS files. Those aren't Web pages. So again, I could just apply some knowledge about my site where I happen to know that all of the legitimate pages on my site Justin with a slash in their euro. So let's go ahead and modify this again to strip out anything that doesn't end with a slash . Finally, we're getting some results that seem to make sense. So it looks like that the top page requested from actual human beings on my little no hate news site is the home page, followed by Orlando headlines followed by World News followed by the comics than the weather in the about screen. So this is starting to look more legitimate. If you were to dig even deeper, though, you see that there are still problems with this analysis. For example, those feed pages air still coming from robots just trying to get X RSS data from my website . So this is a great parable and how a seemingly simple analysis requires a huge amount of pre processing and cleaning of the source data before you get results that make any sense. And again, make sure the things you're doing to clean your data along the way or principled and you're not just cherry picking problems that don't match with your preconceived notions. So always question your results. Always look at your source data and look for weird things that are in it. All right. If you want to, like, message this some more, you can solve that feed problem. Go ahead and strip out things that include feed, because we know that's not a real Web page and just to get some familiarity with the code or go look at the law a little bit more closely, you know, gain some understanding as to where those feed pages are actually coming from. And maybe there's an even better and more robust way of identifying that traffic as a larger class. So feel free to mess around with that. But I hope you learned your lesson. Data cleaning hugely important, and it's gonna take a lot of your time. So it's pretty surprising how hard it was to get some reasonable results on the simple question like, What are the top few pages on my website? And you can imagine if that much work had to go into cleaning the data for such a simple problem. Think about all the nuanced ways that dirty data might actually impact the results of more complex problems and complex algorithms. Very important to understand your source data. Look at it. Look at a representative sample of it. Make sure you understand what's coming into your system and always question your results and tie it back to the original source data to see where questionable results are coming from
62. Normalizing Numerical Data: This is a very quick lecture. I just want to remind you that sometimes you need to normalize or white in your data going into an algorithm. So just keep that in the back of your head because sometimes it will affect the quality of your results. If you don't so real quick lecture here. I just want to remind you about the importance sometimes of normalizing your data, making sure that your various input feature data is on the same scale and it's comparable. And sometimes it matters, and sometimes it doesn't. But you just have to be cognizant of when it does. So sometimes malls will be based on several different numerical attributes. Remember, like multi variant models. You know, we might have different attributes of a car that we're looking at, and they might not be directly comparable measurements or, for example, if we're looking at relationships between ages and incomes. Ages might range from 0 to 100 but incomes and dollars might range from zero to billions, and depending on the currency, it could be an even larger range. Some models are okay with that. You know, if you're doing like a regression, usually that's not a big deal, but other models don't perform so well unless those values air scaled down for us to a common scale. So if you're not careful, you can end up with some attributes counting more than others. Maybe the income would end up counting much more than the age if you were trying to treat those two values as comparable values in your model, so that can issues also a bias and the attributes that can also be a problem. So maybe you one set of your data is skewed. You know, sometimes you need to normalize things versus the actual range seen for that set of values , and not just to, ah, zero to whatever the maximum is scale. And you know there's no set rule as to when you shouldn't should do shouldn't do this sort of normalization. But all I can say is, always read the documentation for whatever technique you're using. So, for example, in psych, it learn their PC implementation has a white and option that will automatically normalize your data for you. You should probably use that, and it also has some pre processing modules available that will normalize and scale from things for you automatically. A swell, you know. Be aware, too, of textual data that should actually be represented numerically orderly. So if you have yes or no data, you might need to convert that toe one or zero and do that in a consistent matter. So again, just read the documentation. Most techniques do work fine with raw a normalized data. But before you start using a new data, a new technique for the first time, just read documentation and understand whether or not the inputs should be scaled or normalized or whitened first, and if so, secular will probably make it very easy for you to do so you just have to remember to do it . Don't forget to re Scalea results when you're done, if you are scaling the input data. So if you want to be able to interpret the results you get, sometimes you need to scale them back up to their original range after you're done. So if you are scaling things and maybe even biasing them toward a certain amount before you input them into a model, make sure that you unskilled them and unbiased them before you actually present those results to somebody or else they won't make any sense. Okay, so just a little reminder. A little bit of a parable if you will always check to see if you should normalize her. Whiten your data before you pass it into a given model. So no exercise associated with this lecture is just something I want you to remember. I'm just trying trying to drive the point home. Some algorithms require whitening a normalization. Some don't so always read the documentation. If you do need to normalize the data going into an algorithm, it will usually tell you so and will make it very easy to do so. So just be aware of that.
63. Detecting Outliers: Sometimes your real world to data will contain outliers, and they might be legitimate outliers. They might be caused by real people, and not by some sort of malicious traffic or fake data, and you'll have to decide how you're going to deal with them. So sometimes it's appropriate to remove them. Sometimes it isn't. Make sure you make that decision responsibly. So, for example, if I'm doing collaborative filtering and I'm trying to make movie recommendations or something like that, you might have a few power users that have watched every movie ever made and rated every movie ever made. And they could end up having an inordinate influence on the recommendations for everybody else. And you don't really want a handful of people to have that much power in your system. So that might be an example of where it would be a legitimate thing to filter out an outlier and identify them by how Maney ratings they've actually put into the system. Or maybe an outlier would be someone who doesn't have enough ratings. Or we might be looking at Web log data like we saw in our example earlier, where we're doing data cleaning outliers could be telling you that there's something very wrong with your data to begin with. It could be malicious traffic. It could be bots or other agents that should be discarded that don't represent actual human beings that you're trying to model. But if somebody really wanted, say the mean average income in the United States and not the medium, they specifically want the mean. You shouldn't just throw out the billionaires in the country just because you don't like them. The fact is, there billions of dollars are going to push that mean amount up, even if it doesn't budge the median much. So don't fudge your numbers by throwing out out liars but throughout out liars. If it's not consistent with what you're trying to model in the first place now, how do we identify out liars? Well, remember our old friend standard deviation? We covered that very early in this course. It's a very useful tool for detecting outliers in a very principled manner. It computes the standard deviation of a data set that should have a more or less normal distribution. And if you see a data point that's outside of one or two standard deviations, there you have an outlier. Remember, we talked to about the box and whisker diagrams to, and those also have a built in way of detecting and visualizing out liars. And those define outliers as lying outside 1.5 the inter quartile range. So what multiple do you choose? Well, you kind of have to use common sense. There's no hard and fast rule as to what is an outlier. You have to look at your data and kind of eyeball it. Look at the distribution. Look at the history Graham. See if there's actual things that stick out to you is obvious outliers and understand what they are before you just throw them away. So let's look at some example code and see how you might do that in practice. If you'd like to follow along, go on back to the outliers notebook in your course materials and we're going to revisit our example of income distribution data. So as before, we're going to use n p dot random to create a normal distribution centered around $27,000 per year with a standard deviation of $15,000 will create 10,000 of these normally distributed incomes and now just a mess. Things up a little bit well, through a billionaire into the mix. May Jeff Bezos, whoever you want to imagine this will just depend that to the list of the income data that we have Now you can see that this messes things up pretty quickly, even just trying to visualize this data. So it's students. We try to plot a history graham of this data. We see that all the so called you know, normal people that are making around $27,000 a year down here in this big spike here and way out here, $1 billion. We have a single data point that you can't even see, but it's already messed up our ability to even visualize this data so that one billionaire ended up squeezing everybody else into this single tiny line in our history. Graham and it also skewed our mean quite significantly. So let's go ahead and run this to actually get that data in the system here. If you compute the mean, we'll see that it's not $27,000 at all. It's more like $127,000 just because of that one outlier. So it's very important in the case like this to dig in tow what is causing your outliers and understand where they're coming from? You want to make sure that if you're throwing data away, it's justifiable. It's based on something principled, right? So if the purpose of this analysis was to really to understand the incomes of quote unquote typical Americans, filtering out that handful of billionaires would seem like a legitimate thing to do, You just want to make sure you're transparent, that you did that when you're presenting your data Now there is something little bit more robust you could do into the just saying, If you're a billionaire, I'm gonna throw you out. We could say, for example, anything beyond two standard deviations of the median value in the data set will be defined as an outlier, and we can choose whatever standard deviation value we want to there. So here's a little function that figures that out for us. I called it reject outliers. It starts by computing the median of a data set and the standard deviation of that data set , and this little line of code here just checks whether or not where blow two standard deviations of the median or above two standard deviations of the median and returns the filter data set that throws them all out. So then we call Filtered equals, reject outliers income, and that just applies that filter function to the entire data set and returns the filtered set in its place. We could then plot that filter data set, and we'll see if it works. Sure enough, it does. And the nice thing here is that it doesn't like me throughout a whole lot of data, right? So we still have a nice, clean little bell curve here, but our billionaire is gone without having to write any special logic that says, if you make more than a $1,000,000,000 worth throwing you out instead, it's based on a multiple of standard deviations, which is a little bit more of a principle thing to Dio and our mean will be, well, more meaningful now as well. Let's go ahead and run this before I forget. So now if we compute the mean of the filtered data set that's back close to $27,000 your results will be a little bit different because of the randomness involved. But we did successfully filter out are billionaires without having to hard code a special case for them. We're just very principled manner rejecting outliers beyond two standard deviations of the median. That's, ah, reasonable thing to do. So here's the activity, for if you want to play with this some more instead of a single billionaire ally, or add several randomly generated out liars to the data, you know, pick a range of values there and just throw them in there. Experiment with different values of the multiple, the standard deviation to identify those outliers and see what effect it has on the final results. It's just, ah, opportunity to get your hands on to this and play around a little bit more directly, if you will. So give that a shot if you want to, and then we'll move on
64. Feature Engineering and the Curse of Dimensionality: Let's dive into the world of feature engineering, in the world of machine learning. What is feature engineering anyway? Well, basically it's the process of applying what you know about your data to sort of trim down the features that you're using or maybe create new features or transformed the features you have. What I mean by feature as well. Those are the attributes of your training data that the things that you're training, your model width. Let's take an example. So let's say we're trying to predict how much money people make based on various attributes of the people. So your features in that case might be the age of a person, their height, their weight, their address, what kind of car they drive, any number of things, right? Some of those things are going to be relevant to the things you're trying to predict and some of them won't be. So the process of feature engineering is in part, just selecting which features are important to what I'm trying to predict and choosing those features wisely. A lot of times you need to transform those features in some way as well. Maybe the raw data isn't useful for this specific model you're using. Maybe things need to be normalized or scaled in some way, are encoded in some specific way. Often you'll have things like missing data. In the real-world, often you do not have complete data for every single data point. And the way that you choose to deal with that can very much influenced the quality of the resulting model that you have. Also sometimes you want to create new features from the existing ones that you have. Perhaps the numerical trends in the data that you have for a given feature are better represented by taking the log of there, the square of it or something like that. Or maybe you're better off taking several features and combining them mathematically into one to reduce your dimensionality. This is all what feature engineering is about. You can't just take all the data you have and throw it into this big machine-learning hopper and expect good things to come out the other end. This is really the art of machine learning. This is where your expertise is applied to actually get good results out of it. It's not just a mechanical process where you follow these steps, take all the data you have, throw it into this algorithm and see what predictions you make. That's what separates the good machine learning practitioners from the bad ones. The ones that can actually do feature engineering are the ones that are the most successful and most valuable in the job marketplace, of course. And this isn't stuff that's generally taught, right? So this is largely a lot of stuff that is learned through experience and actually being out there in the real world and practicing machine learning. Why is feature engineering important in the first place? Well, it's about the curse of dimensionality. What do we mean by that? Well, like I said, you can't just throw every feature you have into the machine and expect good things to happen. Too many features can actually be very problematic for a few different reasons. First is that at least a sparse data. So again, come back to the example of trying to train a model on attributes of people. There are hundreds of attributes on a person you could come up with, right? Like we said, age, height, weight, what car do you drive? How much money do you make? Where do you live? Who knows? Where did you go to college? The list goes on and on and on. And you can actually envision each person as a vector in the dimensional space of all these features. Okay, So stay with me here. Imagine, for example, that the only feature we have is a person's age. You could represent a person by a vector along a single age axis, right, going from 0 to a 100 or whatever. Now we throw in another dimension, say their height. We have another dimension and other axis that we have this vector pointing to that encodes both their age on one axis and their height on another, right? So now we have a two-dimensional vector thrown a third dimension there, say how much money they make. Now we have a vector in three dimensions where one dimension is their age, one-dimension is their height, one-dimension is how much money they make. And as we keep adding more and more dimensions, the available space that we have to work with just keeps exploding, right? This is what we call the curse of dimensionality. So the more features you have, the larger the space that we can find a solution is within. And having a big space to try to find the right solution in just makes it a whole lot harder to find that optimal solution. So the more features you have, the more sparse your data becomes within that solution space. And the harder it is to actually find the best solution. So you're better off boiling those features down to the ones that matter the most, that will give you less sparse data and make it a lot easier to find the correct solution. Also, just from a performance standpoint, imagine trying to create a neural network that has inputs for every one of those features encoded in whatever way that it needs, right? This neural network would have to be massive, extremely wide at the bottom, probably extremely deep as well to actually find all of the relationships between these many features. And it's just going to be ridiculously hard to get that converge on anything. So a big part of success in machine learning is not just using the algorithm, not just cleaning your data, but also choosing the data that you're using in the first place. And that's what feature engineering is all about. Again, a lot of it comes down to domain knowledge and sort of using your common sense about what will work and what one would toward improving your model and just experimenting with different things. What makes an effect and what doesn't, what helps, what hurts things? So a lot of it's just going back and forth with does this feature helps things? No. Okay. We won't use it. It does this feature helps things? No. Okay. Try something else. Now you don't always have to guess to be fair, there are some more principled ways of doing dimensionality reduction. So one of them is called PCA Principal Component Analysis. Pca is a way of taking all of those higher dimensions, all those different features that you have and distilling them down into a smaller number of features to a smaller number of dimensions. And it tries to do this in a way that preserves information as well as possible. So, I mean, if you have enough computational power to actually use PCA on a large set of features that is a more principled way of distilling it down to the features that actually matter. And the features you end up with aren't actually things you can put a label on. It's just artificially created features that capture the essence of the features that you started with. K-means clustering is another way of doing this. What's nice is that these are both unsupervised techniques, so you don't have to actually train these on anything. You can just throw the feature data you have any of these algorithms and it will boil out if you will, a smaller set of dimensions that will work just as well, hopefully closely as well. But again, more features is not better. That leads to what we call the curse of dimensionality. And that's one of the main reasons that we want to do feature engineering. And one of the main things you're going to be doing in that process.
65. Imputation Techniques for Missing Data: So a big part of feature engineering is imputation of missing data. What do you do when your data has missing data elements in it? This is what happens in the real-world. For every observation you have, there's going to be some missing data points more than likely. Well, it's simple solution is just called mean replacement. The idea is that if you have a missing attributes or features within one of the rows of your data, just replace it with the mean from the entire column. And remember, we're talking about columns, not rows here. You want to take the mean of all the other observations of that same feature, it doesn't really make sense to take the mean of all the other features for that row, right? So mean replacement is all about taking the mean of that column and replacing all the empty values with that mean. So it's fascinates easy. Those are some of the positives of this approach. It also doesn't affect the mean or the sample size of your overall dataset because you're just replacing missing datas with the mean. It won't affect the overall mean of the entire dataset, which can be nice. Now, one nuance is that if you have a lot of outliers in your dataset, which is also something you have to deal with when preparing your data. You might find that median is actually a better choice than the mean. So if you have a dataset of a bunch of people and maybe one of those columns as income. And some people don't report their income because they think it's sensitive. You might have your mean skewed by a bunch of millionaires and billionaires in your dataset. So if you do mean imputation and that's sort of a situation where you have outliers. You might end up with an overly high or overly low value that you're using for replacement. So if you do have outliers that are skewing your mean, you might want to think about using median instead. That'll be less sensitive to those outliers. But generally speaking, it's not the best choice for imputation. First of all, it only works on the column level. So if there are correlations between other features in your dataset, it's not going to pick up on those. So if there is a relationship between say, age and income, that relationship is going to be totally missed. So you can end up saying that a 10-year-old is making, you know, $50 thousand a year because that's the mean of your dataset, but it really doesn't make sense, right? I mean, a 15-year-old wouldn't be making that much money yet. So it's a very naive approach from that standpoint. The other issue is that you can't really use it on categorical features. How do you take the mean of a categorical piece of data that just doesn't make sense, right? Although you could use the most frequent value that appears the most commonly seen category would be a reasonable thing to do in that case, it's sort of in the same spirit is mean replacement, but not really the same thing. Overall though it's not going to be a very accurate method. It's a very ham handed attempt at doing imputation. So, although it's quick and easy and has some advantages in practice. If someone's asking you say on a certification exam, what's the best way to do data imputation? Mean replacement probably isn't it? It's also probably not just dropping the missing rows. Although as we've seen sometimes that's a reasonable thing to do. If you do have enough data such that dropping a few rows doesn't matter if you don't have too many rows that contain missing data. Well, doesn't sound that unreasonable. The other thing too is that you want to make sure that dropping the Rosetta have missing data doesn't bias your dataset. And some, some way, what if there's an actual relationship between which rows are missing data and some other attribute of those observations. For example, let's say that we're looking at income. Again. There might be a situation where people that have very high or very low incomes are more likely to not report it. So by removing or dropping all of those observations, you're actually removing a lot of people that have very high or low incomes from your model. And that might have a very bad effect on the accuracy of the model you end up with. So you wanna make sure that if you are going to drop data, that it's not going to bias the dataset in some way as a byproduct, right? So it's a very quick and easy thing to do. Probably the quickest and easiest thing to do. You can literally do this in one line of code in Python, but it's probably never going to be the best approach. So again, if an exam is asking you what's the best way to impute missing data? Dropping data probably is not the right answer. Almost anything is going to be better. Maybe you could just substitute a similar field, right? I mean, that would also be a simple way of doing it. For example, I might have a dataset of customer reviews on movies, right? Maybe if I have a review summary and a full text review as well, it would make more sense to just take the review summary and copy that into the full text for people who left the full text blank. As an example, almost anything is better than just dropping data. But in the real-world, if you're just trying to do something quick and dirty and sort of like start experimenting with some data just to start playing with it. It can be a reasonable thing to do. I just wouldn't leave that in place for production necessarily. The thing you probably really want to do in production is by using machine learning itself to impute you're missing data into your machine learning training. So it's kind of a Meta thing. There are different ways of doing this. One is called KNN, that stands for k nearest neighbors. And if you have any experience with machine learning, you probably know what that is already. The general idea is to find the k where k is some number of the most similar rose to the ones that you're looking at that has missing data and just averaged together the values from those most similar rows. So you can imagine having some sort of a distance metric between each row. Maybe it's just the Euclidean distance between the normalized features within each row or something like that. And if you find the, say, 10 nearest rows that are most similar to the one that's missing data. You can just take the average of that feature from those ten most similar rows and impute the value from that. So that takes advantage of relationships between the other features of your dataset, which is a good thing. One problem with that though, is that that idea assumes that you have numerical data that you're trying to impute and not categorical data. It's tough to take the average of a category, but there are ways of doing that with the techniques like hamming distance. But N is generally a better fit for numerical data, not categorical. If you have categorical data, you're probably better served by actually developing a deep learning model. Neural networks are great at doing categorization problems. So the idea would be to actually build a machine learning model to impute the data for your machine learning model, right? It's kind of a cycle there. And that works really well for categorical data. It really well, it's tough to be deep learning these days. However, of course it is complicated. There's a lot of code involved in a lot of tuning as well. But it's hard to be the results if you actually have a deep learning model that tries to predict what a missing feature is based on other features in your dataset. That's going to work out. Lot of work, a lot of computational effort, but it's going to give you the best results. You can also just do a multiple regression on the other features that are in your dataset. That's also a totally reasonable thing to do. And three regressions, you can find linear or non-linear relationships between your missing feature and the other features that are in your dataset. And there is a very advanced technique along these lines called mice, which stands for a multiple imputation by chained equations. It's kinda the state of the art in this space right now for imputing missing data. So, all right, and finally, probably the best way to deal with missing data is to just get more data. So if you have a bunch of rows that have missing data, maybe you just have to try harder to get more complete data from people. And it's hard to be just getting more real data so that you can just not have to worry about all the roads that had missing data. Again, you want to be careful that if you are dropping data, that you're not biasing your dataset in some way. But really the best way to deal with not having enough data is as you just get more of it, sometimes you just have to go back and figure out where that data came from and collect more better quality data. And so the better quality data you have going into your system, the better the results you will get. And while imputation techniques are a way to cover up issues where you just don't have enough data and you can't get more of it. It's always a good idea to just get more and better data if you can.
66. Handling Unbalanced Data: Oversampling, Undersampling, and SMOTE: Another problem in the world of feature engineering is handling unbalanced data. What we mean by that? Well, let's assume we have a large discrepancy between our positive and negative cases in our training data. So a common example is in the world of fraud detection. Actual fraud is pretty rare, right? So most of your training data is going to contain training rows that are not fraudulent. This can lead to difficulty in actually building a model that can identify fraud because it had so few data points to learn from compared to all of the non fraud data points. So it's very easy for a model to say, okay, well, since fraud actually only happens like 0.01% of the time, I'm just going to predict that it's not fraud all the time. And hey, my accuracy is awesome now, right? So if you have an unbalanced dataset like that, you can end up in a situation like that where you have a machine learning model that looks like it has high accuracy, but it's just guessing no every time. And that's not helpful, right? So there are ways of dealing with this in feature engineering. Now, first of all, don't let the terminology confuse you. This is actually something that I got hung up on a lot at first when I say positive and negative cases, I'm not talking about good and bad, so don't conflate positive and negative with a positive, a negative outcome. Positive simply means, is this the thing that I'm testing for? Is that what happened? So that might be fraud, right? So if I, if my model is trying to detect fraud, then fraud is the positive case. Even though fraud is a very negative thing. Remember, positive is just the thing that you're trying to detect, whatever that is. So beat that into your head because if you keep conflating positive and negative with moral judgments. Now what it's about in this, in this context, this is mainly a problem with neural networks by the way. So it is a real issue that if you have an unbalanced dataset like this, it's probably not going to learn the right thing and we have to deal with that somehow. What's one way of dealing with it? Just oversampling is a simple solution. So just take samples from your minority class. In this example of fraud, just take more of those samples that are known to be fraud and copy them over and over and over again. Make an army of clones, if you will, of your fraudulent test cases. And you can do that at random. You would think that that wouldn't actually help, but it does with a neural network. So that's a very simple thing you can do. Just fabricate more of your minority case by making copies of other samples from that minority case. The other way you can go as undersampling. So instead of creating more of your minority cases, remove the majority ones. So in the case of fraud, we'd be talking about just removing some of those non fraudulent cases to balance it out a little bit more. However, throwing data away is usually not the right answer. I mean, why would you ever want to do that? You're discarding information, right? So the only time when undersampling might make sense is if you're specifically trying to avoid some scaling issue with your training, right? Maybe you just have more data than you can handle and the hardware that you're given. And if you have too much data to actually process and handle, fine, throw away some of the majority case I might be a reasonable thing to do, but the better solution would be to get more computational power, right? And actually scale this out on a cluster or something. So undersampling, usually not the best approach. Something that's even better than undersampling or oversampling is something called smoked. And this is something you might see stands for a synthetic minority oversampling technique, kind of a creative acronym. What it does is it artificially generates new samples of the minority class using nearest neighbors. So just like we talked about using KNN for imputation, same idea here. We're running k-nearest neighbors on each sample of the minority class. And then we create new samples from those KNN results by taking the mean of those neighbors. So instead of just, you know, naively making copies of other test cases for the minority class. We're actually fabricating new ones based on averages from other samples and fabricating them that way works pretty well. So it both generates new samples. And under samples the manure majority class, which is good. So this is better than just oversampling by making copies because it's actually fabricating new data points that have some basis in reality still. So remember, if you're dealing with unbalanced data, smoked is a very good choice. A simpler approach to is just adjusting the thresholds when you're actually making inferences and actually applying your model to the data that you have. So when you're making predictions for a classification, say fraud or not fraud, you're going to have some sort of threshold probability at which you say, Okay, this is probably fraud. Most machine learning models, so just output a fraud or not fraud. It actually will give you some sort of probability that it's fraud or not fraud. And you have to choose a threshold of probability at which you say, Okay, this is probably fraud, it deserves some investigation. So if you have too many false positives, when we fix that, It's just increase that threshold, right? So that is guaranteed to reduce your false positive rate, but it comes at the cost of more false negatives. So before you do something like this, you have to think about the impact of that threshold will have. So if I raise my threshold, that means I'm going to have fewer actual things that are flagged as fraud, that mean that I miss out some actual fraudulent transaction sehr, but I'm not going to be bothering my customers as much saying, Hey, I fly this as fraud, I shut down your credit card. You might actually want the opposite effect, right? Maybe I want to be even more liberal and when I'm flagging as fraud, so I would lower that threshold to actually get more fraud cases that are flagged. And fraud might be a case where you're better off guessing wrong if it's not fraud than the other way around, right? So you need to think about the cost of a false positive versus a false negative and choose your thresholds accordingly.
67. Binning, Transforming, Encoding, Scaling, and Shuffling: Let's quickly go through some other techniques you might use in the process of feature engineering. One is called binning. The ID here is just to take your numerical data and transform it into categorical data by binning these values together based on ranges of values. So as an example, maybe I have the ages of people in my dataset. I might put everyone in their 20s into one bucket, everyone their 30s into another bucket and so on and so forth. That would be an example of binning where I'm just putting everyone in a given range into a certain category. So instead of saying that we've got to train based on the fact that you're 22 and three months old. I'm just going to bucket do you into the ban of 20 year-olds, right. So I've changed that number of 22, whatever it is, into a category of 20 somethings. So that's all bidding is. Why would you want to do that? Well, there's a few reasons. One is that sometimes you're, you have some uncertainty in your measurements. So maybe your measurements aren't exactly precise and you're not actually adding any information by saying this person is 22.37 years old versus 22.38 years old. Maybe some people would remember the wrong birthday or something, or you ask them on different days and you've got different values as a result. So binning is a way of covering up in precision in your measurements. That's one way you, that's one reason. Another reason might be that you just really want to use a model that works on categorical data instead numerical data. That's kind of a questionable thing to be doing because you're basically throwing some information away by binning, right? So if you're doing that, you should think hard about why you're doing that. The only really legitimate reason to do this is if there is uncertainty or errors in your actual underlying measurements that you're trying to get rid of. Now there's also something called quantile binning that you should understand. The nice thing about quantile binning is that a categorizes your data by their place in the data distribution. So it ensures that every one of your bins has an equal number of samples within them. So with quantiles bending, I make sure that I have my data distributed in such a way that I have the same number of samples in each resulting been. Sometimes that's a useful thing to do so remember, quantile binning will have even sizes in each bin. Another thing we might do is transforming our data. Applying some sort of a function to our feature is to make it better suited for our algorithms. So for example, if you have feature data that has an exponential trend within it, that might benefit from doing a logarithmic transform on it to make that data look more linear. That might help out your model and actually finding real trends in it. Sometimes models have difficulty with nonlinear data coming into it. A real-world example is YouTube. They published a paper on how their recommendations work, which is great reading by the way, there's a reference to that in the slide here. They have a whole section on feature engineering there that you might find useful. And one thing they do is for any numeric feature exit they have, you know, for example, how long has it been since you watched the video? They also feed in the square of that and the square root of it. And the idea there is that they can learn super and sub linear functions in the underlying data that way. So they're not just throwing in, raw values are also throwing in the square and the square root just to, just to be careful and see if they're actually are non-linear trends there that they should be picking up on. They found that, that actually Improve their results. So that's an example of transforming data. It's not necessarily replacing data with a transformation. Sometimes you're actually creating a new feature from transforming an existing one. That's what's going on here. So they're feeding in both the original feature x and x squared and the square root of x. You can see in this graph here why you might want to do that. So if I'm starting off with a function of x here and the green line, you can see that by taking the LN, the logarithm of that, I end up with a linear relationship instead, which might be easier for miles to pick up on. I could also erase that to a higher power, which would actually make things worse in this case, but, but sometimes more data is better. Again, we're talking about the curse of dimensionality, so there is a limit to that, but That's what feature engineering is all about. Trying to find that balance between having just enough information and too much information. Another very common thing you'll do while preparing your data is encoding. And you see this a lot in the world of deep learning. So a lot of times your model will require a very specific kind of input and you have to transform your data and encode it into the format that your model requires. A very common example is called one-hot encoding. Okay, So make sure you understand how this works. The idea is that it created a bucket for every category that I have. And basically I have a one that represents that, that category exists and is 0 that represents that it's not that category. Let's look at this picture as an example. Let's say that I'm building a deep learning model that tries to do handwriting recognition on people drawing the numbers 0 through nine. This is a very common example that we'll look at more later. So to one-hot encode this information. I know that this thing represents the number eight and to represent that in a one-hot encoded manner, basically I have 10 different buckets for every possible digit that, that might represent 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. Now I usually start counting at 0 here. So you can see here that in the ninth slot there, there's a one that represents the number eight. And every other slot there has a 0 representing that it is not that category. That's all one-hot encoding is. So again, if I had a one in that first slot, that would represent the number 0. If I had a one in the second slot that represent the number one and so on and so forth. We do this because in deep learning, neurons generally are either on or off, they're activated or they're not activated. So I can't just feed in the number eight are the number one into an input neuron and expect it to work. That's not how these things operate. Instead, I need to have this one-hot encoding scheme where every single training value that label, it's going to actually going to be fed into 10 different input neurons were only one of them represents the actual category I have. We can also tell you about scaling in normalizing your data. Again, pretty much every model requires this as well. A lot of models prefer their feature data to be normally distributed around 0. And this is also true of most deep learning and neural networks. And at a minimum, most models will require that your feature data is at least scaled to comparable values. I mean, there are models out there that don't care so much, such as decision trees, but most of them will be sensitive to the scale of your input data. Otherwise, if you have features that have larger magnitudes still end up having more weight on your model than they should. Going back to the example of people, if I'm trying to train a system based on their income, which might be some very large number like, you know, 50000 and also their age, which is a relatively small number like 30 or 40. I weren't normalizing that data down to comparable ranges before training on it, that income would have a much higher impact on the model than their ages. And that's going to result in a model that doesn't do a very good job. Now it's very easy to do this, especially with scikit-learn in Python, it has a preprocessor module that helps you out with this sort of thing. It has something called minmax scalar that will do it for you very easily. You only thing is you have to remember to scale the results back up if what you're predicting, it's not just categories and actual numeric data. So sometimes if you are predicting something, you have to make sure to reapply that scaling and reverse to actually get a meaningful result out of your model at the end of the day. Finally, we will talk about shuffling. A lot of algorithms benefit from shuffling your training data. Otherwise, sometimes there's sort of a residual signal in your training data resulting from the order in which that data was collected. So you wanna make sure you're eliminating any byproducts of how the data was actually collected by shuffling it and just randomizing the order that is fed into your model. So often that makes a difference in the quality as well. There are a lot of stories I've seen where someone got a really bad result out of their machine learning model, but by just shuffling the input and things got a lot better. So don't forget to do that as well. And that's the world of feature engineering in a nutshell.
68. Important Spark Installation Notes: way. - No way, way, - No way.
69. Installing Spark - Part 1: so far in this course, we've talked about a lot of general data mining and machine learning techniques that you can using your data science career. But they've all been running on your desktop. And as such, you know, you can only run as much data as a single machine can process using some of these techniques using python and psychic learn and what not now everyone talks about big data, and you know what's are. You might be working for a company that does in fact, have big data to process big data, meaning that you can't actually control it all. You can't actually wrangle it all on just one system you need to actually compute. Using that resource is of an entire cloud. A cluster of computing resource is, and that's where Apache Spark comes in. So in this next section, I'm gonna set to you get you set up using Apache Spark and show you some examples of actually using Apache Spark to solve some of the same problems that we solved using a single computer in the past in this course. But the first thing we need to do is get sparks set up on your computer, So we're gonna walk you through how to do that. The next couple of lectures. It's pretty straightforward stuff, but there are a few gouaches, so don't just skip through these lectures. There are a few things you need to pay special attention to to get spark running successfully, especially on a window system. And again, we're gonna be developing these examples just using your own computer. But the same examples can scale scale up to actually run on a Hadoop cluster later on, if you wish. So let's get started. All right, let's get Apache spark set up on your system so you can actually dive in and start playing around with it. A very powerful tool for managing big data and doing machine learning on large data sets. Now we're gonna be running this just on your own desktop for now during this course, But the same programs that were going to write in this section could be run on an actual Hadoop cluster. So if you take the same scripts that were writing and running locally on your desktop and spark standalone mode, you can take those same scripts and actually run them from the Master node of an actual who do cluster and then let it scale up to the entire power of a Hadoop cluster and process. Massive data sets that way. So even though we're going to set things up to run locally on your own computer, keep in mind that the same concepts that we do will scale up to running on a cluster as well. Now getting spark installed on windows involves several steps that will walk you through here. And I'm just gonna assume that you're on Windows, cause most people take this course at home. We'll talk a little bit about dealing with other operating systems in a moment. But here, the basic steps. So if you're already familiar with, you know, installing stuff and dealing with environment variables on your computer, then you can just take this little cheat sheet and go off and do it. But I will walk you through it one step at a time. In the upcoming videos, things you need to do you need to install first to J. D. K. That's a job, a development kit so you could just go Dio Sons Website and don't download that and installed if you need to. We need the J. D. K because even though we're going to be developing and python during this course that gets translated under the hood to Scalia code, which is what spark is developed in natively and Scalia, in turn, runs on top of the Java interpreter. So in order to run python code, you need a Scala's system, which will be installed by default with best part of Smarck. And also we need Java. Job is interpreter to actually run that scholar code. So it's like this technology layer cake. Obviously you're only python, But if you've gone to this point in the course, you already have a python environment set up. And fortunately, spark. The Apache Web site makes available pre built versions of Spark that will just run out of the box that are pre compiled for the ladies to do version so you don't have to build anything. You can just download that to your computer and stick it in the right place and be good to go for the most part. Then we have a few configuration things to take care of. So one thing we want to do is adjust our warning level, so we don't get a bunch of warning span when we run our jobs and we'll walk through how to do that. Basically, you need to rename one of the properties files and then adjust the error, setting it within it. And then we decide some environment variables to make sure that you can actually run spark from anywhere from any path that you might have. So we're gonna add a spark home environment, variable pointing to where you installed spark, and then we will add Spark home Slash been to your system path so that when you run sparked , submit or pie spark or whatever spark command you need, we knows will know where to find it. Finally, on Windows, there's one more thing we need to do. We need to study Hadoop Home Variable. It's well, because it's going to expect to find one little bit of Hadoop, even if you're not using Hadoop on your stand alone system, and then we need to install a file called Win You tills dot eggs e to that path, and there's a link to win. You tills dot xy within the resource is for this lecture so I can get that there. So if you want to walk through it in more detail, we can do that quick note on installing spark on other operating systems so the same steps will basically apply. The main difference is gonna be and how you set environment variables on your system in such a way that they will automatically be applied whenever you log in. So that's kind of vary from O esto s. Mac OS doesn't differently from various flavors of Linux. So you're gonna have to be at least a little bit familiar with using a UNIX terminal command prompt and how to manipulate your environment to do that. But you know, most Mac OS or Lennox users who were doing development already have those fundamentals under their belt. And of course, you're not gonna need when you tills dot eggs e if you're not on Windows. So those are the main differences for installing on different OS is all right, let's get started by actually installing a J. D. K. So I'll walk you through that real quickly and then in our next lecture, will go through all the other details of getting set up with spark. So, like I mentioned before spark runs on top of scallop, which in turn runs on top of the Java environment. So if you don't already have a job a development kit installed on your system, you will need to go get one and just walking through that real quickly. Just go to your favorite search engine and search for J. D. K. Should come up. Just take the most recent one you confined, and that should redirect you to you, the Oracle website. And you just want to select the version that is appropriate for your system. So on Windows, I'm going to accept the license agreement. I'm going to look for the Windows X 64 version in my example. So I'm running a 64 bit version of windows. Go ahead and get that down. It comes and 100 87 megabytes later. We should have something we can install, so nothing special here. You know it's just your standard installer, but that is step one for getting java forgetting spark installed in upper running on your system. Now in our next lecture will go ahead and talk about the remaining steps which are installing spark itself and then all the associated config files and also one little extra gotcha and windows. The when you tills dot exit file that needs to be installed in a special place. So we are getting there, and this download is also getting there. You click on that. She just walk you through a standard installer for the Java SC development kit. So just go ahead and probably okay. He was just accept all the defaults and let it do its thing, and that's all there is. So let's move on to the next steps. All right, we're on our way to getting spark set up on your computer. We've got a J. D. K set up. That's step one. So let's move on to the remaining steps in the next lecture.
70. Installing Spark - Part 2: all right. There's a bunch of little niggly details that we need to work out to get spark actually up and running on your desktop in stand alone mode. So let's just go through them all and get him out of the way. So, so far, we've installed Python. We've installed Java next week's need to install spark itself. So back to a new browser tab here just said to spark dot Apache dot or GTA and click on the big friendly download spark button. Okay, now, this course has been tested with sparked 2.1 point one. So, you know, given the choice anything beyond 2.0, should work just fine. But that's where we are today. Make sure you get a pre built version, okay? And we're gonna just do a direct downloads. So all these defaults air perfectly fine. Go ahead and download that package. Now, what's downloading is a T GZ file that sounds for tar and Jesus. So again, spark, You know, windows is kind of an afterthought with spark, quite honestly. And on windows, you're not gonna have a built in utility for actually decompressing t gz file. So you might need to install one. If you don't have one already, the one I use is called a wind roar. And you can pick that up from a website called Raw or our labs dot com just like that. And just go to the downloads page if you need it and download the installer for wind rar 32 bit or 64 bit, depending on your operating system, and that will allow you to actually decompress. Teague Easy files on Windows if you need it. So hit, pause and go install that if needed. If not, let's check back on our Apache spark download. Here. It looks like it came down. So I'm gonna go ahead and show that in my Downloads folder and let's go ahead and right click on that and extract it to a folder of my choosing Again. Winrow is doing this for me at this point, okay, so I should now have a folder associated with that package. Sure enough, there it is. Let's open that up, all right. And there is spark itself, so I need to install that someplace where I'm going to remember it. You don't want to leave it in your downloads folder, obviously. So let's go ahead and open up a new file Explorer window here, and I'm gonna go to my C drive and I'm going to create a new folder and let's just call it spark. So my spark installation is going to live under C colon backs last spark again. Nice and easy to remember. Open that up. I'm gonna go in control A to select everything in the spark distribution control. See, to copy it back to where I want to put it into see spark and control V to paste it in. All right, now, one little thing we need to do here. Open up the con folder where we installed Sparked to and renamed Log for Jadot properties dot template such that you just take the template off. So it should just be logged for Jadot properties instead. Yes, I'm sure I want to change it. And now I can open that up if you might need to. Right, click there and say open with and select word pad. And what I want to do is change this line here where it says route category info. I want to change that to error. And that will just remove the clutter of all the law expand that gets printed out when I run stuff. So change that from in photo errors. Save it and close out of your editor. Okay, so now where are we? This is kind of exhausting. We've installed python. We've installed Java. We've installed spark. Now, the next thing we need to do is install something that will trick your PC into thinking that her dupe exists. And again, this step is only necessary on windows. So you can skip this step if you're on mackerel Lennix. But for the few windows people, I want you to go to this link here. You should find us in. The resource is associated with this lecture on platforms that has such a thing. But if you want, you can just type this link in yourself by hand, okay? And if you download that, that will give you a copy of ah, little snippet oven execute herbal. That could be used to trick spark into thinking that you actually have Hadoop. Now, since we're gonna be running our scripts just locally on our desktop, it's not a big deal. We don't have to have a dupe installed for real. It's just again gets around yet another quirk of running spark on windows. So now that we have that, let's show that control, See, to copy it out of my downloads folder and let's go to our C drive and create a place for it to live. So I'm going to create a new folder again. We're going to call it win, you tills, and I'm gonna open up that when you tills folder I made and create another folder within it called Been. And within that bin folder I'm going to paste that file that I just downloaded. Okay, this next step is only required on some systems, but just to be safe, open up a command prompt in windows so you can do that by going to your start menu and going down to Windows Windows System Command prompt. And from here, I want you to type in CD C colon backslash win. You tills slash backslash had been which is where we stuck our wind util stopped E X files If you do a d i r. You should see that there and now type in the following when you tills dot t X, c c h m o d space 777 space, back slash TMP backslash hive and that Just make sure that all the file permissions you need to actually run spark successfully are in place without any errors. So you can close that command prompt out Now that you're done with that step. Wow, we're almost done, believed or not. Now the last thing we need to set up some environment variable. So all the software knows where to find each other. So to do that, it's close out of this browser. Get out of all this stuff I'm going to right click on my Windows icon and again on different operating systems. You'll set environment variables in different ways, but on Windows you do it through your control panel and by clicking on system and security system and then advanced system settings. And from here you click on environment variables. So we need to set up a few here. Let's start. So I'm gonna hit the new button for my user environment variables, and I'll start by defining one for a spark underscore home. And that's gonna be the directory installed spark into which is C colon backslash spark. Okay, Next create another one. This one's gonna be called Java underscore home. And that's where I installed the J. D. K. And that's going to be C colon backslash J d k. And finally, I need to set up a dupe home. That would be where I put the win Util study, see file. And that will just be C colon. Backslash win you tills just like that. And I need to finally update my path. So I'm going to click on the path environment variable hit, edit and I'm going to add a new path. It's gonna be percent sign, spark underscore Home percent sign backslash been I'm gonna add another one percent sign Java underscore Home percent Sign back slash Been all right. Who? I think that's it.
71. Spark Introduction: Let's get started with a high level overview of Apache Spark and what it's all about, what is good for how it works. Let's dive in. Let me give you a brief, high level overview of what Apache spark is all about. Civil introduction of the whole concept. So what is spark? Well, if you go to the spark website, they give you a very high level hand wavy answer. A fast and general engine for large scale data processing. It slices. It dices. It does your laundry. Well, not really, but it is a framework for writing jobs or scripts that can process very large amounts of data. And it manages distributing that processing across a cluster of computing for you so basically spark works by letting you load your data into these large objects called resilient distributed data stores RTGs and it can automatically perform operations that transform and create actions based on those RTGs, which you could think of. It's like large data frames, basically, and the beauty of it is that spark will automatically and optimally spread that processing out amongst an entire cluster of computers. If you have one available so no longer you restricted to what you can do on a single machine or a single machine's memory. You can actually spread that out toe all of the processing capabilities and memory that's available to a cluster of machines. And in this day and age, computing is pretty cheap. You can actually rent time on a cluster through things like Amazon's elastic map, reduce service and just rent some time on a whole cluster of computers for just a few dollars and run your job that you couldn't run in your own desktop. So how is it scalable? Well, it's a little bit more specific here and how it all works. So the way it works is you write a driver program. It's just a little script that looks just like any other python script, really, And it uses the spark library to actually right. You're your script with and within that library you define was called a spark context, which is sort of the route object that you work within when you're developing and spark, and from there, the spark framework kind of takes over and distributes things for you. So if you're running in standalone mode on your own computer like we're gonna be doing in these upcoming lectures. It all just stays there on your computer, obviously. But if you do are running on a cluster manager spark and figure that out and automatically take advantage of it, Spark actually has its own built in cluster managers. You can actually use it on its own without even having a dupe installed. But if you do have a Hadoop cluster available to you, it can use that as well. So Hadoop is more than map reducers, actually a component of a do called a yarn that is just separating out. The entire cluster management piece of Hadoop and Spark an interface with yard actually used that to distribute optimally. The components of your processing amongst the resource is available to that Hadoop cluster . And so, within a cluster, you might have individual executor tasks that are running, and these might be running on different computers. They might be running on different cores of the same computer, and they each other on individual cash and their own individual tasks that they run and the driver program that spark context in the cluster manager work together to coordinate all this effort and return the final result back to you. The beauty of it is. All you have to do is write this little script here that uses a spark context to describe at a high level the processing you want to do on this data and spark working together with the cluster manager that you're using, figures out how to spread that out and distribute, so you don't have to worry about all those details well, until it doesn't work. Obviously, you might have to do some trouble shooting to figure out if you have enough resources available for the task at hand. But in theory, it's all just magic. Now. What's the big deal about Spark? I mean, there are similar technologies like map reduce that have been around longer. Smart is fast, though, and on the website they claim that spark is up to 100 times faster than map reduce when running a job in memory or 10 times faster on disk. Of course, the key words here are up to your mileage may vary. I don't think I've ever seen anything actually run that much faster than that produced since, um, well crafted map pretties co. Can actually still be pretty darned efficient. But I will say that Spark does make a lot of common operations easier. You know, map reduce forces you to really break things down into mappers and reducers. Where it sparked is a little bit higher level, so you don't have to always put as much thought into doing the right thing with Spark. And part of that leads to another reason why sparks market so fast. It has a Dag engineer directed a cyclist scrap. That's hard to say. A directed a cyclic graph. Say that 10 times fast. And while that's another fancy word, what does it mean? What it means is that the way Spark works is you write a script that describes how to process your data and you might have on RTD. That's basically like a data frame, and you might do some sort of transformation on it or some sort of action on it. But nothing actually happens until you actually perform in action on that data. So what happens at that point is spark will say OK, so this is the end result you want on this data. What are all the other things I had to do to get up to this point and what's the optimal way toe? Lay out that strategy for getting to that point. So under the hood it will figure out the best way to split up that processing and distribute that information to get the end result that you're looking for. So the key insight here is that spark waits until you tell it to actually produce a result . And only at that point does it actually go and figure out how to produce that result. So it's kind of, Ah, cool concept there, and that's the key to a lot of its efficiency. Smart, very hot technology, relatively young. So it's still very much emerging and changing quickly. But a lot of big people are using it. So Amazon, for example, is claimed they're using at eBay, NASA's Jet Propulsion Laboratories group on trip advisor Yahoo and many, many others. I'm sure there's a lot using it that don't fess up to it. But if you go to the Spark Apache wiki page here, that's actually a list you can look up of known big companies that are using spark to solve real world data problems. So if you are worried that you're getting into the bleeding edge here. Fear not. You're in very good company with some very big people that are using spark in production for solving real problems, and it is pretty pretty stable stuff at this point. It's also not that hard. You have your choice of programming in Python, Java or Scalia, and they're all built around the same concept that I just described earlier. The resilient to distributed data set RTD for short, and we'll talk about that in a lot more detail in the coming lectures. Spark actually has many different components that's built up of So there is a spark core that lets you do a lot. You know, you can do pretty much anything you can dream up just using spark core functions alone. I mean, I have a course where I make an entire recommend er system just using spark or but there are these other things built on top of spark that are also useful. For example, spark streaming is a library that lets you actually process data in real time so data can be flowing into a server, continuously say from Weblogs, and spark streaming can help you process that data in real time as you go forever Sparks equal, lets you actually treat data as a sequel database and actually issue sequel queries on it. Which is kind of cool if you're familiar with sequel Already Ml Live is where we're gonna be focusing on in this section. So that is actually a machine learning library that lets you perform common machine learning algorithms with spark underneath the hood to actually distribute that processing across a cluster so you can perform machine learning on much larger data sets than you could have otherwise. And finally, graphics. That's not for making you know, pretty charts and graphs that refers to graph in the, you know, network theory sense. So think about a social network, for example. That's an example of a graph. And graphics just has a few functions that led to analyze the properties of a graph of information. Now I do get some flak, sometimes about using python when I'm teaching people about Apache Spark. But there's a method to my madness now. It is true that a lot of people use Kala when they're writing spark code because that's what spark is developed in natively. So you are incurring a little bit of overhead by a forcing spark to translate your python code into Scala and then into, you know, Java interpreter commands at the end of the day, but pythons a lot easier, and you don't need to compile things. Managing dependencies is a lot easier, so you can release focus your time on the algorithms and what you're doing and less so on the minutia of actually getting it built in running and compiling and all that nonsense. Plus, obviously this course has been focused on Python so far, and it makes sense to keep using what we've learned and stick with python. Throughout these lectures, However, I will say that if you were to do some spark programming in the real world, there's a good chance people are using Scalia. However, don't worry about it too much, because in pie in spark python and scallop code and some looking very similar because it's all around the same RTD concept. The syntax is very slightly different, but it's not that different. So you know, if you can figure out how do you do how to do spark using python, learning how to use it in Scala isn't that big of a leap, really? So let's go look at some examples and dive in. So that's the basic concepts of spark itself and why it's such a big deal and how it's so powerful in letting you run machine learning algorithms on very large data sets or any algorithm, Really? So let's talk a little bit more detail about how it does that and the resilient distributed data store next.
72. Spark and the Resilient Distributed Dataset (RDD): Let's get a little bit more deep into how Spark works. We're going to talk about the resilient to distributed data store known as our DDS. It's sort of the core that you use once programming and spark, and we'll have a few code snippets to try to make it real. So let's have a look. So we're gonna give you sort of a crash course in Apache Spark here. There's a lot more depth than what we're going to cover in these next few lectures, but I'm just going to give you the basics. You need to actually understand what's going on in these examples and hopefully get you started and pointed in the right direction. So the most fundamental piece of spark is called the resilient to distributed data set in our d. D. And this is gonna be the object that you use to actually load and transform and get the answers you want out of the data that you're trying to process. So very important thing to understand it is it stands for a resilient, distributed data set, so it is a data set. At the end of the day. It's just a bunch of rows of information that can contain pretty much anything. But the key is the are in the first D, so it is resilient in that spark. Make sure that if you're running this on a cluster and one of those clusters goes down, it can automatically recover from that and retry. Now that resilience only goes so far remind you if you don't have enough resource is available to the job that you're trying to run. It will still fail, you know, and you will have to run at more. Resource is to it, and there's only so many things that can recover from. I mean, there is a limit to how many times it will retry a given task, but it does make its best effort to make sure that in the face of an unstable or unstable cluster or an unstable network, it will still continue to try its best to run through to completion. And obviously it is distributed. The whole point of using spark is that you can use it for big data problems where you can actually distribute the processing across the entire CPU and memory power of a cluster of computers, and that could be distributed horizontally. Seeking for was many computers. As you want to a given problem. The larger the problem, the more computers There's really no upper bound toe what you can do there now you always start your spark scripts by getting a spark context object, and this is the object that sort of embodies the guts of spark. It is what is going to give you your rgds to process on. So it is what generates the objects that you use in your processing. You know, you don't actually think about the spark context very much when you're actually writing spark programs, but it is sort of the substrate that is running them for you under the hood. If you're running in the spark shell inter actively, it has an SC object already available for you that you can use to create RTGs and whatnot. But in a standalone script, you will have to create that spark context explicitly, and you'll have to pay attention to the parameters that you use because you can actually tell the spark context how you want that to be distributed. Should I take advantage of every core that I have available to me should I be running on a cluster or just stand alone on my local computer? So that's where you set up sort of the fundamental settings of how sparkle operate. So let's look at some little code snippets of actually creating RTGs, and I think it will make a little bit more sense. So here's a very simple example. If I just want to make an RTD out of a plane old Python list, I can call the Paralyzed Function and Spark, and that will convert a list of stuff in this case, just numbers 1234 into an RTD object called numbs so that it's the simplest case of creating an RDD just from are hard coded list of stuff and that was could come from anywhere. It doesn't have to be hard coded either. But you know that that kind defeats the purpose of big data, right? I mean, if I have to load the entire data set into memory before I can create an already d from it , what's the point? So I can also load an rdd from a text file, and that could be anywhere. So in this example, maybe I have some giant text file. That's, you know, the entire encyclopedia or something. And I'm reading that from my local disk in this example, but that will actually convert every line of that text file into its own row in an r d d So you can think of the RTD is a database of Rose, and in that example, it will load up my text file into an RTD where every line, every row contains one line of text, and I can then do further processing that RTD to parse or, you know, break out the delimit er's in that data. But that's where I start from. Remember when we talked about E T l and E L t. So this is a good example of where you might actually be loading raw data into a system and doing the transform on the system itself that you used to query your data so you could take raw text files that haven't been processed it all and use the power of spark. It actually transform those into more structured data. It can also talk to things like hives. So if you have, you know, an existing high of database set up at your company, you can create a high of context that's based on your spark context. And how cool is this? You can actually create an RTD in this case called Rose that's generated by actually executing a sequel query on your hive database. So that's an example of also creating an RDD. And there are more ways to create RTGs as well. You can create them from J. D. BC connections So, basically, any data base that supports J DBC can also talkto, spark and have rgds created from it. Cassandra H. Base. The last search also files in Jason format. See SV format sequence files, object files and bunch of other compressed files like Orc or what have you. I don't want to get into the details of all those you can, you know, go get a book and look those up if you need to. But the point is, it's very easy to create an already D from data wherever it might be. Whether it's on a local file system or distributed data store is to call attention to that again. Appear I'm loading from a local file using the File Earl system, but I could also use s three. And if I want a hostess filing it on a distributed Amazon s three bucket or H DFS if I want to refer to data that's stored on a distributed H DFS cluster that that stands for Hadoop distributed file system. If you're not familiar with H DFS, when you're dealing with big data and you're working with a Hadoop cluster, usually that's where your data will live. So again, RTD just a way of loading and maintaining very large amounts of data and keeping track of it all at once, but conceptually within your script. And RTD is just an object that contains a bunch of data. And you have to think about the scale because spark does that for you. Now there are two different types of classes of things you can do on our DDS. Once you have them, you can do transforms, and you can do action. So let's talk about transformations first, so transformations are exactly what it sounds like. It's a way of taking an rdd and transforming every row in that RTD to some new value based on some function you provide, so map and flat map are the ones you'll see the most often. Both of these will take any function that you can dream up that will take his input a roven RTD, and it will output a transformed row. So, for example, you might take raw input from some See SV file and you're map operation might take that input and break it up into individual fields based on the common the limiter and return back a python list, for example, that has that data in a more structured format that you can perform further processing on. And you can chain map operations together so the output of one map might end up creating a new RTD that you then do another transformation on and so on and so forth and again. The key is sparking distribute those transformations across the cluster so it might take part of your RTD and transform it on one machine and another part of you already in transforming on another. Like I said, map and flat map are the most common transformation to see. The only difference is that they differ in that map will only allow you to output one value for every row where it's flat. Map will let the actually output multiple new rose for a given rose, so you can actually create a larger a RTD or a smaller RTD than you started with using flat map. Also, filter can be used if what you want to do is just create a boolean function that says, Should this Roby preserved or not? Yes or no? And there was some less commonly used transformations as well, like distinct, which will only return back to stink values within your RTD sample that you take a random sample from it. And then you can perform intersection operations like Unit An intersection, subtract or even produce every Cartesian combination that exists within an RTD. Here's a little example of how it might work. So let's say I created an RDD just from the list. 1234 I can call then RTD dot map with a lambda function of X that takes in each row each value of that RTD calls it X, and then it applies to function. X times x two square it So the output of this if I were then collect the output of this RTD would be 149 and six because it would take each individual entry that RTD and square it and put that into a new RTD. Okay, makes sense now if you don't remember what Lambda functions are. We did talk about a little bit earlier in this course, but as a refresher, the Lambda function is just a shorthand for defining a function in line. So Lambda X Colon X Times X is exactly the same thing as defining a separate function that we named called Square It that returns X Times X and saying rdd dot map square it. So it's just a short hand for very simple functions that you want to pass. It is a transformation eliminates the need to actually declare this as a separate named function of its own. And you know that that's the whole function of Ah, that's the whole idea of functional programming. So you could say you understand functional programming now, by the way. But really, it's just shorthand notation for defining a function in line as part of the parameters to a map function or any transformation, for that matter. You can also perform actions in RTD, so when you want actually get results, you can call collect on an R D D. And that will give you back a plain old python object that you can, then it array through and print out the results, or save them to a file or whatever you want to do. You can also call count, which will force it to actually go count. How many entries? Aaron, the RTD at this point, count by value will give you a breakdown of how many times each unique value within that already occurs. And you can also sample from the Arditti using take, which will take you know, some random number of entries from the Arctic to your top, which will give you the first few entries in that already. If you just want to get a little peek into what's in there for debugging purposes, the more powerful action is reduced, and that will actually let you combine values together for the same common key value. So you can also use RTGs in the context of key value data, and the reduced function lets you define a way of combining together all of the values for a given key. So very much similar and spirit a map produce. So using reduce reduces, you know basically the analogue analogous operation to a reducer and map reduce and map is analogous to mappers, so it's often very straightforward to actually take a map, reduce job and converted to spark by using these functions. Remember, too, that nothing actually happens in spark until you call an action. So once you call one of those action methods, that's when Spark goes out and does its magic with directed a cyclic graphs and actually computes the optimal way to get the answer you want. But remember, nothing really occurs until that action happens. So that can sometimes trip you up when you're writing Sparks scripts because you might have a little print statement in there and you might expect to get an answer there, but it doesn't actually appear until the action is actually performed. So let's go into some. Let's talk a little bit more about Ml lib next, and to get into more details about how this works conceptually, so that is spark one, a one in a in a nutshell. Those are the basics you need for spark programming. Basically, what is an rdd and one of the things you can do to an RTD and Once you get those concepts that and you can write some parts spark code up. Next, we'll talk about ML lib and some specific features and spark that lets you do machine learning algorithms using spark.
73. Introducing MLLib: so, fortunately, you don't have to do things the hard way and spark when you're doing machine learning. It has a built in component called ml lib that lives on top of spark core, and this makes it very easy to perform complex machine learning algorithms using massive data sets and distributing that processing across an entire cluster of computers. So very exciting stuff. Let's learn more about what it can do. One more thing I need to cover before we start diving into some real code. And at that point, things like a lot more sense is ml ebb, and that is a component built on top of spark for machine learning, the machine Learning Library. So obviously that is very relevant to this course. So where some of the things ml lib can do well, one is feature extraction, so one thing it can do at scale is term frequency and inverse document frequencies stuff, and that's useful for creating, for example, search indexes. And we will actually go through an example of that in a few lectures from now. So the key again is that it can do this across the cluster using massive data sets so you know, you could make your own search engine for the Web with this potentially, it also offers, you know, your basic statistics functions Chai Square tests, Pearson or Spearman, Correlation and some simpler things like men and Max, mean and variance of those air terribly exciting in and of themselves. But what is exciting is that you could actually compute the variants or the mean or whatever, or the correlation score across a massive data set and would actually break that data set up into various chunks run across an entire cluster of necessary. So even if some of these operations aren't terribly sexy, what's sexy about it is a scale at which it can operate at. It can also support things like linear regression and logistic regression. So if you need to fit a function to a massive set of data and use that for predictions, you can do that too. It also supports support vector machine, So we're getting into some of the more ah fancy algorithms here. Some of the more advanced stuff in that two can scale up to massive data sets using sparks , an l lib. There was a naive Bayes classifier built in Emma lips. Remember that spam classifier that we built a few lectures ago? You could actually do that for an entire email system using spark and scale that up as far as you want to decision trees, one of my favorite things in machine learning that, too, is supported by spark in will actually have an example of that later. In this course, K means clustering. Also, do an example of that later in the course, and you can do clustering. Using K means using massive data sets using spark an ML lib, even principal component Out analysis and S VD. We can do that with spark as well, and we'll have an example of that, too. And finally, there's a built in recommendations algorithm called alternating least squares that's built into NL lib. Personally, I've had kind of mixed results with it. You know, it's a little bit too much of a black box for my taste, but I am kind of a recommend her system snob. So take that with a grain of salt. Using ML live is usually pretty straightforward. You know, there are just some library functions you need to call. It does introduce a few new data types, however, that you need to know about one is a vector and an example of a vector. Remember when we were doing movie similarities and movie recommendations? For example, a vector might be a list of all the movies at a given user rated okay and the difference between a sparse factor in a dense vector. So let's say we have remember, there are many, many movies in the world, and a dense vector would actually represent data for every single movie that that user watched, whether or not they actually watched the So, for example, let's say I have a user who watch Toy Story. Obviously, I would store their rating for Toy Story, but if they didn't watch the movie Star Wars, I would actually store the fact that there is not a number. You know there is no value. There's missing data there for Star Wars, so we end up taking up space for all these missing data points with a dense vector, a sparse factor. Onley stores the data that exists, so it doesnt waste any memory space on missing data. OK, so it's a more compact form of representing a vector internally but obviously that introduces some complexity while processing. So it's a good way to save memory If you know that your vectors are gonna have a lot of missing data in them. There's also a labeled point data type that comes up. And that's just what it sounds like, a point that has some sort of label associated with it that conveys the meaning of this data in human readable terms. And there is a rating data type that you'll encounter if you're using recommendations with ML lip. So you know that actually can take in a rating that represents a 1 to 5 or 1 to 10 whatever star reading a person might have and use that to inform product recommendations automatically. So I think you finally have everything you need to get started here. Let's dive in and actually look at some riel ml lib code and run it. And then it will make a lot more sense. So that's ml lib makes it very easy to perform complicated machine learning algorithms potentially on very large data sets and distribute that processing across an entire cluster . Like I said before, spark is still young and it's growing every day, So I expect these capabilities to keep expanding and evolving as time goes on. Cool stuff. Let's actually get Ah, get our hands dirty and writes on the mellow code and actually do some real machine learning using spark up next.
74. Decision Trees in Spark: So let's make this real. Let's look at some actual spark code to make a decision tree using ML live that it can actually scale up to a cluster if you wanted to. It's actually pretty simple. Let's take a look. So let's play around with spark an ml lib. Open up your anaconda prompter your terminal depending on your operating system. And by the way, if you did, just install a spark. Remember, we set some environment variables, so you will need to close and reopen your anaconda prompt if you have one open already to pick those up. All right, so let's see the into our course materials folder, as we always do. And in here there are a couple of Python scripts that we can use with spark. Now, unlike before, we can actually run this in a notebook. Eso were instead of just going to use whatever a text editor, we have to look at these files and kind of go through what they're doing. We may as well use Spider That's the Python editor that comes with Anaconda. So just go ahead and type in spider with why and we for that to come up and here we are. So go ahead, hit the open icon and navigate to your course materials CNL course. And we want the script spark decision tree dot p y All right, So let's walk through what's going on here? Shall we? Now, again, we're not using an eye python notebook this time around. We're actually just using a standalone python script. Hence the dot p y extension instead of i p y N b. It is actually possible to run spark code within a notebook, but it involves even more set up steps. And I think we've done enough of that for now, just for running a couple of spark examples here. So let's just keep this as a stand alone script in the real world. The way that you will run this on a cluster will typically be. You will copy the script to the master note of that cluster, and there's a script called Spark Dash Submit that comes with Spark, that it will actually interpret that script and destroyed it throughout the rest of the cluster for you. So that's really the way you would want to do it in the real world anyway, You It's possible to kick it off with a notebook, but it's just a little bit more trouble than I want to deal with right now. Anyway, let's walk through what this script is doing. Its simple enough eso. This may be new to you, so I'll go through it all a little bit slowly. We start by importing all the packages we need, of course, and we need some stuff from M l live. Obviously, if we're gonna be doing M l love code, we need a something called a Labeled point and the decision tree itself from Emma Live, both of which we talked about earlier. And pretty much every spark script is going to import spark conference park context as well . We're also going to important array from num pie, which allows to use numb pie race as we're manipulating our data and preparing it here. Now, keep in mind that spark is not going to magically make everything from none pie and psych it learn a distributable and paralyzed across a cluster. If you call numb pyre, psychic learn functions within this script, it's just gonna be running it within the specific know that this is running on so it's not gonna automatically distribute that work across your cluster for you. You have to be using the actual functions within ml live for that to happen. So keep that in mind. Yes, you can still use numb pie. It's like it learn in here but those methods will not be distributed. If you want distributed machine learning, you got to stick with what's in em. L live. All right. So to kick off a spark script for us, we need toe set up a spark context, which is the environment that we're running spark within. And that basically takes care of all the niggly details of how to actually distribute this stuff and how toe organize the the order in which things are running assembled back together across your cluster. The beauty of spark is that it does all that thinking for you. You don't have to worry about that. Part of it did set up a spark context. However, we need a configuration object first. And what's going on here is we're sending the new spark con object set Master local means that we're only going to be running it on our local PC for this example because I don't have a cluster handy. If you were running on a real cluster, you would change that to something else. And we're also going to set an application name so that when you view this and the spark console, if we had one running you would see it referred to is that name. So with that, we set up our spark context and we'll skip thes functions for now, we'll get back to them and when we actually call them and if we go down below these functions, you start to get to the actual lines of code that will actually be executed here. So we start off by loading up our raw data from the past. Higher start CSP file. We saw this earlier in our decision tree example. Let's go ahead and open that up to refresh ourselves on what it looks like. So if you two are course materials, we should find it in here, past tires at CSB. Let's open that up and this will open up in Excel for me. So it's gonna make it all look like a pretty table, even though it is just a comma separated value file. So again, we have our structure. Here is the first line is the headings for the actual columns here. So our first row tells us what these columns mean. Years experience, whether or not they're employed. Previous number of employers, level of education. So on and so forth. And as before, we have a lot of data here that needs to be converted into numerical data. Just like any machine learning algorithm. It deals better with numbers and with letters. So we're gonna have to transform these wise and ends in the ones and zeros. And these bs Ph. D. And M s levels of education will need to be converted into a new miracle. Orginal data instead. So that's what we're dealing with here. Let's go back to our script. All right, So the first thing we need to do is strip that hetero off because that's not actually useful information for the algorithms. Right. So to do that, the trick we're doing is this. We say head r equals raw data dot first. So what happened when we call it s C dot text file. Is it loaded up every individual row of that? See SV file into an rdd called raw data. Okay, so now we have an already called raw data that just contains nothing but the raw, comma separated strings of each row of that data. What we're doing here is extracting the first row of that data, which is going to be our hetero that just contains the names of the columns. And then we can call the filter function on our raw data already d with a lambda function again. This is an in line function. Basically, that says as long as the given road is not equal, the hetero will preserve it. So by doing this, we basically make a copy of raw data that actually filters out that first hetero. And we say that into a new raw data. So basically, we have a raw data already at this point where that first had a row has been filtered out. Now, this is a good a time as any to mention that in modern spark code, there's something called a data set instead of an rdd, and that tends to be used more widely thes days because it has slightly better performance in some instances, well, it has a lot better performance in some instance It depends. How are you using it? And it also lets you just execute sequel against the data right in place. So because of those conveniences, people are migrating mawr toward using data sets instead of already DS. It's basically a higher level structure, but in this case, it doesn't really make a big difference. So we can use our DDS ml liberal work basically the same way with it. Um, so we're going to stick with RTGs for now. It My way of looking at it is, if you have a simple solution and a more complicated solution and there's no big performance difference, stick with the simple solution. So I'm gonna stick with rgds here. But just so you know, when you talk to people today about spark, they're probably gonna be working with data sets or data frames instead of rgds. Same general concept just has more functionality. All right, so now we need to actually split our comma separated values into actual fields here. And to do that, we're gonna call a map function and we're just going to a little in line lambda function here again, that calls split on the actual line using the comma that will take each row of data and split it up based on the comments into individual fields in the list. So we have a new RTD called See SV Data, where we have actually structured that data somewhat. We've actually taken out the commas instead of just one value that contains a big comma, separated list of stuff. We have a road that contains individual fields that were interested in. Now we need to actually convert those fields to what we want, so we will call a map with an actual function at this point called Create Labeled Points. So let's move up to that function and see what it does. All right, So create label points takes in a list of fields that came in from RCs V data after separating it based on the comments, and it converts them into the format that we actually need for training our decision tree. So the first thing we do is convert the first fields, which represents three years of experience into an integer. Instead of a string, we will take the employed field and call our binary function on it, so Fields one is going to be a feel that contains either the later letter Why or end right communicates where they're not. They're currently employed. The binary function just says, If it's a lie, return one else. Return zero. So this function is going to be called each time on each row to convert that. Why, to a one or an end to a zero? But case remember, machine learning generally wants numbers, not strings or letters. Whenever possible, we will convert the previous number of employers to an integer from a string the education level. We'll call this map education function on that field and that just converts B S, M s and PH. D to the orginal values 12 and three. And we will just call the binary function again to converse with wise and ends on whether they came from a top tier school where they had a previous internship and the final label data of if they were hired or not from wise and ends two zeros and ones. And as you may recall, Ml Lib wants labeled points as its input. So we're going to return a labeled point structure that contains with label, which is the higher field followed by all the feature data, which will be an array that contains the years of experience where they're not, they're employed previous employers, so on and so forth. So the label point contains the label, which is the thing that we're trying to discover, whether or not they should be hired and then the features, which is all the different features of each person that might influence whether or not they would be hired or not. All right, so at this point, if you go back down to where this was called, we have a new RTD called training data that contains all of our trading data, converted into numerical data and ultimately converted into labeled points, which is what NL it expects. So awesome. Now we can start playing with ml lib. So let's create a set of test candidates to actually try this out with, and this example will just set up one person here. So we're going to set up an array that contains information that represents 10 years of prior experience. They are currently employed. They had three previous employers that currently have a B s degree. They're not from a top tier school and they did not do an internship. Okay, so we've sort of set up this fake test candidate to see if we can actually make a prediction on this new person that we haven't seen before. Once our decision tree has been created, and then we take that test candidate and create an RDD out of it so we can actually feed it into spark using the paralyzed function that just converts this array of test candidates, which is really just one candidate into an RTD called test data. Next, we will actually make our decision. Tree classifier will call it model. And we can just call Decision Tree that comes from the ml Lib library train classifier passing in our training data already D that contains all the labeled training data and a bunch of hyper parameters. Here. NUM classes indicates that we only have two classes that we're trying to sort people into whether or not they're hired, yes or no. That's two different classes. We also have to pass in an array to finding which of are features air categorical in nature . And then we can specify how the actual decision tree itself is constructed with what impurity function, its maximum depth and the maximum number of Ben's. All right. Once we have that model trained, we can actually use it to make predictions. So we will do that. We will just call model Doc Predict, given our test data RTD that contains our test candidate, and we will print out the results of that. We'll just print out the actual result of that prediction. And here's the important point here. So at this point, we're actually saying I want to call predictions dot collect. I want to actually get something back from Spark giving me an answer. It's not until this point that spark actually does anything. So all that's been happening up to this point is that a directed a cyclic graph has been getting constructed of all the stuff that spark needs to do to produce this. That answer at large scale. Once I actually say I want a result, I want an answer. It will go back and, like, instruct the optimal way of putting it all together and the optimal way of distributing it . If I were on a cluster and at that point will go off and start chugging away and producing an answer for me so will actually print out our ultimate higher prediction, and we also print out the model itself. There's a handy to debug string method on the decision tree model that will allow us to sort of understand what's going on inside of the decision tree and what decisions it's making based on what criteria. So with that, we can try it now again, with sparked, we need to run that actually within the spark environment itself. I can't just run this from within Spider, at least not without doing a bunch of extra set up steps. So let's close out a spider or at least minimize this for now. And if we go back to our anaconda prompt, it's actually open up a new one, huh? Back to Anaconda Anaconda. Prompt. This will make sure that where we have anacondas python environment available to us again will CD to our course materials. And now what we can do is type in spark dash submit, followed by that script name, which was sparked decision tree dot p y. Now the spark dot submit script is part of spark itself. This is what actually takes the script in decides how to distribute it and actually feed it into the spark engine. What's it? Enter and see what happens. So if you installed Sparks successfully, he should be seeing something like this. And there we have it. All right, So for our test user there, we actually predicted that we would hire that person. And we also have the actual decision tree itself printed out here. Now, we actually can't obviously do a nice, pretty graphical representation like we did before, because we're just in a command console here, But you can still interpret this. So basically, it says if feature one in zero. So the way to interpret that is, if we look back at our source data here, if we start counting at zero feature, 101 would be employed. Okay? And remember, we converted. Why? An end toe one and zero. So it says, basically, if you're not employed if feature one which is employed is in the set zero, which contains a single value of zero. So for categorical data, you'll see that syntax in, you know, curly brackets and whatever the categories are. All right, So if you're not employed, and if feature five is also zero. So 012345 That's an internship. So if you're unemployed, you did not do an internship. And, ah, that says, if you have less than half a year of experience, basically you have no experience and you have only a bachelor's of science degree. We will not hire you. It's what the prediction is, and you can go through and figure out the rest of the structure here if you want to. But that's how you read this stuff. Basically cool. So there you have it, an actual decision tree running within Apache Spark. And although that seems like sort of a convoluted way of doing things on a single computer , I mean it is the beauty is that if you were to actually run this on the master note of a real Hadoop cluster or a real spark cluster, it would just work. It would actually distribute that work across the entire cluster. How cool is that? So you could actually feed in a massive data set of training data and a massive set of people that you want to make predictions for, and it could distribute that throughout an entire cluster and give you back results, no matter how big that data set might be. So that's what's really exciting about this. You know, you could imagine a world where you're working for some huge company or some company that produces, you know, hiring software recruiting software, and you could actually run this at massive scale across a massive number of people. I'll leave aside the ethical concerns of actually doing something like this in the real world, where you're just trying to boil people down into a number and feeding them into a model. I mean, obviously, I wouldn't really want the rial hiring decisions to be based just on that alone. That would be not a world I want to live in. But for the sake of illustration, this is how it would work so that we have a decision trees in spark running for real. And there you have it an actual decision tree built using spark an ML lib that actually works and actually make sense pretty awesome stuff so you can see it's pretty easy to do, and you can scale it up to as large of a data set as you can imagine if you have a large enough cluster, so there you have it
75. K-Means Clustering in Spark: next. Let's take our K means clustering example that we used earlier in this course and solve it this time using spark and ml lib and you'll see it's just a ZZ, maybe even easier. So again from Spider, let's open up a file here, and our course materials this time will navigate to our course material location, of course, under CML course, and this time we want the script. Spark K means not p wise to go ahead and open that up. And as before, we're going to take an example that we did earlier in the course on a single PC using just psychic learn and will actually do the same thing using Apache Spark So we could actually scale this out to a whole cluster. So let's walk through this code. All right, so again, some boilerplate stuff we're going to import the K means package from the clustering ml lib package. We're going to important ray and random from num pie because again we're free to use whatever you want. This is a python script at the end of the day, and Emma live often does require none pie. No umpire raised as input. We're going to import the square root function and the usual boilerplate stuff we need to spark conference spark context pretty much every time from Pie Spark. We're also going to import the scale function from psych. It learn so again. OK, do you psych it learned, as long as you make sure it's installed on every machine that you're going to be running this job on. And also don't assume that psychic learned will magically scale itself up just because you're running it on spark. But since I'm only using it for the scaling function, it's OK. All right, let's go ahead and set things up, so I'm gonna send a global variable K 25 So I'm going to run. K means clustering in this example with a k A five with five different clusters, and I'm going to go ahead and set up a local spark cough just running on my own desktop of God. Set the name of my application to spark K means and create a spark context object that I can then use to create rtgs that run on my local machine. We'll skip past this function for now, go to the first line of code that gets run. First thing we're going to do is creating RTD by paralyzing in some fake data that I'm creating. And that's what this create clustered data function does. Basically, I'm telling it to create 100 data points clustered around K Central Lloyds, and this is pretty much identical to the code that we looked at when we played with K means clustering earlier in the course. So if you want a refresher, go ahead and go back and look at that lecture. But basically what we're going to do is create a bunch of random Centrowitz around which we normally distribute some age and income data. So what we're doing is trying to cluster people based on their age and income, and we're fabricating some data points to do that. All right, so that returns a numb pie array of our fake data. Now the other thing we're doing, so once that result comes back from create clustered data, I'm calling scale on it, and that will ensure that my age is and incomes air on comparable scales. Remember, my lecture is saying you have to remember about data normalization. This is one of those examples where it is important. So we're normalising that data with scale so that we get good results from K means. And finally we paralyze the resulting list of arrays into an r d d using paralyzed. So now our data RTD contains all of our fake data. All we have to do this is even easier than a decision tree call k means dot train on our training data, passing the number of clusters we want our K value couple of parameters that put an upper bound on how much processing it's going to do. Tell it to use the default initialization mode of K means where we just randomly pick our initial central roids for our clusters before we start iterating on them. And back comes the model that we can use. We're gonna call that cluster. All right, now we can play with that. So let's start by printing out the cluster assignments for each one of our points. So we're going to take our original data and map it that is, transform it using this lambda function. This function is just going to transform each point into the cluster number that is predicted from our model. Okay, so again. We're just taking our RTD of data points. We're calling clusters out. Predict to figure out which cluster arcane means model is assigning them to. And we're just gonna put the results in our result rdd. Now, one thing I want to point out here is this cash call here. So an important thing when you're doing Spark is at any time you're gonna calm or than one action on RTD, it's important to cash it first because remember, when you call in action on RTD, spark goes off and figures out the DAG forward and how toe optimally get to that result and will go off and actually execute everything to get that result. So if I call to different actions on the same RTD will actually end up evaluating that RTD twice. And if you want to avoid all that extra work, you can cash your RTD in order to make sure that it does not re compute it more than once. So by doing that, we make sure these two subsequent operations do the right thing. So in order to get an actual result of this result, RTD or Burgundy was used count by value and What that will do is give us back an rdd that has how many points, Aaron each cluster. OK, so remember result RTD has mapped every individual point to the cluster it ended up with. So now we can use count by value to just count up how many values we see for each given cluster I d and we can print that list out and we can actually look at the raw results of that are tedious. Well, that calling collect on it and that will give me back every single points cluster assignment and we can print out all of them. Now, how do we measure? How good are clusters are? Well, one metric for that is called but within set sum of squared errors. Wow, that sounds fancy. It's such a big term. We need an abbreviation for it. Ws SSC All it is, we'd look at the distance from each point to its centrally the final centrally in each cluster Take the square of that error and sum it up for the entire data set. Okay, so it's just a measure of how far apart each point is from its central oId. Obviously. You know, if there's a lot of error in our model, then they will tend to be far apart from the central. It's that might imply that we need a higher about value of K, for example. So we will go ahead and compute that Valium print it out. How to do that? We define this error function that computes the squared error for each point. It just takes the distance from the point to the central oId center of each cluster and sums it up. So to do that, we're taking our source data calling that lambda function on it that actually computes the air from its central center point. And then we can chain different operations together here. So we're calling map to first compute the air for each point. Okay, and then to get a final total that represents the entire data set were calling reduce on that result. So we're doing data dot map to compute the air for each point and then dot reduced to take all of those errors and add them all together. And that's what this little reduced land a function does. This is basically a fancy way of saying I want you to add up everything in this RTD into one final results. Okay, So reduce. We'll take the entire rdd two things at a time and combine them together using whatever function you provide. So the function I'm providing here is take the two roads that I'm combining together and just add them up. And if we do that throughout every entry of the RTD, we had up with the final summed up total, it might seem like a little bit of a convoluted way to just sum up a bunch of values. But by doing it this way, we're able to make sure that we can actually distribute this operation if we need to. You know, we could actually end up computing the sum of this piece of the data over here on this machine and a sum of a different piece over on this other machine. And then take those two sums and combine them together into a final result. Right, So you see how that works? This reduced function is saying, How do I take any to values, you know, intermediate results from this operation and combine them together. All right. So again, feel free to take a moment and stare at this a little bit longer if you want to sink in. Nothing really fancy going on here. But there are a few important points. We introduced the use of cash. If you want to make sure that you don't do unnecessary re computations on RTD that you're going to use more than once, we introduced the use of the reduced function, and we have a couple of interesting mapper functions. A swell here. So there's a lot to learn from in this example, but at the end of the day, it will just u K means clustering. So let's go ahead and run it so as before will open up in Anaconda prompts you or your terminal on other platforms. We will CD to our course. Materials are and let's type in sparked ass Submit spark K means stop ey and just let that run and see what happens. And we have a result. Very cool. All right, so it's like we first have a counts by value here, where it's just showing how many of each point got assigned to each cluster, and these do seem pretty evenly distributed. We had 20 points, categorizes cluster two and 20 0 23 20 and 17. So that's good sign, because we did create a even number of different points in our fabricated data that we were trying to train against. Right? And if you look at the actual cluster assignments again, if you remember how we generated the data to begin with, we actually did generate one cluster at a time. So it's a good sign that these air altogether we have all the two's, all the zeros, all the ones, all the threes. And, you know, it starts to get a little bit confused with the forces of one thrown in the middle there. A couple more once in there so didn't always get it right. Some of these clusters were a little bit overlapping, it seems. And finally we got a W s s S e metric, actually computing how good it is of 20.3. Cool. So it worked. We actually did. K means clustering using Apache spark distributed potentially across a cluster if we had one so very cool. And if you want to challenge for yourself to dive in even more deeply, there's a few things here. You can try that we just try increasing or decreasing the value of K. One of the big challenges and K means clustering is choosing the right value. Okay, so see what impact that has. Obviously, there is a real value of K here that we generated the data with and it will be informative to see what having the wrong value of K does to your results. Also, what happens if you don't normalize the input data before you cluster it? Does that affect the quality of this algorithm? And what happens if you change theme accelerations or runs parameters? The hyper primaries, if you will, on the came and means algorithm itself. So go play around that could those things to try and see you to come up with. There you have it. K means clustering done on spark an ml lib Pretty simple stuff. And the beauty of it is you could actually through a large, massive, real data set at it. And if you were to run that on a cluster, it would actually carve it up for you and distribute all that processing automatically. And it would still work just fine. Pretty awesome. Let's move on to an even cooler example
76. TF / IDF: So our final example of ML live is going to be using something called term frequency inverse document frequency or T F I D E F, which is the fundamental building block block of many search algorithms. So first, let's talk about the concepts of T F I D E f. And how we might go about using that to solve a search problem. Our final exercise with Apache Spark an ML live is going to be about term frequency inverse document frequency. That's what t F I D F stands for As usual. It sounds complicated, but it's not as complicated as it sounds and what we're actually going to do with T f I. D. F. Is create a rudimentary search engine for Wikipedia using Apache Spark and ML live. How awesome is that? So let's get started. T f i D E f stands return frequency an inverse document frequency. And these are basically two metrics that are closely interrelated for doing search and figure out the relevancy of a given word to a document given a larger body of documents. So, for example, every article on Wikipedia might have a term frequency associated with it every page on the Inter. What Internet could have a term frequency associated with it? For every word that appears in that document? Sounds fancy. But as you'll see, it's a fairly simple concept. All term frequency means is how often a given word occurs in a given document. So within one Web page within one Wikipedia article within one whatever, how common is a given word within that document? You know, what is the ratio of that words occurrence rate throughout all the words in that document? That's it. That's all. Term frequency is document frequency, same idea named little bit confusingly. But all it is is the frequency of that word across the entire corpus of documents. So how often does this word occur throughout all of the documents that I have all of the Web pages, all of the articles on Wikipedia. Okay, so you know, for example, common words like a or the would have a very high document frequency, and I would expect them to also have a very high term frequency. But that doesn't necessarily mean they're relevant to a given document. You could kind of see where we're going with this. So let's say we have a very high term frequency and a very low document frequency for a given word. The ratio of these two things can give me a measure of the relevance of that word to the document. So if I see a word that occur is very often in a given document, but not very often in the overall space of documents that I know that this word probably conveys some special meaning to this particular document, it might re convey what this document is actually about. So that's TF idea. It just stands for term frequency times inverse document frequency, which is just a fancy way of saying term frequency over document frequency, which is just a freak fancy way of saying, How often does this word occur in this document compared to how often it occurs in the entire body of documents? That's all. T f I D f. Means it's that simple. So in practice there are a few little new wants is to how we use this. For example, we used the actual log of the inverse document frequency instead of the raw value, and that's because word frequencies in reality tend to be distributed exponentially. So by taking the log, we end up with a slightly better waiting of words, given their overall popularity and are some limitations to this approach. Obviously, one is that we basically assume a document is nothing more than a bag full of words. We assume there are no relationships between the words themselves. And obviously that's not always the case. And actually parsing them out can be a good part of the work because you have to deal with things like synonyms and various tenses of words. Abbreviations, capitalizations, misspellings. You know, this gets back to the idea of cleaning your data. Being a large part of your job is the data scientists, and it's especially true when you're dealing with natural language processing stuff. Fortunately, there are some library libraries out there that can help you with this, but it is a real problem, and it will affect the quality of your results. Another implementation trick that we use with T F I. D. F. Is instead of actually storing actual string words with their term frequencies and inverse document frequencies to save space and make things more efficient, we actually map every word to a numerical value a hash value, we call it. So the idea is we have some function that can take any word, kind of like look at its letters and assign that in some fairly well distributed manner to a set of numbers in some range. So that way, instead of call, you know, using the word represented that might have a hash value of 10. And we can refer to the word representatives 10 from now on. Now, if the space of your hash values isn't large enough, you could end up with different words being represented by the same number, which sounds worse than it ISS. But, you know, you want to make sure that you have a fairly large hash space, so that that is unlikely to happen. Those are called hash collisions. They can cause issues. But, you know, in reality, there's only so many words that people use in the English language commonly, so you can get away with 100,000 or so and be just fine, all right, and obviously doing it. This that scale is the hard part. You know, if you want to do this over all of Wikipedia, then you're gonna have to run this on a cluster. But for the sake of argument, we're just going to run this on our own desktop for now, using a small sample of Wikipedia data. So how do we turn that into an actual search problem? So once we have t f i D E f, we have this measure of each words relevancy to each document. What do we do with it? Well, one thing you could do is compute TF idea for every word that we encounter in the entire body of documents that we have. And then let's say we want to search for a given term given word. Let's say we want to search for what Wikipedia article in my set of Wikipedia articles is most relevant to Gettysburg. I could sort all the documents by their T f I. D F score for Gettysburg and just take the top results. And those are my search results for Gettysburg. That's it. Just search. Take your search word. Compute. T f I D f. Take the top results. What documents have the highest T f I D F score. That's it. Obviously, in the real world, there's a lot more to search than that. Google, obviously as armies of people working on this problem, and it's way more complicated in practice. But this will actually give you a working search engine algorithm that actually produces reasonable results. So let's go ahead and dive in and see how it all works. So there you have the concepts of T F I D E f. Another one of those things that sounds really fancy, but once you understand it, it's actually quite simple. So let's go ahead and turn that into actual source code and a run it in our next lecture.
77. Searching Wikipedia with Spark: This might be the coolest thing we do in this entire course. We're going to build an actual working search algorithm for a piece of Wikipedia using Apache Spark and ML lib, and we're going to do it all in less than 50 lines of code. Pretty awesome. So let's go ahead and open this up in Spider or whatever text Senator you want to use, really, and we'll go to our course materials folder and open up the T f I D e f dot p y script. And let's talk about this and here we have it now step back for a moment and let it sink in that were actually creating a working search algorithm, along with a few examples of using it in less than 50 lines of code here. And it's scalable. I could run this on a cluster. It's kind of amazing. All right, so let's walk through this. It's actually fairly straightforward, all right. We'll start by importing the spark conference bar context libraries that we need for any spark script that we run in Python. And then we're gonna import hashing TF and I d. F. So this is what computes the term frequencies and inverse document frequencies. Within our documents, we'll start off with our boilerplate spark stuff that creates a local spark configuration and a spark context created from that from which we can then create our initial RTD a round to spar context to create an RTD from subset dash small debt dot tsv So this is tab separated values and it represents a small sample of Wikipedia articles. So that gives me back a RTD where every document is in each line of the RTD. So this TSV file contains one entire Wikipedia document on every line, and I know that each one of those documents has split up into tabular fields that have various bits of metadata about each article. So the next thing I'm gonna do is split those up. So I'm going to split up each Art H document based on their tab delimited er's into a python list and create a new fields RTD that instead of raw input data, now contains python lists of each field in that input data. And finally, look at what's going on here. Sanel, I'm going to map that data take in each list of fields extract field number three, which I happen to know is the body of the article itself. The actual article text and I'm in turn going to split that based on spaces. So what this does is extract the body of the text from each Wikipedia article and splits it up into a list of words. Okay, so my new documents RDD has one entry for every document and every entry in that RTD contains a list of words that appear in that document now. So we actually know what to call these documents later on, when we're evaluating the results, I'm also going to create a new RTD that stores the document names. And all that does is take that same fields rdd and uses the snap function to extract the document name, which I happen to know is infield number one. So I have to rtgs documents which contains lists of words that appear in each document and document names which contains the name of each document. And I know these are in the same order, so I can actually combine these together later on to look up the name for a given document . Now the magic happens, so first thing we're gonna do is create a hashing TF object, and we're going to pass in a parameter of 100,000. This means that I am going to hash every word into one of 100,000 numerical values. So instead of representing words internally as strings, which is very inefficient, it's going to try Teoh as evenly as possible, distribute each word to a unique hash value, and I'm giving it up to 100,000 hash values to choose from. So basically, this is mapping words two numbers at the end of the day, okay? And I'm going to call transform on hashing TF with my actual RTD of documents. So what that's going to do is take my list of words in every document and convert it to a list of hash values, a list of numbers that represent each word instead. Okay, and this is actually represented as a sparse vector at this point to save even more space. So not only have we converted all of our words two numbers, but we've also stripped out any missing data. So in the event that a word does not appear in a document where you're not storing the fact that that word does not appear explicitly it saves even more more space now to actually compute the T f I D F score for each word in each document. The first cash this TF rdd because we know we're going to use it more than once. And we use I d. F with a min doc frequency of two. Meaning that we're going to ignore any word that doesn't appear at least twice call fit on TF and then transform that on t f. And what we end up with here is an arty de of the T f I D E f score for each word in each document. Okay, so let's try and put that to use. Let's ah, try to look up the best article for the word Gettysburg. If you're not familiar with us history, that's where Abraham Lincoln gave a famous speech so we can transform the word Gettysburg into its hash value. Okay, that's what this bit of code does. We will then extract a T f i. D f score for that hash value into a new RTD for each document. So what this does is extract the T f i. D f score for Gettysburg from the hash value that maps to for every document and stores that in this Gettysburg relevance rdd. We then combined that with the document name so we can see the results and print out the answer. So let's go run that and see what happens. Okay, so to run this, we need to bring up another anaconda props to do and we will CD to our course materials are and just write type in spark dash Submit T f I d f dot p y and let's let this chug away. So, yeah, right now, it's actually going to be analyzing a subset of Wikipedia data, and it's going to try to find the best document match for the search term Gettysburg. It's already coming up with an answer, and it's done already. And the the answer is Abraham Lincoln actually work. Guys, that's awesome. So, yeah, I mean, T f I d f in action there. We actually created our own little mini search engine here with, you know, less and 50 lines of code. Most of those air comments, mind you, I mean, the actual code itself was like maybe 20 lines and with just that little bit of code ml live is so powerful it was able to construct an entire search engine that could figure out that the article for Abraham Lincoln is the best match for someone looking for information about Gettysburg, At least in the subset of Wikipedia that we gave it. I think that's kind of awesome. So, yeah, I mean. And what's even cooler is that if I had the entire corpus of Wikipedia, I could, with very few modifications, run the same exact script on a cluster and get back results over the entire Wikipedia data set. You know, it's scalable to that extent. So pretty cool stuff. T F I D E f Inaction in spark. And we actually made our own little search engine really easily. That's awesome. And there you have it, an actual working search algorithm for a little piece of Wikipedia using spark and ml lib and T f I. D E f. And the beauty is we could actually scale that up to all of Wikipedia if we wanted to. If we had a cluster large enough to run it Now we've just touched on some of the capabilities of Apache Spark. I'm sure you have a lot of questions. There's much, much more to it. In fact, I have a whole course on Apache Spark, so there's a lot we can talk about. If you do want to learn more about Apache spark, skip to the final couple of lectures in this course where we talk about where to learn mawr , where to go next and there's some jumping off place. Is there for you to go explore and learn more about spark because it is a big topic. But hopefully we got your interest up there and spark, and you can see how it could be applied to solve but can be pretty complicated machine learning problems in a distributed manner. So it's a very important tool, and I want to make sure you don't get through this course on data science, with at least without at least knowing the concepts of how spark can be applied to big data problems. So when you need to move beyond what one computer can do, remember, spark is at your disposal
78. Using the Spark 2 DataFrame API for MLLib: So in July of 2016 spark release sparked two point. Oh, and let's talk about what's new and what new capabilities exist in ml live now. So the main thing was sparked 2.0, is that they're moving people more and more toward data frames and data sets, data sets and data frames. Air kind of used interchangeably sometime. Technically, a data frame is a data set of roe objects. They're kind of like rgds. But the only difference is that whereas an RTD just contains unstructured data, every Roven rdd can contain pretty much anything. A data set has a defined schema to it. So a data set knows ahead of time exactly what columns of information exists in each rove of that little RTD of that data set and what types those are. So because it knows about the actual structure of that data said ahead of time, it can optimize things more efficiently. It also lets us think of the contents of this data set as a little mini database. Well, actually a very big database. If it's on a cluster right, and that means we can do things like issues sequel queries on it so this creates a higher level AP I with which we can query and analyze large, massive data sets on a spark cluster. So it's pretty cool stuff. It's faster. It has more opportunities for optimization, and it has a higher level. AP I. That's often easier to work with now going forward and sparked 2.0 ML limits pushing data frames as its primary a p I. So this is kind of the way of the future here. So let's take a look at how it works. So I've gone ahead and open up the spark Linear regression P Y file. Let's walk through it a little bit here. So you see, for one thing, we're using ML instead of ml live, and that's the new data frame based AP eyes in there. So for in this example, where we're going to use implement linear regression and linear regression is just a way of fitting a line to a set of data. So what we're gonna do this exercise is take a bunch of fabricated data that we have in two dimensions and try to fit a line to with a linear model and what we're gonna do is actually separate our data into two sets, one for building the model and one for evaluating the model. And we'll compare how well this linear model does, actually predicting real values. So to do that, first of all, in sparked to if you're gonna be doing stuff with the Spark Sequel interface and using data sets, you gotta be using a spark session object instead of a spark context. So to set one up, you do something like this, you can say spark. That's going to be the name of our spark session builder config. Now, this bit is only necessary on windows on in sparked 2.0, kind of works around a little bug that they have to be honest. So if you're on windows, make sure you have a C temp folder. If you want to run this, go create that now if you need to. If you're not on Windows, you can delete that whole bit here that I've highlighted OK, because it's not gonna be necessary and it won't work given app, name and get or create. Now, this is interesting because once you create a spark session, if it terminates unexpectedly, can actually recover from that the next time that you run it. So if we have a checkpoint director, you can actually restart where it left off using, get or create. All right, now we're gonna use this regression dot text file that I've included with the course materials and all that is is a text file that has common, limited values of two columns, and they're just two columns of more or less randomly linearly correlated data, and it can represent whenever you want. Let's imagine that it represents heights and weights, for example. So the first column I represent heights. The second column might represent weights. So in the lingle of machine Learning, we talk about labels and features where labels air. Usually the thing that you're trying to predict and features are a set of known attributes of the data that used to make a prediction from So in this example, maybe heights or the labels and the features are the weights May. We're trying to predict heights based on your weight. It could be anything. It doesn't matter. This is all normalized down to like data between negative one and one, so there's no real meaning to the scale of data any anywhere. You can pretend it means anything you want. Really? So to do this, to use this with ml that we need to transform our data into the formatted expects. So the first thing we're gonna do is split that data up with this map function that just splits each line into two distinct values in a list. And then we're gonna map that to the format that m l would expect. So that's going to be a floating point label and then a dense vector of the feature data. Now, in this case, we only have one bit of feature date of the weight. So we have a vector that just has one thing in it. But even if it's just one thing, the M l Lib linear regression model requires a dense vector there. Okay, this is like a labeled point in the older AP I but you have to kind of do it the hard way here. Now, next we need to actually assign names to those columns. So here's the syntax for doing that. We're gonna tell ml live that these two columns in my resulting RTD here actually correspond to the elite label and the features. And then I can convert that RTD to a data frame object. So at this point, I have an actual data frame or, if you will, a data set that contains two columns, label and features where the label is a floating point height and the features column is a dense vector of floating point weights. And that is the format required by ml, lib and ML. It could be pretty picky about this stuff. So it's important that you pay attention to these formats. All right, now, like I said, we're going to split our data and 1/2 so we're gonna do a 50 50 split between training data and test data. So this returns back to data frames, one that I'm gonna use to actually create my model on one of that I'm gonna use to evaluate my model. I will next create my actual linear regression model with a few standard parameters here that I've set. We're gonna call that lor linear regression, and then I will fit that model to the set of data that I held aside for training the trading data frame. And that gives me back a model that I can use to make predictions from. So let's go ahead and do that. I will call model dot transformed dot tests with testy F and when that's going to do is predict the heights based on the weights in my testing data. Okay, so the testing data set, I actually have the no labels, the actual correct heights, and this is going to add a new column to that data frame called predictions that has the predictive values based on that linear model, I'm gonna do a couple of things with this, so I'm gonna cash the results, and now I could just extract them and compare them together. So let's pull out the prediction column just using dot Select, Just like you wouldn't sequel. And then I'm gonna actually transform that data frame and pull out the RTD from it and use that to map it to just a plain old, already de full of floating point heights and 40 point heights, in this case, right? So these are the predicted heights, and then we're going to get to the actual heights from the label column, and then we can sit him back together and just print them out side by side and see how well it does now, mind you, this is kind of a convoluted way of doing it. I did this to be more consistent with the previous example, but a simpler approach would be to just actually select prediction and label together into a single RTD that maps out those two columns together. And then I don't visit them up. But either way works. Let's see if it works, so to run this again will open up an anaconda prompt to do, and we will see the to our course material folder and let's type in. Sparked as submits spark linear regression dot p y and let it do its thing. So again we're using the new ML interface. That's data frame based, and by the way, for using Scalia instead of python, you'd probably using data sets instead of data frames, just different terminology there and there we have the results. So we have here Each result here is the predictive value, followed by the actual value. You can see that by and large, they're pretty close. I mean, it's not perfect, but ah, the model at least did something reasonable So that's it. That's cool. So, yeah, we just did a linear regression using sparks New M L A p. I will relatively new or 2016. It's Ah, you know, several years have passed since then, but this is still kind of the way the future that they're pushing people toward again RTD still work. But, um, the data frame based E p. I is sort of where they're concentrating their efforts these days, so it's worth understanding it. There you have it. ML inaction in spark.
79. Deploying Models to Production: The question I often get is it's all well and good to train these models and deploy them within a Jupiter and notebook. But how would I use these models in the real world how to actually connect the output of these models of these machine learning models to a real application, like a mobile phone app or a website or something like that? Well, that's the whole world of itself. You know, we're getting into the world of actually designing larger systems here, but I'll give you some high level guidance anyway. Now, obviously you're external applications aren't going to be running Jupiter notebooks and getting its results that way. Locally, what we need to do is separate the process of training and tuning our model from the process of actually making predictions based on that model. So the training park and I'll be done off line, right weaken, do that within our notebook. Or we can export a standalone Python script that gets run periodically, perhaps to pick up new training data as it comes in. Maybe it's even running in some sort of a streaming environment, but that can all happen on the back end, you know, You can still use your new book notebooks for that if you want to refine it. But then when you actually have a trained model, remember, we just have ah, model like a classifier that's sitting there in psychic learned that point, and all we have to do is call predict on it to actually get a result. So it's possible to actually export that model to a file and execute that model on a Web service. So the idea would be to push the model out to a Web service fleet. You know, this could be like something on the cloud somewhere where it's just running. A distributed set of services were, Hopefully you don't have to take care of the actual servers themselves, and all that service does is respond to Web service requests over arrest or some other interface. That says, Here's the feature data that I want a prediction for Give me a prediction and that pre trained model that's deployed to that entire Web service can then quickly provide that result back at large scale and of hopefully low latency and high transaction rates reliably . So your app would call the Web service that actually just generates predictions based on the model, but the model itself is created off line, and then the results of that model, the model itself, is pushed to the Web service. Let's talk about some more specific examples here to make it riel. Let's say you're using Google Cloud Services one way you could do this very easily. It's just to use the SK learned externals job lib method here. With that, you can just say job lib dot dump with whatever your model is. You know any classifier, which could be anything, right? Ah, K means you can even, like, have a deep learning classifier in there. If you want us well and psychic learned can dump that out to a file after it's been trained . You just give it a file name. And for Google Cloud, it wants it to be called model dot job lip. You're then upload the resulting model dot Job Live, which contains the model itself to Google Cloud storage, and from there he could just say this is going to run within the psychic learn framework, and it will know what to do with it. At that point, you can just tie a cloud ml engine into it. The Google cloud. Mm L engine. And that will expose arrest a p I that you can call to make predictions in real time based on that model that you uploaded to the Google Cloud. See how that works. So basically, you would train your classifier offline in a notebook or whatever, however you want to do it. Export that classifier to a file, upload that file into Google Cloud storage, and then Google Cloud's ML engine could actually interact with that with your applications over arrest, a p I. And if you don't know what arrest a P I is, it's basically the same protocol that you use for looking at websites. So when you go to your Web browser and say, I want to look at this girl that's sending a rest get request to a server somewhere that says, I want to get the results of this girl and it gives it back in the same exact way your application would say, I want to get a prediction for this set of features and give it back to me, please. Same exact mechanism. You could get a lot more complicated to Let's say that using Amazon Web services and you want to build an entire end to end system that makes product recommendations. One way to do this in AWS would be if you have a fleet of servers that are generating order data. You know, if you're actually monitoring the servers where people are placing orders, we could have ah service called kinesis data firehose running on top of those servers that is, feeding those log data information into Amazon has three storage. And from there, Amazon's elastic map produce a service could be consuming that law data from s three. Maybe that contains all the purchase information or all the view information or all the ratings information for the stuff that we want to recommend. And on Amazon elastic map reduce. We could be running Apache spark across an entire cluster that's consuming that data from S three and creating predictions creating recommendations based on that data. Now, in this case, I'm not gonna push the actual model itself out to Ah, a Web service. I'm just gonna push the results of it out. So what I could do is pre generate recommendations for every user ahead of time. So after consuming all of the latest data from S three all the latest purchase data or rating state or whatever you have to work with. I can go off and use Apache Spark to generate predicted recommended items that people might like for every user in my system. I could then publish that to something like Amazon Dynamodb, which is just a no sequel database that will allow me to very quickly associate a list of items identifiers with a list of user identifiers, and that will be a very fast way of looking up. What items should I recommend for a given user? So I take the output of the model, and I published the output to something more scalable. Like dynamodb. Dynamodb is horizontally scalable so it can handle very high transaction rates at very low latency. Then, to expose that to the outside world, I could use something like AWS lambda, which is their server lis functionality that just lets you write very simple functions that would access that dynamodb database forgiven user and retrieve the results back for some application. And it doesn't in such a matter that you don't have to worry about actually provisioning server capacity to run. That AWS takes care of all that for you. In front of that, you might have something called the Amazon e p I gateway, which actually would provide the rest interface that your mobile lapse or websites would actually talk to to retrieve that data. So, looking at it from the other direction, your client application could say I want to get recommendations for this user. I d. It would say Okay. Amazon E P I Gateway, give me recommendations for this user. I d through some rest query behind the scenes that would transfer that request to aws Lambda, which again has to be scalable. Lambda would say, Okay, I'm gonna take this and execute a little bit of JavaScript code or something too. Retrieve the actual item ideas and dynamodb for that user i D. And that was all pre generated by Apache Spark running on EMR creating recommendations off of Amazon has three. So this is a more complicated way that that it's more of an example of sort of things you might see in the real world. You know, these systems do tend to get rather complicated because it seems you start dealing with massive scale. You need to deal with a cloud at some point, and you have to deal with technologies like Apache Spark that can scale up to massive data sets. So this is what sort of, ah, more advanced example, if you will, but a very real one of how you might actually implement something like that in the real world again. The key is that the actual generation of the model is separate from the vending of the data , the results from it. So on the top row there, that's the process of actually building the recommendations themselves. But then we pushed the actual recommendations down to something that's more scalable to dynamodb, fronted by at Lambda, which is ultimately fronted by Amazon E P I gateway. And then we have a end to end system that can actually handle venting the results of our model at massive scale. There are other ways of doing it as well. You could just write your own Web service if you wanted to using flask or something, or whatever Web service framework you like. You would then have to obviously provisioned your own servers and maintain them, which isn't a whole lot of fun. I mean, that's why people use services like AWS and Google Cloud and Azure these days. Um, also, you can all go all in with a platform. All these cloud providers tend to have their own systems and technologies for doing machine learning these days so below here is like a partial list of the AWS services that are available in the machine learning space right now, for example. So, you know, if you just want to do speech recognition or image recognition, they have services that are available off the shelf. That will just do that for you. And you can obviously integrate those very easily with other AWS services to build larger systems like the ones that we've been looking at here so something to consider as well, although it's always going to be a good idea to sort of prototype new ideas off line in a notebook or something like that. But ultimately, the way that you implement them for a larger system could be very different. So I hope that gives you sort of, ah, high level idea of how you might actually go about exposing your models to a real world system again. The key insight is to separate the generation of the model itself in its results from the problem of actually serving those results to a massive fleet of consumers of that data, whether you go with AWS or Google Cloud services or roll your own or use Microsoft Azure, I mean, those are all the topics of entire courses of their own. Like I can't cover that and a whole lot of depth right now. But at least that gives you a starting point of where to explore if you are in a position of trying to deploy these results to someplace real.
80. A/B Testing Concepts: If you work as a data scientists at a Web company, you'll probably be asked to spend some time analyzing the results of a B tests. These are basically controlled experiments on a website to measure the impact of a given change. So let's talk about what a B tests are and how they work. Let's talk about a B testing. If you're gonna be a data scientist at a big tech Web company, this is something you're going to be definitely involved in because people need to run experiments to try different things on a website and measure the results of it. And that's actually not a straightforward as most people think it is. So let's talk about what a B tests are and what are the challenges surrounding them. So what is an A B test? Well, it's a controlled experiment that you usually running a website. I mean, it could be applied to other contexts as well, but usually we're talking about a website, and what we're trying to do is test the performance of some change to that website versus the way it was before. So you have basically a control set of people that see the old website and a test group of people that see the change to the website. And the idea is to measure the difference in behavior between these two groups and use that data actually decide whether this change was beneficial or not. So, for example, I own a business that has a website. We licensed software to people, and right now I have a nice, friendly orange button that people click on where they want to buy a license. Maybe Blue would be better. How do I know? I mean intuitively. Maybe that might catch her people's attention more or intuitively. Maybe people are more used to seeing orange buy buttons, and I were quickly likely to click on that. I could spend that either way, right? So my own internal biases or preconceptions don't really matter. What matters is how people actually react to this change on my actual website, and that's what a Navy test does. It will actually split people up into some people, see the orange button. Some people see the blue button, and I can then measure the behavior between these two groups and how they might differ and make my decision on what color. My buttons should be based on that data. You could test all sorts of things within a B test. We talk about design changes, so you know the color of a button, the placement of a button, the layout of a page. What have you? It might be a whole U Y flow. So maybe you're actually changing the way that your purchase pipeline works and how people check out on your website, and you can actually measure the effect of that algorithmic changes. Let's go back to the example of doing movie recommendations. Maybe I want to test one algorithm versus another. And instead of relying on error metrics and my ability to do train test, you know what I really care about is driving purchases or rentals or whatever it is on this website. And maybe test could make me directly calling me directly measure the impact of this algorithm on the end result that I actually care about, and not just my ability to predict movies that other people have already seen. Pricing changes. This one gets a little bit controversial. You know, in theory, you can actually experiment with different price points using an A B test and see if it actually increases volume to offset for the price difference or whatever. But use that one with caution. If customers catch win that other people are getting better prices and they are for no good reason. They're not gonna be very happy with you. So keep in mind, doing pricing experiments can have a negative backlash, and you don't want to be in that situation. Anything else you could dream up to. Really. Any change that impacts how users interact with your site is worth testing. Maybe it's even, you know, making the website faster or could be anything. So the first thing you need to figure out when you're designing an experiment on a website is what are you trying to optimize for? What is it that you really want to drive with this change? And this isn't always a very obvious thing, right? Maybe it's the amount that people spend the amount of revenue. OK, well, we talked about the problems with variants in using amount spent, but if you have enough data, you can still reach convergence on that metric. It you know, a lot of times, but maybe that's not what you actually want to optimize for? Maybe you're actually selling some items that a loss intentionally just to capture market share. Or, you know, there there's more complexity that goes into your pricing strategy than just top line revenue. Maybe what you really want to measure is profit. And that could be a very tricky thing to measure because a lot of things cut into how much money a given product might make. And those things might not always be obvious. And again, if you have lost leaders, this experiment will discount the effect that those air supposed to have. So, bottom line, you have to talk to the business owners of the area that's being tested and figure out what it is they're trying to optimize for what are they being measured on? What is their success measured on what air There, you know, key performance indicators or whatever the NBA's want to call it and make sure that we're measuring the thing that actually matters to them. Okay, maybe you just care about driving ad clicks on your website or order quantities to reduce variants. Maybe people are okay with that, and you know you can measure more than one thing at once to you don't have to pick one. You can actually report on the effect of many different things. Revenue, profit clicks, ad views. And if these things are all moving in the right direction together, that's a very strong sign that this change had a positive impact in more ways than one, right? So why limit yourself to one metric? Just make sure you know which one matters the most and what's going to be your criteria for success of this experiment ahead of time. Another thing to watch out for us attributing conversions to a change downstream. So if the action you're trying to drive doesn't happen immediately upon the user experiencing the thing that you're testing, things get a little bit dodgy. So let's say I change the color of button on page. A user then goes to Page B and does something else and ultimately buy something from paid. See well, who gets credit for that purchase? Is it page A or Page B or something? In between? Do I discount the credit for that conversion? Depending on how many clicks that person took to get to the conversion action? Do I just discard any conversion, half action. That doesn't happen immediately after seeing that change. These are complicated things. And, you know, it's very easy to produce misleading results by fudging how you account for these different distances between the conversion and the change that you're measuring. So keep that in mind, too. Another thing that you need to really internalize is that variance is your enemy when you're running an A B test. So a very common mistake people make who don't know what they're dealing with data science is they will put up a test on a on a Web page. You know, blue button versus orange button, whatever it is, run it for a week and take the mean amount spent from each of those groups. And they say, Overlook the people with the blue button on average, spent a dollar more than the people with the orange button. Blue is awesome. I love blue. I'm gonna put blue all over the Web site now, But in fact, all they might have been seeing was just random variation in purchases. You know, they didn't have a big enough sample because people don't tend to purchase a lot. You know they get a lot of use, but you probably have a lot of purchases on your website in comparison. And it's probably a lot of variance in those purchase amounts because different products cost different amounts, so you could very easily end up making the wrong decision that ends up costing your company money in the long run instead of earning your company money. If you don't understand the effect of variants on these results, and shortly we'll talk about some principle ways of measuring and accounting for that. And make sure you need to make sure that your business owners understand that this is an important effect that you need to quantify and understand before making business decisions following an A B test or any experiment that you run on the Web now, sometimes you need to choose a conversion metric that has less variance. You know, it could be that the numbers on your website just mean that you would have to run an experiment for years in order to get a significant results based on something like revenue or amount spent. So sometimes if you're looking at more than one metric like order, amount, order, quantity that has less variance associated with it. And so you might see a signal on order quantity before you see a signal on revenue, for example, and at the end of the day, it ends up being a judgment call. You know, if you see a significant lift in order quantities and maybe a not so significant list in revenue, then you have to say, Well, I think there might be something real that's beneficial going on here. But at the end of the day, the only thing that statistics and data size Comptel you are probabilities that, in effect, Israel, you know, it's really up to you to decide whether or not it's really at the end of the day. So let's talk about how to do this in more detail. So that's an introduction to a B tests. The key take away there is just looking at the differences, and means isn't enough. When you're trying to evaluate the results of an experiment, you need to take the variants into account as well. So let's go into some examples in our next lecture of how you actually measure the effects of variance using the T statistic and P value metrics
81. T-Tests and P-Values: So how do I know if a change resulting from an A B test Israel if it's actually a really results of what I changed or facist random variation? Well, there are a couple of statistical tools at our disposal called the T statistic and the P value. So let's learn more about what those are and how they can help you determine whether an experiment is good or not. So, as I said in our previous lecture variances your enemy when you're running in a B test. So how do we account for that? Well, there are some statistical tools at our disposal, called the tes test to the T statistic more specifically, and the P value, which allows to quantify the effect of variants on our results and make a decision that takes that variants into account. So the point is to figure out if a result is real or not. Was this just a result of random variance that's inherent in the data itself? Or are we seeing an actual, statistically significant change in behavior between our control group in our test group T tests and P values, or a way to compute that and remember to again. Statistically significant doesn't really have a specific meaning. At the end of the day, it has to be a judgment call. You have to pick some probability value that you're going to accept of a result being riel or not. But there's always gonna be a chance that it's still a result of random variation, and you have to make sure your stakeholders understand that. So let's start with the T statistic. Ellison was. A T test is basically a measure of the difference in behavior between these two sets between your control and treatment group expressed in units of standard error. So it is based on standard error, which accounts for the variance inherent in the data itself. So by normalizing everything by that standard error, we get some measure of the change in behavior between these two groups that takes that variance into account. So the way to interpret a T statistic is at a high tea value means there's probably a real difference between these two steps sets. There's a lo T value means not so much. So you have to decide. You know what's ah threshold you willing to accept? And the sign of the T statistic will tell you if it's a positive or a negative chain. So if you're comparing your control, your treatment group and you end up with a negative T statistic, that implies that this is a bad change. If that absolute value of that T statistic is large, how large is large? Well, that's debatable. We'll look at some examples shortly now. This does assume that you have a normal distribution of behavior. And when we're talking about things like the amount people spend on a website, that's usually a decent assumption. There does tend to be a normal distribution of how much people spend, however, their arm or refined versions of T statistic that you might want to look at for other specific situations. For example, there's something called Fisher's exact test. When you're talking about, click through rates E test when you're talking about transactions per user, like how many Web pages did they see? And the Chai square test, which is often relevant for when you're looking at order quantities? So sometimes you'll want to look at all of these statistics were given experiment and choose the one that actually fits what you're trying to do the best. Now it's a lot easier to talk about P values than T statistics, since you don't have to think about how many standard deviations are we talking about. And what is the actual value mean? The P value is a little bit more easy for people to understand, which makes it for a better tool for you to communicate the results of an experiment to the stakeholders in your business. So the P value is basically the probability that this experiment satisfies the null hypothesis. That is the probability that there is no real difference between the control and treatments behavior. So a low P value means there's a low probability of it having no effect, kind of a double negative going on there. So it's a little bit counterintuitive, but the end of the day, you just have to understand that a low P value means that there's a high probability that your change had a real effect. So what you want to see or high teeth statistic and a low p value, and that will imply a significant results. Now, before you start your experiment, you need to decide what your threshold for success is going to be, You know, decide that with the people in charge of the business. So what P value are you willing to accept as a measure of success? Is it 1%? Is it 5%? And again, this is basically the likelihood that there is no riel effect that's just a result of random variance. It's just a judgment call at the end of the day. A lot of times people use 1%. Sometimes they use 5% of their feeling a little bit more risky. But there's always going to be that chance that your result was just spurious random data that came in. But you can choose the probability that you're willing to accept as being likely enough that this is a real a real effect that's worth rolling out into production. So when your experiment is over and we'll talk about when you declare an experiment to be over, you want to measure your P value. If it's less than the threshold you decide upon, then you can reject the null hypothesis and you can say, well, there's a high likelihood that this change produced a riel positive or negative results. It is a positive result than you can roll that change out to the entire site, and it is no longer an experiment. Is part of your website that will hopefully make you more and more money as time goes on. And if it's a negative result, you want to get rid of it before it costs you any more money. So remember this a real cost to running in a B test when your experiment has a negative results, so you don't want to run it for too long because there's a chance you could be losing money . And this is why you want to monitor the results of an experiment on a daily basis. So if there are early indications that a change is making a horrible impact to the website , maybe there's a bug in it or something that's horrible. You can pull the plug on it prematurely if necessary, and limit the damage. So let's go to an actual example and see how you might measure T statistics and P values using python up next. So that's the T test, the T statistic and the P value useful tools for determining whether a result is actually really or a result of random variation. So next let's dive into some real examples and get our hands dirty with some python code and compute these things.
82. Hands-On with T-Tests: let's fabricate some experimental data and used the T statistic and P value to determine whether or not a given experimental results is a real effect or not. All right, let's get our hands dirty with some T tests were gonna actually fabricate some fake experimental data and run T statistics and P valleys on them and see how it works and how to compute it in Python. So let's start by fabricating some data here. Let's imagine that we're running an A B test on a website and we have randomly assigned our users into two groups, Group A and Group B and Group A is going to be Our test subjects are treatment group and Group B will be our control. Basically, the way the website used to be so in this example are treatment Group is going to have a randomly distributed purchase behaviour where they spend on average $25 per transaction, withstand deviation of five and 10,000 samples. Where is the way the website used to be had a mean of $26 per transaction, with the same standard deviation and sample size. So we're basically looking at an experiment that had a negative result here and all you have to do to figure out the T statistic and the P value is used this handy. Dandy Stats Stock T Test End method From Sai Pie It's just that simple. So what you do is you pass it in your treatment group in your control group and outcomes your T statistic, which in this case is negative. 13. The negative indicates that it is a negative change. This was a bad thing, and the P value is very, very small. So that implies that there is a extremely low probability that this change is just a result of random chance. So remember, in order to declare significance, we need to see ah, high tea value T statistic and a low P value. And that's exactly what we're seeing here. We're seeing negative 13 which is a very hi absolute value, the T statistic negative, indicating that's a bad thing and an extremely little P value telling us that there's virtually no chance that this is just a result of random variation. So if you saw these results in the real world, you would pull the plug on this experiment as soon as you could. Okay, so just is a sandy check. Let's go ahead and change things so that there's actually no riel change difference between these two groups. So I'm gonna change Group B. The control group in this is case to be the same as a treatment where the mean is 25 the standard deviation that is unchanged in the sample size of sun chain. So if you go ahead and run that you got to see that our T test ends up being below one now . So you remember this is in terms of standard deviation, so that implies, Well, there's probably not a real change there. It must have a much higher P value as well over 30%. Now, these air still relatively high ish numbers. So you know, you can see that random variation can be kind of an insidious thing. This is why you need to decide ahead of time. What is an acceptable limit for P value here? You know, you could look at this after the fact and say, 30% odds that it's, you know, that's not so bad. We can live with that. But no, I mean in reality, in practice, you want to see P valleys that are below 5% ideally below 1% and a value of 30% means it's actually not that strong of a result. So don't justify it after the fact. Go into your experiment to knowing what your threshold is. Let's do some changes here, and, you know, again, we're creating these sets under the same conditions. Let's see if we actually get a difference in behavior by increasing the sample size, so we're gonna go from 210,000 samples. You can see here that actually the P value got a little bit lower and the T test a little bit larger, but it's still not have to declare a real difference. So you know, you would expect that Teoh. It's actually going in the direction. You wouldn't expect it to go right kind of interesting, but there's still high values again. It's just the effective random variance, and it could have more of an effect on you realize, especially on a website. When you're talking about order amounts. Let's Ah, let's actually go to a 1,000,000. What is that? Well, now we're back under one for the T statistic and our values around 35%. So, I mean, we're seeing these kind of, like fluctuate a little bit in either direction as we increase the sample size. This means that going from 10,000 samples, 200,000 to a 1,000,000 isn't gonna change your result at the end of the day. You know, that's and running experiments like this is a good way to to get a good gut feel as to how long you might need to run an experiment, for how many sample does it actually take to get a significant result? And if you know something about the distribution of your data ahead of time, you can actually run these sorts of models. Now, this is a sanity check. If we were to actually compare the set to itself, this is called on a a test, and we'll talk about that more later. We should see a T statistic of zero and a P value of one point. Oh, because there is in fact no difference whatsoever between these sets. Now, if you were to run that using real website data where you know you're looking at the same exact people and you saw a different value that indicates there's a problem in the system itself that runs your testing. Right. All right. So, you know, at the end of the day, like I said, it's all a judgment call. So go ahead and play with this. You know, see what the effect of different standard deviations have on the initial data sets or different differences in means and kind of get a feel on different sample sizes. So I just want you to dive in, play around with these different data sets and actually run them and see what the effect is on the T statistic in the P value. And hopefully that will give you amore gut feel of how to interpret these results. But again, the important thing to understand you're looking for a large T statistic and a small p value P values probably gonna be what you want to communicate to the business. And remember, lower is better for p value. You want to see that in the single digits, you know, ideally below 1% before you declare victory. Okay, let's talk about a B tests and more in our next lecture. So there you have it CYP. I makes it really easy to compute T statistics and P values for a given set of data, so you can very easily compare the behavior between your control and treatment groups and measure what the probability is of that effect being riel or just a result of random variation. So make sure you are focusing on those metrics, and you are measuring the conversion metric that you care about when you're doing those comparisons.
83. Determining How Long to Run an Experiment: So how long do you run an experiment for? How long does it take to actually get a results? At what point do you give up? Let's talk about that next. This is gonna be a really quick lecture. I just want to spend a couple of minutes talking about how to decide when an experiment is over. So if someone in your company has developed a new experiment a new change that they want to test, they have a vested interest in seeing that succeed. You know, they put a lot of work in time into it, and they want it to be successful and maybe have gone weeks. And you still haven't reached a significant outcome on this on this experiment, positive or negative, you know, they're gonna want to keep running it pretty much indefinitely in hopes that it will eventually show a positive result. It's up to you toe. Draw the line on how long you're willing to run this experiment for. So how do I know when I'm done running an A B test? I mean, it's not always straightforward to predict how long it will take before you can achieve a significant result, but Obviously, if you have achieved a significant results, you know your P value has gone below 1% or 5% of whatever threshold you've chosen your done . And you know, at that point you can pull the plug in the experiment in either roll out the change more widely or remove it because it was actually having a negative effect. You can always tell people to go back and try again. You know, used what they learned from the experiment to maybe try it again with some changes. And you know that's fine. Soften the blow a little bit, but the other thing that might happen is it's just not conversion at all. And if you're not seeing any trends over time in the P value, it's probably good sign that you're not going to see this converge anytime soon. That's just not gonna have enough of an impact on behavior to even be measurable, no matter how long you run it. So what you want to do is everyday plot on a graph forgiven experiment, the P value, the T statistic, whatever you're using to measure the success success of this experiment, and if you're seeing you know something looks promising. You will see that P values start to come down over time. So the more data it gets, the more significant your results should be getting. Now, if you see instead a flat line or a line that's all over the place, that kind of tells you that that P Valley is not going anywhere. And it doesn't matter how long you run this experiment. It's just not gonna happen. So you need to agree up front that in the case where you're not seeing any trends in P values, what's the longest you're willing to run this experiment for? Is it two weeks? Is it a month? Keep in mind that having more than one experiment running on the side at once can conflate your results. So time spent on experiments are a valuable commodity. You can't make more time in the world so you can only really run its many experiments as you have time to run them in with the given a year. So if you spend too much time running one experiment that really has no chance of converging on a result, that's an opportunity you've missed to run another potentially more valuable experiment during that time that you were wasting on this other one. So it's important to draw the line on experiment lengths because time is a very precious commodity. When you're running a B tests on a website, at least as long as you have more ideas and you have time, which hopefully is the case, and that's it. You know, it's a little parable about make sure you go in with agreed upper bounds and how long you're going to spend testing and given experiment. And if you're not seeing trends in the P value that look encouraging, it's time to pull the plug at that point. So remember, time is a precious commodity. When you're experimenting on a website, you can only run so many experiments at once, usually one and the time that you waste trying to wait for a result that will never come on an experiment. This time, you could have spent trying an experiment that actually did make a positive difference. So choose your time limits wisely.
84. A/B Test Gotchas: an important point I want to make is that the results of an A B test, even when you measure them in a principled manner. Using P values is not gospel. There are many effects that can actually skew the results of your experiment and cause you to make the wrong decision. So let's go through a few of those and let you know how to watch out for them. So let's talk about some gouaches with a B tests. You know, it sounds really official to say there's a P value of 1% meaning there's only a 1% chance that a given experiment was due to spurious results or random variation. But it's still not the be all and end all of measuring success for an experiment. There are many things that can skew or conflate your results that you need to be aware of. So even if you see a P value that looks very encouraging, your Web, your experiment could still be lying to you, and you need to understand the things that could make that happen So you don't make the wrong decisions. Remember, correlation does not imply causation, even with a well designed experiment, All you can say is there is some probability that this effect was caused by this change you made. At the end of the day, there's always going to be a chance that there was no real effect or you might even be measuring the wrong effect. It could still be random chance there could be something else going on. It's your duty to make sure that the business owners understand that these experimental results need to be interpreted. They need to be one piece of their decision, right. It can't be the be all and end all that. They based the decision on because there is room for error in the results and there are things that can skew those results at the end of the day. If there's some larger business objective to this change beyond, you know, just driving short term revenue that needs to be taken into account as well. One problem is novelty effects, so kind of the the Achilles heel of an A B test is thes short timeframe over which they tend to be run okay, and this causes a couple of problems. First of all, if there are longer term effects to this change, you're not gonna measure those. But also there is a certain effect. Is just something being different on the website. So maybe your customers are used to seeing orange buttons all the time, and that blue button comes up and it catches their attention just because it's different. But as new customers come in who have never seen your website before, they don't notice that is being different in overtime. Even your old customers get used to the new blue button, and it could very well be that if you were to make the same test a year later, there would be no difference. Or maybe there'd be the other way around. You know, I could very easily see a situation where you test orange button versus blue button, and in the 1st 2 weeks the blue button wins. People buy more because they were more attracted to it because it's different. But a year goes by. I could probably run another Web lab that puts that blue button against an orange button, and the orange button would win again simply because the orange button is different and it's new and it catches people attention just for that reason along alone. So for that reason, if you do have, ah, change that is somewhat controversial. It's a good idea to rerun that experiment later on and see if you can actually replicate its results. That's really the only way I know of toe to account for novelty effects. You actually measure it again when it's no longer novel, when it's no longer just a change that my catcher people's attention simply because it's different and this is I really can't understate the importance of understanding this. This could really skew a lot of results. It biases you to attributing positive changes to things that don't really deserve it. You know, being different in and of itself is not a virtue, at least not in this context. Another problem. Seasonal effects. So again, if you're running an experiment over Christmas, people don't tend to behave the same during Christmas as they do the rest of the year. You know, they definitely spend their money differently during that season, they're spending more time with their families at home. They might be a little bit got checked out of work, so people have a different frame of mind. It might even be involved with the weather. You know, during the summer, people behave differently because it's hot out. You know, they're kind. They're feeling kind of Lee's, either on vacation more often. Maybe if you happen to do your experiment during the time of a terrible storm in a highly populated area area that could skew your results as well. So again, just be cognizant of potential seasonal effects holidays or a big one to be aware of. And always take your experience with a grain of salt if they're run during a period of time that's known to have seasonality, and you can determine this quantitatively by actually looking at the metric you're trying to measure as a success. Metric, be it Whatever you're calling your conversion metric and look at its behavior over the same time period last year, are there fluctuation seasonal fluctuations that you see every year? And if so, you want to try to avoid running your experiment during one of those peaks or valleys. Another potential issue that can ski results is selection bias, so it's very important that customers were randomly assigned to either your control, your treatment groups, you're a or your B group, right, But there are subtle ways in which that random assignment might not be random after all. For example, let's say that your hashing your customer ideas to place them into one bucket or the other . Maybe there's some subtle vibe bias between how that hash function effects people with lower customer ideas vs higher customer ideas. And that might have the effect of putting all of your long time or loyal customers into the control group and your newer customers who don't know you that well into your treatment group. And what you end up measuring is just a difference in behavior between old customers and new customers as a result. So it's very important to audit your systems to make sure there is no selection bias in the actual assignment of people to the control of treatment group. You also need to make sure that that is Simon is sticky, so if you're measuring the effect of a change over an entire session, you know you want to measure. They saw a change on Page A, but over unpaid see, they actually did a conversion. You have to make sure they're not switching groups in between those those clicks, So you need to make sure that within a given session people remain in the same group, and how to define a session can become kind of nebulous as well. Now, these are all issues that using a established off the shelf framework like Google experiments are optimized Lee or one of those guys can help with. You know you're not reinventing the wheel on all these problems, but if your company does have, ah, homegrown in house solution because they're not comfortable with sharing that data with outside companies, you know it's worth auditing. Whether or not there is selection, bias or not. One way for doing that is running what's called on a test. So if you actually run an experiment where there is no difference between the treatment and control, you shouldn't see a difference in the end result, right? You know, there should not be any sort of change in behavior when you're comparing those two things, so a a test could be a good way of testing. You're a B framework itself in making sure there's no inherent bias or other problems. For example, session leakage and what? Not that you need to address. Another big problem is data pollution. So we talked at length about the importance of cleaning your input data, and it's especially important in the context of a Navy test. What would happen if you have some robot, some malicious crawler that's crawling through your website all the time? You know, doing some unnatural amount of transactions, and that robot ends up getting either assigned to the treatment of the control. That one person, that one robot, not even a person could skew the results of your experiment. So it's very important to study the input going into your weapon to your experiment and look for outliers and analyze what those outliers are. Should they be excluded? Are you actually letting some robots leak into your measurements? And are they skewing the results of your experiment? This is a very, very common problem and something you need to be cognizant of. There is malicious robots out there that's people trying to hack into your website. There is, you know, benign scrapers out there. They're just trying to crawl your website for search engines or what not. You know, there's all sorts of weird behavior going on in a website, and you need to filter out those and really get the people who are really your customers and not these automated scripts. And that's that could be a very challenging problem, actually. Yet another reason to use off the shelf frameworks like Google analytics or whatnot, if you can. All right, and we talked briefly about attribution errors. You know, if you are actually using downstream behavior from a change that gets gets into a gray area , you need to understand how you're actually counting those conversions as a function of distance from the thing that you changed and, you know, agree with your business stakeholders up front as to how you're going to measure those effects you also need to be aware of. If you're running multiple experiments that once, will they conflict with one another? Is there a page flow where someone might actually encounter two different experiments with in the same session? If so, that's gonna be a problem, and you have to basically apply your judgment as to whether these changes actually could interfere with each other in some meaningful way and actually affect the customers behavior in some meaningful way. All right, so again, very. You need to take these results with a grain of salt. There is a lot of things that could ski results, and you need to be aware of, um, so just be aware of them and make sure your business owners are also aware of the limitations of a B tests and you'll be OK. So remember the short term nature of an A B test subjects it to a lot of limitations. You might be just seeing novelty effects or seasonal effects and what not. So if you're not in a position where you can actually devote a very long amount of time to an experiment, you need to take those results with a grain of salt and ideally retest them later on during a different time period.