Transcripts
1. Essentials Of Machine Learning: Hey, everyone, it's Max. And welcome to my course on the essentials of machine learning. And this lecture we're gonna go and get in true into machine learning. So I'm gonna tell you what this course is all about, what you're gonna learn and also the things that we're not gonna cover in this course just to make sure that you know what to expect and that you can also decide if this is the right course for you that you want to be taking at this time in this lecture. I'm also going to give you a short introduction into who I am, and then we'll talk about what machine learning is and what machine learning generally looks like when you're doing it in practice. Okay, so what are we gonna cover in this whole course? First, we're gonna learn about a ton of essential terms and these air terms that the kind of span across the whole field of machine learning, And so it's important to know these terms. When you're talking about anything related to machine learning, then we're gonna look at data preparation techniques. We're then gonna look at performance mothers. So how can you evaluate how well your machine learning implementation is doing. Then we're gonna look more specifically into different regression and classifications algorithms. And finally, we're also gonna learn about different optimizes that you can use and how you can decide to use or what optimizer you should decide to use for different machine learning algorithms. All right, so what are you gonna get from this course? While you're gonna get a good understanding of what you need to consider when you're approaching machine learning problems, So you're gonna learn about machine learning algorithms and what you need to consider when you're choosing one as well is what you need to consider before and while you're training one, you're gonna be able to talk about different machine learning implementations with people. So you're gonna have you're gonna be ableto have conversations with people where they tell you about their implementation. You're going to be able to understand what they're talking about, and you may even be able to give feedback or suggestions for things to consider. And you're going to get a whole picture understanding of how the whole machine learning process looks like. So what is not including this course? This course is not including coded examples of machine learning implementations. So, like I said, this is to give you an understanding of the machine learning topics so that you know everything that's going on and that you can talk about it, that you can understand when people talk about it, and then you can have conversations about it and give feedback. But it doesn't include any coded examples of actually implementing any of the algorithms. And we're also not going to go over the mathematical background and deprivation of the algorithms and the different techniques. Now, at this point, I also want to point out that machine learning is an extremely vast field, and there's a lot of active research going on. It's a very fast and developing field, and in this course we're not going to go into the areas such as deep neural networks, and we're also not gonna go into the latest research again. The point of this course is that you feel comfortable talking about machine learning as a whole that you could have conversations about. That you can understand implementation is that people have done. You can provide feedback on it, and a lot of this is gonna give you It was gonna be based upon the understanding of the fundamental machine learning techniques that have been around for a while and that all of this new research is being built upon. Okay, so who am I? Hi, my name is Max, and I'm going to be your instructor for this course. I've been working as a data scientist for over three years now, and I've also been lucky enough to teach over 13,000 students about the world of data science. So what is machine learning? Well, the general purpose of machine learning is that you train a machine to perform a task and this machine will learn the rules behind the system. So this is really cool because you don't need to write up all the individual rules and you don't need to keep changing it. So your system is gonna learn the rules, and it can actually evolve with time and learn new rules as things change. And ultimately you're letting machine generalize its knowledge, so it's gonna learn a certain thing, and then it's gonna be able to apply that knowledge to a bigger field and also onto new problems that it may not have seen before. So what would a machine learning engineer do? Well, it's a pretty one process. So first you need tohave data. So you need to get in process data. I need to convert it from a raw format into a clean format. And you also gonna need to be able to analyze and understand your data and create features and indicators. So this first part is very similar is actually based upon all of the data signs skills. So if you're not exactly sure about data science, I've also got a lot of stuff on data science. So if you're unsure about these 1st 2 things, I'd recommend you check those things out. But of course, machine learning builds upon all of the data signs skills. Now, where you gonna continue on doing then is once you've done all of that stuff, you're gonna weigh the different machine learning algorithms. You're gonna apply different ones you're going to choose. Several wants to test out, and you're gonna identify the most optimal ones or the ones that you want to use. And over this whole process, they were gonna liberate, and you're gonna train it until you're eventually satisfied, and then it's gonna be launched into production. So it's gonna be it's going to go live essentially, and even after it's going life, you're still gonna need to monitor it. You're still going to need to see how it's performing and over time, also fix it will improve it or just keep it up to date.
2. Machine Learning Essential Terms: In this lecture, we're going to go over the first part of the essential terms. Now, I want to just mention that if you don't understand everything that's going on right now, don't worry. I want to introduce all of these terms to you before we go into the different areas so that when they pop up again, you're ready, familiar, and we don't need to sidetrack to explain new terms. So most of these terms we're going to see again, and if you don't understand the right now, again, don't worry about it. We'll cover them again probably in later lectures. The idea here is just to get you an understanding of all the possible terms that aren't really pertaining to one particular area or another that you can, you've seen all of these and that you understand these and know how to contextualize them. And that you understand kind of the bigger picture of everything. And then we know, dive a little bit more into detail into each of the things that we mentioned in the previous video. All right, so let's jump right into it. The first thing that we're going to talk about is different approaches that you can take a machine learning. These are split up into a supervised and unsupervised and reinforcement approach. Now, a supervised approach is when you have data and you also are provided solution, so ideal answers that you want. So in this case, what we see on the picture on the left as you have a straight line. And each of these values may come with an ideal answers. So you may get all the values on the x-axis and you want to predict the values on the y-axis. In this case, you already know what the pairing is. And the idea of supervised machine learning approaches is that you separate your data. So you only take the data that you want to make predictions on, and then you check those predictions against the answers. And based upon how correct or how wrong your predictions are, you then change your machine learning algorithm up to ultimately make it better at coming very close to those answers. Now, the other type of approach that you can take is unsupervised, where there aren't actually any correct answers or you don't have specific answers. And so the goal of your algorithm here is to try to find patterns themselves. So a good example is if we look at the image in the middle, if we have just a dataset like this and we want to understand, are we want to know if there are different groups present. Something that we can run is finding clusters, which we'll talk about more in a second and also more in later lectures. And so the idea is that our machine learning algorithm in this case will find two different groups, which are shown here in blue and an orange, which it has learned to separate. Finally, the third machine learning approach that you can take as a reinforcement approach. And this is in some sense similar to unsupervised, but it really takes it to the next level where you let your machine-learning algorithm just kind of go on its own. And it will start to do stuff. And it will get feedback based on if it's action was good or if it's auction was bad. And you can define what is good and you can define what is bad. And these are some of the most modern approaches and they're also very complex and they can, you know, you can really get very specific. And essentially, the idea is that this is kind of emulating the way that we learn. And so the point here is that you let your machine learning algorithm and each just kinda feed of data and then you let it make its own decisions. And based on the decision that it makes, you then either say, Yeah, but this was a good decision or this was a bad decision. And then it will learn over time to make more good decisions. Hand also avoid that decisions. So the different types of machine learning algorithms that there are, essentially you can have different goals with your machine-learning algorithms. There's regression, does classification, and there's also dimensionality reduction. So let's go through each of these regression is when you're trying to predict a specific value. So the goal here is, let's say you're trying to predict a number. So if we have our X data here, we're just trying to predict the corresponding y value. So this could be that you're trying to predict are kind of continuous series of numbers where you're trying to predict numbers and specific intervals or something like that. But the goal here is that you're trying to predict a certain number from it. Whereas classification, on the other hand, is you're trying to split your data into different groups. So in this case, and this is the same chart that we use last time, we'd have two different groups. And the goal of the algorithm is just to sort data points into either group a or group B. So for example, if you have uses on your website, the machine-learning algorithm could look at what the user does. And they could then either say, this is a user who's likely to purchase from us in the future. Or this is a user who needs more hand-holding or they need more education on how to use our product or however else you groups may come about. But that's the idea that you're not trying to assign any numerical values to it. Rather you're trying to sort them into different groups. Now, dimensionality reduction is an approach that you can take to prepare data for machine learning. And it's actually itself kind of whole set of machine learning algorithms. But the goal is that oftentimes when you have data and especially when things get very complicated, you have a lot of different dimensions. So let's just take an example here. If you have an image and you're trying to identify something on that image, an image is made up of a bunch of different pixels, depending on your resolution. And each pixel, if the image is coloured, also has different color values with it. So it comes with three different color values. So very quickly, even if you just have a 100 pixels, you have a 100 pixels times three colors. That's 300 different values that you can take gone. And a 100 pixels is also not a very big image. So you can see that very quickly. Your dimensions can get very, very large. And so the idea of dimensionality reduction is that you take all of these images. They're not just images, but all of these data sources that have tons and tons of data. And you try to reduce it so that rather than having a million or 5 million different data points for each set of data that you have, you can reduce that number down to much lower ones, which will help your machine learning algorithm because ultimately you're just focusing on the important things. All right, So let's dive a little bit deeper into the building out or the evolution of a machine learning algorithm solo, machine learning flow. The first thing that we're going to need to do is you're going to need to train your algorithm. Now, you may either take a completely new algorithm or you could have a partially trained or an already existing algorithm you need to improve upon. But whatever you start with, you're still going to want to train it on whatever data you've collected. So you're going to use that data. I'm going to make predictions. And then you're going to evaluate those predictions and you're going to look for the mistakes based upon the data that you have available to. Now, alongside training, you're also going to want to do validation. The goal of validation is ultimately just have a dataset that you can evaluate your predictions on or your current model on and see how it performs on data that hasn't been seen yet. And the purposes of this is that you can avoid issues of now overfitting. We'll talk again about again a little bit later. But generally it's just your algorithm finds patterns that don't really exist. So the point of validation is that once you've trained, Did you want to test it on some data that it hasn't seen before so that its predictions haven't been corrected on and see how it does about against that. And validation is really nice because essentially you're just taking your training set and you're splitting part of it off. And you're using most of your training set for training and they're going to use another part to validated against. And it can really help you identify issues of over-fit. And it can tell you when you need to stop training. And you know, what is a good 0.1 is aren't my model actually performing well? Now the last part is going to be the testing part. And the point of testing is actually very similar to validation. But there's one very big difference is your model only get to see the testing data all at once. And you're not going to continue to improve your model to try to fit the testing data better. So generally what you wanna do is you take your initial dataset and you split 80 of it and you put that into the training part. So this is data that your model is probably going to see more than once. And 20 percent you'd just put aside and you don't touch it, you don't even look at it. Not even out of the human because you don't want to introduce any bias into your algorithm. You just put it aside and you just leave it there and you don't touch it until the very, very end, until you actually want to know, okay, how does my algorithm perform now with the training data so that 80 percent, you can split that up into the training and the validation. And the very big difference between the validation and the test thing is that both validation and testing your model is not going to see the datum and it's not going to learn to predict from that data, but it's going to be evaluated against it. Now, with validation, you can train your model several times and you can always test it against unseen data and the validation where you're going to see and you're going to test against that validation data several times. Whereas for the test set, you really truly only leave it to the variant. And you take what you think is your final model. And then you run it against the test data and you see how it performs. And from there, you get an actual good representation of how your model is likely to perform when it sees completely new data. Now the important thing is once you run it, it against the testing set, you don't want to. Train it anymore and you don't want to tune it anymore so that it performs better on the testing center because of the whole goal of the testing set is to introduce completely unseen and unknown data and without any bias and without any input of what is correct or not. And so if you start tuning your data against the testing set again, then it's no longer a testing set, then it's just another validation set. And from there, it's not likely that the result that you're going to get on the validation are going to be representative of what you're actually going to see when you deploy your model on where you can use it on completely unseen data. On. The point of the testing set is to get an almost completely fresh perspective and to really have a good understanding of how your model is going to perform when it's actually out there and when new values come in that it's never seen before. Alright, so an important term to know about during the whole training process is something called hyperparameters. Now, hyperparameters are essentially tunable parameters of your model or your whole process, of your whole learning process. So that's your model, how you decide to create the errors, and also how you decide to do the learning. So an example would be, how fast does your algorithm learn? Now you may say, Oh, well, that's easy. Let's just make it as fast as possible. Now the problem with this is, is sometimes if your model learns too fast, it may actually perform worse and worse with time because it's trying to over correct. So choosing how fast to learn is a pretty important balanced because it's the balance between taking too much time to reach the solution or overcorrecting and never becoming as good as it potentially can be. So hyperparameters are things that your model usually won't learn. Hello, you can use different machine learning algorithms to learn hyper parameters for your model. But there are a lot of these free parameters that you choose. Another example is how complicated is my model we're going to be. So all of these things that are kind of left up to you, and that is part of this machine learning art. Are these hyperparameters that ultimately you're deciding, okay? How should, you know, like, what are the things that it should try out? Does this need to be very complicated? Does it need to be simple? How fast do I want this to learn? All those sorts of things? So a good thing to know about hybrid parameters is that you can do this thing called grid search. So rather than just picking hyperparameters and hoping everything works well, you can use this technique called grid search, where you can give a list of the hyperparameters that you want to try out and just run the calculations several times and run the training several times. And then compare how the models perform based upon these different hyperparameters. And then you can see, okay, what combination of all of these free parameters is the best. That ultimately gives me a good model performance. And maybe also that makes the model learn fast. So those are also some of the things that you may need to consider is okay, how much time do I actually have available to train it? And how much performance do I really need? How much accuracy do I really need? Now, grid search you can either do by redefining the parameters that you want to use. So you can say, all right, I want you to try out all of these different combinations. Or you can just let your computer choose random combinations and tell it for how long you want it to run. So the trade-off here is one, you can choose art. What do I want to explore? And the other one is, okay, how long do I want to let it run until I can take it and go to the next step. And finally, also something important to know is cross-validation. So we talked about validation in the previous part. But the idea of cross-validation is you take your data and you split it into smaller subsets. Hand, you take all but one of those subsets for training, and then you take the last one for validation. And that way you can train several different models and, or the same model several times using different training and validation sets. And then you can have your model have different types of data that comes in and also different unseen data. And the cool thing for cross-validation is that you not only get an understanding of, okay, how does my model perform right now, but also how much does my model vary? So what is the kind of expected performance range that I can expect from this model?
3. ML Essential Terms continued: This lecture we're going to continue on looking at essential terms. Now, at this point you may be asking, Okay, cool. So I kind of understand how the whole training process works. But how do I even start with knowing what model to choose? Well, the first thing to know, and this is extremely important, is that every model is always just an approximation of reality or is just an approximation of whatever you're trying to do. Now, this is true for essentially everything goes holds from physics. Even physics models are just approximations. Now, the goal that you're trying to achieve with the model is you're trying to mimic or understand reality as close as possible so that the differences between your model in reality aren't really important anymore because they essentially behave the same way. Now, every model usually comes with an assumption. And based on whatever assumptions you have about your data, you're going to choose specific model. So for example, if you assume that your data is linear, you may want to choose linear regression. If you assume that your data is more complicated, you may want to choose a polynomial regression or decision trees, or you may even want to go down the roots of neural networks depending on how much complexity you want to add. Now, if you don't make any assumptions about your data, then there's this cool theorem that's called the no free lunch theorem. And this basically says it's impossible for you to know which models the correct choice. They're all equally viable. So applying this to machine learning, really what this says is you may have an initial understanding of your data, but it's always a really good idea to take several different models that you think will perform well on the task that you're trying to achieve and train all of these models. You know, you don't need to completely optimize them. But just kind of pick some default parameters are changed them about just a little bit, and train these different models and see how each of them performs. Now, what you're going to get, sometimes if you're lucky use you're gonna get different models and some perform extremely poorly and others perform generally well. So you want to pick out your winners. Now if they all perform equally well, then at this point you kind of free to choose. But usually you're trying to narrow down the number of models. And so a lot of times you don't just kind of come up with the model because it's often almost, it's extremely difficult to know what the initial correct approaches. So good thing to do is pick several models, train them all, try them all out, see which ones perform best, and then use those, optimize these further, see how they perform them, and then ultimately decide on one and fully go down the route of really optimizing it and training it on your whole large dataset or whatever you may have available. Speaking of datasets, let's go over some of the important terms that you'll encounter when talking about Datasets, specifically also in the field of machine learning. The first term is going to be features. Now, features is all the data that you're going to use to train your algorithm. So let's say you're building an algorithm to predict the height of an individual. The features could be their sex, their height, their occupation, where they live, their daily activity, whatever, anything that you use, you want to feed into your algorithm that your algorithm will use to try to predict that final way. Is it going to be a feature? Now, this can either be raw data that can be formatted data or it can be processed. It doesn't matter. It's just that this is the data that you're going to be feeding into your algorithm. And that's the data that you're going to use to try to make your predictions. Now if we look on the right, this is usually how everything is kinda denoted when you're talking about having multiple observations is what they're called, but this is just multiple rows of data. So each of these rows in this example would correspond to a different person. And each feature, which is what we have on the top there, would correspond to the different features. So feature one, for example, it could be sex, feature two could be their height, and then so on. And feature n could be wherever they may live. Now, the observation one would be person number 1, observation 2 would be person number 2, and so on until you get down to person number m, which is how many observations you have. And as you can see, this is very often denoted by just capital X. And X contains the matrix. Where each row holds an observation and each column is for a specific feature. Now, the other thing that's important to know about are the targets, and these are often denoted by lowercase y. Now the target is your reference value that your algorithm is trying to learn. In this example, our targets would just be the final weight. And we can see that we don't have multiple columns, we just have one column, but we still have the same number of observations. And so for each observation, which in this case would just be a person, we have in our x and our features, all of the relevant features. And in our y, in our targets, it would just be the weight. So y, y1 would be the weight of person one, Y2 would be the weight of person 2, and so on. Now, there are also important terms to know about machine learning models. The first of this is called a bias. Now, this is going to be different than another type of bias that we're going to learn about. But the idea of a bias in machine learning models is just to provide an offset, and this is also known as the intercept. And the easiest way to think about this is if you just think about a straight line here bias or your intercept is shifting that line up and down, shifting around that y-intercept. The other thing that you have, our feature weights. So in this vector we store the importance of each feature. And ultimately if you have multiple features, your algorithm is going to try to learn these features based on whatever formula it's using, whatever algorithm it's using. And it's going to assign weights to your features. It's going to assign relative importance. And so what we have is each feature has a specific weight associated with them. Finally, we also have the parameter vector. Now the parameter vector is just a combination of the bias and the feature weights until one full vector m. And you often do this just because it makes writing it down easier so that you have one vector that contains both your offset or your intercept or your bias, whatever you want, call it, plus all of the weights of your features. So we can see if we go back, we have our features here. And then when we go forward again, we've gone a weight for each feature. And that's ultimately what our algorithms are going to be wanting to learn. And some algorithms will have several sets of feature weights. And some algorithms we'll just use a single set of feature weights together with a bias. And the easiest way, of course to represent this is using the parameter vector, because this allows you to group everything together, which just makes it a little bit neater.
4. Wrapping up essential terms: Now you may be wondering, what are the different approaches that I can take to creating an algorithm or how do I make sure that my model stays up to date? Generally, when you train, there are two different approaches that you can take. Now the first one is called batch learning and the second one is called online learning. Now, the big difference between the two is that batch learning. And you can see that also on the image on the right is you train your model beforehand. So you have batches of data that come in. So chunks of data on all these kids didn't either be smaller subsets or they can be huge. Batches are sets of data that come in that you train your model on. And basically every dream vertical line that you see there is a new model being created. So you train your model several times and you continue to make improvements upon it. And at some point you decide aren't, it's time to performing very well. Let's make it available, let's put it onto production, or let's make a go live or whatever you wanna call it. And then you have live data coming in. But at this point your model isn't changing anymore. Your model was fixed beforehand. Now the data comes in and adjust, outputs, its prediction or whatever the model is supposed to do. Online learning, on the other hand, usually you also start with a Bachelor in him. So you want to train it first so that it does well beforehand. But the option with online learning is that you can continue to train it as new data's coming in. Now, this sounds really nice, but of course there are also complications that come with it. For example, when new data is coming in, how do you know what the right decision is or what the right answer is? If the right answer doesn't come along with it. So if you don't have an obvious right answer that comes with your system when it's online and it's one, it's just working in performing. You can run into some problems because you're just going to have to guess the answer or you're going to have to find some other smart solution to come up with how to evaluate what the right answer is and what you should be using to train it. The point though is that if you do have these answers or if your dataset changes with time, online learning can be really nice because your model is going to adapt as the data kind of evolves with time. So let's say you create a product and you only have a couple of thousand users at first. And you have your model online and you have a good understanding of how to evaluate its performance and how to make a change with time as new users come in, do a bunch of things your algorithm can develop or your model can develop with these users. And so as your product grows, your models also going to grow and it's going to change. And you can see here all of those small green vertical line is also live, is basically new versions of your model. Now, it's also important here, or for both of these cases to kind of evaluate performance over a longer period of time. So you want to come back at least a couple of months after you've deployed it to see how things are performing now For the Bachelor in case this is important because your model is probably going to be outdated at some point. And so it may no longer be performing as well as, as initially was, just because things have changed. For the online learning case, it could be that your model kind of goes off into its own direction. And at some point is just, it's, it hasn't learn correctly and it's gone off into a wrong direction, and it's no longer performing as well as you'd like it to. And so at that point you need to stop and you then need to revert to an earlier version. Or you want to retrain it on some fresher data, or you just need to update it. So in both of these cases, you don't just want to put them online and kind of leave them there. The for the batch learning case, it could be that the data kind of updates and your model's gonna go out of style. Whereas for the online learning case, the model may change with the data on where your model may start going in the wrong direction. Now at this point is also important to talk about data and how effective data is. And it's important to note, first of all, that more data generally means better performance. And with enough data, even simple models can perform very well on complex tasks. So if you have tons and tons of data, at some point, you may not want to spend so much time thinking about, okay, what algorithm exactly. And again, am I going to use? You're going to want to go with one that generally performed smallness kind of task. And you're going to want to focus more on, okay, what model can learn quickly. Because when you have tons and tons of data, what this effectiveness of Theta basically means is that, well, several algorithms can perform equally well. And so it's going to become important. How quickly can you get a model to perform well? So how can I save time with training so that I can get my model up faster or that I can make improvements to it quicker. Now, another important thing to know about data or to think about when creating a model and trying to model data is underfitting. Now, under fitting is when your model is too simple to correctly understand the data. And this can also be called bias. Now under fitting and bias are both variations are forms of over simplicity. So what you have in this case is, you're assuming something is much more simple than it actually is. So let's take an example. The stock market, which is one of the most complicated things and there's so much that feeds into it. If you try to predict the stock market using a simple linear model, it's not going to perform particularly well. And that's because the stock market isn't something that's so simple that it can be understood just using linear model. And if you try to go that route, you're going to heavily underfit or you're going to have heavily underfit model because you're assuming so much simplicity and the stock market is extremely complicated. Another form that simplicity can be introduced into your model is through regularization. Now, regularization is part of the loss function or the cost function is going to be something that we're going to look at it in more, in the later lectures. But ultimately what it is and very short, is you're penalizing the model and you're trying to make it as simple as possible as you're trying to limit its flexibility. Now, on the other hand, you can have overfitting or variance is also gonna be kinda, they're going into the same direction. And that's when your model is overly complex or when it has too much freedom and it finds things that actually aren't there. So if your model is overfitting, that means it's found patterns that actually don't exist. And similarly, when you have more variance, that means your model is becoming extremely complex. And it has too much freedom. And because it has too much freedom, it's no longer performing well because it's focusing on things that aren't important. And it's finding these things because you're giving it so much freedom. And overfitting or variants can come from if you use extremely complicated models. So for example, if you decide to go with like deep decision trees are high polynomial functions or deep neural networks. And you don't try to restrain their freedom if you just let them run free, it's pretty likely that they're going to overfit your data because they're gonna go so deep into it and they're gonna think they found something extremely interesting, extremely complex. And it's probably not going to be true. And so the idea of overfitting and in the same region, the idea of variance is complexity of a model. And so oftentimes you want to think about, okay, what is my bias and what is my variance? What are the trades off? Essentially, water is one of the simplicities and what are the complexities? And how can I make my model that it's neither too simple, that it doesn't find important things, nor that it's too complex, that it has so much freedom that it finds things that aren't even there. So if you look on the graph on the left, for example, one of the testing datasets that you have from a library called SKLearn, which is a machine learning library for Python. It's one of those test datasets for the iris dataset specifically. And what we have here is just a simple decision tree, which will also learn about in later lectures. And the important thing to note here is that we have three different classes, which you can see for the three different types of colored dots. And we also have three different identifications that our model makes, which you can see with the three different colors. So we've got this purple, this pink, and this yellow. And you can see in the yellow, there is a streak. And the streak of pink only hits one yellow point. And even though this data is pretty simple, this is already an example of overfitting where the model is trying to become too complex and it's introducing these over complexities, in this case, a small narrow line to fit one data point in a region that is otherwise dominated by another class. And even in, my point is here that even for these very simple datasets, overfitting can become a problem if you leave your model too much freedom, and if you just kind of let it go off on its own without controlling it. So how does the project flow of a machine printing machine learning project generally look like? Well, the first thing that you're going to need to do is think about the goal. And before you do anything else, you want to know what is the business case or what is the use case of my model? What is the actual goal that I'm trying to achieve? What is the accuracy that I'm going for? And what things are acceptable. So water acceptable mistakes and what are unacceptable mistakes? So for example, if you're in the field of medicine, it, and if you're trying to detect some nude, if you're trying to help doctors detected disease or have some sort of pretests. An acceptable mistake would be to sometimes detect something that isn't there. So a false positive. So sometimes you can say, oh, this, we may want to do further investigation because this disease may be present. And if it turns out it's not there, then it's okay because the person's not going to be hard. Even though in the short-term it may not be so nice to go through that and anxiety. And the long-term it has okay effects. But an unacceptable mistake because if your model misses the disease, so if it says it's not there and there are no further tests that are done and the person ends up having this diagnosis. So in this case, for example, you're going to want to heavily focus on that. You don't make mistakes where you miss something that's actually there because that can cause irreparable damage and it's not a road that you want to go down. So in this case, you can see that sometimes it's not just about how much do you get it right, How much do you get wrong? But more importantly, what are the most important things that you need to get right? And where is it okay to get things wrong? And then you need to, because ultimately your models tropic not ever going to be perfect. So you need to tune it to make sure that the things that you need to get right our right as often as possible. And only after that, do you then want to make sure that the things that, you know, you should get right are also IS right as often as possible. But most importantly, it's so important that before you do anything else, you know what the ultimate goal is. Growl. So you're gonna go off in the wrong direction. Are you going to spend so much time trying to go for an extremely high level of accuracy that's not even needed because that's just, you know, Something that's the goal of your business. Now once you have an idea of what you actually want to do and what you actually need to do. Then it's time to take a deep dive into your data. You wanna make sure you understand your data properly. And you also want to make sure you go through all the data preparation steps that we talked about earlier. So a lot of this first part is actually the data science process. So understanding the business questions, understanding, all right, how do I analyze my data? What are the different ways that I can contextualize my data? How can I bring more information through my domain knowledge into the data that the machine learning algorithm can maybe use. So the first part is going to be very heavily based on data science skills. Now, the second part is where the machine learning stuff really kicks in. Which is you're then going to want to create your train and test split or your train validation and test split. And you're going to want to start training and validating and performing validations on your data on or on your model so that you can improve them with time. Of course, you want to pick multiple models at first and you want to pick a loss function or an error measure that you think is good. And then he, then you're going to want to train these different models and we re going to want to compare them, pick some of the winners, optimize those. At this point, you're also going to need to good optimizers to make sure that your models learn as fast as possible. You're going to want to do grid search with cross-validation and just iterate over batches of training data. And then ultimately you want to evaluate on your test set and you want to see if there are any signs of under or over fitting. And you also want to take this to other people, get input from them. See what they think about the process, see if they have any other input based on the model's performance or based upon what data you decided to supply it with and then iterate from there. And once you feel good about how your model's performing, then it's time to launch it to make it available for you. No. Put it wherever it needs to go. And even then you're still going to want to monitor the performance of your model. And you want to, you're going to want to come back to it and see how it's performing a week later, a month later. And ultimately you're also going to need to reach for it, it on new data to make sure that the model stays up to date.
5. Data Preparation: Welcome to the lesson on data preparation techniques. Properly preparing your data has amazing effects on the performance of your machine learning algorithms and is a crucial step in machine learning. And this is definitely not something that you should think about skipping. This is actually one of the most important parts of making sure your machine out learning algorithms are properly set up. So let's look at an example first. How would you deal with a distribution that looks like the following. So what we have here is just a generator distribution for me shows the income distribution of whatever city. And on the y-axis you have the counts or could also look at it. And like the occurrence rate. And on the x you have the income and tens of thousands of dollars. And so what you see here is a city with a kind of average middle-class, but of course it has people whose salary is extended up to very high rates. And so essentially what we have here is we have a skewed distribution or something that has a very long tail. We have a kind of normal distribution that we recognize in the center. And then towards the right, that distribution just kind of continues on. And this has significant effects on our machine learning algorithm because the scale of the data and becomes extremely big. So you can see that we're just kind of extending on towards the right and the right and our count is going down. But we can always encounter values unless higher range. And this can actually be problems. When we're trying to put this data into our machine learning algorithm. And it may not always deal with it as well as you may want it to. And so how do we approach problems like this? How do we address these types of distributions? Well, one thing that you can do is you can take the log of the income and you can see the effects here. What it does is rather than having a scale of about 2.5 to 20, which is what we had in this distribution. Here, we go to about one to 3.25 or something like that. So we've dramatically cut down the range of the data in our distribution and the range of a significant part of the data and our distribution. So rather than going from about eightfold, which is from 2.5 to 20, we only have about a threefold change from one to three. And that is extremely important for our machine learning algorithm because the ad now has to focus on a lot smaller range of data. And another important thing that we're doing with this log scaling is we're saying the higher numbers are less different to one another. So in this case, the difference between 2.55, so 25000 and 55 thousand is much greater than the difference between say, a hundred and fifty thousand and a hundred and seventy-five thousand. And that also makes sense, right? Like at some point, that difference just doesn't really matter that much anymore. And so that's what we're physically saying when we're using the log scaling. But it also has significant impacts for a machine learning algorithm because it's really good if we can reduce down this range and not make it so big. Another thing that we can do to reduce the range that our data is going over is just take a threshold. So we could say, for example, that, you know, everything after a 125 thousand or 12.5 in this case is basically all the same for whatever project we're considering. Like, let's say our project is seeing what types of people can afford what houses. And we say, well, anyone who's running a 125 K or above, they can basically afford all of the houses that we're looking at. And there's no real difference between them anymore because that extra income doesn't make a difference to our project. So we can say, okay, we're gonna take a hard threshold here, which is based on this physical meaning of why exactly we chose a 125 K because we decided at this point, it just doesn't make a difference anymore. And so in this way, we're also reducing down or range from 2.5 to 20 down to 2.5 to 12.5. So we're not having as big of a reduction as we do in the law case, but it's still a good reduction. However, the problem in this case is if we look at the distribution of that normal distribution that we see around the five mark or the 50 k mark. And I've put a green bar above so you can kind of visualize where the distribution lies. And now if we look to the right-hand tail, we see that the green indicators is about the size of the distribution that we have on the left. And we can see the tail is still much longer than them. And so our distribution in this case still has an extremely long tail. And so this threshold may be a good thing to use in some cases, but actually for this specific example, it wouldn't be a good thing to use just because our tail is still so long, even though we used a threshold and kind of cutoff, a significant portion going from 12.5 to 20. It's still extremely long and it's actually longer. The main part of the distribution itself. Now, something else that you can do is you can take percentiles of your data on. So you're basically taking all the data that you have and you're splitting it into a 100 groups. And then you can take each of these income values and instead replace it with the percentile. And what you get from here is what you can see on the screen. You get a 100 equally distributed groups and you get score is between 0 and a 100, which you can also scale to go from 0 to one or whatever you like. But the important thing here is that you have a well-defined range and there's no imbalance like we see, especially in this case, where the tail stretches on so long in one direction and there's kind of a mismatch to the main part of their distribution and how far the data extends away from it. That's not something that we see here. So now that we know about scaling, something else that we need to talk about is preparing input for our data. In this case, we've looked at a scale that goes from about 2.5 to 20. And if we just have one range of input and we do some of the data preparation we talked about previously, where we tried to move it around or scale it to reduce the impact of the tail, that can be a good thing. But if we have several different features that we use for input and their scales are significantly different, then this can have negative effects on the performance of the algorithm. So the algorithms, and generally machine-learning algorithms like to have numbers that come in similar ranges. So for example, if you have values here going from about 2.5 to 20, or about one to 20, then you want to make sure that your other values are also within the comparable range. So for example, going from five to 30 or from one to 15 or something like that. You don't want is a one distribution that goes from about 0 to 20 and another one that goes from about 0 to a 100, and then one that goes from 50 to 500 thousand. Machine learning algorithms sometimes have problems dealing with those types of distribution. And so what you really want is that the scale that your inputs have are comparable to each other. So what we're talking about here is not reducing the effect of the tail. That's what we did previously, but making sure that your inputs are of comparable size to one another when you're using multiple inputs. So what you wanna do here is essentially you want to scale your features to have more concrete range or a smaller, more defined range. And one way that you can do that is just using something called min-max scaling. So what you do is you take the minimum value and you take the maximum value and you say, aren't any number that's at the minimum is 0. And the number that's at the maximum is one. And anything in between, as a number between 01, that linearly depends on where it lies. So if it's halfway to the maximum, its value is going to be 0.5. Now, the problem with this is that we can still have an uneven distribution of data. So we can see that we've changed the scale at the bottom here. We're going from 0 to one rather than from 2.5 to 20. But we're still maintaining the tail. And so in this case, it's of course still important that in addition to doing this feature scaling, we also do some other form of data preparation to make sure we reduce the effects of this tail. However, this is a good way to approach having, having several features and putting them on the same range. Now of course, a problem that you can also get from this is if we have most of our data within the lower part and we have some outliers. These outliers can heavily affect the range of our distribution. So in this case, we can see that most of our data is around the point 3 mark, but just some of the outliers, which is the 20 up here, that's what pulls it all the way up to one. And so in this case, we're still have the problem that we still have a smaller dominant range in our distribution. So most of the data is now between about 0.1.4. But our range goes all the way up to one just because of the effects of the outlier. So it's still important to do some of the scaling or the preparation that we talked about earlier to reduce the effects of these outliers that were to reduce the effects of the long tail that we're seeing. Now, another approach that you can take is something called standardization. Now, what you do here, rather than setting a definite scale like we do in them and max case, we use a relative scale. And so we find the mean of all of our data and we find the standard deviation. And we apply a transformation to each data point where we take the value, we subtract the mean, and then we divide it by the standard deviation. And so we can see from the graph on the left, our data arranged now goes from about negative one to four. But this isn't something that's predefined. This is just, we're scaling it down based on our dataset and how it's distributed. So this is a nice way to kind of regularize your data because you're not setting a definite scale and you're kind of letting the distribution of the data govern how the new scale looks like. But it's still useful because if you do this across several features or several different inputs, it will bring it down to a comparable scale. And so rather than setting an absolute scale, like we do for the min-max, which goes from, for example, 0 to one, we can have still varying scales that are still comparable to each other using the technique of standardization.
6. Data Preparation Continued: In this lecture, we're going to continue on looking at data preparation techniques. All right, so something else that's important to know about is a sampling and sampling bias. Because ultimately when you're training a machine learning algorithm, you're picking out data that you use for training and you're picking out data that you use for testing. Now it's important to know that sometimes the distribution of your data can have significant impacts on how your machine learning algorithm performs on your training as well as your testing sets. So if we look at the graph on the right, what we have is a distribution of age groups. So we're going from 15 to 25. That's our first group, which has about a little over 30 participants in it. Then we've got from 2006 to 59, which has almost 15, and then we've got the 60 plus, which has about 20. Now, one technique that you can do is you can go about randomly taking participants from this age group. And if you do that, then you may not mimic the underlying distribution. Now sometimes that's not a bad thing, sometimes that's okay. But in some cases, when there is a, an important feature that you want to mimic and your distribution. Sometimes those important feature can have effects on how the subjects behave or what the outcome of whatever your experiment is. So for example, in the case of the age group, the different ages may have varying opinions or perspectives, and therefore their answers to questions or whatever may have a significant effect on the outcome. And so if we just do a random sampling, then what these error bars show is kind of the standard deviation that we can expect. And so we can see that what we have in the actual distribution versus the values that we can get if we just randomly sampling. There's a lot of variants in there and there's a lot of uncertainty. And so we can, here for example, see that, alright, if we have our train set, which is represented in blue, and our testing set which has represented in green. Some of these age groups can be under are over-represented. In this case, the age group of 60 plus is under-represented in the training and over-represented in the testing. Now, the effect that this has on our algorithm is that our algorithm will have less inputs from the 60 plus. And it will have to evaluate more though about the 60 plus. And so it may say that the thoughts of the 60 plus age group are not as important or something like that. Well, they're not as significant. But then when it's being tested against, it actually has to evaluate a lot of the 60 plus age group thoughts. And it's not properly prepared for that. There's a disproportion than the amount of training relative to the other age groups that we've done in the training versus the amount of testing that's being done. And this can have significant consequences on the performance of your machine learning algorithm. So what you can do is something called stratified sampling, where you're trying to mimic the underlying distribution. And so we can see in the training and the testing, which is again shown in blue and in green, that the distributions are now much more similar. Now it's not always going to be a perfect match. And in some cases, the random sample can actually look like a stratified sample, like we have. This is the stratified sample that we see here. So sometimes when you randomly choosing groups, it will look like this, but you're not guaranteed. Whereas with a stratified sample, you're guaranteeing yourself that these distributions look similar in the training as well as in the testing case. And so that way you can make sure that these underlying distributions are kept the same. That the way these data, that the way this data is allocated into your training and your testing sets is kept the same so that you don't mess with some of these important features. Now, in some cases, what you actually want to do is you want to have an overrepresentation of a sample. So sometimes you don't want to stratus, you want to, you don't want to stratify your sample. You don't want an even distribution because this can have negative effects on your machine learning algorithm. So for example, let's take a spam filter who were trying to detect if an email is spam or not. And let's say that most of the data that we have not spam. Now, if you do stratified sampling here, what your machine-learning algorithm may learn is that if it just classifies everything as not spam, it's gonna do pretty well. If the chance of the email being spam is pretty low, it's going to do a pretty good job saying nothing is spam, but that's not at all what you want. That kind of defeats the whole purpose. So something that you can do here is you can over-sample the amount of the spam. So rather than having very little spam, you can create your sets so that you're oversampling the amount of spam. And that way your algorithm is not going to learn that spam is unimportant because appears about equally as much as the non-spam contents. So it has to learn how to identify it. And then when you go back to the testing case or the online case, the live case, you may still not be getting that much spam, but now it can again, much better identify spam because it had to do that during the training process. So sometimes when you have extremely low case. Events, you may want to think about oversampling those events, including more of those events, then you may actually expect when your algorithm as live to make sure that your algorithm learns, fees important parts and it learns to distinguish even rare events that it learns to identify those. So even though stratified sampling may seem like a good idea initially, sometimes you want to make sure that the distribution of data, it's not the same because that will actually have much better implications for the performance of your algorithm. All right, the next thing that I want to talk about is how do you deal with non-numerical data? So we'll look at the first data types of data, which is going to be categorical data. So you've got different categories. Now again, in categorical data, we can split this up into different things. One of this is ordinal type numbers, which we can do here is, for example, we've got the star rating system. We've got one star, two star, three star, four star, and five-star. And what we can do is we can do, we can treat these numbers ordinarily. And we can say, all right, we're going to say that one star, we're going to give that the numerical value of one to sars, numerical value of 2, three stories, three, four-stars for five-stars, five. So what we need to do for categorical data is we need to transform it to numerical value. Now doing this type of transformation to sequential numbers is good. When the underlying values have a distinguished, distinguished went between them if there's a hierarchy, so two stars is better than one star, four-stars is better than a three-star, and five-star is the best. So this is a good way to treat data that has an underlying order. Another example is if you have review ratings which generally say this is bad, this is okay, this is good, great, that you can again assign numerical values that are increasing. So you go 12345 and the five is actually better than the four. So using this kind of transformation is good when there's a clear order in your data. But otherwise it can actually cause problems. So if you do this, for example, if you have categories of student, alumni, teacher, and parent, you can't really assign 1234 because that way your algorithm may learn that a student is lower than an alumni, which is lower than a teacher, which is lower than a parent. And you can't really compare the categories like that because these are different categories and they may be experiencing the same thing from different perspectives. So what you can do in this case, when there is no clear order between these different categories, you can do something called one-hot encoding. And that's what we see on the table on the left, where for each of the categories, you create your own column. And whenever that value is present, that value then gets a one. And so we can see here, if it's a student, the student column is going to have a one and every other column is going to have a 0 if it's an alumni, every other column is going to have a 0 except for the alumni column which has a one. If it's a teacher, the teacher column is going to have a one and everything else is going to be 0. And if it's parent, the parent is going to happen one and everything else will be 0. And so what you can do here is you can take categories and you can transform them instead to this one-hot encoder, which lets hear algorithm better deal with these categorical types. Now this is usually good to do if you have a low number of categories, so about 10. But you don't wanna do this if you have about a 100 categories, because that just inflates the amount of input that you have and your algorithm may not deal with it very well because it has to learn all of these different input factors. So another way that you can go about this is using something called embeddings. Now, this is a very often used for text, which is why we're also going to be looking at it from the text part of the non-numerical data. But embeddings is something that you can do for categories as well as text. And the idea here is that you take a value and you instead transform that to a set of numerical values. So for example, what we can do is we can take the word potato and we can use a 3D embedding. And so every word or every category is assigned three numbers. Now these numbers may initially, initially be randomized, well, will initially be set randomly. But these categories are, these embeddings are actually something that you can learn. And the embedding dimension is actually hyperparameter of your model now too. But an example of this is if we take the words clouds, sun, rain, and rain bone, then if we learn these embeddings are, embeddings may actually look like what we see on the left. And you can see that to get a rainbow, what you have is, well, you go to the Sun, you subtract the clouds, you add the rain, and basically you've got yourself a rainbow. So you can do, or, sorry, You go to the clouds, add the rain, and you add the sun, and you've got yourself a rainbow here. And so you can see that there is a, there is a relation between the numbers that we assign to each of the words. And so we're related categories or related words are actually going to be started to group together. And different words are going to be further spread apart. And so embeddings are extremely useful because they allow you to take a large number of words or a large number of categories and reduce them down to a much lower set of dimensions. So for example, if we do 3D embeddings, which can take a ton of words and we can just reduce those down to three different numerical values rather than having, let's say, a 100 one-hot encoder. So a 100 different columns. We instead just have three. And this is much better for our algorithm because it can then much better deal with these embeddings. And it can know what values are similar as well as what values are different. And it can more appropriately treat these and it doesn't have to learn or doesn't have to deal with all of these different categories.
7. Classification Algorithms: In this lecture, we're going to learn about classification algorithms. So first of all, what are they? Well, they're essentially algorithms that let you predict categories from your data. Now, there are also classification algorithms that let you identify groups within your data. So let's do this with a more practical example using the graph on the right. What we have is just a simple time spent distribution. So each data point represents, for example, a person on the x-axis, we have the time spent at work with the right being more time, the left being less time. And on the y-axis we have the time spent with family, again topping more and bottom being less. Now, let's also say that we know that this group is split up into two different categories. We've got the group a, which is the one in the kind of top left, which is more time with family and less at work. And we've got the one in the bottom right. Now, these groups may not just be identified by the way that they act socially, but maybe this is also related to how people buy. You've got two different types of consumers, which just happened to fall into this category of some go more time with family and others go spend more time at work. Now, let's also look at the green dot in the middle. This is a new person that's being entered and we want to be able to predict what category they're going to fall into so that we know the right way to approach them from, for example, a marketing perspective, what kind of catalogs would maybe be interested in? What kind of buying behavior can we expect from them and how can we adjust their experience accordingly? So we want to know what group of people do they fit into. And so a great way to do this is using classification algorithms, which will let you assign a group or predict a group rather that this new person is going to fall into. So when you use classification algorithms, well, you go about using them when you're more interested in an attribute rather than an exact numerical value. For example, let's say we've got two categories of cats and dogs are just simply cats and not caps. And now we've got another image in this case, and we want to assign this image to a category we want to classify this image. Is this image of a cat or is it not a cat? And so this is another very simple example of what a classification algorithm can do. It we'll assign this image into the already defined classes of cat. We're not cat. All right, so let's go into a little bit more detail into some of the algorithms and actually see some examples of these algorithms. We're going to be using the Iris dataset that's included with SKLearn. And it's basically just looking at three different types of flowers, which is what we have on the left here. Each of these flowers are just taken from the Wikipedia page for the iris dataset. So you can also look them up there and find them there. And essentially there are two forms. First two overarching forms of the way that we can approach classification. One of them is supervised, the other one is unsupervised. The big difference here, of course, being that supervised, we have already known targets for when we're training, whereas unsupervised, we're not really sure what the correct solution is. Now, another part of classification algorithms are single class or multi-class classification algorithms. So being able to predict a single class or being able to predict multiple classes. And we'll look at each of these cases and a little bit more detail now. So first of all, we're going to look at single class supervised algorithms. Now, what we have on the left here is just two of the four features being plotted from the Iris dataset. So on the left we have the petal width in centimeters and on the y-axis rather, and on the x-axis for both of these graphs, we have the petal length. Now we've also split our data into two categories. We've got the not VR, rig Nika. I'm, I've no idea how to pronounce this, but it's one of the flowers. And then you've got the other category that is that type of flower. So basically we can see, and the way that I've labeled this data is everything in black is not that flower, and everything in green is that flower. Now what you see in the background colors is two different classification algorithms. One of them is a Stochastic gradient descent classifier, the SGD, using a linear support vector machine. Again, these are just names for classifiers. So if this is, sounds confusing, just think of it as a name for something. And the other thing is a logistic regression classifier. Now, the big difference here is the blue area. Represents where the machine-learning algorithm says, this is not that type of flower. And the red area says that this is that type of flower. And what we have in the middle there is the decision boundary. Essentially if you cross over that line in the middle of that kind of green squiggly line. If you go to the left, for example, for the SGD case, then you're gonna go into the not that flower category. And if you go to the top right and you're gonna go into that flower category. Now you can see by the way, that I've decided to display the data. This algorithm is of course not a 100 percent accurate. We've got some of the knots flowers in that category showing up in where the algorithm predicts that, that is actually that type of flower. So again, the black represents what the flower is actually, and the colors in the background represent what the algorithm predicts a bit. So black is not that flower, green is that flower, and blue is the algorithm predicts it's not that flower and red is the algorithm predicts it is that type of flower. Now an interesting thing to note here is if we compare the top and the bottom graph, we can see that the two different algorithms that we're using here have different decision boundaries. Not only are they located at different positions, but they're also at different angles. In this case, both of them are, in some sense, almost linearly separating the data. So what they're doing is they're just basically almost just drawing a straight line between the data which separates it. But we can see that the stochastic gradient descent, which is just using linear support vector machine at this point, has more of a diagonal separation, whereas the logistic regression is more of a horizontal separation. So let's look at each of these algorithms in a little bit more detail. In the case of logistic regression, what our algorithm is trying to do is trying to optimize the bottom equations, just trying to find the best probability. And it's using the logistic equation for them. And so we see we've got the features as x, we've got y as our prediction, and we've got our intercept and our coefficients or our weights, like we introduced them earlier. And so what we can see, what we get from here on the left-hand side is we get a probability if at this petal width and petal length, this is, that is that flower or not? What is the probability that given these two values, this is the flower. And so we can see that at low PDL width and basically any pedal length, the probability is very low. And so if we go back and we look at that, that's exactly what we see at low pedal width. No matter the pedal length or probability is low. Which means that the classifier classifies this as not that type of plant. Whereas when we go to high PDL width for basically any pedal length or probability is high, which means our classifier now classifies that region as that planner. So we can see that anything in red has, basically our algorithm is saying that's a high probability that, that has that plant. And so I'm going to assign these values here and say anything over here is that plant. Whereas in the blue region we have a low probability that anything in that region is our plant. So that's what we have in the bottom graph here for logistic regression. If we look at the linear SVM or the SGD classifier, the Stochastic gradient descent classifier here. What the linear SVM does is it tries to find a hyperplane that linearly separates the data. Now what a hyperplane is, it's basically just a line, in this case, in the two-dimensional case here. And if it's three-dimensions, It's a surface. And if it's four dimensions, it's kind of an internal area, but it just tries to find something that I can draw that separates the data. And so that's exactly what we see here. What the algorithm is trying to do is it's trying to find a good line that separates the data. But also what it's doing is it's trying to keep the distance from the data points to the line as big as possible. And so the algorithms take different approaches to the problem. And therefore we can also see that they come out with different results. And so with the logistic regression, we can see we have a probability for each value is assigned a probability between 0 and 1 of belonging or not belonging to that type of class. Whereas for the case of the linear Support Vector Machine, we have a region of 0 which is not that class and of one which is that class. And we basically have that one region in the middle where there's a big jump from not that class to that class. So we can see one is a probability that goes much more smoothly and the other one basically has a big jump between them.
8. Classification Algorithms Continued: In this lecture, we're going to continue on with the classification algorithms that we started last lecture. All right, so let's take a look at multiclass supervised algorithms. So what we did before is we just tried to differentiate between either that flower or not that type of flower. But actually in this dataset, there are three different types of flowers contained within it, which is what we see here. So on the top graph on the left, we've got our three different types of flowers. They're separated using an algorithm called KNN for standing for k-nearest neighbors. And on the bottom have, we have a logistic regression using a one versus rest approach. So let's talk about these in a little bit more detail, specifically the one versus rest. Now, what we have in the first case is right here, we're just trying to separate between a class of either yes or no. And so that's what the single class part refers to, where either trying to say yes or no, it's, it's binary. Now, some of these algorithms don't actually have a multi-class counterpart. They can only say yes or no. They don't have a ABC option. And so the way that you can use these algorithms to, rather than saying just yes or no, turn them into an ABC, is you use a one versus rest approach, which means you train a classifier that predicts the probability for each of these belonging to that type of flower. So we've got, in this case, three logistic regressions, each of them giving a score for a certain petal length and petal width belonging to a specific type of flower. And the classifier that gives us the highest probability that this pedal length and this would correspond to this flower. That's the one that's going to be chosen. And so that's called the one versus rest approach. And in this way we can turn a single class or a binary classifier that can only predict yes or no values and use several of them together to then predict multiple classes. And so we can see here at the bottom, that's the decision boundary that we get. And it does a pretty good job of separating them. Again, you can see that some of the dots kind of spillover. So we've got the green flower kinda spilling over into the red regime and some of the red flowers spilling into what the algorithm would predict to be the green regime. But it does a pretty good job separating these three. Now for the k-nearest neighbors, this algorithm it just looks at are, what are the points around me? And then based on that, it tries to make a guess. And so an important thing to note here is that for the logistic regression for the support vector machine, but we always had is essentially, basically a straight line separating. Whereas for the k-nearest neighbor, we can see that there's actually a curve going in, specifically for the difficult part between going between the two flowers and the top right corner. And so we can see that different algorithms give different decision boundaries. The decision boundaries again, are those squiggly lines that separate the output from the classifiers. And of course, there are many more different types of classifiers that you can go into. And these are just some of the examples. Now, because all of these classifiers behave so differently, it's not often not grade if you learn every single classification algorithm, rather, it's more important that you have a very good understanding of a couple of them. So pick 34 or five and understand those in more detail and feel comfortable with them. And often just having that kind of range of algorithms to choose from and fully understanding them will often make your model a lot better. Rather than trying to use tons of different classification algorithms where you don't really understand what they're doing, how they're behaving, and which one would be most appropriate to use and this type of situation. So for example, other types of multi-class supervised algorithms are Naive Bayes or you can use neural networks, but there are of course many more. But ultimately again, the best thing is just to get pretty familiar with a couple and then just stick to those because those are the ones that you understand best. And if you have a handful of classification algorithms to choose from, odds are, one of them is always gonna do a pretty good job of helping you solve whatever problem you're tackling. All right, let's look at multiclass unsupervised algorithms. Now. Essentially right now we're only going to look at one because these can become pretty complicated. And we're going to look at a clustering algorithm called k-means. Now what the clustering algorithm does is it separates it. It separates our data into different clusters, into different groups. Again, we see here we have a decision boundary and we've got different classes. And the top case, we're saying beforehand, we want this algorithm to split this data into two different groups. And so that's what it does. We see the left-hand side is one group and the right-hand side as another group. And in the bottom half we see, or we say, we want this algorithm to split it into three different groups. So that's something that we say beforehand. And in this case we see we've got one in the bottom left, one in the middle, and one on the right. Now this approach is called again clustering. And when we define beforehand how many different groups we want the algorithm to identify. And this case, it tries to group the data together so that there's minimal variance within each group. All right, so how would you go about evaluating how well your classification algorithm is performing? Well and easy, kind of straightforward approach would be just to measure the accuracy. Basically look at what is the proportion of correctly classified outcomes or predictions. But this can start causing problems if you have a high number of classes. So let's say you're predicting 20 different classes. That means each class is about 5% if they're evenly distributed. If you have an accuracy of 95%, that's almost more, That's basically still just random guessing because you could be seeing no every single time and 95% of the time you're still going to be correct. So accuracy is usually not a good way to go about evaluating classification algorithms. But we can see on the bottom right here, I've still shown the accuracy of the stochastic gradient descent, as well as the logistic regression, which is just the number of correct predictions, which in this case actually isn't too bad because we just have two classes. But again, accuracy can really start to cause problems when you're going to higher number of classes. So what are their options? Are there? Well, another option is looking at precision. Precision you look at or you calculate by looking at the number of correct predictions. Are correct positive predictions, and dividing it by the number of correct positive predictions, as well as incorrect positive predictions. So that means a true positive or a correct positive prediction is if you say this or this flower is this type of flower. And if it isn't fact that type of flower, then that's a true positive. A false positive is when you say this flower is that type of flower, but it's actually not that type of flower. If you say, Yes, this is the flower I'm looking for, but it's actually not that flower than what you have here is a false positive. Precision is really good to use when you want to know how reliable are my trues? When I say this is the case, then it's very likely that this is the case. Now, the other type, but you can use or another type that you can use is something called recall, where you look at the number of true positives over the sum of the true positives and the false negatives. So the false negative is when you're saying this is in fact not that class, whereas in actuality it is that class. So you're saying this is not this type of flower, but an actuality, it is that type of flower. Now recall, is good to use when you want to evaluate how often do I miss my true values? Now, what you can also use something called a confusion matrix. The confusion matrix does is it shows you how often one class was confused as another. So in this case, we can see on the right here we've got the two tables for the SGD. We've got the rows are showing the true values and our columns are showing the predicted values. And so we can see 78 where predicted negative and we're actually negative. Whereas 20 to wear predicted positive, but we're actually negative. So we've got a confusion going on here. But we've got no predicted negative values that were positive. And so what you're looking at with a confusion matrix is how often is one class confused with another class? Now this can be extremely helpful, especially if you're doing multi-class. But of course, it requires Some more checking because it's not just one number, it's a matrix and C you can go in and you can really see, Okay, where's my algorithm doing mistakes? But of course it requires more work because you need to go in and you need to look at the whole set of numbers rather than just having one number to measure performance. Now, these are just some of the tools that you can use for evaluation. There are, of course, also still other tools, but these are kind of the most basic ones that you should know about. And these are also the ones that should give you a taste and an understanding of why it's not always ideal to use accuracy and what other ways you can approach evaluating classification algorithms.
9. Regression Algorithms: In this video, we're going to go over regression algorithms. So first of all, what are they? Well, regression algorithms are essentially algorithms that let us predict numerical values. So when do you want to use them? Well, a really nice case to use them would be if you want to fill in missing values. For example, if you've got a dataset that's somewhat incomplete, but you have comparable data before. And you can train your algorithms to them, fill in missing values so that you can still use that data, rather having to discard it. Another place that you could use it for him, it's very often used for this is for forecasting. So this could be time forecasting or this could also be forecasting just in different areas. So regression algorithms are very often used if you want to use one part of your data and you want to make predictions about another part that you may not have data on. Now, you can also use regression algorithms when you want to understand how your system behaves in different regimes. And what I want you to notice on the graph on the right here is that we have actually the exact same curve for this curve looked linear. So if we go back, we see this is actually a straight line that we have on the right here. And now if we look at the graph further out, we can see that this pod is actually no longer linear. We can see it's a polynomial. So how curves in it. Now, this data was obviously generated by me. This is not extremely realistic. But the point that I want to make here is that in some regions are data can actually look linear. So if we zoom in on the region between 01 or 0.51, which is what we do here. We can actually see that a linear line or a line, a linear line, a line. A pretty good job of fitting the data here, but when we zoom out, this actually is no longer the case. So a straight line is no longer going to do a good job of modeling this. And so if we have a good model or a good algorithm, then we can go into different regimes and hopefully we can understand how the data behaves in these different parts. And so in this way, a very good model can help us understand something that we haven't been able to investigate yet. But of course, that being said, you should also be aware that sometimes or often when you train a model, it's only going to be valid for the data range that you're creating it on rather than on the whole set of data. So even though regression algorithms can be very powerful in helping you predict numerical values, you want to make sure that when we're using them, that you're using them in your appropriate data range. And that if you're getting data that's completely outside of the range that you've normally considered. You want to re-evaluate how your algorithm performs on those pieces of data. Because as we can see from the comparison here on the right, first we start off linear and then we go to this polynomial. So if we go into data of very much outside of the range that we're used to it before, the way that our data in our system behaves can change. All right, so let's take a look at some of the algorithms. We're again going to be using the iris dataset that we also use for the classification case. Now, when you're doing regression, you usually only talk about supervised. Unsupervised isn't particularly well-defined for regression algorithms. And so that's why when we're talking about regression, usually it's going to be a supervised case where you have your training data and you also have appropriate labels for an appropriate target values that you want to achieve. All right, so let's consider again this iris data. What we have here on the x axis, which you can see on the graph on the left, is the sepal length. And then on the y-axis we've got the pedal length. And what you can see from this graph is these two characteristics seem to be somewhat related. So let's try to use some regression algorithms to see if we can predict some of the values or if we can have a model that tries to model this behavior that we're seeing here. Now there are different ways that we can approach regression and there are very many different algorithms. The simplest one is a linear regression. Linear just being a straight line polynomial is a curved line. That's kinda of what we saw here. So this is going to be linear. This is going to be a polynomial. And then you've also got support vector machines you can use for regression. You can also use the K-Nearest Neighbors. Regression. Neural networks are very often, of course, also used for regression, and they're also still many other supervised algorithms that you can use for aggression. So just like for classification we have and a lot of algorithms to choose from that we can use for regression. And oftentimes it just comes down to, okay, what is the problem and what is our dataset and what model is best appropriate to use for this type of data. So let's take a look at the linear regression. What we have here is a linear regression model trained on this dataset. And what I've used is like you use one-hot encoding. Remember we talked about that in the data preparation case, where we one-hot encode the three different flower types. And so our linear model actually creates three different best fits for each of the individual flower types. And we can see that this is the result. And so what you can see from the linear model is we can go down to lower sepal length and we can predict appropriate PDL lengths. We don't know if that value would be correct or not, but we can at least try to predict it. And we can do the same thing. And we can do that for all the flatworms since we've used one-hot encoding. And so what our linear model is actually done is it's split it into three different, so you have three different flowers in this case, through different linear models, that it would all just use m. And it would check which encoding this is. So which flower it is. And then would you use the appropriate when our model to make predictions? We can also do the same thing using a k-nearest neighbor regression. Now the k-nearest neighbor, rather than what the linear model does is it tries to find the best linear line to go through the data and the k-nearest neighbor it looks at the surrounding neighbors to a datapoints. Prediction is just the average value from the surrounding data points. And so we can see that when we're going to lower or higher sepal length than our KNN algorithm doesn't really provide many great predictions anymore. So we can see when we go into the lower SQL length, It's basically all flat. And in the linear case, it's decreasing now because we don't have data in that case, we actually don't know which one is correct or if flowers of those characteristics even exist. So until we have data there, we can't actually validate which one of these streets more accurate. But we can see that the behavior of these two very different. And we can also see that for the linear case, it won't really is just a straight line. Whereas for the k-nearest neighbor case, we've got small kind of horizontal lines that jump up and down as we go across the different SQL. And again here we've got three different models, which are also color-coded in a purple, black, and gray for the orange, blue and red colored flowers respectively. And so you can see that we have small regimes where these predictions would be the same. And we're kind of having, it almost looks like a staircase that isn't fully connected that we're making predictions on. So just like in the classifier case, we can see that different algorithms, of course behave very differently. And so a great way to, your great thing to do, of course, is to just see, okay, what kind of predictions dismay algorithm make or how does my algorithm actually look like? If so, visualizing the prediction is that your algorithm where room would make for all of these different SQL lengths is a nice way to understand what your algorithm actually does. Now of course, this was just two algorithms. So we looked here at the linear case. We looked here at the k-nearest neighbors. Again, like I said, there are very many different regression algorithms and each of them take a different approach. And so the way that the predictions will look are also going to be very different. So how would we evaluate these different models? Well, good thing to do is just to look at the error. Now, I'm going to use y to denote our targets here, and I'm going to use y hat to denote our prediction. And one way that you can look at error is just by looking at the absolute mean error. So you look at each prediction and you see how far the prediction is away from the true value. You take an absolute value so that the numbers are only positive. You sum them up and you divide it by the number of data points. And so in that way, you get a measure of on average, how far away is your prediction from the true value. And so we can see on the right, we've got two models. On the left, we have the linear regression on the top left, on my right we have the k-nearest neighbor. And on the bottom we can see that using the absolute mean error, the linear regression is off by about 0.2. So that means on average, it's off by about 0.2 centimeters. And the pedal length prediction, whereas the K-nearest neighbor is off only about 0.17 centimeters. So using this measure, it looks like the k-nearest neighbor works. Now, there's actually also a different way that you can look at air. And this is often a preferred one, which is looking at the mean squared error and basically output of machine learning, it's often also referred to as the root mean squared error or that's different way of referring to it. We'll talk about that in a second. The mean squared error is rather than taking your prediction and seeing how far it is away from the target and just taking absolute values, you want to square that. And the reason you square that is because that means that outliers or predictions that are very off are punished much more heavily. So for example, in the absolute mean error case, if we have a prediction that's off by 0.1 and another one that's off by 0.5. These are essentially going to be equally weighted. Whereas in the mean squared error case, the higher error is going to have a much larger effect on our evaluation here. And the reason the mean squared error is nice is because it not only gives us a measure of how good our algorithm is, but it heavily punished when the predictions are very off from the actual target. And so we can see here if we look at the mean squared error for the linear regression and the k nearest neighbors. Again, the linear regression in this case actually also has a higher mean squared. So that means from the absolute mean error, we see not only is the model on average off more, but from the mean squared error case, we see that also outliers are not as wealthy or the extremes are not as well fit as in the k nearest neighbor case. Now, what is this root mean squared error case? While the root mean squared error is just taking the square root of the mean squared error. Now, the reason that you want to do this is because the mean squared error gives us values and square. So what we have here from the mean squared error is we have a difference of centimeters squared essentially, which isn't, it doesn't really make sense. It's not really what we're after. So we can take the square root and now we can read that from the root mean squared error case, we've got an error of about 0.2 centimeters, whereas for the k-nearest neighbor case, we've got an error of about 0.2 centimeters. So taking the square root brings or units back to what we're actually measuring and allows us to assign a more physical value. Now the reason that you don't really use the root-mean-squared error and the machine-learning cases. Because the performance is the same, you're just taking the square root at the very end. And the square root is just an extra operation that will take more time. And so you can just use the mean squared error, because if the mean squared error is bigger, then the square root of that value is also going to be bigger than whatever value you're comparing it to. So taking the square root doesn't actually add anything extra when you want to use this to evaluate your algorithm. Taking the square root adds a lot extra when you want to interpret the error from a human perspective. But when you're using it in your algorithm to train the mean squared error, does, is, does the exact same thing. And you don't need to use the extra cost of computing the square root every single time.
10. Optimization Techniques: In this lecture, we're going to look at optimization techniques for our algorithms. Before that though, let's take a quick recap of what we've done so far. First, we looked into preparing for training. So identifying potential good features and just generally going about understanding our data. Then we went into more detail about preparing our data with techniques using like normalization, scaling and the importance of sampling. And then we'd looked at different types of algorithms, specifically for regression and for classification. But how do we make our algorithms learn faster? So for this, we've just talked about what algorithms can we use? What do we do? Or even the whole data science flow, which is basically everything up to the algorithm choice. So how do we go about our data? How do we prepare our data? How do we understand our data? How do we identify good features? How did we do? How do we explore all of those important topics? And then we talked about using our model and how do we evaluate our model. But now, how do we make our algorithm learn faster? Because sometimes when you're using a lot of data, your algorithm can actually take quite a while to train and you just have to wait for it to complete. And so of course, one thing you can do is just use more or strong or computers. But there are also other techniques that you can use to try and make your algorithms learn faster so that they become better, quicker. And that's what we're going to look at now. So here we're going to cover the different types of optimizers. We're going to start with gradient descent, which is something that we also used before. We're going to look at momentum, then nag, eeta grad and RMS prop and finally atom as well as N atom. Okay? Before we go into that though, I quickly want to introduce another term called the learning rate. Now, the learning rate defines how fast our algorithm learns or basically how much does their algorithm affected by the errors that it's currently doing? Now, you may intuitively think, all right, well, I want my algorithm to learn as fast as possible, so then becomes as good as possible quickly. Let's not actually how it works. So let's look at the kind of two extremes. And this is usually one of those Goldilock things where you don't want to be too small. Because if you're too small, then essentially you're going to take a very long time to reach an optimal solution or a model that does acceptable enough job. So if your learning rate is too small, it's gonna take too long. But if your learning rate is too big, then you may actually be taking jumps that become too big and you may never reach an optimal solution because your algorithm is overcorrecting. Now the learning rate is denoted by eta, which is this symbol down at the bottom, which basically just looks like a strange N. And so when we go through the next parts of the lecture, if you see that, that tests on ball or that strange n, That's going to be our learning rate. Now the learning rate is also a hyperparameter of your system that you can tune. And of course, you want to get that optimal learning rate so that you're not going too big that your model over corrects am, but also not too small that your model takes too long to reach the optimal solution. And it could have done the same thing with faster steps or bigger steps. Alright, let's look at the first algorithm which is called gradient descent, gritty descend. All that we're doing is we're taking our error function, or a loss function or our cost function, which is in this case, for example, what I've plotted is the mean squared error, which is what we talked about in the regression case. So the mean squared error is just a quadratic equation, which is just this square here. And we can see that the gradient descend does, is it just takes our current weights. And it basically shifts our weights by the learning rate and the gradient of wherever we currently are at our cost function. So if we look at the graph on the right, if our error is more towards that left side higher up the curve, then we're going to have a steeper gradient at that point. And so we can add more error. Whereas if we're closer to the minimum, which is what we want to reach, then our gradient is going to be smaller. And now the learning rate scale that you can essentially think of scaling that gradient. So we can see changing the learning rate is basically changing between the orange or the purple arrow on the one case or the block and the blue. So we can make a learning rate bigger or smaller, and that basically affects the step size that we're going to take. So all that we're doing with gradient descent as we have our weights, including our bias. So you remember that term theta from before, and we just look at our current weights and we update them based on the error, based on each rate or based on each weight, which comes from the cost function. All right, the next thing that we can do is use momentum optimization. Now with the momentum does is it basically takes gradient descent, which is what we looked at just previously, and it adds a little bit. So what gradient descent does is every single step we calculate the gradient and then we adjust our weights accordingly. So if we're very high up the air curve or gradients are going to be steeper. And we're going to adjust more and more. And the closer we get to the, to that dip at the very bottom, we can see that our arrow a little smaller and smaller. And so we're going to adjust the less. So what the momentum does is it keeps track of basically the previous gradients. And so we can see here the momentum is kinda what I've noted in green. And if we start on the left, we start rolling down the hill. And our momentum is shown there in green. And then we also take the gradient at the local point and change our momentum accordingly. And then we update our weights based on the new momentum and effect that this has is we can see on the left, we're going to start rolling down the hill much faster essentially because we're going to pick up momentum. But when we reach the bottom, we can see that on the right, for example, the momentum is still pointing up. So we're going to roll past the hill a little bit further before we then kind of decelerate and go back down. So for the gradient descend were basically, if we're on the left-hand side, we're kind of rolling down the hill and we're slowing down as we reached the bottom. Whereas with momentum, you can kind of imagine it more like we have a bowl and we would put a ball on the tip of the bowl and just let it run down. And even as the ball reaches the kind of lower point, it still has some momentum that's going to carry it up the other side of the bowl before it gets slowed down and then it comes back and it kind of bounces around a little bit until it reaches an org, stays put in that lower part. And so a momentum as an optimizer works in a very similar way. The next type of Optimizer is called the Nesterov accelerated gradient or knack for short. And it basically just takes momentum. But rather than looking at the gradient apps, the current weights, it looks at the gradient, at the current weights plus our momentum. So we can see that our current step would be those now more transparent arrows. And our step at our current weights plus the momentum would be then the more opaque arrow, so the more colored ones. And a nice thing about that is that our algorithm actually are the optimizer looks a little bit into the future and quotation marks around the next step. And so it's not going to accelerate as much as we can see on the left-hand case. We're actually not going to get that big of a boost. And we're also going to decelerate more once we've passed the optimal point, as we can see on the right-hand side, where we have a larger kick back already. So what knock does is, rather than looking and using the gradients at the current weights, it looks at the gradients at the current weights plus the momentum, and then adjust the momentum according to that and update our weights based on the momentum. Alright. So the next type is going to be 80 grad and RMS prop. And these are very much related, which is why I put them in the same category. And essentially all that we're doing here is we have kind of similar to the momentum before a, another symbol that we're using updating and all that we're doing is we're, we have our current starting value and we change that based upon the gradient in every single direction. And so we're kind of scaling it by my gradients as we can see. So we're taking the gradient of each waves and we're multiplying it by itself. That's going to be our S value here, which is going to be defined for every single weight. And then we update our actual weights and biases by subtracting the gradient and dividing each value by that predefined value. So what we're doing here is we're actually reducing the effect of bigger gradients. And so that way we're not going to be going off in the wrong direction, but we're essentially going to be guided more towards that optimum quicker. Now that strange E at the bottom, That's called an epsilon, and that's just a very small number. And you have that so that you don't divide by 0. But essentially what it does is it takes all of the gradients and it scales them based on their size. And that way your algorithm doesn't really go off into Ron directions, but it will go more quickly into the right direction. Rms prop is almost the same thing. All that we're doing is we're adding a scaling term, which is this gamma here. Um, and so this way we kind of discard some of the earlier parts and we can adjust the weights that are currents guardians have, as well as the memory of the previous values. So we can see that eeta grad and RMS prop are very similar and they have the same goal, which is not letting your algorithm go off into the wrong directions. But they achieved this a little bit differently. Enormous problem is usually better than 80 grad, just because it doesn't remember as much of the old and forgets faster and it considers more the current values. But this gamma, of course, is also something that you can tune. So this is another hyperparameter. All right, finally we've got Adam and an atom. Now what these do, and you'll probably recognize a lot of these parts. So we've got the P, which is essentially a P for momentum. We've got the S, which is the S from Ada garage or RMS prop actually in this case. And what we're doing is we combine our P and R S, and we basically update based on these two values. And there's a little bit of like some scaling going on. So we're dividing our p's by 1 minus Gamma 1 and our S by one minus gamma2. And essentially we're just doing this so that we reach more optimal value and optimal values of P and S quicker. But the idea of this Adam algorithm is that it combines momentum and it combines what does use for RMS prop, and it works pretty well. And then we've also got an atom, which basically takes the atom and it adds the Nesterov. So rather than looking at the gradients and the current place, it will look at the gradients in the next place after considering momentum. And basically that does the same thing that NAG is to momentum and atom is two atom. So these are all kind of the popular optimizers. If you want to go and just pick one. The default one that's usually used is gradient descent, but it by no means is the optimal one. If you're just going to choose a random optimizer, it's usually good to go with the atom or the N atom. These perform very, very well. And so, you know, if you're not really sure, you should probably go with those ones, but you should be aware that ottoman an atom usually orange, the default wants to go for it. But you can see that each of these optimizers that kind of behave differently. And so, you know, there are guidelines of which ones are generally better. But of course, this can also depend on the situation. And if you're not sure, it may be worthwhile experimenting with like two or three of them. And then seeing which type of optimized it actually performs the best in terms of making your algorithm one of the quickest. And then go with that the rest of the way through. So that when you're improving your model, it actually learns quicker than it otherwise would. All right, and so with that, we've reached the end of the chorus. And I just wanted to thank you all for joining me. I hope you've learned a lot. I hope you've enjoyed the course, and I hope you feel comfortable having some conversations about machine learning now. Now, you may have noticed that throughout this course there's a lot of data science work that goes into it, a machine-learning or the machine learning specialization. It's actually built on having really solid data science skills. So if you're interested in specializing into machine learning or if you're interested in becoming a data scientist. I also have some great content on my website coding with max.com designed specifically to get you from absolutely no experience to data scientist. So if you're interested in that or if you want to hear my tips on how I got started, again, you can check it out on coding with max.com. And yeah, thank you again for joining me and I hope to see you in one of my other courses.