Transcripts
1. Essentials Of Machine Learning: Hey, everyone, it's Max. And welcome to my course on the essentials of machine learning. And this lecture we're gonna go and get in true into machine learning. So I'm gonna tell you what this course is all about, what you're gonna learn and also the things that we're not gonna cover in this course just to make sure that you know what to expect and that you can also decide if this is the right course for you that you want to be taking at this time in this lecture. I'm also going to give you a short introduction into who I am, and then we'll talk about what machine learning is and what machine learning generally looks like when you're doing it in practice. Okay, so what are we gonna cover in this whole course? First, we're gonna learn about a ton of essential terms and these air terms that the kind of span across the whole field of machine learning, And so it's important to know these terms. When you're talking about anything related to machine learning, then we're gonna look at data preparation techniques. We're then gonna look at performance mothers. So how can you evaluate how well your machine learning implementation is doing. Then we're gonna look more specifically into different regression and classifications algorithms. And finally, we're also gonna learn about different optimizes that you can use and how you can decide to use or what optimizer you should decide to use for different machine learning algorithms. All right, so what are you gonna get from this course? While you're gonna get a good understanding of what you need to consider when you're approaching machine learning problems, So you're gonna learn about machine learning algorithms and what you need to consider when you're choosing one as well is what you need to consider before and while you're training one, you're gonna be able to talk about different machine learning implementations with people. So you're gonna have you're gonna be ableto have conversations with people where they tell you about their implementation. You're going to be able to understand what they're talking about, and you may even be able to give feedback or suggestions for things to consider. And you're going to get a whole picture understanding of how the whole machine learning process looks like. So what is not including this course? This course is not including coded examples of machine learning implementations. So, like I said, this is to give you an understanding of the machine learning topics so that you know everything that's going on and that you can talk about it, that you can understand when people talk about it, and then you can have conversations about it and give feedback. But it doesn't include any coded examples of actually implementing any of the algorithms. And we're also not going to go over the mathematical background and deprivation of the algorithms and the different techniques. Now, at this point, I also want to point out that machine learning is an extremely vast field, and there's a lot of active research going on. It's a very fast and developing field, and in this course we're not going to go into the areas such as deep neural networks, and we're also not gonna go into the latest research again. The point of this course is that you feel comfortable talking about machine learning as a whole that you could have conversations about. That you can understand implementation is that people have done. You can provide feedback on it, and a lot of this is gonna give you It was gonna be based upon the understanding of the fundamental machine learning techniques that have been around for a while and that all of this new research is being built upon. Okay, so who am I? Hi, my name is Max, and I'm going to be your instructor for this course. I've been working as a data scientist for over three years now, and I've also been lucky enough to teach over 13,000 students about the world of data science. So what is machine learning? Well, the general purpose of machine learning is that you train a machine to perform a task and this machine will learn the rules behind the system. So this is really cool because you don't need to write up all the individual rules and you don't need to keep changing it. So your system is gonna learn the rules, and it can actually evolve with time and learn new rules as things change. And ultimately you're letting machine generalize its knowledge, so it's gonna learn a certain thing, and then it's gonna be able to apply that knowledge to a bigger field and also onto new problems that it may not have seen before. So what would a machine learning engineer do? Well, it's a pretty one process. So first you need tohave data. So you need to get in process data. I need to convert it from a raw format into a clean format. And you also gonna need to be able to analyze and understand your data and create features and indicators. So this first part is very similar is actually based upon all of the data signs skills. So if you're not exactly sure about data science, I've also got a lot of stuff on data science. So if you're unsure about these 1st 2 things, I'd recommend you check those things out. But of course, machine learning builds upon all of the data signs skills. Now, where you gonna continue on doing then is once you've done all of that stuff, you're gonna weigh the different machine learning algorithms. You're gonna apply different ones you're going to choose. Several wants to test out, and you're gonna identify the most optimal ones or the ones that you want to use. And over this whole process, they were gonna liberate, and you're gonna train it until you're eventually satisfied, and then it's gonna be launched into production. So it's gonna be it's going to go live essentially, and even after it's going life, you're still gonna need to monitor it. You're still going to need to see how it's performing and over time, also fix it will improve it or just keep it up to date.
2. Essential Terms Part 1: everyone, It's Max from coating with Max and welcome this lecture. We're gonna go over the first part of the essential terms. Now I want to just mention that if you don't understand everything that's going on right now, don't worry. I want to introduce all these terms to you before we go into the different areas so that when they pop up again, you're already familiar. And we don't need to sidetrack to explain new terms. So most of these terms we're going to see again. And if you don't understand the right now again, don't worry about it. We'll cover them again, probably in later lectures. The idea here is just to get you an understanding of all the possible terms that aren't really pertaining to one particular area or another that you can. You've seen all of these and that you understand the usually know how to contextualize thumb and that you understand kind of the bigger picture of everything and then we know dive a little bit more into detail, unwto each of the things that we mentioned in the previous video. All right, so let's jump right into it. The first thing that we're gonna talk about is different approaches that you can take a machine learning these air split up into a supervised and unsupervised and a reinforcement approach. Now, a supervised approach is when you have data and you also are provided solution, so ideal answers that you want. So in this case, what we see on the picture on the left is you have a straight line and each of these values may come with an ideal answer. So you may get all the values on the X axis, and you want to predict the values on the Y axis. And so in this case, you already know what the pairing is. And the idea of supervised machine learning approaches is that you separate your data so you only take the data that you want toe make predictions on, and then you check those predictions against the answers and based upon how correct or how wrong your predictions are, you then change your machine learning algorithm up to ultimately make it better at coming very close to those answers. Now, the other type of approach that you can take is unsupervised, where there aren't actually any correct answers, or you don't have specific answers. And so the goal of your algorithm here is to try to find patterns themselves. So a good example is if we look at the image in the middle, if we have just a data set like this and we want toe, understand, or we want to know if there are different groups present, something that we can run is finding clusters, which we'll talk about more in a second and also more in later lectures. And so the idea is that our machine learning algorithm in this case will find two different groups which are shown here in blue and an orange, which it has learned to separate. Finally, the third mission learning approach that you can take is the reinforcement approach, and this is in some sense, similar to on supervised. But it really takes it to the next level, where you let your machine learning algorithm just kind of go on its own, and it will start to do stuff and it will get feedback based on if it's action was good or if it's action was bad and you can define what is good and you can define what is bad. And these were some of the evil, most modern approaches, and they're also very complex. And they can, you know, really again very specific. And essentially, the idea is that this is kind of emulating the way that we learn. And so the point here is that you let your machine learning. However, you just kind of feed a data, and then you let it make its own decisions. And based on the decision that it makes you that neither say, yeah, this was a good decision or this was a bad decision. And then it will learn over time to make more good decisions and also avoid bad decisions. So the different types of machine learning algorithms that there are essentially you can have different goals with your machine learning algorithms. There is regression, this classification, and there's also dimensionality reduction. So let's go through. Each of these regression is when you're trying to predict a specific value. So the goal here is let's say you're trying to predict a number, So if we have our ex data here, we're just trying to predict the corresponding live out you. So this could be that you're trying to predict a kind of continuous number Siris of numbers , where you're trying to predict numbers and specific, you know, intervals or something like that. But the goal here is that you're trying to predict a certain number from it, whereas classification, on the other hand, is you're trying to split your data into different groups. So in this case, and this is the same chart that we use last time, we have two different groups, and the goal of the algorithm is just to sort Data points into either Group A or Group B. So, for example, if you have uses on your website, the machine learning algorithm could look at what the user does, and they could then either say this is a user who's likely to purchase from us in the future. Or this is a user who needs more hand holding, or they need more education on how to use their product. Or, you know, however, else your groups may come about. But that's the idea that you're not trying to sign any numerical values to it, but rather you're trying to sort them into different groups. No dimensionality reduction is an approach that you can take to prepare data from machine learning, and it's actually itself kind of whole set of machine learning algorithms. But the goal is that often times when you have data, and especially when things get very complicated, you have a lot of different dimensions. So let's just take an example here if you have an image and you're trying to identify something on that image and image is made up of a bunch of different pixels, depending on your resolution and each pixel if the images colored, also has different color values with it. So it comes with three different color values so very quickly. Even if you just have 100 pixels, you have 100 pixels times three colors. That's 300 different values that you content gone and 100 pics is also not a very big image , so you can see that very quickly your dimensions can get their very large. And so the idea of dimensionality reduction is that you take all of these images that or not not just images, but all these data sources that have tons and tons of data, and you try to reduce it so that rather than having a 1,000,000 or five million different, you know data points for each set of data that you have. You can reduce that number down to much lower ones, which will help your machine learning algorithm. Because ultimately you're just focusing on the important things. Are you? So let's dive a little bit deeper into the building out or the evolution of a machine learning algorithm. So the machine learning flow first thing that we're gonna need to do is we're gonna need to train your algorithm now, you may either take a completely new algorithm or you could have a partially trained or an already existing algorithm that you need to improve upon. But whatever you start with, you're still gonna wanna train it on whatever data you've collected. So you're gonna use that datum, we're going to make predictions and then you're gonna evaluate those predictions, and you're gonna look for the mistakes based upon the data that you have available to now alongside trading your also wanted gonna want to do validation. The goal of validation is ultimately just to have a data set that you can evaluate your predictions on or your current model on and see how it performs on data that hasn't been senior. And the purpose of this is is that you can avoid issues of over fitting now over fitting will talk again about again a little bit later. But generally it's just your algorithm finds patterns that don't truly exist. So point of validation is that once you've trained it, you want to test it on some data that it hasn't seen before. So ones that its predictions haven't been corrected on and see how it does about against that. And validation is really nice because essentially, you're just taking your training set and you're splitting part of it off. And you're using most of your training set for training, and they're gonna use another part to validate it against. And it can really help you identify issues of over fitting. And it can tell you when you need to stop training. And you know what is a good point? When is my model actually performing? Well, now the last part is gonna be the testing apartment, and the point of testing is actually very similar to validation. But there's one very big difference is your model only gets to see the testing data once, and you're not going to continue to improve your model to well, try to fit the testing data better. So generally what you want to do is you take your initial data sent and you split 80 of it and you put that into the training part. So this is data that your model is probably going to see more than once and 20%. You just put aside and you don't touch it. You don't even look at it. Not even as a human, because you don't want introduce any bias into your algorithm. You just put it aside and you just leave it there and you don't touch it until the very very end until you actually want to know. Okay, how does my algorithm before now, with the training data So that 80% you can split that up into the training and the validation and a very big difference between the validation and the testing is that both validation and testing Your model is not gonna see the data, and it's not gonna learn to predict from that data, but it's gonna be evaluated against it. Now, with validation, you can train your model several times and you can always tested against unseen data in the validation. Where you going to see And you're gonna test against that validation data several times. Whereas for the test set, you really, truly only leave it to the variant. And you take what you think is your final model, and then you run it against test data and you see how it performs. And from there you get an actual good representation of how your model is likely to perform when it sees completely new data. Now, the important thing is, once you run it against the testing set, you don't want to train it anymore. You don't want to tune it anymore so that it performs better on the testing set because of the whole goal of the testing set is Teoh introduce completely unseen and unknown data and without any bias and without any input of what is correct and on. And so he start tuning your data against the testing, said again. Then it's no longer testing, said that it's just another validation said, And from there, it's not likely that the results that you're gonna get on the validation are going to be representative of what you're actually going to see when he deployed your model when you use it on completely unseen data to the point of the testing set is to get and almost completely fresh perspective and to really have a good understanding of how your model is gonna perform Blinis actually out there and when new values come in that it's never seen before. All right, so unimportant term to know about during the whole training process is something called hyper parameters. Now, hyper parameters are essentially tunable parameters of your model or of your whole process of your whole learning process. So that's your model, how you decide to great the errors and also how you decide to do the learning. So an example would be, How fast does your over with them learn Now, you may say, Oh, well, that's easy. Let's just make it as fast as possible. Now the problem with this is sometimes if your model learns too fast, it may actually perform worse and worse with time because it's trying to overcorrect. So choosing how fast to learn is a pretty important balance because it's the balance between taking too much time to reach the solution or over correcting and never becoming as good as a potentially can be. So hyper parameters are things that your model usually won't learn, although you can use different machine learning algorithms to Lauren hyper proprietors for your model. But there are a lot of these free parameters that you choose. Another example is how complicated is my model gonna be? So all of these things that are kind of left up to you and that is part of this machine learning art are these hyper parameters that ultimately you're deciding? Okay, how should you know, like, what are the things that it should try out? Does this need to be very complicated? Does it need to be simple? How fast do I want this to learn all those sorts of things? So a good thing to know about hyper parameters is that you can do this thing called grid search. So rather than just taking hyper parameters and hoping everything works well, you can use this technique called grid search, where you can give a list of the hyper parameters that you want to try out and just run the calculation several times and run the training several times and then compare how the models perform based upon these different hyper parameters, and you can see Okay, what combination of all of these free parameters is the best that ultimately gives me a good model performance. And maybe also that makes the model learn fast. So those air also some of things that you may need to consider is okay. How much time do I actually have available to train it? And how much performance do I really need? How much ocracy do I really need now? Grid search. You can either do by pre defining the primaries that you want to use. So you can say, All right, I want you to try out all of these different combinations. Or you could just let your computer choose random combinations and tell it for how long you wanted to run. So the trade off here is one you can choose. All right. What do I want to explore? And the other one is Okay. How long do I want to, you know, let it run until I can take it and go to the next step. And finally, also something important to know is cross validation. So we talked about validation in the previous part. But the idea of cross validation is that you take your datum, you split it into smaller subsets hand. You take all but one of those subsets for training, and then you take the last one for validation. And that way you can train several different models and or the same model several times using different training and validation sets. And then you can have your model have different types of data that that comes in and also different unseen data. And the cool thing for cross validation is that you not only get an understanding of okay, how does my model perform right now? But also, how much does my model vary? So what is the kind of expected performance range that I can expect from this type of model ?
3. Essential Terms Part 2: Hey, everyone, it's Max from coding with Max. And in this lecture, we're gonna continue on looking at essential terms. Now, at this point, you may be asking, Okay, cool. So I kind of understand how the whole training process works. But how do I even start with knowing what model to choose? Well, the first thing to know and this is extremely important is that every model is always Justin approximation of reality. Or is Justin approximation of whatever you're trying to do now? This is true for essentially everything it goes, holds from physics, even physics models are just approximations. Now, the goal that you're trying to achieve with the model is you're trying to mimic or understand reality as close as possible so that the differences between your model and reality aren't really important anymore because they essentially behave the same way. Now, every model usually comes with an assumption, and based on whatever assumptions you have about your data, you're going to choose specific model. So, for example, if you assume that your data is linear, you may want to choose linear regression. If you assume that your data is more complicated, you may want to choose a polynomial regression or decision trees. Or you may even want to go down the roots of neural networks, depending on how much complexity you want to add. Now, if you don't make any assumptions about your data, then there's this cool. The're, um, that's called the no free lunch there, Um, and this basically says it's impossible for you to know which models the correct choice. They're all equally viable. So applying this to machine learning really, what this says is you may have an initial understanding of your data, but it's always a really good idea to take several different models that you think will perform well on the task that you're trying to achieve and train all of these models. You know, you don't need to completely optimize them, but just kind of pick some default parameters or change them about just a little bit and train these different models and see how each of them performs. Now, what you're going to get sometimes, if you're lucky, is you're gonna get different models, and some performed extremely poorly and others perform generally well. So you want to pick out your winners now if they all perform equally well, then, at this point, you kind of free to choose. But usually you're trying to narrow down the number of models, and so a lot of times you don't just kind of come up with the model because it's often almost. It's extremely difficult to know what the initial correct approaches. So good thing to do is pick several models, train them all, try them all out, see which ones perform best, and then, you know, use those optimize thes further, see how they perform then and then ultimately decide on one and fully go down the route of , you know, really optimizing it and training it on your whole large data set or whatever you may have available. Speaking of data sets, let's go over some of the important terms that you'll encounter when talking about data sets specifically also in the field of machine learning, the first term is gonna be features. Now features is all the data that you're going to use to train your algorithm. So let's say you're building an algorithm to predict the height of an individual. Features could be their sex, their height, their occupation where they live, their daily activity whatever Anything that you use that you want to feed into your algorithm that your algorithm will use to try to predict that final weight is gonna be a feature. Now, this can either be raw data. It can be formatted data and it can be processed. It doesn't matter. It's just that this is the data that you're going to be feeding into your algorithm. That's the data that you're going to use to try to make your predictions. Now, if we look on the right, this is usually how everything has kind of been noted when you're talking about having multiple observations is what they're called. But this is just multiple rows of data, so each of these rows in this example would correspond to a different person. And each feature, which is what we have on the top there, would correspond to the different features. So feature one, for example, could be sex picture two could be their high it, and then so on and feature end could be wherever they may live. Now, the observation one would be person number one observation to would be person number two, and so on until you get down to person number M, which is how many observations you have. And as you can see, this is very often denoted by just capital. X and X contains the matrix, where each row holds an observation and each column is for a specific feature. Now the other thing that's important to know about are the targets these air often denoted by Lower case. Why now? The target is your reference value that your algorithm is trying toe learn in this example are targets would just be the final weight. We can see that we don't have multiple columns. We just have one column, but we still have the same number of observations. And so for each observation, which in this case would just be a person we have in our X and our features all of the relevant features and in our Y in our targets, it would just be the weight. So why one would be the weight of person one why? To be the weight of person to and so on. Now they're also important terms to know about machine learning models. The first of this is called a bias. Now, this is gonna be different than another type of bias that we're gonna learn about. But the idea of a bias and machine running models is just to provide an offset. This is also known as the intercept, and the easiest way to think about this is if you just think about a straight line, your bias where your intercept is shifting that line up and down, shifting around that, why intercept the other thing that you have our feature weights. So in this vector we store the importance of each feature, and ultimately, if you have multiple features, your algorithm is going to try to learn. Thes Future is based on, you know, whatever formula it's using, whatever algorithm it's using, and it's gonna assign weights to your feature that's gonna assign relative importance. And so what we have is each feature has a specific weight associated with them. Finally, we also have the parameter vector. Now the parameter vector is just a combination of the bias, and the feature weights into one full spectrum, and you often do this just because it makes writing it down easier so that you have one vector that contains both your offset or your interceptor. Your bias, whatever you want call it plus all of the weights of your features so we can see if we go back. We have our features here, and then when we go forward again, we've gone await for each feature. And that's ultimately what our algorithms are going to be wanting to learn. And some algorithms will have several sets of feature weights, and some algorithms will just use a single set of future weights together with the bias and the easiest way. Of course, to represent this is using the parameter vector because this allows you to group everything together, which just makes it a little bit neater.
4. Essential Terms Part 3: Hey, everyone, it's Max from coating with Max and welcome to the final lecture on the essential term section. Now you may be wondering, all right, what are the different approaches that I can take to creating an algorithm? Or how do I make sure that my model stays up to date? Generally, when you train, there are two different approaches that you can take now. The 1st 1 is called Batch Learning, and the second woman is called Online Learning. Now the big difference between the two is that batch learning, and you can see that also on the image of right is you train your model beforehand So you have batches of data that come in so chunks of data on all these kidding either be smaller subsets, or they could be huge batches or sets of data that come in that you train your model on, and basically every green vertical line that you see there is a new model being created. So you train your model several times and you continue to make improvements upon it, and at some point you decide aren't it's time to perform very well. Let's, you know, make it available. Let's put it on to production or let's make a go live or whatever you wanna call it, and then you have live data coming in. But at this point, your model isn't changing anymore. Your model was fixed beforehand. Now the data comes in and it just outputs its prediction or whatever the model is supposed to do online learning. On the other hand, usually you also start with the Bachelor, so you want to train it for us so that it does well beforehand. But the option with online learning is that you can continue to train it as new data is coming in. Now, this sounds really nice, but of course, there are also complications that come with it. For example, when new data is coming in, how do you know what the right decision is or what the right answer is if the right answer doesn't come along with it. So if you don't have an obvious right answer that comes with your system when it's online is when it's just working and performing, you can run into some problems because you're just gonna have to guess the answer, or you're gonna have to find some other smart solution to come up with how toe evaluate what the right answer is and what you should be using to train it. The point, though, is that if you do have these answers or if your data set changes with time, online learning can be really nice because your model is going to adapt as the data kind of evolves with time. So let's say you create a product and you only have a couple 1000 users at first and you have your model online, and you have a good understanding of how to evaluate his performance and how to make a change with time as new users come in and do a bunch of things your algorithm can develop or your model can develop with these users. And so as your product grows, your models also gonna grow and it's gonna change. And you can see here all of those small green vertical line is also live is basically new versions of your model. Now it's also important here, or for both of these cases, to kind of evaluate performance over a longer period of time. So you want to come back at least a couple of months after you've deployed it. To see how things are performing now for the bachelor in in cases is important because your model is probably gonna be outdated at some point. And so it may no longer be performing as well as, as initially was. Just because, you know, things have changed for the online learning case. It could be that your model kind of goes off into its own direction and at some point is just it's it hasn't learned correctly, and it's gone off into a wrong direction, and it's no longer performing as well as you'd like it to. And so at that point you need to stop, and you either need to revert to an earlier version, or you want to retrain it on some fresher data, or you just need to update it. So in both of these cases, you don't just want toe, put them online and kind of leave them there, the for the batch one. In case it could be that the data kind of updates and your model is going to go out of style. Whereas for the online learning case, the model may change with the data, your model may start going in the wrong direction. Now, at this point, it's also important to talk about data and how effective data is. And it's important to known. First of all, that more data generally means better performance. And with enough data, even simple models could perform very well on complex tasks. So if you have tons and tons of data at some point, you may not want to spend so much time thinking about, OK, what algorithm exactly. And I get my gun use. You're gonna want to go with one that generally performs moments kind of task, and you're gonna want to focus more on. Okay, what model can learn quickly? Because when you have tons and tons of data, what this effectiveness of data basically means is that well, several algorithms can perform equally well, and so it's gonna become important. How quickly can you get a model to perform well, So how can I save time with training so that I can get my model up faster or that I could make improvements to it quicker. Now, another important thing to know about data were just think about when creating model and trying to model data is under fitting now under fitting is when your model is too simple to correctly understand the data, and this can also be called bias. Now under fitting. And violence are both variations air forms of over simplicity. So what you have in this case is you're assuming something is much more simple than it actually is. So let's take an example of the stock market, which is one of the most complicated things, and there's so much that feeds into it. If you try to predict the stock market using a simple in your model, it's not gonna perform particularly well. And that's because the stock market isn't something that's so simple that it can be understood just using a linear model. And if you try to go that route, you're gonna heavily under fit or you're gonna have a heavily under fit model because you're assuming so much simplicity and the stock market is extremely complicated. Another form that simplicity can be introduced into remodel is through regularization. Now, regularization is part of the lost function of the cost function and is going to be something that we're gonna look Atmore and the later lectures. But I ultimately what it is in a short is your penalizing the model, and you're trying to make it as simple as possible. It's you're trying to limit its flexibility now. On the other hand, you could have over fitting or variants is also going to be kind of. They're going into the same direction, and that's when your model is overly complex or when it has too much freedom, and it finds things that actually aren't there. So if your model is over fitting, that means it's found patterns that actually don't exist. And similarly, when you have more variants, that means your model is becoming extremely complex and it has too much freedom. And because it has too much freedom, it's no longer performing well because it's focusing on things that aren't important. And it's finding these things because you're giving it so much freedom and over fitting or variants can come from. If you use extremely complicated, I'm model. So, for example, if you decide to go with, like, deep decision trees or high polynomial functions or deep neural networks, and you don't try to restrain their freedom, if you just let them run free, it's pretty likely that they're going over fit your data because they're gonna go so deep into it, and they're gonna think they found something extremely interesting or extremely complex. And it's probably not going to be true. And so the idea of over fitting and in the same region, the idea of variants is complexity of model and so often times you want to think about. Okay, what is my bias and what is my variance quarter that trades off? Essentially what? There is one of the simplicities and water, the complexities? And how can I make my model that it's neither too simple, that it doesn't find important things, nor that it's too complex, that it has so much freedom that it finds things that aren't even there. So if you look on the graph on the left, for example, this is one of the testing data sets that you have from library called SK Learn, which is a machine learning library for python. It's one of those test data sets for the Irish data set specifically, and what we have here is just a simple decision tree, which will also learn about in the later lectures and the important thing to know here is that we have three different classes, which you can see for the three different types of colored dots. And we also have three different identification ins that our model makes, which you can see with the three different colors. So we've got this purple this pink and this yellow and you can see in the yellow. There is a streak of pink, and the streak of pink Onley hits one yellow point. And even though this data is pretty simple, this is already an example of over fitting, where the model is trying to become too complex. And it's introducing the's over complexities in this case a small, narrow lion to fit one data point in a region that is otherwise dominated by another class . And even in my point is here that even for these very simple data sets over, feeling can become a problem if you leave your model too much freedom. And if you just kind of let it go off on its own without controlling it. So how does the project flow of a machine, a machine learning project generally look like? Well, the first thing that we're going to need to do is think about the goal. Before you do anything else you want to know what is the business case or what is the use case of my model? What is the actual goal that I'm trying to achieve? What is the accuracy that I'm going for and what things are acceptable. So water, acceptable mistakes and what are unacceptable mistakes? So, for example, if you're in the field of medicine, it and if you're trying to, you know, detect some new If you're tryingto help doctors detect disease or have some sort of pre test, an acceptable mistake would be to sometimes detect something that isn't there, so a false positive. So sometimes you can say, Oh, this, you know, we may want to do further investigation because this disease may be present, and if it turns out it's not there, then it's OK because the person is not gonna be harmed, even though in the short term it may not be so nice to go through that anxiety in the long term. It has okay effects, but an unacceptable mistake if your model misses the disease. So if it says it's not there and they're no further tests that are done and the person ends up having this diagnosis. So in this case, for example, you're gonna want to heavily focus on that. You don't make mistakes where you miss something that's actually there, because that can cause irreparable damage. And it's not a road that you want to go down. So in this case you can see that sometimes it's not just about how much do you get it right ? How much do you get wrong? But more importantly, what are the most important things that you need to get right? And where is it Okay to get things wrong and then need to because ultimately, your models probably not ever going to be perfect. So you need to tune it to make sure that the things that you need to get right, our right as often as possible and Onley after that do you then want to make sure that the things that you know you should get right are also it's right as often as possible. But most importantly, it's so important that before you do anything else, you know what the ultimate goaless house you're going to go off in the wrong direction? Are you going to spend so much time trying to go for an extremely high level of accuracy that's not even needed because that's just not something that's the goal of your business. Now, once you have an idea of what you actually want to do and what you actually need to do that it's time to take a deep dive into your data. You want to make sure you understand your data properly, and you also want to make sure you go through all the data preparation steps that we talked about earlier. So a lot of this first part is actually the data science process. So understanding the business questions, understanding. All right, How do I analyze my data? What are the different ways that I can contextualized my data? How can I bring more information through my domain knowledge into the data that the machine learning algorithm can maybe use? So the first part is gonna be very heavily based on data signed skills. Now, the second part is where the machine learning stuff really kicks in, which is you're then gonna want to create your train and test split or your train validation and test split, and you're gonna want to start training and validating and performing validations on your data so that were on your models to the even, you know, improve them with time. Of course, you want to pick multiple models at first, and you want to pick a loss function or an air measure that you think is good and then you Then you're gonna want to train these different models, and you're gonna want to compare them. Pick some of the winners, optimize those. At this point, you're also gonna need to pick good optimizers to make sure that your models learn as fast as possible. You're gonna want to great search with cross validation and just iterating over batches of training data. And then ultimately, you want to evaluate on your test set, and you want to see if there any signs of under or over fitting, and you also want toe. Take this to other people, get input from them, see what they think about the process, see if they have any other input based on the models performance, or based upon what data you two decided to supply it with and then iterated from there. And once you feel good about how your models performing, then it's time to launch it to make it available. You know, put it wherever it needs to go. And even then you're still gonna want to monitor the performance of their model. And you want you're gonna want to come back to it and see how it's performing a week later , a month later, and ultimately, you're also going to need to retrain it on new data to make sure that the model stays up to date.
5. Data Preparation Part 1: everyone, it's Max from coating with Max and welcome to the lesson on data preparation techniques. Properly preparing a data has amazing effects on the performance of the machine learning algorithms and is a crucial step in machine learning. And this is definitely not something that you should think about skipping. This is actually one of the most important parts of making sure your machine out learning algorithms are properly set up. So let's look at an example first. How would you deal with a distribution that looks like the following? So what we have here is just a generator distribution for me that shows the income distribution of whatever sitting and on the y axis you have accounts or could also look at it like the occurrence rate and on the X, you have the income in tens of thousands of dollars. And so what you see here is a city with a kind of average middle class. But of course it has people who sellers extend upto very high rates. And so essentially what we have here is we have a skewed distribution or something that has a very long tail. We have a kind of normal distribution that would recognize in the center and then towards the riot. The distribution just kind of continues on. And this has significant effects on our machine learning algorithm because the scale of the data becomes extremely big. So you can see that we're just kind of extending on towards the right and the right and our count is going down. But we can always encounter values on this higher range, and this can actually be problems. Um, when we're trying to put this data into a machine learning algorithm and it may not always deal with it as well as you may want it to. And so how do we approach problems like this? How do we address these types of distributions? Well, one thing that you can do as you can take the log of the income and you can see the effects here. What it does is, rather than having a scale of about 2.5 to 20 which is what we had in this distribution here we go to about one toe, 3.25 or something like that. So we've dramatically cut down the range of the data in our distribution of the range of a significant part of the data on the distribution. So rather than going from about eight fold, which is from 2.5 to 20 we only have about a three full change from 1 to 3. And that is extremely important for a machine learning algorithm, because the it now has to focus on a lot smaller range of data. And another important thing that we're doing with this log scaling is we're saying the higher numbers are less different to one another. So in this case, the difference between 2.5 and five so 25,000 and 55,000 is much greater than the difference between, say, 150,000 and 175,000. And that also makes sense, right? Like at some point that difference just doesn't really matter that much anymore. And so that's what we're physically saying when we're using the log scaling. But it also has significant impacts for a machine learning algorithm because it's really good if we can reduce down this range and not make it so big. Another thing that we can do to reduce the range that our data is going over is just take a threshold. So we could say, for example, that you know everything after a 125,000 or 12.5 in this case is basically all the same for , you know, whatever project we're considering. Like let's say our project is seeing what types of people can afford, what houses. And we say, Well, anyone who's earning 125 K or above, they can basically afford all of the houses that we're looking at, and there's no real difference between them anymore because that extra income doesn't make a difference to our project. So we can say, OK, we're gonna take a hard threshold here, which is based on this physical meaning of why exactly We chose 100 25 K because we decided at this point it just doesn't make a difference anymore. And so in this way, we're also reducing down range from 2.5 to 20 down to 2.5 to 12.5. So we're not having as big of a reduction as we do in the case, but it's still a good reduction. However, the problem in this case is if we look at the distribution of that normal distribution that we see around the five marker, the 50 k mark and I've put a green bar above, so you can kind of visualize where the distribution lies. And now if we look to the right hand tail, we see that the green indicators is about the size of the distribution that we have on the left. We can see the tail is still much longer than them. And so our distribution in this case still has an extremely long tail, and so the threshold may be a good thing to use in some cases. But actually, for this specific example, it wouldn't be a good thing to use just because our tail is still so long. Even though we used a threshold and kind of cut off a significant portion going from 12.5 to 20 it's still extremely long, and it's actually longer than the main part of the distribution itself. Now, something else that you can do is you can take percentiles of your data. So you're basically taking all the data that you have and you're splitting it into ah 100 groups, and then you can take each of these income values and instead replace it with the percentile. And what you get from here is what you can see on the screen. You get 100 equally distributed groups and you get scores between 0 100 which you can also scale to go from 0 to 1 or whatever you like. But the important thing here is that you have a well defined range and there's no imbalance like we see, especially in this case where the tail stretches on so long. One direction and there's kind of a mismatch to the main part of her distribution and how far the data extends away from it. That's not something that we see here. So now that we know about scaling, something else that we need to talk about is preparing input for our data. In this case, we've looked at a scale that goes from about 2.5 to 20 and if we just have one range of input and we do some of the data preparation we talked about previously where we try to him moving around or scale it to reduce the impact of the tail, that could be a good thing. But if we have several different features that we use for input, and their scales air significantly different than this can have negative effects on the performance of the algorithm, so the algorithms and generally machine learning algorithms like tow have numbers that come in similar ranges. So, for example, if you have values here going from about 2.5 20 or about 1 to 20 then you want to make sure that your other values are also within the comparable range. So, for example, going from 5 to 30 or from 1 to 15 or something like that. But will you don't want? Is it just one distribution that goes from about 0 to 20? Another one that goes from about 0 to 100 then one that goes from 50 to 500,000. Machine learning algorithms sometimes have problems dealing with those types of distribution. And so what you really want is that the scale that your inputs have are comparable to each other. So what we're talking about here is not reducing the effect of the tail. That's what we did previously, but making sure that your inputs are of comparable size to one another. When you're using multiple inputs. So what you want to do here is essentially you want to scale your features to have a more concrete ranger smaller, more defined range. And one way that you can do that is just using something called Min Max scaling. So what you do is you take the minimum value and you take the maximum value and you say, all right, any number that's at the minimum zero and the number that's that the maximum is one, and anything in between is a number between zero and one. That linearly depends on where it lies. So if it's halfway to the maximum, its value is going to be 0.5 now. The problem with this is that we can still have an uneven distribution of data so we can see that we've changed the scale at the bottom here. So we're going from 0 to 1 rather than from 2.5 to 20 but we're still maintaining the tail . And so in this case, it's of course, still important that in addition to doing this feature scaling, we also do some other form of data preparation to make sure we reduce the effects of this tale, However, this is a good way to approach, having having several features and putting them on the same range. Now, of course, a problem that you can also get from this is if we have most of our data within the lower part and we have some outliers, thes outliers can you know heavily affect the range of our distribution. So in this case, we can see that most of our data is around the 0.3 mark, but just some of the outliers, which is the 20 up here. That's what pulls it all the way up to one. And so the in this case, we're still have the problem that we still have a smaller dominant range in our distribution. So most of the data is now between about 0.1 point four, but our range goes all the way up toe one just because of the effects of the outlier. So it's still important to do some of the scaling or the preparation that we talked about earlier to reduce the effects of the's outliers ever to reduce the effects of the long tail that we're seeing now, another approach that you could take It's something called standardization now what you do here, rather than setting a definite scale like we do in them. In Mac's case, we use a sort of relative scale, and so we find the mean of all of our data, and we find the standard deviation and we apply a transformation. Teach data point where we take the value. We subtract the mean and then we divided by the standard deviation and so we can see from the graph on the left. Are date arranged now goes from about negative 1 to 4. But this isn't something that's pre defined. This is just we're scaling it down based on our data set and how it's distributed. So this is a nice way to kind of regularize your data because you're not setting a definite scale, and you're kind of letting the distribution of the data govern how the new scale looks like . But it's still useful because if you do this across several features or several different inputs, it will bring it down to a comparable scale. And so, rather than setting an absolute scale like we do for the max, which goes from, for example, 01 we could have still varying scales that are still comparable to each other. Using the technique of standardization
6. Data Preparation Part 2: Hey, everyone, it's Max from coding with Max. And welcome back in this lecture, we're gonna continue on looking at data preparation techniques. All right, So something else that's important to know about is sampling and sampling bias. Because ultimately, when you're training machine learning algorithm, you're picking out data that you use for training when you're picking out data that you use for testing. Now, it's important to know that sometimes the distribution of your data can have significant impacts on how your machine learning algorithm performs on your training as well as your testing steps. So if you look at the graph on the right, what we have is a distribution age groups. So we're going from 15 to 25. That's our first group, which has about a little over 30 participants in it. Then we've got from 26 to 59 which has almost 15. And then we've got the 60 plus, which has about 20 now. One technique that you can do is you can go about randomly taking participants from this age group, and if you do that, then you may not mimic the underlying distribution. Now, sometimes that's not a bad thing, sometimes that's OK. But in some cases when there is a important feature that you want to mimic in your distribution, sometimes it's important. Feature can have effects on how the subjects behave or what the outcome of you know, whatever your experiment is. So, for example, in the case of the age group, the different ages may have varying opinions or perspectives and therefore their answers to questions or whatever may have a significant effect on the outcome. And so if we just do a random sampling, then with the Zahra bars show is kind of a standard deviation that we can expect and so we can see that what we have in the actual distribution versus the values that we can get if we just randomly sampling, there's a lot of variants in there and there's a lot of uncertainty, and so we can hear, for example, see that all right, if we have our train set, which is represented in Blue and are testing sent, which is represented in green, some of these age groups can be under overrepresented. In this case, the age group of 60 plus is underrepresented in the training and overrepresented in the testing Now. The effect that this has on the algorithm is that our algorithm will have less inputs from the 60 plus. And he will have to evaluate MAWR, though about the 60 plus. And so it may say that the thoughts of the 60 plus age group or not is important or something like that, but they're not a sick as significant. But then, when it's being tested against, it actually has to evaluate a lot of the 60 plus age group thoughts. And it's not properly prepared for that. There's a disproportionate amount of training relative to the other great age groups that we've done in the training versus the amount of testing that's being done, and this can have significant consequences on the performance of your machine learning algorithms. So what you can do with something called stratified sampling, where you're trying to mimic the underlying distribution and so we can see in the training and the testing, which is again showing in blue and in green that the distributions are now much more similar Now. It's not always going to be a perfect match, and in some cases the random sample can actually look like a stratified sample like we have . This is the stratified sample that we see here. So sometimes when you randomly choosing groups, it looked like this, but you're not guaranteed. Whereas with a stratified sample, you're guaranteeing yourself that these distributions look similar in the training as well as in the testing case. And so that way you can make sure that these underlying distributions are kept the same that the way these data. But the way this data is allocated into your training, your testing sets is kept the same so that you don't mess with some of these important features. Now, in some cases, what you actually want to do is you want to have an over representation of a sample. So sometimes you don't want to stratus you want. You don't want to stratify your sample. You don't want an even distribution because this can have negative effects on the machine learning algorithm. So, for example, let's take a spam filter who were trying to detect if an email is spam run on. And let's say that most of the data that we have is not spam. Now, if you do stratified sampling here, what your machine learning algorithm may learn is that if it just classifies everything as not spam, it's gonna do pretty well. If the chance of the email being spam is pretty low, it's going to do a pretty good job saying nothing is spam. But that's not at all what you want. That kind of defeats the whole purpose. So something that you can do here is you can over sample the amount of this spam. So rather than having very little spam, you can have create your sets so that you're over sampling the amount of spam. And that way, your algorithm is not going to learn that spam is unimportant because it appears about equally as much as the not spam content. So it has to learn how to identify it. And then when you go back to the testing case or the online case the life case, you may still not be getting that much spam. But now it can again much better identify spam because it had to do that during the training process. So sometimes when you have extremely low case events, you may want to think about over sampling those events, including more of those events than you may actually expect when you're algorithm is live to make sure that your algorithm learns these important parts, and it learns to distinguish even rare events that it learns to identify those. So even though stratified sampling may seem like a good idea initially, sometimes you want to make sure that the distribution of data is not the same because that will actually have much better implications for the performance of your algorithm. All right, The next thing that I want to talk about is how do you deal with no new miracle data? So we'll look at the first data types of data, which is gonna be categorical data. So you've got different categories now again, and categorical data, we can split this up into different things. One of this is orginal type numbers, which what we can do here is, for example, we've got the star rating system. We've got one star to star three store, four star and five star, And what we can dio is we can do. We can treat these numbers orderly and we can say, All right, we're going to say the one star we're gonna give that the numerical value of 12 Stores in America Valley of 23 stories 34 stars 45 stars five. So what we need to do for categorical data is we need to transform a to numerical value. Now, doing this type of transformation to sequential numbers is good when the underlying values have a distinguished distinguish mint between them. If there is a hierarchy, so two stars is better than one star, Four stars is better than three star and five star is the best. So this is a good way to ST Data that has an underlying order. Another example is, if you have reviewed ratings which generally say, this is bad, this is OK, this is good. This is great that you can again assign numerical values that are increasing. So you go 12345 and the five is actually better than the four. So using this kind of transformation is good when there's a clear order in your data, but otherwise it can actually cause problems. So if you do this, for example, if you have categories of student alumni, teacher and parent, you can't really a sign. 123 and four because that way your algorithm may learn that a student is lower than an alumni, which is lower than a teacher, which is lower than a parent. And you can't really compare the categories like that because, you know, these are different categories, and they may be experiencing the same thing and from different perspectives. So what you can do in this case when there is no clear order between these different categories, you can do something called one hot encoding. And that's what we see on the table on the left, where for each of the categories you create your own column. And whenever that value is present, that value then gets a one, and so we can see here. If it's a student, the student column is gonna have a one, and every other column is gonna have a zero. If it's an alumni, every other column is gonna have a zero, except for the alumni column. Childs of one. If it's a teacher, the teacher column is gonna have a one. Everything else is going to be zero. And if its parent, the parent, is gonna happen, one and everything else will be zero. And so what you can do here is you can take categories and you can transform them instead to this one hot encoder, which, let's hear algorithm, better deal with these categorical types. Now, this is usually good to do if you have a low number of categories, so about 10. But you don't want to do this if you have about 100 categories because I just inflates the amount of input that you have. And your algorithm may not deal with that very well because it has to learn all of these different input factors. So another way that you could go about this is using something called beddings. Now, this is a very often used for text, which is why we're also gonna be looking at it from the text part of the new non numerical data. But embedding is this something that you could do for categories as well as text? And the idea here is that you take a value and you instead transform that to a set of numerical values. So, for example, what we can do is we can take the word potato and we can use a three d embedding and so every word or every category is assigned three numbers. Now, these numbers may initially initially be randomized. Well will initially be set randomly, but the these categories or these m beddings are actually something that you could learn. And the embedding dimension is actually hyper parameter of your model not to, but an example of this is if we take the words clouds, sun, rain and rain Boehm. Then if we learn these m beddings are and beddings may actually look like what we see on the left, and you can see that to get a rainbow. What you have is, well, you go to the sun, you subtract the clouds, you add the rain and basically you've got yourself a rainbow. So you do or sorry, you go to the clouds, add the rain and you have the sun and you've got yourself a rainbow here and so you can see that there is a There's a relation between the numbers that we assign to each of the words and so related categories or related words actually gonna be started to group together and different words are gonna be further spread apart. And so I m beddings are extremely useful because they allow you to take a large number of words or a large number of categories and reduce them down to a much lower set of dimension . So, for example, if we do three d m beddings weaken, take a ton of words and we can just reduce those down to three different numerical values. So rather than having, let's say, 100 one hot encoders, so 100 different columns we instead just have three. And this is much better from our algorithm because it can then much better deal with these m beddings. And it can no what values air similar as well as what values are different and it can more appropriately treat thes. And it doesn't have to learn or doesn't have to deal with all of these different categories .
7. Classification Algorithms Part 1: everyone. It's Max from coding with Max and welcome in this lecture, we're gonna learn about classification algorithms. So first of all, what are they? Well, they're essentially algorithms that let you predict categories from your data. Now, there are also classifications, algorithms that let you identify groups within your data. So let's do this with a more practical example using the graph on the right, What we have is just a simple time spent distribution. So each data point represents, for example, a person on the X axis. We have the time spent at work, with the right being more time, the left being less time. And on the Y axis, we have time spent with family again, top being mawr and bottom being less. Now let's also say that we know that this group is split up into two different categories. We've got the group A, which is the one in the kind of top left, which is more time with family less at work. We've got the one in the bottom right now. These groups may not just be identified by the way that they act socially, but maybe this is also related to how people buy you've got two different types of consumers, which just happened to fall into this category of some go mawr time with family and others Go spend more time at work. Now let's also look at the green dot in the middle. This is a new person that's being entered, and we want to be able to predict what category they're gonna fall into so that we know the right way to approach them from, for example, a marketing perspective. What kind of catalogues would be interested in what kind of buying behaviour can we expect from them, and how can we adjust their experience accordingly? So we want to know what group of people do they fit into. And so a great way to do this is using classifications algorithms, which will let you assign a group or predict a group rather that this new person is going to fall into. So when you use classification algorithms well, you go about using them when you're more interested in an attribute rather than an exact numerical value. For example, let's say we've got two categories of cats and dogs or just simply cats and not cats, and now we've got another image in this case and we want to assign this image to a category we want to classify. This image is this image is a cat or is it not a cat? And so this is another very simple example of what a classification algorithm could do. It will assign this image into the already defined classes of cat were not kept. All right, So let's go into a little bit more detail into some of the algorithms and actually see some examples of these algorithms. We're gonna be using the iris data set that's included with SK learn. And it's basically just looking at three different types of flowers, which is what we have on the left Here. Each of these flowers are just taken from the Wikipedia page for the Irish data set. So you can also look them those up there and find them there. And essentially there are well to forms 1st 2 of argent forms of the way that we can approach classification. One of them is supervised, the other one is unsupervised. The big difference here, of course, being that supervised. We have already known targets for when we're training, whereas unsupervised, we're not really sure what? The correct solution is now another part of classification Algorithms are single class or multi class classification algorithms, so being able to predict a single class or being able to protect multiple classes and we'll look at each of these cases in a little bit more detail now. So first of all, we're gonna look at single class supervised algorithms. Now, what we have on the left here is just two of the four features being plotted from the Irish data set. So on the left, we have the pedal within centimeters and on the Y axis Rather and on the X axis. For both of these graphs, we have the pedal length. Now, we've also split our data into two categories. We've got the not rig Nika. I have no idea how to pronounce this, but it's one of the flowers and then you've got the other category that is that type of flower. So basically, we can see And the way that I've labeled this data is everything in black is not that flower. And everything in green is that flower. Now, what you see in the background colors is two different classifications algorithms. One of them is a stochastic Grady and dissent classifier the S a g d using a linear support vector machine. Again, these are just names for classify IRS. So if this is sounds confusing, just think of it as a name for something. The other thing is a logistic regression classifier. Now, the big difference here is the blue area represents where the machine learning algorithm says this is not that type of flour. And the red area says that this is that type of flower and what we have in the middle there is the decision boundary. Essentially, if you cross over that line in the middle of that kind of green squiggly line, if you go to the left, for example for the STD case, then you're going to go into the not that flower category. And if you go to the top right and you're going to go into that flower category now, you can see, by the way, that I've decided to display the datum. This algorithm is of course not 100% accurate. We've got some of the not flowers in that category showing up in where the algorithm predicts that that is actually that type of flower. So again, the black represents what the flour is actually, and the colors in the background represent with the algorithm predicts of it. So black is not that flower green is that flower and blue is the algorithm predicts it's not that flower and red is the algorithm predicts it is that type of flower now. An interesting thing to note here is if we compare the top in the bottom graph, we can see that the two different algorithms that we're using here have different decision boundaries. Not only are they located a different positions, but they're also at different angles. In this case, both of them are in some sense almost linearly separating the data. So what they're doing is they're just basically almost destroying a straight line between the data which separates it. But we can see that the stochastic gradient descent, which is just using a linear support vector machine at this point, has more of a diagonal separation, whereas the logistic regression is more of a horizontal separation. So let's look at each of these algorithms in a little bit more detail. So in the case of logistic regression, what our algorithm is trying to do. Is it trying to optimize the bottom equations, Just trying to find the best probability. And it's using the logistic equation for them. And so we see we've got the feature this X we've got, why is our prediction and we've got our intercept and are coefficients or our weights like we introduce them earlier. And so what? We can see what we get from here on the left hand side is we got a probability if at this Peter with in at this peter length, this is, if that is that flower or not, What is the probability that given these two values, this is the flower? And so we can see that abs low pita with and basically any paedo length the probability is very low. And so if you go back and we look at that, that's exactly what we see at Low Paedo with no matter the peopling are probability is low . Um which means that the classifier classifies this as not that type of plant, whereas when we go to high Pito with for basically any paedo length are probability is high . Which means our classifier now classifies that region as that plan so we can see that anything in red has. Basically, our algorithm is saying that's a high probability that that is that plant. And so I'm gonna assign these values here and say anything over here is that plant, whereas in the blue region we have a low probability that anything in that region is our plant. So that's what we have in the bottom graph year for the logistic regression. If we look at the linear SPM or the Sud classifier, the stochastic gritty into son classifier here, what the linear SPM does is it tries to find a hyper plane that literally separates the datum. Now what a hyper plane is is basically just a line in this case in the two dimensional case here, and if it's three dimensions, it's a surface, and if it's four dimensions, it's kind of an internal area. But it just tries to find something that it can draw that separates the data, and so that's exactly what we see here. What the algorithm is trying to do is trying to find a good line that separates the data, but also what it's doing is is trying to keep the distance from the data points to the line as biggest possible. And so the algorithms take different approaches to the problem and therefore we can also see that they come out with different results. And so, with the logistic Russian, we can see we have a probability for each values assigned a probability between zero and one of belonging or not belonging to the type of class. Whereas for the case of the linear support vector machine, we have a region of zero which is not that class end of one, which is that class. And we basically have that one region in the middle where there is a big jump from not that class to that class. So we can see one is a probability that goes most press timidly, and the other one basically has a big jump between them.
8. Classification Algorithms Part 2: everyone. It's Max from coding with Max and welcome back. And this lecture we're gonna continue on with declassification algorithms that we started last lecture. All right, so let's take a look at multi class supervised algorithms. So what we did before is we just tried to differentiate between either that flower or not that type of flower, But actually, in this data said there are three different types of flowers contained within it, which is what we see here. So on the top graph on the left we've got are three different types of flowers and they're separated using algorithm called K nn for standing for Cain years, neighbors and on the bottom have we have a logistic regression using a one versus rest approach. So let's talk about these in a little bit more detail, specifically the one versus rest. Now, what we have in the first case is right here. We're just trying to separate between a class of either yes or no. And so that's what the single class part refers to. Were either trying to say yes or no, it's it's binary. Now, some of these algorithms don't actually have a multi class counterpart. They can Onley say yes or no, they don't have a ABC option. And so the way that you can use these algorithms to rather than saying just yes or no, turn them into an ABC is you use a one versus rest approach, which means you train a classifier that predicts the probability for each of these belonging to that type of flower. So we've got, in this case three logistic regressions, each of them giving a score for a certain Peter length in a certain Peter with belonging to a specific type of flour and the classifier. That gives us the highest probability that this Peter Length and this Peter would together that correspond to this flower. That's the one that's going to be chosen. And so that's called the one versus rest approach. And in this way we can turn a single class or a binary classifier that can only predict yes or no values and use several of them together to then predict multiple classes. And so we can see here it the bottom. That's the decision boundary that we get, and it does a pretty good job of separating them again. You can see that some of the dots kind of spillover. So we've got the green flower kind of spilling over into the Red Virgin and some of the red flowers spilling into what the algorithm would predict to be the Corine regime. But it does a pretty good job separating these three now for the cane years neighbors, this algorithm, it just looks at our it. What are the points around me and then, based on that, it tries to make a guess. And so an important thing to note here is that for the logistic regression for the support vector machine, but we always had is essentially basically a straight line separating. Whereas for the Kenya's neighbor we can see that there's actually a curve going in specifically for the difficult part between going between the two flowers and the top right corner and so we can see that different algorithms give different decision boundaries. The decision boundaries again are those squiggly lines that separate the output from the classic fires. And of course, there are many more different types of classifies that go into. And these are just some of the examples now, because all of these classic fires behave so differently It's not often not great if you learn every single classification algorithm. Rather, it's more important that you have a very good understanding of a couple of them. So pick 345 and understand those in more detail and feel comfortable with them. And often just having that kind of range of algorithms to choose from and fully understanding them will often make your model a lot better, rather than trying to use, you know, tons of different classifications algorithms, where you don't really understand what they're doing, how they're behaving and which one would be most appropriate to use in this type of situation. So, for example, there types of multi class supervised algorithms or naive Bayes where you can use neural networks. But there are, of course, many more. But ultimately again, The best thing is just to get pretty familiar with a couple and then just stick to those because those are the ones that you understand best. And if you have a handful of classification algorithms to choose from, odds are one of them is always going to do a pretty good job of helping you solve whatever problem you're tackling. All right, let's look at multi class on supervised algorithms. Now, essentially, right now, we're only gonna look at one because thes can become pretty complicated. We're gonna look at a clustering algorithm called K means. Now, what the clustering algorithm does is it separates it. It separates our data and two different clusters into different groups. Again, we see here we have a decision boundary, and we've got different classes in the top case were saying beforehand, we want this algorithm to split this data into two different groups, and so that's what it does. We see the left hand side is one group and the right hand side as another group. And in the bottom half, we see where we say we want this algorithm to split it into three different groups. So that's something that we say beforehand. And in this case, we see we've got one of the left, one in the middle and one in the right. Now this approach is calm again, clustering. And when we defined beforehand how many different groups who want the algorithm to identify in this case, it tries to group the data together so that there is minimal variance within each group. All right. So how would you go about evaluating how well your classification algorithm is performing well, An easy kind of straightforward approach would be just a matter of the accuracy. Basically, look at what is the proportion of correctly classified outcomes of predictions. But this can start causing problems if you have a high number of classes. So let's say you're predicting 20 different classes. That means each class is about 5% if they evenly distributed. If you have an accuracy of 95% that's almost more. That's basically still just random guessing because you could be seen no every single time and 95% of the time, they're still going to be correct. So accuracy is usually not a good way to go about evaluating classification algorithms, but we can see on the bottom right here. I've still shown the accuracy of the stochastic gradient descent as well as the logistic regression, which is just the number of correct predictions, which in this case, actually is too bad because we just have to classes. But again, accuracy can really start to cause problems when you're going to higher number of classes. So what are their options out there. Well, another option is looking at precision Now. The pursued in you look at re calculate by looking at the number of correct predictions as an correct positive predictions and dividing it by the number of correct positive predictions as well as incorrect positive predictions. So that means a true positive or a correct positive prediction is if you say this cloud or this flower is this type of flower, and if it is in fact that type of flower, then that's a true positive. A false positive is when you say this flower is that type of flower, but it's actually not the type of flower. If you say yes, this is the flower I'm looking for, but it's actually not that flower than what you have here is a false positive. Precision is really good to use when you want to know how reliable are my truths? When I say this is the case, then it's very likely that this is the case now. The other type that you can use, or another type that you can uses something called recall, where you look at the number of true positives over the some of the true positives and false negatives. So the false negative is when you're saying this is in fact, not that class, whereas in actuality, it is that class. So you're saying this is not this type of flower, but in actuality, it is that type of flower now, recall is good to use when you want to evaluate, How often do I miss my true values? Now what? You can also use this something called the confusion matrix. What the confusion matrix does is it shows you how often one class was confused as another . So in this case, we can see on the right here we've got the two tables for the STD we've got on the our rose air showing the true values. And our columns are showing the predicted values. And so we can see 78 where predicted negative and were actually negative, whereas 22 where predicted positive but were actually negative. So we've got a confusion going on here, but we've got no predicted negative values that were positive. And so what you're looking at with the confusion matrix is how often is one class confused with another class? Now this could be extremely helpful, especially if you're doing multi class. But of course it requires some more checking because it's not just one number, it's a matrix. And so you can go in. You can really see OK, where's my algorithm doing mistakes? But of course it requires more work because you need to go in and you need to look at the whole set of numbers rather than just having one number to measure performance. Now these are just some of the tools that you can use for evaluation. There are, of course, also still other tools, but these air kind of the most basic ones that you should know about. And these are also the ones that should give you a taste and an understanding of why it's not always ideal to use accuracy and what other ways you can approach evaluating classifications algorithms
9. Regression Algorithms: Hey, everyone, it smacks recording with Max and welcome. And this video, we're gonna go a regression algorithms. So first of all I want, are they Well, regression algorithms are essentially algorithms that let us predict numerical values. So when do you want to use them? Well, a really nice case to use them would be if you want to fill in missing values, for example, if you've got a data set that somewhat incomplete but you have comparable data before and you can train your algorithms to them, fill in missing values so that you can still use that data rather than having to discard it . Another place that you could use it for him. It's a very often used for this is for forecasting, So this could be time forecasting. Or this could also be forecasting just in different areas, So regression algorithms are very often used. If you want to use one part of your data and you want to make predictions about another part that you may not have data on now, you can also use regression algorithms when you want to understand how your system behaves in different regimes and what I want you to notice on the graph on the right here is that we have actually the exact same curve. But before this curve looked linear. So if we go back, we see this is actually a straight line that we have on the right here. Now, if we look at the graph further out, we can see that this part is actually no longer linear. We can see it's a polynomial have curves. And now this data was obviously generated by May. This is not extremely realistic, but the point that I want to make here is that in some regions, are data can actually look linear. Sophie, zoom in on the region between zero and one more, zero in 1.5, which is what we do here. We can actually see that a linear line or will align Linear line. A line does a pretty good job of fitting the data here, but when we zoom out, this actually is no longer the case. So the straight line is no longer going to do a good job of modeling this. And so if we have a good model or a good algorithm, then weaken, go into different regimes and hopefully we can understand how the data behaves in these different parts. And so in this way, a very good model can help us understand something that we haven't been able to investigate yet. But of course, that being said, you should also be aware that sometimes or often when you trade a model, it's only gonna be valid for the data range that you're creating it on rather than on the whole set of data. So even though aggression algorithms can be very powerful and helping predict numerical values, you want to make sure that when we're using them, that you're using them in your appropriate data range and that if you're getting data that's completely outside of the range that you've normally considered, you want to reevaluate how your algorithm performs on those piece of data because, as we can see from the comparison here on the right, first we start off linear, and then we go to this pulling no meal. So if we go into data very much outside of the range that were used to before, the way that our data and our system behaves and change, all right, so let's take a look at some of the algorithms where again? Going to be using the Irish data set that we also use for the classification case. Now, when you're doing regression, you usually only talk about supervised unsupervised isn't particularly well defined for regression algorithms. And so that's why when we're talking about regression, usually it's gonna be a supervised case where you have your training data and you also have appropriate labels for inappropriate forget values that you want to achieve. All right, so let's consider again this iris data. What we have here on the X axis, which you can see on the graph on the left, is the steeple length. And then on the Y axis. We've got the feudal length, and what you can see from this graph is these two characteristics seem to be somewhat related. So let's try to use some regression algorithms to see if we can predict some of the values or weaken, you know, have a model that tries toe model. This behavior that we're seeing here now. There are different ways that we can approach Russian, and there are very many different algorithms. The simplest one is a linear regression, linear, just being a straight line polynomial is a well curved line. It's kind of what we saw here. So this is going to be linear. This is gonna be a polynomial. And then you've also got support vector machines that you can use for Russian. You can also use the pain years neighbors for aggression. Neural networks are very often, of course, also used for regression, and they're also still many other supervised algorithms that you can use for aggression. So just like the classic vacation, we have a lot of algorithms to choose from that we can use for regression and often times it just comes down to you. Okay, what is the problem and what is our data sets and what model is best appropriate to use for this type of datum? So let's take a look at the linear regression we have. Here is a linear regression model trained on the state of setting. And what I've used is I've used one hot encoding. Remember, we talked about that in the data preparation case where we one hot encode the three different flower types and so are linear. Model actually creates three different best fits for each of the individual flower types, and we can see that this is the result. And so what you can see from the linear model is we can go down to lower steeple, lengthen, weaken, will predict appropriate Peter links. We don't know if that value we need, correct or not, but we can at least try to predict it, and we can do the same thing. Well, we can do that for all the flowers, since we've used one hot and coating. And so what are linear models actually done? Is it split it into three difference? We have three different flowers in this case, through different than your models that it would all just use em and we'll check. You know which encoding this is, which flowered is, and then I would use the appropriate winner model to make the prediction. We can also do the same thing using a K nearest neighbor Regression. Now the K nearest neighbor, rather than what the linear model does is it tries to find the best linear line to go through the data. The cane years neighbor looks at the surrounding neighbors to a data points, and its prediction is just the average value from the surrounding data points and so we can see that when we're going to lower or higher steeple lengths, that our cannon algorithm doesn't really provide many great predictions anymore. So we can see when we go to lower steeple length, it's basically all flats, and in the linear case it's decreasing not because we don't have data. In that case, we actually don't know which one is correct or if flowers of those characteristics even exists. So until we have data there, we can't actually validate which one of these trips Merak cleared. But we can see that the behavior of these two are very different. And we can also see that for the linear case. It will really is just a straight line. Whereas for the K nearest neighbor case, we've got small kind of horizontal lines that jump up and down as we go across the different see pollings and again here. We've got three different models which are also color coded in a purple, black and gray for the orange, blue and red colored flowers, respectively. And so you can see that we have small regimes where these predictions would be the same and we're kind of having almost looks like a staircase that isn't fully connected that we're making predictions on. And so, just like in the classifier case, we can see that different algorithms, of course, behave very differently. And so great waited. Your great thing to do, of course, is to just see Okay, what kind of prediction is this Malgor than make or how does my Igor them actually look like? And so visualizing the prediction is that you're arguing with them would make for all of these different steeple links is a nice way to understand what your algorithm actually does . Now, of course, this was just too algorithms. So we looked here at the linear case. We looked here at the Kenya's neighbors again. Like I said, there are very many different progression algorithms and each of them take a different approach. And so the way that these predictions will look are also going to be very different. So how would we evaluate these different models? Well, good thing to do is just to look at the air. Now I'm gonna use why toe the note our targets here and I'm gonna use why Hat to the note, our prediction and one way that you can look at error is just by looking at the absolute mean error. So you will get each prediction and you see how far the prediction is away from the true value. You taken absolute value so that the number there only positive you sum them up and you divide it by the number of data points that you have. And so, in that way you get a measure of on average, how far away is your prediction from the true value and so we can see on the rights. We've got two models on the left. We have Bill in your Russian, on the top, left on the right. We have the cane years, neighbor on the bottom. We can see that using the absolutely in error, the linear regression is off by about 0.2. So that means, on average, it's off by about 0.2 centimeters and the paedo length prediction where is the Kenya's neighbor is off by Onley about 0.17 centimeters. So using this measure, it looks like the cane nearest neighbor works better Now. There's actually also a different way that you can look at air and this is often a preferred one, which is looking out the mean squared error and basically outside of machine learning is often also referred to us, their route being squared error or that's a different way of referring to it. We'll talk about that in the second. The means squared air is, rather than taking your prediction and seeing how far it is away from the target and just taking absolute values, you want to square that and the reason you square that is because that means that outliers or predictions that are very off are punished much more heavily. So, for example, in the absolute mean air case, if we have a prediction, that's off my 0.1 and another one that's off by 10.5, these air essentially going to be equally weighted, whereas in the mean squared air case, the higher error is gonna have much larger effect on our evaluation here. And the reason that means squared error is nice is because it not only gives us a measure of how good our algorithm is, but it heavily punishes when the predictions are very off from the actual target, and so we can see here if we look at the mean square it air for the linear regression and the pain years neighbors again. The linear regression in this case actually also has a higher mean squared air. That means from the absolute mean air we see Not only is the model on average off more, but from the mean squared air case, we see that also out liars are not as well fit or the extremes are not as well fit as in the cane nearest neighbor case. Now what is this route? Me scored a rare case while the root, mean squared error is just taking this square root mean square. Now the reason that you want to do this is because the means squared. Eric gives us values and square. So what we have here from the mean squared air, is we have a difference of centimeters squared. Essentially, which isn't doesn't really make sense. It's not really what we're after, so we can take the square root. And now we can read that from the root mean square terror case. We've got an air of about 0.28 centimeters. Whereas for the Kenya's neighbor case, we've got an air of about 0.23 centimeters. So taking the square root brings or units back to what we're actually measuring and allows us to assign a more physical value. Now the reason that you don't really use the routing squared air and the machine learning cases because the performances the same, you're just taking the square root at the very end. And the square root is just a extra operation that will take more time. And so you can just use the mean squared error, because if the mean scored error is bigger, then well, the square root of that value is also going to be bigger than whatever value you're comparing it to. So taking the square root doesn't actually add anything extra when you want to use this to evaluate your algorithm. Taking the square root adds a lot extra when you want to interpret the error from a human perspective. But when you're using it in your algorithm to train the mean square does well is does the exact same thing, and you don't need to use the extra cost of computing the square root every single time
10. Optimization Techniques: Hey, everyone, it's Max from coding with Max. And welcome back in this lecture, we're gonna look at optimization techniques for algorithms. Before that, though, let's take a quick recap of what we've done so far. First, we looked into preparing for training, so identifying potential good features and just generally going about understanding our data. Then we went into more detail of are preparing our Daito with techniques using, like normalization, scaling and the importance of sampling. And then we looked at different types of algorithms specifically for regression and for classification. But how do we make our algorithms learn faster? So before this, we've just talked about what Allah greatness could have used. What do we do? Or even the whole data science flow, which is basically everything up to the algorithm choice. So how do we go about our data? How do we prepare data? How do we understand our data? How do we identify good features? How do we do? Expect? How do we explore all of those important topics? And then we talked about using our model and how do we evaluate our model? But now, how do we make our algorithm learn faster? Because Sometimes when you're using a lot of data, your algorithm can actually take quite a while to train, and you just have to wait for it to complete. And so, of course, one thing that you can do is just use more stronger computers. But there are also other techniques that you can use to try and make your algorithms learn faster so that they become better, quicker, and that's what we're gonna look at now. So here we're gonna cover the different types of Optimizers. We're gonna start with Brady and dissent, which is something that we also used before. We're gonna look at momentum than nag. Ate a grad in our mess, prop and finally Adam as well as an Adam. Okay. Before we go into that, though, I quickly want to introduce another term called Learning Rain. Now the learning rate defines how fast our algorithm learns. Or basically, how much is their algorithm affected by the ear is that it's currently doing now you may intuitively think All right, well, I want my girl in tow learn as fast as possible. So then you know it becomes as good as possible quickly. That's not actually how it works. So let's look at the kind of two extremes. And this is usually one of those Goldilocks things where you don't want to be too small, because if you're too small, then essentially you're going to take a very long time to reach an optimal solution or a model that does acceptable enough job. So if you're learning rights too small, it's gonna take too long. But if you're learning, rate is too big than you may have to be taking jumps that become too big and you may never reach an optimal solution because your algorithm is over correcting. Now the learning rate is denoted by ETA, which is this symbol down at the bottom, which basically just looks like a strange and and so when we go through the next part of the lecture, if you see that at a symbol or that strange and that's gonna be our learning rate now, the learning rate is also a hyper parameter of your system that you can tune. And of course you want to get that optimal learning rate so that you're not going to big that your model over corrects. And but they also not too small that your model takes too long to reach the optimal solution. And it could have done the same thing with faster steps or bigger steps. All right, let's look at the first algorithm, which is called grading descent. Greedy Descend. All that we're doing is we're taking our error function or lost function or our cost function, which is in this case, for example, what I've plot it is the means squared error, which is what we talked about in the regression case. So the mean squared air is just a crawdad equation, which is just this square here, and we can see that what the greedy and dissent does is it just takes our current weights. And it basically shifts are waits by the learning rate and the Grady int of wherever we currently are at our cost function. So if we look at the graph, right, if our error is mawr, you know, towards that left side higher up the curve, then we're gonna have a steeper Grady int at that point. And so you know we can add more error or is it for closer to the minimum, which is what we want to reach then argue radiant is going to be smaller now, the learning rage. If you scale that, you can essentially think of scaling that, um, radiant so we can see changing the learning rate is basically changing between the orange or the purple arrow on the one case or the block in the blue so we can make a learning rate bigger or smaller and that basically, you know, effects the step size that we're going to take. So all that we're doing with grading descent as we have our weights, including a bias. So you remember that term three different before and we just look at our current weights and we update, um, based on the error based on each rate for based on each weight, which comes from the cost function. All right, the next thing that we can do is use mo mentum optimization. Now, what the momentum does is it basically takes radiant descend, which is what we looked out just previously, and it adds a little bit. So what grading descent does is every single step we calculate the Grady in, and then we adjust our weights accordingly. So if we're very high, you know, up the air curve or grating certain steeper. And we're gonna adjust more and more. And the closer we get to that to that dip at the very bottom, we can see that our arrows like, smaller and smaller, and so we're gonna just less so. What the momentum does is keeps track of basically the previous radiance, and so we can see here. The momentum is kind of what I noted in green. And if we start on the left, we start rolling down the hill and our mo mentum shown there in green. And then we also take the ingredient at the local point and change our momentum accordingly . And then we update our weights based on the new momentum, an effect that this has, as we can see on the left, we're going to start rolling down the hill much faster essentially, because we're gonna pick up momentum. But when we reached the bottom, we can see that on the right, for example, the momentum is still pointing up, and so we're gonna roll past the hill a little bit further before we then kind of decelerate and go back down. So for the radiant descent were basically for on the left hand side. We're kind of rolling down the hill and we're slowing down as we reached the bottom. Where is with momentum? You can kind of imagine it more like we have a bowl and we would put a ball on the tip of the bowl and just let it run down. And even as the ball reaches the kind of lower point, it still has some momentum that's gonna carry it up the other side of the bowl before it gets slowed down. And then it comes back and it kind of bounces around a little bit until it reaches and it worked. Stays put in that lower part. And so momentum as an optimizer works in a very similar way. The next type of optimizer is called the Nesterov, accelerated, brilliant were nag for short, and it basically just takes momentum. But rather than looking at the Grady int abs, the current waits. It looks at the greedy int at the current waits, plus our momentum so we can see that are current step would be those now more transparent arrows and our step at our current waits. Plus, the momentum would be then the more opaque aero. So the more colored ones and a nice thing about that is that our algorithm? Actually, where the optimizer looks a little bit into the future, in quotation marks or at the next step? And so it's not going to accelerate as much as we can see on the left hand case. We're actually not going to get that big of a boost. And we're also going to decelerate more once we've passed the optimal point, as we can see on the right hand side, where we have a larger kick back already. So what not does is rather than looking and using the greetings at the current weights, it looks at the Grady INTs at the current waits, plus the momentum. And then it just the momentum, according to that and updates are wastes based on the momentum. All right, so the next type is going to be eight, a grad, an arm s prop, and these are very much related, which is why I put them in this category. And essentially all that we're doing here is we have kind of similar to the momentum before a another symbol that we're using for updating and all that we're doing is where we have our current starting value, and we change that based upon the Grady int in every single direction. And so we're kind of scaling it Grady and says, we concede So we're taking the Grady int of each weight and we're multiplying it by itself . That's gonna be our s value here, which is gonna be defined for every single weight. And then we update our actual weights and biases by subtracting the great and dividing each value by that pre defined value. So what we're doing here is we're actually reducing the effect of bigger Grady INTs, and so that way we're not going to be going off in the wrong direction. But we're essentially going to be guided mawr toe words that optimum quicker now that strange he at the bottom that's called Epsilon. And that's just a very small number. And you have that so that you don't divide by zero. But essentially what integrate does is it takes all of the ingredients and it scales, um, based on their size. And that way your algorithm doesn't really go off into wrong directions, but it will go more quickly into the right direction. Armas prop is almost the same thing. All that we're doing is we're adding a scaling term. Which is this gonna my Here on DSO this way. We kind of discard some of the earlier parts and we can adjust the weights that occurrence greetings of as well as the memory of the previous values. So we can see that it a gret and our mess prop or very similar. And they have the same goal, which is not letting your algorithm go off into the wrong directions. But they achieve this a little bit differently. Enormous problems, usually but better than eight. A grad. Just because it doesn't remember as much of the old and it forgets faster, and it considers more the current values. But this come off course is also something that you can tune. So there's another hyper parameter. All right, finally, we've got Adam and and Adam. Now what? These is due and you'll probably recognize a lot of these parts. You've got the P, which is essentially a P for momentum. We've got the S, which is the S from eight, a guard or RMS prop, Actually, in this case and what we're doing as we combine our P and Rs, and we basically update based on these two values, and there's a little bit of, you know, like some scaling going on. So we're dividing our peas by one minus gamma one and Rs by one minus. Kalma too. And essentially we're just doing this so that we reach more up to value optimal values of P . Enis quicker. But the idea of this Adam algorithm is that it combines momentum and it combines what does use for our mess prop and it works pretty well. And then you've also got an Adam which basically takes the Adam and it and the Nesterov. So rather than looking at the greetings in the current place, it will look at the greatness in the next place after considering Mo mentum. And basically that does the same thing that nag is to momentum. And Adam is to Adam. So these are all kind of the popular optimizers. If you want to go, just pick one. The default one that's usually used is great and dissent, but it by no means is theon. Tima one. If you're just going to choose a random optimizer it's usually good to go with the Adam or the N Adam. These performed very, very well. And so, you know, if you're not really sure, you should probably go with those ones. But you should be aware that Adam and Adam usually aren't the default wants to go for it. But you can see that each of these optimizers that kind of behave differently and so you you know, there are guidelines of which ones are generally better. But of course, this can also depend on the situation. And if you're not sure, it may be worthwhile experimenting with, like, two or three of them and then seeing which type of optimizing actually performs the best in terms of making your algorithm one of the quickest and then, you know, go with that the rest of the way through so that when you're improving your model, it actually learns quicker than it otherwise wouldn't. All right. And so with that, we've reached the end of the chorus, and I just wanted to thank you all for joining me. I hope you've learned a lot. I hope you've enjoyed the course, and I hope you feel comfortable having some conversations about machine learning now. Now he may have noticed that throughout this course there's a lot of data science work that goes into it and machine learning where the machine learning specialization is actually built on having really solid data science skills. So if you're interested in specializing into machine learning or if you're interested in becoming a data scientist, I also have some great content on my website coating with max dot com designed specifically to get you from absolutely no experience to data scientists. So if you're interested in that, or if you want to hear my tips on how I got started again, you can check it out on coding with Max. Come and yeah, thank you again for joining me, and I hope to see you in one of my other courses.