Machine Learning Classification Profits - Top 7 CLASSIFICATION Models You Must Know in 2021 | Python Profits | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Machine Learning Classification Profits - Top 7 CLASSIFICATION Models You Must Know in 2021

teacher avatar Python Profits, Master Python & Accelerate Your Profits

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

46 Lessons (3h 8m)
    • 1. course intro + benefits

    • 2. installation and setup

    • 3. methodology

    • 4. what is classification

    • 5. applications of classification

    • 6. LR introduction

    • 7. logistic regression models in scikit learn

    • 8. quiz

    • 9. quiz solution

    • 10. KNN introduction

    • 11. KNN on a Kaggle data set

    • 12. KNN Visualizations

    • 13. quiz

    • 14. quiz solution

    • 15. SVM introduction

    • 16. Linear SVMs

    • 17. Non linear SVMs

    • 18. Preprocessing for SVMs

    • 19. Quiz

    • 20. Intro to Naive Bayes Classifiers

    • 21. Comparison with SVMs on text data

    • 22. Exercise

    • 23. Exercise solution

    • 24. Introduction to DT Classification

    • 25. Comparison with the other models

    • 26. Visualizing Decision Trees

    • 27. Hyperparameter Analysis

    • 28. Quiz

    • 29. Introduction to RF Classification

    • 30. Comparison between DT and RF

    • 31. Visualizing Random Forests

    • 32. Hyperparameter Analysis

    • 33. Quiz

    • 34. Introduction to MLP Classification

    • 35. Comparison to the DT and RF models

    • 36. Hyperparamter Analysis

    • 37. Exercise

    • 38. Data collection

    • 39. Results and analysis

    • 40. Quiz

    • 41. Usefull data preprocessing techniques and approaches

    • 42. Ensemble methods

    • 43. Overview and hints

    • 44. Proposed solution

    • 45. Tips and tricks, pitfalls to avoid, shortcuts

    • 46. Additional resources, next steps, recommended tools

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Are You Ready to Master the Advanced Machine Learning Skill That Will Make You the Expert Data Scientist You Should Be?

Then, you’ve come to the right place! I’ll tell you the framework that the best machine learning experts use to help businesses solve the crisis they’re facing while positioning themselves as one of the top experts in the world!

Dear Seeker of Data Knowledge,

Are you struggling to improve your machine learning development skills?

Do you want to move forward and learn the skill that will make you a great data scientist?

If your answer is ‘Yes’, then continue reading because I’m about to share with you the machine learning skill that will take you from being a mediocre data scientist to an expert level.

Is Data Science the Next Big Thing?

According to a survey done by some experts, data science will be the next big thing for many businesses.

It is expected that at least 1.3 Trillion U.S. dollars in revenue will be generated because of this industry.

The reason? Data science will solve a lot of business problems and needs, which will make their whole workflow more efficient.

So, why is it important for aspiring professionals like you to learn about this?

This is because the industry is in its beginning stages…

Riding that tide as early as now will mean more opportunities for people like you.

How Do You Benefit from this?

If you are looking for a lucrative career where you can earn a lot, then becoming a machine learning expert is the thing for you!

A lot of companies will be looking for the best data scientists, machine learning, or deep learning experts to help them in their business needs. .

And if you already have the basic knowledge of machine learning and deep learning, you are already ahead of the pack!

Now, this doesn’t mean they can’t catch up to you.

This is why the only way to stay ahead is to continuously improve your knowledge!

So, here’s the deal…

We want to offer you the chance to learn the advanced lessons that will help you use your machine learning and deep learning lessons easily in your day-to-day tasks!

The ONLY course you will ever need to take your machine learning to the HIGHEST level!

You also get these exciting FREE bonuses !

Bonus #1: Big Insider Secrets
These are industry secrets that most experts don’t share without getting paid for thousands of dollars. These include how they successfully debug and fix projects that are usually dead-end, or how they successfully launch a machine-learning program.

Bonus #2: 4 Advanced Lessons
We will teach you the advanced lessons that are not included in most deep learning courses out there. It contains shortcuts and programming “hacks” that will make your life as a machine learning developer easier.

Bonus #3: Solved Capstone Project
You will be given access to apply your new-found knowledge through the capstone project. This ensures that both your mind and body will remember all the things that you’ve learned. After all, experience is the best teacher.

Bonus #4: 20+ Jupyter Code Notebooks 
You’ll be able to download files that contain live codes, narrative text, numerical simulations, visualizations, and equations that you most experts use to create their own projects. This can help you come up with better codes that y


Meet Your Teacher

Teacher Profile Image

Python Profits

Master Python & Accelerate Your Profits


We are Python Profits, who have a goal to help people like you become more prepared for future opportunities in Data Science using Python.

The amount of data collected by businesses exploded in the past 20 years. But, the human skills to study and decode them have not caught up with that speed.

It is our goal to make sure that we are not left behind in terms of analyzing these pieces of information for our future.

This is why throughout the years, we’ve studied methods and hired experts in Data Science to create training courses that will help those who seek the power to become better in this field.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. course intro + benefits: Classification models with real exercises, cost overview. Let me tell you a bit about me first. Most importantly, I hope that I am able to teach you and inspire confidence in you to learn the course materials and to believe in yourself that you can do this even if it might seem difficult at times, I assure you that it's possible and not as hard as it seems. At first, I have over six years of teaching experience, both in formal settings such as high schools and universities, and informal settings like online tutoring websites doing visual courses such as this one, and have a PhD in machine learning focused on applying machine learning techniques to help researchers and workers in other fields of study. And last but not least, I love to explore all things tech related and allow new things every day. The target audience for this course is anyone that is interested in studying machine learning, regardless of their programming background. Whether you are just starting out with coding in general, or you are an experienced coders seeking to broaden your skills. This is the right course for you. You will learn how to get things done fast, correctly and how to understand them at an intuitive level. So this is for practicality oriented people. The emphasis in this course is on practicality. This means that we will only cover the theory that is absolutely necessary for understanding something. Otherwise, we will focus on getting things done and understanding how to use the right tools for that. We will also favor libraries that focus on ease of use, accomplishing things fast and that are used for developing real world applications. Here are the benefits that you will get from following this course. It's going to be practical and applicable. We will learn how and where to find the real world datasets that highly skilled professionals use to do their jobs and bring value to their businesses. It's going to be suitable for everyone. We will introduce all the necessary prerequisites before advancing to new topics. And we will make everything as easy to understand as possible without sacrificing depth and correctness, we are going to teach essential skills. Most of the jobs on LinkedIn covalently have some machine learning or data science related requirement. And this will only be more and more to, as the field advances each day, no one wants to get left behind and it's as easy as ever, not do. We will use the latest tag and datasets developed by industry leading professionals and researchers. So we will not teach anything that is already obsolete. We will also teach the best practices and these, compared to the technologies, do not change often. So you will be able to adapt to new tech easily. 2. installation and setup: In this video, we are going to talk about the installation and setup of the necessary tools and software for following our video courses. So we are going to cover how to install Anaconda and why it's useful, how to install Jupiter and what we'll be using it for and why and how to use Jupiter with some basic examples. So the first step to go to and download Anaconda. This is an optional step, but it's highly recommended anyway because our videos will use Anaconda and in order to follow them as closely and as efficiently as possible, we recommend that you use it as well. And a Conda is a wrapper around Python that comes with a lot of packages pre-installed. And it has a lot of useful features such as allowing you to run multiple Python version simultaneously, each with their own installed packages. So it's a great tool and a lot of people in machine learning and data science community use it. And therefore, it helps to know what it's about as well. So according to your OS, download the 64-bit version for Windows, download the graphical installer. For Mac OS, we also recommend the graphical installer. And for Linux, just the 64-bit installer and get the Python 37 version. All right, I'm not going to be installing it because I already have Python installed. We're just going to cover now how to use Jupiter. So in order to use Jupiter, you have to go into some folder. I recommend a rather empty folder. I have here the folder that are used to store my cost files. Shift, click and click open PowerShell window. If you're on Mac or Linux. There are other ways you can use to open a terminal window in the same folder that you are him. While you can just open it anyway and, and move to the folder you wants to use. Alright, so if you installed Anaconda, you can run Jupiter simply doing Jupiter notebook in your terminal. It should open a window like this. And you can see we have the files that we had in the Explorer as well for that folder. So we are in the folder that we ran, the command farm, in my case, this folder. Alright, but before we get into Jupiter and how we can use it, let me stop this by hitting Control C twice. And tell you what you can do if you don't have Anaconda installed. So if you only have Python installed, then transits is, then chances are you don't have Jupiter installed. In order to install it, you can use pip. So PIP install Jupiter and execute this. And it will install a bunch of stuff. It'll take awhile. It didn't take awhile for me because I already have it. So it simply saying that I have everything already. But for you, it might take a few minutes if you haven't installed it already. All right, so let's run it again and talk a bit about how to use it. Ok, so we are here and the first thing we need to do is create a file in which we can write Python code. So we'll go to New and under notebook, choose Python three. Okay, and we should get something like that. Let's find some Python code here. So in each of these cells, these are called cells. We can execute Python code. So let's start with something simple. Ten to the power of three, which should output 10000. So I executed it, and here is the result. But how did I execute it? Because I didn't click on anything. Well, I executed it with Shift Enter. So what that does is it executes the code cell and it creates a new cell. There are other ways of executing code and ourselves as well. For example, you can see here that I executed it again without clicking anything. And I got the result. No new cell was made. That is because I didn't use Shift plus enter. I use control plus enter. Okay? And another thing we can do is executed with our mouth. We have this Run button here, which executes it and goes to a new cell. And we can also insert cells using this menu here. We can change the cell type. For example, we can have markdown cells, which lets us write sum, which lead us, which lets us write some kind of comments in the notebook itself. So they are not Python comments, but it's taxed that will show up as text in the notebook and it will not be executed as code. And this is usually used for headings or simply to provide some information to whoever is using your notebook, whoever is reading it, without providing that information to code. So you could have here something like section. One and give it some title. And this is executed in the same way. For example, if I hit Shift plus enter, now, I get this and that remains there forever. Okay? Another important thing to know is that you will often run into situations in which your code gets stuck or last forever. For example, if we do something like X equals 0, while two x plus equals 1 x. So if we execute this, see, we get this star here, which means that the code is executing. So while that star stays there, this code has not finished executing, but obviously it will run forever and we don't want to wait until something happens that stops it. We want to stop it now. In order to do that, we can click this button here which says interrupt the kernel and it will stop. But depending on what you're doing, this isn't always so easy. So let's run it again and go over a few other methods of stopping called that refuses to stop k. So if this doesn't work, what you can try is going to, now, please start. And it gives you this wanting. You can simply click Restart. And you saw it sat there with starting kernel. And then it said Colonel already. Once it says kernel already, it has finished with starting. The disadvantage of this is that you lose all the variables. And by that I mean that these cells should generally be executed sequentially. So if I have, let's not call it x, let's say a equals one here. And I print a here. If I run this cell first, I get an error. A is not defined. But if I run this cell, then this C, Now a is defined and its value is printed. Ok? So if you restart the kernel, you lose all this, you lose the a here. So you generally want to avoid that if possible. Okay? And another thing that almost definitely works if we run this again, and let's say you go here to Colonel restart and nothing happens when you restart the kernel. Well, then you have to go back. So let's save this with controllers. Okay, click here. It gives you this warning, but let's ignore it. And the green thing here means that the notebook is running. And it even says running here. We can stop it by clicking here and clicking shutdown. And it took a couple of seconds, but you can see it's now got a doubt which means it's no longer running. And this will definitely work. This should be able to stop any code that refuses to terminate. Okay, I'll end this by saying that if you click here on the title, you can give the name, the notebook, a name. Intro to Jupiter. And it's always good to name your notebooks. As you can see, the advantage here is that you can organize everything very nicely. And it reads like a real notebook that someone took notes in, which is very good for presenting the kinds of concepts and the kind of information that will be presenting in our videos. And that is why we encourage the use of Jupiter. I hope this has been informative and that you will enjoy using Jupiter together with us in our video causes. 3. methodology: Let's talk about the methodology employed in this course. First, we have the how, when and why approach. You might have noticed that we've already covered how, when and why in relation to this course overview. We've talked about what we'll be teaching, why, a bit about how, and now we're talking more about the how. This will be done throughout the course for every topic and concept. So that you have a clear picture about how to do it, when to do it. And most importantly, why you might want to do something in the first place. Next, we will have quizzes. It's important to test your understanding of things. So we will have regular quizzes for you to make sure that you are following along. They will all be explained if you ever get stuck. Next, we will also have exercises. It's also important to put wax you learn to good use outside the classroom. The exercises we have prepared for you will help you do just that by being as similar as possible to watch you might encounter in the industry. Don't be discouraged though. You won't have to spend hours on each one. Last but not least, oil hell is when you are stuck just following tutorials about the technology, but still don't have a clear idea about what to actually do with what you're learning will definitely want to avoid that by always pointing out things that you can build on your own to avoid this pitfall. 4. what is classification: In this video, we will talk about what classification is. Classification means placing an object into one out of a number of categories. In machine learning, it means predicting this category or class for a given object based on its features. Let's see what this means with a concrete example. Consider the following figures. One is a cat and the other is a dog. A classification problem would go like this, given the pixels in an image, determine if the image is of a cat or a dog. We would use machine learning to determine this. Our dataset would consist of a bunch of images of cats and dogs, which we would use to train a machine learning algorithm. The algorithm should then be able to make accurate predictions for images that it has not seen during training. Of course, in the real world, pictures will not be as clear as these two. They might be blood taken from weird angles. The animal might be abstracted by other objects and so on. Also, what happens if we have many more pictures of cats then of dogs in the dataset or vice versa. We will answer all of these questions and others in the upcoming videos. 5. applications of classification: In this video, we are going to talk about the applications of classification. Here are some possible uses for classification. The same applications that regression has also apply here. So if you've watched our regression course, feel free to fast forward. First. We have forecasting. Let's say that you have a business and are considering buying some video ads to show up in various places, such as YouTube videos. You could use classification to get an idea about whether or not enough people would watch your ad without skipping it. This could help you decide how much to pay for such ads. Next, there is business process optimization. Let's consider a food delivery business. We might analyze the impact of electronic bike delivery during rush hour traffic on customer satisfaction. This could help us decide whether or not it's worth it to buy more electronic bikes. Most people nowadays are faced with decisions on a daily basis. You don't even need to be a manager for this to be true. Classification techniques can help you make sense of large amounts of data and test the impact of a certain decision before you actually make it. We can also use classification to correct certain errors in judgment. For example, one topic that can become quite controversial is whether a minimum wage increase is a good thing or not. One side argues that it will benefit low-income workers, while the other argues that small businesses won't be able to pay and will thus increase unemployment. One side is in error, but it might not be the same one everywhere on the globe. Classification could help with this by learning from various effects that these kinds of measures generated where they were applied over time, businesses have gathered a lot of data that has the potential to yield valuable insights. However, this data requires proper analysis. Classification analysis techniques can find a relationship between different variables by uncovering patterns that were previously unnoticed. So as you can see, classification has plenty of uses and you can almost always find something to use it on, whether you're a business owner or are simply looking to improve small things in your life. Another important application of classification is in image processing. It has applications in image recognition, object detection, and other image related tasks. We will see some of them in this course. 6. LR introduction: Logistic regression, we first do machine learning models that will learn to predict probabilities for a certain class or event given some input features. So they give us the probability for the class or event. Basically, these are the most basic models used for classification. They use the logistic function or the sigmoid function graphed in this figure here. Let's take a look at it. First of all, it is defined as sigmoid of t is equal to one over one plus e to the power of minus t. And recall that when raising something to a negative power, it basically means one over that something to that power. So e to the power of minus t here means one over e to the power of d. If we look at the graph, this function's value is 0 for negative numbers and one for a mouse positive numbers. Between about minus 606, its value slowly rises from 0 to one, reaching 0.5. at 0. Let's say we want to predict if it will rain tomorrow or not based on some inputs that we don't really care about that this time, after training our machine learning logistic regression and giving it some inputs, it will use this logistic function to make a prediction. This output will be the probability of rain below 0.05. we will interpret as no IN tomorrow and above or equal to 0.5 will interpret as yes, it will rain tomorrow. This is all the theory we need at this moment. 7. logistic regression models in scikit learn: In this video, we are going to talk about logistic regression models in scikit learn they're going to use a data set for predicting mobile phone prices. Forgot to Cabo, it's des one mobile device classification from about two years ago. These stags and this year's ability score. And if we scroll down, we are only going to use this strain that CSV file to keep things simple. And you can also easily modify the code to make predictions on the test dataset. If we look at the columns, make sure that you select all of them. They might not all be selected by default. We see that we have a price range here with four options. 0, low cost, 23, which has very high cost, and they are equally distributed. And we have about 500 data instances, well actually exactly 500 days instances in each glass making this dataset have 2 thousand data instances in total. And if we go to scikit-learn, we are going to use this logistic regression model. It's also called the larger Max sand classifier. You can find it under these names as well. And you can see that we have a lot of hyperparameters here. But we are not going to be using most of them. Okay, so let's go over the code that we have here. It's very similar to the code in the linear regression course. So if you haven't watched that course, I strongly recommend that you do. I will be explaining things, but I will do it a bit faster. I go into a bit more details in that course. But if you are somewhat familiar with scikit-learn, although Eddie, fast-forward explanations that I'm going to be providing in this course should be enough for you. Okay, so here we import pandas, which we will use to read the dataset and to do some simple statistical analysis seaborne for displaying some figures for more data analysis will lead the data and we call Describe under data, which gives us this table here. And we can see that the data is quite nice. The values are nothing out of the ordinary. There are no missing values. So we should be able to easily work with this dataset. Here we have a function that will give us the train test split. Test set size will be 30% of the data. And for data analysis, we have the pair plot which I'm not going to run right now because I ran it before. It doesn't provide any useful information in this case. And it takes a very long while Duran and generates a very large figure. That would be quite hard to talk about at this time. But we will discuss this correlations map, which displays correlations between the features and also correlations between the features and the Target column, which is the price range here. And we can see that all of them are quite two are correlated. Even if most of this is in blue, which would suggest a low correlation. If we look here, the darkest blue only is at 0 and everything else is above 0. So quite good. So this value here, it's about this shade here. So it's probably around a correlation of 0303 something. And this one here, which is the RAM, is the most correlated with the price range. It's quite close to one according to this shade here. So this is encouraging. It's likely that we're going to get good results. Okay, here we have our pipeline setup where we set up our classification pipeline. We don't need most of these. I have copy-pasted this from the regression course. So let's get rid off the things that we don't need, like the linear regressor there. And this is to help us be able to work with multiple datasets, even if so far we only have one. So we are given a machine-learning model, a list of 1-hot columns, which is columns that will need to be one-hot encoded. We encode them using a column transformer. And they one-hot encoder passing remainder equals pastoral so that we only encode the columns. Given here. We use a minmax Kayla. And will it turn our pipeline for classification again, this is taken from the linear regression course or the regression course. Actually. We have this evaluate function which takes a list of classifiers, datasets and runs each classifier and each dataset. And we will output the accuracy score. We do that by iterating over dataset, printing something helpful, splitting the data into train and test sets. Getting the pipeline for each classifier, fitting the pipeline on the training data. And outputting the accuracy score for that classifier on this dataset. And in our case, we call that only with logistic regression. For which I have passed in the solver as LBFGS, which is considered the best general case solver. And they passed in multiclass as multinomial. Multinomial means that there are more than two possible classes. Logistic regression, like we've seen in the introductory video, only made sense for binary classification. That is, when we have only two classes, but here we have four. So we need a generalization. And that generalization is done by passing multinomial to this multi-class hyper-parameter here. And that will generalize logistic regression to work with multiple classes. And here we pass in the dataset. We bass df, which is the DataFrame as our mobile DataFrame. Okay, let's run this again. I already ran it, but just so you can see that it runs. We run it again and we get an accuracy of 0.9, that is 90%. This means that for unseen data, about 90% of our predictions should be correct. So you can see that is already quite good. And we've done it with very little effort. Basically we just renamed a few things in the code of our regression course. But let's check something out. You might recall that in the regression course, this min-max over here, this min-max, Kayla didn't have much of an effect. We got about the same results with and without it. So let's see what happens if we comment this out. And this again. We run this again. Okay, so first of all, we get a warning here that L-BFGS, the solver, fails to converge. And it's suggesting that we increase the number of iterations, but that's honestly a Band-Aid solution. If it fail to converge with the default settings, It's unlikely it will converge if you increase the number of iterations. And also our accuracy took quite dive as well. We are at 63% now. So here, at least for this dataset and this machine learning model, It seems that data normalization is a lot more important. So min-max scaling has a huge effect here. And that's something to keep in mind for the future. That's it for this video. We'll continue with the exercise. 8. quiz: Here is your exercise for this module. We've passed in these parameters here, solver as LBFGS and multi-class equals multinomial. Read the documentation and see what other things you can pass in here. See how they affect the performance, that is, the accuracy here. And try to explain why those effects happen. 9. quiz solution: Here's the quiz solution. If you gave up or if you just want to see the official solution that I had in mind. So first of all, reading the documentation, we see that these solvers are the only ones that support multi-class equals multinomial. So I separated them out and sort of group them by this. And then I also ran the same ones using multiclass equals OVR. And you might wonder why did this OVR? Ovr stands for one versus rest. It basically transforms the multi-class problems since we have four classes into multiple binary classification problems which to class. And what it does is it does a binary classification by basically taking pairs of classes. We'll see how well that works in practice. But until then, if you read the documentation, you will realize that liberal linear only supports this OVR scheme. So intuitively, this OVR schemes sounds kind of like a crotch. It's something that you do to make an algorithm that only works in the binary case, also work in the multiple classes case. So we'll see if that behaves while in practice or not. Ok, so for the first four that are using multinomial for the multiclass parameter. We get the same accuracy, about 98%, almost 91. Going to the OVR and liberally nearer, the accuracy drops to about 80%, not even a full 80%. So the over the one versus Suez scheme does not seem to work very well in practice. And that's to be expected because it's kind of a band-aid solution. It's better to use an algorithm that supports multiple classes by default without having to use schemes such as these. But if that is not possible, then this approach, the one versus rest apporach, can save you out of not being able to use the algorithm at all. So hopefully, you came up with something similar for this exercise and this quiz. Hopefully you found that same explanations or similar ones. And if you did, congratulations, and if not, I encourage you to re-watch some of the previous videos and maybe even the regression course. 10. KNN introduction: In this video, we are going to introduce k nearest neighbors. The K nearest neighbors model looks at the closest instances in the training set to the instance that we're trying to classify. Because this is all it does. It is usually referred to as a nonparametric model. There's no actual learning going on. Just a look at the neighbors at prediction time. The prediction will be the majority classes of the neighbors we look at, and k will be the number of neighbors that we look at. The choice of k can have a big impact on results. However, consider this Wikipedia example. So here we want to classify this green dot as either a square or a triangle. And if k is equal to three, then the neighbors in this circle are the ones that are going to be considered. So two triangles and the one square. This means that the green dots will be classified as a triangle. However, if we increase K to five, we get to the neighbors in this dotted circle. And these are three squares and two triangles. So the green dot will be classified as a square. So as you can see, kNN models can be quite useful in some problems, but they depend a lot on the choice of k. They will also lead to some nice visualizations, which we'll take a look at in the next videos. 11. KNN on a Kaggle data set: In this video, we are going to talk about applying k-NN and he tackled dataset. We are going to use the same mobile phones dataset that we used in the previous video, which dealt with logistic regression. So most of the code remains the same, but let's turn the cells in order. We're not going to run the data analysis here because we don't really need it. We need the pipeline setup. And here for classification, we imported K neighbors classifier farm, SKLearn dot neighbors. And we added some classifiers for different settings of the number of neighbors to look at, starting with three and going up to 200. And let's see why we did that. So we also let in the logistic regression one for your iPhones, which gave about 90% accuracy. And the K nearest neighbors gives varying levels of accuracy depending on the neighbors parameter. We can see that with 200 neighbors to look at, we get 55% accuracy. And this decreases as we decrease the number of neighbors to look at. So it doesn't behave very well here. And let's discuss a bit about why that might be. And let's also discuss its hyper-parameters such as the metric hyperparameter, which is set to Minkowski. And this b equals two here. Well, basically this uses the Euclidean distance to compute the nearest neighbors. A Minkowski metric with p equals two basically reduces to the Euclidean distance. So you can think of the Minkowski distance as a generalization of the Euclidean distance. And it because the Euclidean distance, if you set b equals to two. So we didn't touch dose. So why is the algorithm behaving so badly? Well, one possible reason for this might be because remember in the introductory video we said that it doesn't really learn anything. It simply looks at the closest neighbors. And because that's not always useful, not for all problems. It might not make sense for some problems. And this one might be one of those. It makes sense that the accuracy would be lower or a lot lower. An algorithm that actually lost something. Another reason might be that it's affected by this min-max scalar here. Because this reduces the values of the features, it might affect the distance computation's done by K nearest neighbors. So let's disable it and see if that's the case here. While look at that. So Logistic Regression drops in accuracy significantly. We already knew that from the previous video. And K-nearest neighbors goes up. You can see here, even with just three neighbors, we beat the accuracy of logistic regression, min-max scaling. And then it goes a bit up for 57 neighbors. And then it starts dropping if we look at too many. So that's another thing to keep in mind. That as opposed to logistic regression, K nearest neighbors might not want you to do any scaling of the data. Usually better to leave the features as they are so that the distances make more sense. Because a logistic regressor doesn't care about distances. It benefits from this, but something that will care about distances between data instances might not like such scaling to be done. 12. KNN Visualizations: In this video, we are going to talk about k-NN visualizations. Because of the nature of k nearest neighbors, we can display some interesting visualizations and we'll see what we mean in a few seconds. First of all, we are going to reduce the number of features to only two and high have big ram and the number of course, for this, but you can pick anything else that you want. And of course the price range has to remain as well. This is the target. So after we do this, make sure we run this cell here. And then we scroll down. And I've written this code here that will plot a decision map for the k-nearest neighbor classifier for this problem. And I have adapted this code from this page here from scikit-learn documentation. You can read it to see the original. I've only made a few changes. So here's what's going to happen and how this code works. So first of all, we call gets splits to get our data into train and test sets. Declare our classifier and fit it on the training data. Next we declare four colors, which we will use to print the decision map. We have four colors because we have four possible classes. And we use a mesh grid from numpy on which we will make our predictions. So a mesh grid is basically an array between the minimum and maximum values in the training instance for both features in the train set for both features. So Feature 0 and feature one in increments of 0.5 here. And this is a two-dimensional AMI. And we will predict all of the values 2-dimensional array. And we will plot those predictions and we will also plot the training points. This is why we have to reduce the dataset to two features because we want to be able to plot it in two dimensions. So let's see what happens if we run this. It might take awhile, maybe a few seconds or even minutes, depending on your processing hardware. And you should get something like this with four colors. So basically, everything here in the green area would be classified as the class corresponding to green, and so on. For the purple here, the cyan here, or the sky blue and the orange here. And we can see this if we also scatter the actual training instances. Pivot, as you can see that most of these correspond to the collar there on. And if they don't, that likely means that most of their neighbors are their respective colors according to the Euclidian distance. And we get these nicely like this in rows. Because remember I chose one of the features to be the number of cores. And a number of course goes from one to eight. That's why there are nice like this. If you choose some continuous feature, that's not a category like 12345678, you will get some other distribution. Okay? And of course, this was done just for visualization purposes. If you leave it like this, the accuracy drops significantly to about 72%. But this shows that K-nearest neighbors are useful for problems that have a decision map with certain properties. So if, if the areas here are well separated, then that means that a kNN algorithm will usually perform very well. And we can actually get this to be even better separated by increasing our neighbors here, let's set it to 50, maybe. If we run it now. You can see that the separation here is even clearer. So you should always try to use K and N if possible. Because for classification problems, there's a good chance that it will behave quite well in practice. And even if it doesn't, it usually doesn't take many computational resources and it doesn't cost as much to at least try it out. 13. quiz: Here is watcher exercise is going to be for this module investigated the effects of the algorithm hyperparameter and the P hyperparameter for some setting of and enable us to see how they affect the algorithm. Which combination gives the best accuracy. 14. quiz solution: All right, so here is my proposal. I tested out for values of b for each of the three algorithms available and listed in the scikit-learn documentation. And these are ball tree, k-d tree, and brute force. You can test more values of p if you want what usually people use one or two. So p, o p equals two, others quite to where you don't really see them used in practice. So looking at the results, they are all about the same, around 92%, with the best being the 92.5% here. And we obtain that three times. But honestly these are so close together that it's not really what analyzing what settings gave us 92.5%. I suggest using K neighbors classifier, class width to the default hyper parameters other than the number of neighbors. Cause we can also investigate what happens if we change the number of neighbors here. For example. Let's see if we get better results, if we set it to ten. And also the way I did this, the way I changed all of them at the same time is by holding down the alt key. And that will allow you to do a free form selection like this. Simply release your mouse and the Alt key when done and you're able to edit the selection, click anywhere to get out of that editing mode. If we run it again, we can see with ten neighbors here, we get about 93% accuracy. And again, all of these results are very close together. It's not really what differentiating between them. So this idea of looking at more neighbors means an increase in accuracy still seems to hold across the different algorithms and different values of p. Congratulations if you are able to do this quiz and exercise. And if you got into at least as much detail as I have in this official presentation of the solution I had in mind, if you dug even deeper and tried more things as well, that even better and even more. Congratulations are in order. 15. SVM introduction: In this module, we are going to talk about support vector machines, which we are going to introduce by diving straight into scikit-learn documentation. So for classification, support vector machines are quite similar to what we have seen for regression if you have watched that video. And actually they are meant for classification and regression. So they usually work better for classification problems. And let's take a look at these figures to explain how support vector machines work. For now, ignore the text here, just focus on the figures. So we can see that we have about three classes here. Well, actually exactly two classes because we have three colors. And what support vector machines try to do is they tried to draw these decision surfaces such that they are as close to perfect as possible. And by perfect we mean that they are about in the middle or as close to the middle as possible to between the boundaries, the different classes. So if we look at this group here and these, this group here, this line here you can see cuts about straight to the middle of the space between the two groups. And that's a good thing, that's what support vector machines in classification problems try to accomplish. And of course here we have straight shapes, straight lines. But that's not a necessity. If we look here by using different kernels such as the RBF Kernel or radial basis function kernel, we can obtain these circular shapes as well. And here as well with a polynomial kernel. So these two are obtained using linear kernel, and these two are obtained using some non-linear kernels. Basically the carnal determines how this decision surface will look like. All right, so as opposed to what we've seen so far, support vector machines usually give better results, and they are also used and give a very good results for text classification problems. That is, we are given some text and we need to classify that text. So we'll also introducing this module, a text dataset and some pre-processing methods pertaining to text datasets. And we'll see how support vector machines work on those datasets. 16. Linear SVMs: In this video, we are going to talk about linear support vector machines. We're going to introduce another dataset here, and it's going to be a text dataset. We are going to be using this one. It's an Amazon reviews for sentiment analysis dataset. We are not going to be doing actual sentiment analysis, just some basic text classification for now. But this will work just fine. So scroll down and get the train file here. And you might notice that it has this weird to be Z2 extension, but that's just an archive. You can open it with something like 7-Zip. Anyway, the course files will have it in text format. Be aware though that on archived it's about 1.5 gigabytes, so it's quite a big file. Okay, and now let's go over the code changes needed to accommodate this new dataset. So first of all, we will keep the mobiles data, dataset that we used previously. And we are also going to shuffle it. If you recall, we discussed about shuffling in the regression course as well, but we haven't done it here. And we should have honestly but better light than never. If you've spotted this in the previous videos that we didn't shuffle. Congratulations. Now we will have the chance to see how this affects our result. Hopefully it won't affect them much. For this dataset, the Amazon reviews one, we are going to open it using standard Python files. We're not going to be using pandas yet. And we're going to shuffle it using truffle function from scikit-learn. And it has millions of instances, but that would make training quite slow. So we are only going to be using a 100 thousand data instances. And here is how they look like. So it's a list of strings, and each string starts with the label, which can be labeled one, while label two. And this label refers to the number of stars given by this reviewer. Labeled one means one or two stars, and label two means four or five stars. The three-star ones were removed from this dataset. You can read more about it on the Kaggle page if you're interested. All right, here we could also close this file. And if we run this, it takes a while, but it reads the file just fine and the displays truly random reviews. Next, we need to do some multiprocessing in order to turn it into a data frame and to take out the labels into a separate array. And that's because here they are part of the string. And we don't want them to, we want to separate away with 0 or one for the label. And we do that here. We put a 0 using list comprehensions. If the instance String starts with label one and they won otherwise. And after that, using list comprehensions again and do a replace method, we get to it of this label text in the features. We turn it into a DataFrame and we display the DataFrame. So let's run this again. Here it is. Then we have the train test split function, which is mostly the same. I modified it a bit to add this little thing here. And what this does is it ensures that the labels are integers. When taking them out of text data, it can happen that there will not be saved as integers here and you will get the error further down the line. So it's good practice to add this here. We are not going to be running this Data Analysis part. And for the pipeline, we also had to do a few changes. The problem here is that for text data, this column transform our care and give some errors. So we have a check here. If we don't have any one-hot calls, which we won't have for text data, then, then we are not going to be using this one hotter. So we are only going to use it if the list of 100 calls is not an empty list, the rest remains the same. And I also got rid off. Adding the min-max will come back to that in a few seconds. For the classification, we only did a few changes here again. So first of all, the new imports for linear, as we see from SKLearn that as Vm, then the feature extraction count vectorized or class. We'll turn the words in our strings into numbers. And exactly how this works. We will discuss in a future video. For now, just for now, just understand that it basically counts how many times each word appears in the dataset and replaces that word with its count in the instance. You also have this change here. Again, this is for text data. You want to use a rabble here, otherwise you will get some arrows. One training. Basically what Ravel does is it's done stuff like stuff which shape like something comma, one into a shape like something comma. And that's how scikit-learn and needed to be k. Then we call evaluate and we use make pipeline here to only add the minmax scalar for the things that needed, such as logistic regression. And for k nearest neighbors, we won't add it because we saw that it works better without it. And for linear SPC, we decided to add a standard scalar here just to familiarize ourselves with other methods of doing scaling as well. And then we have another evaluate call for our Amazon data. Here we instantiate the count vector wiser in the pipeline, passing logistic regression and linear support vector classifier. So two classifiers here, B2, it count vector risers. And I commented out running K-nearest neighbors here because it is quite slow. If you want to let it finish, you can. But it's very, very slow because we have a lot of data. And computing the nearest neighbors with so much data is very slow. So that's a disadvantage of the K-Nearest Neighbors classifier. It's speed on very large datasets. So let's run this again and see what kind of results we get. It might take a few minister on, but it shouldn't be very long. And it looks like this. We can see with the minmax Kayla and logistic regression, we keep getting about 90% accuracy. So the shuffling of the data did not do when our good results, K-Nearest Neighbors gets about 91 or almost 92%. And the linear support vector classifier only gets about 80%. Can you think about why this might be? So we'll discuss it later and we'll see if we can improve it later as well. But now let's go see the results for the text data. The count vector riser with logistic regression, we get about 88% accuracy, which is quite good. And width linear support vector classification, we get about 86, which is also quite good. And the differences between these two are definitely smaller than they were for the mobiles dataset. So this is a quick introduction to support vector machines, the linear one in this case, and the kind of data pre-processing you will need to do in order to be able to run machine-learning models and text data will go into more details in the future videos. 17. Non linear SVMs: In this video, we are going to talk about nonlinear support vector machines. So most of the code here is the same as in the previous video. So I'm going to skip over the data loading, the pre-processing, the train test split, the data analysis, the pipeline setup. We are going to dive straight into the classification part. So first of all, we have a few more important here. The most important one is this one from SK. Learn that there's VM, we are going to import support vector classification. And this is the general class. This is the specialized class that only deals with linear support vector classification. And this one is more general because it allows non-linear classification as well. So we call that the linear support vector machine tries to find a line that goes through the middle of the space between two classes basically. And in the general case where it doesn't have to be linear, that line does not have to be a straight line. It can go like this and so on. Ok, so here what we do is we simply add the support vector classifier for the mobile spices dataset and also for the Amazon reviews dataset here. But as you can see, we've added some other things as well. So let's talk about those four mobiles prices we didn't really add anything else. We have the standard scaler followed by the learning model. So that's quite standard and there's nothing really new to talk about here. However, here we have a field things. So first of all, let's look at this linear as VC. And see we've added this max apps Kayla and is another type of scalar that we haven't really seen so far. And why did we added this one? Because the other scalars don't support the sparse vector data. So the count vector Isar basically returns a sparse matrix. So what's a sparse matrix? You might ask? Well, a sparse matrix is a matrix that contains a lot of 0 values. So we call that the count vector Isaac basically gives you a count for each token or word that appears in your dataset. And in order to do that, it must keep track of a lot of tokens or Awards. And that would lead to a very large matrix if we were to store it in the classical way that we store matrixes, lines, and columns. Well, it's basically a grid. You'll have thousands, maybe tens of thousands of tokens. So the size of that matrix would be your number of instances times 10 thousand, or maybe even more. And that's huge. It will never fit in memory, it will be useless practically. So what the count vector riser does is it basically stores tuples containing The row, the column, and the value. And that means that the value at row and column in the sparse matrix is definitely not 0 and equal to this. So it only stores this for non-zero values. And because the non-zero values will be quite few, because most of the words in the alphabet will not appear in one particular sentence data instance, this will be a much smaller matrix if we store it like this. And of course technically it might not be exactly like this, but this is the reasoning behind it. And however, it actually is implemented in numpy, it's based on this idea. But the problem with that is that not everything works with the sparse matrix max APP Scalar does work. Standard Scalar does not work well for standard scalar. We need to make that matrix a dense matrix. And to do that, we've written this dense transform, our class that derives from transform a mix-in which we imported above. And this is all you need to write it. It's basically calling Numpy arrays to dance method in the transform method. And you can find this in Scikit Learn documentation. Let if we use this, what about what I told you before that if we store it in the regular way, that is a dense matrix, it will never fit into memory. Well, that's still true. So what we did is we limited the number of features to 100. So that's very important to do because otherwise you will get a memory error. And we've done the same thing with support vector classification, but here we run into another issue. Unfortunately for the general case of support vector classification, by default, this uses RBF kernel, which means it's a nonlinear classifier, but its complexity is order of number of samples squared times number of features is already 100. And if you recall above, we have over 10 thousand data instances as well. So this is huge. It will never finish unless we're limited somehow. And the way we limited is basion maxiter equals ten. But unfortunately that's very little. And if we look here, we get very poor results because of it. We get about 50% accuracy compared to linear SBC, for example, which already gets 70%, right. So that's quite bad. And unfortunately there's no way around it if your dataset is very, very large, like the one we have here, all you can do is I go wait a very long time for this to finish, use another model or reserve using support vector classification for smaller datasets. The upside is that it usually learns better. It learns much better than a linear support vector classifier, but it's very, very slow. This is a trade off that you must keep in mind. 18. Preprocessing for SVMs: In this video, we are going to highlight some important preprocessing key points for support vector machines. So first of all, as we have seen, Support Vector classification can be very sensitive to data pre-processing. We should always try to apply standard scalar under data as it can noticeably improved results. However, some texts pre-processors that handle turning the text into numbers give us pass matrixes that don't work with the standard scaler class and some other classes as well. An example of this is the count vector riser. We must be careful to handle this either by using a scalar that works with sparse matrix is or by turning the sparse matrix into a dense one. We must be careful, however, to make sure that the dense matrix will fit into memory because we've discussed some issues with this transformation to dance process. Also for text data, it's good to keep in mind that we can try various pre-processors too, that I encourage you to read about the TFIDF and word to vec preprocessors, you will find TF-IDF in scikit. Learn if you search for it, documentation on their website. But for what's the vacuum will have to find the library as well. It shouldn't be too difficult, however, and it's definitely very useful to read a bit about word2vec. As we've seen with the non-linear as VC class. Care must also be taken about the computational complexity of the algorithms that we're using. Put simply, this is the number of populations that they execute. If this is too much, it will be very slow. And we don't really want to use very slow algorithms or very slow algorithms. Or if we do decide to use them, we should be aware that they are going to be very slow. 19. Quiz: You've had the exercises so far in the previous modules. So I talked with to do a quiz this time. Which of these best explains how support vector machines work for classification? Find the line that separates the two classes that goes as much as possible through the middle of the space between them. Find the formula that maximizes the number of correctly classified instances. Find the line that best separates the two classes. The correct answer here is a find the line that separates the two classes that goes as much as possible through the middle of the space between them. This is the principle that support vector machines work on. Finding the formula that maximizes the number of classified instances is kind of true for all classifiers. And so is finding the line that best separates the two classes. These answers out too general. A is much more in particular and in relation to SVMs. Which of these disadvantages of nonlinear support vector classifiers? They are much slower. B, they are more likely to over-fit and simple data. Or C, they are harder to use due to their dependencies. The correct answers here are a and B. They are much slower as we've seen in previous videos, where we had to limit the number of iterations in our nonlinear FVC because our dataset was too large and otherwise it would basically run forever. So B, they are more likely to overfit on simple data. This is true for most non-linear models, because if your data is simple and you use a model that is too complex, it's very likely for that model to overfit. They are not really harder to use due to their dependencies. There are no dependencies as long as you have scikit-learn installed, you can use them. Why is preprocessing important for text data? To improve classification accuracy? To reduce the training time, to turn the text into numbers so that we can apply a classifier. The correct answer here is C, to turn the text into numbers so that we can apply a classifier. Most classifiers, neither text as input. This doesn't really reduce the training time, nor does it to elite necessarily improve accuracy, might improve accuracy, but that's not the main reason. The main reason is that we can't even run anything on the data unless that data consists of numbers. Which of these are possible disadvantages of the counter vectorize our class. It is very slow. It returns a sparse matrix which some scikit-learn classes cannot work with. It uses a lot of memory. The correct answer here is that it returns a sparse matrix which some scikit-learn classes cannot work with. It's not really slow, it's quite fast actually. And it doesn't use a lot of memory either because it returns a sparse matrix. Of course, if we turn that sparse matrix into a dense matrix, that we'll use a lot of memory, but that's not something the count vector ISO itself uses. So C is not a proper answer here. Congratulations if you got this right, and if not, I hope you learned something from the answers. And I encourage you to re-watch the videos for the questions that you got wrong. 20. Intro to Naive Bayes Classifiers: In this video, we are going to introduce naive Bayes classifiers. Naive Bayes classifiers based on a statistical model, which makes them quite a bit math heavy and difficult to understand that first, even at an intuitive level, I will try to briefly explain the math behind them, but I will only scratch the surface enough so that you get a really, really basic idea and that you have something to look up in values documentations. If you are interested in more mathematical details, then what I'm going to cover here, don't worry about it though, because they will be just as easy to use in practice thanks to scikit-learn. So they use what's called Bayes theorem, which is a way of manipulating equalities between probabilities or putting another way. It gives us nicer ways to write the formulas. Nicer ways to either the probabilities that we are interested in, which allows them to be computed better and more easily. So here, base theorem says that the probability of some given class k, So C k represents a class, kay? Given a data instance, x is equal to the probability of that class. And of course this refers to our particular data. The ability of some class k will of course depend on the data we are working with. And therefore, all of these probabilities here will depend on the data that we're working with. So this is multiplied by the probability of x given that class. So note that here we have the same thing as here, except in the opposite order. And all of this is divided by the probability of that data instance. Next 20, it comes time to make a prediction. We are simply going to iterate over all classes and take the class k. And capital K here means that we have capital K classes. And argmax means that we take the k from this set here, which maximizes this formula here. And that k will basically be our prediction for some x. So we take the probability of that glass K that we're at, at the moment in our iteration and then the product of this. So what is this? Well, x i is a feature of instance x and we have capital F features. So basically here we iterate the features and we take the probability of that feature given class k. And of course we need to initialize some things and those initializations will be p of c. K will be equal to the samples in class k in our dataset divided by the total number of samples. This is one way of doing it. Another way is simply one over K. If we assume that all the classes are equally likely, regardless of how many instances we currently have in each class in our training set, for example. And there are some more complex initializations that we can do. For example, for V of x or p of x given c k here, or P of some feature given CK. But those are a bit more math heavy and I don't want to go into them right now. But basically this is a statistical model, the probability model. There's quite a bit of math involved, and it works surprisingly well in a lot of cases, especially with text data. As we'll see, it's very easy to use there and it works quite well. So I hope you have at least a basic idea of how this works. Going to worry, if you didn't understand most of the math, you don't really need it to move forward, but I do encourage you to try to read more about it and get a better understanding. When you feel comfortable with doing that. 21. Comparison with SVMs on text data: In this video, we are going to compare Naive Bayes with support vector machines and text data. They're going to be using the same Amazon reviews dataset. And we got rid of the other dataset, the mobiles dataset because it's not a text data. And we want to keep this as simple as possible. So most of this code is the same. So we're going to scroll down until we get to the classification part. And here we imported multinomial Naive Bayes from SKLearn, that's naive Bayes. There are multiple implementations of Naive Bayesian scikit-learn, and we'll talk about those a bit later. Right now we are going to use this one, which is the most general one and the one that you will see used most often. Okay? So if we scroll down, you can see that we've added it here together with the Logistic Regression, Linear FVC, that follows a max APP scalar. And linear as we see that follows a standard scalar. And we're going to run this. I've already run it. And we'll see how it compares with the linear support vector classifier and linear logistic regressor. Let's run it again so we can get an idea about the execution time as well. And until this executes, let me tell you a bit more about naive Bayes in general, though, Naive Bayes usually expects discrete data, discrete data basically, and usually means that you are dealing with integers. It's not always, but that's a good enough definition for now. You can read more in-depth things about it. If you search for this term, discrete data. And count vectorized or gives us that because it returns integers that represents counts of tokens, it can also work with floating point data or continuous or non discrete data. For example, if you use a TF-IDF transformer here, can also work with that. It might work better, it might work what might be the same thing. But in general, you should use stuff like a count vectorized made to work with that kind of data in general. So let's see if this finished. And it seems to have finished. It was quite fast for all of these. And we see an 85% accuracy for the multinomial Naive Bayes. We can also see it's hyperparameters here. We have alpha classifier and fit pile equals true. And generally you want to leave these as the default, but if you are going to mess with something, I suggest you start with the alpha hyperparameter. And if we scroll up, we can see a standard scalar and linear as VC getting a considerably lower accuracy value. Linearize. We see following the max apps, Kayla gets a bit more and logistic regression gets a bit more than that. So the logistic regression, the basic model still wins here. And that's definitely interesting to see. But again, it depends on the dataset, the hyperparameters. And these are just good models to have in mind to try when dealing with text data. You can see that they all execute quite fast even on this large datasets. So there's really no point in not Triangle of them when trying to decide what works best for some text data. And it's the same way to the pre-processors, such as counties electrolyzer and TFIDF. 22. Exercise: Here is our exercise for this video. So first of all, let's get rid of all but the best model here. So that is get rid of these, right? So we're left with logistic regression, which gave the best result. And multinomial Naive Bayes, which gave a decent result but was then logistic regression. And what you should do here is look up other naive based models in Scikit, Learn at them here and compare the results with the multinomial Naive Bayes, and also try TFIDF, which we've mentioned that this quite a few times in previous videos. And I think now it's time to actually get hands-on and plug it into this code and see exactly how Naive Bayes behaves with it and why not? Let's see how logistic regression behaves with it as well. 23. Exercise solution: Here is the solution to the proposed exercise. So first of all, from SKLearn doubt Naive Bayes. We import the other implementations of naive base in him, scikit-learn, and those are Gaussian Naive Bayes, Complimentary Naive Bayes, and Bernoulli naive Bayes. I won't go into detail on exactly what the differences are between these. I will mention only a few and you can definitely lead more about the differences and when each one is best applicable in the scikit learn documentation. So first of all, we have to keep this dense transformer and we'll see why in a few seconds. And also I forgot to mention that we have to import the TFIDF vector Isar from SKLearn that feature extraction that text. And after that it's simply a matter of calling the appropriate functions that is make pipeline with the proper models. And those are convex, realizable with multinomial Naive Bayes which we had before, and the same with TFIDF vector wiser. So a logistic regression with the TFIDF vector quantizer, the Gaussian Naive Bayes with a count vector riser and the TF-IDF electrolyzer. And here we also notice a difference quite immediately. And that is that Gaussian Naive Bayes does not work well with sparse matrixes. And if you got an error when attempting to use counter electrolyzer directly followed by Gaussian Naive Bayes p, you probably noticed this. That's why you got the era. And the way to fix this is to use the data transform our class. And in that case, we also need to limit the maximum number of features to something like 100. Because if we don't, we will get an out-of-memory exception. After that we have complement naive based, which works fine directly following count vectorized and TFIDF electrolyzer, which we turn a sparse matrix is so compliment Naive Bayes is able to handle sparse matrixes. And finally, we have Bernoulli naive base, which we only used with count vectorize. And we pass it in binary equals two to the count vectorized. Because the Bernoulli naive base requires binary features. And this will not count the occurrences of tokens or watts. And instead it will be a one where that word appears and AZO if it doesn't. And let's take a look at the results quickly. For logistic regression with count vectorized or we get 88.4% accuracy. And replacing that with a TF-IDF vectorized, we do a little bit better with 89.2%. Let's see, multinomial naive based with convector ISO, that eighty-five percent something we kind of expected from the previous videos with DFID, I've vectorized. It's about the same, so no real improvement here. Convector Advisor with the Gaussian Naive Bayes is much worse. Unfortunately here, only 67% or so. And switching to TFIDF improves that somewhat but not significantly. Compliments Naive Bayes is comparable with multinomial Naive Bayes at eighty-five percent with the count vector riser, and also eighty-five percent with TF-IDF lecture rises. So it doesn't seem that DFID vectorized or is doing something better at least for this problem. And at least with the default settings. Because take a look here. These are all hyperparameters for TFIDF vector riser. It has something like 15 hyperparameters that we can tune. And if you want to play around with this, the most important ones are the engram. I hinge, the pre-processor, Smoot IDF, stop watts, sublinear tf. And this is quite important because a lot of times I've noticed in my experience that's setting this to true can improve results at least slightly. So that's something to take into consideration. The tokenizer might also be worth playing around with and use IDF as well. Okay, let's see Bernoulli, but only does surprisingly well. I feel like considering that it works with binary data. We get 84%. So that's pretty good. Anyway. Congratulations if you came up with something similar to this, it was basically an exercise in reading the documentation because the code, as you've seen, is not too involved. 24. Introduction to DT Classification: In this video, we are going to introduce DT, or decision tree classification. Decision trees are a type of model that looks kind of like this. And don't be worried if you think this is a bit overwhelming that there's too much information here, I'll explain the most important things and we'll go over the others in the future videos. Ok, so this Decision Tree lets us make a prediction or a classification for some data instance by going over its nodes and its edges. This particular decision tree is built on the iOS dataset. And this figure is taken from scikit-learn has documentation. So thanks to them, you can read more about decision trees in general, in packet loss documentation, but here's the gist of it. So I mentioned that this is done on the iOS dataset, which contains some information about plants. And we have this line here in each node. And that's a rule that the decision tree checks for. And it asks if the petal length feature in centimeters is less than or equal to 2.5. Well, if that's true, we should take the left edge here. So the left arrow, the arrow that's labeled here. Okay, and if we go there, we see it doesn't check anymore features that we just have these values here and we have this important value here, classical setosa. This means that this is a terminal node. And whatever the classes in this node, that is going to be our prediction. So basically if the petal length in centimeters is lower than 2.45, then this decision tree will always predict the class has been set toss off if that's false and the petal Langton centimeters is not less than or equal to 2.45. Then we take this pad here at the false edge, here and here, we don't have a terminal node because we have another expression here. We have the expression petal width in centimeters lower than or equal to 1.75. Well again, if that's true, we go towards the left and down. So we would consider this sub-tree here. And if that's false, we would consider the right subtree here. So let's assume it's false and we go over this branch here. Well, then we would check the petal length in centimeters if it's lower than or equal to 4.85. So notice that this is a similar expression to the one at the top. But instead of 2.45, we get 4.85, alright, and so on until we reach some terminal node or until we decide we no longer want to continue, in which case we would take the prediction from the node we stop at. So class equals virginica here. And that's so mostly it about decision trees and how classification would work. And there are a few things that we can get out of this whole thing. So why would we want to use decision trees? Well, first of all, it's because it's easy to learn something from them as a human, as a person, from the models we've seen so far, such as logistic regression and support vector machines and naive Bayes, even it's hard to explain why they make a certain prediction, right? So, yeah, they learned to draw those lines that can be straight or non-straight and so on. But it's hard to take a look at the model and maybe even some of its predictions. And say, oh, we predicted that because the petal width is lower than or equal to three or something like that, we will not be able to come up with things like that. So while those models can help us do our job better, they will not help us understand things better ourselves in general. And decision trees, however, will help us do that. Because by learning these rules, well, these are rules that we as humans can understand that we can work with and say this one, right? So if the petal length is lower than or equal to 2.45 and it's always going to be setosa. And let's say this gets something very good like over 95% accuracy. Well, it might be that we didn't know that this is true in 95% of cases or even more cases. So this could be something that we learn from seeing this decision tree. Also, they work with quite a lot of data types and they are not very particular about their data been scaled and so on. And they usually perform well in most problems. So you should be using decision trees if you care about explaining the learning process, about being able to explain it to another human being. And if you don't want to worry too much about processing numerical data in some fancy ways in order to try to improve accuracy. And of course, the first reason is quite important. For example, consider a decision tree is a decision tree that decides whether or not to give a loan to someone. Maybe you work for a bank and you will need to implement something like that for a bang. Well, it's very important to explain to people maybe to that person, to management and so on, why you made that decision. Why did you choose to give a loan to this person? Or why did you decide to reject the loan for this pulse and decision trees will help you do stuff like that. So that's when you should consider them. 25. Comparison with the other models: In this video, we are going to compare decision to reclassify us with the auto models. So we are going to start off with the same code which handles the Amazon reviews dataset we talked about in the previous section as well. We have the same little cells as before, the same pipeline setup, although we will not be using this function for these comparisons. And if we scroll down here, we've imported the decision tree classifier from SKLearn that tree. And similarly, if we wanted to do regression for some reason, we would import decision tree regressor. And as a quick dip, if you write first few characters of some classifier, let's say you don't remember the entire name. And the press tab. You will get autocomplete suggestions which you can then select by clicking on them or by using the arrow keys. Anyway, we are not going to be doing regression here. So let's get rid of this one. We have the same evaluate function as before. And I've added some variations here for pipelines involving the decision tree classifier class. So we've added the counterfactual ISO as the first thing with the default parameters, you must add something because the decision tree classifier with will not work with the strings that we have in this datasets. We must still process them somehow. We've added the count vector advisor with a maximum of 100 features as another variation followed by the decision tree classifier with the default hyperparameters again. Then we've added the count vector riser with the default parameters, followed by the decision tree classifier with a max to a depth of ten. And what this will usually do is prevent overfitting. If you don't pass this in, the tree will be very deep. And we will see that in the future video, where we will visualize the trees, we'll see exactly what this means and why it can lead to overfitting and we don't want that. So usually passing in a max depth is a good idea. Then we've added a combination of the above, where the max features of the count vectorize our 100 and the max depth of the decision tree classifier is done. And then we've also added a variation using a TFIDF vector riser instead of a count vectorize it. And we've limited the decision tree classifiers depth to ten as well here. So let's go down. I've already run this and see how the decision tree classifier compares. So the first one gives us an accuracy of about 75%. And if you recall, the best that we've got was with the linear regression, we've got about 80 to 89%. So this is, was, so far. Let's see how the others did. Okay, here with phi we call correctly yes, with a max features of a 100 in the count vector Isaiah, we have watts results 63%. If we limit the depth for this, we get 72% so close to the first one. And if we do both, so depth limited and Max feature is limited. We get about 69%. So this is the loss. And for TF-IDF, this gets a bit better to 72%. So this doesn't manage to beat the logistic regression. So for this problem, at least, it seems that the decision tree by itself does not do a very good job, at least with the hyper parameters that we have tried. I suggest you try other hyperparameters as well. We look up the class on scikit loans documentation page. Now Scikit Learn website, see what other things you can tweak and simply try them all. There aren't that many, most of them categorical settings. There aren't any continuous values, such as the C in logistic regression that can take an infinite number of values. So it's possible to simply list them all. But you will probably have about something like 2020 combinations that make sense. So not that many tryout, about 20 or 30 combinations of settings for the decision tree classifier and seawater the best results that you can get R. And in the future videos we're going to talk about each hyperparameter in turn. And also visualize decision trees to get a better feel for how they look like and the kind of problems that they are good at. 26. Visualizing Decision Trees: In this video, we are going to talk about visualizing decision trees. So I've gone straight to the part where the code changes a bit. We instantiate a decision tree classifier here. And I've set the max depth to 20 and the max features to auto. And setting max features to auto here will cause the decision tree classifier to only use square root of the number of features. And I've done this for practical reasons in this case, because otherwise it will generate a very large tree which will be difficult to visualize in order to make it easier for our visualization purposes and for the code to execute faster. I have done this. You can play around with not setting this one. But I highly suggest setting a max depth. Otherwise it will take forever to generate the image of the tree. And it will also be difficult to open that image and take a look at it and understand stuff from it. So at least set max depth to utmost 20. And we pass that to our make pipeline function and evaluate it. We get some result. We don't really care about the result here. It's worse than before, but we are interested in visualizing the tree, so let's focus on that. In order to visualize it, you have to import the graph is which you can simply pip install. And also pi dot, which you should also be able to pip install. After that from SKLearn that tree, you have to import. Export graph is the export text. We're not going to cover export tax, but basically what it does is it displays the tree in textual format. Learn more about it by watching our regression course where we cover this as well. So after that, we call the graph is wheat. The DT parameter, which recall is our model which we also passed to evaluate and which we instantiated as a decision tree classifier here. So it's basically our decision tree. And you can just pass in like here. It's simply some stuff to make it look decent. We must save this to a dot data variable from which we will use by that, which we will use pi dot to create a graph and save that graph to a PNG file. And after you run this, this PNG file will be saved in the current directory. So you can either open it from there or use this code below to display it in your Jupiter notebook. And if it displays quite small like this, you can simply double-click on it and it will enlarge. So you will get something like this. Okay, so let's go over what this means. And that's tried to find a root of the tree. It can take awhile. You can see that even with limiting the max depth and the number of features, it's still a very large tree. Ok, so this looks to be the root because there's no arrows going above this. So this node seems to be dilute. And here is how you read it. So the first line here is the check that we are doing. So if this feature 13,256 has a value lower than or equal to 0.5. Then then you would move slightly to the left. It's a bit difficult to scroll here because you can see it doesn't work with the mouse or it might work for you, but if it doesn't, simply scroll down and use the scroll bar. And C, you have a true here. So if that condition is true, you follow the left line and if it's false, you would follow the right arrow. And here you get to another node and proceed similarly. So let's see what the values mean here. So Gg0 is some internal statistic that the decision tree algorithm uses, which is a bit more mathematical, mathematically heavier. And I don't want to get into it too much. You can read about it more on scikit-learn speech. But it's not very important for our, for us here. What is important is the samples and value metrics here. So samples tell us how many samples under this node. So how many samples with respect to the condition of this node here? And in this case is 69,700. And if we scroll to the right, let's see if that's enough. Okay, it will be more. So we call this as root node, 70 thousand samples respect the root nodes. So basically this is the entire dataset. So here we have 238. And recall that to the left. We had 69,762. So if we add 69,762 to 238, which we had to the right, we would get the 70 thousand intervals node. So that means that 69,762 samples are respecting the root nodes condition, which is whatever feature was there, x something I forgot what it was. Lower then 0.05. and the rest, the 238 don't respect that. So that means that the root nodes condition does not really cop the dataset in half. Ideally should do that, but in our case it doesn't. Ok. And what the value metric here tells us is how many of these, of these 69,762 in each class. So in class 0 and class one here, you can see that this is quite well balanced, right? So we have about half in each class. So this condition is quite good because it cuts the dataset kind of in half, let's say. And again, we can follow arrows here for, for proceeding down the decision tree. And we can do the same with all nodes except these leaf nodes here. Okay? So the leaf nodes will always have fuel samples. And they will basically give us a prediction. Once we reach a leaf node, we can make a prediction. So that's about it with visualizing decision trees. This is how they work. And the great advantage is that they are able to explain why a decision was made, right? So it can tell us why we decided to predict class 0. Why did we decide to predict class one? It's very easy to read these nodes. And della client, for example, if this feature is lower than that, and then if this feature is higher than that, then this is the predicted class. And you can't really do that with logistic regression, Naive Bayes and so on. Those are much more abstract, much more black box like. And also, of course, now you should be able to understand why if the tree is very large, overfitting occurs because you keep splitting. You basically memorize the data. You will have something like a node for almost every data instance. So your tree will not be able to generalize. It's just memorized your data basically, and it makes predictions according to that. But you want a small tree, ideally, which can generalize well to any data instance that comes along. So let's see what happens. If we set max depth equals five here. Run this and we run the visualization again. Ok, now you can see that the tree is much smaller. You almost don't even need to double-click to enlarge it. And it's also going to be much easier to eat because of this. And you can see this calling is also much more likely to work, right? 27. Hyperparameter Analysis: In this video, we are going to discuss some of the hyper-parameters that affect the classification accuracy of the decision tree classifier. And in my opinion, the most important tree are the criteria and the max depth and the max features hyperparameters. The criterion refers to the statistical measure used internally by the classification algorithm in order to build the decision tree. And this can be Jeannie for an impurity measure and entropy for an information gain measure. I won't get into detail of what these mean exactly. These are the only two choices, so it's usually not at all costly to simply try both of them and see what happens. Next. We have the max depth, which should generally not be set to the maximum or should generally be limited to something like five or ten or 20 or something in order to reduce the chances of overfitting. And then we have the max features for which we have two choices. Either auto, which is the same as SQL t. So this will use square root number of features and the log T2, which will use log base two number of features. Again, these are something that you should always try to use both unless it takes a very, very long time to train your model, then you should pick one maybe. And the only one you should really try multiple values for R is the max depth, which here I've set to 22 unlimited For the last two. So what I've done here is iterate between Jeannie and entropy will actually alternate between them. Also alternate between all auto or SQRT and log two and between max depth 20, which I have for all of these and unlimited here. And surprisingly, contrary to what I've told you, it seems that not limiting the max depth here gives us the best result. But this is not a very sweet victory. It's kind of a bit of victory because it still doesn't get us good results for this dataset. The conclusion here would be that decision trees are not the best choice for this dataset. We're not going to be spending more time trying to optimize things here, we'll see in the next chapter a better alternative in general to decision trees. And hopefully those will provide better results. But what you should take from this is that you should pay attention to these three hyperparameters when dealing with decision tree classifiers and also regressors. 28. Quiz: Here is your decision to his quiz. Which of these is an advantage of decision trees? They trained very fast. They always give the best accuracy. Or C, they provide a model that is easier for people to understand. The correct answer is C. They provide a model that is easier for people to understand while they do train fast in general, this is not really an advantage as other models also trained very fast. We've seen that for linear regression, for Naive Bayes and so on. In fact, most of the models we've seen so far train quite fast. So this isn't really an advantage and they don't always give the best accuracy. In fact, in our case, we actually got quite bad accuracy. Which of these are likely pitfalls of decision trees? They can over-fit in some cases, for example, if the max depth is not limited, they take a long time to visualize. They cannot be used for regression. The correct answer is a. They can over-fit in some cases. For example, if the max depth is not limited. We discussed about this in the videos of this chapter. And while they do take a long time to visualize, in some cases, for example, when the max depth is not limited, this isn't really a pitfall because the visualization does not affect the accuracy and we judge models by their performance. So by accuracy or by mean absolute error by some metric like death, visualization is just something that is nice to have, but not something that we judge classifiers or machine learning models in general by and they can be used for regression. We discussed that in this section as well. And we also have a chapter in the previous course that discusses regression with decision trees in detail. How do we visualize a decision tree? By using graph is by using PNG files, or only by displaying the tree in a text format. The correct answer is by using graph leads. We've shown this in our video dedicated to this in this chapter. And while we did use PNG files, that isn't really the method or the library that we use, that is just a file format that we save the decision tree image as. And of course we mentioned displaying it in a text format, but that is not the only way to do it. Do you think preprocessing is important for decision trees? Yes, because the building algorithm uses various arithmetic operations which can spiral out of control if the data is not normalized? No, because it uses conditional expressions on feature values which aren't affected by the values order of magnitude. Yes, because we didn't do any preprocessing and we got bad results. The correct answer is no because it uses conditional expressions and feature values which aren't affected by the values order of magnitude. It doesn't use arithmetic operations such as matrix multiplications. So therefore, there is nothing for it to cause. Very large values are very small values that can spiral out of control if not normalized. And in deed, we didn't do any Preprocessing, But that is not the reason that we got back the results. We got bad results because the decision tree simply isn't a good fit for our dataset. 29. Introduction to RF Classification: In this video, we are going to introduce random forests classification. In this chapter, we are going to work with the same dataset and we are going to use the same code set up in which we will plug in a random forest. So random forests are basically collections of decision trees, right? So if you recall from the previous section, this was the decision tree that we introduced. Decision tree classifiers wet in the first video. And a random forests would basically have multiple, such two 0s. Of course they would have different conditions and different nodes, but they would look similar. And that's all there is to random forests. The idea is that by having multiple decision trees in them, they will get better classification accuracy and the chances of overfitting will be less as well while keeping the advantages of decision trees, which is in particular explain ability, which is the most important advantage. In general. 30. Comparison between DT and RF: In this video, we are going to compare decision trees and random forests for classification, we are going to use the same Amazon reviews dataset as in the previous section. And most of this code is the same. The only change is here where we import the random forest classifier glass from SKLearn that ensemble. So why do I import it from SKLearn that ensemble? Why not from something like SKLearn that tree or something like SKLearn dot forest. Since we said it's simply a collection of decision trees basically. Well, when we have a model that combines multiple models, that model is called an ensemble. So that is why we import it from SKLearn that Ensembl. And we will see some other things in this module as well later on. Okay, so if we scroll down, we've added a random forest classifier here with the default parameters. Basically I did specify and estimators equals 100 here because depending on your Scikit Learn version, 100 might not be the default value here. It might be done for some older versions. So I just specified it so it's clear, but you can basically take it as this being the default instantiation of a random forest classifier. And what this tells us is that this model will use 100 Decision Trees. And it will generate these 100 Decision Trees on the dataset with, we will fit it on so on our Amazon reviews dataset. And it will result in a 100 distinct decision trees, which should lead to better classification accuracy and feel our chances of overfitting. And here, I also specify the max depth of 20. So you have access to all of the hyperparameters that you had access to for decision tree classifiers. And in this case, we are telling the random forest to not create the decision trees with adapt larger than 20. So let's see how this compares to the best variants of decision tree classifiers that we identified in the previous section. And don't worry if this takes quite a bit of time to run, it can take maybe five to 15 minutes depending on your CPU. It's normal because we have a lot of estimators here and also here, okay, and if we scroll down to the results, we can see that the random forest classifier, we know depth limit gives us an accuracy of about 85, almost 86%. And the one with depth limited to 20 gives us 80%. And both of these are much better than around 65% we get from decision trees. And also, if you recall, the logistic regression model gave us about 80 to 89%. So it's quite likely that with a few more tweaks, we could get this random forest classifier to reach that accuracy of about 80 to 89%, feel free to try it out. And in general, ensemble methods like this, random forest classifier and random forests in general, give very good results on multiple problems and multiple datasets. So there are always a good thing to try out if we can afford to do it computationally and implementation and training time-wise. 31. Visualizing Random Forests: In this video, we are going to talk about visualizing Random Forest. The visualization algorithm is the same as for the decision trees which we've covered in the previous section. The only difference is that here we have multiple decision trees. So what I've done is I instantiated the random forest classifier here, past it in here for getting the results like in the previous video. And I've also limited the depth to A5 to make visualization easier, but you don't necessarily have to do that. And because we have multiple decision trees here, we have to plot them separately. And I've only done this example for one of them. And you can access the individual decision trees with this estimators underscore field of the random forest classifier instance. This one is a collection of the decision trees and in this code basically plot the first one. You can put this in a for loop that plots all of them and maybe saved them two different PNG files. And after that, it's simply a matter of displaying the images. And in this case, the first one looks like this. And recall that we've limited the depth 25. That's why this tree is so shallow. And that's all for visualizing random forests. It's very, very similar to visualizing decision trees. To random forests consist of multiple decision trees. 32. Hyperparameter Analysis: In this video, we are going to talk about the hyperparameters of a random forest classifier. And we've said that most of them are the same as for decision trees. Csa Random Forest consists of multiple decision trees. So the hyper parameters you pass to a random forest will be passed over to the individual decision trees. The only knew one or the most important new one actually, because there are a couple more nuance. Is this an estimator as one that controls the number of trees in the forest. And if we read here on scikit-learn is documentation page, the description of a random forest. It says that it fits a number of decision tree classifiers and various sub-samples of the dataset and uses averaging to improve the accuracy and control overfitting. The subsample size is controlled with the max samples parameter. So this is also a new parameter. If Bootstrap is true, which is the default and otherwise the whole dataset is used to build each tree. So Bootstrap is true by default. So let's see what the default value of max samples is. In order to get that, we have to look it up here. So let's see. Samples, max features, mean impurity decrease. Mac samples equals none. So it's the last parameter and its default value is none. So that doesn't help us much. So let's scroll down and look it up here. Ok, so the default is none. In that case, draw x dot shape of 0 samples. So in this case it actually uses all samples. And if we pass in an int, then it will draw that many samples. And if we pass in a float, it will be treated as a percentage. So well a percentage between 01. So basically, if you're bass, if you want 50%, you have to pass a and 0.5. And this is something good to play with because this can also control overfitting and prediction accuracy. So you might want to try out various values for this MAC samples parameter, a random forest. And that's basically it. You have a few more, you have this old BY score parameter that you can play around with it, but it's not that useful in general, usually leaving it as the default value, we'll do just fine. In general, you will want to use random forests and not decision trees. Because they use multiple decision trees, they are going to be better than an individual decision tree. 33. Quiz: It's time for the random forest quiz. Which of these is an advantage of random forests over decision trees? They train faster. Having multiple decision trees usually leads to better accuracy and less chance of overfitting. They provide a model that is easier for people to understand. The correct answer is B, having multiple decision trees usually leads to better accuracy and less chance of overfitting. They don't really train faster. In fact, having multiple decision trees causes them to train slower usually. And they don't provide a model that is easier for people to understand. It's just as easy as a decision tree, if anything, a little less easy because there are multiple decision trees involved, but otherwise, it's as simple as reading a decision tree. So this isn't really an advantage. It's mostly the same. Which of these are hyper-parameters that random forests have and decision trees do not. And estimate was max depth. Criterium. The correct answer is n estimators. This controls how many decision trees the random forest should create. Max depth and criteria and available in decision trees as well, because they refer to the maximum depth of a particular decision tree and the criteria and that is used in building the tree. And passing these to a random forest will cause them to be passed into the individual decision trees. Making up deforest. Bottom line, BMC are both available in decision trees and random forests as well, not specific to random forests. How do I visualize a random forest by using graph is by iterating every decision tree in it and visualizing it. Since there are multiple trees involved, we cannot do this. The correct answers are a and B. We have to use graph is just as we've had to do when visualizing a decision tree. And we have to iterate every decision tree and visualize it. If we want to visualize the entire forest, if we want to visualize the entire forest. So this is possible, there is no reason why they shouldn't be possible. So C is not correct. Is a random forest better than a decision tree? No, having multiple trees only complicates things unnecessarily. Yes, because we obtained better results on our dataset, then we did with the decision tree. Usually Yes, they have been found to perform better in general. But there could be some rare cases where a decision tree will outperform a random forest. The correct answer is C. They have been shown to perform better in general. But of course, there can always be cases where a decision tree will outperform around them for us, while we did obtain better results on our dataset, this isn't a good explanation, so we don't consider b to be a good answer. It could be that we were lucky with those is also that there is something specific about our dataset that lead to those resolve. However, with a bit of research, we can see that that's not the case here. But still this line of reasoning is not correct. So we don't consider a B to be a correct answer. And of course, we've discussed quite a bit that having multiple trees may complicate things, but it's not unnecessarily the definite that is definitely an advantage to doing it that way. So a is also incorrect. 34. Introduction to MLP Classification: In this video, we are going to introduce multilayer perceptron classification. You can think of multilayer perceptrons as a generalization of the logistic regression model we've talked about in the beginning. So let's discuss it by explaining this figure from scikit-learn documentation. First of all, we have the features going into the network. Multilayer perceptrons are also called artificial neural networks. You will hear both terms, so that's why I'm referring to it as a network. So X1, X2, X3, and so on, up to x n. The input features that go into the network, they are basically the inputs, right? And we'll only discuss one data instance going in at the time to make things easier. But in the actual implementation, this is handled efficiently for multiple instances going in at once as well. All of these input feature values go into what is called the hidden layers. And here we have just one hidden layer represented by these a1, a2, and so on up until AK. And these nodes here are called nuance. So we have k neurons in this only hidden layer of this network. And for now we're just going to ignore these biases, will discuss them at the end. And what happens, you can see here by the arrows that each input feature value goes into all of these neurons. And internally, each neuron is basically a logistic regressor. What it does is it has a weight vector that is multiplied with all of these features going in result after that multiplication, which is a point-wise multiplications. So it's something for A1, it's something like w1 times x1 plus W2 times x2 plus W2 times X3, and so on. Plus wn times x, a1 and a2 will have a, another such w vector. So the result of that pointwise multiplication is a value, and that value is what is outputted by the neuron. And in this case, it goes to the output neuron, which applies some function to all of its inputs. And of course, this can be made more complex. For example, we can have another function here at the output, and so on. But we don't have to go into these details right now. The basic idea is this. So by having multiple neurons here, each with their weight vectors, we get a much stronger model. So we are able to model nonlinear relationships between these features. We are able to DUS, model our dataset much better. But the idea is that we are also able to overfit much easier. So certain steps will have to be taken. Prevent that. Now what are these biases? So these biases simply output a one, a plus one, so that the data can, and this in practice helps with prediction accuracy or regression quality. It's mostly an experimental thing that has been proven to work well. But it also has some theoretical merits, but we're not going to get into this right now. So that's all there is to multilayer perceptrons are artificial neural networks. Training them can also take a long time because it's a complex model with a lot of weights that needs to be tuned. The training algorithms that quite fast for logistic regression or linear regression can be applied to this as well, but they will usually take longer. And the quality of the training can also vary a lot. And we will see there are many hyperparameters that can be tuned for such a network. In fact, there are probably more for artificial neural networks down for any of the other models we've seen. And that again poses various challenges. But hopefully after watching this section, you will gain at least a basic understanding of how these work and when and why you should be using them. 35. Comparison to the DT and RF models: In this video, we are going to compare our multilayer perceptron with the decision tree and Random Forest classifiers that we've seen in the previous sections. So first of all, we have to import the MALP classifier class from the SKLearn that neural network module. And after that, it's simply a matter of plugging it into our evaluation pipeline. So first of all, we instantiate a counter vectorized. We use a max abs Kayla and instantiate the multilayer perceptron classifier, passing in the maximum number of iterations. And verbose equals true so that we get some feedback while the neural network is training. So there are a few things to talk about here. First of all, before we get into the results. And let's start off with this scalar, which would also be a minmax, Kayla. But if you use a minmax, Kayla makes sure that your data is not sparse as it will not work otherwise you will get an error. And you can fix that using our desk transform a class here from one of our previous videos. But I didn't want to complicate things too much. So I just used this scalar even though it might not be the best one. But it's very important to have some kind of scalar for neural networks in general, because they don't like data that is not scaled in general. So you almost always want to have something like this. And the second thing is this max it or hyperparameter. And we set this much lower than the default. The default is something like one hundred, two hundred. I forgot exactly what it is, but you can look it up on Scikit Learn page, the documentation. And the reason I set it much lower is because it takes a long time to train this neural network trained best on GPUs, that is, graphic cards or video cards. But scikit-learn is not that advanced in this regard. In order to get a neural network to train on a GPU, you will have to use something like TensorFlow or Pi torch. But of course we're dealing with scikit-learn here. So in order to speed things up a bit, I simply reduce the number of iterations. Generally the motivations you have, the better the model will perform. So the better the classification accuracy in our case. But that's not always the case because it can also lead to overfitting if you train it for too long. And we will see later on that there are many hyperparameters that affect the MALP classifier. And neural networks in general, the other parameters we left off as default. For example, here if you will look at scikit-learn documentation for the MLB classifier class, you will see that the default is having one hidden layer with a 100 neurons. And you will also see that default activation functions for each neuron in there. Okay, so let's go down and see how this compares to our random forest. So we can see here the reporting that takes place thanks to the fact that we passed in verbals equals true. And this loss function basically needs to be minimized. And that's what the solver or the optimizer in the neural network does, is it optimizes a loss function. You can read more about this loss function and the loss functions for classification in general on scikit-learn documentation. Or you can Google for more, for even more information, just know that this is inversely correlated with accuracy in this case. So the smaller the losses, the higher the accuracy should be, and the higher the accuracy is under training set. And of course it remains to be seen how good it is and the test set, which is what we care about. We can also see here that the last steadily decreases as the iterations advance, which is a good thing in general. And it's likely that if we had left this run for a longer, it would have decreased even further. Because we can see here that it was decreasing and then it will reach to the max number of iterations we set and it stopped. So as you can see, there is a lot that you can tweak for a neural network. And if we scroll down, we can see that the accuracy is 85%, which is very good and also very close to the baseline with regard with logistic regression. And it's also comparable with the random forests. Don't be scared of. You get slightly different results. Or you might be wondering why the results are not the exact same results we got in the previous videos where we discussed Random Forests. Well, because random forests as their name implies, have a random component in them, so that will affect results. And multilayer perceptrons are neural networks also have a random components. So if you run this code yourself or even if we run it again here, this value might change slightly, but it shouldn't change much. There'll be something around 85%. So, as you can see, multilayer perceptrons give comparable results to random forests. They are quite close to the top of the pack considering the results we've seen so far, it's very likely that you can tweak their hyperparameters to get even better results and get close to a 90% accuracy. And that's a good exercise for you to try. So give it a try, see if you can make it happen. 36. Hyperparamter Analysis: In this video, we are going to discuss some of the hyper-parameters of the MLB classifier class in SKLearn. So as you can see, there are a lot of them and the rest of them have a big impact on the results. It's very important to choose them, right? And there is a lot of ongoing research dedicated to the optimization of these hyperparameters since they are so important for the quality of a neural network. So let's go over them in sequence and discuss them a bit. First of all, we have the hidden layer sizes, which is a topo. And it basically specifies the size of each hidden layer. So how many neurons you want in each hidden layer? The default is a one element TOPO, which means that we have a single hidden layer width, 100 neurons in it. Next we have the activation function, which specifies the activations of the hidden layers. And we have here for options k. So we have the identity function, which is basically just f of x equals x. And what this means is, if you recall in the introductory video, we mentioned that each neuron has a set of weights, which are pointwise multiplied with the input features summed and returned. And that means exactly this, that some of the pointwise multiplication is returned directly with no function applied to it. And then we have the logistic activation, which uses dysfunction and the pointwise multiplication result. And these other two, the most common one is of ALU, or rectified linear unit, which takes the maximum between 0 and x. So it basically sets all negative results to 0. Next we have the solver or the optimizer. You might see it referred to as the optimizer as well, which is the algorithm used to optimize the weights in each neuron. Those weights needs to be optimized so that the neural network performs well so that it has a good accuracy on the test set. And I'm not going to go into details on what each of these algorithms are, how they work in detail. The best one is generally Adam, this behaves the best for the most scenarios, and generally you shouldn't need to change this. Next we have alpha, which is a regularization term. If you recall, we discussed about regularization on support vector machines as well. The default value here should generally be good as well. The batch size is not very important in our case. It basically refers to how many samples to process at once in case the dataset does not fit into memory, or in case the intermediate results do not fit into memory. It can sometimes be a good idea to reduce this, but in our case, it doesn't really matter unless we have the learning rate, which is another important hyperparameter. But the naming here is a bit confusing because it's not the actual learning rate, it's the learning grade algorithms, so to speak. And it tells us how we want the learning rate to adapt, but it's only used one. The Salva is SGD. So usually you won't use this solvers, so it's not that important. Learning great. Init is the actual learning rate used in the algorithm. And this can be important. This is used by all the solvers, except the LBFGS one, but it is used for SGD and Adam. And we mentioned that atom is the most popular ones. So it's important to understand what this refers to and how it can impact results. I'm not going to go into the mathematical details, but it basically refers to how big of a change the neuron weights. So it basically refers to the amount of change the algorithm Adam or SGD will do to the neurons weights in order to optimize them. If it's too low, training will be very slow and it might not reach a good result at the app. And if it's too big, it might be a very noisy training, meaning that we can reach some good result and we will change it to a bad result, again to a good result, again to a bad result and so on. So you don't want to change those weights by too much. And generally this is a good default value, but sometimes you might want to add another 0 after the dot here. So you might want to make this dot 0.0001. Max iterations is the one we changed. Generally 200 can be a bit much, it can take a lot of time to let it finish. So you might want to reduce this like we did. Okay, then we have the tolerance together with and it'll no change which controls when to stop the optimization process based on the amount of improvement in the last function that we discussed in the previous video, momentum and Nesterov momentum I've only used for SGD. So I'm not going to go into detail on them, but they are interesting concepts and I do recommend that you google them in conjunction with SGD. Sgd stands for stochastic gradient descent. So those are good. The Google and learn more and the others, quite self-explanatory. You should be able to understand them when reading the documentation for yourself. Now, another thing that impacts multilayer perceptron performance other than these hyperparameters that we talked about is the preprocessing of the data. I've seen applications of neural networks that go from something like 10% accuracy to 70% accuracy without changing anything in the neural network. And only because something was changed in the preprocessing. So it's very important for the data to be in a format. And neural networks generally like data to be between 01 or between minus 11. So a minmax scaling can help a lot or keep that in mind when working with neural networks or MALP classifiers in scikit-learn and in other libraries as well in the future. 37. Exercise: Here is the exercise that you have to complete to end this section with. So try to get a bit more accuracy out of this multi-layer perceptron classifier, at least one or 2% more. It doesn't have to be that much. Do this by messing with the hyperparameters and maybe with the pre-processing algorithms. Pause the video here and continue once you have a solution. And then I will be presenting my own approach. Okay, so congratulations if you did it and if not, here is what I did to gain about 2% more accuracy. So I kept the max a scalar because I didn't want to mess around too much with the minmax Kayla, in order to do that, we need to make use of this Deaths transform, which can lead to out-of-memory errors. So I'd have to do this in and I increase to the hidden layer size and decrease the learning rate in it hyperparameter. And let me tell you why I did this. So if you recall from the previous video around here at the end, we would get the last value, decreasing, then increasing a bit and decreasing back and so on. And usually when this happens, it's a sign that your learning rate is too large. You want the loss to steadily decrease like this, like we have here. Even if it decreases by very little, eventually, you do want it to decrease because any decrease is an improvement. And by doing this, I got almost 8887%, 0.7% accuracy. So it complies with the requirements I stated, congratulations for getting this far and for completing this exercise or at least attempting it. 38. Data collection: In this video, we are going to do the final classification showdown by running multiple classifiers and multiple datasets. And I've already written some code here. It's based on the previous code, but I've changed a few things to make things easier. So first of all, we are not going to be downloading any dataset from Kaggle, for example, at least not manually. We are going to automate this part by using this fetch open ML function from SKLearn that datasets. And what this function does is it automatically downloads a specific dataset from the open site. This is how the site looks like and this is one of the datasets that we are going to automatically be downloading and running our classifier zone. You can read more about it here. Just go to open ML dot ORG and search for it there. It's basically a dataset consisting of images of handwritten digits. And it has 784 features. That means that the images have 784 pixels. And where this comes from is that they are 28 by 28 black and white images. Anyway, you can read more about it there. So that's one of the dataset we'll be using. And the way this works is we call fetch open ML bass in the dataset name and tell it that we want it to return X and Y's. Otherwise it will return a bunch object. That's basically kind of like a dictionary. You can read the exact documentation on Scikit Learn site. You can find in the documentation if you Google for it. And then we pass those x and y's to shuffle also from SKLearn In order to shuffle the data. So this star here is because fetch opened amount will return a tuple consisting of the x and a y. And we want the elements of that tuple to be passed in as separate parameters to the shuffled function. So that's what this star here does. Okay, and we do the same for the chief dataset, GFR ten, which is also a dataset consisting of images and they split into ten classes. You can also read more about this dataset on the open ML website. This is a more complex dataset. And it's known in the literature that neural networks do very well on these two datasets. And we will see in this set of videos if that's true or not. Okay. Then I also modified this train test split function. It now accepts an X and a Y. I commented out this line and we simply return the result of calling train test split and the passed in x and y, and also passing in the same 30% for the test size. Then I removed some code for that GET pipeline function because we are not going to be using it anymore. We are going to jump straight into classification. Here. We have the evaluate function which I've modified a bit. It's now directly uses the past in classifiers. And because we are dealing with x and y's directly, we are going to pass datasets as topples of x and y. So that's why we have to do this here. The same thing is reported, the accuracy score. And I've only run this width, the MLB classifier, the random forest classifier, and the logistic regression. We no longer needed account vectorize. And I've added min-max scaling for logistic regression and the neural network. And for the random forest classifier, it shouldn't be necessary to add in any preprocessing. We'll see how it manages to handle this tape down like so. And here, like I said, we pass in the pairs of x and y, or data and targets or features and targets, however you want to call these. And they also got rid of the one-hot encoding. So this is it for getting the data off and running our code on it. You can add more classifiers here if you want. But these are the ones that got the best results previously on our Amazon reviews dataset. And I kind of expect them to get the best results here as well. And we'll see if that's true or not, but it's very easy to plug in more if you want to. So write down what you think the results will be or what you think the best model will be and come back in the next video, I've already run this. It's running as we speak for the results. 39. Results and analysis: Alright, so we got some results. Let's look at them and see if we can draw any conclusions. So for m nest, which is the easier dataset, because the images are smaller and black and white and they are generally less complex. We have this progression of the loss function and we can already draw some conclusions here. The loss is steadily decreasing. And it was steadily decreasing at a good rate when we stopped it. So what this tells us is that it's likely that this neural network has had not finished training when we stopped it. So it might have been a good idea to let it run a bit longer. And then maybe we would have gotten some better results. So you might want to try that out. Anyway, if we scroll down, we get 96, almost 97% accuracy, which is very, very good. So the neural network does very well and this dataset, now let's look at the random forest classifier. The random forest classifier also gets about 97% almost. And again, this is very good. The random forest does very well here as well. But remember, this is quite a simple dataset. It's kind of like the hello world of applying machine learning on images. Now let's see what the logistic regression did. And it didn't do that well, it only got almost 9291%, 0.5% accuracy. If you'll recall, our Amazon reviews dataset. For that one, the logistic regression model actually got the best results. So passing the neural network and the random forest classifier, of course, like we mentioned, this can happen. Neural networks are generally, generally good for image tasks where we want to do classification on images and where we feed pixel values to the network for other problems, not always so much. So this is kind of to be expected. This is a linear model. It doesn't do that well for more complex tasks, and especially not for images. Okay, now let's look at the tree far dataset, which is more complex. Again, the loss is steadily decreasing nicely. There is no noise here. It doesn't decrease and increase and decrease again and so on. So that's a good thing. But again, it was steadily decreasing when we stopped it. Or rather one ei to ej to the max number of iterations that we set, which has started. And we don't really want this to happen. Ideally, you would want to run this for a longer. It's quite likely that the last could get below one. If you let it run for maybe a 100 or 200 iterations or even more. But anyway, let's see what accuracy we got. So we got almost 50% accuracy. That is much worse than we did for Amnesty. But recall this is a much harder dataset though. It's not that bad. It's not great, but not bad. Now the random forest classifier got quite noticeably less than the neural network. It got 45%. And this again shows that neural networks are very well suited for tasks involving images, for classifying images, for doing regression and images may be and stuff like that. More advanced neural networks can do other things as well, such as identify where a certain thing is in an image, right? So we can draw those bounding boxes you might have seen in various tutorials or articles. So already we can see that neural networks are good for, for this task, even though our neural network is very simple. And for the logistic regression, we didn't get any results yet. And I have been running this for close to an hour now. So it's quite likely that it takes a long time and it's not converging. So we might as well stop it because we've seen that logistically question doesn't do as well as the random forest and the neural network and the simpler dataset. So it's very likely that even if this finishes eventually, the result will be was then forty-five percent accuracy. So I'm going to stop this. And while this is stopping, let's discuss some of the other models that we've covered in this chapter, but that we've covered until now, but we didn't draw on in this final showdown. So we've covered K neighbors classifier, and this is not likely to give very good results for images. You can try it out. But if you don't really expect too much for linearized FVC and as VC, you can probably expect it to behave similar to the logistic regression. Meaning it might take a long time and a linear model will not provide very good accuracy here anyway. And for a nonlinear support vector classifier, it's almost certain that it will take a very, very long time to finish. For Naive Bayes, again, you shouldn't expect great results. This is usually good for text data and simpler problems in general. So don't expect too great of results here. But I do encourage you to plug all of these in and see exactly what happens. The decision tree, again, we mentioned, if you recall that it's unlikely for it to perform better than a random forest. So at best it will probably get similar accuracy to this random forest classifier. And that was, it will get much less. So you shouldn't expect this to output form a neural network on imaged tasks either. So that's it with the shutdown we've seen so far that neural networks can do very, very well on imaged tasks better than the other models. But on other tasks like our Amazon reviews dataset, they don't always perform the best. And in those cases it's worth trying something like a random forest or a support vector classifier. And this is just something to keep in mind that there's no one size fits all. In machine learning, you always have to adapt to your dataset and try multiple things and see what works best and your particular problem. And you are exactly data. 40. Quiz: Here is the final showdown quiz. This will involve questions that refer to all of the models with disgust and the relationships to some of the dataset types we have seen. So the questions will be more general. You will require a very good understanding of what we've discussed and of the models in order to be able to correctly answer these questions. So pay very good attention to them and the good luck. Which of these models will work best? And all datasets and problems, artificial neural networks, decision trees or random forests. Or none? The correct answer is none. Because it really depends. There is no silver bullet that will work on all datasets and problems. Each model has its pros and cons and each type of dataset has specific types of models that generally work best. Which of these dataset types are likely good fits for artificial neural networks? Image data, text data, any numerical data. The correct answer is imageData. We've seen that artificial neural networks worked the best time to well-known imageData datasets and that they didn't work so well on Amazon reviews dataset, which involves text data that was transformed to numerical data. Y is the logistic regression model unlikely to perform well. And imageData, because they require a lot of hyper-parameter tuning. Because it is a linear model and thus unable to model complex data such as images. Because it cannot handle more than two classes. Which image data usually have? The correct answer is B, because it is a linear model and thus unable to model complex data such as images. We've seen that this was the case when applying logistic regression to our amnesty and the chief datasets, they don't really require a lot of hyper-parameter tuning. In fact, they don't even have that many hyperparameters and tuning them will not get rid of the linear model status. And linear models are generally not able to handle complex datasets, such as images, and they can handle more than two classes, m, the stem tree far both have ten classes. And we've seen that logistic regression did give us answers for amnesty. So they are able to handle multiple classes. Which of these are likely outcomes if we try to apply support vector classifier us to a big image dataset. A linear support vector classifier is unlikely to perform while a nonlinear support vector classifier is likely to take a very long time to train. It's impossible to tell for sure. The correct answers are a and B. A, because we've already discussed that linear models are unlikely to do well. And we've seen nonlinear support vector classifier as taking a long time to train and other datasets as well. And of course it's impossible to tell for sure. But the question asked, which of these are likely outcomes, not definitive outcomes. So I don't consider C to be a correct answer given the way the question was phrased. That's it for the quiz. Congratulations, if you are able to answer the questions. If you did, it means that you have a very good understanding of classification and the relationships between models and data sets. 41. Usefull data preprocessing techniques and approaches: Here are some useful data preprocessing techniques and approaches. So first of all, we have model specific techniques. For support vector machines. You should generally use a standard scalar prior to feeding the data to your support vector machines. They generally work best with this kind of scaling for neural networks, a minmax Kayla usually works best, but could be that the standards Kayla can work as well. So it's worth trying both of them if you can't afford it. For trees generally, you don't need any preprocessing because they don't involve mathematical operations with the data. So they should be able to do just fine without any Preprocessing. But if you suspect that your kind of data causes trees, that is decision trees and random forests to behave badly. Then you should consider some sort of preprocessing, either using a standard scalar, a minmax scalar, or some other problem type-specific scaling that you might be able to identify. Next, we have problem specific types of preprocessing. For example, for text data, we've already covered that we need to transform the text data two numbers first. And this can be done using a count vectorized TFIDF vector Isar. And it's also very important to optimize the various parameters of these. For example, for TFIDF does quite a few things that we can tweak. It has quite a few parameters and that will cause it to output different values. And those variations can affect classification accuracy. So it's best to try out all of them if possible, if time allows it. And if you are training time is too long. For the images, you can usually just divide by 255. Usually. That is because a pixel value is between 0255. So by dividing by 255, you basically do a minmax scaling without having to instantiate the class and without the added overhead of the minmax scaling. So just by dividing by 255 usually will get you better results than without doing it for other kinds of data. I think if you can use any domain knowledge in order to transform your data. For example, maybe you have some domain knowledge that tells you that the data can be divided by 100000 and it will still make sense. It will lead to the values of your preprocess dataset being close together, which should help classification algorithms out. Other approaches that you can take for text data. I think if you can remove various words and punctuation, some keywords that you can look up for more information, stopword removal and lemmatization. This can help a lot. For example, in this sentence, two here, probably does not provide much value when applying a machine-learning algorithm because it's a very common word. The same could be true for an a or an, et cetera. Again, this column here or this full stop here, or this comma here, likely do not provide much extra information. And another thing is that words, different forms depending on whether you use them. For example, here we used remove and here we used remove all. Again, this might confuse learning algorithms and they might prefer that the tree has the same form everywhere. So it might help a machine-learning algorithm, if you changed this, we move on here to remove the might understand the context and the meaning better that way. And for images, it can help to create more data instances out of existing watts. And this is very easy for images because there was a lot of stuff you can do on them, such as rotations and various resize things. For example, consider this, we have this seven, and here we have it rotated. Here we have a little bit smaller. Here we have it a bit smaller in the other way horizontally. This is smaller vertically. And you can actually get Dance of new images by doing this kind of thing. And it makes sense because it can help with overfitting and having more data to learn from as well. So definitely tried to do this for image data if you can. And if you feel the need to, if you maybe don't have enough data. 42. Ensemble methods: In this video, we are going to talk about some more advanced classification models and that is ensemble methods. We've already seen one ensemble classifier, and that is the random forest classifier. But there are more and here are the ones we are going to talk about in this video. So first of all, we have the bagging classifier, which is quite a basic one. And the way this works is it draws a random subsets of the data set. And this has the effect that it can improve classification accuracy and reduce overfitting. And for most of these dipole of estimators, we can pass in a base estimate or that will be used. And if this isn't passed into default is a decision tree. In case of bagging, we are usually also able to specify a number of estimators do use. And that the idea of most of them, but let's see the others anyway. So this is bagging. Then we have the extra trees classifier where it fits randomized decision tree. So it's kind of like a random forest and it uses averaging to improve the accuracy and control overfitting. Again, we can pass in an estimators a criteria and because it deals with decision tree is a max depth. So this is a very similar one to random forests, as you can see. Next, we have random trees embedding, which is an ensemble of totally random trees. So as opposed to a random forest, there is more randomness going on here, but it's still very similar. As you can see here, we have the same parameters and estimators, max depth. We no longer have the criteria. And because like I said, this has even more randomness. So there is no criterion being used here. And note that this is called an embedding. It's not called a classifier. And what this means is that this does not do classification at least not directly. What it does is it constructs a presentation of your dataset. So you can see here that a data instance is encoding according to which leaf of each tree it is sorted into, and then using a one-hot encoding. And this leads to an encoding off your data instance with as many ones, there are trees in the forest and you have here some formulas about the dimensionality and so on. So note that because this does not do classification directly, you will need to add a classifier on top of it. So think of it kind of like a preprocessor. Next we have the AdaBoost classifier, which is quite a popular one. And it's, it begins by fitting the classifier and the original dataset and then fitting multiple copies of it on the same dataset, but adjusting the weights of incorrectly classified instances. So basically it specializes on subsequent classifiers, the misclassify the instances. Next we have the gradient boosting classifier, which is also quite popular. And it also deals with multiple stages and it deals with differentiable loss functions. So as opposed to the previous ones, it uses differentiable loss functions. It doesn't use decision trees or some estimator that you pass in. And it's supposed to be robust to overfitting. And a disadvantage, as we will see, is that it is quite slow. So if we go into our code, I already plugged these in. We have the random forest classifier, the bagging classifier, with a decision tree classifier passed in. So this is the a model used by the bagging classifier. We have the extra trees classifier, the random trees embedding and which we apply logistic regression, Adaboost. And for gradient boosting, I commented it out because it took a very, very long time, it wouldn't finish. These are very big datasets, so Gradient Boosting might not be the best thing to apply on them unless you have a very, very fast CPU, okay? And most of these parameters, the default, I let the default settings in which might not give the best results. But still just to get an idea of what they can do, I left them like this. So let's look at the results. We also change the code a bit here, though, that there is also easier to eat for so many models. So we have M nist random forests, which we know gives us about 9696% to 97%. And bagging seems to do was extra trees seems to do slightly better. It gets closer to 97% random trees and the other Adaboost don't do well at all. And the same seems to happen for a tree far. Random forest does almost 46%. Bagging is much was extra trees is a little better at 46.2%. And the remaining glands are, was. So this is it for ensemble methods. They are something good to try and they generally get better results then using individual models. But for some kinds of data, for, in particular for image data, it is hard for them to be models such as neural networks, which are very, very good for this type of data. 43. Overview and hints: Here is what you will have to do for your capstone project. Their requirements are very simple. You have to get at least 55% accuracy on the CFR then dataset. So this data set right here under the settings that we've used when talking about it. So don't change the train test split and any other setup only changed the parameters you pass into the evaluate function you should pass in a model that gets fifty-five percent accuracy on the test set. And pause the video here unless you want to heal some hints, okay, so here are some hints. First of all, I suggest that you start with the model that got the best accuracy so far. And I'm gonna tell you which one that is. Go back and watch the previous videos if you don't remember. And after that, I suggest that you tune that model, play around with its hyperparameters, run it multiple times until you hits at least 55% accuracy. This might take awhile depending on the model you choose and the hyperparameters you choose, the training could be very slow. But it is what it is that's life in machine learning in general, you have to try multiple things until something works the way you want it to. That's it for this video. Good luck and come back and watch the official solution video once you manage to obtain at least 55% accuracy or if you've given up. But I do suggest that you spend some time on it, even a few days if needed. 44. Proposed solution: Okay, so here is the result. I got 55, almost 2% accuracy. And as you might guess from this output here, I did this using a multilayer perceptron or an artificial neural network. And I'm going to scroll up now and show you the settings I did. So if you still want to try to figure it out yourself, pause the video here and keep working on your own approach. So here are the settings I've used. So I have two hidden layers here. One with 2048 neutrons and another with a 128 neurons. The maximum number of iterations was set to 200. The activation was the rectified linear unit and Learning Ware Dennett was what we've talked about. And if we scroll down here, you can see that the training actually stopped at iteration 166. And this is because of this. So the training loss did not improve more than the set tolerance. And you can pass in this tolerance as a parameter to the NLP classifier class in scikit. Learn if you want to pass in something larger or smaller. But I didn't have to change it. As you can see, it's the default. Now I'm not saying that this is the only setting that works and this is what you had to find. No, there could be multiple things that work. So if you've got similar is also maybe even better results with some other settings. That's great and congratulations, as long as you've got the result and you use the appropriate methodology that is splitting your data into train and test sets and reporting the results on the test set, then that's great. So congratulations for finishing your capstone project and good luck in your future learning experiences. 45. Tips and tricks, pitfalls to avoid, shortcuts: In this video, we are going to talk about tips, tricks, common pitfalls, and various shortcuts. So the first step I would give you is to start with the known best models when dealing with a machine learning classification problem. For example, we have discussed the image classification tasks and we said that the best models for those are likely to be artificial neural network. So you should start with those if you have an image classification task, there's no point starting with logistic regression. Because like we've said, it's unlikely to be able to model your data very well and you won't be getting the best accuracy. And by doing this, you save time and trying various models. Of course, there might be multiple models that work best. And in that case, you should try all of them, but you shouldn't try those, that definitely won't do very well. Next is to start with the default parameters. This is always a good idea. The default parameters in scikit-learn and other machine learning frameworks have been selected because they generally worked best. Of course, they might not be the best for your particular problem, but they are a good starting point. Another tip and trick I would give you is to look stuff up online. Chances are, you are not the first one to deal with your particular problem. It might be distinct from other problems you find online, but it should still have some similarities. And the stuff that you can find online for the exact problem or similar problem. If you can find anything for your exact problem or is likely to help a lot. And last but not least, I suggest that you keep it simple. Don't start with overly complex models and don't spend a lot of time trying to get them to work more often than not. Some things simple will work just as well or even better. Here are some common pitfalls that you should avoid. The most common one and the most dangerous one is test sets snooping. And you are doing this whenever you use the test set in order to make decisions about your model. So the test set should never guide your decisions about your model. You should lock it away somewhere and only use it to test your final model. So it should go something like this. Shuffle your data, split it into train and test set. Only use the training set, which you can split further to see how your model does on unseen data, but never touch the test set until the end. So when you are happy with the performance of your model on the training set, then evaluated and the test set. And if you're happy, that's great. If you're not happy, you should start over completely. So we shuffle your data and we split it and start again. So the test set should be what your client tests your model on, so you should not use it to guide your decisions. And other common pitfall is to use small datasets. For very small datasets, it's very possible for a model to perform well because it's not diverse enough and the model is able to capture all the necessary relationships. And this might change when you get a lot more data. So try to get a lot more data at the start and use it. Or bad optimizations are also quite common pitfalls. Remember that we said that default hyperparameters generally good choices, but they are good choices to start with. It doesn't mean that they are the best ones. You should still try to optimize the hyperparameters and end up with something that you're confident enough, that you've spent enough time on and that they weren't good enough for your particular problem. And finally, here are some shortcuts that you can take when dealing with classification machine learning problems. You should always look stuff up. We've mentioned this before. Chances are that you can find information online that helps you. And this can generally save you a lot of time. And this ties in with the other shortcut, which is don't reinvent the wheel. There's no price to be received for figuring stuff out that has already been figured out by other people or by teams of experienced researchers. Even if you can do it, you will spend a lot of time on it. If this is your job or if this is an important project that you have good chances to monetize, you shouldn't do this, okay? If you are doing it for fun and you have plenty of time, sure, it's nice. It's a nice feeling to figure stuff out yourself, but you shouldn't rely on it. Look up stuff and see what you can find, use it and adapt it as needed. But don't try to reinvent the wheel. There's a lot of information online and you might as well use it. Next, realize that there is no free lunch, right? So if you find some model that people say gives very good accuracy, it's very likely that, that model will require a lot of computational resources to train. And the other hand, if you find a model that trains very fast, it might not give the best accuracy out of all the other models. So there is always going to be some trade off, some compromise and you have to accept that. And in fact, it will help a lot and save you a lot of time. Usually, if you decide beforehand what that compromise should be, that you care about the model that trains very fast and gives decent results. Do not care about how long it takes to train something. You only care about getting the best possible accuracy. If you can answer this question, it will likely be good shortcut because it will save you a lot of development time. So I hope this advice helps when deciding how to approach a particular machine learning classification problem. 46. Additional resources, next steps, recommended tools: In this video, we are going to talk about additional resources, next steps and recommended tools. So for additional resources, we recommend that you start with Kaggle. Tried to find some interesting datasets and maybe even some competitions that you can take part in. Don't worry if you don't get any good results at first, it will come with experience. The important thing is that you get something working and that you understand what the datasets are about on the competitions are about also look over the existing solutions. A lot of people post their approaches and there is plenty to be learned from those. Another good resource for datasets is the UCI machine learning repository, where you can find a lot of interesting datasets as well. Look some up in some fields that you are passionate about and again, try to apply some classification models on them. Second last documentation is also very important. Of course, you've seen that we made use of it extensively throughout this course, almost in every video. So definitely get used to it, get used to searching on it and understanding how it's structured. It also has a lot of examples. Make sure that you are able to reproduce as many of them as you can afford to spend time on. Google. Ai is another good resource with plenty of information about the field that I recommend you check out. And last but not least, our own courses. There are more advanced courses that you can take part in in order to complete your education. There will definitely be useful. As next steps we recommend the following. Make use of the other courses. For example, data visualization with little exercises which will tell you how to visualize data in more detail. We've covered a bit of data visualization, but not a lot, then it's definitely plenty more to be learned. So check that one out and check the other ones out as well. Watch the regression models with real exercises, videos as well. It deals with regression as opposed to classification, which dealt with in this course. They usually go hand in hand and it's important to understand both of them. Next, try to think about the field that you're passionate about and try to come up with a way of applying classification models to that field. After you've thought of something, see if you can find some dataset that lets you do what you talked about. And this is an iterative process because you might think of something cool, but you might be unable to find a dataset that lets you do it. So in that case, either you have to build your own dataset, which is possible, but it might be harder. All you have to change your idea of it and see if you can find the dataset for your new idea and so on. So it's definitely a cool thing to take on as recommended tools. We have mostly Python with Jupiter and Anaconda. As we've seen, this is what we use. We don't really need any external tools. Whatever knew we needed, we could just pip install. And this is generally the case for a machine learning. Python and Anaconda have very rich ecosystems that provide you with all that is necessary in at least 99% of cases. Google AI also has some tools that are worth checking out. Open a male has a few as we've used it for datasets. It was definitely very convenient. So check out this website as well in more detail. And last but not least, Try to find some yourself and make sure to keep track of them. Don't just be glad that you found, found something cool for a particular thing you are working on at the moment, because it might come in handy in the future as well. So write it down somewhere in some list of cool tools that you like, which you can refer to in the future for other projects as well. I hope this advice helps you.