Python machine learning tutorial | classification | road to machine learning part 4 | Michal Hucko | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Python machine learning tutorial | classification | road to machine learning part 4

teacher avatar Michal Hucko, Python | Docker | Kubernetes

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

13 Lessons (2h 15m)
    • 1. Introduction

      2:07
    • 2. Machine learning introduction

      4:13
    • 3. Dataset description

      4:40
    • 4. Exploratory analysis

      17:46
    • 5. First prediction

      18:25
    • 6. Numerical preprocessing

      18:33
    • 7. Categorical preprocessing

      10:59
    • 8. Final preprocessing

      10:18
    • 9. Algorithm explanation

      9:24
    • 10. Feature selection

      14:36
    • 11. Grid search

      8:42
    • 12. Pipelines

      11:00
    • 13. Project

      4:02
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

22

Students

--

Projects

About This Class

In this class we will go through basic process of classification in python. During the lectures we will work with the HR case study dataset published on the website. We will try to predict whether given employee will leave the company in the near future or not. For the predictions we will compare the performance of algorithms of K-nearest neighbor classifier (KNN) and Logistic regression. 

During this course we will go through the code examples summarised on the github: https://github.com/misohu/classification_basics

This tutorial is dedicated for machine learning beginners who are ready to learn practical basics. For each phase of the machine learning process we will go through example code in Python. For the most of the algorithms we will use scikit-learn library. 

This tutorial is not a mathematical nor statistical lecture filled with formal proves and equations. I don't recommend this tutorial to absolute python beginners. For those I recommend my previous courses in road to machine learning par 1,2 and 3.

This course will cover following topics:

  1. Machine learning introduction 
  2. Dataset introduction
  3. Exploratory analysis
  4. First prediction 
  5. Numerical preprocessing 
  6. Categorical preprocessing
  7. Final preprocessing 
  8. Algorithm explanation
  9. Feature selection
  10. Grid search 
  11. Scikit-learn pipelines

After this course you will get familiar with terms: 

  1. Scaling
  2. Normalization
  3. Encoding (one hot encoding)
  4. Histogram 
  5. Scatter plot
  6. Train and test set
  7. Cross validation
  8. Precision, recall and F1 score
  9. Hyper parameters

Hi guys my name is ItGuyMichal and I am experienced machine learning dev ops engineer. For more updates about my actual machine learning courses checkout my Skillshare profile.

Meet Your Teacher

Teacher Profile Image

Michal Hucko

Python | Docker | Kubernetes

Teacher

Hello world!! My name is Michal Hucko and I am passionate python developer. I am former university teacher. I was doing my Phd degree in computer science, however because of unfortunate situation I decided to currently postpone the study. Thats why I want to teach computer science online. Hope I can help you to understand the modern world of machine learning and distributed computing.

Besides programming I like to spent time with my wife, my brother and my friends. I am passionate fitness guy and sometimes I play computer games.

About my engineering career

For past 5 years I am working as a machine learning dev ops developer. I am working mostly with docker, kubernetes and python. Currently I am working for one of the biggest computer company in the wor... See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Hey guys, my name is Michelle hoops, go out phenomenas IT guy, Michelle and welcome to the classification course on the skill share. In this course, we're going to talk about basics of classification in the Python. We will go through many practical examples. These course is dedicated for those of you who are experienced with Python and libraries like pandas. If you are absolute novice to this problematic, I highly recommend you to going through the courses in the series of courses called The Road to machine learning parts 123, which you can find it my skill share profile. During this course, we are going to apply basic classifiers to predict whether employee will leave the company or not. Leveraging the HR case study analysis data set. At the end of these, you will be familiar with the terms like exploratory analysis, preprocessing of the numerical data, preprocessing of the categorical data, feature selection, hyperparameter tuning, pipelines in the scikit-learn and classifiers like K nearest neighbors classifier and logistic regression. During the course, I will provide a lot of practical hands-on examples with which I will explain the difficult terms in the machine learning. My name is Michelle, who's going, I'm working as a machine learning DevOps engineer right now. For the past five years, I am working with the machine learning where I have reached experience. Most contributions in the machine learning are regarding the classification of emotions and the clustering of the texts. You can find Couple of my publications on the Internet. If you are interested for more machine learning courses like these, please consider following me on the skill share or on the social media. Guys. Thank you very much for watching this video. I hope I will see you in the first lecture of this course. 2. Machine learning introduction: Hey guys, Mikhail hoods, cohere IT guy Mikhail on the social media. Thank you very much for watching this lecture in the basics of classification in Python. And in this video, we're gonna talk about the kinds of machine learning which are out there. So before we can move to the practical parts of this, it's great to talk about these distinctions between two types of machine learning and the issues or the problems which we can solve meta machine learning. So there are four types or we can distinguish basically the four basic types of machine learning or the learnings. And the first one is supervised learning. So supervised means that we are provided with the data set, with the training data set, where are the labels which we want to predict, which we want to learn? A great example is, for example, housing price prediction. So you have a training set which consists of the details about houses like what is the area, if it's a balcony red, which are the columns. And then you have a target value which you want to predict, which is the price of the house, right? Or for example, in this case of tutorials, we will have a HR data set. So the data set will consist of data about the employees. And then we would like to predict whether the employee will leave the company or not. So this is a target value in the supervised learning. We have this statute value in the training data set, which we want then to predict on the test data, right? So I am also mentioned in the training and test data. So the train is used during the training process of the machine learning algorithm and the test data is then being used to evaluate. So this was the supervised learning problem. Now, let's talk about unsupervised learning. Now, you have sometimes the data where the labor label is missing click, you have a lot of data about some people read and you don't have any label and the, or you don't know what you want to predict on them, right? So, so these kind of problems are called the unsupervised learning problems. Here are the predictions like sometimes you want to crew, for example, with these people, you want to group them in some certain groups. And then the unsupervised learning algorithms, we'll group them like without any knowledge of the label, right? So clustering is one of the cases of the Unsupervised Learning. Or for example, you want to, you want to look for the anomaly in your data set. Yes. You have a certain like behavioral data from your light no employees. And you want to find some, some errors or some suspicious activity, like in the banking sector, is this quite the common case? Now, this is the unsupervised learning problem. The third group of problems as semi-supervised learning problems. So it is what? It's like this craft in the name. So in the training set you have some data which are labeled and some of them which are not. And you want to use both of them later to predict something in the test data, right? So this is sometimes getting popular with, with the x-rays screens or with the medical data. And then we have also reinforcement learning, which I don't want to go into details in this video or in this lecture in this course, right? So this course will be about classification. And classification is the case of the supervised learning problem. So we will have a training data with the target label which we want to predict. So, and to be specific, as I mentioned, we are going to predict whether the employee will leave the company or an odd based on his specification in the time, right? Okay guys, I hope this gave you some insights into differences between the learnings in the machine learning. And I hope I will see you in the next video of this course. Thank you very much. 3. Dataset description: Hey guys IT guy Mikkel here and welcome to the first lecture of the course for classification in Python. In this video, I'm going to show you which data set we're going to use throughout this course. In this description of this course, you will get a link to a data set or to a case study, which is called the HR analytics case study. So please go there. And here you can find an amazing data set which is describing some employees in one of the company. And they gathered the data about whether the employee will leave the company or not. And based on the characteristics of the the employee. Now, if you go to HR analytics case study, you will be prompted with a view like this. If you click here on the data and if you scroll down there where you can download the data, it's about 45 megabytes and you will get all these files here. Now, I will go through these files right now. So most important one is this general data. And here you can see are about five megabytes of data with the data set. So as you can see here, we godlike the H, What was the age of the employee than the aeration stock? So this is the Target column whether the whether the employee left or not, then the business travel. How often does the employee business travel that department at SAP like this. And so there are about, I guess 24 columns, right? So, so 24 columns, and there are even more data here like in time. So when the guy came to the work and how when the guy went down, this is the alle time. Then there is a manager solving survey which was I think gave them after a few months in the work and stuff like this. And then there is the employee survey data. So step like this. But for the purpose of this tutorial, we just going to use the general data. And now, as you can see, there are not only the numerical data, there are also some categorical data. So this is exactly the great that'll sit for the purpose of this course, where I will show you how to predict not just based on the numerical data like in the iris dataset and hope the examples on the internet, but they will also show you how you can use the categorical data or text data in your classification. Now, also very important file in the folder that is called the data dictionary. And here you can see the meaning of each of the variables. Now, this is very cool now. So let's go through this. So as I mentioned, there is a ratio at attrition. Sorry, I can't pronounce it properly. So hopefully you are not angry with me. So business travel, how frequently the employee travel for the business department? The department in company now, the distance from home. So so how much does the employee needs to travel to the work? What's the employee education? There are five degrees like doctrine master, Partial College, and below college education field, the field of education employed count and like ambitious and not useful variable. It's just there because it was in the database. I think we will get rid of them during the prediction employee numbers. So that's the ID of the employee. If you want to map, for example, that in time, you can use the employee number for this, but we will not do it in this course, maybe in some future courses. So stay tuned. And now environments satisfaction. So work environment satisfaction level, again, four levels, right? This was measured somehow, we don't clear right now. The gender, the job involvement, right? Job level, the job role, the sad like self-explanatory job satisfaction, job satisfaction level, blah, blah, blah, martial status, monthly income, data, some quite interesting numbers. You can shake them if you want. A number company's worked. So total number of companies which the employee work before, over 18, whether the employee is above 18 years of age or not. And then there's just the performance rating and other stuff, right? Yeah, training and and lots of other stuff. So, so you need to download the data set and then put it inside your working directory for the purpose of this course, I highly recommend create a folder, create the data folder inside the data, put there the data from these, from these datasets. You can put there everything. Now, if you are interested in the commands which I'll will use through these videos, you can check out the good linked in the description below. I will go through very complicated staff. And the staff needs a lot of knowledge in Python. So if you didn't know anything about Python, I highly recommend my previous courses on the skill share, and I hope you'll have a lot of fun with this course. So I will see you in the next video. Thank you very much. 4. Exploratory analysis: Hello guys, welcome back. My name is IT guide Mikkel, and I'm glad you joined this video. This is another lecture of the course for Python classification, where I show you how to classify data set for HR. Now we are going to predict whether an employee will leave the company or not, as I mentioned in the previous video. Now, let's go to the first step of the whole machine learning or the predictive cycle. And the first step is always an exploratory analysis. Now, we already did the first part of this in the previous video where we will looking at the columns and the structure of the data set and stuff like this. But now I want to show you just some code which is used for export for deeper exploratory analysis. Yes. So let's jump right back into the Jupyter lab. And as you can see, I have here the first file which is currently exploratory analysis. So here is the code. I highly recommend you to read the code on your own. So just stop the video if I'm going too fast, but please don't copy, paste the code from the gate. You will not learn anything. You need to learn the syntax. You need to get the code inside your fingers, inside your head through your fingers and stuff like this. Yeah, you cannot just copy, paste it in and then you'll just end up being these kind of data scientist who's just copying and pasting everything from the internet wall whatsoever. I just put it here before because, well, you are developing, there are a lot of errors. I just want you to think about me that I'm not making any Universe, but you know, I do a lot, right? So let's go here. So first of all, compared to the other courses, as you can see, we're going to use a lot of libraries are a lot of stuff, right? So in the GitHub you can find the requirements TXT file with all the libraries which I'm using and versions of them. So feel free to install them with the PIP installed that I think it's the r dash are requirements that the C. Okay, so let's jump here. So we're going to use pandas for data processing, the num pi for data processing, seaborne. Now the seaborn is library which helps you to plot data. Now there is also a matplotlib library. Those of you who don't know matplot be MOOCers can copy, paste the commands MongoDB school, but seaborne is, is like seaborn is built upon the map LTP. And to me it's more like nice, more beautiful graphs you can bring into the seaborne light. But since it's up to you, then you also need to put if the matplotlib and I'm also importing the mountain Libyan away. Yeah, because I'm sitting here now this is just a setup for the matplotlib. So I'm saying how big should be the images presented in these Jupyter Notebook? Saying what is the size of the font here? So you just can copy-paste it. This is nothing important, this is no big stuff. You can find these on Stack Overflow and copy-pasted, whatever. So first of all, we are reading the dataframe with read_csv and I'm reading the general data CSV and I'm setting the employee ID as the index column. So if you look here and this is the general data, and here you can see there is a lot of staff. And I'm setting this employee ID, which is a unique column as an index. This can be very helpful because I don't want to have this, this, this column as a feature. So I'm setting it as index. So it's much, it makes me wear. Is it easy to work with that data later? So and then I'm dropping these two columns, standard hours and employee gum, which makes no sense. These these columns are just filled with the same values. They don't give any value to our data set, right? How do I know it? Because I looked into the data in the previous, previous part here. So when I look here and I saw that standard hours of work for the employee, this is just the same number for every one. That's the number of hours the worker shoe should work during the day and everyone was working the same hours. Why and how do I know it? I look into the data, right, that and that's it. So I think even if you like, look here and find standard, it's not here. Nevermind, it's not you. Okay. So I dropped the columns, right? And then you can see we have about 4,410 rows and 21 columns. We can also look at EFF. Yes. Show you how does it look like? Yeah, we had here the text data, the categorical data, yada, numerical data and stuff like this. Now, the thing that I always like to do is just setting, splitting the columns to numerical and categorical data. How do you know it? So if you put a method which is called the types, here, It's a parameter. Yeah, it outputs. This is a pandas Series and index is the name of the column and the value is the specific data type of the column. As you can see, this is the integers 64-bit. Then there are the objects. These are the the categorical data, and then there is the float 64 for flow points. Now I'm splitting it into like numerical columns, which is everything not the object. You can just notice the syntax. I'm accessing the data types, which is the series, and I'm setting the dy times, this will return a series of booleans and then you can access them if you have trouble like sub selecting dataframes and working with data frames, I highly recommend my another course at the skill shared goal there, it's a Pandas Data course. I don't know, there are just nice students, but, but this is the greatest cores which I haven't paid so far. So please consider watching the course before moving here. Now then I'm accessing the index, right? And then why? Because I would like to know just the names of the columns and then I'm accessing the values because the index will retune some strange array. I just want to have the list, right? So this will return a list. I'm transferring it to a list, and then I'm adding here the attrition which started with vari variable, which is not integer, but I want to have it in my columns because I don't know why I will add it and use it later. Now the categorical columns are the object once. And now, the next thing is, it's always a good part of the of the exploratory analysis is just, I'm, I'm showing you how you can convert the target variable if it's a categorical value into a numerical ones. So the problem with this is this one. So if I run this and then if I go here and I put here the DFT, let's say numeric, whatever DFT numeric. And let's say we want the apparition, right? And you can see this is a categorical value. Why? Because I transferred it and with the astype category now and now what I can do is I can access the categorical part of it and I can access the coats. And now this will encode the true four or one and already yes for one and no 4-0 or otherwise. Right. But I think it's just based on the yes. So yes, it's one and no is 0. Luckily, always check it out. Sometimes it can be reversed. Now, Yeah, because there is no logic. There are just strings now. So this is the way how I am encoding the categorical values into the numerical ones. Now I'm removing the NA values. Well, why am I doing that? The empty values or NAN values inside your dataframe can cost you a lot of troubles. I was talking about these in the previous course. So this is just the way I'm deleting it the same way as I did in the previous course. So I'm asking if it's a Northern LA and if the whole so so so if there is a row with these than just Dealey deed, and then as you can see, the shape just changed from Lobo from let's get the shape here, 4,410 to 4,382 rate. So deleted some of the rows. Now, next thing which is important during the exploratory analysis, just to look on the correlation between the average attrition. So the target value which we want to predict and all the other features. Now you can do it because we transferred the categorical type into the numerical. If this was a categorical type, you are still not able to do this. Now, please beware that transferring this into codes makes sense only if you want binary classification. Now, we will talk about the classification in the next video and what is the classification and what we're gonna do. But right now, the intuition is like we are going to predict based on some employee data whether the employee will leave the company or not? Yes, in the given state, right? So, so, and how are we going to accomplish this? We are going to create something which is called the classifier, which is going to be trained on the historical data. And based on this historical data from the previous employees, the classifier is able to see whether the current employee will leave the company or not. Yes. So for these purpose, we are going to use this data. Now. The best use is, are the numerical data because how are we gonna do it? I will show you later. We are going to use some mathematical functions which are going to map the behavior of the employee based on the data. And the mathematical functions work best with the numbers. So we are going to use the numbers mostly, but you can use also the categorical data, which we are going to transfer into numerical data the same way as we transferred the attrition into the, into the numerical data in these line, right? So now, because the dataframe numeric. So if I put in here is just composed of the numerical data, I can compute the correlations between each of them, right? So if you remember from the previous video, from previous course, these returns a metrics. Where is the correlation of each from each of each rows. But we are interested only in attrition numeric. And as you can see, that the most important features for attrition is the total working years. What does it mean? Because this is a negative number. The less working years there are, there is a higher probability that the employee will leave the company, which makes sense. There is also the negative correlation between the age and the, and the attrition. And there is also a negative correlation between the years with current manager. So the less years you are with the current manager, then the more likely it is that you will leave the company, right? Which is quite interesting. I don't know, I like my manager and I'm not talking with him. So let's go next. Or you can also plot these correlations. I really like the plots. They can talk, tell you a lot just based on the picture, right? So how you can plot. So please, this is a very complicated matplotlib commands which I just create. I don't know, I copy pasted from the Stack Overflow. Please be sure to copy-paste it also read. So I'm just saying setting acts as numeric and I'm saying the alpha, well, this is just a variable name, but I'm setting the, you know, the axis to the names. And then this beautiful code can create these beautiful plot. And as you can see, we are just interested in these row. So as you can see, there is a very based on the colour, there is a very interesting correlation in, in case of h, which is a little bit dimmer than the grey parts which are not useful at all, right? But you cannot make any assumptions just based on the correlation. I think 99% of my cases when I was trying to predict something with the data set, everything was great, right? And then at the end of the day I was I was able to create some classifier. Yes. So please, if everything is gray in your data set, don't worry. This is just arena are like dependency. But sometimes in your data there is a non-linear dependency which you cannot like see just with the correlation, right? And there is also a thing like correlation is not a causality. And this is a big topic. Go to the Google and Google it and you can see some funny examples between the correlations in the world and which are not telling you anything about the real world. Right? Let's go next. So this was the correlation. That next step of the exploratory analysis is that data distribution. Now what does it mean the data distribution to your seat on the example. So you've got the, as you can see, still am working just with numerical data in these examples. So I am getting the numerical data from the data frame and I'm writing a histograms. And as you can see here, you can see the range of The distribution for h. So we have some examples with the guys who are between the years. I think this is anything up to 60 years. And most examples here are between 3040, about 35, right? Make sense? That's like valid, right? There are the pension lists after 60 and there are just too young to work before, right? Then this is distance from home. A lot of people are living very close to the office. I think this is in kilometer, it's amature, maybe not. And then some guys are living as far as 30 kilometers away from office. So they need to go to the office like at least one hour from the application point of view, that is the three most awkward one. And that was I just tell you, education tree is a bachelor degree. Well, to my country, bachelor degrees like Nas often lead then the job level blablabla. And as you can see, there are different ranges. These graphs are called the distributions. And in data science world or in predictive analysis, the best distribution is the normal distribution. So it looks like this. So everything is the most common Park is around the center of distribution. And then you have these beautiful like bell-shaped function, right? So this is good one. And then in most of the real world cases you will end up with the long tail problem. So something like this. You have this big hill and then the long tail of lower values. And we will fix this long-time issue in the preprocessing or the or the preferences? Yeah, in the preprocessing video, right? As you can see, there are a lot of long tails or also this is somehow a long tail and this is a long tail red. So these long tails can cause troubles to your classifiers. So we will need to like adjust these distribution to be more than normal distributions. So these bell-shaped distribution, also, as you can see, there are various ranges from the numbers. This is from 0 going up to 60. Well, this one is going from 0 up to two hundred thousand, two hundred thousand monthly income. What is this? Well, these guys sound dem, are very rich. Well, next, I have something special. This is a pair plot. And now reading a pair plot can be sometimes very tricky. So I will just show you. So these, for example, this one shows you how are the examples distributed between two variables. So the age variable and the years with the current manager. So as you can see, when there is a last years with the credit manager and here is the legend. So the orange ones are likely to leave and the 0 means they are not likely to live. It's a blue one. And as you can see, there is some linear. So if you can split these, these, these graph, we just one line and separate a lot of cases on one side and a lot of other cases on the other side. This means that there is some visual linear correlation or Rindler relation between the predicted value and two of the target variables. So as you can see, if we put here the line, we will be able to separate these two gals, right? And also here so that the pair plot can show you the linear dependency between two features and the predicted feature, right? So this can be very helpful. Now looking into these crafts before you move to some predictions can save you a lot of time. Why? Because you may find that there is not another singling our dependency in your data. And you'll just based on your time with the simple models. And you can maybe try some neural networks which can help you. But, you know, sometimes without any correlation in the data and any linear dependency, there is just, there is, you're just wasting time to make some analysts later. Ok. So this was just like introduction into exploratory analysis. I showed you some examples how you can look into data to find some correlations, to find some dependencies, please always consider doing these things before you move to the actual predictions. So guys, thank you very much for watching this lecture. I will see you in the next lecture of this course. Have a nice day. 5. First prediction: Hey guys IT gay, Michelle here and welcome to another lecture of our core course for classification biotin. In this lecture we're gonna talk about how to evaluate the classifier, right? We didn't I didn't show you any classifier or I didn't explain you some of the classifiers in the Python. But before we move to classify a classification and running the classification, we need to understand how we going to evaluate it. Trust me, this is much better to start with them going there later. So if you look inside these notebook over here, you can see that I'm importing here a lot of stuff. For the purpose of example, I'm going to use something which is called K nearest neighbors classifier in this notebook. I will later explain how this classifier work. Right now, you can think of a classifier as a black box. You put inside there the parameters or the features of the, let's say in this case, a given employee, like what's true, what's gender? How many years is he working there? What was the whole lumpy suffering under these Manager? Yes. And stuff like this. And then the black box will say whether this guy will leave the company or not. So, and now let's talk about how we can evaluate the classifier red. So, So before we remove Next, let's just legislating about right. Let's say you want to predict ten employees, okay? So, so then you want to know about thin employees if they will stay or if they will go, you will put an array with ten features for each of the employee inside the black box. And the classifier will output array like 11111000000, right? So he said that the first five of the employees in the feature vector will leave the company, while the last five, at least five will stay at the company, right? And now let's say you know the truth, right? You somehow have the magic bowl and after a few months, you will find the true, like 10111101000. So this is the truth. And now in the machine-learning Durer, a ways how you can evaluate the accuracy or the performance of the classifier, right? Forget about accuracy right now, okay? So first of all, yeah, the, the basic thing is like computing how many times does the classifier, it does the accuracy. So if there were like, then like possible examples out of ten examples, he too, like how many of them ate, right? So it's 80% accuracy rate, eight of them he was ranked, so it's 80%. But now there are more sophisticated approaches to evaluate the classifiers, which I want to tell you right now. So let's imagine what are the cases that can happen. So in these case, these, in, in these metrics are in this table. We have at the left side. The actual class. So what was the real value of the prediction? So this means in these rows, we will have cases when the actual class was 0. And here we have cases where the actual class plus one. And in the columns we will have predictions. So when the prediction was 0 and when the prediction was one. So for this case, when the prediction was that you will stay and the prediction is that he will or she will go red. So for the metrics, the cell is called a true negative. That means that we were predicting about something that will not happen. And in truth, it didn't happen. So we were right and it was true, negative prediction. In this case, we were saying it will be positive, but in reality it was false. That is the reason why it's called a false positive example. And now in this case, we have a false negative because we are predicting it was false. And basically we were false because it was true. So in this false negative, and here we have true positive because we predicted a positive and basically it was a positive. So when we go back to these two arrays, this is our prediction. And this is the reality. I will just line it one after another. So this is a true positive example. This is what we are predicting that it's true. So it's a false positive because the reality was 0, right? It's a false positive. If you look here, it's prediction was one, and then this is the true positive troopers does true positive, false negative, true negative two, negative three, negative two negative, right? Simple. Now what's next? Having these like counts for these, we can evaluate the accuracy which I showed you right now. We compute the true positive plus true negatives, right? So that's eight and we divide it by each of that, write the sum of everything, right? That's what I did in the first case, that's 80% now, but there is something which is called precision and recall, and these can be much more descriptive for our cases. Now, precision is the number of true positives out of true positive plus true negative. If you look in the table, the true positives is here. So that's the number of times when we hit the prediction, the true and it was true and the true negative is here. So the number of time when the prediction was negative, it was negative. So basically precision is saying when we predict something. And, and so how many of, of, sorry, one more time. The precision has in the denominator, the true positive plus false positive. So if you look in the table, is true positive and false positive. So we are talking about the cases when we predict that something is happening, rank that something was positive. And basically the precision says, that's the number or the percentage out of our predictions. So how many of our positive predictions are actually true, right? Given the boss and think about it. Because in the denominator is true positive plus false positives. That are the cases when we are predicting that something happens, is that what's the probability that we are right in case when we are saying that something is happening, right? And then there is something which is called recall. And if you call, if you look at the recalls and denominator, there is a true positive that's here and there is a false negative. False negative. So basically recalls saying, how many of actual positive cases are we hitting with our classifier? Yes, because this is a true positive. So how many of them are reheating? Now, there is something which is called there is something for something law. So while we are trying to optimize the precision, many times we are lowering the recall. So you need to find the balance between the precision and recall, right? And sometimes we have a very high precisions. So if you look here, the less we bred it, something that is true, the less chance that we are false, right? So, so basically as you can see, this is quite trade-off. Now, many of you can be confused. Right now. We are losing here and we don't understand this given the boss and go through it again. Now, there is some, one more descriptive measure which is very hard to understand because it's a lot mathematical compared to the precision and recall which I explained you here by words. And it's called, it's called an F1 measure. Now, in this case, the F1 measure is just the combination of precision and recall. And here you can see the formula for these and F1 measure is just one number saying something about the classifier accuracy or sorry, classifier performance, right? So you can think as F1 measure as some ground truth, which can be used anytime. But many times, I, as a professional Machine Learning Engine or I like to look at the precision recall values than to the F1 measure. Now, you can see that we are talking about a heating. Now, a heating, a class. Now it's, it's quite simple for binary classification, but it comes quite trickier when you are alike talking about multiple classes. Let's forget about something. Forget about classes like true and false. Think about classes like a and B, red. So the precision for a will be like how many times are reheating a when we are saying that something is a and when we want to compute the precision for the model in general, we need to compute the precision for Class a and Class B, right? And if we have n classes, we need to compute for each of these n classes, right? So it's quite simple to compute it for one class, but it's harder to combine it for multiple class. And in these case, we are talking about macro precision and micro precision. These are the ways how we can combine the precision for each class. And basically the macro precision is just the average of precision for each class. Well, the micro precision is something else, something more complicated. In micro precision, we are just computing the true positive for each of the class divided by true positive for each of the class plus some of the false positive for each of the class, right? It can be quite complicated. I don't want to go into too much details and bore you if you are interested in McMurray micro and macro precision goal to the Internet right now for the purpose of a binary classification, it's much easier to work with the, with the macro precision because it's like the same, the microeconomic reform, the binary classification. Who saw now that was a lot of information about precision and recall. So what we want to do. So it's a good practice that after the exploratory analysis, you just go take the data and make a first classifier just for numerical data and see how it, how it behaves, right? So if you remember in the previous exploratory analysis notebook, at the end, I save the numerical data to a file called Process Data sludge numerical CSV. So here I can immediately read it and now I can, I know that in this dF, I have only numerical data with the target value. So if you look at the columns, there is the edge distance and at the end we have the attrition number, right? So this is amazing. So right now we can make a simple prediction just on the numerical data. And the running of the first prediction is as simple as this. First of all is good to split the data frame into the features and the predicted or the target volume. In this case, the target value is attrition number while the x, the features I will refer to x with capital X to a metrics, while I will refer with the lower case letters to a vector. What does it mean? This is a vector, right? And this is a matrix with multiple rows and columns. So here I'm saying, I want each column except these two columns. And here I'm saying for the y, I will just these goals, right? So if you look inside here, by running these need to run everything at this time. We can ignore this one. And here, by running these, you can see that we have 3,677 cases when the attrition is equal to 070005 cases when the attrition is equal to one. And now what do we want to do is we are going to split the data set into training sample and this sample. And what is this? So in order to evaluate a data set, we want to test the performance. In order to evaluate the classifier, we want to test the performance of the classifier on the data which was not presented during the training of the classifier, right? So when you have a labeled data set, as in this case, we have, we want to split some part to training and some part to actually testing and evaluation, right? And it's a good rule of dump, which is called 80 to 20, where we are saying that 80% of the data set is set to training, while 20% of the data set is said to evaluation, right? And Python and scikit-learn allows us to run these train test split, which is a function from the, from the scikit-learn to split this data set for us. And by running these, it will just equally split the data set to train and test. And here you can see that it automatically split the, the positive and negative cases according to the rule. And also it did for the training set, right? So you don't need to worry about writing a function to split the training and test set and to keep the equal number of these indices also the data set is being shuffled so, so you don't need to worry about this thing, right? And now that we have the training extreme and x, sorry, x train and y train will be used for training while x-test and widest will be used for evaluation to compute this precision. Recall F1 measures right and accuracy. Now, so how, how do we, how do we run the classifier? Yes. So in order to create the classifier, and you need to create an instance of the classifier. Let's say for this case it's just a black box for you. I understand. So just follow the steps. So here we are saying into the variable name, we are setting it to the instance of the K-Nearest Neighbors classifier for the n neighbors, which is a variable set to three. You don't need to think about it right now. And then we are training a classifier. So we are saying Nate, fit with extreme and y train the fit, the method of which will be used for every machine learning model you will met in your life. When you create some, some machine learning will own your own. Always provide the feet and feet transform and predict functions for, for, for the others. So as you can see here, we are training the classifier and by running these, the Nate variable will be in place. Lee update it so you don't need to assign it to another variable. And then we can predict, now we are predicting how many of the test set employees will go in the future, right? So by running these, you will see that inside the, why predict. We will have an array of ones and zeros, right? And this is the prediction error from the beginning, right? We can compare now this prediction error array with the wife test, which is the actually the truth of the predictions for the given employees, right? So in the next cell, I can compute the F1 score for the whitest white bread and average micro whatever. Don't care about this. And then we can also compute the precision score, recall score for macro. So by running this, you will see that we are having breezy, nicer results. Even the baseline classifier got 92% for F1 score and precision score 84. Now, there is a classification report which is a function which can provide more insights. So here you can see that for the class 0. So for the class, when we are saying that the guys will not leave the company, we are reaching the precision of ninety-six percent. So out of the existing examples for the class is 0, we are reaching to 96% of them. And when we are saying that these 96, and when we are saying that somebody will stay in the company and we are labeling it as his hero. We are the, we are accurate and 96%, right? And here you can see on how many classes it was predicted. Here is also f score for these guys. And there you can see other measures down there. The most important part of this table is this precision recall for each of these values and f score, right? You can even plot the confusion matrix. And here you can see this is true. So true positive, we have 89 cases through negative we have 722, right? But we have troubles in these two regions. So in, in a now for you, what do you think which of these two numbers is more critical for us? And the answer is, is this one? Why? Because there is less, as sort of this one. There is less like true labels, so delay. So there is less examples when the, when the employees will leave. And so these number is relatively bigger to these number than these number to this number rate. So our biggest problem is this false negative, which is quite big. So we want to minimize this in the next videos now. Okay, this could be quite confusing. It took me a lot of energy to explain this. I hope this was not confusing. It's still it's still conferencing. Please leave a comment in the discussion below. I will make my best to give you some other resources to explain this issue like properly, because this is the essential part and without this part you are unable to continue with the next lectures. The guys, thank you very much for watching this video. I hope I will see you in the next lecture of this course. 6. Numerical preprocessing: Hey guys IT gamey hill here and welcome to another lecture of our classification in Python, tutorial or course on skill share. So in this lecture, we're gonna talk about numerical data. In the previous videos, I show you the basics of evaluation. We were talking a little bit about precision. Recall F1 score. If you still are lost in these terms, go there, check out the lecture again, write down the comment. I'd be glad to answer your questions. Now in this video, let's go to the Jupyter lap, I mean the notebook. And as you can see, again, there are a lot of our libraries which we're going to use. Most things are important for scikit-learn, which we're going to use just for evaluation. And the important part is here these modules like standard scaler polynomial features, min-max scalar from scikit-learn, which we are going to use. So whenever you will go through the code and you will use something which I am not defining here. Go to the top and check the libraries which I'm using and the modules which I'm using. And make sure you import the same thing Society. For the versions, I will create the requirements TXT file on the kid linked in the description below. So please go ahead and install the same libraries. I'm running it in the winter and virtual environment of the Python treats so. So basically the standard process. Now, I am reading the numerical data into the DF And as the same as in the previous video, I'm splitting the Y vector of predicted values and the metrics of the features outside. So harm splitting downside. And then our first I'm bringing these hissed. I already printed it into exploratory analysis notebook over here. But right now I'm doing it here just to recap what we did. So I showed you here these distributions. So here we can see distributions of, of our data. And now we are going to play a little bit with these distributions. So let's say we want, we want to first talk about normalization. So in this case we want to rescale the data or put it into different distributions. And first let's talk about min-max normalization. So the min-max normalization idealization, we will ensure that the data will go from the distribution SDS into the range between 01. So this is very helpful for some types of classifier, let's say logistic regression nine or many of you don't know what is logistic regression. We'll talk about it later, but just need to know that sometimes you need to rescale the numerical data in order to improve the classification, right? You can create a classification with the data as you have, as we saw in the previous video, in the previous lecture. But to improve the classification, he asked to get the better recall and precision, you need to play little bit with the data in the re-scaling the Domenico data is one of the options. So let's rescale that with min-max scalar, although the minmax normalizer. So the formula for this is that you want from the x, which is the given row of the given value. And subtract the minimal value from all the xs in the feature. And then you want to divide it with max of the given feature minus the min of the given feature, right? So basically we are just subtracting the minimum from the value, the minimum of, let's say you have, I can display it on the example. Let's say you have like 123. These are the features. Let's say these are the values for the h. I know this make no sense. Ages not so small, but let's say it's for age. So for each of the row, you will do the equation like minus the minimum of these vector, which is in this case one. And then you will divide this thing by a maximum of this, which is a, which is three, and minus the minimum which is one. And you will do the same thing for each of it. Like here you, well, here you will put the same, the same thing instead that you put hidden number two. And here you will do the same thing. Instead, you will put here number three, right? And when you will run this, you'll see that each of the values will be between 01. You can play with the values which one you will end up within these like range. Now, when you got here, for example, the years at company, you see that they are starting from 0 going up to 40. And when you use the minimax scalar, this, we'll do this transformation for you. So how to use it? You will create an instance of min-max scalar. You can import the minmax killer from scikit-learn pre-processing tools, right? Always if you don't know how to import it, put it into Google and the first reason will be mostly the documentation. So go ahead and check it out. So we are creating the minmax scalar. And then we are calling the normalizer is the minimax killer to function which is called Free Transform and basically defeat transform is the same function as the feed in the previous case. But instead of just, you know, training the classifier in this case, it will also transform the data. So result equals normalize their feet transform n. And because, you know, these scalars needs to work with an umpire race and, and they need the special like they are accepting only the column vectors and row vectors. You need to reshape the values of the panda series. This is just the technical detail. Please copy this, I will show you later how to use it easily. So, but the thing is, I am putting it here just to demonstration that when I ran the minmax scalar upon the years at company, I need to extract the values and reshape this into the column vector. So if I put it here outside, you will see that. You will see that this current is yeah, I didn't run this. So don't forget the rondos. Take some time. Yeah, just play these, right? And here is, you can see that I'm creating just the column vector as an input to the normalizer. And then I'm just reshaping beg, the column vector into the series. Just so I can, I can have like the, the output for, for the histogram. And now you can see that the histogram went from these into these starting from 0 going up to one. So this normalization or the min-max normalization can be very helpful for certain classifiers. Now again, there is not like a golden rule for everything. The application of different normalization tools depends on the classifier which you're going to use through the prediction. So this is what makes a good data scientist, a good data scientist, knowing what classifier needs, what is the essential key to success in the data science and machine learning? So please note that this is a tool and you can use this tool. I add my beginning is it was just experimenting whenever I was running some classifier was always trying to min-max normalization and other non-white normalized for us, which I can present you here in this video. Now, the another approach which is quite common is called the standard scaling or the set normalization, where we are transforming data into normal distribution. So to resemble the normal distribution, this operation centers the data around the mean of 0 and when the standard deviation, right? So it keeps the standard deviation and it's centered data around the 0. So the range from minus something up to one, but the mean is always 0. Now, the equation for this here, so we are just like subtracting the poo, see, yeah, subtracting the mean, which is the EU and divide it by the standard deviation of the row. So you have the row of edges, let's say you compute the mean and then you do just divided with the s. So for example, this is the year of the company, right? This is the mean. And you can also get the standard deviation down there. But the simplest thing to show you is just you can create an object which is, so this is the original again. And you can create here the object which is called the standard scaler. And then you feed in these data and transform it. And as you can see, the center is around 0. Now. Now, the most important thing. So this is the Zen normalization. It's quite popular. You can use it with your data to improve the scores now. But the thing is like right now I showed you how to transform the whole X. But you can haul X with one column, right? But you can transform the whole dataframe in one run. So let's say you have a whole data frame rate and you want to transform and scale or using the minmax scalar or the standard scaler, the whole dataframe. So you create the scalar array the way how we did it here. And then you run the feet transform upon the whole dataframe. And these will end up in creating the dataframe, the new dataframe, right? So here I am transforming the whole dataframe, adding ED to results. And then, because these will, so sorry, let me just show you. So the x is in this case a dataframe, right? But when I ran the scalar. I'm so sorry, scalar fields transform x. So the S1, I didn't specify this. So let's get the standard scaler to it here. And then feet transformed is one. You will see that the output is an array of array. It's a numpy array. So if you go, but the shape is the same as, as the x. So this is the shape of these. And I get here three X, right? So the same shape. And now because this is an area of array, I want to do the print, the histogram. I need to convert it into dataframe. And that's the reason why I am putting it here to the constructor of dataframe and then printing it like these. And as you can see, everything was centered around 0 with the normalizer. Right? Now. Let's go next. So beware of the future scaling. So what does it mean? So, so as you can see by running the standard scaler here, we were extracting or subtracting the meaning. But you still need to have in mind that these mean is calculated from the training data? Yes. So you want to transform also the test data, Right? Right now I will show him how to run it on the whole eggs, but you need to extract the mean and the standard deviation or the max min max in case of min-max scalar from the training data and apply these transformation on the test data, right? So please always be sure that you are not extracting these values from the whole data set. And when you're splitting it to train to test, always separate the train and test. And now I will show you how you can do it. So the home, so beaver the future scaling, right? I create here the scalar, and now I'm splitting the X by the 80-20 rule into extreme x-test y train wind test, right? And then look at these. I want to learn the scaler, the standard scaler, the mean, and the standard deviation from the train data. That's the reason why I'm using feet. Whenever you reuse feet, you are saying that learn something right? In case of before in the previous video showed how to learn, the classifier was using just feet. But I want to also transform the extreme data sooner than later, I can put it into the classifier. And then in case of test, I'm running a method which is called a transform, right? So this way, the learn mean and standard deviation is being used for transforming the x-test here, right? Skip time to absorb this information, go through the example, you can try it on your own. Always rewrite the code. Don't copy it from the good lab. Github, sorry, the polynomial features is the next issue. So you have these, you know, these array or sorry, you have these dataframe of features, right? Like H. Let's look at that. You have these frame of features three, we are just running. So like So you have here the h distance education yet I showed you how we look at linear like dependencies and the correlations. And sometimes the thing what helps is to multiply one column with another. Now, when I talk about help, I mean like it can improve the performance of the classifier later. Now, to apply this combination of columns, something which is called polynomial features is being used. Now to create a polynomial. A polynomial you need to like columns, right? So to apply the polynomial features, so let's say you have a column a and b. And when you apply the polynomial features upon them, the result of the polynomial is like the column a, column B, the column a times a, the column a times b, the colon B times B, right? So we are like artificially creating new columns in this data frame by multiplying the old columns, ran it. So let's go to the example, right? So let's suppose you have a data frame with these two columns, right? And now you will just go and apply the polynomial features of the second-degree to these, to these, to this dataframe. So I'm using the feet transform. And here as you can see, the first column is just the one. It's always the one. And then there is the a. Sorry, the first column is the a, that is the one. The first, second column is the a, then there is the bead is B, then there is the eight times a, this is one times two, and then there is the b times a and b. Sorry, this is the a times a, then a times b, and then b times b, which is the last column, right? So I ultimately create all the combinations of these columns and add it to the end of the, of the, of the dataframe. Now, you can do this for the whole dataframe and create combinations of each of the columns. And this can vary significantly improved the performance of some of the classifiers. Now by applying these feats transform for the whole dataframe, right? So I showed you how to normalize things, how to create the polynomial features. And let's test how, what is the impact on the final results. So first, I'm splitting the x and y of the data set. And this is the same prediction from the last. So this is the first prediction. So the prediction of numerical data without any normalization or polynomial features. You can see the results here. And now, what I'm doing here, I am creating a polynomial features of third degree, right? So there will be another eight on the third degree of three and stuff like this. And I am applying the standard scalar. So first, I am applying the PF transformation, so polluting the polynomial features. I'm adding it here, then I'm adding the scaling fit. And then I'm transforming this data. Always there's data after the train data. And then I'm applying K nearest neighbors for this scale, then transform data and I'm predicting on the desk of scaled and transform data. And as you can see the results, I will put it here. So the output here is like the impressive precision is the same, right? So we are fine, but the precision on the one on predicting whether somebody will go or not is lower in this case, but the recall is much higher. So not match but on 3% higher, right? So what does these application of transformation does? What that it improved the performance of the recall of the variable one. Right? So the, the, the point of this is like sometimes applying some modifications and the transformation may improve something in your classifier and sometimes not. You need to try all the other possibilities and you need to play with these to get the point. This is the work of the data scientists and machine learning engineers to try the, some combinations of different applications of the data transformations to reach the better results, right? In the next videos, I can show you also how to make it even more failure. I have also also the example of the minmax scalar. So if you look at the minmax scalar, the recall is even higher than here. Well, so the minmax Kayla was very helping with these predictions, right? Ok guys, this was a little bit about numerical features. I hope you learned something new about, I don't know, scaling and normalizing and creating polynomial features. In the next video, I will talk a little bit about categorical features. Thank you very much for watching. I hope I will see you in the next video. 7. Categorical preprocessing: Hey guys AT gaming curl here and welcome to another lecture of our course on the skill share about classification in Python. In this lecture we're gonna talk about categorical data and it'll, before we move on into the explanation of how to process the categorical data in the Python, let's talk about what is categorical data. Now, I would always like to explain these things in the example. So if you look into the general data, CSV, as you can see, there are numerical data which are quite straight forward. There are the integers, float values, yes. So there are like infinite possibilities to the values yes, you can have, for example, integer viral values starting from minus infinity up to infinity, right? And in the floating numbers you can have even 0 dot one dot 1111111, and so on, right? So now let's talk about categorical data. Now. In this example, we will use the text data as a categorical data, but not always are that takes data categorical data. Now if you look at the business travel column, you will see that there are repeated values like treble, rarely traveled frequently and stuff like this. So this is an example of categorical data rate. So a value of a given text field, for example, text field will ever, can be applied to multiple roles, right? But if these value in each of the row is unique, this is not a great example of categorical data. We can then talk about some unique data or text data. And there are other kinds of processing in these kind of data rate. So it's not like you, OK, whenever you see a text in the columns, you will apply these categorical preprocessing. No, always be sure to check the columns to know whether you can apply the categorical processing or not. And as you can see, most of these like education field and the gender and the job role and the martial status. And what else is there? They are over 18. Yeah, that's a great example of categorical data. I think that this is the rest of the columns are great examples of categorical data in this data set. Now, so what are we going to do? So first of all, we're going to import the libraries. Again, traditional libraries. I'm setting the matplotlib pyplot RC params to the size. So the blood's during the node explanation will be the same size, also setting the font size. So it's nice, unreadable. Yet, if you didn't set up these things, you will end up with smaller plots during your plotting games. And I like it, I prefer this one. Now as you can see these time I'm reading the whole general data, CSV, not just the numerical data as in the previous example in the previous lecture, I'm dropping off like that, not useful columns which I was talking in previous video. I'm deleting the NA columns, right? So if there is any rows, so if there is at least one NA and non value inside the row and deleting the whole row. And then I'm removing the attrition outside of the data because I am just going to work with the categorical data Yan here, I'm listing the categorical data just by asking what the type is equal to object. Yes. And I'm getting the list of it now. So the first thing, which is quite like the most easy thing in the categorical data preprocessing it just something which is called one hot encoder, right? But before we can apply the one-hot encoding inside the Python, we need to transfer the columns of type object into columns of type category, right? And the way how are we going to do it is we're going to import from the scikit-learn preprocessing library, the module or the class or the class named label encoder. We're gonna create an object of label encoder. And we're going to apply these encoder fleet transform to each columns inside the DF cat, right? And DF cat is set to the categorical columns of the data set right? After application of these, you will see that what each of the values has been transferred into numerical types. So for example, this is a business travel column. So if I go here back to the original dataframe, I go to the business travel. You will see that before it was like travel, rarely travel frequently, non travel and at all and stuff like this. So and now here we applied a like a substitution for the text values into the integer values. So this is like something which is called encoding and the numerical encoding. And this is one of the way how you can transform the categorical data, numerical data. Now, and this is highly discouraged. Why? Because when you apply the machine learning algorithms, they are trying to find the linear relationship inside a numerical field. What does it mean? So when you, when you give a business travel column, are the better example is the department, right? Imagine you put a department like one to stuff like this into a classifier, let say a black box. This black box, this classifier will think that there is a relation between the department one and department too. So the classifier, a good example is, for example, the K nearest neighbors classifier will think that the department with the one is, has the equal distance between the department number two, then that is the distance between the department number 32, right? And it will think that the department number three is further away in the space, department one, right? So it will look for the numerical relations between the values of the column. And that's why it's not good to label encode the string categorical data into numerical ones, right? Some of you may say, well Macau, but right, but business travel, look at names. Travel rarely, non travel and the travel frequently there is a right, like a numerical relationship yet there is a distance, right? So for example, when somebody is non travel and travel readily, the distance between these two is may be the same that compared to travel readily and traveled frequently, right? But the thing is when you are using the labor encoder out of the box without any settings, you are not able to say what will be the number assigned to each of these category? Yes, if you have this insight into the data set, you can set up the label encoder to represent the numerical relationship between the values of categorical columns, which I highly encourage doing It is. But for the purpose of this course, which is quite simple, we're gonna use some different technique rate. So this was just a little word about label encoding. And always make sure you know what they're doing with the categorical data before you encoded the values using the label encoder. Now the second option is something which is called one-hot encoding. And the I will explain the one-hot encoding on the example. So imagine you have a data frame with two rows and the column 0, and the values of this column will be a and B. Radically simple data frame like this. And this is the dataframe. And now I will use the one-hot encoded from the scikit-learn preprocessing library. And I will apply the Fourier transform to the example dataframe. After setting this, it will return a sparse matrix. Let's don't talk about smarts mattresses at all. The thing is like I need to convert it to a data frame to make it readable for you. And you will see now that the, the one-hot encoder created for each value inside the column 0 as specific column, right? So if here was a CE, there will be one more column. And then where the value resides in that row, only in that row in the given column was presented the number one. Yes. So as you can see and as you can imagine, when you will apply this form to a big data frame with a lot of values, you will get a lot of zeros and ones inside your dataframe. And the most common value will be 0, right? Pause the video and think about this. Yet you have these life. Business travel. Let's talk about business travel, coal, the one-hot encoder will create for each value of the business travel is unique value inside there a separate column. And for example, there will be a column called business travel underscore tremble rarely, and there will be one. Only in the first row are some of the others. So where the treble readily reside risk and there will be all the other zeros in this room rate. And for this purpose, pandas or Python is using something which is called sparse matrix. And these sparse matrix is just like throwing away the zeros and the internal representations of sparse matrix is something different and more like data friendly to your CPU and your hard drive. So that's the reason why it is stored like this. But the way what is happening behind the scenes is like VR replacing the one-column to each of the unit values columns inside that column, right? So let's put in an example on our data set created the one-hot encoder. And here we put the results where we are transforming the dataframe of the categorical values. And as you can see, there have been created the columns for a new for us, as I said. So non travel and there is only one in the roles where they don't travel was presented, travel frequently, a traveler lateral, rarely. So in the first row, as we saw in the original dataframe, there was the treble readily value for the, for the column, and that's the reason why it was presented there. Now, this is much more friendly than the label and colder. Why? Because there are no numerical relationships between the values, right? Yes, there is like 01, but there is the same distance between the values inside the column for each of the row, right? So like applying this transformation came can make your dataset explode. Yes, you will suddenly have a lot of columns and there will be a lot of data inside there. And for example, if you have a lot of unique values inside your column, it can cause you have suddenly more columns than rows, which is not so not a good idea to use red. So and for this purpose they are using their techniques to just to choose the most important columns, which I will show in the later lecture, which is called the feature selection Aki. But for now this is enough, guys. Thank you very much for watching this lecture. Please go through the examples in the notebook, tried to apply the examples on your own data set. And I hope I will see you in the next lecture. Thank you very much. 8. Final preprocessing: A guy is me Calhoun's coherent Welcome to another lecture of our course. Classification in Python for the beginner is where I'm showing you practical applications of Python libraries to classify data. In this course, we are classifying the HR case study from the Kaggle. So please join us on these beautiful adventure. Let's go to our next notebook, which is called 00 55 and preprocessing for those of you who are going along with the code, you can find the code examples linked in the description below or my GitHub, please feel free to follow my GitHub and let's go next to, let's go back to the notebook. So in this video, I'm going to show you how we can alike connect or apply the preprocessing mechanisms which I showed you in previous videos together to predict on the data set. And I will show you how application of these pre-processing methods can improve classifiers or the machine-learning algorithms, like in this case, the logistic regression to make some predictions, right? So as you can see here in the first cell, I'm importing a bunch of classes from the scikit-learn. And for those of you who are new to scikit-learn, and I'm telling you you will see the scenario always. Everybody's importing everything in the first cell. So this is a good practice. And whenever you will find, for example here the standard scaler and there was nowhere the definition above, you can always go here to the first cell, look for the standard scaler and see from where it was originally standards killer imported, right? I highly recommend before moving on into any of these examples, go through the modules and through the classes which I'm importing always, right, so first of all, I'm reading the dataframe from the CSV file that that general data, that CSV, that I'm dropping the unused columns and I'm also dropping the, the rows to be the non values. Further in next lectures, I will show you how you can apply different mechanisms to these NAN values, which is also part of preprocessing for those of you who don't know anything about the non values, I highly recommend my previous course about pandas basics, where I'm talking about NAN values and handling the non values, right? So now I'm also splitting the columns as I did in the previous lecture, categorical and numerical columns. And now as you can see, I'm creating for each of the preprocessors a specific instance which I'm assigning to a variable like scalar polynomial for polynomial features. Here I'm setting the degree 23 and I'm also setting the encoder to one-hot encoder. And I'm turning of the sparse matrices because I don't need a sparse matrices right now we don't have so much data right now. Many of you may ask yourself like OK, because, but what is too much data? Now, you will find out what is too much data for your computer when you will reach a certain limit in the computation of your computer, when the computer will just start to overheat, it will take many time, many hours or minutes. To predict something, right? And in this moment, there is too much data for your computer, right? Then you need to like buy a new computer or running somewhere else, or just wait for the results as I always do. Right? So let me go back to the prediction is, and as you can see from the dataframe to original dataframe and putting away the attrition, which is the target value which I want to predict a setting the attrition here, it's always important to do like Buddha away the target value when you will forget about this and you will predict you will suddenly have 100% of everything. And it will be like, man, I'm good. My classifier is predicting on 100% and basically you will have a false, we will have wronged classifier because it was just predicted based on the target value which was already present in the data set, right? So then I'm setting the way the tangent value into the Y vector variable. And then I'm splitting the data set into the train and test set based on an 80-20 rule, as I mentioned in the previous videos. Now, what is next? I want to train the logistic regression now for those of you who don't know what is logistic regression, you don't need to worry right now. Yes. I will explain the logistic regression in next video. Well, to be honest, this course is more like a practical application of machine learning. Great. It's not the mathematical statistical part where I'm going to talk about the theory of algorithms, of machine learning. And I'm gonna bore each of you, you will just turn off the scores and go away and never come back. Now, I want to make this as engaging as possible with the practical examples of the machine learning algorithms from Python. So I will just like keep the scope of this course to the practical applications. Using the next video, I will just talk a little bit about the algorithms, but it will be just an easy explanation for interview because, you know, I would like to everybody to understand his videos and make them as simple as possible, even for my mother or father who is watching this. Okay, so let's go back to the training data. And here, as you can see, I'm using the fleet transform method for the numerical columns in the extreme in the first case. So I am training the polynomials so I can later applied to the test set, right? And then I'm setting the results into the policy variable and doing then the transformation for the polynomial variable with the standard scalar. And then I'm saving the result into the scaled variable. And the same thing I'm doing for the encoder, which is adjust one-hot encoding, these case not the label encoder from the previous video came out of it to the encoded variable. And then, as you can see, the numerical shaping these case is 560 combs. Now, quick boss. And a small question why there is so many columns? And the answer for this is because I said that the polynomial features to three, right? If I set it to two, you will have less columns. And if you set it to one, you are not applying any polynomial features at all, right? And then in the next row, I'm like concatenating the scaled and encoded like dataframes into one that iframe. And that's it. That's the shape. Well, to be honest, this is not a dataframe. It's a numpy, not by metrics. So you oh, sorry. Unit around everything. Yeah, and as you can see, it's an array of arrays in Bahrain, but you can still converted into DataFrame if you want. But I prefer to walk with NumPy is five. So we have these transformation and now I'm applying the transformation to test and notice the difference besides calling defeat transform, I am just calling the transform function for each of the preprocessors, right? And then at the end I have the test set, which has the same column number as the training set. Well, is a decent become what's next? Now, we are going to compare the performance of prediction and the logistic regression without the preprocessing and with the preprocessing. Now, Logistic Regression is my favorite. I did a masters. This is about logistic regression and applications of it to predict the emotions. So I, it's like my first choice. Logistic regression is a simple algorithm. It's very fast, but it's very effective and is used nowadays in production code a lot. Now, how you can use the logistic regression is as simple as setting the logistic regression here. And now as you can see, I'm not using the tray extreme pros, but I'm using the extreme variable, which is the original dataframe without any custom preprocessing. And I'm predicting on the x-test. And as you can see, I'm unable to predict that somebody will go away from the work. So I have 0 precision and 0 wrinkles. I never predicted somebody went away. So basically what I just created here, it's something which is called one label classifier. So if I will look inside the ellipse, why bread? So if I go inside the white bread, you, you will see that there will be only knows, right? It's a good baseline classifier. If you can read this, you are creating at least something, but our classifier is predicting basically just still know it's as good as mean, right? So I can be even better sometimes. So now, like, well, we need to transform the data, right, in order to help logistic regression the hell answers, yes. So if you look at the logistic regression with the extreme prose, which is the process data from numerical and categorical attributes which are shown in the previous videos. And I'm running the prediction. You will see that suddenly I have some great precipice with great precision numbers even for yes. I have some recall, Yes, it's not the best, right? The key K nearest neighbors was better before. But well, we are getting somebody somewhere. So the, the purpose of this video was to show you that some of the algorithms are unable to predict without any pre-processing. Great. The k-nearest neighbors is like, well, it's good. It can work with a lot of data. You don't need to preprocess anything great. Or maybe I was just lucky with the data set. Who knows? Maybe for the project you will do something with your own data set and you will suddenly look at that. And the kNN, we'll return nothing great, but K-Nearest Neighbors classifier will just always predict NOW, right? So in that case, I will try to apply the preprocessing or the different algorithm well, so you always need to know how to preprocess data. And as I showed in this video, by applying the preprocessing to the data, I was able to improve the logistic regression well, significantly, I get from nothing to at least somewhere, well, and the F score of 090, it's quite good. It's quite good. So guys, thank you very much for watching this lecture. I hope I will see you in the next lecture. 9. Algorithm explanation: Hey guys, IT gamey held here and welcome to another lecture of our course about classification in Python. Now, this video is a little bit different to the others of this course. I'm going to talk a little bit about algorithms which we use for the classification in this course. Now, to make things clear, there are like dozens and hundreds of algorithms which are applied today to predict the classification, right? And these video, I'm going to just talk about K-Nearest Neighbors classifier and with just a little bit about logistic regression. Now, there are even other stuff like the Bayesian classifier or the neural networks. There are various types of neural bet for the deep networks. Deep neural networks, ESG boost, Random Forest and classify trees. For example. Now, if you are interested in other classifiers, please go to the internet and tried to look something there. There are a lot of great tutorials on the Internet. I referred to some motor torque tutorials during this video. Now as I said in the previous video, I don't want to make these theoretical. I'd like to explain things with practical examples. And this video is like just theoretical cycle. Make it this as fast as possible. So I have clear created here and a notebook where r, I have this great picture for k nearest neighbor algorithm. The kNN stands for the k nearest neighbor, right? And I will just explain why is that. So like imagine you have a data set with just two columns. So for example, you have some x values and you have some y-values. And imagine this is a great rate. So, so at the bottom there is X axis and here is the y axis. And then we are just plotting here the example. So for example, these, if here was 0, that's 0, that let's say five, and that's the class one. So the square is class one and the triangle is class to write, or in our case, it's the class like he will leave the company. And then this is the example of evil, not live to company. Let's, so let's just make these examples simple with two dimensions, right? So how do we predict or how does the classifier classifies if this given new example, the circle is either a square or a triangle based on the two values, right? So the algorithm will put the example inside the feature space and it will look on the K nearest neighbors. How simple, right? So in this case, it's looking for three nearest neighbors. And as you can see, there is square and there are two triangles. So let's say make it easy. The classifier will guess based on the K nearest neighbors, the class by the majority world. So in this case the majorities three angle. So the circle will be classified as three angle. How simple? Now the KNN can be quite inaccurate. Now why? Because the question is. How many neighbors are we going to use, right? Are we going to use my little damn 12? Like it's a good rule of thumb to use the the number which is as slow as possible, right? And there are even other staff which you can find on the internet recommending the number of neighbors. But it's always a tough decision. And the decision on the number can War II based on the data set? And now the next thing is, how are we going to measure the distance between these guys? So there are a lot of distance metrics and now many of you may say, Well, it's simple. This is a 2D space or Euclidean distance, right? So the triangle, yes. So this one I our articles. But you need to understand that when you have a data set with 521 columns, you are going to work in the feature space with 521 dimensions. So our brain is not even possible, while most of our brains are not even possible to like visualize the 500 dimensional space, right? But the algorithm can do it behind the scenes. And now it stopped to like count distance inside this multidimensional space, right? So which like distance metric, are we going to use? Euclidean distance, Manhattan distance, Minkowski distance? Well, that's a, that's a tough question. And I will show you how you can evaluate the classifier based on the distance metric in case of key and, and how you can evaluate the number of neighbors which you are using and stuff like this. So this was the kNN. Now, let's talk about logistic regression and logistic regression was derived from the linear regression. So first, Let's talk about linear regression. So imagine you have here, We ain't gonna predict the class, but we are going to predict a value, right? So imagine here you have x, which is, for example, the number of Idle No rooms in the house. And the y is then the predicted price of the house. So in this case, we are working just with one dimension, the number of rooms. And as you can see, if there is a one-room, the price is six. And if it is a two room, the price is five. And if there is a three room, the price is seven. Now, what does the logistic regression? It is trying to create a line which will go through each of these examples. And then when the new example game. So for example, if your predict what will be the price if the number of rooms is five, the classifier over predict the value on the line, so it will be around ten, right? So the logistic regression is trying to set a function, the line is called a function to map the data set in the training data, right? So and then based on the east trained line, the classifier, in this case the regressor, is trying to predict the value based on the closest distance right? Between these, right? So. Function is trying to be minimized based on the distance between the training data. So these are the training data, the red dots. And we are trying to put here the line which will have the smallest distance. By summing the distance, the classifier will sound the distance between the training examples. And we will try to put here the line which has the smaller the distance between these. Yes, so there is a lot of mad behind the scenes, right? That derivatives are applied because we are trying to find the minimum of the function. But the size said, this is not a mathematical course, so you can find explanations in the books and on the internet. But this was linear regression rate. This is predicting the numbers, but we want to predict the class is Mikkel. So how we are going to do that? So the logistic regulation is the same, almost the same. So let's say you have like hours of study on the x-axis. And then here is the probability that you will pass the exam. So there is only two values. There are two values you will pass and you will not. And as you can see, by increasing the number of hours, there are more points here. So the logistic regression is trying to put here a function here which is trying to map the behavior of the dataset, right? And then there is a round function applied to this function. So we are predicting, for example, if the function says, what is the probability that you will pass the exam if I studied 3.5 hours. So we're almost here. So the value is here. The function will predict that there is a probability of 0 that's 75. So 0975 is rounded to one. So the classifier will yield the result of one, right? So again, Logistic Regression, similarly to the linear regression, is trying to map a function to the data. Now, well, this explanation can be an hour long with a lot of mathematical equations, right? You don't need to know these equations when you are applying the brightens libraries because every, the whole mathematics is happening behind the scenes. And now I don't want to bother you with the charts, with numbers, with the derivatives on the paper. I can do a separate course for this, but this course is dedicated for those of you who just won't use the libraries in Python. Okay, so now I hope this video gives you a little bit more insight into the algorithms which I'm going to use through these examples. So we already used the kNN classifier and the logistic regression, right? If you are still confused, please right down into the discussion and I will link you sanctuary books about this issue and some YouTube tutorials or something else like, I think I can create even in the future, the more mathematical courts about this issue. But right now, I'll state with the practical examples, guys, thank you very much for watching this lecture. So try it on your own. Trying some examples from the things which we were talking about hearing the previous lectures. And I hope I will see you in the next lecture. Thank you very much. 10. Feature selection: Hey guys, IT gamete, ya'll here and welcome to another lecture of our course, classification in Python on the skill share. Thank you very much for reaching these lecture out. And without further ado, let's go into the Jupyter lab. These video or these lecture is going to be dedicated to something which is called feature selection. Now, what does it mean? In previous video, I showed you how you can create new columns using polynomial features. How you can exploit your datasets using the LT encodings, the one-hot encoded or rate. So when you will analyze your datasets and preprocess your datasets, you will reach the point where your data set or your metrics. You'll have a lot of columns, right? The new columns will be added up to the preprocessing and stuff like this. And now in this situation, it's quite handy not to use all the columns. Why? Because when you will have like millions of rows and beak like dataframes and like having given more columns, will cause the algorithm to be very slow. And sometimes the number of columns doesn't always mean the great accuracy you will see during your data science career that having a lot of columns can, is causing mostly troubles and not the performance improvements. Now, to solve this issue, there is something or some approach which is called the feature selection. And what does feature selling should do is that it's, it's searches through the features of your data set and specify as those which are important for the classification or for the statistic in general, right? And now there are a lot of approaches and under each of these approaches are applicable for all the datasets. But I just want to talk about them so you know what to Google in the future, or maybe you will just use them. But in order to use them, it's quite necessary to understand some, some laws of statistics and mathematics. But we'd have mentored you. Let's look at the first one. So first feature selection approach is called low variance elimination method. Now what does this have to do is that it searches through the features, through the columns of the data frame. It computes the variance. And the intuition behind is that when the column has low variance, Dell is there is less information data. So here we are specifying a threshold under which circumstance all the columns would lower the value, then the threshold will be deleted from the set, right? So how do we do that, right? So first of all, notice the dam at the beginning reading the dataframe specifying the x and y. And here as you can see, I'm using the scalar array that because, you know, it's quite good to have scale data in order to compute the variance. And now I add on the scaled data set, I'm applying these variants which is dividing its threshold right. And our specifying that the zeros, that nine and I'm saying that remove all the features which are lower than 0, none that we'd have a lot lower than 0 dot nine. But then I got thinks like no feature in x means the variance threshold 0 that nine, right? So what does it mean is that each of the feature, so if you then print the variance of the columns, you will see that each of the feature has the variance lower than 0 than they are now. And this is a shoe red. So we can bump down the threshold and then end up with just some of the data. But now, what is the right value for the threshold? Yeah, and how to set these up? Well, this is sometimes tricky and this is most of the time not working correctly. So, but if you are a great mathematician and statistician, you are able to use these in your projects. Now, another approach is something which is called a univariate feature selection, or which is also referred to as select k best. And it's a selecting top-k features based on, based on the univariate statistical test. Now, I am not going into like descriptions of what is the univariate statistical tests. You just need to know that There's cheese score x-squared desk is these kind of tests which is used for specifying the, the features which have different like distributions red. So these approach is another approach for the feature selection. And as you can see, it's a simple as select k best here you need to specify the test which you are going to use. In this case, I'm using g squared, which is part of the feature selection of cyclin learn. There are also other tests. Go, I wanted you to go to the documentation of the select game best and then you are just applying it to fit transform. As you can see, you need to train it on something and then transfer most of them. Now, this was quite simple. You also need to specify the K, which is like well. So we want to use feature selection, but we must specify how many of the features that we want, right? And this can be sometimes tricky to what member of features do we like and stuff like this. Next video, I will show you how you can overcome these issues. Now the last thing which I want to talk about is something which is called recursive feature elimination method. Now, before I go to this method, lets talk about like when is this method good for? Well, Let's dive right into the method. Ok, so this method is used always with the classifier inside there with rigorous so in case of the regression, so what does this method do? It consists of few steps, so it starts with all features and it applies the classifier, and it computes the efficiency of the classifier on all the features, right? And then in order to apply this method, you need to have the classifier which can return some feature importance. And these feature importance is our saying which of the features we are most useful for the actual prediction. And then in the second round. The features with the lowest importance are eliminated and the whole process is repeated, right? So then the next round is repeated without one feature or two features based on the setup. And then the next round and next round and the extra and other are these algorithm stops with the improvement is too small or not at all, right? So we are running these loop until there are some improvements in the accuracy. And then we are stopping this method. Now, those of you who are listening carefully can be like, well me help Soviet using the test set to evaluate the performance of this classifier. The performance of this method? No, the thing is like you are always running the recursive feature elimination method on the training data set and then something which is called cross-validation is applied. So basically what does there equals the feature elimination do is like it takes the training set and it splits the training said that 80% of the whole set into, let's say 80 to 20, which is a cross validation alike. But these 20 part, it rounds the first round on, on these 60, on this 80% and then evaluates it on, on these 20%. Sometimes method, which is called kenny fault method, is applied when not only on one split is the grand, but on multiple splits. So it's, it's splitting the dataset into various chunks of, of train and cross-validation set and then the average is being computed. Now, our cross validation can be quite confusing. I don't want to go into these details. The thing is we don't want to say is that the recursive feature elimination method is being run on the training set, right? And not evaluated on the test set. So this is a part of a training right after the recursive feature elimination method, you have the features which are important for the algorithms. Algorithm which we are using, the requisite features. So those of you who are listening carefully can have some questions and I will just answer them right now. So if you look here using max scalar and then I'm saying Aki ran the requisite feature elimination for the logistic regression, right? So after the reign of requisite feature elimination here, you will see that you want, you will have these true false, false, true, false, true, false, true false. Area, which would you saying which of the columns should be used labelled as through, which are alike, most important for the logistic regression read. So you must, or it's a good practice to use these recursive feature elimination, the classifier which is being applied later in the classification rate. So after running these Andrade set, you still need to evaluate it on the test set, right? So first you will specify and which features to ran. Then you will again retrain the whole classifier with logistic regression on the train set. And after that, you will predict on the test set. Well, this could be quite confusing. And when I was starting, I also had these travels. The most important part is the keyword recursive feature elimination, which can help you to select subset of the features, not owl in your predictions. Okay. Now, I also well, for you or if you are curious, this is the 00005 final preprocessing notebook, which I covered in the previous lecture where we were comparing the performance of the classifier are used with the polynomial and standard scaler. And here is the feature selection part where I used the recursive feature emulation For those degradation. Now some of you may ask, So can I use the K nearest neighbors classifier here? And the answer is no, because you need to run the requisite feature elimination or the for only for classifiers which have the feature importance is factor inside there. And in kNN, We don't have the feature importance, so we cannot say which of the features are important for the results of any classifier, right? So that's the reason why the recursive feature elimination is not applicable for, for each of the classifiers, but it's applicable for logistic irrigation. So I was talking about these RFA for logistic regression. As you can see, I ran it on the whole extreme plots and I was trying to predict the y train to train sample. And then I was predicting the support and ranking the same way as I put here. So I want these like true to false which of the columns? And this is just saying it's the output of ranking, which is saying order the columns based on the requisite feature elimination results, right? So you can see the first two columns are very important, while the third one is not so much, and so on and so on. You can even get the names, but you need to refer to the documentation to get the names there. It's very useful now, so I ran it on the whole data set and I got these for each iteration I get these warning. Please be aware that many times the scikit-learn is outputting some warnings unless it's an ad or just read the warning and then just wait for the output many times the warning or just saying you have to small data set, something like these, like no iteration region immediate, right? So, so basically this is just saying I'm using too, not too many roles to predict. My trend set is very small and stuff like the surgeon's ignored it. As you can see, there was a lot of these warnings it to Gilad like about 20 minutes because like internally the requisite feature elimination is running the logistic regression all over again for these cross-validation sets. So all over, all over again, the requisite features nomination is taking the train set, splitting into train and cross-validation on the trains and logistic regression is being trained on the cross-validation is being evaluated. The feature with the lowest importance is being deleted. Then in the next step, the tests that the train set is against split on the cross-validation and train set. Different cross-validation is being used. Unless columns are being used, then the performance or the cross validation is being computed. And then the feature with the listening port since it's been deleted and again and again and again until the classifier is being improved. So as you can see, there was a lot of iterations. And if I go down, I'm pressing the page down. So if I go down, I have here, for all these columns, I have these true-false array. So as you can see, there were a lot of important features. And then there is the ranking. So, so some of them are least important. And then when I run the prediction, I just had a minor improvement in the precision in prediction of the yes class, right? So like the lesson learned from these is like the feature selection can help in some cases, in these, it helped just a little and it was even worse. It's even like make some predictions worse here, the recoil is even lower than we doubt the feature selection. But sometimes the feature selection can improve the performance of the classifier significantly, right? So it's like dust and oh sorry. Like test, evaluate, improve repeat test, improve repeat rate. You're applying various mechanisms of data science. Do your machine-learning like problem, and you are improving the classifier aside here. So in this case the feature selection was not working properly. Ok guys, this was a longer video about the feature selection of you learned some mechanisms how to apply the feature selection to select important features or combs from your dataset. I hope I will see you in the next video. Thank you. 11. Grid search: Hey guys AT gain Mikkel here and welcome to another lecture of our course for classification in Python. In this lecture, I'm going to call it about Greek search or hyperparameter optimization, hyperparameter tuning. Now, let's jump right back to the Jupyter lab. Now what is the issue? So far we were using the Knn and logistic regression as the black box. So we were just importing the algorithms. We dealt any knowledge in their fast like two videos ago, I, I just briefly described what is Canem. And I just said that it's computing the, most, the class based on the neighbors from the training set, right? So if you look into the S scikit-learn documentation, the k-nearest neighbors has all these parameters. So for example, you can set the number of neighbors. You can say the weighting function, you can set the algorithm, right? So then you can set the leaf size, then you can send a B, then you can send a metric, metric pumps and jobs. So all these things are being alike parameters for the king neighbors classifier. And so far we just import it and use it without any setting, right? So only the default parameters has been set. Now, the thing is like, what if some of the parameters are better for your classification problem? How are we gonna do that? Ok, so for these purpose, something which is called a grid search is being used. So in this case, in this example, I'm just going to test, uh, just these three parameters. So I am going to test how many of the neighbors are optimal. So how many of the cave in Kanye nearest neighbor stands for? How many neighbors are going to be put to voting of the class. So I'm going to try to guess the number of, of them and how does these number affects the performance, then what distance metric should be used? So how are we going to compute the distance in the feature space, Manhattan distance, Minkowski distance, Euclidean distance. And we also going to set up something which is called a leaf size, which can drastic to optimize the results England and go into the details of it. But and you can read it inside the documentation on the internet, so, so please be free right now. Just, I want to show you these many times there are the parameters which even I don't understand, right? In some cases, in some, some algorithms. But the using the grid search allows me to experiment with these parameters and get all the possible combinations of the parameters to know what is the best results and this is exactly what greets her job does. So let's just look at these. So I'm importing the gain near neighbors, splitting and the metrics, things like write. And then I'm setting the Train test and XY set right the same way as before. And as you can see, this is the prediction without anything, right? So I'm just trade, predicting the extreme wide train and get the results. Now I'm going to do the grid search. So first, how are we going to specify the grid search? So you need to specify the model. So in this case it's just the empty K-Nearest Neighbors classifier. And then you are specifying the parameters which you want to play with. So for the p, So which is the p parameter is saying the, which of the least dense metrics are we going to use red? As you can see here, in case of one, it's equivalent to Manhattan distance. In case of the do it's a LPD and distance and in other cases, it's just the Minkowski distance being set for us, right? So I'm just going to play with these 123 options, right? So men had Jan and APN and the Minkowski, then the number of neighbors, I'm going to play with the list of number of neighbors. I'm, I'm running from two up to, let's go 30 and the increasing it by five, red. And then I'm going to put, this is a dictionary, a Python dictionary, put these into a grid search CV. The CVE stands for cross validation. I know I talk a little bit about cross-validation in the previous video. We can talk about it one more time here. So here I'm setting the model, and here I'm setting the parent IQ, which is the parameter dictionary which I created overhear. Here, I'm setting the number of times the cross-validation should repeat for each of the combinations of parameters. So grid search will create for each of the parameter combination a separate run with the algorithm. Then we will always run these for the train set, okay? Then for each of the run, cross-validation will be around three times. So what does it mean? In each of the cross-validation around the train set will be split to a training Bart. And the cross-validation part, mostly 80 to 20 depends, right? Then it will be done three times. First, second, third, for each of these ran, the classifier will be trained on the train set and evaluated on the cross-validation set three times. In each of the cross-validation around different portions of data are being used in train and cross-validation set. So the classifier that cross-validation around you'll have three values. Then the average is being computed as the result of the cross-pollination around for the given grid search run. Okay? And this is a called k-fold cross-validation approach, right? Because we are running the cross-validation for K folds. And then the result is being then computed for each of the grid search combinations of parameters, right? And the output will be the best combination of parameters based on the cross-validation, right? So we are not using the test set in the grid search, but we are using it in, in the curve. We are using train set in the in the grid search by applying the k-fold cross validation technique. This could be quite confusing if you're still confused with k-fold cross validation technique. Write a comment down below in the discussion. Go to the Internet, try to find something. I'll be glad to help you. So then you can see it's as simple as running greed feet, right? This will put here all the possible combinations of models. Then you can say, OK, the best model gets the best estimator and show me the params. And these went through 540 combinations of all of these parameters, right? It took me 19 seconds, which is quite fast. That was not a lot of data. K-nearest neighbor classifier is very fast. And the best possible combination was the leaf size of one metric of Minkowski. These are the default parameters which I didn't set. The b of three, right? And weights uniform. So the default parameters which just Facit form into default, right? So and then I'm using here the best model to predict on training and test data and look at the accuracy and precision. One, I have 100% precision in case of predicting whether somebody will go or it will stay? Yes, of course, I'm not reaching to all of it because the recall is low, but the recoil is 0 is all. And I have 96% of precision. The F1 score is 9696, which is very good F1 score for this kind of problem. And as you can see from the confusion matrix, you can see that there are 0 false positive cases. So whenever I am saying that somebody who leave the company, he will leave the company, right? How amazing is that by leveraging the grid search, I can find the optimal combination of parameters for the given kNN, right? So, okay guys, so this was the grid search. And you can, in the next video, I will show you how you can apply the greets or how you can apply all the knowledge which we learned right now to do, do this kind of example and how you can make a very clear pipeline implementation of the machine-learning algorithm in the Jupyter lab. Thank you very much for watching this lecture. 12. Pipelines: A guy say anti-gay Mikkel here and welcome to the grand finale. The amazing last lecture of the course, classification in Python for the beginners. The practical application of machine learning algorithms from biotin to the real-world examples from the cable. Now, in the previous video, we reached quite a great precision for predicting with somebody will leave bus, but we still have some space to improve for the guys who will not leave freight. We are analyzing the HR data set and we want to be as precise as possible. So in this video, we're going to gather all the knowledge which we learned about preprocessing, about how to predict about the hyperparameter tuning, about the grid search. And we are going to apply it together to reach the best accuracy, precision recall as possible, right? And during the videos you also saw that sometimes it's quite handy to, you know, to just transform the data from the algorithm to there and then store something here. And it's quite easy to make a mistake there. And scikit-learn is like very long years here. And the guys there are so amazing and they know that we are a week and we tend to make mistakes. So they created something which is called the pipelines. And I think that the scikit-learn pipelines are the best thing. Therefore, machine learning in the world, right? Like, yes, there is a TensorFlow for neural networks, but besides these, scikit-learn with these pipelines, stuff is amazing. Okay, so what we're gonna do, let's jump to the 009 plants jupyter notebook in the Jupyter lab. And here I'm going to import all these crazy, crazy stuff, right? So just copy-pasted, go through, read if you want. I'm using just all the things which I used through these lectures before. Now, I'm reading the dataframe here, dropping the unused columns, which I said in the beginning, they made no sense to use. I'm setting the categorical columns in one variable, the numerical columns in other valuable. And then I'm setting the category of the categorical columns to the category so I can apply some categorical pre-processing as I did before. Nothing special. And now I'm splitting the x two to do the app, splitting the data frame to the X and Y. And then our splitting these two drain and tests that some of you may say, what is this random state? So python, uh, used to create these train test split. So the training and test data set random generator. So to randomly split it every time you run it. And to get the same results that mean you can say, okay, run it with the random seed 42. And when, if you put here random seed 42, you will always end up with the same train test split rendered. You can put there whatever you want. It's just shortest description. And now we will create something which is called a pipeline. So I important here pipeline. And here what I'm saying, and I will create a numerical pipeline. And this pipeline will consists of steps of simple input there, okay, then with a standard scalar, then the polynomial features, and that's everything. This is the first numerical pipeline is, so what is the simple input that really didn't setting before the simple input, there is feeling the NaN values for us. Rather than if you remember in previous videos, I was just getting rid of the rows with the non values. But there is also a great approach like feeling the NAN values of numerical data, V8, median in this case, right? Or mean you can put there whatever you want annually, just median because I'm a fan of many and whatever I shouldn't have to change. So I'm, I'm just saying beliefs take the well, I'm, I'm never specifying numerical data just distinguish us. If you put here a dataframe, it will get rid of the rows with the nano by setting the NAND2 a median, it will standard scale the data and then it will compute the polynomial features of the second degree, right? Then the categorical data will be like the categorical transformer will have pipelines, simple input strategy constant, missing what we are saying here, we are applying a strategy constant. So for each NIM value, we will replace it with the constant value of missing, right? So when there is an add value in the categorical column, there will be a missing strings instead. And then we are applying the one-hot encoded the same way as we did before. And then we are specifying something which is called column transport mirror, which we are saying that for the type of numerical column, police apply the numerical pipeline for the column names of numerical columns in the numerical columns variable. Then for the categorical data, apply the categorical transformer. Here. Here, the transformer doesn't have to be, right. And then for the categorical columns. And then at the end, this is the most important part. We are creating classifier of pipeline with the steps when the preprocessor is the column preprocessor specified above. And the classifier is the K nearest neighbors classifier. Descartes time to inhale, right? These ten lines of code now replace all the 100 lines before which I did. So everything is now brought here into one pipeline. Now, this is a great thing because, why is it is even possible that we can do this? You remember the in the previous video I was always putting that feed, transform and transform to the classes which I was importing from the scikit-learn. Scikit-learn does this for each of it's calling estimators, right? And then these estimators can be applied in the pipelines like this thanks to the same like naming conventions for the medicine, stuff like these. Like I want to go too deep inside this implementation specific details. But the thing is that you can create these pipelines as simple as this. And then you can just call these resulting pipeline feed with the train data and predict on the test data and compute the scores. Okay? So as you can see if I put here the pipeline like this, the results with the baseline K-Nearest Neighbors classifier I are quite poor, right? Nothing special, right? But we learned something about grid search in the last time, right? So then you can also like print a nice diagram of the pipeline with these set conflict here as so you can see that there is a processor column transformer for the numerical data. I'm applying these steps. Yeah, you can even open it. And for the k-nearest neighbors, I don't have anything but, you know, so if it's a long life pipeline, you can even like visualizes with these and click on it, right? So, but now let's apply the grid search to these. So how do you apply these grids that we have pipeline like these, I just copy pasted here and I want to send the grid search only for classifier. So in the param dict here, I'm saying for the classifier, but here, send the leaf size to these. And for the classifier said the end neighbors to these. So here are 222 underscores, okay? And then you can email, grid search things like which I will show it later. But really let's send it like this and then I'm setting the grid search, running it. And this was run for three minutes are quite long compared to the previous one. And I get the best estimator and look at that, how I reach this is the best prediction which I had so far. But addiction is 99%. It's not a 100% as before, but here I get 98%, getting 96103 Cole. Okay, so how good is that? By running these pipeline with the grid search, I found the most optimal combination of features so far, and I reached the prediction of 99%. So almost 99% cases uncovering and almost a 100%. Can I be sure that when I say about something that he will live the company, he will leave the company, right? So, and some of you may say, okay, but if I want to like 2mm these like number of features or whether put here Medan or, or mean it simple. You can just put here the preprocessor because it's part of the preprocessor, then you go into one layer deep, which is the numerator. It's part of these preferences or no. Then going one layer deep into poly, poly. And this is called the degree, right? If I put default, but it's a degree, right? So you can even do these parameters. I don't want to run it right now because if I put the number three will take like, I don't know, our Until it these will finish because I won't get the big dataframe right. And there will be a lot of combinations which could insert needs to go through. But if you have a great computer, feel free to go, ran these and you will see the results on your own. Guys. This was that, these were the pipelines. The pipelines are an amazing concept in the scikit learn, it helped me to create the great classifiers like professional classified as which were used in the practical world. So I wanted to share this information with you. I will provide all the source code's. Now use the sorts goals wisely. I'm not telling you that you can not use it in your company. Feel free to like make money with it. Well, you can even reference me sometimes, it doesn't matter. But the thing is like, you cannot just copy paste these to every problem. You need to think, right? Always like or should I use standard scaler? Should they use min-max scalar? Is it optimal to use median strategy for this kind of problem? In this case it was optimal rate, but Sunday's even demean strategy, sulci, or sometimes getting rid of the non values is even better, right? So please always go through all these steps, make them exploratory analysis, make the pre-processing on your own, try to look at the data, visualize the staff with the charts. Yes. So we added percent sure you know what you are working with and that you know your data. Okay? Don't just be like these fast guy who just wants to make everything the same way and, and get the money for it. Ok. Enjoy the work rate, play with the data and told me that others, yes, you are not the smartest. They're out there. There are a lot of great data scientists which can help you like. You can talk about your solutions with others, share your thoughts. You're in discussions on online or with your colleagues in work. Okay. I just wanted to show you the this alaykum process of machine learning for declassification. I hope you enjoyed it and I hope I will see you in the next lecture. Thank you very much guys. 13. Project: Hey guys AT gamey hill here and welcome to the last video of this course about classification in Python on the skill share. If you reach this video, thank you very much for going through all the lectures. I hope you learn something new about the classification. In this video, we're gonna talk about the Project, which should be as your assignment during discourse. Now in my previous courses, I make it a little loose, right? I just lead you to boast sampling with different data and stuff like this. I want to make the assignment or the project in this course a little different. I want from you to analyze one exact datasets. So the data set is about the prediction of the rain, like whether it will rain or not. In Australia, there is a very nice data set linked in the project description of this course. So please download the data set in the first step. Now, during the project, please go through each of the steps which I was explaining in this course. So namely, after downloading the data set, go through the description of the dataset in Kaggle, like go through the columns. Word are the meanings. What are they referring to, whether they are numerical or categorical or text, right? Be sure you are working on with numerical and categorical data beta from the data set. Please don't use the texts from there. If you are interested in how to use texts in classification, I can create a video, just leave a comment, but these time just use the numbers and the categorical data. Now, Afterlife, going through the columns and the features of the data set, please make sure to go through the exploratory analysis, right? So blotted the numbers. What are the distributions pro plot the scatter plots, like the plot the correlation matrix to see what are the most like visible features to predict the ring in Australia. Now after doing this, make sure you go through the preprocessing, preprocessing step and evaluate your preprocessing on the first prediction, right, created a simple gain NN classifier with the default values, all the logistic regression without any value in and see what are the predictions. Blot the precision, recall F1 score, and get the confusion metrics, okay, get familiar with the data and make sure you understand what are you trying to predict. Now after the pre-processing, you can apply the feature selection algorithms, try to play with it. What are the best features for you or for your prediction for your classifier. Then at the end, apply the grid search to find the most optimal parameters of the classifier. Don't forget to apply the pipelines to make your code more readable, more efficient. And then also, you can take a pic. You can take a picture of your best prediction and please posted into the project of this course, we can make a small change of who from the students can predict with the highest precision and recall and F score on the data set which, which is in these projects. So the reigning prediction, that asset. Okay guys, if you are like more into the machine-learning and classification, I will publish some more courses later. And please don't forget to train the code. Like everything what you learn is important to be practiced. You need to like be able to read it every time. So make some datasets of your own, don't choice and tried to predict on them. You can also post some of your pictures here. I'll be glad to help you with any issues if you will have during the, during this project. Guys, thank you very much once again for watching all these videos and lectures of this course. Please give me a feedback. I want to improve myself, make the courses a little bit better. Thank you very much guys. I hope I will see you in the next course.