Machine Learning Regression Profits - Top 8 REGRESSION Models You Must Know in 2021 | Python Profits | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Machine Learning Regression Profits - Top 8 REGRESSION Models You Must Know in 2021

teacher avatar Python Profits, Master Python & Accelerate Your Profits

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

63 Lessons (6h 12m)
    • 1. Course introduction

    • 2. Course benefits

    • 3. Course methodology

    • 4. Course overview

    • 5. Machine learning big picture

    • 6. The need for pre processing

    • 7. Data pre processing normalization

    • 8. Data pre processing standardization

    • 9. Encoding categorical features

    • 10. Data pre processing discretization

    • 11. Handling missing values

    • 12. Custom transformers

    • 13. Feature engineering

    • 14. Train test split

    • 15. Cross validation

    • 16. Data pre processing quiz and exercise

    • 17. Data pre processing exercise solution

    • 18. Data pre processing summary

    • 19. What is regression

    • 20. Applications of regression

    • 21. Introduction to linear regression

    • 22. Scikit learn linear regression class

    • 23. Running linear regression on a dummy dataset

    • 24. Running linear regression on a Kaggle dataset

    • 25. Exploratory data analysis with visualizations

    • 26. Performance metrics measuring how well we are doing

    • 27. Current shortcomings

    • 28. Addressing the shortcomings by adding data preprocessing

    • 29. Putting it all together a full machine learning pipeline

    • 30. Exploring the model hyperparameters

    • 31. Ridge a more advanced model

    • 32. The linear regression showdown

    • 33. Linear regression exercise solution

    • 34. Linear regression summary

    • 35. Introduction to non linear regression

    • 36. Polynomial features for achieving polynomial regression

    • 37. Polynomial versus linear regression on two real world datasets

    • 38. Polynomial versus linear regression on two real world datasets

    • 39. Support vector regression

    • 40. Decision tree regression - 1

    • 41. Decision tree regression - 2

    • 42. Random forest regression

    • 43. Artificial neural network regression

    • 44. The final regression showdown

    • 45. Non linear regression exercise

    • 46. Non linear regression exercise solution

    • 47. Non linear regression quiz

    • 48. Non linear regression summary

    • 49. Making the most out of Kaggle

    • 50. Understanding an existing Kaggle approach

    • 51. Advance ensemble methods intro

    • 52. Bagging

    • 53. AdaBoost

    • 54. Gradient boosting regressor

    • 55. Advance ensemble methods - exercise

    • 56. Advance ensemble methods - exercise solution

    • 57. Advance ensemble method quiz

    • 58. Advance ensemble methods summary

    • 59. Capstone project

    • 60. Capstone project solution

    • 61. Questions and answers

    • 62. Extras tips, tricks, and pitfalls to avoid

    • 63. Tools and resources

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Big Chunks of Data Will Take Over the World Soon, and There’s Nothing You Can Do About It!

But, I can tell you how to make a career out of it if you want to earn a lot of money in the future. So, don’t exit this page without reading what I am about to tell you. 

Dear Digital Adventurer,

Datafication of lives and other things started when the internet was first created. Now, we’re nearing to the point where everything will be turned into digital information.

This is the reason many businesses are looking for experts to help them take advantage of this trend and position themselves as one of the leaders of their industry.

They do this by collecting information and using them as variables to predict possible problems and solutions that will help their business and customers.

If you’re someone who is looking for the next big thing after the IoT or cryptocurrency, then this is something you should look forward to.

Just like its predecessors, Datafication opens up a lot of opportunities…

Especially in the data science industry!

So, if you are a machine learning or deep learning developer, and you want to take it to the next level, then this is definitely for you!

So, what’s the next level?

I know this is your question…

You’ve always wondered what is the next step after learning the basics of machine learning and deep learning, right?

So, to answer this question…

The next step is to learn Regression Analysis Models!

If you’re unsure what this is, here’s a simple explanation:

Regression models consist of a set of machine learning methods that allow us to predict a continuous outcome variable based on the value of one or multiple predictor variables. Its goal is to build a mathematical equation that defines the dependent variable as a function of the predictor variables.

It means you’ll be dealing with numbers and a lot of data when using this model.

What does it do with the opportunities you mentioned earlier? 

The answer: Everything.

You see, understanding the concepts of regression models and implementing them properly provides two major benefits in a business.

First, it helps the company predict the future demand for their products or services. This means they can adjust their processes accordingly to make sure they can provide the quality they’ve promised to their customers.

Second, it allows them to fine-tune their operations and improve the quality of their products and services.

So, why is this important for businesses?

The simple answer is it greatly affects the revenue they generate.

The better problems they solve, the more money that comes in.

What’s in it for you?

Remember, I mentioned earlier that you will greatly benefit from this trend, right?

If you are looking for a lucrative career where you can earn a hefty amount of money, then this is for you.

A lot of companies will be looking for the best data scientists, machine learning, or deep learning expert out there.

This means the trend is just starting…

So, you have the chance to ride the tide as early as now and reap what you sow later on.

You have to start planting the seeds of knowledge within you that will make companies want you to work with them.

And if you already have the basic knowledge of machine learning and deep learning, you are already ahead of the pack!

Now, this doesn’t mean they can’t catch up to you.

This is why the only way to stay ahead is to continuously improve your knowledge!

So, here’s the deal…

We want to offer you the chance to learn the advanced lessons that will help you use your machine learning and deep learning lessons easily in your day-to-day tasks!

The ONLY course you will ever need to take your machine learning and deep learning knowledge to the next level!

You also get these exciting FREE bonuses !

Bonus #1: Big Insider Secrets
These are industry secrets that most experts don’t share without getting paid for thousands of dollars. These include how they successfully debug and fix projects that are usually dead-end, or how they successfully launch a machine-learning program.

Bonus #2: 4 Advanced Lessons
We will teach you the advanced lessons that are not included in most machine learning courses out there. It contains shortcuts and programming “hacks” that will make your life as a machine learning developer easier.

Bonus #3: Solved Capstone Project
You will be given access to apply your new-found knowledge through the capstone project. This ensures that both your mind and body will remember all the things that you’ve learned. After all, experience is the best teacher.

Bonus #4: 20+ Jupyter Code Notebooks 
You’ll be able to download files that contain live codes, narrative text, numerical simulations, visualizations, and equations that you most experts use to create their own projects. This can help you come up with better codes that you can use to innovate within this industry.

Meet Your Teacher

Teacher Profile Image

Python Profits

Master Python & Accelerate Your Profits


We are Python Profits, who have a goal to help people like you become more prepared for future opportunities in Data Science using Python.

The amount of data collected by businesses exploded in the past 20 years. But, the human skills to study and decode them have not caught up with that speed.

It is our goal to make sure that we are not left behind in terms of analyzing these pieces of information for our future.

This is why throughout the years, we’ve studied methods and hired experts in Data Science to create training courses that will help those who seek the power to become better in this field.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Course introduction: Regression models with the wheel exercises. Let me tell you a few things about me first. Most importantly, I hope that I am able to teach you and inspire confidence in you for learning these topics. I have over six years of teaching experience at the high school and college levels. I have a PhD in machine learning focused on applying machine learning in various other fields to solve important problems in these fields. So my PhD is quite practical, which is what is going to be the focus in this course as well. Last but not least, I love to explore all things tech related. The target audience for this course is anyone that is interested in studying machine-learning, regardless of their programming background. Whether you are just starting out with coding in general, or you are an experienced coders seeking to broaden your skill set. This is the right course for you. You will learn how to get things done fast, correctly and how to understand them at an intuitive level. The emphasis in this course is on practicality. This means that we will only cover the theory that is absolutely necessary for understanding something. Otherwise, we will focus on getting things done and understanding how to use the right tools for that. We will also favor libraries that focus on ease of use, accomplishing things fast, and that are used for developing real-world applications. 2. Course benefits: Here are some of the benefits that you will get. By studying discourse. It's going to be highly practical and applicable. You will learn how and where to find real world datasets that highly skilled professionals to use to do their job and doing value to their businesses. Next, it's going to be suitable for everyone. We will introduce all the necessary prerequisites before advancing to new topics. And we will make everything as easy to understand as possible without sacrificing depth and correctness. Another important benefit is that you will learn essential skills. Most of the jobs on LinkedIn currently have some machine learning or data science related to requirements. And this will only be more and more true as the field advances each date. No one wants to get left behind and they just easy as ever, not do. Last but not least, we will use the latest stack and datasets developed by industry leading professionals and researchers. So we will not teach anything that is already obsolete, will also teach the best practices. And bees compared to the back, do not change often. So you will be able to adapt to new tack easily. 3. Course methodology: Let's talk about the methodology employed in this course. First, we have the how, when and why approach. You might have noticed that we've already covered how, when and why in relation to this course overview. We've talked about what we'll be teaching, why, a bit about how, and now we're talking more about the how. This will be done throughout the course for every topic and concept. So that you have a clear picture about how to do it, when to do it. And most importantly, why you might want to do something in the first place. Next, we will have quizzes. It's important to test your understanding of things. So we will have regular quizzes for you to make sure that you are following along. They will all be explained if you ever get stuck. Next, we will also have exercises. It's also important to put wax you learn to good use outside the classroom. The exercises we have prepared for you will help you do just that by being as similar as possible to watch you might encounter in the industry. Don't be discouraged though. You won't have to spend hours on each one. Last but not least, tutorial hell is when you are stuck just following tutorials about the technology, but still don't have a clear idea about what to actually do with what you're learning will definitely want to avoid that by always pointing out things that you can build on your own to avoid this pitfall. 4. Course overview: In this video, we are going to go over the course contents. Here is the cost map. We are going to start with the crash courses that you should follow if you think you need to brush up on your knowledge of the following pandas, numpy and matplotlib, Seaborn and so on. These are very important because we will make quite heavy use of them. So it's very important that you are familiar with them. If you think you are not, please watch the corresponding crash courses. Next, we're going to move on to data preprocessing. Here we are going to talk about cleaning and preparing data for a machine learning. This is a very important step that is always done. So it's a very important section that you shouldn't skip. You should watch the videos in the order that it's presented in this course map that will ensure the best possible understanding. Finally, we'll move on to regression and its applications, which will start the main part of this course. Here we'll talk about some generalities and give some examples of regression and also discussed about its usefulness and its applications. Then we will move on to linear regression, which will be the first really hands-on chapter of this course. We are going to do actual data pre-processing on real world data. And we're going to introduce the easiest forms of regression using scikit-learn. This is basically the module in which we start getting our hands dirty and write code. Next, we're going to move to a bit more advanced stuff with non-linear regression. We are also going to introduce more complex datasets which will require more data pre-processing and analysis. This is where things will start to Luke a lot more professional and real-world like. So this is a very important module as well. We are going to continue with a tips and tricks module where we will discuss how to make the most out of Kaggle as a community, how to use the existing solutions on Kaggle and so on. From there, we will move to more advanced models of aggression. In particular, we are going to discuss ensemble methods, which make use of multiple regression models whose results are combined in certain ways such that the final result is better than the result of a single model. This will show that more can actually be batter. And from there, we will end this course with the Capstone project, which is a rather big budget that you will have to do from scratch. This is where you test yourself and make sure that you have gotten the most out of this course. Here you should be able to apply most of the things we've talked about by yourself without looking at our code. And don't forget that you also have some extras, such as a Q and a video. And a video with common mistakes. These are helpful in order to answer any questions you might have, and in order to make sure that you don't fall for some common mistakes that begin us usually tends to do. Also most of these chapters and with a quiz and an exercise, make sure that you do the quiz and the exercise because they will make sure that you have understood the content properly and are able to move on to more advanced content. Also note that all videos have associated Jupiter notebooks that you can check. So after watching a video, it, you might want to open the Jupiter notebook and take a closer look at the code, run it, tried to change it, and so on, will help you understand things better if you do that. However, keep in mind that these are examples and that it's always possible to improve them. And in fact, some of them purposefully not made to be not made to be the best that they can be in order for you to dig into them, experiment and make them even better, or to fix some bugs that were left there on purpose. But who knows, maybe not even on purpose. The conclusion is that it's very important to work by yourself, even if you think you understood the code and the concept, it's always important to write code yourself and implement those concepts yourself from scratch. It's usually trickier than it sounds. And it's definitely a good exercise that will significantly improve your understanding of the subject matter. So I strongly suggest you do this. We write the code, mess around with it, break-it, fix-it, et cetera. 5. Machine learning big picture: In this video, we are going to talk about the machine learning big picture. Let's talk a bit about machine learning in general, about an overview of the field. So machine learning basically means programs that can perform tasks without explicit instructions. They do this by learning from data how to perform those tasks. The market size is expected to reach almost $100 billion by 2025. You can read more about this fact at this link here. It encompasses software and hardware alike. By software, we mean various algorithms that can process the data and learn from it. By hardware, we mean stuff like processors and even video cards that can make this learning happened much faster due to their processing power. A lot has been invested in this. And there are some very, very powerful dedicated machine-learning hardware systems out there that facilitate this learning. Because of this, it's quite ubiquitous across the entire IT sector. Both software professionals and hardware engineers have definitely heard of it. And we'll definitely hurt here about machine learning a lot more in the coming years. Also, it can work with huge amounts of data because of this. Because there are, there is dedicated hardware that processes this data and that can run the dedicate, the eNodeB can run the necessary algorithms for extracting useful information from such data. It can automate a lot of jobs because of this as well. And we will see some of them in a few minutes. However, it can also create a lot of jobs. So, so people losing jobs is not as big of a problem because new jobs will also be created in coffee. It like shifting Jada of them will go extinct, but others will come up to replace. Let's see, some real world applications of machine learning. Let's start with self-driving cars. For example, Tesla does this surprisingly well. And in this video, you can see an example. This example is filmed in the UK, where the autopilot feature of Tesla is not even the best performing one. In the US, it performed quite better. And there are already some US cities that have implemented self-driving taxi services. So this is one job that we might not see anymore in the feet. The driver job we might no longer see in a few years taxes with a human being behind the wheel. Or to go a step even further. We might no longer see big trucks with people behind the wheel. This will significantly reduce costs. However, people will have to adapt to the changing economy with these kinds of jobs going extinct. It also has very important applications in cancer diagnosis. For example, an AI developed by Google can now beat doctors and breast cancer detection. You can read more about that at this link here. Next we have video upscaling and not just any upscaling, but to 4K quality and also not by a giant company either this time, someone use neural networks to upscale a famous movie from over a 100 years ago to 4K quality. And that's an amazing accomplishment. If you ask me, what would you like to see machine-learning do? We've seen examples from three different categories. Self-driving cars, medical applications, and entertainment applications. A lot more applications. And honestly there are simply too many to list here. We could spend hours listing them all and discussing them. The good news is that whatever you can think of, very, very likely that machine learning has some applications to that thing. Even if it cannot do it directly by itself, can definitely help in some way, all you need is imagination and some skills. Hopefully, we can develop a boat in the upcoming cause. Machine learning is big and buy big. We mean the following, plus the demand is very high, according to LinkedIn is emerging jobs report for 2020 and the United States, which you can read here, ai specialist is the number one emerging job with a 74% annual growth in the last four years. Ai specialist is an all encompassing term. It does not refer just to machine learning experts, but some other jobs as well. However, the unique skills required for this is machine learning, deep learning, TensorFlow, Python, and the natural language processing, all of which are covered in our upcoming courses, data scientists, which is also a closely related occupation to machine learning and machine learning jobs in general is at number three with a 37% annual growth. And the unique skills for this machine learning data science, Python, and Apache Spark, which we also cover with courses. So this is definitely mirrored on a global level, even if LinkedIn's emerging jobs report is only for the US, if you search for jobs in Europe and Asia and so on, you will see that this definitely takes place on the other continents as well. So it's definitely a global phenomenon. And you can find the need for a machine-learning experts in pretty much any country. This is also evidenced by the fact that big companies like to snatch up AI companies and startup. You can read more about this here. And good examples of this is, for example, Siri, which was a separate company that was acquired by Apple. And there are many others as well. Companies like Google, Facebook, Microsoft, Apple like to acquire these companies in order to stay ahead of the game and to get an edge on the competition in development and innovation. These are all innovative, innovative companies that like to think outside the box and develop great tools that will benefit people using machine learning. And of course, the big companies wants to stay on top of this innovation competition. Machine-learning is definitely the future. Everything is digital nowadays generate huge amounts of data and we need ways to be able to extract information from this data, which is what data science, AI, machine learning helps with. We need algorithms to help us make decisions because this is a lot of data and therefore a lot of information that we have to process. And unfortunately, evolution does not work as fast as technology does, so we are not able to process all that information in a good way. That's why we need algorithms to help us or even to decide for us, for example, a self-driving car decides for us what it should do. Its a high impact, highly disruptive, filled with relative low barriers of fans. The causes we provide and the information available online will definitely make you able to make a contribution to work in this field. It's highly disruptive because of its very nature. Can you imagine all the big trucks on the road today being without a driver behind the wheel in just a few years, machine-learning can very likely do that and can very likely get us the job automation means more income for companies and most specialized skill requirements from employees. That's why now is a great time to start learning. You might think that you're a developer and machine learning is not going to threaten your job anytime soon. But you never know how fast these things progress. So don't delay. It's not just drivers that are going to be affected by this. Even if it's so powerful, it's still in its infancy. It has only been a buzzword for a few years. And in fact, maybe even as, and in fact maybe even less than ten years ago, the term machine learning might have only been known to experts and researchers. But now most people with technical inclinations have likely heard about this. So imagine what this field will be in just a few years from now. That's why it's definitely the future and it's definitely wise to start learning about it. It will, without doubt, disrupt everyone's life for the better. Very, very soon if it hasn't already. Also, there is great demand without being over-hyped. It's not just something with theoretical applications. We have concrete results available. We have caused without autopilot features. And we have systems that can put medical diagnosis better than doctors. Can those doctors in, in order to be able to make those diagnosis has studied for years, maybe even decades. And now we have a relatively simple algorithm that can beat those medical professionals. And that is a great thing. It's a very concrete thing. It's something that definitely exists. And which proves that this isn't just hype. It's something real and very useful. 6. The need for pre processing: In this video, we are going to talk about the need for pre-processing. And important issue in machine learning is that the algorithms are very sensitive to the data that goes into them. Perhaps the most important sensitivity is two different data types. Most algorithms only work on numerical data, but real world data, it's obviously not just numerical. We have images, text, and all kinds of different binary data. All of these must be transformed into their numerical versions if he wants to be able to feed them to machine learning algorithms. We will see how in the upcoming videos this data type transformation phase is done doing data preprocessing. Just having numerical data is still not enough for the machine learning algorithms to behave well and learn what and how we wanted them to. Another important issue is that numerical data often has different orders of magnitude. That is, some numbers in our data might be very small and some might be very large. For example, let's say we have some data about weights and distances for some objects. If we had 03 kilograms for an object and the distance of 3 thousand kilometers for the same or even for another object, then this wouldn't be a good thing to feed to most machine learning algorithms. The difference between these two numbers is simply too much. The algorithm doesn't care about where in the dataset this difference comes up. It's just want like the fact that we feed it such strikingly differently sized numbers. In general, we want to keep our numbers are small so that they are close to one another. Something like 0.71. It's very likely to make most algorithms perform better. We will see how we can achieve this in upcoming videos. Note that I removed the kilograms and kilometers from the new values. After a pre-processing step that changes the orders of magnitude, the units are almost always not going to make sense anymore. But that is fine. The algorithms don't really care what those numbers represent. Anyway. We will see that some algorithms are even pickier and like the data that we feed them to have some special mathematical properties as well. Doing the processing, we can change the data such that it gains those properties and the dust makes learning easier. These properties and how to obtain them can be quite theoretical. So we will discuss them a bit later on in this section and see what they refer to with practical examples. For now, it's enough to know that this can be an issue and that it has solutions. 7. Data pre processing normalization: In this video, we are going to talk about normalization. This usually we for us to scaling our numerical data to be between 01. So we are going to assume that our data is numerical and we will see how we can achieve this in pandas and numpy using scikit-learn and some pandas methods. Normalization can also refer to a few different things as well. But for now, we are only going to focus on this 0-1 scaling, as it is the most often used meaning of the term. So let's start by importing numpy and pandas. We're going to declare a numpy array that will store our data. And also a Pandas DataFrame that will store the same data using a DataFrame. The difference is that we are also going to have column names. And let's display them. This is the Python NumPy, and this is the pandas DataFrame. As you can see, they contain the same data. Next, we're going to see how we can do this using scikit-learn. So first we have to import min-max Kayla. Next we have to instantiate to this object. We won't be passing any parameters as the defaults are satisfactory for now. Next, we're going to store this data in a another variable. And we're going to assign it in another variable called scale data. And we're going to assign to it normalize R dot fit transform, to which we pass the original data. What fit transform does is two things. First, the fit part makes the normalizer compute the minimum and the maximum values for each of the columns in our data. The transform part will use those minimum and maximum values to transform each column accordingly, we will see exactly how in a few seconds. For now, let's just print this scale data. You can see it looks kind of like this. And the important part is that each number here is between 01. So how did this happen? Well, it uses the following formula. So for each number, let's say 100 here in the first column. What did the did is subtract the minimum of that column from it. So in this case, this four here. And divide the result by the difference between the maximum in that column, which is 11000 in this case, and the minimum in that column, which is four in this case. So we will write here 10000 minus four. Let's see what the result is. You can see that the result is the same as the one scikit-learn has computed for us. So this is all there is to the normalization process. You simply subtract the minimum from the number and divide the result by the difference between the maximum and the minimum for that column. And you'll do this for every number in every column. Let's see how we can do this using Pandas. Now. In Pandas, it's actually a one-liner. We can write the following to get a new data frame that will contain the scaled numbers. So we'll write scale. The df is equal to vf minus df point min. Everything divided by D f dot max minus dF dot min. And this is the same formula that we have explained above. And if we show the scale of D F, you can see that there is also the same as the ones we have obtained using scikit love min-max scalar. So this is how you can achieve normalization, or more specifically, min-max scaling using scikit-learn and pandas. 8. Data pre processing standardization: And in this video, we are going to talk about standardization. Standardization refers to scaling data so that its mean is equal to 0 and the standard deviation is equal to one. If you are not familiar with standard deviation, don't worry about it. You don't really need to be in order to understand this video. Just know that this method solves the statistical issues that some datasets have that affect certain machine-learning algorithms. We are going to use the same data as we use for normalization. And we are going to have it in both a NumPy array and the pandas DataFrame. We can display the data by running this code below. And you can see that it looks the same. In order to achieve standardization in scikit-learn, we will have to import the standard scaler class from SKLearn dot preprocessing. Instantiated the standards Kayla class to an object that will store in the standardize a valuable called a fixed transform, a method on this object in order to get the standardized data from the original data. So standardized data is equal to standardize our dot fits transform. And finally, let's display the standardized data. If we want this cell, we get a numpy array that contains the preprocessed data. And if you look at it, the average of the elements in each column is 0. Let's take a look at that. So np.mean and pass it on a Python list containing the elements in the first column of our data array, which is 104100. Right? So these ones, but we actually have to pass these values because these are the corresponding values after the standardization process. So we'll just go into copy paste them. And if we run this, we can see that the mean is 0 and the standard deviation is one. It's not exactly one here because of some rounding errors, but in reality it's actually one. So what these standards Kayla does is applied the following formula. For each number in the original array. For example, let's exemplify this with 100. It subtracts the mean or the average of all the elements in that column. So, so 100 for 100 in this case. And it divides the whole thing by the standard deviation of the same elements. You can see that this value is equal to this value computed by the standards Kayla. In pandas, we can do something very similar to what we did for normalization. And that is compute d f minus and divide the whole thing by d f dot SDV. And if we run this, we get a new data frame that contains computed values of standardized values. But if we look at them, they are not exactly equal to the ones given by num pi. That is because Pandas computes the standard deviation a bit differently. By default, it's not really a big deal. Both of these are valid, but if you want to make them the same, you should pass DD o f equals 0 to the STD method. And now we get the same results. 9. Encoding categorical features: In this video, we are going to talk about categorical features. A feature will be a column in our dataset. Starting from this video, we are going to lose the training wheels and use a real dataset. In this case, we are going to use the 80 cereals dataset from cargo. You can read more about it and the Kaggle website. All I'm going to go over in this video is that you can download it by clicking this button here. And that you need to save it in the same folder that you are Jupiter notebook is in. Let's talk about how we can load this data set. We can load it by importing Pandas and using the read_csv function from pandas. So we're going to write dataframe equals to p d dot, read CSV, and pass it the filename. This has loaded the file into our df dataframe, which we can now show by writing df in the next cell. And here we can see the dataset. One problem with this dataset that makes it unusable for most machine learning algorithms is that we have strings for the MFA and type columns. We also have strings, strings for the name column. But the name is not all that important and we could simply remove that column. However, the manufacturer and type might be very important. So we're going to see how we can turn this categorical features into numerical features. We call them categorical features, because each of these and q, k, the C in the type column, and so on, represent categories. For the manufacturer or for the type. We can turn categorical features into numerical features by using one-hot encoding. Let's see how we can do that using pandas. We are going to declare a variable called DF one-hot MFA, which is going to represent the encoded, the one-hot encoding. The one-hot encoded columns for the manufacturer column. And this is going to be equal to PD dot. Get the dummies of df of MFA bar, that is the original column name in the df dataframe. And the prefix for this will be one hot AM FR. We're going to do something very similar for the type column. So, so we will only need to replace a few things. Type here, here and here. And let's show the DF one-hot MF far column, m FR DataFrame. Df one-hot MF far dataframe. And here it is. You can see that we have seven columns. Well, we had one before. So the MMR column has become seven columns that are all containing numerical data. More specifically, they all contain 0 or one. So why do we have a one here? Well, we have a one here because the original DataFrame had an n under the NFL column. For the first row. Let's check that. So the first row, the far column, and we have an n, we have a queue for the second row and the k for the third row. So let's see what we now have in the one-hot encoded data for envelope. For the second row, we have a one under q. And for the third, we have a one under K. So hopefully, this gives you an intuitive understanding of how one-hot encoding works. By doing it in this way. We are helping many machine learning algorithms perform better. And for some of them, it's absolutely mandatory that we feed them the data in this format. However, so far, we have only obtained a new DataFrame for each of the columns that we wanted to one-hot encode. But we would like to concatenate these DataFrames into a big DataFrame that also contains the numerical features we had in the original dataset. So in order to do that, we're going to do new df equals to d f dot drop. And pass it columns equals a list of columns that we want to remove from the original DataFrame. That is m FR and type DF here. And this will get rid of the original NFO and type columns. We will replace those with our one-hot encoded columns. So if we show new df now. You can see that it's missing the IMF far and type columns method contains all the other columns and all the other columns except the name, contain numerical data. And now we just need to add our one-hot encoded columns back in. And we do that by using the pandas concat function, to which we pass a list of data frames. And those DataFrames are the F15, MFA, the F1 hot type, and new df itself. And we pass axis equals one to tell Pandas that this is a column wise concatenation. So we want to add columns. Let's show new DF. Now. You can see here that we have the one-hot encoded columns and also the original numerical data. We will now see how we can achieve something similar using scikit-learn. We're going to import from SKLearn dot preprocessing. The one-hot encode our class. Instantiated and pass sparse equals false. Passing sparse equals false will make it return a numpy array a that is not sparse. Working with sparse Numpy arrays can be a little tricky and now call scikit-learn machine learning models like that. So we pass this for compatibility mostly. Next, encoded data will be equal to encoder that fifth transform to which we pass dF dot values. We call that d f dot values gives us a numpy array of the DataFrames contents. And we wanted to one-hot encode every row. And only the second, third columns. That is the MFIs and the type columns. So in order to do that, we have to pass one semicolon. We have to pass one column three. Let's show the encoded data. Here are this, and you can see that the positions of the ones correspond to the positions of the ones in the pandas, one-hot encoded DataFrames that we have seen above. 10. Data pre processing discretization: In this video, we are going to talk about discretization. Discretization means turning real valued features into integers, or more generally, numbers into bins. Here is a toy example, as this is very similar to what we have seen so far in the previous videos on data pre-processing. We import numpy and the K beans discretize on class from SKLearn dot pre-processing. And here we have some real valued numerical data. By real valued, we mean that the numbers are not integers, so they are something, dot, something. We instantiate a discretize or object to which we pass a list of beams or number of bins. What this means is that for each column, the values in that column will be put into this many beans. So for the first column, the values will be put in one of three bins. For the second column, they will also be put in one of two e beans. And for the third column, each number will be put in one of two bins. And we are going to use something called ordinal encoding. We call fit transform under data. And we show it, Let's see how it looks like by running this cell K. So like we said, the numbers in the first column will all be put in one of three bins. So the first number is putting been wat in bin 0. So minus 1.2 goes to being 0.45, goes to bin, 2.41 goes to bin one. And if you look at the numbers, this kind of makes sense. It's kind of a sorting of the numbers. Minus 1.2 is the smallest, 1.4 is the second smallest, and so on. And the same happens for all the other columns as well, except for the last. For the last both 5.86.5. I'll put in bin number one. This is because there is an algorithm that determines the range of numbers that each bin corresponds to. And the numbers are put into that mean according to that range of numbers. But that is not very important to know exactly in general. But you should take from this is that discretization is used when you want to turn data into categorical features. Usually, after turning it into categorical features, you usually apply another preprocessing step that involves turning them into one-hot encodings. So usually this is not a final step in pre-processing. That is a one-hot encoding step that follows this one. However, discretization is quite rarely used in machine learning. It's generally only used for a more advanced algorithms and special kinds of datasets that are heavy on real numbers. We'll talk more about it when the time comes and when we absolutely have to use it, and there is no better way to do things. 11. Handling missing values: In this video, we are going to talk about handling missing values. Sometimes datasets have missing values for some of those and features. This makes them unusable by machine learning algorithms. So we must find a way to get rid of or to fill in those missing values. We are going to explain this using the USC fights dataset from Kaggle. This is the one. If you wanted to look it up. We are going to be using this data dot csv file. So let's load it using pandas. I call the mine UFC underscore that CSV. Let's display the data we have from this file. Okay, here it is. If you scroll to the right and down a bit, you can see here that we have some NAN values. This stands for not a number for some numerical columns here, such as B, underscore age and underscore h, c, we have nanos for both of them. And we need a way to either get rid of those or to fill them in appropriately for machine learning algorithms to work on this dataset. The first approach and easiest one is to simply remove the rows or columns that have missing values. So recall that by lows, we mean records or observations. For example, each row in the above dataset is an observation for some fighter or for some fight. And columns means features. For example, the fight stance is a feature, or his number of wins is also a feature. Let's see how we can drop all those that contain missing values. We're going to do this by calling the drop and a method on the DataFrame. And let's display the result. Here it is. And if you scroll to the right, you can see that all the rows with missing values are now gone. However, this isn't to say that this is a good approach. Note that we only have $3,200 now. And we had almost 5,200 before. So we lost almost 2012 was doing this. That's not very good. Let's see what happens if we do up the columns. We do this by calling the drop NA method as well, but we pass axis equals one. This will make it drop columns instead of rows. Let's display it. In this case, you can see that it's dropped to the offending columns. But again, it dropped a lot of them. Feel, if we look here, we have 36 columns, but we have a 145 before. So we lost over a 100 features. Again, that's not good at all. A better approach is usually to fill in the missing values. Removing the rows or columns with missing values only works if either you have features with mostly missing values or just a few rows that contain missing values. Otherwise, it's best to try to fill them in. One way of doing this is using the interpolation. Interpolation is a method that's tries to fill in values based on the values close to them. So we can do df of some column name that we want to fill in missing values for, for example, b underscore h, which will be equal to df of b underscore age dot interpellate. And we're going to pass Limit direction equals both. This will make sure that as many values as possible will be filled in. If they are missing. Lets run this. And let's display the DataFrame with this columns missing values filled in. And if we scroll to the right and look at the b underscore age column, we can see that there are no more missing values. And we could do the same for the other columns as well. And most importantly, we haven't lost anywhere else and any columns. We can also do this using scikit-learn like so. We're going to need numpy. And from SKLearn dot impute. We're going to import simple in peer-to-peer class, which we are going to instantiate and pass it. Missing values equals and strategy equals the mean. This will make it to replace all the values that show up as noun with the mean of the values across depth column. And after that we simply do nano field equals np. That fits transform of d f dot values. However, there is an issue with this code and it will crash. Feel free to pause the video here and think about why that is. Also think about what kinds of datasets this approach would work for. So let's run it. And like I said, it crashes. And it says that it cannot use the Mean strategy with non-numeric data. So this approach only works on numerical data. And there is no simple way of only passing it some columns for it to look at. So we're going to save this for numerical data. We are not going to try to get it working for this one. Just know that it is hard to get walking on mixed data types. Another approach would be to use placeholders, which I will leave as an exercise, but I will give you a few hints and also tell you when they have a pulpit to do. So. If we have string columns, we can make it something like unknown. If those columns are empty or show up as nan. And for a number of columns, we can make missing values something like 0 or minus one. And now you should look up how to do these using pandas and scikit-learn. Tried to apply it to this dataset for now. 12. Custom transformers: In this video, we are going to talk about custom transform us. Sometimes, what do we give an isn't enough and we need our own functions to transform the data in ways that we want to. We are going to work on the USC fights dataset here as well. So let's read the data. Let's show the DataFrame. Okay, so is the same data that show used to by now. And we're going to implement our own data transform that will simply remove some columns from this dataset. We do that by importing function transform from SKLearn dotplot assessing. And we define a function that does the actual preprocessing. We're going to call it all but first calls. And it will take an excess parameter. And all we're going to do for this example is returned every though in X and everything column and every column starting from the index five column. Next we declared the transformer and pass it the function. Be careful here. We pass it the function, not the function call. So there are no parentheses here. This would be wrong. This is correct. And we also pass it, validate false. So it doesn't do some validations that can lead to errors and that won't help us in this case. Then we simply do transformed equals, transform, transform, and pass it the values in our data frame. Let's run this trans fine. And let's display the transformed data. And as you can see, it's the same data except it's missing the first few columns. You might be wondering why we didn't just call this function on its own. Why did we have to go through the function transform my class? Well, this will become obvious a bit later on when we will see how we can stack multiple data pre-processing steps into a machine learning pipeline. As part of a pipeline, we will need to pass something like a function transform on to it. So it's good to have this knowledge available to us when we get into. 13. Feature engineering: In this video, we are going to talk about feature engineering. Feature engineering is the process of using our own domain knowledge to extract features from data. We are going to discuss some possible feature engineering the directions and the UFC fights dataset. But we're not going to be implementing them, but they serve as good implementations, as good the implementation exercise for you and also as good thinking material about the topic. For example, we might add a column for who you thought the winner was going to be. Let's say you collected this dataset to yourself over the years. In that case, you would probably have some records about who you thought the winner was going to be in each 5p. That column might be useful for a machine learning algorithm. As long as your predictions were better than a coin toss, so better than 50-50. Next, we might add columns for who the experts predicted the winner was. In this case, we don't even need much domain knowledge at all. We simply need to look up those fights in the news and find out what the experts talk. And we might have two or three or four or something like that columns for the experts predictions. Next, we might think of what other columns we think might improve learning by machine learning algorithms. So what other columns would you take into consideration if you were to make your own predictions? In this case, we need quite advanced domain knowledge to know what is useful and what isn't. For example, the weather that day. Is that useful? First talk maybe not for this dataset, but who knows? For other datasets, the whatever it might be very important for other datasets, it might not be obvious at all if the water is or isn't important unless you have some intimate domain knowledge. Feature engineering doesn't just deal with adding columns. We might also change some columns, may be the numbers in those columns are not as large as we would like them from any domain knowledge point of view. So in that case, we might increase them or decrease them, or we might add something to that, or we might multiply them by something and so on. If you couldn't answer most of the questions Above, Don't worry. They'll just means you maybe don't have that much domain knowledge about the UFC. I don't eat. So kind of to think about what kind of dataset you would be able to answer such questions for. And then try to look up something like it on Kaggle. And actually tried to come up with a few columns that you would add to it, and maybe a few columns that you would change in it. 14. Train test split: In this video, we are going to talk about trained test splits. This is very similar to when we study for a topic in school. We use some material to study. And then we are going to take a test at the end, some similar material, but usually not exactly the same material. This is the same way that machine learning algorithms out testing as well. They are trained on some data and then they are provided some unseen data to find out how well they actually perform and what they have actually like. We're going to use the UFC fights dataset for this task as well. We lead this dataset here. And now let's see how to perform a train test split on it. We need to import a few ticks first. So from SKLearn QTLs, we're going to import shuffle. And from SKLearn dot model selection. We're going to import train test split, which is a function. We are going to set data equals shuffle of d f dot values. And what this does is it shuffles all the data. In our DataFrame. This is usually a good idea so that the machine learning algorithm won't learn the order the data is in, which might lead us to believe that it is performing better than it actually will in a real world scenario. Note that this is not always a good idea for all datasets. So this is a general way of doing things. It might not be appropriate for this particular dataset. That is, because with this dataset, our data is somewhat common logical. The data for each fight contains fighters data up until that fight. So by shuffling it, we might actually mess things up more than we should. And we might actually unrealistically get wass performance than it would happen in a real world scenario. But that would require more than analysis. So I'm not saying it's happening here on this dataset. I'm just saying it's something to keep in mind. And that this is a general example. Generally, we shuffle, but not always. And now we will introduce some notation. So we take capital X will usually refer to the features of our dataset. And with a lowercase y, we will usually refer to the labels or the things that we need to learn and then predict. Okay, so we are going to set x comma y will be equal to data of all of those, right? So this right here will be set to x. And the thing we write here will be set to y. And let's fill in the y first because that's easier. So this will be every row and column five. We set this to column five because that column gives us the winner. And that is what we want to learn how to predict. Okay? And here we need to set every column except column five. And we are going to use list comprehensions for that. So we are going to say i for i in range of data dot shape of one. So all the columns, if the column index is not five. And after that we said x train comma x comma y. Comma y. To be equal to train test split of XY dust size as a percentage. So this will be 0.33 for 33%. And that's all. Lets run this and see what happens. Okay, no errors. Okay, now, let's display some things now. First of all, let's put in y train. We can see that these gives us the windows for each row. And then let's print some shapes. So x trained up shape, x dot shape. Why trained up shape. And y test to that shape. And as you can see, the dataset was successfully split in a train test the pair. And what we will do in future videos is trained on the pairs x train. Why train? And then see how well the algorithm performs an X test. So we will have it make predictions and X test and compare those predictions with wide test. Think about it like the algorithm taking a test on X test. And then the professor comparing those answers with his known correct answers, which are in white dust. This is a very important step in machine learning. So please make sure that you have understood this video before moving on. 15. Cross validation: In this video, we are going to talk about class validation. This is another way of testing how well machine-learning algorithms perform. After training. We are going to show how this works and the UFC fights dataset as well. Let's read it and show it. The rest of the code is quite similar to what we have seen for twain test splits. The only thing that changes is here. Instead of doing a train test split, we will do multiple such train and test splits. First, we import the K-fold class from SKLearn dot model selection. And we instantiate it passing and splits equals five. Let's run this code and see what happens first. As you can see, five lines are displayed. This means that within this loop, we should train the algorithm on this, and then test it on this. Then train it on this. That's the done this and so on. This is called a five-fold cross validation. You split the dataset into five parts, train it on four of the pots, and tested on the remaining fifth. And repeat this five times for each of the five parts. If we pass in ten here, we will get ten lines. There are actually ten lines, but they are large. That's why they show up a bit differently. If we pass in two, we get two lines. And for training, we can simply use acts of train to get the training. Instances. After doing this K times, you would average the results and report a final result. This is usually used for smaller datasets because it's obviously less efficient and it will take a lot more time than a simple train test split. However, it usually gives a more accurate idea of how well an algorithm will perform in practice on new data. 16. Data pre processing quiz and exercise: In this video, we are going to talk about a data pre-processing quiz and then exercise. I am going to eat the question, presented the options, and then you should pause the video, think about the choices presented and choose at least one and that most, all of them. After that, resume the video and see what the correct time. So it was which of the following disadvantages of Scikit Learn simple impotent class. We cannot choose the columns. It does not work for all data types, and it only works for strings. Pause the video here, choose at least one option, and then keep watching to find out the correct answer. The correct answer is that we cannot choose the columns. The simple input or class from scikit-learn does not allow us to do this easily. If we wanted to be able to choose the columns, it's usually easier to use pandas. So much you've seen so far. When do you think we will use pandas? Scikit-learn for loading the data, for training the algorithms. Mostly for data pre-processing, as it makes most such tasks easier than in Scikit Learn. The correct answer is C, mostly for data pre-processing. Indeed, these tasks are usually easier in pandas. Scikit-learn. Although we have use pandas for loading the data, this can easily be done with NumPy as well and with other liabilities in general. So this is not a Pandas only feature. And that is why we don't consider it a correct answer here. But if you did choose it, don't worry about it. Not that wrong. Which data pre-processing steps are applicable when we have very large integers? Normalization, standardization, or imputation of missing values. The correct answers are a and b. These two lead to smaller numbers that are closer together. And they usually help machine learning algorithms. Congratulations if you made it this far, and don't worry, if you didn't get everything right. Now, we are going to discuss your exercise. Look at the USC fights dataset and the Kaggle page. You will see that there is a preprocess Data dot CSV file. Try to describe what pre-processing steps were applied to obtain this file and dried them here in this notebook. Then tried to write your own code that preprocesses data.dat CSV, which is available under same Kaggle page. In a similar way. It doesn't have to be exactly the same, but you should obtain something that looks kind of like preprocessed data dot csv. After you do that, try to complete this section here, right? Well you got stuck and how you got unstuck. It's quite likely that some things that, you know you have to do, you will not be able to do right away. You might have to look some things up in the documentation, or you might have to re-watch some of the videos. Make a note of them here, and write it such that it helps you avoid this in the future. Also, right? Well, you have to look at the official solution. If you ever had to look at the official solution in order to get unstuck. Finally, after you're happy with your own solution, do take a look at the official one. But did you do better? Maybe we were a bit superficial about something in the official solution was to add the extra time that you spend doing it better, or would it have been mostly the same thing if you have taken the same approach that we took in the official solution, basically tried to be objective about it and really decide if you over complicated something. Well, if we really did do it superficially in the official solution, then has tried to answer, what did you do worse than the official solution? Maybe again, you over complicated something that ultimately gave the same result. Maybe you out more code for something that could have been done in a one line of code. In general, what can you learn from the official solution that you didn't learn from completing this task on your own. It's always helpful to look at other people's implementations. And finally, write any other helpful things that you can think of and that you took note off while implementing this. And again, congratulations if you made it this far. And I hope you will keep enjoying these videos. 17. Data pre processing exercise solution: In this video, we are going to talk about the data by processing exercise solution. So we got the data from here. We got the two files, data CSV and preprocess data CSV for reference and to have something to compare ourselves with. So the first step that seems to have been done is the one-hot encoding on weight class and stances for each fighter. And in order to see this, I printed the columns of both datasets. And if it's call down C, We have here weight class in the data file, so the unprocessed file. And if we scroll down, we also have this dance column and we have one for the red fighter as well. And if we look on a cable, we will see that these contain text data. They are categorical text columns, so they need to be one-hot encoded. If we scroll down, we see this is where the pre-processed file starts. Fiscal way down. C. We have these columns here, which definitely suggest a one-hot encoding that was done on them. We have the same for the stances. So we will have to do one-hot encoding on these. And then the file contains missing values. This is also something you can see on Kaggle. Or we can simply do data. Dot is an a Psalm. And if we run this, let's comment out the print statements to make things easier to see. We can see that there are many missing values. So those will have to be filled in and we will fill them in with the median. I'm not entirely sure if that was what was done on Kaggle, but it's a good idea anyway, and it's likely that it was filled in with the median and cattle as well. So if we look at preProcess data, we can see there are no missing values. Also, by looking on Kaggle, we can see that some useless columns well dropped, such as the date of the file, the location of the fight, and the referee for the fight. Okay, and here is some code that I've written that does a similar type of pre-processing. So first of all, I dropped the useless columns. There are these two that will no longer be used as the location, the date, and the referee. And you have to make sure to put them here with copy paste from the output above. Because you can see some have a capital first letter, others do not. So in order to not make any mistakes, copy pasting works best here, I feel like then we use get dummies to do the one-hot encoding and concat to add the one-hot encoded columns back to the full DataFrame. We use get dummies for this. And we pass in a perfect to make it look like in the preprocess file and cattle. But you don't really have to pass in the prefix if you didn't do it this way. It's completely fine. And we do this for a weight class, B stands and OS stands. And of course, after we do this, we must get rid of the original text columns. So we do that using point to draw up here on the DataFrame. And then we pull into the number of columns in each file in in our dataframe and in the preprocessed file from Kaggle. If we run this, we can see that we have a 161 and the Kaggle file has a 160. So what's up with that? Well, I wasn't sure what's up with that myself. So I built this tiny bit of code here to print the columns in my data that are not part of the columns in preprocessed data. And apparently the column that fault here is stands sideways. And I'm not sure why this doesn't show up in the preProcess data file on cargo, but it's just 11 column and it's not really worth investigating further. So I just left it in. Okay, and then we have some code for filling in the missing values. This is done by iterating all columns in my data. Checking if it has missing values. Getting the median cost at column. And using the fill and a method on that column to replace all missing values with the median across the existing values. And after that, if we do this again to show that the missing values, we can see that they are all zeros. So this is all we have to do to get something very similar to the preprocessed file. I hope that after watching this you are able to. And so all of these questions and to make sure that these observations will help you in the future. Congratulations if you completed this exercise and if you got close, that's also very good. And congratulations are in order. 18. Data pre processing summary: In this video, we are going to do a summary of the data preprocessing module. We're going to go over what we learned. And this will help you go back and rewatch videos. If you think you're not very familiar with something we mentioned here. Or it can also serve as a future reference for you to come back to. If you want to see if some problem you are having discovered in this modular or not. So we started by talking about value ranges and how to turn text data to numerical data. By value ranges, we mean that machine learning algorithms are not happy with numbers that are too small or too big. They want them to be about the same size. So they are not happy if you give them multiple instances of data. And those instances contain both very small numbers and very big numbers. And one way of handling that is to normalization, which will turn the data to a certain interval lying between 01. Another way is true standardization, which will ensure that the data has some statistical properties, such as a mean of 0 and a standard deviation of one. So 0 mean and unit barriers. But of course these two can only be applied to numerical data. So if our data is not numerical, but we can do is use one-hot encoding to make it numerical. And actually this only works for a categorical columns, for features. And by categorical features, we mean stuff like having a column that says what kind of an animal will dealing we are dealing with. And the values might be cat, mouse, and dog. And if there are only these three values, we can use one-hot encoding to turn them into numerical columns. And after that we can apply normalization and standardization. And if we think we need it, then we discussed about enriching our data. And real world data is not very friendly. Most of the time. They can have, it can have missing values, for example, right? So going back to that animal column, maybe there's an entry that does not contain anything. So we don't know what animal it is. And we can use, we can use various methods of filling in missing values so that the machine learning algorithms will be able to work with such data as well. We talked about applying our own custom transformers for more advanced pre-processing methods or activities. And we also talked about feature engineering. And feature engineering is when you use your domain knowledge or your intuition to extract more value from the data. So you might use it too, come up with new features, which are a function of existing features, but that you think will have the learning algorithms perform better. And finally, we talked about Twain desk splits and cross-validation. So for doing test splits, we use the school exam analogy. The train set is a part of the data that you use to study. And the test set is what you get under test and what the examiner evaluates and gives you a score. And depending on how well you did, It's the same here. The train part of the data is what we will train our algorithm sound. And the test part is what we will test them on. We will have them make predictions on the test data based on what they have learned from the training data. So congratulations if you made it this far, and I hope that you will enjoy the videos in the next modules as well. See you then. 19. What is regression: We are going to talk about what the regression is. Regression is a way to model the relationships, and it therefore allows us to make predictions. Let's see what this means with a concrete example. Consider the following figure. It's taken from scikit-learn documentation. The blue line is called a regression line. We will say that it has been fitted to the black dots. But what do the figures axis will represent? What about the black dots? Let's make something up for this example. Let's say his uncle axis represents height and the vertical axis represents weights. So a black dot then represents a person's height. And the wave. Don't worry, if this seems a bit unrealistic to you while looking at the picture, will allele as real examples soon. Now, luckily, if you are given a height and one to find what the corresponding lock, the corresponding weight should be. Well, you will know where on the height axis this height should go. Let's say it's here. And now draw a straight line upwards from it until you hit the blue regression line like this, then go left like this. And this is your corresponding weight. What we did was predict the weight of a person and given their height using the learn regression line. Machine learning will help us learn such lines as we'll see, they don't even have to be straight lines. 20. Applications of regression: In this video, we are going to talk about applications of regression. Here are some possible uses for regression. First, we have forecasting, let's say have a business and are considering buying some video ads to show up in various places, like YouTube videos. You could use regression to get an idea about how many people would watch your ad without skipping it. This could help you decide how much to pay for such ads. Next, there is business process optimization. Let's consider a food delivery business. We might analyze the impact of electronic bike delivery during rush hour traffic on customer satisfaction. This would help us decide whether or not it's worth it to buy more electronic bikes. Most people nowadays are faced with decisions on a daily basis. You don't even need to be a manager for this to be. Regression techniques can help you make sense of large amounts of data and test the impact of a certain decision before you actually make it. We can also use regression to correct certain errors in judgment. For example, one topic that can become quite controversial is whether a minimum wage increase is a good thing. One side argues that it will benefit low-income workers, while the other argues that small businesses won't be able to pay and it will thus increase unemployment. One side is in error, but it might not be the same one everywhere on the globe. Regression could help with this. Over time, businesses have gathered a lot of data that has potential to yield valuable insights. A lot of this data is also made public. However, this data requires proper analysis and regression analysis techniques can find a relationship between different variables by uncovering patterns that were previously unnoticed. So as you can see, regression has plenty of uses and you can almost always find something to use it on. Whether you're a business owner, simply looking to improve small things in your life. 21. Introduction to linear regression: In this video, we will introduce linear regression. Linear regression is any method that finds linear relationships between variables. We've already seen an example of linear regression with this figure from scikit-learn documentation by the term linear regression or linear regressor. We will refer to any algorithm that can find the blue line in the image below. We will discuss how to use scikit learns implementation of these algorithms in the next videos. 22. Scikit learn linear regression class: In this video, we are going to talk about scikit lines linear regression class. I think this is a good time to introduce scikit last documentation and tell you how you should read it in order to understand how a certain model given by scikit-learn works, I'm going to show you what you should focus on and what you should ignore as two T of ethical information that doesn't really benefit you at this stage. So this is how it's going to look once you open the documentation page for one of Scikit, Learn classifiers or anything to do with scikit-learn that is provided by this library. So here we have where the model is located. So if you want to use the linear regression class, you would have to import it from this package. Okay? This part here tells us how this class is defined and it also tells us that it's a class. And these are the parameters it takes. This is what you must pass to it if you want to instantiate this class. However, we can see that they all have default values, which is true for most classifiers. So you can instantiate it without passing any parameters and they will take their default values. Next, we have a short description about what this model does and how it does it. In our case, this is quite young radicals or we don't really care about it. We all we care is that it's a linear regressor. So it's going to give us a linear regression line. Next, all of the parameters are described. Again, most of them are not very important at this point. I'm going to focus on this normalized parameter for the linear regression class. And in particular, I'm going to focus about how the normalization is done. So it says here that it's done by subtracting the mean and dividing by something called the L2 norm. I won't get into what the L2 norm is for now. I'm just going to say that you can pass to this parameter when you instantiate the class. So here. And it will automatically take care of scaling the data for you. However, if you want to use one of the normalization, one of the pre-processing methods that we have discussed previously. You can pass false to it and then use the standard scalar we discussed prior to using this linear regression and the data. We will see exactly how in a future video. For now, let's discuss the other parts of this documentation. If we scroll down a bit, we have the attributes. Which are things that you can access and the object after you have instantiated it. Again. So far they are not very important, so we'll skip over them. Next. We usually have a C Also section for every model, which tells us what other similar models exist in the library. And you can click on them to go to their documentation page. Then we have some nodes. Here. We might be told something about how well the algorithm performs, what it's, time complexity is, and so on. Or simply some implementation details. For now they are not very important ITA. And next we have an example. And we could actually take this code and run it in a Jupyter Notebook and it should run. So these are usually complete examples that you can copy, paste and run and they should work. For now, I'm not going to discuss the example. Feel free to run it, changed some things and see how those changes impact the outputs. But we're going to be using our own example soon. Next, we are given a list of all the methods together with a description. And the most important methods here are going to be fit. So this method will train the model. Going back to our example, it will make it build that blue line we showed in the previous introductory video. And predict is going to make predictions on unseen data. Recall the new weight example that we also discussed in a previous video. When we were given the the height and we wanted to find the corresponding weight or vice versa. Also, we call that a capital X stands for features. So these are the features that the algorithm considers. And y stands for the labels or the target, which is what the algorithm must learn for each instance in X. Next, each method is presented in even more detail with each parameter described. So like we said, for x, it's the training data. And why is the target values? Each entry in X has a corresponding entry in y. And this fits the model or drains it. For predict. We are given the samples that we want to make predictions for, and we are returned the predicted values. The other methods I don't want to go into details about right now, but we will touch on them in this section. Some more notes about each method can be given as well. And finally, at the end, we are given a list of examples that use this class from the documentation. So you can go to each of them by clicking on them. And you will have usually full working code that you can use to better understand how that model works. Feel free to take a look at each of these examples, but don't get too caught up in trying to understand the code at this point. 23. Running linear regression on a dummy dataset: In this video, we are going to talk about running linear regression and a dummy dataset. First, we are going to generate and visualize some test data. Let's write the following. We're going to import from SKLearn datasets. The make regression function. We'll see what this does in a moment. Then we're going to import pandas, matplotlib, and seaborne. We'll use this to visualize the data. Next, let's write the following code to generate our data. X comma y will be equal to make regression. And we're going to pass it n samples equals 100. This means that we will have 100 samples or 100 rows of data in our desk dataset, in our dummy dataset will set n features equals one. This means that we will have a single feature. And the noise equals 30. Noise is a parameter that controls how chaotic our data's is or how spread out it is. It will become clear once we visualize it. Next, we're going to make a Pandas DataFrame out of this data like so. Dataframe and we'll pass it a dictionary with x being assigned the input data. So the features, that is x. And we only choose the 0th column here. And for a Y, we Bassett y. And now we will use Seaborn to display a regression plot. So SNS, not direct blot. We pass the labels here. Data equals data, and we'll pass it. Fit drag equals false. Let's run this code and see what happens. As you can see, it generates some data and displays it. Let's run this again and see what happens. You can see the dataset changed and it actually changes after each run. So it generates a random dataset. If we change the noise parameters, say, we set it to five. You can see that the dataset is much more well-organized. But you generally want this to be quite, quite noisy in order to, in order for it to be a realistic dataset, as realistic as it gets considering its delay damage dataset. So let's leave it at 30. Okay, so you can probably guess what make regression does by now. It creates a dataset comprised of an x and the y. So features and labels the number of features. So the number of columns in x will be one because we passed one for N features. And the labels are going to be randomly generated. Okay, now let's see how we can run our linear regressor and this. We're going to import it from SKLearn dot linear model. We're going to instantiate it. So linear regressor is equal to linear regression and we're not going to pass any parameters to this class yet. We're going to call the fit method. And then we're going to have it make predictions and x. So predicted is equal to linear regressor, that verdict of X. And let's go into the predictions. Okay, here they are. So they look quite okay. If we look at this figure, they look about the same range as the values in this figure on the y axis. But how do we know how well it actually is doing? Well, we can do that by displaying the regression line generated and comparing it with C-H bonds on the regression line. Let's see how we can do that. First of all, we're going to make a DataFrame out of our predicted data. So predicted data is equal to pv, that DataFrame. And we are going to pass the same thing for x. And we're going to best predicted like this. So instead of the initial y, we now bass our predictions. And now we have some code for displaying. I'm just going to copy paste it as it's just displaying code, and I'll briefly explain it. So we will do two plots side by side. That's why we create subplots here which do columns. We're going to do a leg blood, which we'll plot the initial data in color green. And then we are going to do a line plot that will plot our prediction as a line in Arad. And on the right side, we're going to plot the initial data, just like we did above. But you can see here we have fitted a lag equals to. This will make seaborne draw its own regression line and the data we tell it to display. And if our own regression line looks the same or similar to C bonds line, then we have probably done a good job. So let's see what happens. Okay, here we go. This is our line in red and this is C1s line. Ignore the shaded area for now. And as you can see, this is quite similar. In fact, it's almost the same. If not the same. These are shifted a bit, but you can see they are actually the same figure. And our line is very similar to C bonds. So this means that our application of the linear regression class in scikit-learn was successful. This is how you can generate dummy data using make regression for testing your algorithms. And this is how you can apply the linear regression class in order to compute a regression line and then displayed. These are the basics of regression. From now on, we will try to solve more complex problems and also more real-world problems using similar things, but a bit more complex regarding the datasets and the application of these algorithms. 24. Running linear regression on a Kaggle dataset: In this video, we are going to talk about running linear regression and a Kaggle dataset. This is the dataset that we are going to be using. It contains graduate admissions data from an Indian perspective. You can read more about the features contained in this dataset and the Kaggle page. But we're going to be using this admission predict version 1.1 file. So download it and save it in the same folder as your Jupiter notebook will be discussing about the dataset in this notebook. So first of all, we import pandas. Then we import matplotlib and see bone. They'll be using CBO and for some data visualizations later on. And then we simply load the dataset. So data equals P d dot CSV and pass it the file name. You have saved the data set as I saved it as graduate admission that CSV. And if we look on the Kaggle page or if we display the data here, let's do that now. We notice that we have this serial number column that simply increases and count the rows in the data. And obviously that's not going to help us predict the chances of admission. So let's get rid of it. In order to get rid of it, we'll do data equals data dot to drop. And we'll pass in a list of columns that we wanted to drop. In this case, we only want to drop serial number if we display the data. And now we can see that column has been dropped. So let's go over this dataset a bit and talk about what we want to do. Don't worry if it seems a bit unclear when discussing the dummy dataset, because these things are usually easier to understand when discussing an actual real-world scenario. So we're going to do that now. So we're given all of these features. And let's say they are very likely referring to students, right? So a student's GRE score, TOEFL score, the university rating that they're applying to, and some other things, whether they have previously published research and so on. And these are their chances of being admitted to the desired graduate program. So what we want to do is if given this data for some students that we haven't. This dataset, we want to be able to tell that student what their chances of being admitted, R, k. So we want to predict a student's chance of admission based on their academic performance expressed with these metrics here. So these columns are going to be our features. And this column is going to be our target. Okay, now that we have decided on that, but we can do in order to gain a bit more information about the data, about this dataset is displayed data describe. And here we get for each column, some statistics such as how many rows of data we have and we have 500. The mean for each column or the average, the standard deviation, the minimum, the maximum sum percentages and so on. And this can be useful in order to simply get a better feel for the dataset. For example, if I look at the university rating column, I can see that it's likely a categorical column because I only have 1234 or 54 main twenty-five percent, fifty 75% percent and maximum. And if I look at the GRE score column, I can see that the values are quite close together. So I don't have something like a value of ten in one row and a value of 10 thousand in another row. And of course, many of these can be seen from the catalog page as well. But not all datasets have a Kaggle page, so it's useful to be able to do it in the notebook as well. Okay, so now that we have the data and we have a good feel about how it looks like and and what we need to do with it. We're going to move to the train test split part. Recall that we need a training set to train the algorithm on. Then a tests had to test the algorithm on in order to see how well it has learned. And this is how we do the train test split. From SKLearn dot model selection. We're going to import train test split. Then we're going to set x comma y to be equal to theta dot values. All of ours and everything up until excluding the last column. So what this will do is set x to. Let's display the dataset again. The values in all of these columns. So basically this column, this column's values, this, this, and so on up until research. And in why we want to set the values of the last column so the targets and the chance of admission. And in order to do that, we will do data. Dot values, take every row and only take the last column. In order to see that this does indeed happen, let's show x. And you can see the values on the first row correspond to the values on the first row here. You can check the other ones yourself. But if we show why the values correspond to these values here. Ok, so after we set our x and y, now we must set our x train. X best. Y train. And y test to be equal to train test split of x comma y. And we are going to set our test size to 0.2. This means that our test set size will be 20% of the entire data. So 20% out of 500. Usually we use something like T or 33 plus n. But since we have very little data here, only 500 instances, we're going to use 20%. Keep a bit more data for training. Let's run this code. And this is all we do here. And now we need to run our linear regression. In order to do that, we have to import linear regression from SKLearn. And that linear model instantiate to the class call fit, passing it the train data. So linear regression dot fit, x train. Why terrain. Then we're going to save our predictions on the test set. So predictions equal linear regression. That predict and the end we will pass it X test. So basically. Think of this as the data we use for studying. This is our study method. And this is the data we study on. And think of this like taking the test. This is the take the test method. And these are our exams objects. And this will be our answers, and this will contain our answers. This fits the algorithm. And then this uses the algorithm to make predictions on unseen data. Let's put it into the predictions. Here they are. And they look about right. We don't know if they are correct yet, but they look well formed. That is between 01, just like the train data was. But how do we know if they're actually correct? Well, instead of showing the predictions, let's, let's show predictions minus y test. And this will show us, and this will show us the differences between each prediction and the known correct answer. And if you look here, they are quite small. So this means that we are doing quite well. Actually. If you are not familiar with this notation, it means this number multiplied by ten to this power. So this value here would be equal to minus 1.21 multiplied by ten to the power of minus one. Or actually multiplied by 0.1, which is the same as dividing by ten. So you can think of it if there was a minus as this number, divided by ten to this Bhalla. So without the minus, OK, and this is equal to about minus 0.1 to one. So that's a very small difference. We are very close to the correct time. So this is how you run a linear regression model on a cattle dataset. These are the basics of aggression. From now on, we will see how we can improve this and gain more insights into this dataset and other datasets in general. 25. Exploratory data analysis with visualizations: In this video, we are going to talk about exploratory data analysis with visualizations. We will pick up where we stopped in the last video. That is at displaying the differences between our predictions on the test set and the noun correct values. The values here might be slightly different. Then the ones in the previous video. That is because the train test split function splits the dataset randomly. So if you add these two cells multiple times, the results might be slightly different, but they should still be very similar. Data analysis is actually a step that is first down before applying any machine learning algorithms because it helps us decide what preprocessing steps to apply and what machine learning algorithms are likely to work best. But we started directly with applying the linear regressor, since this is an easy dataset and I wanted to show you some results. Also, this topic is slightly more complex, so I thought I'd leave it for after. Just know that we usually apply this first and then keep doing it as we get results from our machine learning algorithms that we wanted to improve upon by understanding our data better. We are now going to introduce the pair plot. All we have to do for this is Y 21 single line of code, that is S and S dot pair plot. Recall that S and S stands for C Bowen. And pass it our DataFrame. Lets run this cell and see what gets displayed. It might take awhile to compute. So please be patient. Okay, we can see that a huge figure shows up with multiple graphs. If you double-click on it, it enlarges it a bit and it's easier to see. Notice that all of our columns are shown on the left and also on the bottom of the figure. What this does is what we did when visualizing our dummy dataset, except that it does it for each pair of columns in our dataset. This allows us to see the relationships between the features themselves and between each feature and our target in the future, this will allow us to pick the correct machine learning algorithms for the task at hand. For a pairs formed out of the same feature, a histogram is shown telling us how the values of that feature distributed. For pairs formed out of distinct features, their values are simply plotted on the two axes. The more clustered together in a diagonal shape, they are the most strongly correlated the features are. We will discuss correlation next. Without getting into any statistical theory to features are well correlated if they give us the same information. For example, let's look at the relationship between the GRE score, this one here, and the chance of admission. So we can get to that by scrolling to the right and looking at the first row and the last column. So this one plots the GRE score against the chance of admission. As you can see, as the GRE score increases. So does the chance of admission. This means that these two are quite well correlated. They give us the same information. They both tell us how likely it is for the student to be admitted. Correlation is almost never perfect in the real world. That's why some of these values are a bit far away from. Most of them. These are called outliers, but we won't get into that now. Having features that are well correlated with our target, that is, with what we wanted to learn, means that our machine learning algorithms will do a good job learning. We've seen that our linear regressor does quite a good job. Indeed. While we can infer correlations from the pair plot, another plot helps more. That is the correlations heatmap. In order to display the correlations heat map, we have to add a bit more code, but it's not very much. So let's slide. Core from correlations is equal to Data dot core. Then we're going to instantiate a heat map or a color map. Let's call it c map to be equal to seaborne. So as an S dot diverging palette. And I'm going to just pass some values here that will make it look nicer. But this entire step is optional. And I'm just going to pass some values here that will make it look nicer, but this entire step is optional. You can get away with the default values, just fine. So I'm going to pass it in to 20. Then as CMAP equals two. And then to display the actual heatmap, SNS dot heatmap, core, CMAP, CMAP. And square equals two to make it show up a bit nicer. And if we run this code, we get the following figure. And what this does is also gave us a pair between each column. So between each feature and each feature and Target column. And the whether it is, the stronger the correlation is and the bluer it is, the weaker the collaboration between the corresponding pairs is. So if you look up the diagonal, that is modest and that makes perfect sense because each feature is perfectly correlated with itself. And if we look at the last column, this is what gives us the correlations between the features and our target. What we wanted to learn. We can see here that the best correlated ones are the sea GPA column or a feature, the GRE score, the TOEFL score, and that's about it. The rest are significantly less correlated, but not necessarily useless. Because our minimum correlation is around 0.4, which is already quite decent. And we can also display these in text by displaying the core variable. This can make them easier to read. In some cases. These steps usually lets us do two things. The first is dropped, the features that are badly correlated with the target. Because they want help our machine learning algorithms a lot. They won't provide any value. Second is dropped. The features that are well correlated between themselves. They give us redundant information and it might help to remove them if we have a lot of features. That's not the case here. So it likely wouldn't help much in this particular sense. But an example of that is the TOEFL score and the GRE score. You can see that they are well correlated between each other. So if this was a huge dataset with tens of thousands of features, dropping one of these might help us reduce the feature space. We will perform more complex data analysis as our tasks, datasets and results demanded. 26. Performance metrics measuring how well we are doing: In this video, we are going to talk about performance metrics, measuring how well we are doing. So far. We have eyeballed how well our model performs. We planted the predictions and decided they look similar enough to what they should look like. Then we printed the differences between the predictions and the non correct values. And again, we visually decided these differences are quite small. However, this isn't a very professional way of doing things. You wouldn't want your mechanic to just take a quick glance at your car, come up with a diagnostic and then charge you. Similarly, clients wouldn't want you telling them that the results look good enough. They want you to give them objective measures on how good the algorithms predictions on the test set. In order to do that, we will use performance metrics. These are values that tell us how good our predictions are according to some evaluation criteria. To continue the school exam analogy. Think of these as the score that your exam paper was given by the examiner. We will introduce two performance metrics, the mean absolute error, MAE and the mean squared error, or MSE. We will have to get into a bit of math to explain them. But to down to a, it's not a lot and I will keep it short. The mean absolute error, MAE, as it's usually referred to, is basically what we did, but expressed more robustly. It takes the absolute difference. That is, for negative numbers, we simply get rid of the minus sign. So in this case, instead of minus 0.01 and so on, we would just take 0.01 and so on between each prediction and its corresponding correct target. And then takes their average. So it basically takes the average of these things that we printed here. These, instead of having to read hundreds or even thousands of such differences, we will only have to worry about the single one. In this way. The mean squared error is very similar, except up each defense is squared, that is, multiplied by itself. Before taking the average. This has the following effect. Large differences will increase the final value by a lot. While low differences, especially those lower than one, will have almost no effect. To see it is open up your calculator and see what happens if you square a number, such as 235 or 12. And what happens if you square a number such as 0.1 or 0.2? There are two ways to implement these metrics. The first is using Numpy. So let's import numpy. And now we can use num pi of a features to compute the mean absolute error like so. Mae underscore n p is equal to o and p, I mean, to which we pass. And B dot abs, this will take the absolute value. And to this with bass predicted minus y test, we call that predicted minus y test. Looks like this. Let's put it into this value. And let's run this code. So this is our mean absolute error on the test set for this problem and for this application of the linear regressor, This is a good value as the mean of the differences is about this much. And since this is a very small value, the predictions are very, very close to the known correct values in the wide test of a. Let's now do the same thing for the mean squared error. Msc and B will be equal to o and p dopamine. The which we will pass predicted minus y squared. And let's print it. If we run this code, you can see that this is a much smaller value, which makes sense this, we did say that small values, especially those lower than one, and all of these are lower than one, contributes almost nothing to the arrow. So it makes sense that in this case, this MSE value would be lower than this MAE value. The same thing can be achieved using scikit-learn. We have to import from the mean absolute error and the mean squared error functions. Then we will have very similar code. So MAE SK Learn will be equal to mean absolute error of lightest and predicted. And let's print it. Mae scikit. Be careful because the order of the parameters can matter here it went, if it doesn't really matter for this particular function. In general, the order matters. Here. They don't really matter because taking the absolute value or taking the square of a defense because the same thing. But plus I could learn functions, tried to get used to passing them in the correct order given in the documentation. So you don't run into any surprises for different functions. Okay, let's run this code. And we can see that this value is the same as this one. Let's now do the same thing for the mean squared error using scikit-learn. I'm going to copy paste this code. And we name a few things. Mse here, squared here, MSE here, and MSE here. If we run this, we can see that we get the same values here. These are the only metrics, but we won't be needing anything else for now. Just know that depending on the exact task, there might be other metrics that make more sense to use. 27. Current shortcomings: In this video, we are going to talk about the current shortcomings of our approach. We are going to write down these shortcomings just so that we are aware of them. I'm going to write them as a comment. And the biggest shortcoming that you've likely noticed is that we don't do any data preprocessing. You might have noticed that there are columns with relatively small values, let's say lower than ten, and other columns with large values over 100. This means that this algorithm will almost definitely benefit from these normalized in some way. Next, there are some columns that might benefit from one-hot encoding. For example, the university ranking is a categorical column. It would probably benefit from being one-hot encoded. So no category column column. Handling. The third shortcoming is that the data analysis step is not acted upon. Although we have perform data analysis in the one of the previous videos, we haven't acted upon the information gained from this step. For example, we haven't done anything with the columns that are badly correlated with the chance of admission. Maybe we could remove them and that this might or might not improve our result, but it should still be something that we try. So even if we do get very good results and this dataset, it might be possible to get even better results by doing some of these things here. Even if you don't get any improved results after doing them. We should be aware that I'm more complex datasets. These would definitely be beneficial. So we will do them on this dataset in the next video so that you are familiar with them. 28. Addressing the shortcomings by adding data preprocessing: In this video, we are going to talk about addressing this short comings by adding data pre-processing. We're going to continue on the same notebook and start by adding one more parameter to the train test split function. And this parameter is a random state. And you can set this to any value, I'll set it to 42. What this does is it makes sure that the same random test and train split are going to build it out. And if you run this code multiple times, we will have to run it multiple times in order to compare our results before data pre-processing and after data preprocessing. And we wanted them to be compared on the same test set. Another thing we are going to do is comment this line, which brings the difference between predicted minus y test because we no longer need that defense is well using metrics now. Okay, so let's start by adding our first data pre-processing methods. And the first one is going to be adding data normalization. And the approach we are going to take is to copy, paste the initial code and simply pass normalize equals due to the linear regression class. Then let's blend our performance metrics. After this data preprocessing step. We can start by copy-pasting to previous code and simply change the message is. If we run these two cells now, we get the following results. But we also have to run this code after we added the random state parameter in order to take that into account. So let's run everything from the beginning. Okay, here are the results and now we can compare them. So we get after pre-processing, this for MAE, which we can see is almost the same as without the preprocessing step. And the same is true for the mean squared error. So this means that normalizing the data using this parameter did not help much. Let's try another thing and also add one-hot encoding for the university ranking column per feature. In order to do that, we are going to declare another pandas DataFrame. Let's call it one-hot uni ranking and set it to be d dot. Get dummies. And they're going to get the one-hot encoding for the column university ranking, university ratings. So they're going to pass in data of university rating. Next we have data one hot equal to the original data without the University rating column. And now we have to add the one-hot uni ranking data frame to our data one-hot data.frame. So we'll use the concat method for this. Pd dot concat. Pass the data in a list of data frames we want to concatenate. The first one is one-hot uni ranking, and the next one is data one-hot itself. And the order matters here. Because if you pass this one last, the next step in which we will set y equals to the last column, will not be valid. And we must also pass a, an axis equals 1.5 though that we have to reset our x's and y's and also our train and test sets. So we start by copy-pasting this code and make sure to name data to data one-hot. And this is why we pass it a random state here, so that this call here will get us the same split as this call here. Okay, that should do it. Let's run the code again. And if we look at the results now, they are slightly better than before. We get 0.04 to five here after the one-hot encoding step. And we got 0.04 to seven before for the mean absolute error. And the mean squared error is also slightly better even if the differences are very small. This is a very small dataset, so it's very likely that we are very close to doing the best that we can considering the data. But as you can see, if you keep at it adding data pre-processing methods, you are very likely to obtain better results in the end, even if our first attempt that yields the only normalization was unsuccessful. It can even happen that after adding some data preprocessing like normalization, your results are worse than before. But it's possible that if you add more preprocessing steps, they will eventually get better. You just have to keep trying multiple methods. 29. Putting it all together a full machine learning pipeline: In this video, we are going to talk about putting it all together, a full machine learning pipeline. We're going to pick up where we left off with the rather messy Jupiter notebook that we had before. And I clean the top a bit. I removed some of the cells. We're only going to be using these ones now. So the loading data self, which we can earn now, the train test split cell will remain the same. And we will write tumours health, setting up our pipeline and measuring how well we are doing. I let the data analysis part at the end. Because although you might be doing this as a first step, it's a step that you will generally not be editing that match after awhile. So I think it makes sense to leave it at the end and write your new code above it. But if you want, you can move your data analysis part above and below it. Two, both are valid. It's just a matter of personal preference. So let's start setting up our pipeline. First of all, let me tell you a bit about what a pipeline is. A pipeline will be a series of steps that will be applied on the data in order to get some results. We've seen that we have done one-hot encoding normalization and finally, use the linear regression class. We can put all of these in a special type of collection called a pipeline. And scikit-learn will know how to call them one after the other. This will make our work easier because we won't have to deal with a bunch of code sequences that all handle a specific thing. We'll have them all together. And it'll be easier to follow that way and also easier to make changes considering that we are working in a Jupyter notebook. So first of all, let's import a few things from SKLearn. That linear model, which we are going to import the linear regression class from SKLearn, that pipeline. We need the pipeline class from SKLearn pre-processing. We need a minmax Kayla and the one-hot encoder. And from S Kayla and that Compose, which we haven't talked about before. We're going to import the column transform our class, and I'll explain what it's used for in a minute. Okay, now we need to handle the one-hot encoding step of our pipeline. And here is where we are going to use the column transformer. So let's write one hotter. Equals column transformer. And we'll pass in a list of tuples. Each tuple represents a transformation that is applied to one or more columns. It consists of three things. The first is a name for this transformation. In our case, this will be your new rating, one-hot, because this transformation will be responsible for applying one-hot encoding to the university waiting column. Remember when I told you that scikit-learn is not very good at only applying stuff like one-hot encoding to specific columns. Well, I lied a bit. It is able to do it is just that we must do it in this way using a column transformer. So we're going to use that way now. So the second element of the tuple is a one-hot encode the object. And we must pass a in a sparse equals false has this will make things easier because we want it to return a classic num pi of a, not a sparse away. And the next is the list of column indexes that we wants to one-hot encode. So we must pass it the index of our university waiting column. And if we count from 0, we have 012. So this has index two. And we only want to apply one-hot encoding to that column. So it will simply pass in, pass in a list with one element consisting of two. Next, we must specify what this column transformer should do with the columns that aren't part of specialists here. Or in any other taus that we might have. The default is to simply drop those columns, which we don't want. We want those columns to be passed through to the next step of the pipeline as they are. So we will pass in remainder equals pass true. This is very important, otherwise it will drop the other columns. Next we're going to do a min-max normalization step, which isn't really necessary as we views the, the normalized parameters of the linear regression class before. But this is just to show you that a pipeline can contain many things. And we'll comment it out later if we decide not to use it, it's only one line of code anyway. So. Min-max o will be equal to min max scaler. And finally we have the regression step where we instantiate our regressor. And for now we're going to pass in normalize equals false because we'll be using this min-max scaler and we don't want to normalize twice. Now we declare our pipeline, like this. Pipeline is equal to pipeline, which takes as parameters a list of pairs. The first element of each pair is a name for that step. One hot in this case. And the second is the transformer that we wanted to apply. So one hotter. Next we have our main maxima. And finally we have our regressor. And this is it for the pipeline. Now, we simply walk with the pipeline as if it were any other scikit-learn model. So we use it just like we use the linear regression class directly. So we are going to say pipeline that fit and pass it in x train and y 2n. Then we're going to call predict and save the result to a variable. And again we call predict on the pipeline. And finally, let's just show the results. Predicted minus y test. Let's run this and see if it works. Don't worry if you get some warnings. And you can see that we get similar results to what we did before. Now let's write our measuring code to see if indeed we get the same results. So far. Sk Learn dot matrix. We're going to import the mean squared error function and we are only going to be using this one. Now. You can write in the MAE yourself. So MSE pipeline will be equal to the mean squared error between y test and predicted. And let's see if we get the same results I got before. Okay, you can see that this is slightly different than what we had before. What do you think could be an explanation for this? Feel free to pause the video here for a few seconds and think about it. Ok, hopefully you figured it out. The problem is, well, it's not really a problem, but there's a difference compared to the previous code. And that is that here we use this min-max Kayla. And we set normalized to false. Well before we had normalize equals true. And we didn't use any min-max scaling. And see what, what I did there. I just changed one thing here and one thing here in the pipeline. So now it's much easier to change this code or this pipeline than it was before when we had to find the particular cell we were doing each thing in. So now let's run this code again and this one. And as you can see, we now get the same result. So this code is equivalent to what we had before, except it's much easier to follow, much easier to do changes on. And it generally looks a lot more professional. And this is the kind of thing you will see in practice in real machine learning applications and projects. So please take a moment to get familiar with it. We watched the video if you have to, because we will be working like this a lot from now on. 30. Exploring the model hyperparameters: I want to take a moment to talk about the concept of hyperparameters in machine learning. You will hear it a lot if you read about the field. A hyperparameter is simply a parameter that you pass to a machine learning model. So we have four for the linear regression class. And they are usually listed on scikit-learn documentation. But you might also be working with models that are less well documented. So you should be able to identify them yourself. But we are only going to focus on scikit-learn four. Now, these hyperparameters usually control how well a model will perform. Either how well it will learn and how well it will make predictions on the test set or how fast it will learn by specifying the computational resources available to the training algorithms. Let's go over these a bit. So first we have the Fit intercept hyperparameter. And I'm not going to go into the details on exactly what an intercept is and why this is a good thing to set to true. But just know that it is by default to and that in general, the default parameters will usually do a good enough job in Scikit lub. Then we have the normalized hyperparameter that we talked about before, which normalizes our data. And psychic lands documentation gives you more info if you want to look it up such that the fact that it uses the L2 norm. Next, we have some non not really training algorithm related to hyperparameters. We have this copy acts hyperparameter, which means that the algorithm will work on a copy of x when we call fit, for example. And another interesting one is scikit-learn that pretty much all models will have is this n jobs hyperparameter, which controls how many processes the training algorithm will use. So if you have, let's say, eight processor cores, you might get a speedup. If you set this to eight. This is just something to be aware of. It doesn't have any immediate impact on your learning. I just wanted to make you aware of what a hyperparameter is. If you happen to hear that term. 31. Ridge a more advanced model: In this video, we are going to talk about to wage a more advanced model. You can read more about it and scikit last documentation. But just looking at the start of this documentation page, we can already see that it has quite a few more hyperparameters then the linear regression model we used so far. This is why we are going to say that this is a bit more advanced, even if it's not fundamentally different. Okay, so here's what we're going to do. We're going to start from where we left off with the code in the previous videos. And first of all, let's run the cells. And if we run this cell, we call that, we get this warning. So let's try to get rid of this warning fast. If we read the actual warning message, we can see that it actually tells us what to do to get rid of it. And that is passing in this category is equals ATO parameter to the one-hot encode. So let's do that. And it now if we run this cell again, the warning goes away. Okay, another thing I want to do is organize this even better. And in order to do that, I'm going to put the entire code into a gap pipeline function that will take us parameter a Machine Learning model, and it's going to use that machine learning model. This will let us reuse the preprocessing steps both for linear regression and for the wage model that we're introducing now and also later on for other models. So what changes is here in stead of instantiating a linear regression model? We're going to assign this to our machine learning model parameter. And that the end. We simply return this pipeline. We will get rid of this code as well. There are no errors if we run this cell. And now we're going to print the mean squared error for about linear regression and the wage. And in order to do that, we are going to modify this cell types so the import is gonna stay. The import is going to stay. And we will also import from SKLearn that linear model, the linear regression model, and the double-edged model. And we are going to use an evaluate function that will evaluate these two models. So we are going to instantiate. Linear regressor width gets pipeline to which we will pass an instance of linear regression. With the normalized parameters set to two. We're going to do the same thing for the wage regressor. And I'm just going to rename some things in the line. I copy-pasted rehab for which here. And we can also pass normalize equals two to the wage model. And if you look up, you'll see that these two will get assigned here. Then here, then finally put at the, at the end of our pipeline. So these two calls will get us to pipelines. The first pipeline from the first call will end with a linear regression model. And the second pipeline from the second call will end with a village we Glasser pipeline with a which we Glasser model. Now we have to fit these two. So linear, we go as her dot fit x train, y train. And the imagery Glasser dot fit x train, y train. And then we're going to print the MSA for the linear regressor, which is going to be the mean square error between y test and linear regressor that predict X test. Let's copy this line and rename some things accordingly. So age there. And call predict fought the wedge model, which we Glasser. And it's predict here, not predicted. There we go. And we can get rid of these. And all we have to do now is call evaluate. Let's see. Pause the video here and make a guess for each one, we'll get better results. Well, let's run the cell now. Here we go. So the linear model gets the better results. We have 0.0036 for the linear model and 0.004 to four the wage model. So even though I said that this is a more advanced model, it still gets worse results. You might be able to make it get better results for at least very similar results by messing with its hyperparameters a bit. For most of them, we just accepted the default values. Tried to do that, see if you can get better results. But if you don't, don't go away, like I said, this is a very basic dataset. And for such small results, for all intents and purposes, they are the same. So there's no practical or practically useful difference between these two is alt. And a more advanced model is unlikely to show it's true strength on such a basic dataset. 32. The linear regression showdown: In this video, we are going to talk about the linear regression show down. We are going to compare two scikit-learn linear regression models and to Kaggle datasets. The first dataset is the graduate admissions dataset that we are familiar with by now. The second dataset is this house sales in King County, USA, Kaggle dataset. You can get it from here. Download this file. It's only one file. As you can see, it has a lot more columns and therefore features and also a lot more instances. I have already written some code that reads this dataset. So we leave the house Data dot CSV file, which is the file I downloaded from Kaggle and saved in the same folder as this notebook. Then I drop some columns that are definitely not going to contribute to the learning, such as the ID, date, zip code, latitude and longitude. And you might think latitude and longitude might contribute to, but it would make things small difficult. So I'm just going to drop them for now. Anyway. We don't really want the model to learn some geographical representation. We want it to learn how the other features affect the house price. Next, what we do is write some code to move the first column. And the first column is the price column, the column that we want to predict. So our target, initially, it's the first one after we dropped the date and the id. But we want it to be the last one. Because if it's not the last one, this code here won't work because it assumes that the last column is the target's column and everything until it features. So basically we use list comprehensions to move it and then rebuild our DataFrame. Then we describe our data frame here you can see some info about each feature. And we can also see the exact count of instances, our observations in our dataset. We have about 21 thousand, which is significantly more than the 500 in the graduate admissions dataset. I have already ran the data analysis part. Here is the pair plot. And we can see that this data is a lot less nicer than our previous dataset. The values in these pair of plots a lot more scattered as you can see. And the same is evident from the correlations heatmap. If you look at the correlations with the price, we can see that to some features out quite well correlated, but we also have a lot that badly correlated. So it might make sense to remove some of them as well here. But we're not going to do that right now. You can consider it an exercise. We're going to start by modifying this gap pipeline function a bit in the following way. So for the graduate admissions dataset, we have a column that we needed to one-hot encode the index two column. But for this dataset we have no such column. All of them are numerical, so there is no need for any one-hot encoding. However, we want to use the same function to handle both of them. So we must parameterize this value here. This list of to, this list containing the integer two must come as a parameter. It must not be hard-coded. This will help you if you add any other datasets as well that require other columns to be one-hot encoded. So we're going to pass in a parameter called 1-hot calls. We're going to use it here. We're going to call this one hot calls. Nothing else changes here. And under this free for all cell, we are going to write our evaluation code that will evaluate both the linear regression model and the wage model and both datasets. We are going to make it as general, as professional as we can. First we are going to import our metrics. So from SKLearn dot matrix, we're going to import mean squared error and mean absolute error. Then we're going to import our models. So far MSK land that linear model. We're going to import linear regression and the wage. And that should be enough. Our evaluation function will take as parameters a list of regressors and the list of datasets. And we're going to do for each data in our datasets, play into the datasets that we're working on. So four, we're going to say here data of label. So data sets will be a list of dictionaries. And each dictionary will have a key called label that will give us the name of that dataset just to make things a bit more user-friendly. And be careful if you use this notation. If you use the single quotes here, you must use double quotes here and vice versa. Otherwise you will get an error. Next we have x train, x-test, y train. Why test is equal to get splits. And we're going to pass in data of df. So the DF key of the dictionary will contain the actual datasets DataFrame. So the actual data itself. Next we need to iterate our regress us. So for the Glasser, eyeglasses, get the corresponding pipeline. By calling get pipeline, through which we pass the regressor and data of one H. So the 1H key will correspond to the list of columns that need to be one-hot encoded. Next, we fit this pipeline on the train data and then print the results. So point. We're going to use M-I for this example. And I'll explain why we choose MAE and why it makes more sense here in a minute. So shoulder we go ask her that we were working on. And then the actual results. You don't really need to do it exactly like this. This is just so it looks a bit nicer pipeline that predict X test. And that's it for the evaluate function. Now we just need to call it. And before I forget, I don't remember, I ran the change to get Pipeline cell. So let's run that to make sure we have the latest version. And now we're going to pass in as the list of regressors. Linear regression with normalized equals two and the wage with normalize equals true. And for the list of datasets, we must create dictionaries. The first one will have a label of grad data. Df will be grab data. And one h will be the list we had before as a hard-coded value. Copy paste this, and update it accordingly. So instead of grab data will have house data. The DF will be house data. And the no columns need to be one-hot encoded. So we'll pass in an empty list. Let's run this code and see what it does. Here it is. So we can see that the results for the graduate dataset, for the graduate admissions data, what we expect them to be. And these are the results for the new datasets, the house data dataset. There are quite large. And by now you have probably realized that this is why we choose mean absolute error. Because we are working with values for the price columns. For the price column in the range of hundreds of thousands. The minimum is, we can check it out here. So this will tell us the min. So the minimum for the price is about ten thousands, ten to the power of four. So we have between 10 thousand and the maximum is about 1 million. So it makes sense that we'll have such large areas. And if we use mean squared error as the performance metric, we will have much larger values here and they will be hard to make sense of. So mean absolute error, in my opinion, makes more sense here. And what this tells us is that we are usually off and the price of a house by about a $140 thousand. It's up to you to decide if this is a good or a bad thing or up to your client. But in general, and we'll all datasets, you can never get perfect results. So this is quite realistic for a linear model. But again, what's more interesting to us, I think, is that the linear regression model does better. Its mean absolute error is a 143 thousand compared to wages, almost a 150 thousand. So let's see if we can improve on this. Let's try to pass in some alpha two dovish model. The default is one. So let's try passing in 0.1. And this again. And we've obtained a better result now, we have a 142 thousand. So messing with the hyperparameters can definitely help. Let's try 0.01. Ok, this is still better, but it's worse than it was before. So maybe we want to low try 0.05. still worse than it was before. So try 0.01. this looks like the best. You can keep trying other values and also feel free to explore the other hyperparameters as well. But the most important thing we've introduced in this video is this way of working, I think tried to parameterize everything, make your code as usable as possible. And that will let you easily introduce new datasets and new models into your code, saving time. And as you know, time is money, especially in this field. Writing the code in this way, improve your GitHub portfolio as well. If you decide to upload your work on Kaggle to GitHub for example. 33. Linear regression exercise solution: In this video, we are going to talk about the linear regression exercise solution. So I decided to use this dataset from Kaggle, which contains data about sales in the vast MAN store. I downloaded this train, that CSV file, and we're only going to be using this one. We're not going to be using the test one. So normally, since this is a Kaggle competition, what you would do is you download the train file, Dojo training on it, and then submit the results to the results for the test file. But like I said, we're not going to do that. We are only going to work on the train data. So I saved the file as store sales dot CSV. And I started by dropping the store and the date columns. Then I rearrange the columns so that the number of sales, which we wanted to predict is the last column. And then I did a little preprocessing because these data can be a bit messy. There are some instances that contain string data. I'm not sure why, but it doesn't really matter. What we do here is we use pandas, apply method to make sure that each column is numeric. And the way we do that is I allowed an insurer Float function that tries to convert its parameter to a floating point value. And if it succeeds, it simply returns that value, otherwise, it returns 0. So for example, if val is any string, this line will give an exception. This will execute and it'll return 0. And the apply method applies the function given as parameter for every entry in the column we call it on. Okay, and if we describe our data, store sales, here it is. So the mean of the sales is about something like 5 thousand. And we have various values here for the max and so on. Okay, the rest stays unchanged. The train test split part, the pipeline part stays unchanged. And now we're at the evaluation part. I chose the lasso model, which is also imported from SKLearn, that linear model. And I plugged in store data here. And let's run this cell again and see the results. They are starting to show up. Okay, so Lasso does best for the store data. Apparently, it doesn't do so well for grad data. And it doesn't do so well for house data either. So considering that, we don't really do any data analysis and anyway, I'll data pre-processing. And these are decent results, about 990. This means we are off by this match. For every prediction. It's arguable if this is good or bad, it's not great. That's for sure, but considering the amount of work we put in, it's definitely decent. So if you choose any other dataset at as long as you've got some results showing up like this. Congratulations. And if not, I recommend that you do this again until you can do it without watching any solutions such as this video. 34. Linear regression summary: In this video, we are going to do a summary of this linear regression module. So here's what we learned. We learned about selecting datasets and plugging them in, reading them with Pandas. And then we introduced some concepts such as performance metrics, pre-processing data, and hyperparameters. We looked over the documentation concerning some of these and we even implemented SAP. Then we did a train test splits in order to split our data into train and test sets. And then we discussed about some models, in particular the linear, the wage and the Lasso regressors, which we used to make predictions on a couple of datasets and another dataset, it doing the exercise. And all of these, It's nicely complimented by the data analysis part that also can be used to make certain decisions. And we saw that this consists of bare plots, correlations, and domain knowledge and intuition. This will be analyze more in depth in future videos and sections. But what's very important here is this pipeline concept that we talked about. For example, the preprocessing part can be, we saw that we can put it into a pipeline. Same width, metrics reporting. And the pre-processing part can be neatly followed by a model in that scikit-learn pipeline. And this makes things very easy to use and understand and change later on. So congratulations for finishing this module, and I hope you enjoy the next modules videos as well. Well, we'll talk about non-linear regression. 35. Introduction to non linear regression: In this video, we are going to talk about non-linear regression. Non-linear regression is used when we have nonlinear relationships in our data because these relationships can describe more complex processes and because they also include linear relationship, they are applicable to a much wider array of datasets. Consider the following image from scikit-learn documentation. It contains the same dataset and still lives a linear regression line. And the non-linear one, we can clearly see that the non-linear one better fits the data and therefore would be more suitable for making predictions. 36. Polynomial features for achieving polynomial regression: In this video, we are going to talk about polynomial features for achieving polynomial regression, we are going to generate a simple synthetic dataset and show how non-linear regression, the first formula, linear regression, when it's a better choice and luck disadvantages we may encounter. Most of the code in this notebook is similar to the code in which we run linear regression on a dummy data set. I have therefore free weight and a lot of it, while I will explain most things, I recommend that you watch that video before continuing. This is the class that we are going to use. It generates polynomial and interaction features according to the documentation. And what this means is that given the features AB, it will turn them into these features here. Each feature will be squared and add it as a new feature. And the products between the features themselves will also appear as new features. That is, if we have the degree hyperparameter set to at least two is its default value. And you might notice that this class is in the preprocessing module. So it's not an actual machine learning model, say pre-processing class. That is because it's changes the features. It doesn't do any learning itself. The learning will still be done by our linear regression class, but that class will be applied to the changed features. And therefore it will act as, and therefore the entire thing, the entire pipeline will act as a non-linear regressor. And let's see exactly how that works. So here we are making a dummy dataset just like before, with some noise. And then we change x by applying the exponential function to it. The exponential function means that every value in x will be replaced by E to the power of that value. And why we do this is to make the relationship non-linear. Because the exponential function is a non-linear function. And let's see what happens if we run this code. Okay, and as you can see, this data is different than what we had before. For example, if we comment this out, we get much nicer data. If we put it back. It's nonlinear, meaning that you cannot draw a straight line that is close to all of these points. You will need to draw a curve line, something like this. And let's see if we can get that to happen automatically using scikit-learn. First, we're going to import a few things. So they're going to impart our linear regression model first. Then a pipeline. Because we will make a pipeline out of the polynomial features class and the linear regressor. So we will need a pipeline. And finally, we import our polynomial features class. We instantiate our polynomial features class. And let's pass in degree equals to, even if this is the default value I want to pass it to. And so I remember to change it later and show you what happens if we change it. It's always good to pass in the parameters that you plan to play with later, even if you pass it in the default value initially. Then we instantiate our linear regressor. And finally, we make our pipeline. The first element in the pipeline will be our polynomial features. And the second will be our linear regressor. Next to a fade to the pipeline. Save our predictions. And let's go into the Square elections. Here they are. But because we have a non-linear dataset now, but because we have a non-linear dataset now, it's hard to tell if these are good or not, if they are exponential or not. So we will have to add visualization. In order to have visualization. Now going to use the same code as before, where we create a DataFrame to our predictions. You just have to rename this accordingly because we call it predictions above, not predicted. And then we simply plot the data. Our own predictions. And the predictions using C-H bonds, linear predictor, linear regressor as well for comparison. Let's see what happens with this. So let's look at the figure on the right. For now. You can see that C-H bond drew a straight line because it does linear work. And obviously, there are some data points that are quite far from this line. And it makes sense because a straight line cannot be close to all of these points. And if you look on the left, the red line is the non-linear, non-straight line that our pipeline came up with. The pipeline form from the polynomial features data pre processor and the linear regression model. And you can see that this line is closer to more points than this one. So line fits the data much better. Okay, so you might think, what happens if we increase the degree, will that make it even better? Let's try. Okay, So this indeed seems to make it better, right? So this part here is almost perfect. Visits isn't as good as a line can get regarding how close it is to all of these points. And this one is also doing a good job of handling this lone point here. Can we do even better? And we say, go to the Greek called Stan. Okay, here it is. And now you might think that this is even better, but it's actually not. And I'll tell you why. But to make it maybe even more obvious. But let's try to do something extreme here, like 30. Okay, so what happened is that by giving such a large degree, the algorithm basically tries to draw a line that fits every point exactly. Like kind of tries to connect each of these points with the red line. And obviously it will fail. And that's that and the water that is called as overfitting. So let's go back to ten here. So see, this point is very indicative of overfitting. Because let's say that we are given a data point, a new data point here. And if the algorithm hasn't taken this detour here, it will give, That's a very good guess for that point. But because this detour, it will give us a value. Close to plot, this value is here. So about a little over a lethal under 101, it should be around a May 120 be. And that's not good. So think of overfitting like memorizing last year's exam questions. And the answer. That's very, very unlikely to help you with this year's exam unless the exam is exactly the same, which is almost definitely won't be. So you don't want algorithm to perfectly fit the data. You want it to get close enough so that it generalizes, Wow. So passing into here, or maybe even three. That's good enough. But anything more than that and you run the risk of over-fitting. And we'll see later, a symptom of overfitting is when we have a very good result and the train dataset, but a very bad result. And the test data. We'll talk more about these in the future videos. 37. Polynomial versus linear regression on two real world datasets: In this video, we are going to talk about polynomial versus linear regression unto the real-world datasets. Here is one of the datasets that we are going to use. It's called Berlin Airbnb data, and this is from cathode. It contains Airbnb activity data in Berlin, Germany. In order to run the code using this dataset, downloaded this file here, listing summary file with 96 columns. And the second data set. This is very similar and it contains data about Airbnb listings in Amsterdam. So we have two European capitals. And also download this file here listing details dot CSV with 96 columns. And if you investigate this dataset and compare it with the Berlin dataset, you'll see that they have the same columns. In this video, all of the code is going to be pretty weight them. By now, you should have a good idea about how we are doing things. And we are going to keep using the template from the previous section about linear regression. Haven't watched those videos, I strongly recommend that you watch down as the code aspect is going to be less emphasized from now on. That is, we're going to focus on concepts rather than on what each line of code does. I'm going to expect you to understand the code that is returned with minimal explanations. And what I'm going to focus on teaching from now on is the actual concepts or various implementation optimizations or tips and tricks. But we're still going to be writing code. Don't worry about it. It's just that the level is going to be a bit i OK from now on. Okay? So what we're going to do, we're going to read the two data files and we made a list of columns to keep. And I am going to explain how I chose these columns. So I chose these columns by eliminating stuff like IDs or host name, stuff like this that is pretty much unique to every listing and therefore cannot contribute to any learning. And using these columns, we're going to be able to learn to predict prices. So the price column for a new listing. So let's say you have a property in Berlin and you can tell if it's available 365 days a year to other manatees. You have in a room type, you have 12mm types, you have stuff like this. And you might be interested in knowing how much you can learn from it. So that is why such a task is helpful if we can solve it, because it can automate this thing. Instead of having to do a lot of market research, you would simply input these features into a web page may be, and that web page using a machine learning algorithm that is similar to what we will discuss in this video. We'll tell you what the price you can expect to get is. And this works for other stuff as well. If you can find the data, maybe you have data about car sales. Again, we put, plugging them in, in a similar algorithm. And you could make something that predicts the market price for that car. Okay? So now that we have these columns that we're going to keep, we simply select them from about datasets. And notice that whatever we do for the Amsterdam, they thought they also do for the Berlin data. I'm only going to be explaining things for one of them. Right? Now. If we look on cargo, if you get this, it means that you have to refresh the page. Usually. And after that, the data should. Okay, here it is. So if we look at the humanities column, it can go here and click select OK to display all columns. After wait awhile. Ok. So let's search the humanities column here with these. So you can see that it's a string column. And what I'm doing here is making a new column for each of these things that has a 0 or one in it. So we're going to have a 0 if there is no Wi-Fi in that column, in the Wi-Fi column, and they won if there is. And similar for TV ads, it has a kitchen and so on. And how we do that is we use pandas, apply function to which we pass. This function contains value fiscal up. This function returns another function which checks if our desired value. It is in the given column. It returns a one if it is, and a 0 if it's not. We can add other things here as well. If you can think of them. Like maybe if it has fiber Internet, stuff like that. I'm going to add it right now because I don't know if it's contained in any amenities listing for any property in the dataset. But you can tweak some of these and see if there's other things that you think should be had it. Right. So let's see. Maybe and in parking, street parking, maybe you think that has an effect on the price, then you would add that to moving on. The next thing we do is we remove the amenities column and replace it with a count of humanities. And we do that by splitting everything by a comma. So we consider that whenever a comma appears in the amenities column, that comma introduces another romantically, and we replace it with the count, because a count of amenities is much easier to work with than a string. Another thing that we are doing is if we look at. This column, the price column. So the actual column that we are predicting, it has a dollar sign in front. And we don't want that dollar sign just messes things up to just one the number. So what this code here does is simply gets, gets rid of that dollar column. And for some large values, the numbers are split up. So 1 million is given as one comma 000, comma 000. And we don't want that either. So we get rid of any commas in our data. We only do that for the extra people and price column because these are the only two columns from the ones that we selected that I know had this problem. Next we have these columns here, which are Boolean columns. So they contain either a one or 0. But unfortunately, they don't contain 10 in the actual data. They contain through our false or something else for a missing value, maybe what we do here is we use the map function. So you replace those values with one or 0. Recall that we only need numerical data. Another thing is to handle any missing values. This dataset unfortunately does have missing values. And what we do is for each column, we see if the some missing values in that column, It's higher than 0. And if so, we'll replace that entry with the media. And we do that by calculating the median using the median empathic per class that column. And then we use the fill and aid method to fail any missing values with the median. Here's something interesting that you can do. Let's take this line here. And of course we don't want to do it for any particular column. We want to do it for the whole dataframe. So we have to remove that part. And you get something like this that tells you for each column how many missing values you have in each. And that can be useful to make decisions. If you have a lot of missing values, then it makes sense to remove that column altogether. But we don't really have a lot of missing values in any column because our dataset is quite large. So even this 2600 here is not that much. So I think it's a good decision to simply fill it in with the medium. Next, you've probably seen this code before. If you have watched the other videos, we simply move the price column to the end and then we convert our twice column to numeric weight. We make it a numeric column in pandas. And the reason we do this is to help with our data analysis. Because the data analysis part, especially the correlations path, will not display some columns if they are not numbers, if it considers them categorical column. So let's run the pair plot. I already started running. It takes a while, and then let's run the correlations heatmap, and this takes awhile. 38. Polynomial versus linear regression on two real world datasets: Alright, so here is the pair o'clock. It's very big as you can see. And it's the same, almost the same for the answer them Datastore and are going to run that one. And if we look at their brick and see that the data is quite scattered or the plots are. But some of them are linearly correlated apparently. Let's look at the correlations heatmap. And if we look at the price, we can say that we don't have very good correlations. But some of them recently in half, it seems that the GM is quite well correlated with the price or well correlated compared to the others, at least to be around here somewhere. And this collar, so 0.2 maybe. Okay, so as you noticed, we spent a lot of time doing this. Here, this data query processing and data cleaning, turning everything to integers, doing some feature engineering here for, doing some feature engineering here for each of these amenities, cleaning the dollar signs and stuff like that. And for most machine learning tasks, this is the step that takes the most time. Because you will never have perfect data. You must spend time making it as perfect as possible for the task at hand. Since our task is to predict Airbnb property prices, we must make the most out of the data we have and use our own knowledge and common sense will try to improve it. And this step usually takes a lot more than actually applying a machine learning model and the data. And of course, there is still more that you could do. For example, we have latitude and longitude columns as well. We just didn't keep them here. And what you could do is compute the distance from that longitude and latitude given in the data for each property to the city center. So to Amsterdam city centre or ballets City Sandra, because it makes sense intuitively at least, that the price would be correlated or affected by how far a property is found, the city Santa, or how far it is from other points of interest in the city. For example, from the airport or thoughts, or some train station of some important tourist attraction. So feel free to do these as an exercise. Try to improve this data preprocessing part and see how well that affects the results. And if you can beat the results we're going to show in a military. So the train test split part remains unchanged. We simply keep 30% of the data as test data. We have plenty of data, so we can afford to keep data. Epa stands here, I think here for the pipeline, there is one small change. This handle unknown parameter. To our one-hot encoder. What this does is it ignores inputs that it hasn't seen in the train set. So let's say for some column, you have like something like a value of a single bad, right? Some column. And let's say that this value was not sane in the train data for that column. So the one-hot encoding does not know how to encode it. So what should they do if it says it in the test data? With all that, to ignore it. If you don't tell it anyway, it will give an error. And we don't want the now, it's better if we just ignore it. And here we have our evaluate function. Most of the inputs are the same. This one we can ignore, I only added it's called test something out. And here is how we call it the data part. We pass in the corresponding DataFrames. And we pass in the columns that need one-hot encoding. You just have to print the DataFrame and count these. And they are the same across datasets because we did the same preprocessing and they had the same columns to be great to begin with. For our models, we are going to use the wage linear regression and the pipeline that contains polynomial features of degree two. And the wage with alpha set to, to make pipeline function creates a pipeline out of what you give it as parameters. But it's a bit simpler than calling the pipeline class directly because it doesn't force us to name these steps. So it doesn't accept taus, We can simply pass it's the actual models and you import it from SKLearn and our pipeline as well. So let's run this code and see what results we get. And feel free to add your linear model from last sections exercise. Well, and then the other datasets as well. If you have them. Ok, feel free to pause the video here and see if you can make a prediction for what our mean absolute error will be for each of these marbles. Okay, so some results are showing up already. So for the rich model and Amsterdam, we get about 42 or mean absolute error. And what that means is that on average, we have about $42 away. So we are along by $42. Again, it's debatable if that is good or if that is bad, it depends. But the linear regression, we get a very bad result of like over a 100 million. So it will simply ignore that linear regression definitely doesn't work here, definitely doesn't work, right? And for our pipeline consisting of polynomial features and the way Joy Glasser, we get for the toy, which is actually worse than just coverage itself. So what we can do here is try to assume that the clay and the alpha hyperparameters, maybe that will get us better. For Berlin, things are looking a bit better. Simple wage gets us 30, so we are off by an average of $30. Buy property. Linear regression again fails completely, so we don't care about it. And our polynomial regression and gets 26.5, which is quite better than simple o h. So again, these results are not perfect, but neither is our data pre-processing. And that data pre-processing is the most important part. So I do encourage you to try to add things to it and try to improve on these results. But until then, let's see what happens if we change this degree parameter. And let's say, we said if the trait here, and we get an error actually. And that is because we've run out of memory. It's a memory error. And it says it's unable to allocate an array of Shake, 14 thousand by 471000. And so where does this 471000 come from? Well, we call what polynomial features does it multiplies the features between themselves to obtain nonlinear regression? So if you set the degree three here, or a number of features will increase by an equivalent number of features to the power of three. So that's a lot. We already have quite a few features, have about 20, I believe. So 20 to the power of three is a lot. And it's, it makes sense that we run out of memory. Even if we didn't run out of memory, it will definitely one slow and as well. So this is not a very good way of achieving non-linear regression. And we'll see better ways in the next videos. 39. Support vector regression: In this video, we are going to talk about support vector to the collection. Support vector regression on epsilon support vector regression is a class of machine learning algorithms that are all around nonlinear requesters. We can see that the fifth time complexity, a little more than quadratic with the number of samples, which makes it unfeasible to use for datasets with more than about 20 thousand samples. But fortunately, we fade in depth. We only have about 20 thousand samples, so we should be fine. And for more than that, we are given some alternatives in the documentation. The first of which is the linear support vector regression, which does linear regression. And the more general as S VR does non-linear regression, it does that by specifying a kernel which defaults to the RBF kernel. The kernel is a special type of function is used to precompute a kernel matrix based on the training samples. And that function will, will act kind of like the polynomial features pre-processor that we used before. We have a bunch of hyperparameters here, most of which affect the performance of the algorithm. And one problem with this algorithm is that it's quite hard to cube. You have to mess with these hyperparameters are lots to get the best results. And the most important ones are gamma and c. Thankful gamma only has two options, scale and auto, but C is the regularization parameter. Same as Alpha as we've seen in New Age, with one notable difference, while a higher alpha meant more regularization, a higher C means less regularization. It says so right here, the strength of the regularization is inversely proportional to C. But we haven't talked much about regularization in general. Regularization is something that we use to avoid overfitting. Recall that overfitting is when you fit to the training data too well and then fail to provide good answers. And the test data regularization helps avoid this by penalizing models that try to overfit the training data. It won't allow them to overfit the trade wealth. We have seen that too much alpha or too little alpha can cause problems with the wage. And the same is true for support vector machines or support vector regression. In particular. You can read more about it in the documentation, will put it into our pipeline in a bit and see how it behaves. Can also eat the more general page that talks about support vector machines, which are a set of supervised learning methods that can be used not only for regression, but also for classification and outliers detection. And we are given here some advantages and disadvantages. Advantages according to the scikit-learn documentation, are that it's effective in high dimensional spaces. And high dimensional spaces means that we have a lot of features. Says it's still effective when the number of dimensions is greater than the number of samples. This means that our features are more than our observations or data instances. But this is arguable because most algorithms will not work very well in that case. So take this effective with a grain of salt. It might mean It's more effective than others. But in general, you don't want this to happen. We don't want to have more features than samples. It's also memory efficient because it only uses a subset of training points called support vectors. That's why they're called support vector machines in order to do the training. So that makes them memory efficient. Recall that for the polynomial features per processor, we got the memory era. That shouldn't happen here. And they're versatile because we can plug in different kernel functions. Some common ones are provided, but we can also write our own life if we want to. And the table says here that if the number of features is greater than the number of samples, the regularization term that is the C hyper parameter is very important and the other one is not very relevant to our case so far it mostly refers to classification problems, so we don't really care about it now. Okay, so let's go into our code and we have our pipeline. And let's plug in a support vector regressor here and see how good it takes. Let's get rid of the linear regressor because it got us very valid results and we don't need yield our output window. So we import the as we are class from SKLearn dark SVM. And I'm going to instantiate it here at the end. Passing gain the RBF Kernel. And let's try for gamma out. Ok. And let's also do equals one. These are, most of these are the default value, but if we want to change them later, we'll have them here already. I forgot around the first cells. Let's go back and do that. So the way this notebook is set up, you have to run the cells in order. Otherwise you will get these undefined, the Arabs. But that's not hard. If you get an undefined and just go back and along each cell. And see, now it works. And as you can see, it's taking a bit of time for the as we are class to finish training and outputting gets results. And that's kind of normal. Recall that the complexity is quite nasty. So it's normal for it to take a bit of time since we have about 20 thousand data instances. Okay, and we got our results. And as you can see, our support vector regression got better results than the others. We got about 36.54 Amsterdam and 26.5 for Berlin compared to okay, we're starting to get results. As you can see, we got a better result for Amsterdam. We got about 6.5 compared to 41 foil age. And for the trait for our polynomial regressor. But for Berlin, the support vector regressor results are a bit was, we got 30.430, which is only slightly better than the wage. And definitely was then the polynomial regressor. So it's not always going to give better results. But it might be possible to get comparable results with the polynomial regressor, or even better for Berlin if we mess with these hyperparameters. So for example, we might go here and say make C equals two. Let's see what happens now, if anything at all. Okay, so it's still get better results for Amsterdam. That's a good thing we haven't below and what to act. And for Berlin or we are still waiting on the results. Hopefully we get something closer to this, this time, but it's not a sure thing by any means. Almost there. Okay. So it's a bit closer now. Definitely better than their age way Glasser. But we're still quite far away from this. So feel free to experiment with these hyperparameters a bit and see if you can get close to David regressor. Overall, what you gain from this is the understanding that support vector regressors are good overall non-linear models and they're a good first attempt at a machine learning problem to get a baseline for how, for what kind of results you should be obtaining. Other models might be able to do better, but they likely want to do a lot better. 40. Decision tree regression - 1: In this video, we are going to talk about decision three regression. Decision trees are a class of models that basically learn a set of FL's conditions that allow them to make the predictions we wanted them to. For example, they might learn that if dewormed has a TV or that if a woman has a TV, we should check if it also has a phage before making a decision. Things like this. This has a very important advantage. They are white box model. This means that after training, we can use them to improve our own understanding of the subject matter, how to predict the Airbnb prices. In our case for support vector regression and other models, we couldn't do that. That was a black box model. It learns some complex internal representation of the data. But from our point of view, it might as well be magic because we can't really use it to figure out a set of human-like decisions. However, there are also disadvantages. It's very easy for the model to overfit and it's not suitable for all problems. You can read more about the advantages and disadvantages here. As you can see, there are quite a lot. Also. This is not a model that is suitable for all problems. In fact, it usually works better for classification problems rather than regression problems. But we will try it out here as well. Now, we will focus on some implementation details. 41. Decision tree regression - 2: Okay, let's go down to our evaluation cell. And in other to plug-in, decision trees will have to import a few things. So from SKLearn that we're going to import the decision tree regressor class. They export the graph of a function, and they export the text function. And I'm going to explain what these are in a few minutes. Until down. Let's first instantiate our decision three, going to do it here. And this is going to be equal to decision tree regressor. The which will pass one very important hyperparameter and that is max depth. This is going to be equal to five. You can play around with other values as well. They should be relatively small. And again, I'll explain why in a minute. But until a more detailed explanation just now that this helps avoid overfitting. If you don't pass they say in the past in a very large value or a rather large value, it will definitely overfit the training data. Kind of memorize what the training data is and the predictions on the test set will not make much sense. So they will be very bad. We are going to comment out the support vector regression model in the evaluate function call because that takes a bit of time and this will make things go faster. Okay, so just plug in decision theory here. And let's run this and see what happens. Are starting to get some results. So we got our results for the Berlin dataset. That decision tree regressor actually got the best result out of the ones where you're Anna. And for Amsterdam, it's close, but it's not exactly the past. It's in the second place that you can do here is also support vector regression model and see how it compares to that. Or if you only wants to compare the non-linear models, comment out through age, or you can come into the pipeline with polynomial features as well. If you want to compare decision trees which support vector regression and this, and these datasets. All right, so it performs quite well. It's definitely competitive. But I mentioned an important advantage and that is that it's a white box model, which means that we can learn from it ourselves in order to gain a better understanding of the problem that we're dealing with. And in order to do that, we are going to visualize the three. Which means basically visualizing the if else conditions that they've learned. In order to do that, we're going to import pathways and pi dot. And you can keep install these. I suggest you use pip install them. Don't use anaconda because you can have some issues with that in mind if they install correctly, you might not be able to import them. So I'll just gives people to install them. And in order to export a PNG figure, we're going to do dark data equals export. The graph is, well, we're going to use that function to which we will pass our decision theory. And the output file will be none. We want fields to be true around. It'll be through this. We'll just make it look a little better. And let's sell special characters to be true as well. And this dot data is a special type of format that this function gives us. They're going to use pi docked the into an image file that we can display. So in order to do that, we can use graph equals pi dot dot the graph from data data, which will pass our data data variable. You don't have to remember this code. I don't remember it either I always get it from the documentation or online examples. So I'll just make a reference to this Jupiter an alto can you can just copy it from here whenever you need. And then we can do graph dot widely and g. And you can give it whatever name you want here. I'm going to give a decision tree that PNG. And if I run this file, it should finish quite fast. And it should give you in this file, which should be in the same folder as your Jupiter notebook, should give you something like this. So this is the theory that the algorithm, we'll call it blood. And here's how we do this. Three has here Acts 28th law then are equal to 0.5. This means that it looks at feature with index 28. So you could also pass it a list of feature names. But in our case, that's a bit hard to do because our feature names are kind of lost when we use this one-hot encoding if he wants to use the feature names so that they are displayed in the three. What you can do is get rid of this one-hot encoded here and use Pandas to the one-hot encoding somewhere here. And then you will have the feature names, which corresponds to the pandas DataFrame columns. That's a good exercise for you to do. I'm not going to do it now because be quiet St. Paul to accomplished based on what you have learned so far. And also, it doesn't really help understanding how decision trees themselves work. So it asks itself if the 28 feature or feature index 28 is lower than are equal to 0.5. And if yes, it goes to this branch here, and it looks at feature index 54 tasks itself. If it's lower than or equal to 13.5, and so on, until it reaches a leaf node or a final node. And that's the end of it. That's what it uses it. Now to make a prediction about the price can be eight thousand, twenty five hundred, and so on. Smaller values as well. And if you notice the height of this tree is five, that means if we count the maximum number of these lines here, these edges here, until we reach a leaf node, that number will be five. And we have one here, 2345. That's because we set the max depth, the five. We didn't limit it. We would have this huge three with up to a height of thousands die. Obviously that will overfit. We have only like 20 thousand data instances. And if we have a tree of length, even 100 of height, 100th or thereabout, It will have ten, hundreds of nodes and it will basically have about a note or iid data. In a clear case of overfitting, basically memorizes the training set and we don't want that. They have a few more things here, but they are not very important at the moment. We have this MSE, which stands for the mean squared error corresponding to this splits here. The number of samples corresponding to these splits here, and, and so on. These are all control about two hyperparameters, which you can get from the documentation. I encourage you to play around with these, try to tweak them and visualize the three, See how it affects the resulting tree. So this is a very important thing because we can easily understand the kind of logic or inside that the algorithm has gained into this data by analyzing such H3, that's a great advantage of decision trees. And it's what makes them very useful for a lot of problems. If you have parallel installing grab is, or if you simply don't want to work with images or don't have a way to visualize them for whatever reason. You can also visualize the tree in a text format. To do that, we can use tax equals exports taxed, which we pass our decision tree. And then simply twin the variable. And we get something very similar text format. But it's the same three. So you should use decision trees. If you want to have a white box model that you can understand and as a baseline algorithm because it's generally behaves quite well and gives good results on most dataset. They are easy to visualize, which can tell you a lot about what it's learning and also about what you should tweak in order to make the learning given battery. 42. Random forest regression: In this video, we are going to talk about the random forest regression. In order to do random forest regression, we're going to use SKLearn that the ensemble, that random forests to Glasser. The ensemble part means that this is a combination of multiple other models. And in this case, it's a combination of multiple decision three regressors. So basically what this does is it combines multiple trees, the number of which is given by this and estimate those parameters, which specifies the number of trees in the forest. And by doing this, we obtain better results in general and less overfitting as well in general. Most of the time, this is a better choice than a simple decision three. So I always recommend that you start with this rather than use a decision tree. This is generally bad and it will almost always give better results than decision twins. So if you can use this, okay, let's see how we implement this. Base call down to our evaluation part. And firm. As Kayla, dark. Ensemble. Way import Random Forest. We Glasser. Which way instantiates down here. And make sure we pass in the number of estimators which can be quite high. So let's say tan. And the max depth parameter, which will specify the maximum depth for each tree in the forest. So it's the same as the one here. So max depth will be five. And let's run this and see how it compares with our decision tree. Hey, we got some results for the Amsterdam dataset. We can see that the random forest regressor got a bit better results than the others. So it's a bit better than the decision tree regressor. And for Berlin, it's also a bit better than that decision tree regressor. But not by a lot. It's quite insignificant, but it's damn. Let's see what happens if we pass in a 100th estimators. Okay? Also better results and the Amsterdam dataset. And again, slightly better results on Berlin as well. But not by a lot. We can fairly talented, significant difference here. Again, it depends a lot on the dataset, but generally, you want stay random forest rather than a single decision. Three will generally be BATNA. As for Advantages and Disadvantages, mostly the same as those for decision trees. By using multiple estimators, we reduced the chances of overfitting a lethal, but we still have to control that using the max depth parameter as well. So make sure you don't forget that one. Other than that, they are quite similar and there's not much more to say about them. Just prefer Random Forest in general, rather than decision trees rather than a single decision tree, a forest is generally pattern than h3. Quite intuitive, I believe, and a quite easy thing to remember. 43. Artificial neural network regression: In this video, we are going to talk about artificial neural network regression. Artificial neural network regression for multilayer perceptron regression. Because artificial neural networks are also called multilayer perceptrons is a method that basically generalizes linear regression by forming a model such as these, in which each node out of many nodes consists of a linear regression model. And most specifically, I have a bunch of layers, like you see here. So each of these horizontal stacks of nodes is a layer. Ok? So the first one from the left is called the input layer. And these axis, basically our feature values. And we also have a bias term which is one all the time. And what happens is we have multiple hidden layers. In this case, it's this layer here, which consists of multiple nodes. So in this case we have one hidden layer with k nodes. And what happens? All of the inputs get fed into each internal node. So a1 here, for example, would get the values of all of our features and a plus one. So basically a special feature 0 whose value is always one. And that is why I say that each of these nodes is basically a linear regressor. Because it will act like that. It will do linear regression on these. So we will have basically k linear regressors here. And their outputs will go to the output node, which will apply a function to them and provide the final output of the entire network. And that function is generally also a linear recursion followed by some special type of function called a nonlinearity. And it gets quite complicated. For example, the inputs, the outputs of these hidden nodes that go into the final output. Note, their outputs can also go through a function first, which is also called a non-linearity. But we don't have to go into details just now that they are quite complex and that Scikit Learn is actually not the best library to use. If you want to use. Neural network models are highly complex. They are highly dependent on their hyperparameters settings and on their hidden layer configuration. There activation functions for each hidden layer and so on. So there is a lot to configure about them. It's very easy to get them wrong, and it's very hard to implement a proper model, a proper neural network model using scikit-learn. So we're going to plug them in into our template project and see how they behave. But again. But Scikit Learn is not the best library for this, does not support a lot of things that you would have to have available to make a proper neural network model. You would need GPU support, for example. And generally, all kinds of methods to make them behave by throwing practice, which psychic law does not provide. But other libraries dedicated to neural networks that you will see in other videos on this platform will support. Okay, so what you should get from this is that it's basically a big generalization of linear regression and that they usually work well for very large datasets. Complex problems, especially stuff to do with image classification. And that Scikit Learn is not the best livelihood yields them. From. Now let's go into our code and see how we can implement them, right? So from SKLearn neural network, we are going to import the NLP regressor class, which we are going to instantiate down here. And it's usually a good idea to pass in hidden layer sizes, which will specify the size of each hidden layer, so the number of nodes in it, it's gonna be a topo. And let's try a 120. And let's run this and see what happens. As you can see, it can take a long time, at least much longer than the random forest and decision tree. And that's another disadvantage, can be quite slow. And look, we got some results. We got 1.9. It feels very bad in stores to resolve. And that's not because the model is a bad one. Because our configurations, power choices of hyperparameters, or rather the defaults because we left the most of them at their default setting, are not the best ones, at least for this dataset. And also we get this warning here, which says that the maximum iterations konrad was reached and optimization hasn't converged yet. So without going into too much detail, what this means is that neural networks trained. So without going into too much detail into how neural networks are trained. But what this means is that the optimizer that does the training. So basically the one that optimizes the neural network model in order to learn as best as it can from this dataset, did not manage to copper lead converge on to properly locked from the training set and the number of iterations that we have stat, which was 200. So basically tells us that the neural network hasn't finished training in the amount of time would have given it. It doesn't directly correspond to time. There's nothing fully one-to-one link between time and iterations, but it's close enough so you can think about it like it hasn't managed to learn properly in the amount of time we have given. And that can happen for various reasons. For example, hyperparameters might be bad, or the data set might be bad. Or our choice of optimization algorithm. The solver parameter here might not be the right one, and so on. Can be a bunch of things that along and our hidden layer sizes aren't good enough. Like I said, it can be a lot of things and it's very hard to work with neural networks in general because of that. So they are kind of expert model because it gives you a lot of power. You can configure a lot. And if you do it properly, you will get state of the altruism. That's a lot of the state of the art models using neural network models, but we're not going to get into that for now. I suggest years. Other models, again, someone more experienced and some more knowledge from our future video. Let's see how we did for Berlin. Again, for Berlin, the result is the worst. It's still quite close enough. It's closer than it is for Amsterdam, a bit closer. So that's about it for neural networks. The bottom line is, I don't recommend that you use them yet. And in general, I don't recommend that you use them from Scikit Learn. You'd have to use them, shows another library. 44. The final regression showdown: In this video, we are going to talk about the final regression show down, where we compare all of our nonlinear and linear models on both of our data sets. We have this cell here while we plug in our models. So let's make sure we have everything plugged in. I'm going to uncomment the support vector v Glasser, which will come into it because it's quite slow. We want its result as well. And if we run this now, we will get something like this. The support vector regressor included. But this looks quite the ugly and it's frankly quite hard to read because it's hard to take a quick look at it and say, what's the best dataset is for these kind of tasks, or what's the best model is on each dataset. So we're going to make this whole thing a bit more visual so we can draw some conclusions. Okay, in order to do that, we are going to change this evaluate function, like so. So it's no longer going to blink, or rather it's now it's no longer going to just go in. We're going to have a boss, a parameter whose default value is true. And if this is true, it will also print the results. But its main function will be to display a figure that will visually presented on each data set for each of the models. In order to do that, we are going to use a dictionary, which at the end will be turned into a pandas data frame. So here we will have a dataset column. And they initially a regressor column, and they mean absolute error column. Next we're going to get our data's labor, which is just data of labeled divest stays unchanged. And here we're going to say Results, dataset, data band. And the pen that dataset label. For this who is all time tree. Make sure these are all aligned in the same. Get pipeline remains unchanged, fate remains unchanged. This reporting will change a bit like so. So first we will have a requester name, which will be equal to the regressors name without the parameters. So we only want this part out of each of them. We don't really care about these. At this point. You might care about them while debugging, but generally, you know what they are so you don't feel the need to print them. Okay, so in order to get that part, we can use the split function. Split by paranthesis, my open parenthesis, and get the first result out of that provides both in the end the regressor column. So results of a Gerasa that append to regressor name. Then we compute the mean absolute error, or MAPE too, which is simply this part here. And if verbose is true, but we're also going to print this with tax. So done. Data Set, Labor, requester, name and ID. And now we can get rid of this. And here we simply put the MA intro is outs dictionary results, mi, e dot append. And after this is all done and we have to turn results into a Pandas DataFrame. So the two cos, t, the dots DataFrame of results. And we're going to use a bar cloth from Seaborn to display our figure like South. As an S bar plots. The x-axis will be MAE. So this one has two corresponds to these here. It's important that they correspond. Otherwise you will get an error. Y will be our regressor label. And we're going to pass a and Q equals dataset. And they tie equals d. Q equals dataset basically tells seaborne to group our bar plots by the dataset and we'll see exactly what that means when we run this. So divest should defined as it is, let's run it and wait for the new results. You can see Divan when taxed, it looks much easier to eat. Let's wait for the whole thing to finish and we'll come back and talk about it when it's done. Okay, we've got our results and here's how they look like. So we have a figure that looks kind of like a histogram. It's not a histogram is a bar plot. What this means is we have bars for each regress and each dataset. So the blue bars represent the Amsterdam dataset and the brown bars represent Berlin. And the horizontal axis represents our mean absolute error. And the vertical axis, or y axis represents our regressor. And this helps us get a, an airplane view of how things are going for each dataset and for each reclassify. Example, just glancing at this figure, we can say that for Berlin, the best model is the random forest reclassified because it has the least MAE of about something, maybe 23. If you want to know exactly. We can go here. And we can say it's 24.869 and so on. So this makes things much easier to read, but this makes things much easier to read. Then, given this short text here. And also for Amsterdam, we can see that they are all quite the same except support vector regression, where kind of distance itself on this run at least it has about the sampling just under 40. We can scroll up and look at it exactly if he wants to. That's not really the point here. The point here is that by displaying such figures, you can draw conclusions much faster. And in practice, you would probably have even more models here. Maybe you would have deeper, more variants of the same model. So maybe you would have multiple random forest regressors where it's different hyperparameters. Or multiple support vector regression with different hyperparameters that you would want to compare against each other. And in order to do that, you would use a figure such as these, or even multiple such figures if you had enough models and data sets for this to fail to match and make it. And so in general, a, You know what they say, a picture is worth 1000 words or this picture is definitely worth more than these printouts here, and definitely prefer such visualizations when possible. 45. Non linear regression exercise: In this video, we are going to talk about your exercise for non-linear regression. What you have to do for this exercise is go down here to the evaluate function is called pick two of these nonlinear regressors. So let's say you're going to pick as VR and the random forest regressor and tried to add another variant for each. So something like this. And this, and tweak the hyperparameters on each one so that the one you are tweaking for gets better results than the one we had before. Okay, so maybe you will change the kernel for as we are because we are maybe older, change gamma or see a random forest. Maybe we'll change the number of estimators or the max depth or some others that you can read about in the documentation and try to spend a bit of time on their spend about 30 minutes may be really makes sure that you have tried as best as you can until they say, I'm not saying it's definitely possible consistently better results, but it, you should be able to achieve at least slightly bad results, even AS they only happen on some Lance and not all of them. Anyway, congratulations for making it this far. I'm watching all of the videos in this module. If you have to go back to rewatch any, feel free to do so. There is no shame in it. There is no problem with it. Normal to have to do that for a lot of the code that I have written here. I had to look stuff up in the documentation before I did. And so if you have to do that, it's now problem. You will always have to do that. We all have to do that no matter what our skill level and what are experiences. So congratulations, keep up the good work for the future and videos, and good luck with this exercise. 46. Non linear regression exercise solution: In this video, we are going to talk about the non-linear regression exercise solution. Congratulations if you are able to obtain better results with hyper-parameters different than the default ones. And if not, don't worry, because I haven't really been able to obtain better results myself. So what I did was for support vector regression, I said c2 TAM. And for Random Forest regressor, I muster out only the max depth parameter. And in general, I messed around with the C parameter for as we are in the kernel. And with max depth or random forest and also a bit with the number of estimators. But that did not consistently get better results. In this particular one. We have a big untold results, at least for the Berlin dataset, but they are not really significant. So the new SBIR with a higher see obtains a bit better results, a bit better. Absolute error, and random forests regressor with an extra level of depth obtains only slightly battered mean absolute error and also only under Berlin dataset. But the point of this exercise was to make you realize that the default values are generally the last ones that you can use and that it's not worth generally to spend a lot of time to come up with other hyperparameters, only to get insignificantly better results. Because a one here, the difference is quite insignificant and they fail on this cell again, diethyl randomized nature. These algorithms that results might switch around. So we might not even have this thickness here, this one here. So don't worry if you haven't been able to obtain better results. The most important point of the exercise was to make you're either documentation and get comfortable with that code and modifying it. Also, you might have noticed that if we plug in multiple variants of the same model, model, both of them don't show up in the figure here. That's because of the way we're doing it, because we're only getting the SBIR part and using that as the name. So it won't be able to differentiate between them. What you might want to do here. But this is not a coding exercise, not a machine learning or data science exercise is make it so we pass in topples here also dictionaries. Well, one of the elements is the label for this week, lesser per each regressor actually. And that way you can use that here instead of doing this clumsy splitting. And then you will be able to see both of them displayed here in the figure as well. Congratulations if you spent a bit of time trying this even more. Congratulations if you managed to obtain consistently better results than what we had initially. And if not, don't worry about it, we will see later on that there are methods that can automate the search for these hyper parameters so they can find the best ones for us. And that's the kind of thing that people use in practice. We don't really spend much time optimizing these by hand, at least not in the case of most machine learning models. 47. Non linear regression quiz: Welcome, they are non-linear regression quiz. Why did the neural network performs so bad in the showdown video? We introduced above when we writing they evaluate option because they are sensitive to the choice of hyperparameters and we didn't know tuning. So it can happen. Because they are generally bad models. The correct answer is B, because they are very sensitive to the choice of hyperparameters. And we did now cuny, so it can happen. I don't think we introduce Danny bug when we've I think the evaluate function because it worked well for all the other models. And it's likely that if you run it multiple times, it will also work decently for the TOR network. Also, they are not generally path models, they are generally very good models, but they do require a lot of hyper-parameter tuning in order to perform at their past. What happens if we don't specify the max depth decision trees? The default value will prevent that three from getting too big. The default value we'll let there are very big. The default value will very likely cause overfitting. The correct answers are B and C. The default value will not let the twinkle very big, and this in turn will very likely cause overfitting. It will prevent the tree from getting too big because the default value lets it go as big as it wants to or as big as the training algorithm are as big as the training algorithm concepts do. Can the US VAR model the linear model? Yes, by using the linear kernel? Yes, but the default kernel is non-linear. Now, it's a strictly nonlinear model. The correct answers are a and b. We can make it a linear model by plugging in the linear kernel. And indeed, the default kernel is non-linear. Which of these disadvantages of the polynomial features approach? It generates a lot of extra features and is therefore memory inefficient. Not all that generated features are likely to be useful. It requires installing an add-on library to scikit-learn. The correct answers are a and b. It does generate a lot of extra features which makes it memo inefficient. And indeed, it's very unlikely that all those generated features will be useful. It does not require installing any add-on library. It simply requires important like any other class from scikit-learn. 48. Non linear regression summary: In this video, we are going to talk about a summary of this non-linear regression chapter. So let's see what we did this module. We talked about how to pick the right datasets from Kaggle and why those datasets might be suitable for a real world application of some non-linear model. From scikit-learn. We downloaded those datasets and set them up. Then we spent a lot of time preprocessing them. We insisted the lock on this especially cleanup aspect, the feature selection part and the feature engineering part. So feature selection is what we did when we eliminated some features and decided to only keep some of them that we talked with irrelevant or that we thought were the allow us to spend the least time putting them in a correct format for our machine learning models, feature engineering was when we decided to extract some extra data from other features such as the amenities part where we passed the amenities string given to us and extracted things like if the room has a TV or a fridge, or a kitchen and so on. That's called feature engineering. And then we did our train pass lit. And here I just want to mention that in an absolutely perfect world, you would put this step before the preprocessing step and you would only do the pre-processing on the train sets, so you would not touch the test set until you are absolutely done with anything relating to your solution, until you are certain that you have a great machine learning model that you can deliver to your client. You would not touch the test set until that. But for our example, and in order to keep things shorter, this is a decent way of doing it. Just now that this box here, Growth ideally go here. Did our train test split. And then we get to the model selection part. Remodel selection means picking a model. We started with polynomial features pre-processor followed by a linear regressor, which ultimately leads to a non-linear regressor. Then we analyzed support vector regression decision trees to random forests and the multilayer perceptrons are neural network. We compared all of these between each other to get, find out what the best model is for all of our data sets, for each of our dataset. And then we spent some time trying to tune them obeyed. You earn their hyperparameters in order to see what makes them behave at their patent. For some of these, this is an easiest step than for others. For example, for the neural network, we talked about how this is very difficult and how the neural network is very sensitive to this cuny or other models such as support vector regression, decision trees and random forests that they fought values were decent enough or we only had to make some very minor changes to them, such as setting the max depth of decision trees. And this is called the model selection part, and it's called that, so this is called nominal selection step. And it's called that way because this is where you select. You pick the best one by comparing it to the others. You keep tuning it. Or the step you will do until you are happy with what you have there, not something that you do once and, and move on. You will keep doing them. And then you have your final model that for now gives the best results. And in a real-world setting, you would deliver history or client who would be happy with it, or they would ask you for some other changes. Of course, all of this is tightly coupled with the data analysis part of data analysis we have pair plots. Now the plots correlations, the correlations heatmap in particular. And we also use some domain knowledge and intuition in order to make the best decisions. For example, for the feature engineering part, we intuitively decided that it might be best to have columns that tell us if there is a TV, if there is Wi-Fi and stuff like that. And this part is tightly coupled to all the others because it can affect all of them. It can affect all of them, including picking the datasets because we might decide through data analysis that dataset is good enough or that the data set that we have picked is unusable. We cannot get good results on it. And therefore, data analysis can make us decides to pick another datas. It can also make us decides to do more preprocessing if we see that nothing is well correlated and we might decide we need to select more features or to do more feature engineering can also affect our train test split in some situations, but this is quite rare, so we won't get into it right now. And it can definitely affect our models collection. Because we can tell from some data analysis that some models might not work at all. For example, we might decide that linear models are definitely not good or that this dataset is so complex that we need a deep neural network. And we will just have to, and I'll make it work, things like that. So you can use this figure here to decide on the order you should be doing things in. And we went over what we did as a general overview. And I hope that helps you move on from here. Feel free to reference this video in the future. If you get some of these lab or what they entail. 49. Making the most out of Kaggle: In this video, we are going to talk about making the most out of Kaggle, looking at what other people are doing. So Kaggle has datasets such as the ones we have been working on so far, where you download a CSV file or some file in another format and do all kinds of things on it. In our case, we've done machine learning. It also has tasks which are suggestions on what to do. And this is usually explicit. There is a task tab. We will see it in a minute. Or they can also be implicit, which means that cargo will not tell you exactly what you can do it that dataset. But it's usually quite obvious based on the columns of features in that dataset. Usually you can do prediction tasks. And in our case with the Airbnb data, cargo does explicitly tell us under datasets page that we can predict the prices. And it also has kernels, which is what other people have done for the listed tasks or just stuff that they have done in general with the dataset if there is no explicit task given. So let's go to the Berlin Air BnB data Kaggle page and see what we have here. So this is the data page which we have seen before and from where we have downloaded our dataset. And there are a few interesting things that we can see here. The first one and an important one is the license, which tells us how we can use this data. And if we have to give any credit. And in this case, this data was created by Molly Cox and more information can be found here. So thanks to Marie Cox for this data. Then we have a usability score where we are given some checklists for these three categories here, which tell us how usable this data is and how good it is or how easy it is to use in practice. So the first one is ease of understanding, including essential metadata. It has a subtitle, it has relevant tags, description, cover image. Then it has some of each file formats. And again, they have made that metadata means they have good descriptions. Column descriptions are missing. In this case, the license is specified and acceptable file formats such as CSV comma separated values files are used. Then we have assurance that the dataset is maintained, which means that in the future new versions might show up off some fixes. Maybe if some data was wrong. It has a public kernel. Update frequency is not specified in this case and the provenance is not clearly specified. Although there is a link here which might give more details. So a usability score of around eight is generally pretty good if you find such datasets and they are generally good. Then we have this tasks dab. And if we click on it, we have to wait for it to load. Note to editor, continue from here. Okay, it loaded and see it says here that one of the task and the only task in this case is to predict Berlin Airbnb rental prices. So it's a prediction task on the price. Next, if we go to kernels, this is perhaps the most important page. Here is the work that other people have done and this dataset. And let's open a random one, but let's keep it out of this field that are, that have a lot of votes of upvotes. And let's try maybe this one. And it will take us to a page such as this, which will contain something like a Jupiter notebook. It is a Jupiter notebook actually of what this person has done on the dataset. Well, usually people start with some data analysis. We have a small table of contents on the left here. And they describe their work. In general. If we scroll down to maybe model, we should be able to get some results as well. So let's see what kind of results and this pass and obtained. So they use linear regression with lasso. And they actually use R2, which is a, an error metric that we haven't discussed. And it seems they also use mean squared error. And these mean squared errors are very, very small. But there might be another reason for why they are so small. Maybe they preprocessed the targets as well. So that's something to keep in mind. They also used random forests. So basically here you can read in detail what this person hasn't done. And you can also download this file if you click download code here. So click this ellipses here and download code. And of course you can also upload your own code. So feel free to explore these kernels. They are very helpful because they will save you a lot of time if you just need to get something working on this dataset. You can start from what other people have done. Maybe you will start with just the data pre-processing pots, the data clean on the feature engineering and so on. You don't necessarily have to completely used what they have done, but it helps to see what other people have come up with as well. Maybe they have an idea that you didn't think of. Or maybe it's just to get you started and you can improve on their code. Don't be afraid to look at what other people are doing. After all, that is the basic building block of science. We take what other people have done and we make it better. We doing something of our own in it. That's how science works and that's how we are life machine learning. Research and engineering works as well. And actually, I have done the same thing when preparing this video, when preparing this course. So if we look at this cardinal BY mohammad peck dash, who, who I'm grateful to. We will see that a lot of the data preprocessing is the same as the one we have done. So the columns that we are keeping the same, very close to the same. They also use the distance from Berlin, which we didn't use in order to make a distance column. They also turn a tease and AFS into ones and zeros, which we have done as well. We also clean the prices are called, may be a little different. We have I have written some things differently, but most of them are the same. So yeah, there's definitely helps to get you started and to have a baseline on which to work so you don't start from scratch. 50. Understanding an existing Kaggle approach: In this video, we are going to talk about understanding an existing solution for a Kaggle dataset. We're going to discuss this solution here for the Berlin Airbnb data. So I'm going to go over almost every line and explain what it does and what you should think about. How you should approach understanding an existing solution that you're waiting for. The fast almost the first time on Kaggle. Okay, so there's usually an introduction. And it's a good idea to put those all at the top so people know what they're getting into, how good the solution is going to be in the end. And this person says that they obtain the root mean squared error of about 28. Root mean squared error simply means taking the square root of the mean squared error that we discussed. This value should be close enough to the mean absolute error, but it's not exactly the same of course. However, our mean absolute error is around to this mark here, if I recall correctly. So they should have obtained about what we obtained. If you wanted to compare it exactly. Go back to our code and make it display this root mean squared error. You can do that by making it display the mean squared error and then take the square root of it. Alright, so it starts by importing the necessary libraries. It even tells us what each one is four. So in case you forgot or see something you're not familiar with. Also read the comments because the authors will usually tell you what they did, at least to the good ones, the ones with a lot of positive votes. The warnings are then silenced. This is unnecessary, but it might help in some cases. Then the data is read and this person walks with both data files, the listings one and the listing summary one. So that's interesting, that's something we didn't do. Then they display a data instance from the listings file. So the listings file is the one with fewer columns, as we can see here. And then the important features, the one's that this person wants to keep putting this array here. And only those are kept from the dataset. So this will drop a few columns. They info is displayed. Again, this helps. For example, one important thing this helps with is checking null values. We can see here that we've use per month only has about 1818 thousand nano null values, while the others have more. So this column here definitely contains missing values. K Sama. Heatmap is then displayed, giving us, giving us the number of null values in every column. We can see that indeed the Livius by Antoine contains a lot of missing values. This step isn't really necessary in my opinion because this one already gives us that information, but it has to have it in a visual form like this. Okay? And the other does the same conclusion. And he decides to delete this column altogether here because it has too many missing values. Okay? And now they look at the datasets to variable which contains, let's go back a bit, the listing summary data. And this one has a bit more columns and they'd also, and it also has these text columns. And the info here is a lot bigger because we have many more columns. And again, we can see here that multiple columns have missing values. Okay, a similar plot to the one above will give this. We can see here in a visual form how we stand with the missing values and these datas and this data file. Okay, so again, features that contain too many nulls will be ignored. And here is how this is done. So the columns are iterated. And using the is an a. And using the method which we sum and divide by the length of the data set, we get the percentage for the number of missing values in that column. And if this percentage is lower than 30%, we decide to use that column. And otherwise we decide to drop it. But no actual dropping is being performed here. What is done here is we only save the columns that we wanted to keep. And we also put in some. Information data regarding that. So we will keep 75 columns. And here they are. This is the same way we used to. Only keep the columns we wanted to. Then we display them again. Display the info again, and see. This is a much smaller dataset now with regards to number of features, but we still have some missing values. Let's see how we are going to handle those. Ok? So the outer here says that they will fill in missing values with the majority value of the feature. And he also says that we can create new features from existing features. For example, the distance from the latitude and the longitude we did under this spot. Because the idea here is that it's important how far away the apartment or the location is from the center of Berlin, for example. And in order to do that, they use the GOP module. And now if you're not familiar with it, what you should do is right click on it and Google search it. You'll likely find documentation. And they say that the Berlin city center is located at these coordinates. Again, how did they get this? Well, it's not something you should know by heart, of course. What you should do is, again, Google it, Google Lens, city-center, GPS coordinates, latitude and longitude. That's it. You need to get used to googling stuff. Because a lot of things like this will show up in practice and you need to be able to handle it. All right, so they compute the distance. Now they handle the categorical features for the room type. And they use pd dot categorical to do this. And pd dot get dummies. Passing in a prefix of whom type we didn't use gets dummies when we preferred scikit-learn one-hot encode. But this is definitely a valid way of doing things as well. Okay, now they're switching back to the listing summary dataset. They are making a copy of it probably because they are, they are going to be BY changing things. Ok, so now they are deleting some columns. Let's see which columns they are deleting. Okay? Yeah, so these wines, for example, they don't need their own type feature anymore. Because if we look here, alone, type is string. And it has been. Converted to one-hot encoding. So it makes no sense to keep the original animal. After that, they convert the true and false as a string to numeric 110. And they do that by, by making a list of the Boolean features and mapping this function object to bool for each of those columns. And that will turn them to numerical features. Do something similar to bad stipend property type using categorical and get dummies to do one-hot encoding. The amenities are turned into a count. So they do that by splitting by comma and taking the length of the resulting array, the array that results from the split function. And of course we also did this. So as you can see, and this is going to be a trend for most cattle datasets. A lot of the space contained in a colonel and a lot of the code you see in a public kernel, It's going to deal with data cleanup, feature selection, feature engineering, and so on. Just trying to understand the data and make it more suitable for the upcoming machine learning part. Next, the datasets have converted to float. We also did this for some of the columns. Nan values are filled in with the median. We also did this. So if is an a that sum is larger than 0, we save the features that contain nulls, and then we compute their medium and use fill in a to replace those those missing values with the median. Except we did this in one step. We didn't use two steps to do this, but this is just as, just as good really. K Then they also clean the outliers using something called a Z-score. I'm not going to go into details here too much. Basically outliers are data instances that are quite different from the norm. So let's say you have some data that contains information about student heights. And you would have someone with a height of maybe 2.5 meters? Well, 2.5 meters is quite a lot. I'm not sure if anyone in the world is that doll. So it could be a mistake, right? And we might think that it's a good idea to remove it. But that's debatable because ideally or machine learning algorithms should be able to handle it. And we shouldn't be, we shouldn't need to baby the algorithm this match to remove outliers. But it, some people think it's useful. I think it depends a lot on the data and it's debatable. But they are doing it here. So if you want to read more about it, you can read about how to eliminate outliers in scikit-learn. Here, they don't use scikit-learn. As far as I can see. What they do is they compute the z score using the formula for it directly. And they use Numpy in order to be able to do all this algebraic operations. Again, they use numpy to do some concatenations. We don't really care about it, we didn't do it. It's okay not to do it. If you don't fully understand something. The Kaggle kernel that we're reading. And it's also OK to put it into your own notebook for future reference in case you decide to use it, or maybe even try using it to see exactly how it works and what it does. Both of these are okay approaches. O right? After that, a heat map of correlations is shown. This looks quite similar to what we had. Then. They do the random split. They only use 10% of the data as test data. They use a minmax scalar and they don't seem to be using any pipeline here. They only use the minmax Kayla. They call fit transform and transform on the test data. Which is fine. There's no real reason to not do it this way. But I think the pipeline approach that we introduced is more professional and it's definitely easier to work with once you decide to add more things in. Like we decided to add more models and so on. Just makes things more general and nicer to understand. Especially for someone that's reading the code for the first time. Because in a real-world setting, you will not be the only one writing the code. Other people would probably work on it as well and they would need to understand it. And after that, unfortunately, this kernel gets a bit far away from what we have done because it uses chaos. And we haven't gotten to that yet. We use scikit-learn. But the most difficult part, in my opinion has already ended. And that was the data analysis and pre-processing and clean up about and that we have discussed until now. From now on, if you have been following along, you should be able to intervene and plug-in your own regressors for any dataset, even if you don't understand this part. So let's say you see something like this for another dataset that we haven't talked about. And you understand the data preprocessing part and the cleanup part and so on. But you don't understand all the scalar stuff. That's perfectly fine. Just use what you have learned so far about regression in scikit-learn and plug-in your own regressors here. So what we're going to do now is keep to some results. Okay, now results here they talk about the loss. Again, we don't really care about this part so far. And here it is. So the test set means squared error loss is about 850 and the root mean squared error is about 29.1. So now you should be able to easily make the code we had. We plot the mean squared error and root mean squared error and compare yourself with this. Give it a try. Did we get better results or not? I hope this helps you understand how you should approach reading other people's Cognos and what kind of information and knowledge you can gain from doing this. 51. Advance ensemble methods intro: In this video, we are going to talk about ensemble methods. We have already seen an ensemble model and that is the random first regressor. In general, ensemble models wants to combine the predictions of several other models built with a given learning algorithm in order to improve the quality of the overall combination. Or the idea is that more is better than one. And there are two main families of ensemble methods that we are going to discuss in future videos. One consists of averaging methods, which is further divided into tagging methods. And for rest of Randomized Trees, we've already seen a part of this, and we're going to talk about this in a future video. And the idea here is to build several models independently and then take an average of their protection. And this has the fact that it's going to be better in the end because the resulting variants of the model will be reduced. This introduces a new concept, the concept of variance. And variance basically means over fitting when you have some noise in the data and all data is a little noisy, not perfect if it's real data. In this case, overfitting means that you try to come up with a model that basically will end up memorizing the data are part of the table. So it will not generalize well, you don't want high variance, you'd wants to keep it low. Averaging methods do this quite well. A contrasting approach is the boosting methods, which we will also discuss in a future video, where the estimators are built sequentially. And here the I to try to reduce the buyers, the combination. And now we have another term here which is biased and unbiased basically means under fitting, right? So when you, when you use a week models such as linear regression in order to try to estimate some polynomial data, for example, that's under fitting because a straight line going something like this will not be able to properly fit something that goes like this, for example. So ideally, you want low bias and also low variance, but this is not possible most of the time that it has to be a compromise or a trade off between bias and variance. So depending on your problem, one of these will probably work better than the other. But it doesn't usually cost match to try both methods or multiple methods from the two families. In general, these are the type of algorithms that gives the best results on a lot of problems. So they are very important to know about and they are widely used in practice. 52. Bagging: In this video, we are going to talk about bagging. Although this is an advanced method, it's only a blast from a theoretical point of view in actual practice and implementation, especially using scikit-learn, it's not complex at all. All we have to do is import the bagging regressor. And I've already pre-written some called it because it's it's very menial and there is no point you watching me while I write it. So I'm just going to go over it a bit. So first of all, I modified this evaluate function to accept the models that we can access has a dictionary with a label and they model and they updated the code accordingly. We use wheatgrass runoff model here too. And calling GATK pipeline. We append the labor instead of doing the thing that we did before. And we also use requester off-label here when fainting results in text form. If both is true. The changes I did today evaluate function. And then I added for each model except for the polynomial regression and the multilayer perceptron. I added a tagging variant as well by instantiating the bagging regressor class and passing in the same model that we pass directly. And this takes awhile to run, especially for support vector regression. Because support vector regression is very slow and the bagging regressor basically instantiate about ten such models. So this line here should execute about ten times slower than this one here. And this one here is already quite slope. But if you let it run for a while, you should get some results like these. And what we notice here is two things. First of all, that for every model, the bagging version is not was okay, or is only insignificantly was. Let's look at wage. For example. We get with bagging 41.98 and without bagging gets 41.96. But for all intents and purposes, these results are the same. For as we are. We have the 6.594 without bagging and 59 eight with bagging. Again, for all intents and purposes, they are the same thing for decision trees. We get the same results at least for Amsterdam, or they are faster than dataset. And also for random forests. For Berlin, things change a bit. For age, the results are again, quite close. For as we are quite close. And for decision trees, the bagging version is a bit better. It's a bit noticeably better with about I mean, absolute error points. Battery. We have 23.6 with bagging and 25, 0.56 without bagging. So there is a noticeable difference here. And for random forests, the results are also a bit better with bagging, but this is not that significant. So those two things I mentioned are that one, with bagging, we never seem to get a worse result. And the second one is that with bagging, sometimes we get a slightly better result. So it makes sense to always try bagging. Even if you've decided on a model already, try the bagging version of it as well. We can see this in visualized form here as well. And now I want to talk about the documentation of it. If we go to a psychic loss documentation on the bagging regressor and we scroll all the way down. We can see that we have some examples here with using the bagging regressor. And this is true for most models provided by scikit-learn. If you scroll all the way down, you will usually have a list of examples. And if you click on them, we get to this page, the bagging regressors case. If S call down. We see some results and figures. And let's look at the bottom moused figures. Because the top ones are cards to make sense of them are quite small and they look very, very similar. So we're going to focus on the bottom ones. And we see here that the bias and the bias squared is given in blue and the variance is given in Dewey. Let's look at the blue line, the bias line, and the left figure. It's the bias without bagging for some three. Okay. And you can see it's quiet. Oh, and width bagging. It gets a bit larger, so the line goes a bit more upwards. That me is, the bias is larger. And that makes sense because remember in the introductory video we talked about the bias variance tradeoff. And we also said that Bagging helps reduce variant and the fifth reduces variance. It stands to reason that it will increase bias a bit because we can't have the best of both. Okay, and if we look at variants, the green line, we can see that it's lower tweet bagging. So for this example here, it rec managers Dam produces the variance, but it gives up a bit of performance regarding the bias in that process. And this is even more clear with this tax form here. So if we have a simple three, the bias squared is this. So 0.0003. And with bagging is 0 form. So a bit more and the variance is significantly less. It's a little more than half with bagging than without bagging. So that's it. We gave up just a bit of bias for a lot less variance. So it's definitely a good thing. And we also kind of noticed this in our, our Airbnb data. So this is definitely a good approach to plug-in which your models, it's very easy to do and you will not get or you should not get worse results. You should almost always get at least the same results. 53. AdaBoost: In this video, we are going to talk about add apples. This is a boosting method and not a bagging random as its name suggests. We also imported from SKLearn dot ensemble marginal cost of the curves is very similar. And this class is used exactly how we use the bagging regressor class. We simply instantiate it by passing it in a model. And it will use that model in order to, and it'll use that model to do boosting. But here are the results before we get in detail on exactly how it works, you can see that with boasting, all of the results are significantly worse. And we're going to talk about why. First of all, let's go to the documentation page and open this example here. Okay, and while that's loading, let's scroll up and talk a bit about what Adaboost is doing. So what it does is it also uses multiple estimators, will charge copies of the base estimate or and it feeds them sequentially. But the weights of the instances are adjusted according to the era of the current for addiction. So basically, later regressors will focus on more difficult cases. Now, the previous regressors did not predict very well. Here is an example. In documentation. They have this sinusoidal data. So it's basically a toy dataset. The green line is a decision tree and the red line is a boosted decision tree with 300 estimators. And remember what we said about fitting data points so well, it leads to overfitting. So I was not really like an estimator or a model to fit my data. The green line is not ideal either. But in my opinion, the red line is much worse because it can definitely lead to overfitting. And this is likely what happens to us here as well. You might be able to fix this by tweaking the booth parameters. And this is definitely an occurred towards trying. But you should be careful about actually using it. Only trying this boosting approach, because you can't see it can really mess up results. The bagging approach seems to be safer. So I suggest if you have to pick one of them, pick bagging, but if you can afford to try out things, definitely try Adaboost as well as it can work very well. And other datasets. And of course, feel free to mess with the hyperparameter, submit. 54. Gradient boosting regressor: In this video, we are going to talk about Gradient Boosting regressor. And this is another boasting method, but it doesn't involve us passing in a model beforehand. We can use it with just all the parameters left as default and it will not use any external model. And how it does this is a bit more complex. There's a bit more theory involved, hit, involved here. But just know that it also uses this and estimators parameter, which represents the number of boosting stages in this case, That's the number of estimators that are copied, duplicated, but a number for the posting stages to perform. And this model is fairly robust to over fitting. So we shouldn't have that problem like we did with Adaboost. So a large number of these estimators usually results in better performance. And there are a bunch of hyperparameters we can tweak. And I'm going to get into those details. And you have four examples and how to use it, which I recommend that you look over k, looking in our code, if we scroll down, I've already added it. So it's also important from SK law that ensemble. You have to import the gradient boosting regressor. And because it doesn't accept any external model as parameter, I only added one of it at the end. And it will run a four about the Amsterdam and the Berlin datasets. And because this takes awhile, I have ran it before starting the video and the results are quite good. So for Amsterdam, we only got the bid was than rage with bagging and boosting didn't do as bad as adipocytes lead. But it didn't really improve on the best result hidden for Berlin, however, it did improve on the previous best result. We got 22.9 here and all the others quite well above this. So there is some noticed about defense for the Berlin dataset. So as a conclusion, this is definitely one that is worth trying. It's not going to give significantly worse results. And it has a good chance for a decent chance of giving slightly better results. If you want to play it completely safe however, and you can only pick one for whatever reason, bagging still looks like a better choice, at least according to these two datasets. 55. Advance ensemble methods - exercise: Here is your ensemble methods exercise. You have to find another dataset and add it here and compare the ensemble methods on that dataset. So basically, you have two choices here. Pause the video here if you don't want a small hint. Okay, so the hint is that you might be able to find dataset that contains Airbnb. They talk from another city. And if you're lucky and that dataset looks exactly like these, you'll be able to simply it here and run everything as it is. If you are not able to find that, then you will need to change these data pre-processing steps, but the rest should be doable if it, if everything else remains unchanged regarding the crowd in these cells or regarding setting up our pipeline and performing non-linear regression. Okay, So either you find a dataset in the same format and downs change much at all. Or you find the fancy regression dataset, in which case you have to make a few changes, but they shouldn't be too much. 56. Advance ensemble methods - exercise solution: In this video, we are going to talk about the ensemble methods, exercise solutions. So the first step, what I told you about in the health in the last video, which is to try to find another Airbnb dataset and searching gun Kaggle for Airbnb. I found a Seattle Airbnb dataset that is very similar to the other two we have and that we can use in the same way so we can use the same columns. The next step was to turn the data preprocessing part into a function. So I put all that code into a function. So we no longer have to copy paste lines of code for each dataset because that was getting quite clumsy given three datasets. In fact, it's, it was clumsy to begin with which to datasets. And the reason I did it that way was because I also want to show you the bad ways of doing things that you might encounter in practice that for mine, other people doing. And I think it's a good idea to also know what you showed them. Moused courses and royals tell you how you should do things. But that's not very realistic because in practice you will also find a lot of bad ways of doing things. And you won't be able to recognize is that if you only talk the one way that works, they think you should also be tall, bad ways of doing things so you can recognize them, avoid them. We factor those, refactor the code when you say truth and in that way, so in doing this refactoring, they also fixed a problem that has been going on for a while. Some of you may have noticed, if not, don't worry about it. It didn't really affect our video cost. The information where this cast that I presented to you was valid. It was just a small mistake that didn't affect any any theory or any practice regarding choosing machine-learning models or applying various data preprocessing operations. So that mistake was in this line. We have something like Amsterdam data of bicycles to numeric. And it should have been Amsterdam data of price here. But getaway copy paste driver had Berlin data twice here. So Amsterdam's price column was set to Berlin's price column. Which of course is not a good thing. But like I said, it did not affect the discussion regarding the processing methods and choosing the right machine learning algorithms for the task using one-hot encoding, hyper parameters and so on. It might have affected or is also lethal, we'll see in a minute. But those results aren't really that important. The actual values aren't that important in our case since this is a tutorial. But in actual practice, of course you don't want to make this mistake without wants to send something like this to your client. The way to avoid it is to avoid copy-pasting. Right? Because that was the issue that led to this intentional in this case era. And the way you avoid copy-pasting is when you find yourself pasting a line of code, you should write a function that accepts a parameter and does those things that you are a copy pasting in order to do to that parameter. But that's why. You're way, way well taping classes, Airbnb function that takes a DataFrame as parameter. We preprocessed it and we'll return to that data. And now we simply call that function for each dataset and that that results in much less chances of making mistakes and also in fewer lines of code. So congratulations if you've spotted that. If not, don't worry about it. As long as you understood everything else, the explanations and stuff, that's perfectly fine. Ok, so let's make sure we don't leave it like this. So that data again. And a, a good thing to do after you do some refactoring like this. Search, for example, search for Berlin. And it only shows up here, which means that we didn't forget to a name, anything. Let's also serve for Amsterdam. Ok. So it's only in other places. Ok. So this looks right now. And another thing right there that was added this try except because the Seattle dataset has some missing values for non numerical columns. And we can't really compute the median of a non-numerical column. So we try to do it and they forget an era. Simply feel the missing values there with an eye not available. Okay, so that's just a quick way of handling this error. And if you run this code, you will get some errors that say that you show them modify a DataFrame without using this method. And we're doing that in various places where that's not really our nation today and we're returning the DataFrame anyway. So that shouldn't affect us. And if you want to make extra Sure, you can do something like say Apple data heres to the quintet. And maybe we only want the biased column Queensland. And let's pin them for our Amsterdam as well. If we run this, we get the warnings again. And you can see now that the prices are their fans. So we no longer have the same problem. We can leave the Zane. The current remains unchanged, basically because we, we selected a dataset that has the same format. So they spark who are trained and have other results. So for Seattle, we get slightly worse results than for Berlin. Berlin is still than the one that gives the best results or the one that we get the best results on. And Amsterdam, even after fixing that, the results are about what we had before, maybe a little worse actually. But of course, that has to be fixed anyway. We couldn't deliver something like that. So an actual client. So I hope you are not too upset about me trying to trick you with that smaller error. But I think it's, it's important for you to know what not to do as well. And to recognize that things that you might find in practice, I don't only want to tell you a good way of doing things. I also wants to make you aware of some bad ways of doing things that might increase the chances that you will make mistakes, such as copy-pasting code. It's quite tempting when doing machine learning because a lot of cold is the same. And it's very tempting to copy paste stuff and we name stuff rather than doing it properly with functions and stuff like that. So I hope you enjoy this pollution. Congratulations if you've picked a completely different datasets and refactor the code or the weight and Marco, and congratulations for making it until this point and learning as much as you have so far. I wish you good luck in the future videos, there is still much to learn and I'm sure you will enjoy doing it. 57. Advance ensemble method quiz: Welcome their ensemble methods quiz. Which Ensemble Model was covered in a different module? X, g boosted decision trees, random forests. The correct answer is random forests. We talked about this in the nonlinear regression module. Decision trees are not an ensemble module. And extra boost is an ensemble module, but we did not talk about it at all. Which ensemble model gave the worst results? Random forests. Are that boast? Bagging? We Gerasa. The correct answer is unopposed. It's results were significantly worse than anything else. In fact, they were as much as two to three times was. Which of these are principles of ensemble methods? Using more models? Who's predictions are combined in some way using a single, highly advanced model. Using special GPU algorithms that improve results. The correct answer is a. Using more models, who's predictions are combined in some way. Ensemble methods combine multiple models to gain better results. Using a single highly advanced model implies a single model. Of course, this does not have anything to do with ensemble methods. And using special GPU algorithms is quite irrelevant here because we did not talk about that at all. And by themselves, GPUs do not improve results as long as you have enough time, chances are that a CPU algorithm will perform the same. Gpus generally only optimize for a time. 58. Advance ensemble methods summary: In this video, we are going to talk about the summary of the ensemble methods module. So here's what we learned about bagging at the start. And we said that Bagging leads to less variance in the data, which means that it has less chance of overfitting because the mother will not try to fit the outlier for and will not try to fit the data perfectly. Which might seem like a good thing, but in fact it leads to bad generalization. Also, we found out that at least an hour they thought the bagging approaches are less extreme. That is, even if they don't really give us much, much improvement over simple algorithms that are not tagged. It will also not cause the results to be noticeably was either we get political battle. We remain at about the same level. And we said that in general, they are a good thing to try if you can afford it because of it. Then we talked about boosting. Boosting tries to reduce the bias, which means that it fits the data as well as it can. But we saw with adipose that this seems to cause overfitting our data. Which was kind of confirmed by an example from scikit-learn documentation, where those thing is shown to almost perfectly fit to that sinusoidal data. And that's a bad thing because it's very likely that it wants generalize. Well, I will pass for me, memorize the answers for the train set if we cannot make predictions while on the test set. And that's what overfitting is. And of course, both of these are affected by the bias-variance tradeoff, which we talked about. That that means that we cannot minimize both the variance and the bias. So there has to be a compromise between them. There is a trade off. And you should pick the best compromise according to your own task and your own data. For example, in some use cases, bagging might be Badger, and in other years, boosting might be batter and that bow. We also noted that although these are considered advanced method because of the way they work internally and of the ideas behind them. It makes them very easy to use there as easy to use as any other model that we've seen so far. That's a very good argument for attempting damn whatever your dataset is and whatever your task is. 59. Capstone project: In this video, we are going to talk about your capstone project. This is a project that you will have to complete from scratch using most of the things that we have learned in this course. We're going to go over some of the things you will have to do. So first of all, you need to get this dataset. This is from the UCI Machine Learning Repository, which contains dataset similar to the ones on Kaggle. It's an older site and it doesn't have all the features Kaggle has, but it will do for this project. Here are the steps you will need to perform along with some questions that you should answer while solving this project. So first of all, perform some data analysis. What is the target? And what are we predicting in this dataset? You can answer this by reading the UCI page for this dataset. So this page here, this will make sure that you understand that the data. Next show a pair plot, right? Some words about it. Like, do any pairs look to be linearly, linearly related or not? Show a correlation plot. Next, I'll any pairs while correlated, is any feature correlated with the target by more than 0.5. we've already done this in previous videos, so you should be familiar and able to do them yourself. Next, are there any missing values? How would you solve this issue? Next, we get ready for training. Perform the train test split necessary for training and testing. How well our models perform. Use 20% of the data for the test set and use a random seed of 2020. Next we do the training. Use any tree discussed ensemble methods. You might need to use data pre-processing here. So might also need data pre-processing. If you do add it either as a pipeline, as sequential lines of code, report mean squared error and mean absolute arrows on this dataset. And for each method you use, discuss a bit about how the methods fare between each other, which is the best, which is the second best, how big the differences between them, and so on. Perhaps most importantly, relative to the dataset and the target variable, how good do you think the results? Are? They very good with, with the ERA basically being meaningless? Or is the very large which basically would make this kind of prediction and feasible to use in practice. You will have to use your intuition here a bit and your common sense to answer the question. And you will also have to understand what the dataset is about. Finally, try not to look at our code while solving this exercise. You can look at it before starting if you must, and if you feel like you don't fully understand something, not fully capable of replicating it from scratch. However, while solving, you can use Google and the scikit learn docs, but you should then use the code we have written in previous videos. You should not do watch the previous videos while doing this project. And you should also not read the accompanying Jupiter notebooks. And finally, you should organize everything nicely using markdown cells. For example. This here is a markdown cell. We can make markdown cells by going to sell cell type and selecting markdown. And here we can write text in various formats, smaller, larger, et cetera. You should have a few of these for each of these main sections here. Feel free to watch the solution video after you do it, to compare your approach with my official approach. Don't worry if there are slight differences. The important thing is that you managed to do it and that you get results that are close to what I get in the official solution. Maybe you did something's better. Maybe you did some things a little. Was the important thing is that you learned something from it. I'm sure you can do this and I wish you the best of luck. 60. Capstone project solution: In this video, we are going to talk about the capstone project solution. If you are watching this, I hope that you already have a solution of your own running and that show quite happy with it and are just curious about how I have done it in order to see maybe a different approach or to learn something. So if that is the case, Welcome to this video. If that is not the case and you got stuck, I strongly recommend that you keep trying to get it working on your own before watching this video. Anyway, let's talk about the solution. So what I have done and what I suggest you do as well in general, is to answer these questions in line here that will help you keep track of your progress. And you can also come back to it later in case you want to see how you did, or maybe to learn something about the data set. Maybe it will have to use again. Maybe you'll have to use it again, and you'll have this reference. You can check that way. So let's go about, let's go over the answers first and then we'll get into the code. So what is the target? What are we predicting? So we're predicting the sound pressure level in decibels. And this is the sixth column. Show a or a pair plot tried some words about it. So column index is four and the one which correspond to suction side displacement, thickness with angle of attack. And you get this info from the dataset page. Okay, Sheila correlations plot. So the pairs above are well correlated, but nothing is that well correlated with the target, nothing more than 0.5. And we'll watch these two plots in a minute. I just wanted to go over these answers First. There are no missing values. But if we would solve them using Phil and a and the pandas DataFrame, or rather to be more specific on the columns we want to fill in. Okay. Then we perform the train test split. I just wrote ok here to basically track my progress to make sure that I've done it. Use any tree discussed ensemble methods. I've done this too. We bought Mean Squared Error and MAE, I haven't written anything here because this is only partially done. I only plotted them a, e. We'll see about MSE in a minute. How do the methods fare between each other, which is the best method? We will see that there are some noticeable differences between them, but gradient boosting works best relative to the dataset and target variable, how it will do it in the results. So the best, the mean absolute error I got was about 1.897, which means that we are off by less than two decibels on average. And this seems to be very good. Consider the sound that the traffic outside your house makes. Would you be able to approximate the decibel level it's at within an error of two decibels. And that sounds like something very difficult to do. So I think this is very good. It will then be unnoticeable error by the human ear. Okay? And we have everything organized using markdown cells here, for example. Here we describe the dataset. This is an important step because it helps us gain some insights into the data. For example, we can see that the decibels, I have a value of around a 100, so between a hundred and three and a hundred and forty. And they are quite Wow, grouped together around 900 something value. We have our train test split here with the random state of 20-20. Let's run these cells. Make sure we're running the latest code. K. We have the pair plot here, which is just one line of code. I'm not going to run it again. And here are the two cells that two features I said was linearly related somewhat. And also here. And here is the correlations heatmap, where we can see that there are no features that are correlated with the target by more than 0.05. but this minus 0.9 here gets quite close because even a negative correlation is very good. So if we had here minus 0.9, for example, that would be very good. And you might remember from the previous videos that the first and last cells kind of got cut off a bit in the correlations plot. And we fixed that using this. So AX equals the heat map here. And we call set y lim to 60 to make it display nicely. And the reason we had to do this here is because we also passed in annotated equals two to the heat map call, which displays the correlations in each square here. And if we don't do this, we won't be able to lead this number here. So that's the only reason we had to do it here. Right? And then we get to the model selection part. And we decided to use bagging GUI Glasser, Adaboost regressor, and gradient boosting regressor. The bagging we go SO IT batched in. Oh, and we also use day standards, Kayla. And we put it all in and evaluate function so we don't repeat any code. Okay, and if we run this, we get these results. This is the 8.89791 value I mentioned with gradient boosting. And let's also make it so we display the mean squared error. So in order to do that, we can say, well results with model labels equal. Let's make it like this. I'm a e equals MSE, MSE. And we will have MSE equals. We have to import Mean Squared Error. Start by copy-pasting this. And don't forget to rename it. Mean squared error. Let's run this again. And here it is. These are the mean squared error values. So we got that bot in as well. And since we got it in, let's fill it in here so we can keep track of what we've done. Then we go. All right, so that's about it for the capstone project, it wasn't very difficult. The main point of it was to familiarize yourself with the code and get some exercise, Oh, I think things by yourself. And also exploring data sets by yourself, downloading them, reading the documentation pages to make sure you understand them, and so on. So it's basically just an exercise to get you more familiar with the field. It doesn't have to be too complex. It's very important that you feel comfortable writing this kind of code and working with this type of dataset. I hope you enjoyed doing this, and I'll see you in future videos hopefully. 61. Questions and answers: In this video, we're going to do a quick Q and a based on what we have discussed so far, which is regression. This looks complicated. Do I need a Ph.D. the good news is that now you don't need any formal education at all. You can find the most tough online. Even the latest research quickly implemented in various libraries and you can make good use of his. The documentation of the various tools is also very good and that lets you take advantage of it very quickly. In fact, the barrier for entry in this field is very, very low. As long as you spend a few months studying, you will do great. Are robots going to take over? They are even driving costs. Now, this is very unlikely and it's mostly media saturation. Although these things are impressive, such as self-driving cars or AIs that can put various medical diagnosis. Her then doctors can, the field is nowhere near human-like level of thought processes. Which means that these algorithms are not capable of thinking like we do. They are not capable of critical thought, and they are not capable of levels of influence that we are capable of. So now this is not going to happen. Is the automation introduced by machine-learning going to make some jobs obsolete. This is likely, for example, we already have stuff like self-driving, taxis, grocery stores, not staffed at all. You just go in, get your staff and you'll be built automatically. Drawn deliveries are starting to become a thing slowly but surely I think, and some others as well. For example, I already Management APIs that can put medical diagnosis. So in the future, maybe even in the near future for some of them, some jobs are probably going to go away. And that's why this is a great time to start learning, because this hasn't happened yet. It's not an immediate issue, but it's good to be prepared. Can I do anything with what I know so far? Almost there, assuming this is the only cause you've watched so far. That isn't a question one, I suggest you keep watching our courses and at least the classification one. Then you will have a Google around knowledge of what the field is about. And you will likely be able to be productive on certain projects. But of course, I do recommend that you watch them all at fast one as well as they will give you an extra edge and you will be able to be useful to many more project types. So I need to find a new data scientist's job if I'm already a developer? Probably not because there is an increasing need in most organization for this kind of expertise regarding data science, machine learning, AI, and so on. So they might prefer to exclude from their own, if they already have a developer that they know is reliable. And if that developer says that they can fulfill some of these data science related needs, then they are likely to be chosen TO ask around, see if there is anything you can country built in this regard. Is it worth it to join a startup? Yes, startups are the ones that usually bring them out innovation, but it's not going to be easy. However, we do hope that our videos will help you along this path and that you'll be able to pass all the required interviews and to make a great contribution that startup. And hey, why not think about starting your own as well? Maybe you'll have some great idea that machine learning can help with. Definitely something to consider. How can I gain some demonstrable experience, right? So every employer wants to see that the people they hire have some experience under their belts. This is true. All programming positions, all development positions. And one thing you can do here is to be active on Kaggle, some mature on kernels, take part in competitions and be active in the various discussions and not just uncharitable. You can do this nano Stack Overflow as well, for example, can fall into a town quora, for example. And then various other sites that you can find online. And this can go a long way in proving that you have experienced that you know what you're talking about. And that should not only know the PLA, but that you are able to put it into practice as well now, so consider a freelancing. Freelancing is very easy to nowadays. And as long as you find the right project and you, they'll go do our Canadian. You can definitely put it on your resume. And that's definitely practical experience that you can demonstrate. 62. Extras tips, tricks, and pitfalls to avoid: In this video, we are going to talk about tips and tricks and pitfalls to avoid. First would be to not listen to just do an author. You should always get your info from more places and more key powers. Don't just watch one video and think that whatever the video says is the absolute truth, or that it's all you need to know. A good video. We'll link you to documentation. It will give you other resources such as Kaggle, the UCI Machine Learning Repository, and like that for example, another good resource is cabals kernel's page. For each dataset, you can find what other people are doing. So that's also very important. You don't want to be stuck in this trap. You only follow one person next to. You should always write your own code. Don't assume that you don't need to write code because you watch the video and you've understood called presented in the video. This is true even for my own videos when I presented some code, even if you understood it, you should maybe pause the video and try to write it by yourself without Look at what I've weight and as much as possible. Or you can do it after the end of the video. This is what builds practice and confidence in one's own abilities. Next, you should do your best to avoid to toil towel. So tutorial, how is you keep following tutorials but you'll never put them to actual, real use. So don't get hung up on any tutorial, on more tutorial. Once you finish watching them, try to apply them to something you're passionate about. Or just any toy projects, for example, here maybe now you can go and find another regression dataset and apply machine learning models on it. Maybe find a more complex one than you did before in the capstone project. Things like this, things that make you work on your own with no supervision and without being told exactly what to do. Another important thing is to be critical of flatly watch. Teachers also make mistakes, be they on purpose, by accident. They don't always use the best methods either. Take the best from them. But we've added to correct the vast. And you can notice a small mistake or something that you know can be done better, definitely do it. And usually access features you've seen. We continue with the same code from multiple videos. So this will force you to adapt those future videos according to the changes you've made and that will make you understand those concepts even better. So this is very important, even if you just make small changes that will force you to pay more attention and think about how to adapt the future videos and decoding them to your small changes. It's very important and I strongly suggest that you do it. This is very important, and I strongly suggest that you do it within the documentation is again very important. So we didn't cover everything there is to know about regression and no video will cover everything there is to know about the certain topic. Should always read the documentation of the tools that have been talked about. The documentation about the algorithms Azure, you should always do the example. We mentioned the examples in psychic glass documentation would always take some time to understand them and try to reimplement them. In general, you should always try to learn more. Don's just be satisfied with what you're told in some video, tutorial or even, you know, a university lecturer, you should always try to learn more. I think it's important to make a watch list. For example, you now have some idea about what machine learning is, though, I think it's a good idea for you to go out and find some people with YouTube channels or blogs that you like and follow them regularly. In general, you should be part of the community and maybe you have some local meetups that you cannot turn. And yeah, just be involved with other people with similar interests. This is very important because it will keep you up to date with these developments and you will be exposed to a variety of approaches and opinions within the community. 63. Tools and resources: In this video, we are going to talk about the additional resources, next steps and the recommended tools. For additional resources. I suggest Kaggle, which we've talked about, but here it is for reference. The UCI Machine Learning Repository, which we also mentioned briefly at the end of the course. Again, this is for reference, even if you are aware of these now, you might want to look them up later. So save this somewhere you remember, and you'll be able to have a quick and handy reference for things like this. The scikit-learn documentation can be found here. And one we didn't mention is Google AI, which has a lot of resources and tools regarding machine learning. You have the link here. And of course, last but not least, it's our own courses. They are definitely helpful and you should keep following them. For our next steps. I suggest that you make use of the other courses to combine them with what you learned here. For example, data visualization with real exercises is something that can definitely help you come up with more interesting data visualizations for the datasets that we used in this course. You can also integrate some interactive visualizations. And that's definitely a good exercise that will make you more comfortable boat with machine learning and the various data visualization libraries. Next, you should watch the classification models with real exercises videos, because classification is a very important topic in machine learning, perhaps even more so than regression. A lot of data sets built for classification tasks, and they have a lot of real world applications. So it's a good and the natural next step. Next, try to think about the field that you're passionate about and how regression models could be applied to it. This isn't something that you are expected to be able to come up with in a few hours or even a few days. It's just something to keep in your mind and see if anything comes to you. This can be a sport or some hobby you have anything really. Here's a list of recommended tools. Mostly we recommend Python with Jupiter and anaconda. These will help you do most things in machine learning. Then there's the Google AI tools. They are for a lot of interesting things, so I suggest you check them out. Next, it's open. Ml could also offer a lot of datasets and interesting tools. We're not going to go into detail on what each of these are. I suggest you go to their web pages and if there's anything you don't understand, just leave it for later. And also try to find some interesting tools yourself and make sure to keep track of them. For example, you can edit this file and add them here.