Transcripts
1. Marketing Video: Welcome to the machine learning with vitamin course, where our goal is to make a machine learning coding experience easy to learn for people from every walk that plague. Our focus throughout the course will be to teach you the in-demand machine learning skills using the Python platform for coding through hands-on lab exercises, we will use Jupyter Notebook coding environment, which is easy to use in very functional. We have enhanced the learning experience by explaining key concepts visually. Leader have a capstone project at the end of the course that will let you learn all the concepts in one place and then one goal. The quizzes are designed to test the students learning of key concepts. Meet our team who played a role in researching, designing, and creating this course. Together we have nodes to 30 years of experience and practical implementation of technology and teaching technical courses at a university level. Our team members have specialization in areas of information technology, software engineering, data sciences, and design. We will have a total of ten sections that are designed to help the students learn progressively. We will start from the basic introduction of the course and gradually move to the intermediate concepts. By the end of these lessons, you will be able to start coding in Python for your projects, for machine learning. Key concepts are explained visually to enable our students grasp the technical concepts quickly and effectively. The Jupyter notebook as an open source coding environment, some of the key benefits are easy to use graphic user interface text in noted in the Markdown syntax to give you more insight into the code, you can generate graphics and charts easily, interactive code and data exploration, and lastly, easy coding exercises. Our capstone project will help you understand the overall steps which should be followed in order to achieve the desired predicted result. Implementing all the concepts taught in this course. In this present world, machine learning is routed to almost all the fields of life related to a common man. And so comes the role of it's important to study and learn the key concepts. Some of the companies requiring machine learning skills to fuel their day-to-day operations and critical software development needs. We look forward to having you join our course and be promised that this course will help you build your foundational coding skills in machine learning that will help you make your resume stand out and demand a competitive salary in the marketplace.
2. What is ML?: Hi and welcome to lecture one of Python machine learning. Before we move on to machine learning, I have a question for you. How could you tell apples and oranges apart? That's right. We can tell both about by their physical appearance, shape, texture, color, size, and even taste. All these traits are called features. When we see a person or an object be subconsciously absorbed their features and tell them apart. But the question remains, how can a computer perform such tasks? The answer to this question is through machine learning. So let us see what actually is Machine Learning. After seeing a person or an object, our brain automatically feeds in the feature of that particular thing or person. In future we know who is who and what is wat. But in terms of computers to do so, we provide them with a feature set, which is also referred to as an observation. And along with that, we get them labels for each object so that they can perform machine learning algorithms on the provided input and train themselves to make future predictions. Learn from that data and improved throughout the process. Let's see the phenomenon discussed diagrammatically. Over here you can see that in data and labels are being supplied to the machine learning algorithm. As a result, a machine learning model has been trained. After training the model test data is provided to the model. In this case, you can see that two shapes are supplied as test data to make further predictions. But over here, the labels are not provided. The model predicts what type of shape are these and give the predicted output. All of this is summarized machine learning procedure, and we're going to discuss this in detail throughout the course.
3. Applications of ML: Up till now we have talked about the theoretical concepts of machine learning. But in order to relate to it, we must see where machine learning exists in our daily lives. Facebook has a huge role to play in socially connecting us with one another. Facebook helps you connect and share with people in your life. Facebook uses machine learning algorithms that run feeds, ads and search results and create new text. Understanding algorithms that keep spam and misleading content at bay. Amazon has made our lives easier as it has enabled us to show up at home. Amazon suggests new items or products you might find often trust on the basis of your buying patterns. Similarly, netflix suggests movies and TVs on the basis of your video watching trend. All of this is times two machine learning. Looking again into Amazon and some of more features possible due to machine learning involved not only product suggestion, but customer segmentation based on their purchasing behavior and variation of product price on the basis of its demands. The gaming industry has evolved a great deal thanks to machine learning and artificial intelligence. We now have the facility of virtually experiencing the gaming environment due to VR glasses. Not only that, but physical activities at home is possible due to gesture game control provided by Nintendo. The last practical example that I'm going to discuss, Uber. Uber has made our lives easier and machine learning has made it possible. Features such as suggesting shortest route, suggestions for drop and pick a point, finding customers traveling and similar routes and predicting fears based on root n time of booking is because of machine learning.
4. Data Acquiring : When a machine wants to learn a certain algorithm, it needs data to learn. Now there are many different techniques for gathering this data. In this module, we will be going through the step-by-step procedures for collecting data, processing them, knowing what formats they are contained in and how you can bring them to your machine in order for it to learn and become familiar with. The question which comes first in mind is, what actually is data acquisition? The simple answer to this is that data acquisition is the process of collecting a dataset that is genuine to the scope of learning algorithm that you are looking for. But in further detail, there are mainly two basic requirements for the applications of machine learning. Data collection and model. Data acquisition is the core part for machine learning and artificial intelligence. Without a dataset, one cannot simply apply these techniques. The basic three approaches to data set collection. Our number-one data discovery, data discoveries, the collection and analysis of data from various sources to gain insights from hidden patterns and trends. Through the data discovery process, data is gathered, it's combined, analyzed in a sequence of steps. The goal is to make Missy and scattered data clean, understandable and user-friendly. Next comes data augmentation. Data argumentation is a strategy that enables practitioners to significantly increase the diversity of data available for trading models. This is done without actually collecting new data. That argumentation techniques such as cropping, padding, and horizontal flipping are commonly used to train a machine learning model. Then comes data generation. This refers to the theory of methods used by researchers to create data from a sample data source in a qualitative study. Data sources include human participant, document, organization, electronic media, and events. When collecting the data for machine learning, there are some challenges to it as where few of the challenges are stated over here. Let's just see them one by one. Number one is data what you, continuous learning is possible with the use of existing data, which means that a lot of data is required for training your machine. Then comes data presentation. Limited data and faulty presentations can interrupt the predictive analysis for which machine learning data is built. Then we have variety of data. Machine learning needs data that is comprehensive to perform automated tasks. Data from multiple areas are useful. And then the, the accuracy that I itself is a challenge, especially if inconsistent, biased, or insufficient. Procuring data. Obtaining large datasets requires a lot of effort for companies. Other than that deduplication. Removing inconsistencies are some of the major and time-consuming activities. And then lastly, we have the depositions Bozeman data, if collected without permission, can create legal issues. Collecting your data must have a source depending upon what type of model is to be trained. Bearing application vide. The dataset which you may need for application can be sourced from either a freely available data source over the internet. Or you can get the source data internally from your organization's internal data source. Depending upon the type of application you need to train your machine for. The data might be freely available or if the data is confidential, it might only be with some specific organizations. We will be going through some of the data sources that are public data sources, Excel or Google sheets, web scraping, and lastly, internal data sources. Let's put some light on public data sources. First, there are more than a million three data sources available over the internet. These datasets are contributed by companies and government bodies made publicly available for data science enthusiasts. This is where we will be mainly focusing some of the common data sources available for free on Google dataset powered by Google platform, kaggle dot JOB dataset, FiveThirtyEight, UCI EML repository. The first question that we're going to address is what our internal data sources? The most common source of data is the organizational internal data base. So this data is already collected. It's the pilots clean and it's preprocessed. Other internal data sources include online transaction processing, which is also referred to as OLTP data, which contains the transactional information which could possibly be modified, it could be updated or deleted. Next is the data warehousing data, which is the analytical data or data using which you can make reports. Then comes the log files. These are generally the events that might be recorded at certain instance, which can be IP addresses or location of different people trying to access what type of content. Sensors and networks. Currently referring to the in-demand topic, which is internet of things. These youth centers over the network and they generate a very large amount of data in very little time. So coming up next, we will discuss the data preparation process of which data collection is the first stage. The problem statement should be clear in order to know what type of data needs to be collected, what should be the data features, and where can we collect the data for this task? After the data collection, we move on to data preprocessing stage. In this stage, the data is formatted to a standard protocol. And in the final stage, the data is transformed into a machine understandable format, which are only numerics.
5. Data Formats: When talking about a dataset for machine learning, our data format is of great importance. Understanding the basic data formats you get the data in and manipulating them can be helpful for getting fruitful results in the machine learning process. There are numerous formats of datasets. You might get a humungous data which can be in one of the formats, may be say, CSV, text, Jason, XLS, or even any other format. Each format has different layout and a different look and feel. So Amanda's numerous format, the popular data formats are number one, CSV, which stands for comma separated values, number to Excel or Google sheets. And number three, JSON, which stands for JavaScript Object Notation. Let's see these common data formats in detail. First, we have comma-separated value. This format contains data, data separated by commas. Each value before the comma belongs to a specified feature. Let's take a look at the csv file that we have over here. Now considering the first three labels, VC, passenger ID, comma survived karma, pclass until onto a thought. Now moving on to the next rule, the first value one belongs to the first label, which is passenger ID. Similarly, the second man whose 0 belongs to the second label, which is survived. And in the same way, the third value, three, belongs to the third label, which is peak loss. This could be easier to understand if the data was less than quantity. But the data isn't for us. It is for a machine to understand and the machine understands it with the help of the comma, same as we discussed. A comma separated value is a delimited text file that uses a comma to separate values. Each line of the file is a data record. The use of comma as a field separator is the source of the name for this file format. Next we have an Excel file format, the extension for which is dot XLS or x, x. This is one of the common format which we also use in our daily lives. This format is easier to understand because of its format style in table form. So we see here the same data that was previously used to understand the CSV format, nowadays here in the Excel format. And here it is very easy to understand the labels in the respective data. In an Excel formatted data, the columns represents the features of the dataset, whereas the rows represents the number of the recorded features against each feature. As most of the machine learning tools use the CSV format and the data which humans usually required is the tabulated format, that is the XML format. And it is important to know how to convert the data from Excel to the CSV format and vice versa. Json stands for JavaScript Object Notation is very simple in lightweight data interchange format, it is easy for humans to understand and machines to interpret. Jason is aborted by hundreds of applications as a data format. Some of these applications include networking automation, programming, configuration files and data signs. Data in the JSON format can easily be Convert to CSV or XML spreadsheet or CSV format. This is an example of how actually adjacent file looks like.
6. Importing Datasets from Public Sources: Till now, we have looked at what is data acquisition, their formats, and what are their sources. Now we're going to see how we can use the public sources in order to obtain these datasets. Being a data scientist, data collection is a very important grasses data that is readily available, easy to access, which can be available publicly over the internet, is preferred over other sources. Before moving on to the public sources that we will be discussing, let's have an overview of what is a dataset. Data set is a collection of related data that contains useful information for modeling a machine. A dataset may have one or more database tables having data of different data entities. A column in our dataset represents queryable attribute, or you can say a feature. Any dataset can have multiple features, but it depends upon the data scientists which to use and which not. A row in the dataset represents a record or an observation. The total count of the rule account for the total number of data points in our data set. Point to be taken here is that in order to apply a machine learning model for prediction, you must have at least more than thousand data points in your dataset. You can download data from various public data sources for practicing data analysis and it assigns an expert level data analyst may create a dataset by themselves. But if you are new to the field and learning that analysis and data signs, then you can grab datasets from free and public data sources that we are about to discuss. Among those the most common are kaggle, dot JOB dataset and UCI Machine Learning. Let's just explore them one by one. Google is one of the famous and most commonly used sources were requiring machine-learning dataset. It has the largest data science community. Caegeul offers a web integrated platform for coding your notebook and getting the results on the online form, you can add the collaborators do your projects as well. Caegeul also host machine learning competitions in which the competitors design a machine learning model which is fast as well as accurate enough to predict the outputs. Next, GOP datasets are countrywide repositories of data made available by the respective governments. With the United States, China, and many more countries becoming artificial intelligence superpowers. Data is being democratized. The rules and regulations related to these datasets are usually stringent as their actual data collected from various sectors of the nation. Task, cautious use is recommended. Some of the countries that are openly sharing their datasets include Australian government dataset, European Open Data portal, news Allende's government dataset, and Singapore's government dataset. The UCI machine learning repository is a collection of databases, domain theories, and data generators. These are used by machine learning community for the empirical analysis of machine learning algorithms. It is used by students, educators, and researchers all over the world as a primary source of machine learning datasets. As an indication of the impact of the archive. It has been cited over 1000 times.
7. What is EDA?: Once we have acquired the data, Next we want to do is to look into the data. Let's see what we have Mike truly everywhere and John are for quality students. We want to see whether they play basketball. Cournot. Can you answer this? Weight? No. But why? The answer to this question is because we don't have enough information about their background. What does it mean? It means that the set of features to predict the result is unavailable to us. The doubt which we won't be able to predict the results. Now here we are provided with a set of parameters for some students. It contains the roll numbers, exam grades in their heights and weights. So can you use the given data set to predict the answer we're looking for? Yes. But the question is how? In order to predict that a given student will either play basketball or not, we need to design a model that needs to be trained on the training data set. Here is the training dataset. The given features are the input for the training model and the predicted result is the output represented by y. Now there are features set accounting for the results. Here we need to check which values are responsible, which features play an important role and what doesn't affect the predicted results. All this needs to be evaluated. For this, we use the exploratory data analysis technique. Let's see what it is. Just pause for a moment and try to memorize this dataset as we are going to discuss it for applying the EDA techniques. Edl, also known as exploratory data analysis, is a scheme to explore your dataset you have acquired. In the fourth step. During this process, the data scientist gains and insight to the data and its underlying structure. By doing this, you get aware of the dataset, understand what story the data tells, and get an idea of the next step for the processing of data and gain a hypothesis to answer the questions for your research. With the help of the statistical tools in Python, we can perform the EDA on our data to explore it and analyze it. In a nutshell, EDA summarizes the main characteristics of the dataset. Looking forward towards the steps involved in the process for EDA, the main steps involved are stated here and we're going to look into each of these individually. Let's start with features, data, and variable types. I hope you remember the basketball example in the dataset that we discussed earlier. Let's start the exploratory data analysis or of that dataset. Let me look at the variable types we have. Or to be more specific regarding the data features we have. We see that it has input variables or the predictor variables. And the output variable, which is our target variable. The features that has gender, exams, grade, height, and weight are over predictor variables, whereas the target variable is play basketball. Orchestra talking about the data types that we have in our dataset. The character type data is the gender feature as it contains the value M or F. The numeric data type we have are the exam grades, the height and the weight of the students, and the play basketball output. Moving on the variable category. This is the term used to identify the way rebuild type. In common, there are two basic variable types, the categorical and continuous. And categorical variables. We talk about those variables which can be categorized. For example, gender feature. It can be categorized as either male or female only. Similarly, the play basketball variable has value 0 or one, representing that the student will not play or play basketball, respectively. Whereas the continuous variables are those which have any value like that. We cannot categorize them. For example, the exam marks, as they can be any value in the continuous range. Likewise, The same goes for the case of height and the weight values. The univariate analysis. The meta for univariate analysis depends upon the variable type. When talking about the quantum news-type, these are the values which are continuous within a range. Which means that we can analyze the values for such a single feature of using the statistical tool applicable for the quantum news datasets. For example, using the mean, median or mode on the feature values. Similarly, finding the min and max values applying range or quartile. This information will help us understand the spread of values and their range. Contrary to this, the categorical variables can be analyzed using the discrete statistical tools or those functions that can be applied on a grouped dataset. For example, our dataset can have gender, male, or female. Using the frequency table, we can see how many meals and how many of them are females. Similarly, we can acquire the percentage for it. Bivariate analysis finds out the relationship between two variables. Here we look for association and this association between variables at a predefined significance level. We can perform bivariate analysis for any combination of categorical and continuous variables. This combination can be categorical and categorical, categorical and continuous. And continuous and continuous. Different methods are used to tackle these combinations during analysis process. Let's say we have three variables, a, B, and C. And we want to perform bivariate analysis onto it. The simplest approach is the correlation matrix. As we can see, a matrix for the correlation. Here the diagnosis contain ones because a itself fully dependent on a and so as B and C. But we can also see some positive and negative numbers. To positive number means that as one variable increases, the other also increases. Whereas the negative number indicates that as one variable increases, the other decreases. Now the numbers represent how strong the relationship is between variables. Although we're going to be discussing this section in detail, but we'll have a brief overview of what missing values and outliers are. Let's begin. A missing value is something that your data set does not contain. Either it was not recorded or left Optional. It can also be referred to as an incomplete dataset. Outliers are values which are out of the expected range. These are very critical in model training as it can lead to very dangerous outcomes. Calculating the statistics of data including outliers and gives much mediated results. Engineering is the science and art of extracting more information from existing data. For example, let's say you're trying to predict footfall in a shopping mall based on the dates. If you try and use the dates directly, you may not be able to extract the meaningful insights from the data. This is because the footfall is less affected by the day often month than it is by the day of the week. Now this information about the day off, the vk is implicit in your data. You need to bring it out to make your model better. This exercise of bringing out the information from the data is known as feature engineering. The machine learning process involves the following mentioned steps and processes. Let's just discuss each of them separately so that we get to know the importance of EDA. The first step in machine learning processes, the problem statement, the purpose of developing a model. Then we look at there is a need to solve this problem and that do using the machine learning approach. Next comes the problem-solving approach, which connects us to the problem data collection step. Reason being that the approaches identified ones when the data collection is done. After the problem defining step is done, we go for the data collection process, which we have studied in the previous lectures. The data collection source is very much crucial as the data collected must be authentic and veiling line. But the context of the problem which is to be solved. The quality of the data depends upon the authenticity of the data set obtained during the collection process. However, it's remediation process involves its refinement by passing it through the filtration process, are organizing it and migrating it so that it is fit for the purpose and use. After we have organized the data, then comes into play the whole analysis that we have discussed in this lecture, the Exploratory Data Analysis, EDA. This process is the backbone for the training of the machine learning model. Any negligence at this stage of analysis can lead to serious effects negatively impacting our body. But the knowledge we have gained during our initial lectures, depending upon the dataset and their category or type, we train our model. The model which is unsuitable to our questions on the problems we have focused in the first stage of the process. Once the model is trained and the more or less ready after devaluation to provide us the results. The last step is to communicate them to the user in a presentable format, which is easily understandable. Till now we have seen the process of EDA in numerical format, which is a bit boring. But people tend to avoid the mathematical values. These explorations can be done visually as well. In the next lecture, we will see how we can explore our data visually using the Python interface.
9. Data Standardization: Having some standard for the data makes life easier. Data standardization is the process of re-scaling one or more attributes so that they have a mean value of 0 and a standard deviation of one. Data standardization is a data processing workflow that converts the structure of the desperate datasets into a common data format. As part of the data preparation field, data standardization deals with the transformation of datasets after the data is pulled from the source systems and before it's loaded into the target systems. It is about making sure that the data is internally consistent. That is, each data type has the same content and format. Standardized values are useful for tracking data that isn't easy to compare otherwise. For example, suppose you and your friend, when do different universities, one day you bought caught your midterm grades for your chemistry classes. Your professors sticks to the normal creating scale out of a 100. So you've got a grid of 80 for the test has a mean of 77. And the standard deviation of six. Your friends Professor though, uses his own grading scale. So she got agreed or 450 to her desk has a scale of 750 and mean of 400 and standard deviation of a 100. Both of your scored above average. But who did better? While the mean datapoints might not be immediately comparable. There is a way to standardize and compare the datapoints. Converting them to percentages shows that you come out ahead with an 84% compared to your friends, 60% collect data in common formats. This is venue. Make sure that your survey is set up to record the same data point in the same format every time. For example, people's dates of birth shouldn't be collected as June 198621 January 19741956 in the same survey, collect data based on preset standards. If there are pre-existing international or local standards for how the measures and count of particular datapoint stick to them. For example, the SDG indicators are a great international standard that more organizations are adopting today. Transform data into a common format. During data cleaning, data standardization involves changing different data formats to just one format. Convert data to z-scores, rather than showing the data point on its own scale, z score shows how many standard deviations a data point is from the mean. This conversion happens during data cleaning or analysis.
10. Data Normalization: Hi guys. In this lecture we deal, we'll be talking about data normalization, which is an important technique to understand in data preprocessing. Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form, such as the dot-product or any other Cornell to quantify the similarity of any pair of samples. Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, but without distorting differences in the ranges of values. For machine learning, every data set does not require normalization. It is required only when the feature have different ranges. When we take a look at the used car dataset, we noticed in the data that the feature length ranges from 150 to 250, while feature of width and height ranges from 5200. Female want to normalize these variables so that the range of the values is consistent. This normalization can make some statistical analysis easier. By making the ranges consistent between variables. Normalization enables a fair comparison between different features, making sure they have the same impact. It is also important for computational reasons. Here is another example that will help you understand by normalization is important. So consider a dataset containing two features. One being the age and other being the income, where age ranges from 0 to a 100, while the income ranges from 0 to 20 thousand and higher. The income is about 1000 times larger than the age, and it ranges from 20 thousand to 500 thousand. So you see that both of the features have very different ranges. And when we do further analysis, for instance, the linear regression, the attribute income will have an immense effect on the result, more due to its larger value. But this doesn't necessarily mean that it is more important as a predictor. So that's why we have databases in the linear regression model to weigh the income more heavily than the age. To avoid this, we can normalize these two variables into values that range from 0 to one. Compare the two tables at the right after normalization, both variables now have a similar influence on the model we will build on later. There are several ways to normalize data, but we will just outline three techniques over a hill. The first, the first method is a very simple technique. It's called simple feature scaling. Just divides each value by the maximum value for that feature. This makes the new value range between 01. The second method called the min-max, takes each value x old, subtracted from the minimum value of that feature, then divides by the range of that feature. Again, the resulting new values range between 01. The third method is called these core or standard score. In this formula, for each value, you subtract the mu, which is the average of the feature, and then divide by the standard deviation, which is sigma. The resulting values hover around 0 and typically range between negative three and positive three, but can be higher or lower. Following our earlier example, we can apply the normalization method on the length feature. First use the simple feature scaling method, where we divide it by the maximum value in the feature using the Panda's method max. This can be done in just one line of code. Here's the minmax method on the length feature, we subtract each value by the minimum of that column and then divide it by the range of that column, the maximum minus the min. And finally, we apply the z-score method on the length feature to normalize the values. Here we applied the mean and standard deviation method on the land feature. Mean method will return the average value of the feature in the dataset. And the standard deviation method will return the standard deviation of the features in the data set.
11. Lab Section 6: Hi guys and welcome to the first lab session of our course. This lab is for section six in which we're going to see how to import data, perform some exploratory data analysis and data cleaning. Formally, what steps we're going to discuss are collecting the data, importing the CSV file formats, how to perform EDA on this data, visually analyse the data. And lastly, how can we clean this data? So let's get to it in the first step, which is collecting the data, we're going to download the data set of Titanic. And you can do that by clicking this link, which is provided in the document. And you can go to this link and download all of the files at identical zip file will be downloaded in your downloads folder. Just unzip that file and you'll see three CSV files. For your ease, I've pasted the sample image so you'll get a train CSV, one for test and one genders admission. After you're done with downloading your dataset, the next step is important, the dataset to Jupiter. This is a very important step as what our data to mean something and for us to perform machine learning on, on it, we need to bring it into Jupiter notebook. So before we do that, let's just see how can we import certain libraries into our Jupyter notebook. We're going to import pandas library that we already discussed in the conceptual phase of our course. So this library is written for Python language and it's mainly for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. So let's just see how we can import this library or as matter of fact, any library. So the key word that we use is important. In order to import any library. After that, we type the name of the library that we want to import. And then we give an alias to this library. And using this alias, we can use the functions of this library. For executing this fail. We press Shift and Enter simultaneously. And you can see that the score of line or the cell has been executed perfectly. Another way to import the library is using the pylab magic function. So this Pylab is a magic function that can be called an iPython. And by calling this, the interpreter will import the matplotlib numpy modules. So for that, we need a person did sign and then the pylab keyboard and the inline function basically plot all of your figures in line in a single row. So we'll see that practically moving on, similar to the previous cell. For executing this, we need to press Shift and Enter simultaneously. And you see that this has been performed and there is no editor, it's been executed. So whenever we import the file for data analysis, we need to know that we're the notebook in form of the read or the import files where they actually persist. So to ensure that our file is placed at the current location, we have a very important library, which is OS, and it stands for operating systems, which an operating system library. So to import it will use the import keyword along with the OS library name. So shift and endo in order to execute it. Moving on to the next cell, we're going to call its function get CWT, which is get the current working directory. And when I executed. You see that my current working directory is users D4 sixties. So it's in that folder. So it's telling me that where my current working directory, actually what my current working directory is. After building up the entire foundation. Now we're here in order to read or import the CSV file. So we're going to use the alias of the Pandas library, which is pd dot read CSV. This is actually the function of this library. And I am going to give this bot or the name of my file where I want this to read the data. And this, the output for this is going to be loaded into this variable name, which is df. I need to press Shift Enter in order for this to execute. So the file is not actually been found. So this is a very common error that you might get if your file is not in the current working directory. So as it says that my current working directory is this. So I actually need to paste my file over here. So I am going to go into my Downloads. We're going to take these files and I'm going to go to see users. And this folder. I'm going to paste this here. Now let's just try this again. Yes. So it's done. So all of the data has been imported to df variable. Now let's see what's in this data. Okay? So by using the variable name df, and I'm not using print over here, simply df and using Shift and Enter in order to execute it. So you see the output over here, it started, I have all of this data. You're going to see this data in detail in a bit. So you see that I have 891 rows and I have 12 columns. So let's move on to the part of exploratory data analysis. So the very first thing is I'd want to see the shape of my data for that I have to command dot shape. So let me just press Shift and Enda. And you see that I have IID 91 rows and I have 12 columns, same as above. Let's just tell it. So you see these many rows and many columns. And if you want to see the end of the data, that is, I want to see only the data. Data is at the end of the entire variable, that tail of the data. For that I have to function dot tail, shift enter. And okay, this is the tale of my data. That means the data which is at the end. You can see the indices number over here. So it's showing me the row number eight, eighty six, eighty seven, eighty eight, moving on. So this is all of the data that is at the end of my variable. So these are the n rows. Now I want a somebody of my data. So to perform or to know your data better in order to perform media, you need to know the details of your data. So for that we use dot describe, Shift, Enter. So here it tells me the count of the passenger ID survived pclass age. And you can see that it is slightly less than all of the others. So it's probably because there might be some missing values that we discussed earlier. So that's why it's giving me a lesson number over here. So the mean for all of these is stated over here. Same as I have standard deviation. I have the minimum value and then 25% quartile, 50%, 75 percent, and lastly, the maximum value amongst these. So as you can see that somebody paired 512, I don't know actually the currency, but this is the amount or the highest fill that has been paid and the lowest value. You can see that 0 to somebody actually paid 0 third in order to get on Titanic. So we're going to explore that. So similar to the previous case, we saw how we can see the tail of the data and are we going to see how we can see the head of the data, that is the starting rows of the data. So for that we have the function dot head and let me just enter, press Shift Enter. And okay, so this is the head of the data. So now you see that the indices from 0 to four are showing contrary to the tail function. So let's just discuss all of the columns that we have. So we have 12 columns in our data set. Number one is the passenger ID, so this is the Pacinian number for which of them in our dataset starting from one to 891, survived in the Survived column. This is basically it's going to be our target column. And over here, 0 means that the person did not survive. That means that the person died and one means that they survived in their life. The beat loss is actually the passenger ticket class where one is first class, second class, and three stands for Todd class. And then the Sex represents the gender of the passenger. Age contains the age number of messenger SIB SP. So this actually this column is for if that person weird exploring that percent GO he or she, if they have any spouse or they have any sibling. So the number shows the total count for that and be arch or parched. So this value represents if they had any parents or children off the passenger. If presents, it shows the total count for that. Next the ticket, the ticket number of the passenger fares, the amount of fair debate in order to board, and cabinets the cabin number of percent. And lastly, embarked contains from which port they actually embarked. So C stands for this, Q for Queen Shaw and S for Southampton. So these are the ports from where they embarked. So let's just see the type of the data for each column. That is the data type of the column. For that we have a function that is d types. So I'm printing df.loc types and I'm going to shift enter and you see that it's telling me that the passenger ID is integers or survived and PE, class, name and sex, they're actually object. So it's a very interesting idea that all of the strings, you'll see that they appear as objects rather than the string. So age is flawed and same. You can see that cabinet and Buxton's, they contain strings and ticket as well. So these are objects. So let's see what if I don't print and simply just write df dot datatypes, QGIS to do for pressing Enter and we get the same results so there's no need to print it. Okay, so now for the number of survived percentages, I want to find out how many of the passengers that actually survived and how many did not survive. So for that I have df, which is my variable name, dot survived, which is my column name, and value count is actually the function name. So if I execute it, so you can see that number of people who didn't survive are 459 and the ones who survived R 342. Now let's just count Jen device. I want to see how many were male and how many were females. So instead of survived, I'm going to use sex as the column name, Shift Enter. And okay, so the meals that survived, actually the males that were on board, this is a respective off if they were if they survived or not survived. So the number of males is actually 577 and number of females as 314. Okay, now I want my data to be visually explored. So I want to visually explore it and I want to make a histogram for meals and female count. So I don't want the output in this form that I earlier got in terms of rows and columns. Now I want it to be plotted. So for that I'll use the plot function. And in the kind, pardon me, today I'm going to give it bar. So that is, I want the plot to have bars. So it's basically about blog. So here you can see that males are actually more than females. And then again, this is irrespective of whether they survived or Norcia white. And then I want to plot a histogram for fear. So I am writing dF dot Fair, which is my column name. And instead of writing got plot kind bar, I am simply writing hist, which is a function and plots the histogram. So I'm going to shift enter and yes, so this is my pair. Okay, so here we can see that the people who pay the fare from 0 to 50 are slightly more than 700. And the people who paid the fear from 50 to 100 are a little more than a 100, and so on and so forth. So there very few people who paid more than 500 or between 400 to 500 in order to board didactic. So moving on to the data cleaning and data cleaning phase, we're going to see that how we can extract the name titles, data's Mr. Mrs. et cetera, from the name feature. And we're going to be filling in the missing information. Moreover, we're going to separate and categorize the uncategorized data in the dataset. And we're going to create the CSV file for the dataset. So the preprocessing phase of data analysis or for data cleaning. So before we start the data cleaning part of our data, we will first look at any insight of the complete data in light with the cleaning, we need to do. Before we begin, you know the drill that we need to import some libraries. So I'm going to import matplotlib, pyplot, and I'm going to give it an alias, which is PLT. Then I'm going to use the inline function for damage. I want all of my pledge to appear in line and I'm also going to import a library seaborne. We saw that these boat libraries are used for visualization. So this is its alias. So let's just shift and enter in order to execute this. So it has executed. So now we're going to explore the categorical features, which really for that by defining a function bar chart. Firstly, we need to make a copy of an existing variable in DF copy. And in the bar chart function I have survived and dead variable. And in that we're only keeping the information for survived where the value is one, there value count. And in the dead we're getting all of the information for the people who did not survive. So where the Survived values equals to 0. And we have another DFS knew another variable and it's basically dataframe type and it contains the Survived and dead features. And then again we have the indices for Survived and dead. And lastly, we're going to plot this information. So here we are going to give it a feature parameter and it's going to plot the bar chart or the bar plot for that. So let's begin with the gender-based survival. So the feature four, that is six, that's the column name. And let me execute it. And you can see that we have two bars over here. So this is a stacked bar plot. When we talked about the Survived bar, you can see that more of female survived rather than me. Now when we talk about the other bar, you can see that more often females survived. So the females actually they were less number of females who died rather than the meals. Now, let's do the same for PE class. So over here you can see in the histogram above it shows that the first-class, they survived more rather than the third class. So the colorful first classes, blue, so they survived more rather than the third class, which is green and the second class which is oranges. And then when we talk about the people who die, so you can see that third class were more dead and it's more likely than the first and the second class. So let's perform this analysis on the sibling and spouse. So what does this histogram shows us? It shows that the piston GO with no sublingual spouse has more likely tied than others. Okay, so let's perform this analysis again on parent and child survival. And you can see that the people are more likely to die who have a 0 count off parent or child. Now we're going to see the feature extraction and we will extract the title, which is Mr. or Mrs. from the name feature. So here is the code for that. So firstly, we have another variable which is df knew through which we're going to loop through. And it contains df dataframe. And now in the data set title, we're extracting all of the titles from the name feature and extract is the function that we're using. And we're extracting it bases on any space or any dot where. So these are the values on the basis of which we are extracting it. So let me execute this and after that, let's just print out the head. So here you can see we have an additional feature which is title, so it contains Mr. Mrs. Smith or whatever the title is. Now let's just see the title count. So here you can see the title account that we have it start MR. Is 517, misses 182, and so on and so forth. So now let's just clean our data set. The convention that we are going to use is that we're going to give mr. the value 0, miss one, misses to and others three. So all of the rest of the titles but don't contain a very high value. We're going to give them three. So here is our title mapping. We're using a dictionary over here. So for Mr. keyword, we're going, we're giving it the value 0 for MS1 message to end rest of all of these, giving them the value three. After that, now we're changing the value of title with this mapping value that we have a rare. So we're using the map function to map all of these value to that respective title. Now let's just print the head. So you see that the title is either 0 to Banner three. Now let's use this title feature in our bar chart function that we defined. And here you can see that the people who died are more likely to be Mr, as per the map. And you can see that the Mr value is 0 and misses one. So more of the people who died were meal and they were ministers. And you can see that more of the females, they survived. So now let's just do the same for gender heavy, going to assign the value one and female the value 0. So I'm just going to replace meal with one and female with 0. Printing it. You can see that the values for male and female are changed. The second step that we're going to see over here is filling the missing values or missing information in the dataset. It is very important. So before we fill in the missing information, we must first know what information is missing from our dataset. So let's just see how we can do that. So then again, I'm going to use the shape attribute. It's giving me that. Now it's giving me 13 columns since we added one more. And let's instead of describe, use info function. So this info function, it gives you the Null Count and the data type. So you can see over here in age, we do have some missing values. So since the total amount of roses, 891, but we have 714 over here. That means that we have some missing values. Similar is the case with Gavin and embarked as well. So now let's just check the sum for the null values or we use a function is not a, it is a Boolean. It returns a boolean value that is 0 or one, whether it is non null. And then we're going to calculate the sum for that. Sorry, let me just shift enter. And now you can see that we have 177 ij rules that are novel and we have six 87 cabin and do embarked. So this is the sum for all of the null values. So we discussed some of the approaches for filling in the missing values. Over here, we will fill in the missing value with respect to the title feature that there are a number of approaches for filling in the missing values that we discussed earlier. We will fill in the missing values with respect to the title we made. Like to fill in the ages of mail, we will take the mean of the male titles and similarly we'll do for the females. Over here we have another variable which is missing ages. And it will contain only those rows which have the each feature as none. Let me just execute this and then execute the missing ages. So this is my data frame that contains all of the rows where ij is null with non stands for null. Though I'm going to take the mean ages and I'm going to group that by sex and peak loss. After that, let's see what the mean ages are. So here you can see that for female which is 0 and peak last 1-2-3. So these are the mean ages. And similarly for the mean and the pclass B have these mean ages. So now we have another function which fills in the null values. And over here, this function checks if the age is null and it replaces it with the mean value. So it checks the group for that. And according to that group edges replaces that value. Otherwise it just returns the row of the age as eaters. We have executed this. Now we're going to apply this to the age column. And now let's just see how many novels too we have with respect to H. So here you can see now we have 0 nulls in the age column. Moving on, we're going to see how we can categorize the uncategorized data in the dataset. So the age of different people has various values, it has some continuous values and we need to categorize them in terms of their classes. This process or method is known as bending or categorizing of theta. So in Benin V forcefully make a feature vector maps. So if the person has an age which is similar to the age of the child, then we're going to give it 0 category. If, if he or she has young, then they're going to have the young category which is one adult to the age three and senior for so this is the code for that according to certain condition, that is if the age is less than equals to 16, so that person is 0, that is, he or she is a child. Similar if the age is greater than 16, but less than equals to 26, then we have the category one and so on and so forth. According to these conditions, we will categorize or will bend over each feature. Now let's just execute this and see what do we get. So here you can see that in the each feature and I'll be getting all of the categories instead of the continuous values. So let's just enter this feature into the bar chart. And here now you can see that the people who died the belonged to category two and category two as actually adults. So more of the adults, they didn't survive. Now lastly, let's just export the cleaned CSV file that we have. Firstly, I'm just going to print the head, and then I'm going to convert my dataframe to CSV by using the two CSV function over here. And I'm going to give it the name Titanic dataset cleaned dot csv. Let's just execute this. And after that, I do have another variable which is df underscore clean and I'm going to read my this my clean CSV into it. And yes, so it has been successfully executed and let's just print the head of this clean CSV. So here you can see that it has been printed. Thank you for watching the lab and I hope that are cleared many off your theoretical concepts as we had a hands-on session today on it.
12. Model Selection: Before training, we need to select the model best fit to our data and needs. Given easy to use machine learning Python libraries like scikit-learn and chaos, it is straightforward to fit many different machine learning models on a given predictive modelling dataset. The challenge of Applied Machine Learning therefore becomes how to choose among a range of different models that you can use for your problem. There are three types of model used in machine learning, namely classification model, clustering model, and the regression model that we have already discussed in module three. Let us now see the model selection with cross-validation. We'll go over why cross-validation is important. Understand how it works, and see how it can be applied in many different ways. Suppose you need to design a machine learning model for a given fruit, whether it's an apple or an orange. This can be an issue for a farmer to pack them separately for production line and dispatch. In the machine learning approach, the input is represented by X and output as y. The output y can be 0 for an apple or one for an orange. In order to design our system, we need to collect the data from the real board. Our dataset will contain different photographs of various fruits and the labels. Once we are done with the data collection, we will train our system and then put into test with the real world, helping the farmers to distinguish between the apple and the oranges for their purpose. Let's look at the training process in more detail. As it turns out, we know many ways of training machine learning systems, each with different bad meters and settings. For example, we can learn a 1-nearest neighbor system, a three nearest-neighbor system, a five nearest neighbors system, accordance regression system with a sigma one, a corner regression system that is sigma of two nice days Support Vector Machine and many others. The problem of choosing which method to use from a pool of possible methods is known as model selection. We want to choose the model that we learn the best at test time in the real world. But all we have is a fixed data set. One way to choose is to train each method on our data and then test on the same data that we have. This is a terrible idea. It indicates that we are giving the answer key to the model itself before it even providing the Test dataset. Instead, we will do the following fee split our data into sections. Each section is called a fold. In this example, we have fourfold. Next we will iterate through the folds as following. The first iteration we train on folds to till four and then we test have permitted unfold one. In this case, the algorithm has never seen fold one before. Next, we measure the error rate of our method on this fold. We then swap places with false 12. Don't retrain unschooled 134 and we test unfold. We do repeat this process for each fold with holding that fold from training and then computing error on that fold at test time. Some folds are easier to learn than others. Finally, we combine the four reads into a single average. This average is known as the cross-validation error. For any single method. Cross-validation error is an estimate of how the method would perform if the data we collected as an accurate representation of the real world. We repeat the cross-validation procedure for each method we might select during training, then we can select the model with minimum cross-validation error. In this case, five nearest neighbors is our best guess. For this module will be the best fruit identifier for us. Now that we have chosen our model, we can evaluate it on the real data. But what sort of performance do we expect to be expected degli the same performance as Abba cross-validation estimates. Maybe our estimate was optimistic or maybe it was too conservative. In fact, the 17% arrow we found during the model selection processes, almost certainly optimistic. This is because model selection has biased or estimate of test data. We chose the best cross-validation error out of many possibilities. So even if we had a pool of 1 million random classifiers, we would still expect at least one of them do have low cross-validation error purely by chance. So we need to take another look at our data. We still use cross-validation, but this time we apply cross-validation twice. First, we separate our data into two parts. The first part will be used for model selection, and the second will be used for testing to represent the unseen boiled. The important point is that the world data is never touched by other model selection procedures. To perform model selection, we divide the data into full, just like before. In this case, we have 6-fold. We then perform cross-validation for each of our methods to determine an edit rate. In this case, three nearest neighbor is emitted with lowest cross-validation error. We can evaluate the result of model selection on our tailed out test data. This time we use all folds of training data during training. Now what does this final number estimate? In fact, it is the estimate of our entire learning process. We look at the data, we train multiple methods, and then we select the best according to cross validation. Finally, we tested on held out data and not seen by the algorithm. In other words, we achieved an estimate of how our entire learning procedure, which includes model selection as a part of training, will perform on unseen data. Again, if the world happens to be well-represented by our data set, this time our estimate of 16% is most likely conservative since we are only using a portion of the data that we have in order to train our model. In conclusion, firstly, cross-validation is a simple and useful method of muddy selection. But more importantly, cross-validation is also necessary to obtain an estimate of error of our model selection method.
13. Training Model: The essential part of a machine learning model is it's training. In most of the cases where companies or organizations are interested in developing a machine learning model. But it is possible that the data is a complete set of information, meaning that there is no separated desk data. And the training data. When we train a model on the training process needs a dataset which is only supposed to have the features essential point a predictor. As the original data contains the ontos, so does the training data contains the ontos. This data helps the mortal latrine itself using the statistical tools and algorithms. Once the model is developed using the training dataset, then it is exposed to the test data. The test data is without the answers. And this helps us to test the model, whether it gives the desired output or not. For the complete data without the test and training data set apart, splitting the dataset is of very much importance in machine learning. The accuracy of machine learning models is one of the important aspects. One can train and test the model for machine learning using the same data set, but this has a few flaws as the goal is to estimate the lightly performance of the model on an out of sample data. However, there are several advantages of data splitting. First being the model is trained and tested on different data. Second, the predicted values are known for the test data set so the model can be evaluated. Thirdly, testing accuracy is a better estimate rather than training accuracy. Data splitting is of two types. The very first ipa CDO splitting in this type, you have a dataset, for example, 1000 samples. Now using the 80-20 ratio, you split your data. We take the first 800 sample as the training data set and the next remaining 200 typos as our days dataset, removing their results answer. Here the yellow shows the training data set and the green shows the Test dataset. And other approaches random splitting. In this type, again, supposing that we have a complete data set of 1000 samples, that we take 800 sample randomly from the dataset as our training data and the remaining samples as our test data. These samples are chosen randomly. The samples they are chosen randomly. As you can see that the yellow blocks and the green blog and trundled random splitting R naught on contiguous locations.
14. Lab Section 7: Hi guys and welcome to lab two of our course. This lab is associated with Section seven in which we are going to see how we can select a particular model and how we are going to train that model. So previously we have seen how we can import a dataset and we use the Titanic prediction of survivors dataset. And we're going to continue working with that data set in this lab as well. So before we move any further, we need to import the necessary libraries. So here are some data analytics libraries and data visualization libraries. So we have pandas, numpy random, and for visualization we're going to use seaborne and matplotlib pyplot. Let's just execute the cell. It's taking a little time for me. Okay, it's done. So we also need to important machine learning libraries. So we're going to import from SKLearn the linear regression, logistic regression, K and neighbors classifier, decision tree classifier, Gaussian Naive Bayes. And all of these libraries that you see are there basically for the evaluation purpose in which we're going to discuss the accuracy score, K-fold, Cross val score, cross value prediction and confusion matrix. Let me execute this cell forced. The next essential step that we have is we need to read our data. So we have two CSV files, one a strain and the other is test. I'm going to read the test CSV in dest DEF and train in train df. And then we went to combine both of these in one data frame. So let's just execute this. And after that, let's just print the information in order for both of them in order to gain a little more perspective. Okay, so you guys can see over here that in the first DataFrame we have around 12 columns. So we have survived with injured IDP class, so on, so forth and Divisadero datatypes. But in the second DataFrame we don't have the survives column. That is because survived as our target column and may need to actually predict that column. So that is why this column is missing in our test DataFrame. To we are training it by giving it the Survived column. But when we are testing it, we are testing it. How well does it predict it? So that's why that column is missing. Next, I'm just going to describe train data frame. Okay, so we actually saw this in the previous lab. So it's giving me the count mean, standard deviation, quartiles, minimum and maximum. And let's just print out this dataframe. And it's giving me the number of rows, number of columns, and all of the data. So we have analyzed this data in the previous lecture, so I'm not going to go into the detail of that. So some of the features which we feel that are not relevant in machine learning techniques, we're going to drop those features. So the two features that we are going to drop our cabin in ticket. Before drop, we went to see the shape of train and test. So this is the shape for a train and this is the shape fortes. And earlier we declared a variable combine, and we combine both of these DataFrames into way too. You can see that on indices numbers euro, the sharpest DES, which is similar to the train. And on the next index it has this shape which is similar ASA, which is actually off the test data frame. Because now we're going to drop the ticket and Gavin and both of two DataFrames. So I'm using the drop function for that. And then again, putting both of these in the combined variable. And now let's just see thereafter shape. So here we have shaved off train test and you know, the same as in the previous case. So you can see that now we are having ten columns instead of 12. And over here we have nine columns instead of 11. The number of rows remains the same in both the cases. Let's just print train the f. And now you can see that those columns have been dropped. Now we're going to create a new feature. This feature is the same as we did in the previous lab title feature. We're going to extract it from the names feature. And then they are going to represent this information in a cross tab. So now you can see that how many females and males have this respective title? So you can see that we have 517 MR. And 182 lives and we have 125 misses. So now we can replace many titles with a more common name or classify them as rear. So the ones that are rare are these. We're going to replace this title. Ok, so that has been replaced and now let's just group them. Group The survived with respect to the title and take the mean. So the mean for master who survived this, this four, the mean for Miss who survived as death, so on and so forth. So we can convert the categorical titles to ordinal as well datas for mister, I'm giving it the value one, miss, I'm giving the value to Mrs. three, master Ford Andrea fight. So we saw earlier how we can do that. So I'm not going to explain you the code over here, but we're going to perform this action and then I'm going to print the head. So now you can see that our data, actually it has been categorized. So in the title, you see I have the categories instead of the strings. So now we can safely drop the name futures since we have extracted the information that we want from it. So here in this code we are dropping the name to Egypt and the passenger ids. And it also doesn't contain lots of information with which we can do anything. And since I'm dropping it from the tree and I am dropping it from the test as well in order to keep some harmony amongst the boat and then ledge to see their shape. So now we have nine columns and both of them. Now let's just convert the strings to the numeric value. That is, we are going to map gender. To some categorical data as well. That is for female I have, I have one formula, have zeros. So this is the same as we discussed previously. So now you see that we have categories over here for the six. The next step is to treat the missing values. And we're going to treat the missing values in the age feature, but we're going to do that very smartly. So first of all, we're going to visualize the age in terms of p class and in terms of sex. And then we are going to benefit accordingly. So we have used the seaborne visualization for that. I'm using a grid over here and you can see that we have around 1234, actually six graphs. So we have age on the x-axis and on the y-axis we have the count for that. And so pclass is equals to 160 is equal to 0. We have one graph to others for PE classes equals to one and x is equals to six, is equals to one, and so on and so forth. So you can see that how does the age weighted according to pclass and according to six? So in the graph here you can see that is between 20 to 40 has higher bars. And this is for B class three, that is, the people who traveled toward class. And for Sx 0, which is male gender. And similarly for the female, you can compare both of the ages. So let's start by preparing an empty array to contain the guest values of age based on p class and gender combination. So this is how we're going to create an array. So we have created an array successfully, and now we're going to iterate over it for 601 and pclass 1-2-3 to calculate good guest age. So you can see in this core document, iterating over the sex and we're aggregating over the peak loss. And we have another variable which is guest df, which contains all of the values forensics and peak less than h, but h is not equal to null. We're dropping all of those values. Then we calculate the median for this data frame, which is named as age kids. Now we have a list which has guess ages and we are rounding off the age. So if the edges say it's 45.31, so we're rounding it off to the nearest 0.5 age after we're done with it. Now we are making another DataFrame. Now we're making another DataFrame in which we are replacing those null values of age with the guest values that we have computed over here. And lastly, we are replacing this dataset, each feature with another dataset, each feature to be replacing the guest values over here. And lastly, we're just going to print the head of train DataFrame and the info of the train dataframe. Let's just see what we get. So here you can see that the nodal values are now reduced. So now we don't have any null values in the age feature. So this is a smart way of actually guessing the age and then fixing the missing value problem. After eliminating the null values, the next step is to produce the bins. So we have a function gotta in Pandas which can get us the bend. So we just need to tell it on which feature we want the bend and how many bins to be warned. So I'm telling it that I want to apply this ket function on the each feature and I want five bins are five ranges for that. So it has given me that. But along with that in the next line, I'm saying that I want survived to be displayed alongside with it. I want to group it by the age band and I want to take the mean of the survives. So it's telling me that in which age range, how many people survived in the Titanic data set? The next step is to categorize all of the h value. So you have got the ranges for the ages and we've got the mean of the Survived. And according to this, we're just going to categorize it. So we have these ranges over here, according to which we are giving it values 0123. And then we are going to print the head. Because now you can see that we have age band, but in the age column, now we have the categories. So if you're going to drop the age band column since we don't need it anymore. So using the drop function, we're going to drop it. So now you see that it has been dropped. Another thing that we're going to do is create a new feature based on the existing teacher. This is known as feature engineering. We're going to create a family size based on the sibling, spouse, and the parent and child. We're going to iterate through the combined variable and we're going to introduce another feature which has family size. So it's just adding the sibling and spouse for each of the passenger along with their parents in China and oneness for that present him or herself. Next, we are going to display this information. We have family size and survive that has gripped by the family size and you're taking the mean of survived. So let's just see that information to now. You can see the family size and you can see what were their chances of surviving. We're also going to create another feature which is called is alone. And we're going to see that if the person was alone word, word, his or her chances of survival. So I'm not going to explain this chord Since we have gone through it many times. So if the person is alone, so that the person was alone, the survival chances was 0.05. and if he was accompanied by his family, his survival challenges was 0.3. So now we're going to drop the parent-child, siblings, spouses, and families as visual because we think that is alone is giving us a lot of information. And the rest of the features we've extracted the kneeling pull information out of them and we don't require them anymore. Okay, so we can also create an artificial feature combining PE class and age. Let's just see how that is done. So we have another feature which is age into class. We're making this feature in which we're multiplying the ij with the peak lost. And to display its ahead, you can see that this is a feature that we have got up till now. We have seen that how can we fill the missing values of continuous feature, but how can we do that with a categorical feature? So we saw earlier that embarked has lots of missing values and its categories are basically SQL and C. So let's just see that how can we deal with this missing feature? In our dataset? We saw that embarked has basically two missing values. Soviet just going to fill them that the most common missing values will be taken. The mode for embarked where we don't have any null value and the frequently used category is s. So we're going to simply fill the null spaces with.'s. And then again, we're going to see the Survived mean amongst the category. So if a person has embarked from port C, then it's reliable. It onto the 0.5. from board q is 0.389 and foreign port f at 0.89. How can we convert categorical feature to pneumatic so we can open border embarked feature by creating the new nomadic board feature. So we're just giving the categories SDO, c1 and Q is equals to two. And now you can see that we have converted, we have mapped it to pneumatic values. We have another pneumatic feature that has lots of missing values similar to the ij. So this feature is fair. And we are going to compute the median for this feature. And you're going to replace the nulls with the median of this feature. So we are going to do with this in a single line of code, heroes, that line of score. Now you can see that we have replaced fear with the median value of fear. Similar to the age, we are going to use the cue card function over here previously use the cut function or we want to use the cue cards and we're going to make four bins for fear. And let me just display this answer. So here we have four bins and we have the respective mean of the Survived according to which we are going to make the categories for the fair feature. Let's just run this code. Now you can see that the fair feature has been categorized and there are no null values in it. Moving on, we're going to see how can we select model and how can they want to help us predict the target output and solve our problem? So firstly, we are going to split our data into test and we have extreme inviting. In the extreme we're dropping the survives column and invite the Andi hafta survive column. In X test, we are dropping the passenger ID and then we're printing out all of their ships for extremely have 891 rows and columns and provide tree and we have 891 rows. For x-test, We have 418 rows and we have eight columns. So let's just print out extreme. And you can see that these are the columns that we have and these are the values for those columns. First, we're using the logistic regression model. And the code for this is very simple way using the logistic regression function and we're fitting the extreme invite you into this function, you predicting our output values. Next we're checking the accuracy for this model. So we have a built-in function to check the score, accuracy score for this model and it's 80.36, not volatile. Now let's just check the correlation for each of the feature with the target class. And these are the correlations. And we discussed in the conceptual study that the strongly correlated feature has a value greater than 0 and it's a positive value, and negative correlated has a negative value. So you can see that six is a very strongly correlated feature. Next, we're going to use the decision tree model for the prediction bookers hairs. Then again, they're going to fit extreme environmental modelling. We are going to predict the result in by predict. And we're going to calculate its accuracy score, that is 86.76. Similarly, we are going to do it for k nearest neighbor. But here you see that we need to give the parameter of n, which is that we want three nearest neighbors to be the parameter. Fitting the algorithm and checking the accuracy is 83.8. For that, we're going to do it with linear regression. Linear regression. You see my accuracy score is 39.06. It's a very poor school. And lastly, we are going to see nice base. So you see the accuracy score is 72.2. It, let's just visualize all of these scores on a graph so that we can get a better understanding of which module is actually better for us in this problem. So here we have all of the scores in all of the models in the tabulated form. But since we have also seen that how can we visually represent our data? So let's just do that. And over here you can see that decision tree has the highest score. After that we have k1, then logistic regression then comes Naive Bayes. And lastly we have linear regression. So decision tree has outperformed all of the models in these cases. Thank you for watching the slab.
15. K folds: In this module, we're going to discuss the evaluating techniques of a machine learning model. The problem with machine learning models is that you won't get to know how well a model performs until you test its performance on an independent data set, a dataset which was not used for training the machine learning model. So we use some model evaluation techniques. The first model evaluation technique that is going to be discussed as cross-validation. In the machine learning. We couldn't fit the model on the training data and can't say that the model will work accurately for the real data. For this, we must show that a model got the correct patrons from the data and it is not getting too much noise. For this purpose, we can use the cross validation technique. Cross-validation helps you estimate the performance of your model. One type of cross validation is the k-fold cross-validation. When you are given a machine learning problem, you will be given two types of datasets. Known data, which is the training dataset, and unknown data, which is the Test dataset. In one round of cross-validation, you will have to divide your original training data set into two parts. Number one, cross-validation training set, and number two, cross validation testing set or the validation set. You will train your machine learning model on the cross-validation training set and test the models predictions against the validation set. You will get to know how accurate your machine learning models predictions are when you compare the model's predictions on the validation set and the actual labels of the data points in the validation set. For reducing the variance, several rounds of cross-validation are performed by using different cross-validation training sets and cross validation testing sets. The results from all rounds are averaged to estimate the accuracy of the machine learning model. K-fold Cross Validation is a common type of cross-validation. Some steps of k-fold cross-validation are number one, partition of the original training dataset into C0 equal subsets. Each subset is called a fold. Now suppose the names of the falls are F1, F2, and up to F k. Now we looped through i is equals to one to k. And we keep the fold FI as the validation set. Keeping all the remaining k minus1 folds in the cross-validation training set, modulus trained by cross-validation training set. And the accuracy is calculated by validating the predicted results against the validation set. Average accuracy of all iterations is the estimated accuracy. Now in the k-fold cross-validation method, all the entries in the original training data set are used for both training as well as validation. Also, each entry is used for validation just once journey lead the value of k is taken to be ten, but it is not a strict rule. And k can take any value.
16. Accuracy and precision: After talking about the cross-validation model evaluation technique, we're going to talk about accuracy and precision. Both accuracy and precision will extremely high importance when addressing model evaluation techniques. So let's discuss them. Accuracy is closeness of the measurement to a specific value, whereas precision is the closeness of the measurements to each other. In other words, accuracy describes the difference between the measurement and the actual value. While precision describes the variation you see when you measure the same bought repeatedly with the same device. Let's take a look at the following confusion matrix. Can you tell what's the accuracy for this model? Now, according to the Beatrix in front of us, the values in the diagonal that is actually negative and predicted negative and actual positive and predicted positive are the correct predictions. So the accuracy of this model is very, very high, which is 99.9%. But what if I mentioned that the positive over here is actually someone who is sick and carrying a virus that can spread very quickly. Well, you get the idea. The cost of having the misclassified actual positive or false negative is very high here. A true positive as an outcome where the model correctly predicts the positive class. Similarly, a true negative as an outcome where the model correctly predicts the negative class of false-positive is an outcome where the model incorrectly predicts the positive class and a false negative is an outcome with the model incorrectly predicts the negative class. Now let's look at precision first. You can see the formula for calculation of precision on your screen. Immediately you can see that precision talks about how precise or accurate your model is out of those predicted positive and how many of them are actually positive. Precision is a good measure to determine when the costs of false-positive as high. For instance, email spam detection. In email spam detection of false positive means that an email that is non-spam, actual negative, has been identified as a spam or not as the predicted spam email user might lose. Importantly, males if the precision is not high for the spam detection model. Now let's see in the same context, what is recall the formula for this calculation is also on your screen. It actually calculates how many of the actual positives our model captures through labeling it as positive. That is true positive. Applying the same understanding, we know that recoil shall be the modal matrix we use to select our best model when there is a high cost associated with the false negative. For instance, in fraud detection, if a fraudulent transaction which is actually positive, is predicted as Lawn fraudulent, that means predicted negative. The consequences can be very bad for the bank.
17. Lab Section 8: Hi everyone, welcome to lab number three of our course. And in this lab we are going to evaluate the models that we previously trained in lab number two. We're going to begin off with logistic regression. And the technique that we're going to use in order to evaluate it is cross-validation. So I'm going to explain you the court over here, and this code is pretty much used in all of the remaining models. So we are going to go into the detail of the first block of code and then I assume that you probably have a good understanding of it so that we can begin off with checking the outputs and comparing those outputs for rest of the models. So let's just get to it. Okay, so you can see over here that we have a variable which is k fold and we are calling the K-fold method. So it's basically a method of the SKLearn library and it provides the train and test indices to split the data into training and test sets. So it splits the data into k consecutive falls. And you can see some parameters that we have defined over here. Let's just discuss these Firstly, we have n splits over here. So this is basically the number of falls and it must be at least two. And over here we have assigned it the value ten. The second is a random state, and we have declared that this is 22. It basically affects the ordering off your indices. So this line, it splits the data into ten equal parts. Up next we have cross val score. It's also a method of the SKLearn library, and we are going to use this in order to evaluate a score by cross-validation. So we're giving it an estimator, which is our trained model of logistic regression. So it's an object to use in order to fit the data. Next, we have an, sorry. Next we have an array over here, which is all features. So this is the data to fit basically. And then we have the targeted feature. So this targeted feature is basically our target variable, which we are trying to predict. So then we have cv ten, which is basically cross-validation folds ten. And the scoring technique or the scorer is accuracy. So we basically printing the result over here, we rounding it off or taking the mean for it, and then we're rounding it off to two decimal place. Next we have cross val predict. So it generates a cross validate estimate for each input data point. So we have our estimator over here and then we have all of the features, which is the data to fit. And lastly we have the targeted feature. And then we have cross validation which is set up to ten. And lastly, we're going to plot a heat map and we are going to plot it for the targeted features and for the predicted value over here that we have found or God using the cross val predict. So let's just execute this and see the result. Okay, so we can see over here that the cross-validated score for logistic regression is 80.24. And in terms of confusion matrix, we can see that the zeros correctly identified are 477. The ones which are correctly identified are 238, and the ones that are wrongly predicted, R1, 04, and for the zeroes at 72. So let's just use the same technique in order to evaluate the decision tree. The court is same except the fact that I'd use decision tree instead of the estimator. So let's just execute this. And now you can see that the cross validated score for decision tree is 79.92. And in terms of confusion matrix to zeros correctly predicted are 485 and the ones correctly predicted are 228, whereas the false predictions are 11464. Now for k-NN, the validated score comes out to be 78.46. Let me just make the slight change over her and turn this to kNN. This just rerun this. So yes, pseudo cross-validated score for KNN is 78.406. And this is the confusion matrix. These are correctly predicted and these are wrongly predicted. Now for Naive Bayes, the score is 71 and this is the result of the confusion matrix. So all of these are actually quite close to each other in terms of confusion matrix. So we have calculated the cross-validation accuracies. Now let's just compare these. So I'm just rounding off them to the moon and up to two decimal places. Let's just run this and now let's just see them in the tabulated form. So over here you can see about the cross-validation accuracy score for logistic regression is 80.2. For, for decision tree it's 79 pumpkin, it's 7849 spades, it's 71. So let's just plot this to get a better understanding. And here is the plot. So earlier in the previous plot, we saw that decision tree came out to be the best. But now you can see that logistic regression proves out to be the best model and the best fit in order to predict more accurate results as compared to the others. So in order to rely on a deployable model for this problem statement, that is prediction of the survivors. The logistic regression model has proven to be most suitable.
18. Deploying ML Model: The deployment of machine learning models is the process for making your models available in the production environment where they can provide predictions to other software systems. It is only once models are deployed to the production that they start adding value. Making deployment is a crucial step. And we're going to discuss this in this module. Now that you are done with the process of machine learning model and the design process. So you have a good knowledge and working to generate your own machine learning model. The major part is to deploy the model you have created so that it can be used by the end users such that they can inputs and parameters changing them and predicting the results they are looking for. So let's begin with a definition of what is modeled deployment. The practical process of machine learning deployment, or the process which we say model deployment simply means, is integration of a machine learning model into existing production environment, where it can take some input parameters and return a predicted output. The purpose of deploying your model is to allow the end users to make the predictions from a trained machine learning models available to other systems existing in the network. Such as stock's prediction, real estate price estimation, and investment plans, et cetera. For model deployment, there are some criteria that the machine learning model needs to achieve before it's ready for the deployment phase. The criteria that should be met by our machine learning model before deployment is number one, portability. This refers to the ability of the software to be transferred from one machine system to another. A portable model is one where the relatively low response time and one that can be rewritten with minimal effort. The next one is scalability. This refers to how large your model can scale. A scalable model is the one that doesn't need to be redesigned to maintain its performance. Let's have a look at the high level architecture for machine learning. Machine learning modelling process carries four major layers. The first one is the data layer. The data layer provides access all the data sources that the model requires. The second is the feature layer. This is responsible for generating feature data in a transparent, scalable, unusable nano. Next, third one is the scoring layer, which is responsible to transform features into predictions. Scikit-learn is most commonly used her and is industry standard for scoring. And the last layer is evaluation layer. The valuation layers check the equivalence of two models and can be used to monitor production models. Which means it is used to monitor and compare how closely the training predictions match the predictions on the input data. In general, there are three ways of deploying your machine learning model. First one is the one off. It is not that the machine learning model needs to be trained continuously. Sometimes a model is only needed once or on a periodic basis. In such cases, the model can add hockey, be trained as per the requirement, and move to the production till the time it degrades and requires attention to address the fixes. Next in is the batch training. This allows constant use of updated model versions, having scalable capability, which allows the use of sub-samples to train the model instead of using the complete dataset. This is better if using a model on a consistent basis, but doesn't require the real-time predictions. And third comes the real-time. In most of the cases that is required to get a real-time prediction. Let's say we want to determine whether the transaction is fraudulent or not. So this can be done with the help of online machine learning models, like linear regression using stochastic gradient descent. There are a few factors that need to be considered when determining the chosen method for deployment. Number one is how frequently predictions will be generated and how urgent the results are needed. Second is feather. The prediction should be generated in the visually or by batches. And third is the latency requirement of the model, the computing power capabilities that one has. And lastly, the operational implications and the cost required to deploy and maintain the model.
19. Capstone Project: Hello everyone. Now we're at the Capstone project and in this phase of our course, we are going to have an overview of all of the disgust concepts. And we're going to apply them and see that how can be worked around them. So the problem at hand that we have as Boston house price prediction. The fourth step is that we need to load the necessary libraries. So we have numpy and matplotlib, pyplot pandas, Seaborn, sci-fi dots, dots, SKLearn and RC params from Matplotlib. So let's just import these. These have been successfully imported and now we're going to load our data set. So SKLearn library contains a variety of preached door datasets and Boston house prediction as one of the dataset. So we're going to load this from the library. So here is the code for that. So we have a variable Boston in which we're going to load this data set. And let's just print out the keys for this data set. So we have the data, we have target, we have feature names, we have description and the file name. Let's just print out the shape of the data. And we have five or six rows and 13. Now we're going to describe our data. Let's just go through this so we can get a better understanding of our data. So this is Boston house prediction dataset. And the number of instances, which is the number of rows is 506, and the number of attributes is 13. And this doesn't include the target. So this is the description for each of the column that we have. So we have cream, Which is the per capita crime rate by town. We have z, n, which is the proportion of residential land zoned for lots over 25 thousand square feet in this is the proportion of non retail business acres per town. Chess as Charles River dummy variable, which is one if the truck bound River and 0 and the other case, NOX is the nitric oxide concentration. Rm is the average number of rooms, but dwelling ages the proportion of owner-occupied units built prior to 1940. This is weighted distance to five Boston employment centers. Rad is the index of accessibility to radial highways tax is the full value property tax rate per $10 thousand PD ratio as the pupil-teacher ratio. B is the proportion of blacks per town. El stat is the person, lower status of the population. And Mehmed VI is the median value of owner-occupied homes in $1000. So there are none missing attributes. And the creator of this dataset, as Harrison and Rubin field. So we're going to use this data set for predicting the prices of Boston houses. Let us just see the feature names. So these are the tags for the feature names. And now we're going to convert the loaded data set that we have into DataFrame by applying the pandas library function. So pd is the alias for pandas library and we're calling its function dot frame and giving it the parameter that data that we have already loaded and we're saving it in the Boston df variable name. And lastly, we are going to see the head of this DataFrame. So here is the head of the thing to be noted over here is that we don't see the column names as yet. So they are tagged as 0123, whereas they should be given the names cream, z, n, and so on, so forth. So let's just do that. So over here we replacing the integers by the tags so that as df, Boston underscore df.columns is equals to Boston dot features name. And now let's just display the head. So here it is, it's been replaced. And now lastly, we're going to create an other coelom. The name of that column is going to be price and we're going to give it the value of Boston dot target. And you can see that it has been added. Let's just see the shape now. To now we have 14 columns instead of 13. Let's use the describe function to see the summary of each column. So this is the count, the mean standard deviation minimum value, a five-person quartile, 50% quartile, 75% quartile and the maximum numbers. Now we're going to find out the correlation between the features for that we are using the built-in function which is core. And then we're going to see the shape. So the shape is 14 by 14. Now just, let's just plot this on a heatmap. Just taking a little bit of time, and we have it here. So you can see over here that v have positively co-related features and some negatively correlated features. So you can see that with price, L star is very strongly correlated but negatively. And similarly r m is strongly correlated, but this is a positive correlation. Now let's just look into further detail of correlation. So using a pair plot over here, it's taking a bit of time heritages. Now, using this pair plot, we can see the distribution of the features with respect to price. So we can see that RAM and tax, PET ratio and L stat. So these have a very strong relationship with price. So they are in good correlation. So we are going to focus on this. Let's just plot a graph that only has these four features. So we are going to short our DataFrame. And then we're going to look into the correlation for these four with praise, let's just plot the heatmap. And now it's a bit more clear. So we can see that acts it has a very strong correlation, but negative correlation with price, such as the case with PET ratio. And lt stat is even more strongly correlated but negatively. So v are missing on one feature, which is there, yes. And RAM, we can see that it's positively correlated, but it has a very strong correlation when it comes to price. So these are features of interests. So here is the correlation in terms of matrix. So these are the values of correlation. Now let's just make a pair plot again. It's going to be a five cross five figure and it will help us understand the data in more detail. And we will see the distribution of data with the help of this pair plot. All right? So we can see that our M and L stat, they are normally distributed. Whereas in tax and in price, we do see some outliers as in O'Hare. And some here as this prices, sum of the prices are very high. Most of the prices lie somewhere here. So we're going to see that how we can handle these outliers as well. Okay, so let's describe our limited data set. And here's the count, mean, standard deviation, minimum, and 25-50, 75% quartile. And the maximum values for these four features and our target feature. Now we're going to understand the feature correlation. Let's just see the individual relationship of each feature with price, starting with number of rooms, which is RM. So we're going to have a rich plot. X contains x axis, has the value RM, Y-axis has value price. Let's just execute this. And we can see that there is a positive correlation that is with the increase in the number of rooms, the price also increases with the decrease in the number of rooms, the price also decreases. So there's a positive correlation amongst the two. Next, we have lower status population, which is L stat. And now we can see that we have a negative correlation doctors with the increase in L star, we have a decrease in price and vice-versa too. We have a negative correlation. Now let's just see tax and price. So there's a direct correlation between tax and price. That is, when the price is more, we can see that the taxes are more as well. And lastly, FOR PET ratio and price, we see a sort of a negative correlation amongst the two features. Moving on, we're going to have a univariate and multivariate analysis. For that, we're going to analyze the price first. We are going to analyze price using boxplot and distribution plot. And here are both of our plots. So you can see that some extreme values on the left and some extreme on the right, these could be a potential outliers. And by the shape of the distribution plot, we can see that price is normally distributed. Now using the prerequisite knowledge of outliers, we will observe these data points and a potential outliers, one which is lesser than quartile one minus 1.52, the inter-quartile range, and it is a greater than quarter and three plus 1.5 to inter-quartile range. So now let's just find out the values of these potential outliers. That is, we're going to find the price of the value and the low values to these quartile ranges. So firstly, we have the prices where we have a lower price range and these other potential outliers which have a value less than quartile one minus 1.5 into the inter-quartile range. So we have your observations, you can see at that lower the price, we have higher taxes. And we can see that the tax 666, it is very high for a house having almost five rooms. So in conclusion, we can say that as both the tags and the LSAT, they're negatively correlated to price. This means that the higher the tax and the L stat value the logo will be the price. So that means that both of these value there of importance. And we cannot just deduct them from our data set, so we cannot remove them and we need them as we will go forward. Same analysis will be applied for the upper value range of prices. And here we can see that we have lots of higher prices of houses, which is out of the ordinary higher values. So we can see over here who that having house price which is high, for instance, over here we have lots of fifties and the number of rooms, they ranges from about five to nine. So here you can see that they range from five up to nine. So also for these houses that ducks, it ranges from low to very high. So these are some low values. And then again we have some highs. So for house prices between 37 to less than 50, the room number is higher than 75% of the total data points. And since RM is positively correlated to the price, this can be a possible reason for the little higher house prices. And also for these houses, the PET ratio and the L stat lies in about 25% to 50%. So these are both of these ratios. Since P D ratio and L stat are negatively correlated to price, this can be the reason for little higher house prices. So what can we conclude from this? So we can conclude that we will have to drop the data points for the houses that have prices equals 250, but we keep the points between 37 to 49 as they don't have any unusual behavior. But the ones having a price is equals 250 tend to have this behavior, so they ought to be removed. Let's just remove those points and see what the DataFrame before and after shape. So firstly, we have 5-0, nine rows, but after removing the outliers where the house prices equals 250, we have poor 90 rows. Now let's just analyze tags in the same manner as we did for the price. So over here you can see that we have a boxplot for attacks and distribution of tags. And lastly, we are have made a scatterplot of tax versus the price, the distribution of taxes not normal. And in the box plot we can see that there are no outliers, but instead there are some tax values that are way too high. So from the scatter plot, we see that these high-tax values are for price values range from low to high. Now let's just see the shape where tax is greater than 600. So we have about 132 entries in which tax is greater than 600. That is mostly the value we saw came out to be 666. Ok. analysis dive a little deeper and print the temporary data frame. We have 132 rows and five columns. Let's just describe this data frame. So these are the count mean, standard deviation, minimum quartiles and the maximum value. So what can you observe from this description of a hill? And the summary, we can say that RAM ranges from about 3.56 to about 8.78. And for PET ratio, we can see that the value is mostly 20.10220.2 zeros and that's pretty much 20.20 with a little variation. For L star, we can see that the range is from 5.29 to about 37.97. And for the price, we see that we have the range about from five to about 29.80. So all of these observations, they're quite unusual and it isn't possible to have such high tax values for all of these houses. So these values are mostly like, So these values most likely they will missing values and they were imputed casually by someone. So the conclusion that we can make is that since l is most correlated, two tags as seen above and the heatmap. So we will replace those 132 tags values with the mean of the remaining tags values dividing and some intervals with the help of Al stat. And the interval one, which is tax ten, will replace the extreme data values having l stat in between 0 to ten. But the mean of other dogs values whose ELL status between 0 to ten and the interval to which the stacks 20, we'll replace the extreme trucks values having a l star, which is between ten to 20 with the mean of other ducks values whose ELL status between ten to 20. The next thing that we're going to do, which is interval tree, is that we're going to replace the extreme tax values having, I'll start between 20-30 with mean of types values, which is between 20 to 30. And similarly for the fourth interval. So all of those steps that I've mentioned have been quoted over here. So let's just execute this and these have been successfully imported. Now let's just see the count where taxes greater than 600, it comes out to be 0. So we have successfully replaced those venues. And lastly, let's just see the shape of texts now. So tags you can see over here that now it's normally distributed and there's not much variation. So all of the values which were greater than 600, they have been imputed. Next, we are going to analyze the PET ratio. So you know the drill, we're going to have a boxplot, a distribution plot, and a scatter plot with respect to price. So over here we can see that PET ratio, it is a normally distributed and there are few values which are to the extreme left. So these are some of the outliers that we can see with the help of a boxplot. Let's just see what these values are. So here you can see that where the PB ratio is less than 14, these are the values. So these are the house prices, these are the arms in these other tax values for these. So what do we observe? The PET ratio for all of the above points is. So that is with a little variation. We have some 12.6, but mostly we have 13s. And the RM, and the price is increasing simultaneously as RM and the price, they are positively correlated, so the prices increasing and so is the RM. And lastly, we can say that the L stat increases, but as the price decreases because there's a negative correlation amongst the two. So we don't observe any abnormalities. So we'll keep this data. And now we're going to analyze the L stat. We're going to plot the three graphs and you can see that we do see some outliers over here, but the distribution is quite normal. But we can see that the LSAT is skewed towards the right. So there are some high health dot values and then we can see some losses. Well, so first we're going to calculate the outliers, the upper range for that. And you can see that these are the upper ranges or the upper values. So from these calculated values, we can observe that the prices of houses is actually low for a high L dot value, which represents the negative correlation. And for the RMS value, we can see that it is low and the tax value is a little higher, which means that low price has the correlation which is negative with our M and it has a positive correlation with tax. So we don't see any abnormality over here, so we will keep this data. And now lastly we're going to observe RM. So firstly, we're going to draft the boxplot and distribution plot and the scatter plot. And we can see that autumn is normally distributed, but there are some expected outliers on the left and on the right. So the scatterplot of RAM versus the price shows a good positive linear relationship amongst the boat. Let's just calculate the lower value for the outliers. So here we can make an observation that in row number 365367 VC, that the house prices, they're much higher as compared to other prices. While the RM, It is quite low though RM and the price, they are positively correlated. And also for these two data points, tags and the PT ratio there above 50% of data points respectively. So both are negatively correlated with the price. So for the rest data points, we don't see any unusual behavior. So what can we conclude from this observation is that the two points which are at row number 365367, they may influence the prediction capability of our model. So we are going to keep the rest, but we're going to delete these. And after that we're going to see the shape. So earlier we had 490 rows, but now we have around 488 rows. So weekend see that there is a difference of shape after removing the outliers. Now let's just see for the upper values. So these are the upper values that we have. So in the above data points, we see that only one data point that is at row 364, it has a very low house price as compared to other house prices that we see here. So the RM, while it is a very high at this point in time. So though Ottoman the price, they're positively correlated. So also for this data, the ELL status low and the price is also low. So both of these are negative correlated, but still we see this relationship amongst the board. And for the rest of the data points, there isn't any unusual behavior, so it's good to go. But we're going to remove this last data point. And after removing, we're going to see the shape of our data. So here we see that now we have around 487 after removing the 1. And we can see the difference between the shape. And now we're done with the multivariate and the univariate analysis. So let's just split our data into test and training so that we can train a model and check its accuracy and see that which model is best for prediction of house prices. So for that, what we are going to print out our data frame. So we have 487 rows and we have five columns, and this includes our target column as well. So let's just remove, actually dropped the price column and we're going to include it in the variable y. And the rest of the features are going to be in variable x. Now let's just see variable x. So here is variable x. Now we have four features, and in variable y we have this one feature which has a price. Let's just print out the shape for both. So we have four columns in the first DataFrame and we have one column and the other. Now we're going to split our data into training and test data. That is going to be with the help of SKLearn library which has trained test split. So the desk size that we're going to give us 0.3. So our split is going to be 30, 70%. Let's just execute this and now let's just see the shape of all of our data frames. So we have an extreme which contains 340 rows and x-test which has 147 rows of y train, which has 340 rows, and uveitis which has 147 rows. Now we're going to train our model. We're going to use the logistic regression in order to train our model. So we've already seen that how we can apply logistic regression with the model. So we're using the fit function for that. We are giving it extreme and via train and then we're going to predict the result. And lastly, let's just plot it out on a graph. So we're here, we can see the predicted values versus the original values. This is under test set. So this is our y predicted and this is our test. So now we're going to calculate the accuracy. And the accuracy is going to be my yard using R-square and adjusted R-square. So these are basically the formulas for them. And we are just going to execute this. So the R-square value comes out to be 0.7. for nine and adjusted comes out to be 0.42. So it's pretty much about 0.7. for that is the accuracy for our model. Now let's just train decision tree on the same data. So we are training a decision tree. And this is the array, this is the decision tree regressor that we have and then we have the arena. Let's just plot the predicted values. And over here we can see the y predict and the test for the decision tree regressor, regressor. Let's just calculate the accuracy for this. Similarly, we are going to calculate the R-square and adjusted r-squared. And this comes out to be much closer to the linear regression to, it's also 0.7. for the round off or 0.7. for, for both of them. So let's just plot both of the model, the results for them. Cocaine. So as a result, we can see that for linear regression we have the prediction of zeros is 0.7490071% is accurate. And what decision tree at 0.748. So there is about 0.001% better accuracy we have for linear regression then for decision tree. Now for ones we can see that we have 0.7. 420002002 and for Decision Tree a 0.7. 40963. So then again, with about 0.002%, linear regression is better than decision tree. So we can conclude that the R-square and adjusted R-square values for linear regression is better than Decision Tree. So therefore this model is better approach for the prediction of house prices of boston dataset. Thank you for watching this entire session and I hope that this helped you a lot in making your concept clear.